So I’ve spent the last few weeks getting into web crawling. Not for any really good reason, other than I can.
I started out using Cobweb, a ruby web crawler that had good ratings, just to see what it could do. And I started with my favorite site on the Internet, Reddit. After crawling about 20K pages, I stopped getting responses from Reddit. Going to their site gave me a wonderful warning about how I had been banned for hitting the site too much. I looked and I’d been hitting them 20 times a second! Whoops!
For those not in the know all sites need to worry about denial of service attacks, where an attacker tries to take down a web site by repeatedly hitting the site, much like I was doing. The only solution to this is to block their IP address. Which is what Reddit did to me.
After that I started writing my own crawling system that would be able to not only scale out, but not be banned from every site in existence. A slow crawler if you will.
While working on that, I started playing around with sentiment analysis, or the analysis of what people say to determine if they are positive or negative regarding something. It’s a pretty hard problem, so I started with a very naive method of matching words to words I knew, using a dictionary I found online. Just for something to play with.
This needed data to use, and I hit upon it. Twitter has had this mythical “firehose” for a while that streams all tweets to someone. If you want, you can get the entirety of Twitter and analyze it as fast as you can process it. Pretty much what I need! Test data!
So I found a gem called TweetStream that let me hook into the firehose (who knew it was a public thing anyone could use?!), and set it up to pull in tweets. Here’s the code for your amusement:
require 'tweetstream' require 'json' require 'unicode' require 'awesome_print' require 'mongoid' Mongoid.load!("config/mongoid.yml", :development) class Result include Mongoid::Document end TweetStream.configure do |config| config.consumer_key = 'GetFromTwitter' config.consumer_secret = 'GetFromTwitter' config.oauth_token = 'GetFromTwitter' config.oauth_token_secret = 'GetFromTwitter' config.auth_method = :oauth end @client = TweetStream::Client.new @client.on_error do |message| puts "ERROR: " + message puts "sleeping for 1 minute" sleep(60) end @client.on_delete do |status_id, user_id| puts "deleting " + status_id.to_s Tweet.delete(status_id) end @client.on_limit do |skip_count| puts "told to limit " + skip_count sleep(skip_count * 5) end @client.on_enhance_your_calm do puts "told to calm down. Sleeping 1 minute" sleep(60) end puts "starting Twitter firehose" i = 1.0 @client.track("Facebook","facebook") do |status| #remove all non-english characters text = Unicode.decompose(status.text).delete('^0-9A-Za-z ,.<>/?!;:'"+()*&^%$#@') mongo = Result.new mongo["url"] = "http://twitter.com" mongo["word"] = "Facebook" mongo["elements"] = Array.new mongo["elements"].push(text) mongo.save puts i.to_s + ": "+ text i += 1 end
I’ve pulled in around 30 million tweets in the last two months or so, purging out those that didn’t have any sentiment in them to get to about a 10 million tweet data set. Next up, the analysis, playing around with Hadoop!