Twitter, you're doing it right!
So I’ve spent the last few weeks getting into web crawling. Not for any really good reason, other than I can.
I started out using Cobweb, a ruby web crawler that had good ratings, just to see what it could do. And I started with my favorite site on the Internet, Reddit. After crawling about 20K pages, I stopped getting responses from Reddit. Going to their site gave me a wonderful warning about how I had been banned for hitting the site too much. I looked and I’d been hitting them 20 times a second! Whoops!
For those not in the know all sites need to worry about denial of service attacks, where an attacker tries to take down a web site by repeatedly hitting the site, much like I was doing. The only solution to this is to block their IP address. Which is what Reddit did to me.
After that I started writing my own crawling system that would be able to not only scale out, but not be banned from every site in existence. A slow crawler if you will.
While working on that, I started playing around with sentiment analysis, or the analysis of what people say to determine if they are positive or negative regarding something. It’s a pretty hard problem, so I started with a very naive method of matching words to words I knew, using a dictionary I found online. Just for something to play with.
This needed data to use, and I hit upon it. Twitter has had this mythical “firehose” for a while that streams all tweets to someone. If you want, you can get the entirety of Twitter and analyze it as fast as you can process it. Pretty much what I need! Test data!
So I found a gem called TweetStream that let me hook into the firehose (who knew it was a public thing anyone could use?!), and set it up to pull in tweets. Here’s the code for your amusement:
require 'tweetstream'
require 'json'
require 'unicode'
require 'awesome_print'
require 'mongoid'
Mongoid.load!("config/mongoid.yml", :development)
class Result
include Mongoid::Document
end
TweetStream.configure do |config|
config.consumer_key = 'GetFromTwitter'
config.consumer_secret = 'GetFromTwitter'
config.oauth_token = 'GetFromTwitter'
config.oauth_token_secret = 'GetFromTwitter'
config.auth_method = :oauth
end
@client = TweetStream::Client.new
@client.on_error do |message|
puts "ERROR: " + message
puts "sleeping for 1 minute"
sleep(60)
end
@client.on_delete do |status_id, user_id|
puts "deleting " + status_id.to_s
Tweet.delete(status_id)
end
@client.on_limit do |skip_count|
puts "told to limit " + skip_count
sleep(skip_count * 5)
end
@client.on_enhance_your_calm do
puts "told to calm down. Sleeping 1 minute"
sleep(60)
end
puts "starting Twitter firehose"
i = 1.0
@client.track("Facebook","facebook") do |status|
#remove all non-english characters
text = Unicode.decompose(status.text).delete('^0-9A-Za-z ,.<>/?!;:'"+()*&^%$#@')
mongo = Result.new
mongo["url"] = "http://twitter.com"
mongo["word"] = "Facebook"
mongo["elements"] = Array.new
mongo["elements"].push(text)
mongo.save
puts i.to_s + ": "+ text
i += 1
end
I’ve pulled in around 30 million tweets in the last two months or so, purging out those that didn’t have any sentiment in them to get to about a 10 million tweet data set. Next up, the analysis, playing around with Hadoop!