Twitter, you're doing it right!

So I’ve spent the last few weeks getting into web crawling. Not for any really good reason, other than I can.

I started out using Cobweb, a ruby web crawler that had good ratings, just to see what it could do. And I started with my favorite site on the Internet, Reddit. After crawling about 20K pages, I stopped getting responses from Reddit. Going to their site gave me a wonderful warning about how I had been banned for hitting the site too much. I looked and I’d been hitting them 20 times a second! Whoops!

For those not in the know all sites need to worry about denial of service attacks, where an attacker tries to take down a web site by repeatedly hitting the site, much like I was doing. The only solution to this is to block their IP address. Which is what Reddit did to me.

After that I started writing my own crawling system that would be able to not only scale out, but not be banned from every site in existence. A slow crawler if you will.

While working on that, I started playing around with sentiment analysis, or the analysis of what people say to determine if they are positive or negative regarding something. It’s a pretty hard problem, so I started with a very naive method of matching words to words I knew, using a dictionary I found online. Just for something to play with.

This needed data to use, and I hit upon it. Twitter has had this mythical “firehose” for a while that streams all tweets to someone. If you want, you can get the entirety of Twitter and analyze it as fast as you can process it. Pretty much what I need! Test data!

So I found a gem called TweetStream that let me hook into the firehose (who knew it was a public thing anyone could use?!), and set it up to pull in tweets. Here’s the code for your amusement:

 

require 'tweetstream'
require 'json'
require 'unicode'
require 'awesome_print'
require 'mongoid'

Mongoid.load!("config/mongoid.yml", :development)

class Result
  include Mongoid::Document
end

TweetStream.configure do |config|
  config.consumer_key       = 'GetFromTwitter'
  config.consumer_secret    = 'GetFromTwitter'
  config.oauth_token        = 'GetFromTwitter'
  config.oauth_token_secret = 'GetFromTwitter'
  config.auth_method        = :oauth
end

@client = TweetStream::Client.new

@client.on_error do |message|
  puts "ERROR: " + message
  puts "sleeping for 1 minute"
  sleep(60)
end

@client.on_delete do |status_id, user_id|
  puts "deleting " + status_id.to_s
  Tweet.delete(status_id)
end

@client.on_limit do |skip_count|
  puts "told to limit " + skip_count
  sleep(skip_count * 5)
end

@client.on_enhance_your_calm do
  puts "told to calm down. Sleeping 1 minute"
  sleep(60)
end
puts "starting Twitter firehose"
i = 1.0
@client.track("Facebook","facebook") do |status|
  #remove all non-english characters
  text = Unicode.decompose(status.text).delete('^0-9A-Za-z ,.<>/?!;:'"+()*&^%$#@')
  mongo = Result.new
  mongo["url"] = "http://twitter.com"
  mongo["word"] = "Facebook"
  mongo["elements"] = Array.new
  mongo["elements"].push(text)
  mongo.save
  puts i.to_s + ": "+ text
  i += 1
end

I’ve pulled in around 30 million tweets in the last two months or so, purging out those that didn’t have any sentiment in them to get to about a 10 million tweet data set. Next up, the analysis, playing around with Hadoop!

Subscribe to Swimming in the Matrix

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
[email protected]
Subscribe