By meatshield — Apr 16, 2013

Twitter, you're doing it right!

So I’ve spent the last few weeks getting into web crawling. Not for any really good reason, other than I can.

I started out using Cobweb, a ruby web crawler that had good ratings, just to see what it could do. And I started with my favorite site on the Internet, Reddit. After crawling about 20K pages, I stopped getting responses from Reddit. Going to their site gave me a wonderful warning about how I had been banned for hitting the site too much. I looked and I’d been hitting them 20 times a second! Whoops!

For those not in the know all sites need to worry about denial of service attacks, where an attacker tries to take down a web site by repeatedly hitting the site, much like I was doing. The only solution to this is to block their IP address. Which is what Reddit did to me.

After that I started writing my own crawling system that would be able to not only scale out, but not be banned from every site in existence. A slow crawler if you will.

While working on that, I started playing around with sentiment analysis, or the analysis of what people say to determine if they are positive or negative regarding something. It’s a pretty hard problem, so I started with a very naive method of matching words to words I knew, using a dictionary I found online. Just for something to play with.

This needed data to use, and I hit upon it. Twitter has had this mythical “firehose” for a while that streams all tweets to someone. If you want, you can get the entirety of Twitter and analyze it as fast as you can process it. Pretty much what I need! Test data!

So I found a gem called TweetStream that let me hook into the firehose (who knew it was a public thing anyone could use?!), and set it up to pull in tweets. Here’s the code for your amusement:


require 'tweetstream'
require 'json'
require 'unicode'
require 'awesome_print'
require 'mongoid'

Mongoid.load!("config/mongoid.yml", :development)

class Result
  include Mongoid::Document
end

TweetStream.configure do |config|
  config.consumer_key       = 'GetFromTwitter'
  config.consumer_secret    = 'GetFromTwitter'
  config.oauth_token        = 'GetFromTwitter'
  config.oauth_token_secret = 'GetFromTwitter'
  config.auth_method        = :oauth
end

@client = TweetStream::Client.new

@client.on_error do |message|
  puts "ERROR: " + message
  puts "sleeping for 1 minute"
  sleep(60)
end

@client.on_delete do |status_id, user_id|
  puts "deleting " + status_id.to_s
  Tweet.delete(status_id)
end

@client.on_limit do |skip_count|
  puts "told to limit " + skip_count
  sleep(skip_count * 5)
end

@client.on_enhance_your_calm do
  puts "told to calm down. Sleeping 1 minute"
  sleep(60)
end
puts "starting Twitter firehose"
i = 1.0
@client.track("Facebook","facebook") do |status|
  #remove all non-english characters
  text = Unicode.decompose(status.text).delete('^0-9A-Za-z ,.<>/?!;:'"+()*&^%$#@')
  mongo = Result.new
  mongo["url"] = "http://twitter.com"
  mongo["word"] = "Facebook"
  mongo["elements"] = Array.new
  mongo["elements"].push(text)
  mongo.save
  puts i.to_s + ": "+ text
  i += 1
end

I’ve pulled in around 30 million tweets in the last two months or so, purging out those that didn’t have any sentiment in them to get to about a 10 million tweet data set. Next up, the analysis, playing around with Hadoop!

Subscribe to Swimming in the Matrix