One of the most common naive errors in statistics is to confuse correlation with causality. Common sense tries to tell us that when two events co-occur, the first one is causing the second to happen (which often is the case). Red traffic lights correlate to cars stopping and sure enough, we know that red lights cause (most) people to stop. But sometimes things correlate because a third, external mechanism is influencing them. Drownings increase as ice cream sales rise, but ice cream isn’t causing drownings. The external factor is summer, of course.
This is on my mind because over the last few days, when cold medicine hasn’t fogged my brain up so much that I couldn’t think, or at least couldn’t think logically, I’ve been working with the Twitter APIs to see what I could come up with in terms of tracking topics as they move around on Twitter. I’m attracted to Twitter because its immediacy and brevity make it relatively easy to analyze.
Eventually, what I hope to do is find useful patterns in the interplay of words, Twitter screen names, URLs cited and hashtags (and any other entities that could be extracted). I’m focusing first on URLs, since they are sort of the “stories behind the headlines” on Twitter. My friend Dave Land this morning mentioned that writing a tweet is like writing a headline. If so, then the cited URLs in those tweets are like the stories behind the headlines.
I’ve put Python and SQL to work scraping statuses from Twitter, pulling out word pairs (I’m planning to analyze them with the other entities via LSA), screen names and URLs. I’m resolving all the little URLs to the pages they actually point to, since Twitter users, limited to 140 characters, frequently use services like TinyURL to shorten them, but I want to see when people are citing the same URLs even if the shrunken URLs are different. In fact, looking at the ratio of shrunken URLs to actual URLs is interesting – if it is high, that means that a lot of people are finding the cited page independently, rather than retweeting it or getting it from the same source external to Twitter.
As I find cited URLs, I’m using Twitter’s search API to get the most recent mentions of them, then storing the identities of the users who also cited them and when they did so. That gives me a timeline of URL citations. I’m not tracking explicit retweets, so I don’t know if the first people to cite a URL first are more influential or not.
I haven’t asked Twitter to white-list me yet, so I’m working within the limitations of their API – 100 requests per hour. That forces me to be as smart as possible about how my code explores the data. I started by choosing somebody who has a decent number of followers, but not too many, so that it wouldn’t take too long to scrape the person’s followers’ tweets. I chose Tim O’Reilly because I suspect he is fairly influential on Twitter and we’ve had some conversations that go back to the mid-90s about how to figure out “what the Internet is thinking today.”
O’Reilly’s company was one of the first, if not the very first, to measure social media for market research. Many years ago, they were scraping Usenet to help decide which technologies would make good topics for books. I recall that one of the first decisions they made from that data was to choose between two open-source database projects, MySQL and mSQL. They chose MySQL… and therein lies another reminder of causality v. correlation. Did MySQL succeed because O’Reilly chose to focus on it, or did O’Reilly succeed because it chose the right books to publish? There is no way of knowing, but I have personal evidence that O’Reilly doesn’t always choose the right topics… or perhaps the right authors. That’s a story for another day.
After a lot of wrangling with Python, MySQL and technical issues having to do with Unicode and my inability to write a correlated subquery under the influence of Sudafed, I have something working. It started by scraping Tim’s recent tweets and then searched for people who also cited the same URLs. Then it explores those who it has found cite the greatest number of URLs overall. It adds 10 people at a time and then re-ranks to see who it should explore next.
I’ve been running this for a couple of days now in various forms. So far I have found about 7,000 unique URLs. Only about 300 of them are duplicates – different shrunken URLs for the same page. The URLs have been mentioned about 20,000 times by 7,500 users. I have found about 40,000 two-word phrases (stop words, URLs, screen names and hashtags are excluded) and 52,000 mentions of those phrases, which means that a number of phrases are being used by multiple people.
What the heck, here are the top phrases and the number of Twitters (Twitterers? Twits?) who used them over the last few days (remember, this is far from comprehensive):
I suspect that the words “check out” on Twitter are much like the words “click here” were in the early days of the web. ”Emergency generator” intrigued me, so I linked it above to Twitter search. Hint: it has to do with the Toyota Prius. I suspect those same people included a link to a New York Times article about it. Interestingly, a number of the people who cited it were not retweeting (at least not explicitly)… but many of them were using a shrunken URL cited by – guess who – Tim O’Reilly. The interesting thing about the popularity of the phrase is that it gives my code a way to discover the other shrunken URLs in a single search, instead of having to scrape everything and resolve every shrunken URL to the actual page. Tim may or may not have influenced those people to look at the article, but it is clear that he is in tune with a topic that people are interested in, which makes him interesting, whether he is an influencer or just well-influenced, so to speak.
Time to publish this post, I guess, even though I’m tempted to wait until today’s cold medicine has worn off to proofread it one more time.
More results here as I come up with them.