msgbartop
Social media analytics for decision-making
msgbarbottom

07 Jan 09 Time to think about my Twitter data

Time to step back and consider what I’m doing with Twitter code and data.

Background: For the last two weeks, I have been writing code to find interesting URLs being cited in Twitter posts, or tweets.*  I now have a database of about 100,000 Twitter users (I will not call them/us Tweeple!) who have cited 40,000 URLs and more than 200,000 two-word phrases that accompanied those URLs.  The URLs have been mentioned 230,000 times (2.3 times per URL) and the phrases have been mentioned 330,000 times (1.6 time per phrase).   I have gathered all of this data via the Twitter APIs within their constraint of making no more than 100 requests per hour.  The primary public output of this work has been the Hot Twitter Cites list.

This morning, I’m going to try to take off my engineer hat, put on my product manager coat and consider what problem this could help solve and how to package it to meet that need.  In other words, I’m going to try and extract some focus from my brainstorming.  I’ll start by describing the data a bit.

Here’s a graph that shows how many people cited each URL for a one-day period.  A few URLs are cited many times, but the vast majority only pick up a handful of cites – this graph shows a very long tail.  

Users per URL cited

Users per citation

The pattern of citations per user has more depth.  In other words, this also has a long tail, but a fatter, uh, body.  This is good because it means that there are a lot of people citing URLs.  More people means more points of view.  

Citations per user

Citations per user

I’d be happier to see a greater variety of URLs being cited, but I’m not going to argue with the data… and Iwould expect (and hope) that the variety of cited URLs will rise as Twitter attracts a more diverse user base.  

I’m generating a score for each user, based on how early they cite a URL that becomes popular.  The URLs listed in the hot cites page are chosen partly because they were cited by people who tended to cite popular URLs in the past.   I want to be sure that this isn’t redundant to how many people follow them.  If it is, then there’s no point in doing all these calculations, I could just watch for the URLs cited by the people with the most followers.  Here is a log-log scatterplot of my scoring v. follower counts.

 

Score v. follower count (log-log)

Score v. follower count (log-log)

This is good.  If the two data sets had a linear or power law relationship, the dots in the scatterplot would be clustered around a line.  They are obviously not, which means that whatever I’m calculating, it is substantially different from ranking based on how many followers the citing user has.   I’d like to see a comparison between my score and each user’s follower/followers (a/k/a friend/follower) ratio, but I’ve just started gathering the “follows” (friends) numbers.

Still, I’m not surprised.  Follower relationships on Twitter do not imply significant connections between people, for several reasons:

  • Many popular Twitter “users” are not people at all.  They are aggregators, robots that spew, I mean stream, headlines.
  • Twitter celebrities (I really will not say Twitterarati!) have far too many followers to have a significant relationship with most of them.
  • People follow others on Twitter simply to induce the others to follow them to create the appearance of popularity.

(This topic itself is fairly popular, as demonstrated by the fact that The 10 Users You’ll Meet on Twitter was cited by 120 people in the last few days, which puts it in the top 2 percent of URLs cited.)

More to come as I have time.

 

* In a strange coincidence, around the same time I started this, I added to my office a clock that tweets a bird call for each hour. My wife made me take out the tweeter batteries.  Mute birdies are staring at me.

Tags:

blog comments powered by Disqus