msgbartop
Social media analytics for decision-making
msgbarbottom

07 Jan 09 Time to think about my Twitter data

Time to step back and consider what I’m doing with Twitter code and data.

Background: For the last two weeks, I have been writing code to find interesting URLs being cited in Twitter posts, or tweets.*  I now have a database of about 100,000 Twitter users (I will not call them/us Tweeple!) who have cited 40,000 URLs and more than 200,000 two-word phrases that accompanied those URLs.  The URLs have been mentioned 230,000 times (2.3 times per URL) and the phrases have been mentioned 330,000 times (1.6 time per phrase).   I have gathered all of this data via the Twitter APIs within their constraint of making no more than 100 requests per hour.  The primary public output of this work has been the Hot Twitter Cites list.

This morning, I’m going to try to take off my engineer hat, put on my product manager coat and consider what problem this could help solve and how to package it to meet that need.  In other words, I’m going to try and extract some focus from my brainstorming.  I’ll start by describing the data a bit.

Here’s a graph that shows how many people cited each URL for a one-day period.  A few URLs are cited many times, but the vast majority only pick up a handful of cites – this graph shows a very long tail.  

Users per URL cited

Users per citation

The pattern of citations per user has more depth.  In other words, this also has a long tail, but a fatter, uh, body.  This is good because it means that there are a lot of people citing URLs.  More people means more points of view.  

Citations per user

Citations per user

I’d be happier to see a greater variety of URLs being cited, but I’m not going to argue with the data… and Iwould expect (and hope) that the variety of cited URLs will rise as Twitter attracts a more diverse user base.  

I’m generating a score for each user, based on how early they cite a URL that becomes popular.  The URLs listed in the hot cites page are chosen partly because they were cited by people who tended to cite popular URLs in the past.   I want to be sure that this isn’t redundant to how many people follow them.  If it is, then there’s no point in doing all these calculations, I could just watch for the URLs cited by the people with the most followers.  Here is a log-log scatterplot of my scoring v. follower counts.

 

Score v. follower count (log-log)

Score v. follower count (log-log)

This is good.  If the two data sets had a linear or power law relationship, the dots in the scatterplot would be clustered around a line.  They are obviously not, which means that whatever I’m calculating, it is substantially different from ranking based on how many followers the citing user has.   I’d like to see a comparison between my score and each user’s follower/followers (a/k/a friend/follower) ratio, but I’ve just started gathering the “follows” (friends) numbers.

Still, I’m not surprised.  Follower relationships on Twitter do not imply significant connections between people, for several reasons:

  • Many popular Twitter “users” are not people at all.  They are aggregators, robots that spew, I mean stream, headlines.
  • Twitter celebrities (I really will not say Twitterarati!) have far too many followers to have a significant relationship with most of them.
  • People follow others on Twitter simply to induce the others to follow them to create the appearance of popularity.

(This topic itself is fairly popular, as demonstrated by the fact that The 10 Users You’ll Meet on Twitter was cited by 120 people in the last few days, which puts it in the top 2 percent of URLs cited.)

More to come as I have time.

 

* In a strange coincidence, around the same time I started this, I added to my office a clock that tweets a bird call for each hour. My wife made me take out the tweeter batteries.  Mute birdies are staring at me.

Tags:

30 Dec 08 Influencing and being influenced; tracking topics on Twitter

One of the most common naive errors in statistics is to confuse correlation with causality.  Common sense tries to tell us that when two events co-occur, the first one is causing the second to happen (which often is the case).  Red traffic lights  correlate to cars stopping and sure enough, we know that red lights cause (most) people to stop.  But sometimes things correlate because a third, external mechanism is influencing them.  Drownings increase as ice cream sales rise, but ice cream isn’t causing drownings.  The external factor is summer, of course.

This is on my mind because over the last few days, when cold medicine hasn’t fogged my brain up so much that I couldn’t think, or at least couldn’t think logically, I’ve been working with the Twitter APIs to see what I could come up with in terms of tracking topics as they move around on Twitter.   I’m attracted to Twitter because its immediacy and brevity make it relatively easy to analyze.

Eventually, what I hope to do is find useful patterns in the interplay of words, Twitter screen names, URLs cited and hashtags (and any other entities that could be extracted).  I’m focusing first on URLs, since they are sort of the “stories behind the headlines” on Twitter.  My friend Dave Land this morning mentioned that writing a tweet is like writing a headline.  If so, then the cited URLs in those tweets are like the stories behind the headlines.  

I’ve put Python and SQL to work scraping statuses from Twitter, pulling out word pairs (I’m planning to analyze them with the other entities via LSA), screen names and URLs.  I’m resolving all the little URLs to the pages they actually point to, since Twitter users, limited to 140 characters, frequently use services like TinyURL to shorten them, but I want to see when people are citing the same URLs even if the shrunken URLs are different.  In fact, looking at the ratio of shrunken URLs to actual URLs is interesting – if it is high, that means that a lot of people are finding the cited page independently, rather than retweeting it or getting it from the same source external to Twitter.

As I find cited URLs, I’m using Twitter’s search API to get the most recent mentions of them, then storing the identities of the users who also cited them and when they did so.  That gives me a timeline of URL citations.  I’m not tracking explicit retweets, so I don’t know if the first people to cite a URL first are more influential or not.

I haven’t asked Twitter to white-list me yet, so I’m working within the limitations of their API – 100 requests per hour.  That forces me to be as smart as possible about how my code explores the data.  I started by choosing somebody who has a decent number of followers, but not too many, so that it wouldn’t take too long to scrape the person’s followers’ tweets.  I chose Tim O’Reilly because I suspect he is fairly influential on Twitter and we’ve had some conversations that go back to the mid-90s about how to figure out “what the Internet is thinking today.”  

O’Reilly’s company was one of the first, if not the very first, to measure social media for market research.  Many years ago, they were scraping Usenet to help decide which technologies would make good topics for books.  I recall that one of the first decisions they made from that data was to choose between two open-source database projects, MySQL and mSQL.  They chose MySQL… and therein lies another reminder of causality v. correlation.  Did MySQL succeed because O’Reilly chose to focus on it, or did O’Reilly succeed because it chose the right books to publish?  There is no way of knowing, but I have personal evidence that O’Reilly doesn’t always choose the right topics… or perhaps the right authors.  That’s a story for another day.

After a lot of wrangling with Python, MySQL and technical issues having to do with Unicode and my inability to write a correlated subquery under the influence of Sudafed, I have something working.  It started by scraping Tim’s recent tweets and then searched for people who also cited the same URLs.  Then it explores those who it has found cite the greatest number of URLs overall.  It adds 10 people at a time and then re-ranks to see who it should explore next.

I’ve been running this for a couple of days now in various forms.  So far I have found about 7,000 unique URLs.  Only about 300 of them are duplicates – different shrunken URLs for the same page.  The URLs have been mentioned about 20,000 times by 7,500 users.  I have found about 40,000 two-word phrases (stop words, URLs, screen names and hashtags are excluded) and 52,000 mentions of those phrases, which means that a number of phrases are being used by multiple people.

What the heck, here are the top phrases and the number of Twitters (Twitterers? Twits?) who used them over the last few days (remember, this is far from comprehensive):

  • check out 52
  • blog post 47
  • New Year 36
  • social media 34
  • New blog 33
  • new years 29
  • New York 25
  • ice storm 24
  • sad true 23
  • emergency generator 23
  • about attitudes 22
  • gas tax 21
  • mornings paper 19
  • Attention Influence 19
  • one best 19
  • Equal Authority 19
  • prices people 18
  • Jeff Jarvis 18
  • Its morning 18

I suspect that the words “check out” on Twitter are much like the words “click here” were in the early days of the web.  ”Emergency generator” intrigued me, so I linked it above to Twitter search.  Hint: it has to do with the Toyota Prius.  I suspect those same people included a link to a New York Times article about it.  Interestingly, a number of the people who cited it were not retweeting (at least not explicitly)… but many of them were using a shrunken URL cited by – guess who – Tim O’Reilly.  The interesting thing about the popularity of the phrase is that it gives my code a way to discover the other shrunken URLs in a single search, instead of having to scrape everything and resolve every shrunken URL to the actual page.  Tim may or may not have influenced those people to look at the article, but it is clear that he is in tune with a topic that people are interested in, which makes him interesting, whether he is an influencer or just well-influenced, so to speak.

Time to publish this post, I guess, even though I’m tempted to wait until today’s cold medicine has worn off to proofread it one more time.

More results here as I come up with them.

Tags: , , ,

06 Dec 08 Bad assumptions

In a lengthy conversation with a fairly well-known analytics thought leader earlier this year, I was startled when he mentioned that he didn’t know much about statistics.  We were talking about my preference for simplicity and I was describing methods of eliminating redundancy.  I gave up.  While I would not claim to be an expert on statistics, I took the essential classes in college, I’ve been mentored by a couple of Ph.D. statisticians and I do know the parts that are important to the kind of analytics I do.

If there is one thing I’d urge on anyone doing this kind of work, it is to be in the habit of examining the assumptions behind your metrics and models.  Bad assumptions lead to ambiguous or meaningless data.

Here is my favorite bad assumption: it is good when visitors view more pages.  No.  That behavior might mean bad site design.  If people can accomplish their goals by clicking fewer pages, that generally would be considered an improvement.  “More page views” is good for increasing the frequency of advertising, which was pretty much an unquestioned good thing before the Internet came along.  It is not necessarily true any more, so it has become a bad assumption.

Here’s a bad assumption that should be thought-provoking: Some of the values will be equal to the average.  Wrong.  My favorite example is that nobody has the average number of arms.  Think about it.

Perhaps the most common bad assumption that leads to statistic struggles is that data is distributed normally.  Plenty has been written about this subject and I won’t try to summarize it, but here are some things to remember:

  • A lot of data that isn’t normally distributed is normalized by taking the logarithm of each value.  That’s part of why log-log scatterplots are such a great tool.
  • Unlike many other statistics, medians and standard deviations are “robust,” meaningful even when data is not normally distributed.  Rely on them heavily.
  • Get to know the Poisson distribution.  It shows up frequently in social media.

Tags: , , ,

26 Nov 08 Measuring redundancy – the joy of log-log scatterplots

In my previous post here, I mentioned that some variables are not worth including because they essentially measure the same thing. I’ll write a longer piece about this later, but it has been bugging me all day that I didn’t explain the basics of identifying redundant variables. In addition to simplifying complex algorithms, this can also help you avoid “data smog” in your reports. To repeat an example, I don’t think there is any reason to report both page views and time on site. They measure essentially the same thing (please write to me if you find that this is consistently not so on your site, I’d love to hear about it). At the same time, it is a good idea to monitor such relationships and trigger an alert if they stop correlating.

The first thing I almost always do when I’m examining pairs of metrics is generate a log-log scatterplot. As long as you don’t try to graph too many points, Excel is a fine tool for creating log-log scatterplots. All you need to do is generate pairs of values in two columns. For example, if you are comparing page views to time on site for visitors, you would create one row in Excel for each visitor, with two values – the number of page views and the time on site. The order doesn’t matter. The scales of the two numbers should be roughly comparable, so you’d probably want to have time on site in minutes. You should be looking at a enough data to put the numbers into thousands or higher; otherwise it might be hard to see patterns. Labels don’t matter much, either. Select your data, insert a scatterplot, then go into formatting for the X and Y axes and set each scale to logarithmic. That’s it.

You’re looking for something very simple in the scatterplot – whether or not the points cluster into a line or single area. If they are all over the chart, then there is no direct correlation between the two metrics – they are not redundant. If they cluster together, even if there are some outliers (and there usually are), then the two metrics are mostly redundant and you probably can drop one of them.

I’ll write more about this later, including ways to quantify the relationship, why the outliers (the points that are away from the cluster) might be interesting and more.

Tags: ,