msgbartop
Social media analytics for decision-making
msgbarbottom

26 Jan 09 I think I have developed a Twitter aggregator finder

Blocking aggregators from Tweetsnet is turning in to a whack-a-mole game.  As I block them, others pop up – and some of them are brand-new accounts.  This means I’ll have to prioritize creating an algorithm to block them rapidly.  Weird stuff pops into the feed, like a series of wrestling-related posts.

Meanwhile, I’m working on a couple of experimental vertical feeds, on web analytics and social media, since I’m interested in those subjects and I suspect they get a fair bit of attention from Twitter users.

Looking at the most common phrases that Tweetsnet is finding, it seems that the most talked about subjects are the generally popular ones – President Obama, the Super Bowl, peanut butter, Steve Jobs and “Slumdog Millionaire.”  But that may be becuase I’ve only recently pared down the aggregators.  We’ll see what develops.

It seems to me that Twitter is related to blogs and search engines as radio has been related to newspapers.  Radio was usually first to cover breaking news; newspapers covered the same events in greater depth.  I’m using past tense because radio and newspapers are changing these days, already fairly different from when I worked in those businesses.

25 Jan 09 Tweetsnet progress

Still debugging Tweetsnet… found a problem in scoring, which was limiting the variety of people whose cites could score anything at all.  That’s fixed, along with more minor items that I’m finding as I review the code.

I have added Twitter Trends as a source for discovering more URL citations.  The system periodically grabs the Trends list (had to learn a little JSON to do that) and combines the phrases with “http” to find trending tweets that also have URLs in them.

The changes I made earlier today have greatly slowed down the number of published items, as the system builds up information about non-aggregator sources.

25 Jan 09 The problem of robots

A handful of robotic news aggregators have taken over TweetsNet… Twitter users that spew volumes of URLs an hour, which makes them appear to be on top of whatever is new.  I’m planning to exclude them by algorithm (just the raw number of tweets is a good clue, but they also usually follow few people), but for now I’m excluding the big ones manually.

The idea of Tweetsnet is to leverage smart people, not dumb robots.

24 Jan 09 You can now follow Tweetsnet on Twitter

You can now receive the stream of Tweetsnet postings by following @Tweetsnet on Twitter. Still having some weirdness with document titles sometimes… still working on it.

@Tweetsnet is also now automatically following anybody who posted a URL that it published. We’ll see how long that works…

23 Jan 09 Tweetsnet

The social network analysis I’ve been doing on Twitter turned into a new site called Tweetsnet, which shows web pages that are hot topics on Twitter. It’s a blog, with a feed. It updates every 10 minutes or so with the five highest scoring, previously unpublished, web pages being talked about.

Each post shows the page title, summary and keywords (as tags) if available, and frequent two-word phrases that appear in conjunction with the page citations.

It’s still beta and I’m still deciding where to go with it. Your thoughts, etc., are more than welcome.

I’m considering similar feeds with a vertical focus. I’m also thinking of splitting out the pages that are cited by the big, popular aggregators, since they’re already well-known.

A lot of what is showing up now is news, so I’m also wondering if I can automate a comparison to something like Google News to see what the differences are.

11 Jan 09 Most-cited sources on Twitter: frequency v. reach

I have extracted domain names from the URLs that I track on Twitter. Below is a table that shows how many citations and how many unique citing users there are for the top 25 domains since midnight last night.  The numbers are quite different.  For example, Engadget and Digg are cited quite a bit – high frequency – but by relatively few people – low reach. ReadWriteWeb, Mashable and TechCrunch seem to do the best job of achieving frequency and reach.

Here are the top 25 ordered by the number of people who cited pages from each source.

domain cites users
www.readwriteweb.com 263 216
mashable.com 210 189
www.techcrunch.com 274 162
news.cnet.com 236 118
friendfeed.com 164 98
www.whoppersacrifice.com 78 78
www.youtube.com 94 74
lifehacker.com 152 72
sethgodin.typepad.com 77 71
twitpic.com 115 69
twitter.com 114 69
www.smashingmagazine.com 76 68
digg.com 490 65
www.cnn.com 106 63
www.microsoft.com 59 59
www.ustream.tv 57 57
news.bbc.co.uk 102 53
truemors.nowpublic.com 78 52
danzarrella.com 50 50
www.google.com 52 48
www10.nytimes.com 82 47
xr.com 47 47
museums.alltop.com 47 47
www.engadget.com 534 45
www.mobilecrunch.com 48 40

Here are the top 25 sorted by number of cites.

domain cites users
www.engadget.com 534 45
digg.com 490 65
www.techcrunch.com 274 162
www.readwriteweb.com 263 216
news.cnet.com 236 118
mashable.com 210 189
www.techmeme.com 190 24
friendfeed.com 164 98
lifehacker.com 152 72
twitpic.com 115 69
twitter.com 114 69
www.cnn.com 106 63
news.bbc.co.uk 102 53
www.youtube.com 94 74
www10.nytimes.com 82 47
truemors.nowpublic.com 78 52
www.whoppersacrifice.com 78 78
sethgodin.typepad.com 77 71
www.smashingmagazine.com 76 68
twitrss.dyndns.org 63 4
www.msnbc.msn.com 61 26
www.microsoft.com 59 59
www.ustream.tv 57 57
news.yahoo.com 54 30
www.google.com 52 48

09 Jan 09 Twitter social network leaders: navel-gazing or more?

I’m exploring the Twitter data I’ve gathered over the last few weeks, which is designed to uncover patterns of URL citations, which I believe is one of the service’s most powerful uses.  As I have written, I’m looking at Twitter as a massively parallel self-organizing point-of-view system.  In other words, my premise is that by posting URLs to Twitter, people are saying that they found a web page to be interesting and valuable.

Today, I’m looking at “centrality,” a typical social network metric.   I am interested in degree centrality, which looks at how many connections a person has, which shows who the key players are.  I’m considering two people to be connected if they cited the same URL in the same time frame, regardless of whether or not one was an explicit retweet of the other.  Later, I’ll probably weight the connections with explicit retweet and other data.  For now, I want to see if follower count, a far simpler metric than centrality, would work just as well.  Here is a log-log scatterplot of degree centrality v.  follower count.

Follower count v. degree centrality

Follower count v. degree centrality

The data points are scattered all over the place, which means that follower count does not correlate to the connections revealed by citing the same URLs.  I’m not surprised, given all the games people play to get followers, the robots and such that have little or any human thought behind them. 

As a reality check, let’s look at a similar plot that compares follower count to user mentions.  I would expect that people who have a lot of followers will be mentioned (in the form of @screen name, in a reply, retweet or any other context) more often.  Here’s the graph. 

Followers v. mentions

Followers v. mentions

Bear in mind that my data gatherer is biased toward people who cite a lot of URLs, so when I say count mentions, those are mentions by people who tend to cite a lot of URLs in their posts.   As you can see, although there are many outliers, there is an obvious trend upward and to the right, which indicates a positive correlation – people with a lot of followers indeed do tend to be mentioned a lot.  The upper left area is almost empty because it is hard to get any mentions when you don’t have any followers.  On the other hand, you can have lots of followers and few mentions, which is why the there are more points toward the lower right.

Outliers are often interesting and I find myself wondering who is getting a lot of mentions even though they have very few followers.  The dot closest to the upper left corner is MsTweet, who is a “customer service evangelist for Mr.Tweet” and therefore doesn’t follow much of anyone, but gets mentioned a lot.  In the upper right border area, with lots of followers and mentions, are Shorty Awards, Chris Brogan, Guy Kawasaki, and ReTweetTrends (in the center of the top, not following nearly as many as the others).  The lower right corner outliers are people who are heavily followed, but rarely mentioned by people who cite URLs.  They include Kevin RoseJason Calacanis, Veronica and iJustine.  I’m surprised, actually, that these folks’ huge followings apparently either aren’t mentioning them often or aren’t often citing URLs.  Let’s reality-check that with Twitter search.

I’ll search on each of their user names, then repeat the search with their name and “http,” which will give a rough comparison of all mentions v. mentions with URLs in them.  Twitter’s search doesn’t give a result count, so it’s pretty hard to tell.  All I can go by is the frequency of recent tweets.  Let’s compare it to somebody who is mentioned a lot – Chris Brogan.  He is definitely getting a lot more frequent mentions in conjunction with URLs, so at first glance, the data seems believable.

Perhaps this indicates that the people with big followings yet few mentions have a different kind of influence.  People like Chris and Guy seem to be leading others to look outside of Twitter, while Kevin, Jason, Veronica and Justine have some other, perhaps more Twitter-centric influence.  Is it safe to say that the latter group is more engaged with Twitter for its own sake?  

It seems that some of the popular Twitterers are leading their followers mostly into Twitter navel-gazing, while others are leading people beyond what Twitter itself has to offer.  I find myself wondering how this might change as Twitter matures… and wondering if perhaps the navel-gazers are newer to Twitter and will get bored faster.  I’m gathering more of the user information now, so I should be able to compare the average number of days they have been using it.  In any event, from a business standpoint, I think I know which kind of leader I’d be more interested in.

Tags: ,

07 Jan 09 Time to think about my Twitter data

Time to step back and consider what I’m doing with Twitter code and data.

Background: For the last two weeks, I have been writing code to find interesting URLs being cited in Twitter posts, or tweets.*  I now have a database of about 100,000 Twitter users (I will not call them/us Tweeple!) who have cited 40,000 URLs and more than 200,000 two-word phrases that accompanied those URLs.  The URLs have been mentioned 230,000 times (2.3 times per URL) and the phrases have been mentioned 330,000 times (1.6 time per phrase).   I have gathered all of this data via the Twitter APIs within their constraint of making no more than 100 requests per hour.  The primary public output of this work has been the Hot Twitter Cites list.

This morning, I’m going to try to take off my engineer hat, put on my product manager coat and consider what problem this could help solve and how to package it to meet that need.  In other words, I’m going to try and extract some focus from my brainstorming.  I’ll start by describing the data a bit.

Here’s a graph that shows how many people cited each URL for a one-day period.  A few URLs are cited many times, but the vast majority only pick up a handful of cites – this graph shows a very long tail.  

Users per URL cited

Users per citation

The pattern of citations per user has more depth.  In other words, this also has a long tail, but a fatter, uh, body.  This is good because it means that there are a lot of people citing URLs.  More people means more points of view.  

Citations per user

Citations per user

I’d be happier to see a greater variety of URLs being cited, but I’m not going to argue with the data… and Iwould expect (and hope) that the variety of cited URLs will rise as Twitter attracts a more diverse user base.  

I’m generating a score for each user, based on how early they cite a URL that becomes popular.  The URLs listed in the hot cites page are chosen partly because they were cited by people who tended to cite popular URLs in the past.   I want to be sure that this isn’t redundant to how many people follow them.  If it is, then there’s no point in doing all these calculations, I could just watch for the URLs cited by the people with the most followers.  Here is a log-log scatterplot of my scoring v. follower counts.

 

Score v. follower count (log-log)

Score v. follower count (log-log)

This is good.  If the two data sets had a linear or power law relationship, the dots in the scatterplot would be clustered around a line.  They are obviously not, which means that whatever I’m calculating, it is substantially different from ranking based on how many followers the citing user has.   I’d like to see a comparison between my score and each user’s follower/followers (a/k/a friend/follower) ratio, but I’ve just started gathering the “follows” (friends) numbers.

Still, I’m not surprised.  Follower relationships on Twitter do not imply significant connections between people, for several reasons:

  • Many popular Twitter “users” are not people at all.  They are aggregators, robots that spew, I mean stream, headlines.
  • Twitter celebrities (I really will not say Twitterarati!) have far too many followers to have a significant relationship with most of them.
  • People follow others on Twitter simply to induce the others to follow them to create the appearance of popularity.

(This topic itself is fairly popular, as demonstrated by the fact that The 10 Users You’ll Meet on Twitter was cited by 120 people in the last few days, which puts it in the top 2 percent of URLs cited.)

More to come as I have time.

 

* In a strange coincidence, around the same time I started this, I added to my office a clock that tweets a bird call for each hour. My wife made me take out the tweeter batteries.  Mute birdies are staring at me.

Tags:

05 Jan 09 Not so surprising, aggregators lead in URL scoring

I thought I’d see which Twitter users are scoring the highest in terms of posting URLs that become popular. My code gives them points based on how early they posted and how popular the URL becomes. I suppose it should not have surprised me to find that most of the high scoring users are not real people, but aggregators that feed tons of URLs.

Who is it that says that web analytics data is always messy? Whoever it is, right you are! Since a fundamental goal of the work I’m doing is to uncover interesting points of view, I need to downgrade sources that aren’t behaving as though they really have a point of view (or at least an intelligent one). I can tell instantly that I’m almost certainly looking at an automated system when I see that the “user” in question follows zero or very few people. That’s grounds for immediately downgrading. I’m not sure if I want to downgrade based on the volume of postings. Certainly beyond a believable number… and perhaps if every single post contains a URL.

Here are the top 20 sources from the last week or so, based on the criteria I described above.

  1. Net2 (878)
  2. techupdates (706)
  3. OriginalSignal (587)
  4. radi8 (565)
  5. Dakshinamurti (542)
  6. GaryTheGeek (453)
  7. techupdate (449)
  8. haripakorss (436)
  9. readmashcrunch (392)
  10. twittfeed (379)
  11. TwitLinksRSS (359)
  12. top_post (342)
  13. tclauss (329)
  14. TechFeed (303)
  15. tc2tw (300)
  16. vcsangels (295)
  17. dlbrown06 (287)
  18. davidsim (279)
  19. mashable (272)
  20. ReTweetTrends (268)
  21. balduaashish (268)
  22. wiredgnome (264)
  23. julieti (259)
  24. TechRSS (248)
  25. davekresta_rss (246)

Tags: ,

04 Jan 09 Twitter phishing URL made the list

I’m glad I haven’t automated the hot Twitter URL list yet… a “fake Twitter” phishing link showed up, so I changed its tinyURL in the database to “disabled.” I should create a page for that I guess.

Tags: ,