Blocking aggregators from Tweetsnet is turning in to a whack-a-mole game. As I block them, others pop up – and some of them are brand-new accounts. This means I’ll have to prioritize creating an algorithm to block them rapidly. Weird stuff pops into the feed, like a series of wrestling-related posts.
Meanwhile, I’m working on a couple of experimental vertical feeds, on web analytics and social media, since I’m interested in those subjects and I suspect they get a fair bit of attention from Twitter users.
Looking at the most common phrases that Tweetsnet is finding, it seems that the most talked about subjects are the generally popular ones – President Obama, the Super Bowl, peanut butter, Steve Jobs and “Slumdog Millionaire.” But that may be becuase I’ve only recently pared down the aggregators. We’ll see what develops.
It seems to me that Twitter is related to blogs and search engines as radio has been related to newspapers. Radio was usually first to cover breaking news; newspapers covered the same events in greater depth. I’m using past tense because radio and newspapers are changing these days, already fairly different from when I worked in those businesses.
Still debugging Tweetsnet… found a problem in scoring, which was limiting the variety of people whose cites could score anything at all. That’s fixed, along with more minor items that I’m finding as I review the code.
I have added Twitter Trends as a source for discovering more URL citations. The system periodically grabs the Trends list (had to learn a little JSON to do that) and combines the phrases with “http” to find trending tweets that also have URLs in them.
The changes I made earlier today have greatly slowed down the number of published items, as the system builds up information about non-aggregator sources.
A handful of robotic news aggregators have taken over TweetsNet… Twitter users that spew volumes of URLs an hour, which makes them appear to be on top of whatever is new. I’m planning to exclude them by algorithm (just the raw number of tweets is a good clue, but they also usually follow few people), but for now I’m excluding the big ones manually.
The idea of Tweetsnet is to leverage smart people, not dumb robots.
You can now receive the stream of Tweetsnet postings by following @Tweetsnet on Twitter. Still having some weirdness with document titles sometimes… still working on it.
@Tweetsnet is also now automatically following anybody who posted a URL that it published. We’ll see how long that works…
I have extracted domain names from the URLs that I track on Twitter. Below is a table that shows how many citations and how many unique citing users there are for the top 25 domains since midnight last night. The numbers are quite different. For example, Engadget and Digg are cited quite a bit – high frequency – but by relatively few people – low reach. ReadWriteWeb, Mashable and TechCrunch seem to do the best job of achieving frequency and reach.
Here are the top 25 ordered by the number of people who cited pages from each source.
| domain | cites | users |
| www.readwriteweb.com | 263 | 216 |
| mashable.com | 210 | 189 |
| www.techcrunch.com | 274 | 162 |
| news.cnet.com | 236 | 118 |
| friendfeed.com | 164 | 98 |
| www.whoppersacrifice.com | 78 | 78 |
| www.youtube.com | 94 | 74 |
| lifehacker.com | 152 | 72 |
| sethgodin.typepad.com | 77 | 71 |
| twitpic.com | 115 | 69 |
| twitter.com | 114 | 69 |
| www.smashingmagazine.com | 76 | 68 |
| digg.com | 490 | 65 |
| www.cnn.com | 106 | 63 |
| www.microsoft.com | 59 | 59 |
| www.ustream.tv | 57 | 57 |
| news.bbc.co.uk | 102 | 53 |
| truemors.nowpublic.com | 78 | 52 |
| danzarrella.com | 50 | 50 |
| www.google.com | 52 | 48 |
| www10.nytimes.com | 82 | 47 |
| xr.com | 47 | 47 |
| museums.alltop.com | 47 | 47 |
| www.engadget.com | 534 | 45 |
| www.mobilecrunch.com | 48 | 40 |
Here are the top 25 sorted by number of cites.
| domain | cites | users |
| www.engadget.com | 534 | 45 |
| digg.com | 490 | 65 |
| www.techcrunch.com | 274 | 162 |
| www.readwriteweb.com | 263 | 216 |
| news.cnet.com | 236 | 118 |
| mashable.com | 210 | 189 |
| www.techmeme.com | 190 | 24 |
| friendfeed.com | 164 | 98 |
| lifehacker.com | 152 | 72 |
| twitpic.com | 115 | 69 |
| twitter.com | 114 | 69 |
| www.cnn.com | 106 | 63 |
| news.bbc.co.uk | 102 | 53 |
| www.youtube.com | 94 | 74 |
| www10.nytimes.com | 82 | 47 |
| truemors.nowpublic.com | 78 | 52 |
| www.whoppersacrifice.com | 78 | 78 |
| sethgodin.typepad.com | 77 | 71 |
| www.smashingmagazine.com | 76 | 68 |
| twitrss.dyndns.org | 63 | 4 |
| www.msnbc.msn.com | 61 | 26 |
| www.microsoft.com | 59 | 59 |
| www.ustream.tv | 57 | 57 |
| news.yahoo.com | 54 | 30 |
| www.google.com | 52 | 48 |
Time to step back and consider what I’m doing with Twitter code and data.
Background: For the last two weeks, I have been writing code to find interesting URLs being cited in Twitter posts, or tweets.* I now have a database of about 100,000 Twitter users (I will not call them/us Tweeple!) who have cited 40,000 URLs and more than 200,000 two-word phrases that accompanied those URLs. The URLs have been mentioned 230,000 times (2.3 times per URL) and the phrases have been mentioned 330,000 times (1.6 time per phrase). I have gathered all of this data via the Twitter APIs within their constraint of making no more than 100 requests per hour. The primary public output of this work has been the Hot Twitter Cites list.
This morning, I’m going to try to take off my engineer hat, put on my product manager coat and consider what problem this could help solve and how to package it to meet that need. In other words, I’m going to try and extract some focus from my brainstorming. I’ll start by describing the data a bit.
Here’s a graph that shows how many people cited each URL for a one-day period. A few URLs are cited many times, but the vast majority only pick up a handful of cites – this graph shows a very long tail.

Users per citation
The pattern of citations per user has more depth. In other words, this also has a long tail, but a fatter, uh, body. This is good because it means that there are a lot of people citing URLs. More people means more points of view.

Citations per user
I’d be happier to see a greater variety of URLs being cited, but I’m not going to argue with the data… and Iwould expect (and hope) that the variety of cited URLs will rise as Twitter attracts a more diverse user base.
I’m generating a score for each user, based on how early they cite a URL that becomes popular. The URLs listed in the hot cites page are chosen partly because they were cited by people who tended to cite popular URLs in the past. I want to be sure that this isn’t redundant to how many people follow them. If it is, then there’s no point in doing all these calculations, I could just watch for the URLs cited by the people with the most followers. Here is a log-log scatterplot of my scoring v. follower counts.

Score v. follower count (log-log)
This is good. If the two data sets had a linear or power law relationship, the dots in the scatterplot would be clustered around a line. They are obviously not, which means that whatever I’m calculating, it is substantially different from ranking based on how many followers the citing user has. I’d like to see a comparison between my score and each user’s follower/followers (a/k/a friend/follower) ratio, but I’ve just started gathering the “follows” (friends) numbers.
Still, I’m not surprised. Follower relationships on Twitter do not imply significant connections between people, for several reasons:
(This topic itself is fairly popular, as demonstrated by the fact that The 10 Users You’ll Meet on Twitter was cited by 120 people in the last few days, which puts it in the top 2 percent of URLs cited.)
More to come as I have time.
* In a strange coincidence, around the same time I started this, I added to my office a clock that tweets a bird call for each hour. My wife made me take out the tweeter batteries. Mute birdies are staring at me.
Tags: twitter
I’m glad I haven’t automated the hot Twitter URL list yet… a “fake Twitter” phishing link showed up, so I changed its tinyURL in the database to “disabled.” I should create a page for that I guess.