“At the end of the day, Twitter is a prototype.” That’s a comment on Dave Winer’s blog by Chuck Shotton, who created one of the first web servers, long before most people had even heard of the Internet. Chuck’s main point is that Twitter is a good idea, but it should be implemented as a distributed system, not a centralized one.
Dead on, Chuck. I’m not in any way faulting Twitter by agreeing with Chuck. There are good reasons that they are succeeding where others have failed at microblogging. It is good that they are demonstrating the broad appeal and usefulness of this kind of communications. The problem, as Chuck nailed it, is that they are centralized. Compare this to blogging, which was designed from the start to be decentralized. There are dozens of blogging platforms that you an run locally, on a rented host or at a site dedicated to hosting blogs. Choices, choices, choices. But if you want to tweet, there’s only one way to do it – Twitter.
One reason Twitter succeeded where others failed is that it has a good API and is extremely open when it comes to sharing data. The default, unlike most other social media companies, is that all of your data is open to everyone, except for direct messages. That’s fairly radical and perhaps more than anything else, has inspired developers to create many, many Twitter applications.
I caught the bug myself, attracted by the volume of data that is easily available. I threw together TwURLed News, not with the idea of building a company around it, but because I wanted to see how well something like it would work. It wasn’t very hard to built, has a back end that requires a BSD machine worth maybe $1,000 and the front end runs on a very low-cost hosting provider. Amazing.
Still, I can’t believe this is the future of microblogging. Instead of running applications that use the Twitter API on our desktops, it seems much more likely that we will end up running something like the Twitter API ourselves, which talks peer-to-peer instead of client-server.
Consider how Twitter and Google have opposing information flow. The Google model is that people publish information on web servers, then Google’s robots gather the data. To access Google, you use a standard web client. In the Twitter world, nothing gets published until and unless it is pushed to Twitter’s servers and a lot of the people who read Twitter-published information do so using custom clients. I guess you can rationalize this by arguing that Twitter is getting its users to do all the work that Google’s robots would otherwise do, but that’s a terrible idea. As Chuck pointed out, it doesn’t scale.
Consider also how different Twitter’s data flow is from blogging. When you post a blog entry, you’re usually also publishing it as an RSS feed. Outfits like Technorati (and Google, of course) send robots out to read those feeds and make them available via the web or newsreaders. People call Twitter microblogging, but instead of encouraging people to tweet locally and make the tweetstream available to anybody who wants to retrieve it from your site, as with RSS, Twitter says no, you have to send your tweets to Twitter and then they become available to the public. The pain of that centralization is already hurting Twitter, as developers complain about being unable to get even a single user’s entire tweet history, about being unable to search more than a few weeks’ data and other limitations.
So, here’s a thought. How about if every Twitter application developer throws off the yoke of centralization and adds local (or hosted, via XML-RPC) RSS publishing as an option? This is relatively simple for desktop apps – it could use the same mechanisms as RSS. It could actually be an RSS feed tagged as a tweetstream, so that anything that reads it will know that no entry will be more than 140 characters, expect hashtags, “@” screen names, etc. Phone apps could use a proxy to do the same while continuing to publish the tweetstream on Twitter.
Imagine the services that could bloom if everybody’s tweetstream were available without haing to rely exclusively on Twitter and its limited resources? In no time at all, we’d see comprehensive indexing and other value-added services.
So, why not? I’m not suggesting anyone abandon Twitter, I’m just saying that microblogging will take off much faster if Twitter developers realize that they don’t have to depend only on Twitter to publish their tweets.
The bad news is that everything was dead for about 12 hours because my hosting company, Bluehost, shut it down for consuming too many CPU cycles. The culprit was a WordPress plugin that generates XML sitemaps. It was generating an updated sitemap for every post, with a fairly expensive MySQL query each time. No more. The plugin is set to only permit manual updates and I’ll trigger that every few hours, not at every posting. That should also make the site more responsive.
Live and learn.
I suppose it is a cliche to say that many useful things have been created unexpectedly, even accidentally. Here in Silicon Valley, that principle often becomes a problem, as highly creative people see a thousand products or services in their creations, but fail to focus enough to create a viable business. I know that disease well because I have to fight it constantly. Right now, however, with Tweetsnet, I’m still in the brainstorming and experimentation phase, when the point is to explore the possibilities. If it gives rise to a business of some sort, that’ll be just fine, but that’s not the point yet.
The bit of unexpected goodness I’ve noticed in Tweetsnet over the last few days is in the tagging. The tags and the tag cloud achieve one of my goals – self-organization – even though I didn’t really plan on it. If I had stopped to think about it, I guess I would have realized it would happen. It all started when I realized that since I’m fetching page titles from popular Twittered URLs, I could also extract any keywords found on those pages. I had to hack a Python WordPress RPC-XML library to support tags, but that was no big deal.
Once those tags were working, I realized that I could treat Twitter hashtags as a special case of tagging. In the Tweetsnet database, tags are identified by source – HTML meta keywords or hashtags. On the Tweetsnet pages, they all look the same.
When that was working, I found myself staring at the “phrases” that I’m capturing from Twitter. Those are two-word phrases extracted via some very simple rules – end of sentence detection, a stopwords list, hashtags and user names excluded and so forth. I noticed that when the same word showed up in more than one of those phrases, it often would be an appropriate tag. And I noticed that existing tag words often showed up in the phrases, so those get added no matter how frequent they occur. Any word that show up in at least three of the phrases is also added as a tag, although I’m not storing them in the database, since they are sometimes a bit odd.
The result is a set of tags and a tag cloud that do a pretty good job of finding articles related to a particular topic. For example, when an article about the rumored GDrive showed up, it was tagged “gdrive,” which I clicked and found two more articles. Cool. That’s why I recently increased the size of the Tweetsnet tag cloud widget.
As you may have noticed, I have added links to sites that are doing things similar to Tweetsnet. One of those, Twitscoop, offers a tag cloud widget, which gave me the idea that perhaps Tweetsnet should do the same. Soon, I hope. That would be in keeping with my idea that one of the secrets to success is to notice when you’ve invented something useful, then package it well.
I would be remiss if I didn’t point out that all this would not have happened if I wasn’t using WordPress as my platform. Although it gets in the way sometimes, the features that come for free, including all the third-party themes and widgets, are terrific. Ditto for Python and all the libraries people write for it.
Time to step back and consider what I’m doing with Twitter code and data.
Background: For the last two weeks, I have been writing code to find interesting URLs being cited in Twitter posts, or tweets.* I now have a database of about 100,000 Twitter users (I will not call them/us Tweeple!) who have cited 40,000 URLs and more than 200,000 two-word phrases that accompanied those URLs. The URLs have been mentioned 230,000 times (2.3 times per URL) and the phrases have been mentioned 330,000 times (1.6 time per phrase). I have gathered all of this data via the Twitter APIs within their constraint of making no more than 100 requests per hour. The primary public output of this work has been the Hot Twitter Cites list.
This morning, I’m going to try to take off my engineer hat, put on my product manager coat and consider what problem this could help solve and how to package it to meet that need. In other words, I’m going to try and extract some focus from my brainstorming. I’ll start by describing the data a bit.
Here’s a graph that shows how many people cited each URL for a one-day period. A few URLs are cited many times, but the vast majority only pick up a handful of cites – this graph shows a very long tail.
The pattern of citations per user has more depth. In other words, this also has a long tail, but a fatter, uh, body. This is good because it means that there are a lot of people citing URLs. More people means more points of view.
I’d be happier to see a greater variety of URLs being cited, but I’m not going to argue with the data… and Iwould expect (and hope) that the variety of cited URLs will rise as Twitter attracts a more diverse user base.
I’m generating a score for each user, based on how early they cite a URL that becomes popular. The URLs listed in the hot cites page are chosen partly because they were cited by people who tended to cite popular URLs in the past. I want to be sure that this isn’t redundant to how many people follow them. If it is, then there’s no point in doing all these calculations, I could just watch for the URLs cited by the people with the most followers. Here is a log-log scatterplot of my scoring v. follower counts.
This is good. If the two data sets had a linear or power law relationship, the dots in the scatterplot would be clustered around a line. They are obviously not, which means that whatever I’m calculating, it is substantially different from ranking based on how many followers the citing user has. I’d like to see a comparison between my score and each user’s follower/followers (a/k/a friend/follower) ratio, but I’ve just started gathering the “follows” (friends) numbers.
Still, I’m not surprised. Follower relationships on Twitter do not imply significant connections between people, for several reasons:
(This topic itself is fairly popular, as demonstrated by the fact that The 10 Users You’ll Meet on Twitter was cited by 120 people in the last few days, which puts it in the top 2 percent of URLs cited.)
More to come as I have time.
* In a strange coincidence, around the same time I started this, I added to my office a clock that tweets a bird call for each hour. My wife made me take out the tweeter batteries. Mute birdies are staring at me.