I suppose it is a cliche to say that many useful things have been created unexpectedly, even accidentally. Here in Silicon Valley, that principle often becomes a problem, as highly creative people see a thousand products or services in their creations, but fail to focus enough to create a viable business. I know that disease well because I have to fight it constantly. Right now, however, with Tweetsnet, I’m still in the brainstorming and experimentation phase, when the point is to explore the possibilities. If it gives rise to a business of some sort, that’ll be just fine, but that’s not the point yet.
The bit of unexpected goodness I’ve noticed in Tweetsnet over the last few days is in the tagging. The tags and the tag cloud achieve one of my goals – self-organization – even though I didn’t really plan on it. If I had stopped to think about it, I guess I would have realized it would happen. It all started when I realized that since I’m fetching page titles from popular Twittered URLs, I could also extract any keywords found on those pages. I had to hack a Python WordPress RPC-XML library to support tags, but that was no big deal.
Once those tags were working, I realized that I could treat Twitter hashtags as a special case of tagging. In the Tweetsnet database, tags are identified by source – HTML meta keywords or hashtags. On the Tweetsnet pages, they all look the same.
When that was working, I found myself staring at the “phrases” that I’m capturing from Twitter. Those are two-word phrases extracted via some very simple rules – end of sentence detection, a stopwords list, hashtags and user names excluded and so forth. I noticed that when the same word showed up in more than one of those phrases, it often would be an appropriate tag. And I noticed that existing tag words often showed up in the phrases, so those get added no matter how frequent they occur. Any word that show up in at least three of the phrases is also added as a tag, although I’m not storing them in the database, since they are sometimes a bit odd.
The result is a set of tags and a tag cloud that do a pretty good job of finding articles related to a particular topic. For example, when an article about the rumored GDrive showed up, it was tagged “gdrive,” which I clicked and found two more articles. Cool. That’s why I recently increased the size of the Tweetsnet tag cloud widget.
As you may have noticed, I have added links to sites that are doing things similar to Tweetsnet. One of those, Twitscoop, offers a tag cloud widget, which gave me the idea that perhaps Tweetsnet should do the same. Soon, I hope. That would be in keeping with my idea that one of the secrets to success is to notice when you’ve invented something useful, then package it well.
I would be remiss if I didn’t point out that all this would not have happened if I wasn’t using WordPress as my platform. Although it gets in the way sometimes, the features that come for free, including all the third-party themes and widgets, are terrific. Ditto for Python and all the libraries people write for it.
Tags: self-organizing, tweetsnet, twitter
Blocking aggregators from Tweetsnet is turning in to a whack-a-mole game. As I block them, others pop up – and some of them are brand-new accounts. This means I’ll have to prioritize creating an algorithm to block them rapidly. Weird stuff pops into the feed, like a series of wrestling-related posts.
Meanwhile, I’m working on a couple of experimental vertical feeds, on web analytics and social media, since I’m interested in those subjects and I suspect they get a fair bit of attention from Twitter users.
Looking at the most common phrases that Tweetsnet is finding, it seems that the most talked about subjects are the generally popular ones – President Obama, the Super Bowl, peanut butter, Steve Jobs and “Slumdog Millionaire.” But that may be becuase I’ve only recently pared down the aggregators. We’ll see what develops.
It seems to me that Twitter is related to blogs and search engines as radio has been related to newspapers. Radio was usually first to cover breaking news; newspapers covered the same events in greater depth. I’m using past tense because radio and newspapers are changing these days, already fairly different from when I worked in those businesses.
Still debugging Tweetsnet… found a problem in scoring, which was limiting the variety of people whose cites could score anything at all. That’s fixed, along with more minor items that I’m finding as I review the code.
I have added Twitter Trends as a source for discovering more URL citations. The system periodically grabs the Trends list (had to learn a little JSON to do that) and combines the phrases with “http” to find trending tweets that also have URLs in them.
The changes I made earlier today have greatly slowed down the number of published items, as the system builds up information about non-aggregator sources.
A handful of robotic news aggregators have taken over TweetsNet… Twitter users that spew volumes of URLs an hour, which makes them appear to be on top of whatever is new. I’m planning to exclude them by algorithm (just the raw number of tweets is a good clue, but they also usually follow few people), but for now I’m excluding the big ones manually.
The idea of Tweetsnet is to leverage smart people, not dumb robots.
You can now receive the stream of Tweetsnet postings by following @Tweetsnet on Twitter. Still having some weirdness with document titles sometimes… still working on it.
@Tweetsnet is also now automatically following anybody who posted a URL that it published. We’ll see how long that works…
I have extracted domain names from the URLs that I track on Twitter. Below is a table that shows how many citations and how many unique citing users there are for the top 25 domains since midnight last night. The numbers are quite different. For example, Engadget and Digg are cited quite a bit – high frequency – but by relatively few people – low reach. ReadWriteWeb, Mashable and TechCrunch seem to do the best job of achieving frequency and reach.
Here are the top 25 ordered by the number of people who cited pages from each source.
| domain | cites | users |
| www.readwriteweb.com | 263 | 216 |
| mashable.com | 210 | 189 |
| www.techcrunch.com | 274 | 162 |
| news.cnet.com | 236 | 118 |
| friendfeed.com | 164 | 98 |
| www.whoppersacrifice.com | 78 | 78 |
| www.youtube.com | 94 | 74 |
| lifehacker.com | 152 | 72 |
| sethgodin.typepad.com | 77 | 71 |
| twitpic.com | 115 | 69 |
| twitter.com | 114 | 69 |
| www.smashingmagazine.com | 76 | 68 |
| digg.com | 490 | 65 |
| www.cnn.com | 106 | 63 |
| www.microsoft.com | 59 | 59 |
| www.ustream.tv | 57 | 57 |
| news.bbc.co.uk | 102 | 53 |
| truemors.nowpublic.com | 78 | 52 |
| danzarrella.com | 50 | 50 |
| www.google.com | 52 | 48 |
| www10.nytimes.com | 82 | 47 |
| xr.com | 47 | 47 |
| museums.alltop.com | 47 | 47 |
| www.engadget.com | 534 | 45 |
| www.mobilecrunch.com | 48 | 40 |
Here are the top 25 sorted by number of cites.
| domain | cites | users |
| www.engadget.com | 534 | 45 |
| digg.com | 490 | 65 |
| www.techcrunch.com | 274 | 162 |
| www.readwriteweb.com | 263 | 216 |
| news.cnet.com | 236 | 118 |
| mashable.com | 210 | 189 |
| www.techmeme.com | 190 | 24 |
| friendfeed.com | 164 | 98 |
| lifehacker.com | 152 | 72 |
| twitpic.com | 115 | 69 |
| twitter.com | 114 | 69 |
| www.cnn.com | 106 | 63 |
| news.bbc.co.uk | 102 | 53 |
| www.youtube.com | 94 | 74 |
| www10.nytimes.com | 82 | 47 |
| truemors.nowpublic.com | 78 | 52 |
| www.whoppersacrifice.com | 78 | 78 |
| sethgodin.typepad.com | 77 | 71 |
| www.smashingmagazine.com | 76 | 68 |
| twitrss.dyndns.org | 63 | 4 |
| www.msnbc.msn.com | 61 | 26 |
| www.microsoft.com | 59 | 59 |
| www.ustream.tv | 57 | 57 |
| news.yahoo.com | 54 | 30 |
| www.google.com | 52 | 48 |
I’m glad I haven’t automated the hot Twitter URL list yet… a “fake Twitter” phishing link showed up, so I changed its tinyURL in the database to “disabled.” I should create a page for that I guess.
As I’ve been working on the algorithm for hot Twitter cites, I’ve noticed that a handful of sources are responsible for most of the top 50 URLs cited.
They are:
Under the assumption that repeating content from popular sites doesn’t add much value, here’s what the current list looks like without those sites.
I’ve added a page to this blog for hot Twitter cites, based on the code I’ve been writing over the last few days.