As I’ve been working on the algorithm for hot Twitter cites, I’ve noticed that a handful of sources are responsible for most of the top 50 URLs cited.
They are:
Under the assumption that repeating content from popular sites doesn’t add much value, here’s what the current list looks like without those sites.
I’ve added a page to this blog for hot Twitter cites, based on the code I’ve been writing over the last few days.
Following up on my last post, here’s the up-to-date list of URLs are are frequently cited on Twitter, among the pages and people that my scraper has explored. (Look down a few posts to see how it is somewhat intelligently exploring.) The list shows how many URL variations are being used (an indicator of how many people independently decide to tweet about the page) and how many people have included a URL in their tweets in the last few days.
The list appears below, ranked by the number of users citing the URL. Below that is a list of URLs that had at least 50 cites, ranked by the number of unique URLs, which would be the way to do it if you wanted to find hot topics that people are finding independently of each other.
Here’s the list ranked by the number of unique URLs, for those that were cited by at least 50 users.
Note: There actually were a large number of unique URLs that point back to Twitter home page and Facebook, but I think that they didn’t resolve properly. In the case of Facebook, I suspect a second redirect is taking place because they all resolved to the login page. As for the ones that point back to the Twitter home page, I don’t know what’s going on there. User error, perhaps, but I’ll have to dig deeper.
My Twitter scraper allows me to see which web pages are being cited by the most people (out of the relatively small, but closely related, sample of Twitterers that it has explored). I can see how many variants of “shrunken” URLs are pointing to the same page, which gives a rough idea of how many people are independently finding and citing a page.
Below is a list of people who recently cited How to Use Twitter to Grow Your Business in a tweet. This was one of the most widely cited URLs with a relatively high number of URL variations that my scraper found over the last few days.
This page was cited by 98 people* with 11 URL variations. They appear below, in chronological order with some typographic hints. Each unique shrunken URL is color-coded, which indicates that those people are near one another in the Twitter social network. That’s especially cool because it did not require scraping through all of their followers to figure out, which would have consumed far more resources. The publication time is underlined for the first person who used each unique URL. Those people are likely to be opinion leaders, particularly if they created the shrunken URL (which a Google on the URL would be likely to reveal). The people who tweeted the URLs earlier are more likely to be opinion leaders, since there are more others who may have been influenced by them.
In addition to identifying hot topics and cliques within the social network, this is also a means of finding people who might be worth following, given that the first people to cite a URL that becomes popular are either influential or are good at spotting trends early (a distinction that can be impossible to resolve).
* I’m getting the first 100 search results for each URL, which means that for each “shrunken” url with 100 or fewer results, they are comprehensive. However, there’s no way to know, short of looking at the entire Twitter timeline and resolving all the URLs, if there are more people citing the same page. I’m planning to use the words associated with the URLs to do further searches to discover more people who cited popular pages.
One of the most common naive errors in statistics is to confuse correlation with causality. Common sense tries to tell us that when two events co-occur, the first one is causing the second to happen (which often is the case). Red traffic lights correlate to cars stopping and sure enough, we know that red lights cause (most) people to stop. But sometimes things correlate because a third, external mechanism is influencing them. Drownings increase as ice cream sales rise, but ice cream isn’t causing drownings. The external factor is summer, of course.
This is on my mind because over the last few days, when cold medicine hasn’t fogged my brain up so much that I couldn’t think, or at least couldn’t think logically, I’ve been working with the Twitter APIs to see what I could come up with in terms of tracking topics as they move around on Twitter. I’m attracted to Twitter because its immediacy and brevity make it relatively easy to analyze.
Eventually, what I hope to do is find useful patterns in the interplay of words, Twitter screen names, URLs cited and hashtags (and any other entities that could be extracted). I’m focusing first on URLs, since they are sort of the “stories behind the headlines” on Twitter. My friend Dave Land this morning mentioned that writing a tweet is like writing a headline. If so, then the cited URLs in those tweets are like the stories behind the headlines.
I’ve put Python and SQL to work scraping statuses from Twitter, pulling out word pairs (I’m planning to analyze them with the other entities via LSA), screen names and URLs. I’m resolving all the little URLs to the pages they actually point to, since Twitter users, limited to 140 characters, frequently use services like TinyURL to shorten them, but I want to see when people are citing the same URLs even if the shrunken URLs are different. In fact, looking at the ratio of shrunken URLs to actual URLs is interesting – if it is high, that means that a lot of people are finding the cited page independently, rather than retweeting it or getting it from the same source external to Twitter.
As I find cited URLs, I’m using Twitter’s search API to get the most recent mentions of them, then storing the identities of the users who also cited them and when they did so. That gives me a timeline of URL citations. I’m not tracking explicit retweets, so I don’t know if the first people to cite a URL first are more influential or not.
I haven’t asked Twitter to white-list me yet, so I’m working within the limitations of their API – 100 requests per hour. That forces me to be as smart as possible about how my code explores the data. I started by choosing somebody who has a decent number of followers, but not too many, so that it wouldn’t take too long to scrape the person’s followers’ tweets. I chose Tim O’Reilly because I suspect he is fairly influential on Twitter and we’ve had some conversations that go back to the mid-90s about how to figure out “what the Internet is thinking today.”
O’Reilly’s company was one of the first, if not the very first, to measure social media for market research. Many years ago, they were scraping Usenet to help decide which technologies would make good topics for books. I recall that one of the first decisions they made from that data was to choose between two open-source database projects, MySQL and mSQL. They chose MySQL… and therein lies another reminder of causality v. correlation. Did MySQL succeed because O’Reilly chose to focus on it, or did O’Reilly succeed because it chose the right books to publish? There is no way of knowing, but I have personal evidence that O’Reilly doesn’t always choose the right topics… or perhaps the right authors. That’s a story for another day.
After a lot of wrangling with Python, MySQL and technical issues having to do with Unicode and my inability to write a correlated subquery under the influence of Sudafed, I have something working. It started by scraping Tim’s recent tweets and then searched for people who also cited the same URLs. Then it explores those who it has found cite the greatest number of URLs overall. It adds 10 people at a time and then re-ranks to see who it should explore next.
I’ve been running this for a couple of days now in various forms. So far I have found about 7,000 unique URLs. Only about 300 of them are duplicates – different shrunken URLs for the same page. The URLs have been mentioned about 20,000 times by 7,500 users. I have found about 40,000 two-word phrases (stop words, URLs, screen names and hashtags are excluded) and 52,000 mentions of those phrases, which means that a number of phrases are being used by multiple people.
What the heck, here are the top phrases and the number of Twitters (Twitterers? Twits?) who used them over the last few days (remember, this is far from comprehensive):
I suspect that the words “check out” on Twitter are much like the words “click here” were in the early days of the web. ”Emergency generator” intrigued me, so I linked it above to Twitter search. Hint: it has to do with the Toyota Prius. I suspect those same people included a link to a New York Times article about it. Interestingly, a number of the people who cited it were not retweeting (at least not explicitly)… but many of them were using a shrunken URL cited by – guess who – Tim O’Reilly. The interesting thing about the popularity of the phrase is that it gives my code a way to discover the other shrunken URLs in a single search, instead of having to scrape everything and resolve every shrunken URL to the actual page. Tim may or may not have influenced those people to look at the article, but it is clear that he is in tune with a topic that people are interested in, which makes him interesting, whether he is an influencer or just well-influenced, so to speak.
Time to publish this post, I guess, even though I’m tempted to wait until today’s cold medicine has worn off to proofread it one more time.
More results here as I come up with them.
Tags: causation, correlation, Influence, twitter
Social media proponents (a/k/a people who want your money for social media products and services) urge us many ways, but it all boils down to “People are talking about you on the Internet, so you’d better pay attention.” Tapping into the swelling ground, grabbing a long tail or otherwise engaging in social media is supposed to help you make better products, faster, leading to happier customers and more money. But how do you decide who to listen to?
Success generates noise. Millions of customers means millions of comments. The first and easiest answer to this dilemma, which may be good enough – for now – is to figure out which comments are most popular. See which ones are getting the most page views, the most Diggs, Tweets or other indicators that somebody cares.
The problem with that approach is that by the time something is popular, it is often too late do so anything about it. This is particularly true in two situations:
If you have a hit on your hands and you find out too late to create more inventory before the crowd has moved on, you missed an opportunity. There’s a corollary to this: the sooner you find out you have a dud, the faster you can stop wasting resources on it.
The really unfortunate thing about angry and unhappy people is that they consistently have more energy to invest in bad-mouthing than happy people have for paying compliments. That’s a well-known fact of marketing. And life. By the time grumpiness about you and your stuff becomes popular, a lot of damage has been done, obviously. Knowing who is influential can help you prevent grumpiness in the first place or do better at quelling it before it becomes popular opinion.
In mass media, the important metrics focus on popularity. Although popularity still matters, digital social networking allows us to measure influence, at least indirectly. The difference between being popular and being influential is very simple to understand in principle.
If I have 5,000 followers on Twitter, I’m obviously fairly popular. I probably am also influential. There are ways to figure that out, such as by measuring how much interaction I engage in or how many times my tweets are “re-tweeted.”
If I have one follower, does that mean I am not influential? Not if that one follower is a fellow named Barack Obama, who has more than 150,000 followers, according to Twitterholic. That is, if that Obama fellow really is following me. I mean really following me, the way we mean “follow” in the real world.
In other words, a few influential followers can be far more significant than thousands with limited influence.
I’m using Twitter as an example because I’ve been working with the Twitter APIs to see how hard it would be to come up with measures of influence. Twitter is growing on me and I think that part of the reason is the language it uses.
From a social network analysis standpoint, Twitter is much easier to deal with. Mostly.
Popularity is a first-order measurement. My popularity on Twitter is the number of followers I have. Influence is a second-order or greater measurement. The simplest measure of potential influence is to see how many followers my followers have. The API makes this very easy to measure (for people who aren’t so popular that the API limits become an obstacle).
My Twitter followers are followed by a bit more than 10,000 people. Pretty good, I think, since I only have 28 followers (I haven’t been on Twitter long).
My friend Dave Land has 95 followers and those people are followed by almost 175,000 others. Wow. Dave’s followers are followed by a lot more people than mine are.
Some of my followers are people I believe are influential in the world of web analytics. Let’s see how they do (a third-order measurement of my potential influence, if I did it for all of them). In no particular order:
Avinash Kaushik, Google’s web analytics evangelist, isn’t following me (hey, bub!), but anybody whose title is “evangelist” is supposed to be influential. At the risk of exceeding the Twitter API limits, I ran my gizmo to get his stats. Avinash has about 2,000 followers, who are followed by almost 600,000 others.
If you rank these people by popularity (followers), Avinash is No. 1, hands-down. But if you rank by potential influence, Marshall Sponder’s followers are followed by the most people, which is especially surprising given that Avinash appears to be more than twice as popular.
Dave Land comes in at No. 1 when this group is ranked by the ratio of second-order followers to followers. That means he is doing the best job of attracting followers who attract followers, which is what you need to do if you want your influence to scale beyond immediate popularity. But I should note that having a lot of followers will inevitably dilute your second-order influence, which should comfort Avinash, who came in last on that measurement (thereby saving me from last place in all three rankings).
I should note a messy bit of this measurement – sites like Woot, Twemes, hashtags.org and others that automatically follow you when you follow them. Ugh. I haven’t figured out a good way to exclude them, so I’m just doing it manually… and I haven’t thoroughly made sure I caught all of them. So there’s hope, Avinash – maybe Marshall is just signed up for more of those. In any case, don’t take these numbers too seriously. I’m going to work on some additional data points – number of replies and such, to strengthen the results.
Or can somebody save me this work and point to a site that has already done this sort of analysis? I searched but didn’t see anybody looking at second-order popularity.
I have switched from hashtags.org to Twemes for the Twitter tag feed for web analytics (#wa) that appears in the sidebar. Sorry, hashtags, but you just weren’t reliable.
I have also added a Flash widget for Twitter, showing my tweets, to the top of the sidebar. It’s kind of flashy, so I might switch to the text version.
Finally, I have added TwitThis (shouldn’t that be “TweetThis”?) at the bottom of each post, allowing you, my fine readers, to Tweet, with ease, I hope, any post you choose.
Ah, the joy of actually using social media!
Tags: admin