msgbartop
Social media analytics for decision-making
msgbarbottom

09 Dec 08 What business problem are we solving?

“Technology in search of a problem,” is a longstanding Silicon Valley standard criticism of companies with smart people who build cool stuff but fail to generate revenue. Sometimes it seems like the entire world of web analytics could be described that way. The squishier the metrics definitions are (prime example: “engagement”), the more accurate the description is.

The most important word in that description is “a.” In other words, find just one problem to solve at a time, rather than a dozen. This is a normal problem in a developing environment, where inventors are driving. The problem is that people who are good at inventing products and services are also good at seeing problems they can solve, so they do a lot of things moderately well and excel at none. In the day-to-day work of web analytics, this often appears as “data smog,” a term I first encountered in Actionable Web Analytics, which credits it to David Schenk.

Somebody involved in every analytics effort absolutely needs to be talking to the ultimate customer, the one who is actually generating revenue, to understand their business. At the very least, analytics should address the specific business problem a customer knows it has. Ideally, it goes further and sees solutions or opportunities that the customer didn’t realize that data could address.

In other words, I think it is a mistake to just ask or settle for being told what numbers to deliver. Ask why the numbers are needed. Ask for the goals and their priorities so that you can set analysis priorities. For example, the goal is support, the speed of responses in the social network matters a lot more than when the goal is loyalty.

If there is no way around the need for a lot of measurements, fight data smog by knowing your priorities and packaging the data with some sort of drill-down that puts the most important numbers on top. This is where the dashboard idea becomes critical – hide the complexity behind a one-page (or less) summary. The fact is that even when people think they want complexity, they almost never do. I created a set of Excel-based dashboards with deep drill-down… and I’ll bet that hardly anyone used more than a small percentage of what was there. But it was there if they wanted it and that keeps people happy.

All of this adds up to one word: focus. A venture capitalist friend gave me a mantra that stuck: the five most important things (for startups, but it applies to innovation in general) are focus, focus, focus, distribution and focus.

Tags: ,

06 Dec 08 Bad assumptions

In a lengthy conversation with a fairly well-known analytics thought leader earlier this year, I was startled when he mentioned that he didn’t know much about statistics.  We were talking about my preference for simplicity and I was describing methods of eliminating redundancy.  I gave up.  While I would not claim to be an expert on statistics, I took the essential classes in college, I’ve been mentored by a couple of Ph.D. statisticians and I do know the parts that are important to the kind of analytics I do.

If there is one thing I’d urge on anyone doing this kind of work, it is to be in the habit of examining the assumptions behind your metrics and models.  Bad assumptions lead to ambiguous or meaningless data.

Here is my favorite bad assumption: it is good when visitors view more pages.  No.  That behavior might mean bad site design.  If people can accomplish their goals by clicking fewer pages, that generally would be considered an improvement.  “More page views” is good for increasing the frequency of advertising, which was pretty much an unquestioned good thing before the Internet came along.  It is not necessarily true any more, so it has become a bad assumption.

Here’s a bad assumption that should be thought-provoking: Some of the values will be equal to the average.  Wrong.  My favorite example is that nobody has the average number of arms.  Think about it.

Perhaps the most common bad assumption that leads to statistic struggles is that data is distributed normally.  Plenty has been written about this subject and I won’t try to summarize it, but here are some things to remember:

  • A lot of data that isn’t normally distributed is normalized by taking the logarithm of each value.  That’s part of why log-log scatterplots are such a great tool.
  • Unlike many other statistics, medians and standard deviations are “robust,” meaningful even when data is not normally distributed.  Rely on them heavily.
  • Get to know the Poisson distribution.  It shows up frequently in social media.

Tags: , , ,

04 Dec 08 The social media data warehouse

I promised an overview of how data moves into and out of a data warehouse, so here goes.  The short version of “data in” is that there are ETL (extract, transform and load) processes that get data from various sources, change into a structure suitable for the data warehouse, then load it into tables in the database.  The way data comes out is via SQL queries.  Nothing really unusual there, so let me explain how these are different from typical databases.

In a dimensional data warehouse, the data is extremely “denormalized,” meaning that instead of designing tables to eliminate redundancy and maximize integrity, they are designed to be extremely simple.  Ideally, there are only two types of tables – facts and dimensions – and every query joins fact tables to dimensions.  This is called a “star” schema.  Imagine a fact table at the center with dimensions as points of the star.  A typical web analytics fact table is a clickstream log; associated dimensions might be days, users, visitors, ip addresses and so forth.  That’s it in a nutshell.  Grab a book about data warehousing if you want details (and there are plenty) but I’m going to focus on some of the ways that a data warehouse for social media might be different.  There’s a lot to say about this, so this just a first post in a series.

If I were starting today, I think I would completely violate one of the principles of data warehousing and plan for the social media data warehouse to be on-line, a production system, rather than keeping it isolated.  The normal model is that the data warehouse only serves reporting and analytics needs of managers, clients, etc..  However, “analytics needs” are becoming part of the user experience.

We are so accustomed to thinking of analytics and reporting as a management tool, a way to keep clients and advertisers happy, that we forget that our communities can benefit from analytics, too.  Increasingly, when we discover something interesting in the data, there is value in exposing it to the community.  Unfortunately, that tends to be very slow to happen because analytics is usually a step-child in the engineering family.  When resources are tight, the production system gets priority, as it should.  So make the data warehouse a production system, not for analytics job security, but because analytics can add real value to social media.

Picture a community where anybody can blog.  Typically, the only feedback is how many comments a posting gets.  Imagine if each blogger could see how many visitors and page views each post gets.  Imagine if they could see which of their words are generating search engine hits.  In other words, empower each user to do their own SEO by giving them analytics-based feedback.  The vast majority probably won’t, but those few who do may have a great impact.  In social media, empowering the super-users may be the key to success.  Analytics is how you identify them, but don’t stop there.

The possibilities are enormous.  Social media analytics can tell us which people or groups have the greatest influence.  Feed that data back to the community.  Analytics can tell us which topics are heating up – feed that back.  Tell the community where the new visitors are coming from; there’s probably something interesting out there when it changes.

The architectural challenges are not simple, given that data warehousing isn’t intended to give real-time results the way that live production systems are.  I suspect that the path to this kind of capability is to mirror an aggregated version of the data warehouse in near real-time and let the production system query the mirror.  In any event, I do believe it is the way things are headed.  Analytics isn’t just a management tool – or perhaps it is, but we forget that every visitor can be a manager.

Tags: ,

03 Dec 08 Analytics data warehousing

For many analytics practitioners, how to store data is somebody else’s problem – Google, Omniture, Yahoo/IndexTools or another third-party provider.  As Gary Angel points out, “Sophisticated organizations are increasingly finding good reasons to move data from their web analytics tools to other data processing and analysis platforms.”  Indeed.

Having spent the last few years designing, building and using a terabyte-scale analytics data warehouse, I have to agree with Gary.  He calls it “moving” or “transferring” the data, but I see it as an “also,” rather than “instead of.”  At LiveWorld, our data warehouse was complementary to various third-party analytics systems that our clients used.  Name just about any tag-based solution and we were supporting it.  We were even starting to test Google Analytics in “hybrid” mode, where the data goes to Google and to the local server, potentially allowing tag-based data to go into the data warehouse, which would provide a cross-check, at the very least.  Some of our customers, particularly bankers, banned third-party JavaScript for security reasons, so they relied entirely on the data warehouse.  (We could have hosted the tag scripts locally, but decided not to deal with the resulting maintenance issues.)

One of our clients did something that I suspect many other large businesses will opt for – they asked us to create a daily feed for their data warehouse.  The feed didn’t include full detail, it was aggregated data, relatively easy to customize because the source was a set of queries against our warehouse.  This approach offers the advantage of allowing deep data integration on the client’s systems.  They were able to out-source social media to us, yet run integrated reports while preserving customer privacy.

With a data warehouse, you can go beyond the kind of data you can get from tag- or log-based systems.  Social media is virtually always based on application servers.  There’s a database underlying the app server, which means that you can periodically query the database and stick the results into the data warehouse.  In some cases, you’ll just want a snapshot of the state of the app server.  Much of the time, you’ll want the app server to record events with a timestamp, so you can query for the full detail of what happened and when.

Once you have a data warehouse up and running, the big advantage is reporting flexibility.  The whole point of a data warehouse is to store information in a way that allows fast-running queries to be written easily.  At the most basic level of a typical data warehouse, there are no pre-suppositions about what queries will be run.  However, to gain acceptable performance, aggregates often need to be created (ideally, invisibly to users) for commonly run queries.

Next, I’ll given a quick overview of how data typically gets into and out of a data warehouse.

Tags: ,