I promised an overview of how data moves into and out of a data warehouse, so here goes. The short version of “data in” is that there are ETL (extract, transform and load) processes that get data from various sources, change into a structure suitable for the data warehouse, then load it into tables in the database. The way data comes out is via SQL queries. Nothing really unusual there, so let me explain how these are different from typical databases.
In a dimensional data warehouse, the data is extremely “denormalized,” meaning that instead of designing tables to eliminate redundancy and maximize integrity, they are designed to be extremely simple. Ideally, there are only two types of tables – facts and dimensions – and every query joins fact tables to dimensions. This is called a “star” schema. Imagine a fact table at the center with dimensions as points of the star. A typical web analytics fact table is a clickstream log; associated dimensions might be days, users, visitors, ip addresses and so forth. That’s it in a nutshell. Grab a book about data warehousing if you want details (and there are plenty) but I’m going to focus on some of the ways that a data warehouse for social media might be different. There’s a lot to say about this, so this just a first post in a series.
If I were starting today, I think I would completely violate one of the principles of data warehousing and plan for the social media data warehouse to be on-line, a production system, rather than keeping it isolated. The normal model is that the data warehouse only serves reporting and analytics needs of managers, clients, etc.. However, “analytics needs” are becoming part of the user experience.
We are so accustomed to thinking of analytics and reporting as a management tool, a way to keep clients and advertisers happy, that we forget that our communities can benefit from analytics, too. Increasingly, when we discover something interesting in the data, there is value in exposing it to the community. Unfortunately, that tends to be very slow to happen because analytics is usually a step-child in the engineering family. When resources are tight, the production system gets priority, as it should. So make the data warehouse a production system, not for analytics job security, but because analytics can add real value to social media.
Picture a community where anybody can blog. Typically, the only feedback is how many comments a posting gets. Imagine if each blogger could see how many visitors and page views each post gets. Imagine if they could see which of their words are generating search engine hits. In other words, empower each user to do their own SEO by giving them analytics-based feedback. The vast majority probably won’t, but those few who do may have a great impact. In social media, empowering the super-users may be the key to success. Analytics is how you identify them, but don’t stop there.
The possibilities are enormous. Social media analytics can tell us which people or groups have the greatest influence. Feed that data back to the community. Analytics can tell us which topics are heating up – feed that back. Tell the community where the new visitors are coming from; there’s probably something interesting out there when it changes.
The architectural challenges are not simple, given that data warehousing isn’t intended to give real-time results the way that live production systems are. I suspect that the path to this kind of capability is to mirror an aggregated version of the data warehouse in near real-time and let the production system query the mirror. In any event, I do believe it is the way things are headed. Analytics isn’t just a management tool – or perhaps it is, but we forget that every visitor can be a manager.
Tags: analytics, data warehouse