In my previous post here, I mentioned that some variables are not worth including because they essentially measure the same thing. I’ll write a longer piece about this later, but it has been bugging me all day that I didn’t explain the basics of identifying redundant variables. In addition to simplifying complex algorithms, this can also help you avoid “data smog” in your reports. To repeat an example, I don’t think there is any reason to report both page views and time on site. They measure essentially the same thing (please write to me if you find that this is consistently not so on your site, I’d love to hear about it). At the same time, it is a good idea to monitor such relationships and trigger an alert if they stop correlating.
The first thing I almost always do when I’m examining pairs of metrics is generate a log-log scatterplot. As long as you don’t try to graph too many points, Excel is a fine tool for creating log-log scatterplots. All you need to do is generate pairs of values in two columns. For example, if you are comparing page views to time on site for visitors, you would create one row in Excel for each visitor, with two values – the number of page views and the time on site. The order doesn’t matter. The scales of the two numbers should be roughly comparable, so you’d probably want to have time on site in minutes. You should be looking at a enough data to put the numbers into thousands or higher; otherwise it might be hard to see patterns. Labels don’t matter much, either. Select your data, insert a scatterplot, then go into formatting for the X and Y axes and set each scale to logarithmic. That’s it.
You’re looking for something very simple in the scatterplot – whether or not the points cluster into a line or single area. If they are all over the chart, then there is no direct correlation between the two metrics – they are not redundant. If they cluster together, even if there are some outliers (and there usually are), then the two metrics are mostly redundant and you probably can drop one of them.
I’ll write more about this later, including ways to quantify the relationship, why the outliers (the points that are away from the cluster) might be interesting and more.
Tags: scatterplots, simplicity