In a previous post, we worked through the mechanics of linear regression. Today we will cover how to interpret our results and their limitations. Last time, we calculated a best fit line of:
Y = 1.6x – 2, or equivalently, monthly sales = 1.6 *(median home value) – 2
Remember that both monthly sales and median home value are in units of $100,000. So, on average, we would expect a $100,00 increase in median home values to increase our monthly sales by $160,000. In our example, to keep the number of calculations under control, I only used five pairs of data points. To consider our results reliable, we would need more data points. There is no precise consensus on how many, but 30 or more is a good starting point. I prefer 42 because it worn by Jackie Robinson, and it is, after all, the “Answer to The Ultimate Question of Life, the Universe, and Everything”. The calculations needed to analyze a few thousand data points by hand would be tedious, but we have computers. A harder problem is when we do not have enough data points. We have to take our results with a grain of salt, or look to other tools for our analysis.
Other potential problems we have to look out for are:
- Omitted variables: In our example we only considered the impact that median home values have on monthly sales. There are more factors that affect monthly sales, and we can measure many of them and expand our analysis to a multiple regression model and/or non-linear models, but we can never be sure we have captured ever single variable. If one of your store managers is more capable than her counterparts at other locations, not accounting for her excellence can lead to problems in your analysis of different locations.
- Reverse Causality: If you think home values have an impact on your monthly sales, then you will put your first stores in areas with high home values. Your first locations will also benefit from the higher amount of care you can take with each store. As you expand, the amount of time you can spend with each location will decrease, and sales may suffer. If we do not account for your lack of time with each subsequent store, we will overstate the importance of home values on our monthly sales.
- Limited range of independent variables: In our example, if the median home values of an area are $100,000, then we predict monthly sales of -$40,000. This ridiculous result arose because I chose our data points, but it does bring up a real problem with regression. It is only useful for predicting results that fall within our given range of independent variables.
So with all of these problems, why do we bother? With modern statistical software we are able to run thousands of regressions in a matter of seconds and the results are accurate and easy to interpret. Advanced techniques of regression have been developed to deal with some of the problems we have covered here and other more technical difficulties like autocorrelation, multicollinearity, and heteroscedasticity, but we have to still approach our results with care and at least a little skepticism.