As site selection grew into a more rigorous process, regression analysis emerged as a powerful and commonly used tool. It is a well established technique for finding the relationship between a set of variables. By keeping track of current store sales performance, retailers could begin to predict how a store might perform in a new area by comparing key characteristics of the area to that of previous stores.
If you are not familiar with regression, this series of videos by Brandon Foltz is a good place to start. If you need an algebra refresher, Khan Academy has a great lesson plan that covers everything you need to know. There are many different forms of regression analysis, but we are going to focus on simple linear regression for the purposes of this article. Simple linear regression is powerful, but still easy to understand. Today, we will work through a motivating example. In a subsequent post, we will cover interpreting our results and the limitations of regression analysis.
The type of question that regression can help us answer is “What impact, if any, does median home value in my trade area have on my monthly sales?”. For this example, monthly sales will be the dependent variable (because it depends, in some part, on median home values) and median home values will be the independent variable. By convention, we use “X” to denote the independent variable and “Y” to denote the dependent variable. Our goal here is to calculate the equation for a straight line of the form y = ax + b (where a is the slope of the line and b is the y-intercept) that “best fits” this data and will help us predict monthly sales for other median home values. To get started, we first need to collect sales figures and area home values for a few locations. I have created some numbers below for our example–the home values and monthly sales are expressed in units of $100,000. So you would read the first pair of figures as median home value of $700,000 and $1.1 million in monthly sales.
Median Home Value(x) Monthly Sales(y)
The process we will be using to calculate the best fit line is called ordinary least squares, if you would like to know why, see here. The short answer is that it makes our lives a lot easier mathematically. Next, we need to find the mean (M), or average of each set of figures. For median home value we have M(x) = (7+4+6+3+5)/5 = 5 and for Monthly Sales we have M(y)
= (11+3+5+4+7)/5 = 6. We now need to calculate the deviation from the mean for each value of x and y. I added some additional columns that we will need later.
Median Home Value(x) Monthly Sales(y) x – M(x) y – M(y) (x – M(x))*(y-M(y)) (x-M(x))^2
7 11 2 5 10 4
4 3 -1 -3 3 1
6 5 1 -1 -1 1
3 4 -2 -2 4 4
5 7 0 1 0 0
We now need to calculate the sum of the third and fourth columns, that is ?(x-M(x))(y-M(y)) = 10+3-1+4 = 16 and ?(x-M(x)))^2 = 4+1+1+4 = 10. The slope of our best fit line is 16/10 = 1.6 and the y-intercept = M(y) – b*M(x) = 6 – 1.6*5 = -2. So the equation for our line becomes:
y = 1.6x – 2 or, equivalently, monthly sales = 1.6*(median home value) -2
So, we would predict that an increase in median home values in your trade area of $100,000 would result in monthly sales increasing by $160,000. We will cover in depth the limitations and interpretations of our result in a following post.