In the course of a normal day, we interact with learning machines several times. Anytime you receive a product recommendation after making a purchase, apply for a loan, get a call from your credit card company about a suspicious charge, or check the weather forecast you are participating in and seeing the results of machine learning. Unfortunately, machine learning has become an overused catch-all, just like big data, cloud computing, artificial intelligence, and gluten intolerance. So before we move forward, lets start with a definition. Machine learning is a method of data analysis that automates analytical model building. Using algorithms that iteratively learn from data, machine learning allows computers to find hidden insights without being explicitly programmed where to look.
Machine learning has become an essential tool for an increasing number of companies thanks to advances that made machine learning a suitable solution to a wide variety of problems. Also, there is this quote from IBM: Everyday, 2.5 quintillion bytes of data are created and 90% of the the data in the world was created in the last two years. Humans are unable to process data at this level. Clever humans, however, are able to program computers to process and make predictions using these huge quantities of data. A particularly nice advantage of computers is they have no preconceived notions of what variables should be important for your business. This lack of bias means every possible combination of variables is considered, before coming to a conclusion.
There are several types of machine learning, too many to cover here, but I want to give you an overview of how most machine algorithms work. First, the data is gathered and organized into a machine readable format. For site selection, you can think of each site under consideration as a row in a giant spreadsheet and each variable (demographics, psychographics, traffic, etc.) as a column. So there will be tens of thousands of columns in our spreadsheet. We start with performance data for your existing locations or the locations of your closest competitors to build our training set. Our algorithm uses the training set to identify key relationships in the data and create new variables or features that are in-turn used to build models and make predictions about your target locations.
An important concern in machine learning is the problem of overfitting. Overfitting occurs when an algorithm attempts to explain every possible variation in the training set. For instance, if we were studying the retail purchase patterns where we had the time of purchase for each transaction, the general time of day will probably have an effect on a customers purchase pattern. However, the exact time of each purchase, down to the second, has zero predictive value. My favorite example of overfitting is Washington’s professional football team correctly predicting the outcome of every presidential election between 1940 and 2000. With so many variables to consider, spurious relationships like a football game’s impact on presidential election will appear and you need to be ready to deal with them. One method for dealing with overfitting is cross-validation. We again start with our training set and randomly partition it into two pieces, a training set and a test set. The training set is used to build a model that is then evaluated on how well it predicts the test set. This process is repeated for a large number of random partitions ensure that only the truly important features are given weight in the final model.
A final note I will leave you with is that even though we call it machine learning, a lot of critical human thinking has to take place to ensure that the proper techniques are applied to each problem. We humans come with our own built in biases and subjective views, but with careful application of human and computer judgment we can achieve incredible results.