" Avoiding the curse of dimensionality
You're a newly promoted data scientist with an important task: find what variables are important to include in your model.
Vexnomics Office - Tuesday Afternoon
As a newly promoted data scientist at Vexnomics, you're about to attend a meeting with your boss to find out your priorities.
Hey I bet you're looking forward to getting your hands dirty with some more advanced modeling?\nI do actually have something for you\nDog shed have come back to ask for a more in-depth modeling project, to build on the work we've done before\nThey have a big list of variables they think are significant\nWe need to include the most important variables, but we don't have much data to work with so we're risking overfitting\nSo the job is to whittle that list down to produce a model that's a little more practical\nAre you up to the task?
The Curse of Dimensionality
Ideally we'd be able to include as many variables in our model as possible, because so many things may impact our sales. However in practice, the data we have for modeling is always limited: we never have enough of it! Therefore we run into the curse of dimensionality – every new datapoint we add to our model becomes a new dimension on the chart. If we add too many variables each one could just memorize each observation and perfectly predict our data back to us. This is called overfitting, and it's a problem because it would lead to a model that doesn't do very well predicting data it hasn't seen yet. It would give you good accuracy scores, but wouldn't generalize enough to be useful in future forecasts. In addition the model would suffer from multi-colinearity, where multiple variables would be correlated with each other, which makes it harder for the model to separate out the impact of any one variable.\nThe solution is to always keep the model as simple as possible, and throw out any variables that don't show significant impact. But how do you know what variables are important? The act of doing this, called 'feature selection' is both a science and an art. Feature selection is one of the fundamental problems in data science. It's tedious to do manually in Excel but also can't be fully automated, so a good analyst needs to understand the different techniques available and when to use them
Everything should be made as simple as possible, but no simpler.
Hey as promised here's the cleaned data for Dog Shed\nWe have 146 observations here so as a rule of thumb we need to limit to 14 variables at a maximum, but ideally aim for as few as possible\nTry to build the model manually first to get a sense of what's important in the data, and then run a few feature selection algos before deciding on the final list\nLet me know how you get on
As a general rule of thumb you should have a minimum of 7-10 observations in the data for every variable you include in your model. This isn't a hard and fast rule but can help you avoid trouble with overfitting. This means if you had a year's worth of weekly data, 52 observations, you'd want to stick to 5 variables maximum. You might find more variables than that are significant, and even that model accuracy improves with 6 variables, but the model will be simpler and more generalizable if you limit it to the top five most significant variables.
Hand crafted artisanal models
Before you attempt to automate it, it's important you step through it manually first. Work through building a custom model by hand.
Vexnomics Office - Wednesday Morning
Time to dig into your feature selection task. First let's build a model with everything in it and see what variables look important!
Kitchen Sink Model
A kitchen sink model comes from the phrase 'everything but the kitchen sink'. It refers to including all variables in a model to see which ones are statistically significant. It can be a good starting point, from which you can work through and drop some of the least significant variables. Create a model with the LINEST function, then calculate the P Values in the same way as the following template does. The P Values will tell you which variables were significant, which you can then drop and create a new model. In dropping variables you want to see how it impacts the coefficients of the model, as well as the p values and the overall accuracy. For example dropping a variable might decrease R Squared, but the coefficients for the variables that are left over might make more sense. Particularly if two variables are correlated, removing one might make the other more significant.
Taking a look at the template and how it works, then create a 'kitchen sink' model with every variable: which variables were statistically significant? (p value less than 0.05)
What was the least statistically significant variable in the model? (apart from the constant)
Copy your work to a new tab, and create a new model but remove the least significant variable: how did the R squared of the model change?
In another new tab, copy the template to make a new model, but this time remove all of the variables that were insignificant in your last model. What variables are left?
Automated feature selection methods
Explore Filters, Wrappers and Embedded methods to find the best way to automate feature selection for your model.
Vexnomics Office - Thursday Afternoon
Now that you've had a few rounds of manual selection, you see the benefit of automating some of the work. This is where it makes sense to break out into coding...
There are three main types of automated feature selection: Filter methods, Wrapper methods and Embedded. Filter methods are the cheapest computationally, as they rely on simple calculations and rules, and don't take the whole model or other variables into account. Wrapper methods rebuild models trying multiple combinations of variables until the right mix is found that maximizes accuracy or removes insignificant variables. Embedded methods are part of the regression algorithms themselves, and often use feature importance to determine what variables are selected.
Google Colab is a hosted version of Jupyter Notebooks, a way of running code cell by cell that's popular with data scientists. Using a notebook helps you work through the model step by step while showing your work. You can share Google Colab Notebooks just like any Google doc, and you don't need to set up any coding environment on your computer. If you're unfamiliar with colab or python code we recommend taking some time to learn how it works before continuing this simulator, which assumes some Python coding knowledge.
If a variable doesn't have much variance (it doesn't change much) then it's not likely to be that informative in the model. This is because regression works best when it can map spikes and dips to events and actions, in order to estimate what variables were most responsible for performance. One simple filter method is to just remove variables that have low variance.\nAnother filter method is Univariate analysis: regressing each variable individually against sales gives you an idea on how much variance it can explain on its own. This isn't a perfect method because it only takes one variable into account at a time (without accounting for interactions between variables) but it is computationally quite simple and easier to interpret than multi-dimensional analysis.\nRead the first two sections of the Scikit Learn documentation (link below) on 'Removing features with low variance' and 'Univariate feature selection' then use these examples to perform variance and univariate feature selection on your data.
Make sure you drop the 'Week' and 'Sales' variables from the dataframe before doing the variance selection. Use the same threshold they use in Scikit Learn. Rather than using fit_transform, try using fit, and then get_support() on the sel variable to get a list of booleans you can use to filter the dataframe. For example: `sel.fit(X)` and then `X.loc[:,sel.get_support()].columns`. This method can also be applied to univariate selection to get the selected columns.
What variables did the variance threshold select?
What variables did the univariate selection choose?
Can you forsee any downsides or issues with using these filter methods?
One benefit of code is that it's trivial to run many different models with any combination of variables and see what works best. In Excel or GSheets this would take a long time, but in a Colab notebook you just need to create a loop and define the logic you want to use to select what model works best.\nOne common approach is backwards feature elimination (BFE), where you start with a 'kitching sink' model (i.e. all variables), then work your way backwards, dropping the least significant variable until all variables are statistically significant and/or you're below the threshold of the number of variables you want to select.\nAnother wrapper method feature selection algorithm you could use is progressive feature enhancement (PFE). This works in the opposite direction: start with no variables and iterate through each option until you choose whatever one increases the accuracy the most. Keep adding variables until you hit your quota or the variable is insigificant.\nIn this exercise we won't be asking you to code these solutions, instead you should run our implementations and see how they work. We'll ask you how you might modify the logic to improve accuracy, so be sure to play around with the code if you understand roughly what's happening. Feel free to make changes to see how it changes the outcomes of what variables are selected.
How does the backwards feature engineering algorithm work in your words?
How do the variables selected differ when running BFE versus PFE?
How would you modify the PFE code to select better variables?
The final and most complex category of feature selection algorithms is embedded methods. This is where the feature selection is actually built into the model algorithm itself. So to use them, you typically need to actually build a model, then extract the feature importances in some way. These methods can be more computationally expensive because they're actually building models, but typically with a modern computer with a few hundred observations they run in seconds.\nOne popular regression algorithm is Ridge Regression: it can be seen as a more modern alternative to standard Linear Regression. This is a type of regularization algorithm which penalizes the squared size of the coefficients of the model, which practically decreases the importance of some variables in the model. This helps us fight the curse of dimensionality by decreasing the chance of overfitting, particularly if we then extract only the most important features after. This is in the same family as other regularization algorithms such as LASSO and Elastic-Net, which could also be used.\nThe other algorithm we're going to look at is RandomForest. This refers to a type of decision-tree algorithm which deals well with non-linear variables, for example media spend (which usually has a weaker incremental effect as spend increases). Random Forest uses 'bagging', which trains a number of strong learners (unconstrained models) in parrallel, then smooths them all out to get the average of their predictions. The other option would be boosted trees, which trains a series of weak (constrained) decision trees then combines their predictions. Random Forests tend to help with overfitting, while boosting is good for generating flexible models.\nRead the article (below) by Elite Data Sciene for more information on algorithm selection. We have already provided you with the templated code for this exercise, so just run it to answer the final questions.
How similar are the variables selected by Ridge vs Random Forests? What variables stand out as unusual to you?
What algorithm selection method gave the most similar results in terms of variables selected to your manual model?
What 2 variables have consistently shown up as important across most algorithms?
Thanks\nI appreciate the work that went into this, it gave me plenty to choose from.\nAs you can see the automated methods are useful, but there's no substitute for common sense!"