" When the client questions the model
A key client has questioned the accuracy of your model – how do you prove that your model is valid and can be trusted when making decisions?
Vexnomics Office – Monday
You get in to work and you have an email from your boss. She's asking you to take a look at a model you built because the client had concerns about it's accuracy...
Goolybib Model - Questions from Client\nHi, I was talking to Goolybib's CEO and he had some questions about the accuracy of the model.\nI reminded him that was the best we could do on short turnaround, but now we've got some breathing room can you take a look?\nTake a look at the key statistical tests and let me know how robust it is, so I can decide how much time to allocate to V2.\nAppreciate it,
All models are wrong in some way, so it's important to know what flaws a model might have, and if you can live with them, before you make decisions using it. Model accuracy is a surprisingly deep topic: because there never really is a 'ground truth' - the user behavior that generated our sales is so complex that we can only estimate it. Every attribution model is a simplification of reality, so it's our job to find ways to build confidence in the model and determine how much uncertainty we can cope with relative to the amount of time we have to spend on making the model better. Statisticians have developed multiple tests and metrics you can use to estimate model accuracy, but they can be confusing to anyone without a stats background. That said, it's possible to quickly build a working knowledge of how to use these tests to tell you if you're on solid ground or need to dig a little deeper.
All models are wrong, some are useful
Model accuracy becomes an easier topic to reason about if you start with the assumption that all attribution methods are flawed in some way. People unfamiliar with marketing mix modeling might point out flaws to reject the method, while conveniently ignoring the fact that their last click attribution model is obviously flawed in other ways.The gold standard of attribution is to run an a/b test, but that isn't always possible and takes up a lot of planning and implementation time.\nSo there is always a tradeoff between accuracy and the cost of information gain - for large, important, irreversible decisions it's worth spending more money and time on a better model, but for most decisions a 'good enough' model will be all you need. So your job becomes understanding the flaws of the model, quantifying the uncertainty and presenting the tradeoff to decision makers.
The four assumptions of regression
Linear regression is only reliable as a method if certain conditions are met. What assumptions are we making with this model?
It's important to understand what assumptions we're making when we use a linear regression model, the algorithm we use in marketing mix modeling.
There are multiple assumptions that go into a linear regression model, and failing any one of them could indicate your model has a problem. We're going to go through the 4 major assumptions: Linearity, Independence, Homoskedasticity, and Normality. In each case we're only giving a brief overview, so it can be helpful to do your own reading on the topic.
Each of these assumptions has a lot of statistics behind it, so they're deeper topics than we can cover in this chapter. If you have extra time it can be worth searching for more information on each topic and build some additonal understanding before attempting the exercises. In our case we've simplified it to four assumptions, but you'll find many articles will list 5-8+ assumptions as they break them out in more detail, and all have different approaches to testing for them. Don't be intimidated by this, and feel free to skip ahead to the next chapter if you want to get your hands dirty.
Linear regression needs the relationship between the independent and dependent variables to be linear. Meaning if you drew a scatter plot of each feature in your model, you should be able to draw a straight line to connect the dots. If a variable's relationship with sales gets stronger or weaker at scale, then that's an example of a non-linear relationship: you'd need to do a variable transformation to turn it into a linear variable, if you want to use linear regression.
Marketing mix modeling depends on the variables that drive sales being correlated with it, but uncorrelated with every other variable. The bigger differences there are day to day and between variables, the better the model will be at estimating the impact of each variable. If two or more variables are correlated with each other, that's called multicollinearity. This isn't a strict requirement of Linear Regression, so you'll probably still get good predictive accuracy, but can lead to unusual results as the explanatory power of one variable gets mixed up with another. For example this can lead to a model that says marketing drives negative sales, or a variable being statistically insigificant when actually it's very important.
This means that the errors in your model have equal or almost equal variance across the regression line. They should look like random dots on the chart - if you see patterns in the errors (for example if they fan out over time) that's an indication of Heteroskedasticity. Seeing patterns in the errors is an indication there's something missing from your model that is generating those patterns. There are many tests for heteroskedasticity, but you can also just plot the errors against predicted values and see visually if there are any patterns.
For the model to be reliable, you should also check if the errors are normally distributed. This is a desirable property because if they aren't normally distributed, it can impact the reliability of some of the metrics we rely on in the model, like standard errors, p-values and confidence intervals. This assumption isn't a strict requirement: you can still use Linear Regression if your errors aren't normally distributed, but it's just something to watch out for. It can be an indication that the model is misspecified in some way, or that you don't have enough data.
Why might marketing spend have a non-linear relationship with sales?
Is there anything about marketing mix modeling specifically that would make it more susceptible to multicollinearity?
What is heteroskedasticity in your model's errors most likely to indicate?
If you see that your errors aren't normally distributed, what metrics can't you trust?
Which two of these four common issues are most worrying if you see them in a model?
There are many ways to test for these issues, and when you find them in your model, you should carefully consider the potential causes. However none of these tests have a veto right: the most important thing is making sure your model is as useful as possible within the time frame you have to devote to it. Statistical tests can inform your opinion of whether the model has issues, but they can't make the decision for you, nor can they tell you what to do about those issues.
Does this model make sense?
Marketing mix modeling required common sense. Review your model's metrics to see if they match your expectations.
Now we know what assumptions we're looking to uphold, we can test for them. But first, let's do a sense check of the model. What story is it telling us?
Thanks for taking a look at this\nWe know the R2 is decent and all the variables are statistically significant\nBut do the coefficients make sense?\nAlso take a look at the margin of error
Most people have heard of R-squared (R2) and know that you need a P-value of less than 0.05 to reach statistical significance. However one common mistake is to present a model to a client where the coefficients don't make sense. For example our model says that we make $2,453 less on Sundays: someone who knows the business could confirm that sounds right, or tell you actually Sunday is their best day of the week. Modeling is both an art and a science, and there's no use in confidently presenting a model that goes against common sense.\nThe other thing experienced analysts do is look at the margin of error. Facebook might return $4.2 in revenue on average, but the gap between $3.2 and $5.2 (the confidence interval) is quite wide: at one end of the spectrum the client might hit their bonus, at the other end they might lose their job! It's also a good idea to compare our results with other attribution methods: what does Facebook report in terms of return on ad spend? How does that compare? Knowing if it's substantially higher or lower can give us more or less confidence in our model. If we get a surprising result, it might still be correct, but we know we need to dig further and give more justification before we make decisions from the model.
The statistics in the model are often shortened using scientific notation. So for example 11,090 would be shortened to 1.109e+04 or 0.000000000000147 would be shortened to 1.47e-13. This is to save space, and although you might not be used to seeing it, you can easily convert it back to a normal number. To do so you take the first number and add or subtract as many 0s as the number following the e. So 1.109e+04 is 1.109 plus 4 zeroes and 1.47e-13 is 1.47 minus 13 zeros.
Looking at the standard error relative to the coefficients, what variable is the model least certain about?
Facebook is reporting a return on ad spend of 2.48 compared to the 4.23 coefficient we got for our model. How would you interpret this result?
Assuming that the client wouldn't willingly lose money advertising on TV, why might a coefficient of 0.39 make sense in our model?
Look at the charts showing actuals versus predicted values, and the error. Is there any pattern in the errors? What does this indicate?
Interpreting the standard statistical tests
F-Statistic, Skewness, Durbin Watson... what do these terms mean? Let's walk through each statistical test in the standard model output.
The model output contains a lot more than coefficients and p-values. There are a lot of terms that you probably don't recognize, words like Skew, Kurtosis and tests named after statisticians like Durbin-Watson and Jarque-Bera. Let's learn more about them.
There's a lot of complex information contained in the standard model output, but you don't need to look at or understand everything. We only need a working knowledge of what's important to use as a jumping off point to dig deeper into anything that looks wrong.\nLooking at the top right, you'll see the F-statistic and Prob (F-Statistic) - you can think of the latter like a P-value but for the whole model. It indicates if our variables are jointly significant when put together.\nThe bottom panel is mostly about our assumption of Normality: a normal distribution makes a bell shape, and if skew and kurtosis were zero, it'd match that shape perfectly. If Skew is higher or lower that means the bell is shifted to one side, and if Kurtosis is above or below 0 it means it's too spiky to draw a smooth curve. Normal ranges for these values are between -2 and +2, however you're best off using the Omnibus and Jarque-Bera tests that combine the effects of Kurtosis and Skew. The Prob(Omnibus) and Prob(JB) need to be above 0.05 - if either is below that level it indicates non-normal errors. This doesn't necessarily mean we have a bad model, just that we can't fully trust our standard errors and confidence intervals.\nDurbin-Watson is testing for something different: autocorrelation. We hope to have a value between 1.5 and 2.5 - if we get above or below that it'd indicate that each datapoint is correlated with the day before, a common problem with time series data and a potential indicator that we're missing variables. Finally we have the condition number, which tests for multicollinearity. We want it to be below 30, otherwise this is a sign that we have too many redundant variables in our model.
Based on the Prob (F-Statistic) our model is statistically significant.
Which of these tests did we fail (i.e. got an outcome that indicates an issue with our model)?
Which metric likely caused us to fail the Jarque-Bera and Omnibus tests?
Can you explain why we might have a large condition number?
Running your own statistical tests
There are many more tests available outside of the standard output. Let's implement three of the more popular ones and learn what they test for.
Vexnomics Office – Tuesday
You sent your findings over to your boss, and she responds asking you to run a few more statistical tests.
K, thanks it looks like we might have issues with the model\nLet's do a little more testing first\nI'm concerned about multicollinearity so can you do a VIF test\nWe should also test for heteroskedasticity with breusch-pagan and draw a Q-Q plot to visually see the issue we have with normality of errors.
The statsmodel library we use to create our marketing mix model also has a number of statistical tests we can run, for example Breusch Pagan and Variance Inflation Factor (VIF). We can also use other libraries like SciPy and SciKitLearn to import other tests we want to run. Whatever you need to test for, it's likely someone has packaged up the necessary code in an open-source library: these statistics libraries are one of the reasons Python is such a useful language to learn for data science!
There are a number of statistical tests you can run, but there are three we need to focus on. The variance inflation factor or VIF, helps us identify multicollinearity, or variables that are correlated with other variables in the dataset. We're looking for a value below 10 for each variable, but values above 100 indicate definite multicollinearity. Breusch Pagan is a test that lets us identify heteroskedasticity, which is where we see patterns in the errors. This tells us if we're likely to have a missing variable from our model. Finally we need to create a Q-Q plot chart, which will let us see visually the non-normal errors we identified with the Jarque-Bera and Omnibus tests, indicating there are non-linearities in the model. The code is already written for you, and the following exercises will test your understanding of the results.
Which variable is causing multicollinearity issues?
What does the result of the Breusch Pagan test indicate?
Does the Q-Q plot indicate non-linear errors? Explain how.
These three tests indicate the three types of pattern you're likely to encounter when running statistical tests to validate your model. You usually either run a test on the variables like we did with VIF, run a series of tests using a statistical package like we did with Breusch Pagan, or plot a chart and inspect the data visually, as with the Q-Q plot. With these three patterns you should be able to write the code to create any type of test.
Predicting data we haven't seen before
Our model might be valid, but is it useful? The only way we can know for sure, is if we can use it to accurately predict future values.
Vexnomics Office – Wednesday
Now that you've completed the battery of statistical tests, you understand where your model is valid or might have issues. However that doesn't tell you a lot about it's usefulness! Time to test it's predictions...
Excellent work on the statistical tests\nthe multicollinearity issue seems to be easy to solve: we can just remove the constant\nThat might straighten out some of the other errors too - but let me deal with that\nCan you test the accuracy for me?\nNMRSE is preferred, but it's up to you if you want to do train-test split or cross-validation
Models are only useful if they empower you to take action. In order to give you confidence in the data, your model needs to have a high degress of accuracy in predicting future values... but you can't wait for the future to happen – that wouldn't be that useful! So we do the next best thing, and hold back some of our dataset. This 'test' data set is reserved and your model doesn't get to see it in the training phase. Only once you're finished building the model do you use the test dataset as your 'future values', to see how the model generalizes to data it hasn't seen before. In this way you can measure the real accuracy of your model.\nTaking this concept further, you can also run a process called 'cross-validation'. This does the train-test split we just covered, but multiple times across the dataset. For example, if you split the data three ways, you take 2/3 of the segments as your training data and test on the third segment. Then you repeat this two more times rotating the segments until each has been the test set. Once you're done you can aggregate your accuracy across these tests and get a truer picture of the real accuracy of the model.\nBut what do we mean by accuracy? Well we can continue to use the R-squared value, and that's a decent default, but it can be useful to have a measure of accuracy that fits the units of the variable we're predicting. Mean Squared Error (MSE) takes the error (in our case, it's sales minus predicted sales), squares each value, then finds the mean, which penalizes larger values by more. The root mean squared error is just the square root of the MSE, which reverses the squaring we did, and turns it back into a number we understand. For example an RMSE of $1000 means we were on average off by $1,000 per day in our model. We can normalize that figure by dividing it by the mean of our daily sales, which gives us the percentage we were off by per day.\nIn this chapter we've already provided the code for the normalized root mean square error (NRMSE), train test split and cross-validation, so the questions are more about your understanding of the concepts.
What happened to NRMSE under cross validation?
Can you think of a reason why we might not want to use cross-validation?
What problems are there with using the standard R-squared value as a measure of accuracy?
That's good to see: the NRMSE is low at 7%, which gives me confidence\nWe have to deal with multicollinearity, but hopefully removing the constant solves that. If you look at the chart, I can see the errors depart around the time TV starts, so I think we're missing some variable or interaction effect there which is giving us the non-linear errors.\nWe can also try transforming the facebook and tv variables to see if there are diminishing returns and/or a lagged adstock effect on brand\nI'm also going to recommend to the client that we scale Facebook spend up and down to manufacture some variance and see if we can get at a more realistic coefficient\nIf all that fails and we can't think of any variables we're missing, it'll be a case of waiting for more observations, or doing something more elaborate like a geo-test\nanyway, that's all I need from you for now - thanks again!
There are multiple ways to improve a model that has flaws, and they all revolve around either adding, removing or transforming data. If you have multicollinearity, you can remove redundant variables. If you see Non-Linearity or Non-Normality of errors, you can apply a transformation to your non-linear variables. If you see heteroskedasticity, you might be missing an important variable you need to add to the model. It's also often useful to gather more data, particularly when you have some control over operations. For example you can turn spend on or off in different regions to run a geo-test, or flex your spend up and down to generate more variance in the data. It's also valid to do nothing! Sometimes the cost of improving a model's accuracy isn't worth the decision you're trying to make – so for your end of year attribution analysis you want a robust model, but for a quick check to see if a small new campaign is incremental, an afternoon in Excel and 80% accuracy might be enough."