" Finding numbers for Greek letters
Building a marketing mix model is not just about finding the right parameters, but also the hyperparameters used to transform your data to account for adstocks and saturation.
Vexnomics Office – Thursday Afternoon
You've been looking for an excuse to learn how to use evolutionary algorithms for building models automatically, and now this might be your chance. You hear on the grapevine that a colleague is struggling with their model.
Yes you heard correctly\nWe're struggling with the GoolyBib model for Italy\nI'm not satisfied with the model as it stands – the error is over 10%!\nThis could be a good use case for automated hyperparamter optimization\nDoing it manually is getting us nowhere fast!\nI'd appreciate your help
Every model has parameters: these are the variables used to make predictions when the model is built. For example your model might say that Facebook Ads drives a cost per purchase of $5. How the model arrives at that conclusion is dependent on Hyperparameters: parameters that control how the model learns the right answer. To take our Facebook Ads example, our cost per purchase might get worse at higher spend levels due to saturation, or there might be a carryover effect where spend today affects performance tomorrow. \nThese transformations of the data are key to building an accurate model, but it's hard to know ahead of time what values are correct. The manual way of figuring this out is to choose a value for a parameter, for example an adstock level of 20% (20% of your spend from today has an impact tomorrow), and then move that value up or down to see how it impacts model accuracy. If you increase it to 30% and accuracy of the model goes up, you know that's a better fit. The problem is that each parameter affects the behavior of every other parameter in the model, so it can be hard to find the right value.\nSo there have emerged several strategies for dealing with this task of Hyperparameter optimization. The brute force approach is a grid search: try every possible value until you find the right one. This isn't always possible however, because there may be more potential combinations than there are stars in the universe! Random search lets you set a limit of how long to look for the right values, randomly guessing within a budget. The method we're using today is an evolutionary approach, which learns in a more efficient way than random guessing. It deploys a population of potential solutions, then kills off the worst ones (with the highest error) and replaces them by mutating the best performers in the batch. With hundreds or thousands of iterations we can efficiently arrive at the best estimate for even a very complex set of parameters.
GoolyBib Italy model\nHi,\nThanks for offering to help: please see model attached. \n\nWe have been choosing the values for theta (adstocks) and beta (saturation) manually and this is the best combination we've got to so far.\n\nBest,
To optimize the whole, we must sub-optimize the parts.
Theta and Beta
Marketing campaigns tend not to have a linear relationship with sales. As you spend more money on a channel, the efficiency decreases because you have to outcompete a greater number of other advertisers for ad inventory. In order to model a non-linear variable in a linear model, we must transform the data. The Power saturation curve we're using simply takes the advertising spend to the power of beta. But what is beta? It's a Greek letter, because by convention mathematicians use them to identify parameters in a formula. \nWe don't know what the correct value of beta is – it's our job to estimate it. The way this is done is by trying a lot of beta values from ~0 (completely saturated) to 1 (linear) and see which value best improves the model. If we see the accuracy increase from 0.3 to 0.4, but then decline again at 0.5, we know the true value could be around 0.4.\nWe do a similar job for adstocks, to estimate the percentage of advertising impact that's felt the next day after spend. For example if we choose 0.2 then 20% is felt the day after, 20% of that the next day (4%), and so on. The process is the same: calculate the accuracy of the model at each level of adstock, and see what it does to accuracy. Of course this can get complicated to do across multiple variables, as if you change one it changes the whole model, and the odds of ending up with the best possible combination are low.
For models with up to 5 or 6 parameters, it can be feasible to adjust the model manually and find a good enough solution. However this task quickly becomes impossible with more parameters because the number of potential combinations increases exponentially.
Better than random guessing
Rather than manually choosing the value for each parameter, let's get Nevergrad to do it for us. It uses an evolutionary algorithm to build hundreds or thousands of models to find the right parameters for us.
Vexnomics Office – Later that day
After digging through the Nevergrad documentation and using some marketing mix modeling code you had from another project, you have something ready to run.
Nevergrad is a Python package created by the Meta (Facebook) team, for gradient-free optimization using techniques. It works using differential evolution and particle swarm optimization, which optimizes a problem by iteratively trying to improve a candidate solution with regard to a given measure of quality. \nThis process is called evolutionary, because it is based on the ideas behind evolution in nature. The simplest evolution strategy operates on a population of size two: the current point (parent) and the result of its mutation. Only if the mutant's fitness is at least as good as the parent one, it becomes the parent of the next generation. Otherwise the mutant is disregarded.\nSuch methods are useful because they make few or no assumptions about the problem being optimized and can search very large spaces of candidate solutions. Because it does not use gradients like other optimization methods, it can be used on optimization problems that are not even continuous, are noisy, change over time, etc. This is perfect for machine learning where the actual model might be a black box, where changes in one part of the model can give non-linear responses in another part of the model. The downside of evolutionary algorithms is that they are computationally intensive, so take a while to run, and do not guarantee an optimal solution is ever found.
Hi, I'm going to walk you through how Nevergrad works\nI'm gonna take the file here, and just going to make a copy, so we have our own copy in drive\nFirst you want to install Nevergrad\nSo if you just click into that cell and press shift + enter, or you can click play here\nThat's just downloading the library and installing in on the machine Google is giving you for colab.\nNow we just need to load in all of the libraries that we're going to use\nJust shift + enter, or click the play button.\nAnd these are some functions we're going to use as well\nWe need to authenticate Google drive so we can import the data\nSo if we just run that cell, then it'll give you a link to click through, and you just want to authorize, then copy and paste\nHit enter\nNow if we run this cell\nIts going to pull in the data from this Google doc, and it's going to display it here as well\nSo we know that we've got the right data, its the number of days, then facebook spend, radio spend, tv spend and then revenue\nNow let's plot the data so we see how the data is trending over time\nSo we see here it's a plot of revenue, and then we see all the facebook, radio and tv spend underneath\nNow in order to run Nevergrad we need to do a little bit of setup\nFirst we need to get all the spend columns, so the media variables are if 'spend' is in the column, and then we just need to create the X and Y variables\nY is revenue, that's what we're trying to predict\nThen X is what we're trying to predict it with\nThis is an adstock formula, it helps us take into account that there will be some lagged effect of advertising\nAnd Nevergrad is going to help us find the right adstock values\nIt's going to try lots of theta values here\nAnd it needs that function in order to transform the data in each run\nSame for saturation, this is diminishing returns its the same thing, its just a power transformation, with just one variable\nYou can have more variables, if you do something more complex, like a weibull adstock or a hill saturation curve.\nNow we need one big function to model everything together\nSo it needs to take in the facebook spend theta, facebook spend beta, and then same thing for each other channel\nAnd then it just pulls it together, and then creates the adstocks, and the saturation, so it does adstock first, and saturation after that, which is best practice\nFinally it creates a new data frame, basically an excel spreadsheet\nthen creates the X and Y values, and splits them into X train, X test, and X train X test\nSo this is a train test split that lets us check the accuracy, with data we haven't seen before, in the model\nThis prevents overfitting of the data\nFinally we have a linear regression model\nAlthough you could replace this with whatever model you want to use, if you want to use Ridge regression or something different, and then it fits the data, and gives us predictions, we then calculate the accuracy metrics\nSo normalized root mean squared error, is just an accuracy metric that gives a percentage, that penalizes really high percentage values, more than mape \nThe mape is the mean absolute percentage value, and that's just a general percentage you were wrong each day\nWe also have the R squared as well\nThen we kind of wrap this function and get all of the results, but only return the mape value\nThis function is what nevergrad is going to use to optimize to\nSo here we're optimizing to the mape\nBut we could swap this out for nrmse or r squared if we wanted to\nNow we just need to make these scalar values\nDon't worry too much about this, but we're essentially just setting the bounds of what the parameters could be, and then we're going to run the optimizer here\nThe optimizer, this is actually Nevergrad doing its thing\nWe're passing in the instrumentation into the parameters, and then we're giving it a budget, and we're saying this is how many times we want you to run a trial\nAnd the trial is basically the algorithm choosing what kind of parameters to set or to try next and then run that trial to see if those parameters worked\nThen we just want to log the error values, so we can do a chart afterwards of how quickly it approached the lowest mape, \nso that's what we're doing here\nwe're just registering the callback\nevery time it tells us the answer for that trial\nok and we're going to set a timer, so we just see how long it takes to run\nthis is the optimizer function, this is where its actually minimizing the error\nthis is like where all of the work is actually happening\nso when you run all of this\noh I have an error\nmain function is not defined\nif you ever get this, its because you probably ran things out of order\nso I'm just going to go back and run this one here\nthen run this one again\nthere you go, you can see it took 1.8 seconds to do those 100 trials\nand these are the values that it came back with for facebook beta and facebook spend theta\nSo these were the parameters for the adstocks and saturations, that got us the best accuracy out of all of the models they tried\nYou can also kind of push those parameters into the function that you created in order to get the MAPE\nSo it's 0.076 so it's 7.6%\nThen you can see if you plot\nSo just run that so you can see if it works\nIt's going to be a little bit different each time, because its an evolutionary algorithm, so there's some random elements in there and it wont' always be the exact values\nJust going to run this\nWe can see here that we started off with a very high mape, it's 18%, and then even spiked up to 20%\nThen over time it kind of evolved towards the better values\nIt got a lot lower, to 0.08 and then stayed around there for most of the 100 trials\nThen if we just run the build model function, its going to return us the mape value and the nrmse, r squared value, the model and also the result\nSo we just run that and its going to give us those values\nWe can also plot the accuracy\nSo if we plot the figure, and we take the revenue and the prediction from the result, which is the data frame, then we are showing that\nWe're seeing here the revenue is in purple, vs the red prediction\nThat's how you run nevergrad in order to find the right parameter values\n\n\n
What are the benefits of using Nevergrad vs other optimization libraries?
Describe in your own words the steps taken in the Google Colab notebook to use the Nevergrad library
That's great to hear\n8% error is much better than we had on our manual model\nLooking forward to seeing if we can improve it further by dialling up the number of models
This approach works if you're running up 2,000 models, but it takes a while and hits an error if you set the budget too high. To reach 10,000 models we have to parallelize the operation, meaning apply multiple 'workers' to solving the calculation, rather than calculating each iteration one at a time.
Scaling to 10,000 models
We need to make some modifications if we want to scale up the number of models we're building. It will help to parallelize the operation to allow multiple calculations to happen at once.
Vexnomics Office – A few hours later
We have a model that works – we're already getting better accuracy. Let's modify our code so we can run a much higher number of models.
Nevergrad offers parallelization right out of the box, you just need one other library called 'futures' to make it work. Futures lets you create a ThreadPool which manages the multiple workers you create to do the calculations. These workers can carry out tasks asynchronously, as if they were their own mini computers, and then pass back the information to the main function. We run this within a 'with' statement to ensure they're cleaned up properly and shut down after everything finishes executing. \nYou can have as many workers as you have cores available for processing on your computer, which you can find with the os library `os.cpu_count()`, though it makes sense to leave 1 core for anything else your computer needs to calculate. For demonstration purposes we're going to just use 2 cores, which is the minimum needed to calculate 10,000 models. We set this in the optimizer on the first line with `num_workers=2`. The following file has the updated script we'll use.
Why do we need to parallelize Nevergrad?
Copy the code from the last model to show the output of the parallelized code. What is the MAPE for the next model? Was it significantly better?
The parameter for TV spend beta changed from 0.45 to 0.10 in the new model. How would we interpret that?
What does the chart showing mape over time tell you? What does it look like Nevergrad was doing? Is this behavior what we'd expect?
Thanks for sending the results across - very interesting\nI thought dialling up the models would get us a better result\nHowever seeing that it jumped around a lot and tried lots of options gives me confidence that we really did end up with the best allocation\nThanks again for your work on this\nWe can reuse the script with modifications for every new model we need to optimize!"