" Clean data, clear mind
You've been asked to get data ready for a big modeling exercise, but what format do you need it in? What does clean data look like?
Vexnomics Office – Morning
You've been asked to clean the data ready for a big model for a new client. But what does it mean to 'clean' data anyway?
Data Cleaning\nHey,\nI know this is your first data cleaning task, so let me explain.\nOur goal is to build a single file with all the data in tabular format.\nThat means one column per variable with one row per date.\nLet me know how you get on!\nThanks,
Clean data refers to data that's in the right format for modeling. The format might differ depending on the model, but for marketing mix modeling in particular you need it in tabular format. This means one row per date (it could be daily, weekly, or monthly data) and one column per variable (for example one column per channel or campaign). There should be no gaps in the data (we need zeros if there are) and all of the data from different sources should be joined up together in one sheet. In order to get the data in the right format you might need to transpose and/or unstack the data, as well as cleaning outliers or missing values. The final result is called an ABT (Analytical Base Table), the dataset in its final form, ready for modeling.
Data scientists spend 60% of their time on cleaning and organizing data.
You should do data cleaning as a separate task, in a separate file, and export the final dataset with the name 'clean' or '_abt' (Analytical Base Table). This lets you come back to the model in the future and use the exact same data as you did before, as well as letting others replicate your work.
All your spend in one place
First we need to collect our media spend, and get it all in one place, categorized by campaign in tabular format.
Vexnomics Office – Morning
The first big job is to get all of your spend data in one place with revenue. To do this we need to unstack the data then categorize our marketing campaigns.
Data is 'stacked' when some of the values you want as columns are listed as values in a single column. So for example the source / medium column you export from Google Analytics needs unstacking to get one revenue column per channel. This is necessary because we might want to filter out specific values, categorize them into groups or just move to a tabular format for modeling. For marketing mix modeling you need one row per date and one column per variable, so unstacking is often a key part of cleaning the data. Take a look at how we unstacked the Facebook ads data and categorized the campaigns in the following Python notebook. Your challenge will be to do the same for Google.
Categorize the campaigns into 'brand' (containing 'goolybib') and 'generic' (everything else). How many brand campaign entries were there?
Use the unstack data function on the categorized google data. How much did brand spend on 2021-03-24?
Some of the data we receive when building a model comes with the dates as columns and the variables as rows. This data needs to be transposed: that means swapping the rows for columns. Once it's in the right format we can join it up with the rest of the data we have in tabular format (dates = rows, variables = columns).\nFirst we need to load the TV data from GSheets with the same method we used for Facebook and Google. The GSheet file is linked to below. Then we need to use the transpose_data function which will be provided.
Did you manage to transpose the data with the function?
Another important concept is 'interpolation'. At present the TV data is presented weekly, but we need it daily to match up with the rest of the data we have. In order to do this we interpolate, which means filling in the gaps for the 6 days a week we don't have yet. The same method is used to zoom in on pictures (which makes them look blurry). We'll provide you with an interpolation function and the code to get it in the right format for interpolation, you just need to apply it. In order to use the function we need your date column to be in the format `YYYY-WW`.
Interpolate the transposed TV data and explain your method.
Now for a bigger challenge. Load the data from Google Analytics, unstack the data and filter for just the total revenue and date columns. You can use the methods we've covered so far (go back and review the code) and also feel free to Google for help if you get stuck.
Please detail any issues you encountered that you had to work through to complete this challenge.
You'll see in the file above that we joined the data together for you into one file, based on the date column. Take a look at this code because we'll be using it in a later chapter.
Ok good progress\nThe media spend is the important part\nTake a look next at what 3rd party data you can add
Importing 3rd party data
Market trends, and global events like COVID have an impact on sales, so how do we include them? The key is getting the data in a usable format.
Vexnomics Office – Afternoon
It's straightforward to incorporate media spend, but how do we factor in external factors like market trends and the impact of COVID? Where do we even get that data? You ask a friend.
Yo!\nFor market trends, I usually just look at Google Search trends\nGoogle also has a pretty good mobility dataset for COVID
One good proxy for market demand is how much the product is being searched for. Google doesn't provide actual search data but it does provide an index, which is all we need for our model. We just want to know when periods of interest were high or low, so we can tell whether our sales dips and spikes were due to something we did in marketing, or attributed to a wider shift in the market. Controlling for this will give us a better picture of how marketing is doing - without it you'd be giving too much credit to marketing in boom times and unfairly penalizing marketing in downturns. We'll provide you the data and code to deal with search trends data (which needs to be interpolated), but take a look at what it's doing, and the output.
Download the clean search trends data and make a line chart in Excel or Google Sheets. In what month does interest peak?
Large scale changes in human behavior will definitely impact your business as many have found with the COVID pandemic. Online businesses tended to do well, while traditional businesses did worse. Without controlling for these factors you may be under or over attributing performance to marketing. To include this data it's best to not simply use a 'is covid' dummy variable (1 or 0), instead you need more fine-grained control. When did people stay inside and when did they start to leave the house? This has changed daily based on lockdowns in different regions: luckily you can get this data from Google, who released this Maps data to help people understand how their community is reacting. Like search trends, this is an index, but that's fine for the purposes of our model. We'll provide the code to handle covid data for a region, you just need to adapt it and use it to get the data we need.
The files Google provides are quite large, so in our case we need to actually upload the file to the notebook, rather than loading it in from GSheets like we have in other exercises. To do that you can go to Files (left hand side menu) > Upload Icon. It should take about 2-3 minutes.
Get the data for 'New York' using the code and file provided.
Cool to see that COVID data\nDidn't know that was available!\nFinally we just need to add in fed holidays / sale periods
Making dummies for dates
Holidays and other important dates need to be incorporated, and to do so we need dummy variables – 1s and 0s.
Vexnomics Office – Afternoon
We're almost done, just one last task: incorporating holidays. But how do we handle dates? For that we need dummy variables.
The way we handle important dates in models is by creating dummy variables: columns of 1s and 0s to label when an event was occuring or not. This lets the model determine the impact of that variable on sales relative to the other channels and events that occured in that time period.\nWe'll provide you with the code for the exercise, you just need to modify and run it. First we look at the case of adding dummies for single dates, then the case of adding dummies for a range of dates, for example during sale periods.
Load the holidays gsheet and use the above function to generate dummies. Did you get this working?
Modify the code to get it working with date ranges in the sale period sheet above, rather than single dates. Explain what you did.
The final thing we need to do is merge all of our data together in one table, called the ABT or Analytical Base Table. This is the final cleaned version of the data, all joined together by the date column, which has been present in every data source. There are different ways to merge data, but typically you want to start with your revenue or conversion data as the first table and left join everything else to that. This is because if you don't have revenue or conversions for a time period, you won't be able to train your model on it anyway. Left join drops any values that are present in the right table (the one you're merging with revenue), so you'll never have a date that doesn't have revenue in your table. The other thing you want to make sure of is that you fill any N/A values with 0, which Pandas has a native function for called fillna(0). The code below shows you how to merge data with a left join. Do this for all the tables we've cleaned in this course so all of your data is in one place.
The code for cleaning data will always be a little bit custom. Luckily Pandas and other Python libraries have a lot of common things covered, and there's a rich trove of information out there if you get stuck – just Google it (that's what even experienced developers do!). The benefit of doing this in code rather than manually in Excel or Google sheets is reproduceability and long term time saving: it might take longer to write the code the first time, but you can then reuse that code again and again, and it'll work in exactly the same way.
Amazing work on all of this\nYou've taken a big job off my plate\nI'll make sure you're included in the modeling project so you get to do some of the fun stuff as your reward!"