in Statistics

Marketing Mix Modelling with Bayesian Regression

1.0 Introduction

“Half of my advertising dollars are wasted, the problem is that I don’t know which half” – John Wanamaker.

With marketing mix modeling (MMM), analysts attempt to answer causal questions like “how does TV spend drive my sales? How much should I spend on TV next year?”

In research, the best practice in addressing causal questions is to use randomized experiments. However, this is not practical for companies because the advertisement is either on or off for the population at any moment of time. In other words, with above-the-line campaigns (aka non-targeted campaigns), it is not possible to have a control vs. treatment group. David et. al (2017) stated that with randomized experiments being infeasible, advertisers often turn to regression models to answer their questions about advertising effectiveness. In this post, we would step through the analysis end-to-end, from simulating the MMM data, to comparing the results of 4 different variable selection regression models.

This post is broken down into the following sections

  • Section 2: Describe the typical characteristics of the data for MMM.
  • Section 3: Describe typical challenges associated with approaching MMM with regression.
  • Section 4: Demonstrate simulating a dataset for MMM and perform simple feature engineering.
  • Section 5: Demonstrate variable selection using Bayesian approaches and Lasso. Discuss results in brief.
  • Section 6: Highlight the possibility of injecting non-reference prior given experience.

2.0 Typical Data for MMM

For most companies, the typical dataset assembled for MMM consists of:

  • The dependent variable: e.g. sales volume or customer acquisition count.
  • The various types of input variable:
    • Spend by channel: e.g. TV / Search/ newspaper / sponsorship / digital.
    • Interaction effect or “Synergies”: e.g. TV x Search.
    • Company’s price promotion effect: e.g. a price promotion that is not supported by any advertisement would also drive sales.
    • Seasonality: e.g. month of the year – Christmas / other spikes.
    • Trend: e.g. past week revenue, past X months channel spend.
  • A limited number of rows – spend data is typically reported in weeks, if a company has 3 years of data for all of the above, we have about 52 x 3 = 156 rows of data.

3.0 Challenges of the regression approach

  • Colinearity / Multi-Colinearity: The correlation between the input variables causes the linear regression model to be unstable, i.e., the coefficients that come out of the model is susceptible to changes. There are two ways to mitigate this, using a correlation matrix to remove input variables that exhibit at least a 0.7 correlation with any other input variables. In addition, if we are worried about an input variable correlating with a combination of two input variables, we can compute the variance-inflation factor. We will demonstrate these two approaches in section 4.2.
  • Curse of Dimensionality: As discussed above, the number of rows in the MMM dataset is approximately 100. If there are too many input variables (columns) relative to observations (rows), there is too much variance in the model and overfitting becomes a problem. A general rule of thumb is to ensure that there is enough data is to check that number of rows divided by the number of columns is about 10 – that there are about 10 observations per input variable (assuming that all input variables are continuous numeric variables).
  • Correlation does not mean causation: Linear regression is meant to show an association, using it to answer causal questions can be problematic. A quick relatable example, if we perform a regression of the average lifespan of a country’s population against the proportion of the country’s population that eat seafood, we probably would find a statistically significant positive coefficient for the input variable. While we have found that the input variable is associated with the dependent variable, we cannot conclude that if we give a fish to every household and increase the proportion of seafood eaters, it would cause the average lifespan to increase. There is probably a confounding variable, say wealth level which is positively correlated to the proportion of seafood eaters, that is really causing people to afford better health-care and live longer.
  • Extrapolation beyond the observed range of data: Suppose the range of Search spend is between $50 and $150, what would happen if we increase spending to $400, or cease spending altogether? We simulated a dataset with a correlation between X and Y to be about 0.81 which implies that the R square of the linear fit is going to be about \(0.81^2 \approx 0.66 \), we have fitted three models given the same R-Square. Notice how the quadratic model does not recommend a search spend of more than $200-300 whilst the other two models show very different outcomes.

The above figure illustrates how three linear regression models with the same quality of fit, measured by r-square, gives a conflicting recommendation to increase search spend more than $300 (an extrapolation from the observed range). The code that produced the above figure is here.

4.0 Generating a simulated dataset for MMM

To perform the analysis, we would now require a dataset that is suitable for MMM purposes. Fortunately, Google Research has built the amss repo which simulates MMM data. Following the instructions at amss repo, I have generated a CSV with default parameters, which could be downloaded here. The source code to generate the csv is here. We are going to use only the revenue, TV spend, and Search Spend for the rest of the project.

4.1 Feature Engineering

With the CSV, we generate additional features (source code here). Given the TV and Search Spend, we engineer additional features to account for the seasonality and trend:

  • Seasonality: We create 1 year lagged variable for revenue and channel spend. These variables have a suffix of .lag1y.
  • Trend:
    • We create 3 month and 1 week lagged variable for revenue. These variables have a suffix of .lag3m, .lag1w.
    • For channel spend, we compute the rolling average for the past 1 week/3 months/1 year. These variables have a suffix of .p1w, .p3m and .p1y.

4.2 Collinearity and Multi Collinearity Checks

To check for collinearity, we use a correlation matrix and observe if the magnitude of the correlation between any two input variables is more than 0.7.

The above correlation matrix suggests that we should remove either revenue.lag1y or revenue.lag1w. I have decided to keep revenue.lag1y to keep the seasonality-related input variable. Next, we check for multicollinearity.

The VIF function from the car package helps in detecting multicollinearity. As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.

Using VIF=10 as a threshold, we remove search_p3m from the rest of the analysis.

5.0 Variable Selection Techniques: Lasso vs. Bayesian Adaptive Sampling

Variable selection is important to limit the variance of the model and prevent overfitting. My favorite algorithm in variable selection has been the lasso, part of the glmnet package within R. The lasso algorithm is described in detail in the Introduction to Statistical Learning textbook as listed in the references section.

In this post, we are going to investigate variable selection techniques of the BAS (Bayesian Adaptive Sampling) package, and compare their performance in the simulated dataset. We perform leave-one-out-cross-validation to compare the performance of Lasso vs. 3 variants of the BMA (Bayesian Model Averaging) model.

Bayesian model averaging is a technique that uses an ensemble of models to perform prediction, it is referred to as a hierarchical model. During inference, the prediction is the posterior probability-weighted average of all predictions from each sub model and it performs best under squared error loss. The three variants are:

  1. BMA: Bayesian model averaging, uses the predictions of all submodels.
  2. HPM: Highest Probability Model, uses the prediction of one sub-model, with the highest posterior probability.
  3. MPM: Median Probability Model, use all predictors whose marginal inclusion probabilities are greater than 0.5.

After removing some variables in Section 4.2, the full model is specified below:

\(Revenue_i = \beta_{intercept} + \beta_{tv} tv_i + \beta_{search} search_i\)
\(+ \beta_{revenue.lag1y} revenue.lag1y + \beta_{revenue.lag3m} revenue.lag3m \)
\(+ \beta_{tv.lag1w} tv.lag1w + \beta_{search.lag1w} search.lag1w \)
\(+ \beta_{tv.p1y} tv.p1y + \beta_{search.p1y} search.p1y \)
\(+ \beta_{tv.p3m} tv.p3m \)
\(+ \epsilon_i \)

A reference prior is adopted. The reference prior has a uniform distribution across all coefficients. Whilst the ability of injecting non-reference priors is unique to Bayesian approaches, the focus of this post is to focus on variable selection techniques. The source code for this section is here.

5.1 Variables selected

The following table illustrates the different variables selected by the model.

5.2 Leave-One-Out-Cross-Validation Results

Leave-One-Out-Cross-Validation (LOOCV) refers to the process in which each row of the dataset take turn to be the test set, a model is fitted using the rest of the dataset and is used to predict the dependent variable for the one-row test set. This way, a RMSE (root-mean-squared-error) is recorded for each row, and we can compute RMSE statistics like mean and standard error for each model type.

For brevity, the results from the LOOCV shows that the 4 selection methods are not significantly different from one another. The bar chart below illustrates the level of RMSE between the prediction and actual revenue, error bars depicting the standard error are also added.

Personally, this finding is great for me and data scientists who uses Lasso for as the default-go-to. Granted this is only one experiment, I hope that this encourages data scientists to try the Bayesian Regression methods. Moreover, Bayesian Regression Methods allow the injection of prior experience which we would discussion in the next section.

6.0 Future Exercise: Injecting non-reference priors

A strength of the Bayesian approach is the ability to inject the prior distribution for all coefficients. If an experienced consultant has access to the expected coefficients for TV / Search spend for various industries, it could be valuable to inject this external experience in the form of prior distribution for each coefficient. I seek to investigate this in a separate post.

7.0 Conclusion

In this post, we have discussed the motivations for MMM, the typical data available, the typical challenges associated with approach MMM with regression. As part of the practical walk-through, we have simulated a dataset, performed some feature engineering and performed feature selection using both Bayesian Regression and Lasso. Lastly, we have also discussed the future direction of injecting non-reference priors and its motivations. I hope that this helps data scientists to consider adding the Bayesian Regression methods into their data science toolkit.

Acknowledgments & References

For sections 2 & 3, this post follows the content of David et. al (2017) closely. For the simulation of data, I have used the “Aggregate Marketing System Simulator” from Github. For the Bayesian regression models, I found the Coursera course very informative. For Lasso and VIF reference, I have used the “Introduction to Statistical Learning” textbook (Chapter 3 & 6).