There is an urgent need to better understand what uplift tests are, what they tell us and what they don’t.
We recently spoke with Adexchanger about certain challenges advertisers face when running uplift tests. Marketers have been under increasing pressure to prove the return of advertising spend on large digital marketing budgets (no wonder why…), as one of our customers wrote on an e-mail… “we are under the microscope” to show results.
“I think advertisers are more aware now than before, one of the reasons [why there’s a rising demand for uplift tests] is because the mobile space brings new challenges for retargeting. There are more user touch-points than before and we want to make sure we are not cannibalizing our different campaigns.” — Adrian Sarasa, Mobile User Acquisition Director at LetGo
In that context, marketers are trying to answer all sorts of questions: Does advertising work in general? Which segments are the best to target? How does seasonality impact ad effectiveness? with (what they believe is) an uplift test.
Anyone who’s seen The Hitchhiker’s Guide to the Galaxy understands the frustration of getting an answer that doesn’t seem to match the question… so the first step is to understand what uplift tests can actually tell marketers.
Some of the mentioned questions are valid (and answerable in one way or another) but it’s important to understand that uplift tests answer only ONE question: Am I able to change user behavior by showing a specific type of ad (in our context, mobile app retargeting ads)? In other words: Does showing users a Jampp ad affect his/her behavior relative to not showing the Jampp ad.
One question, with only one of three possible results:
That sounds simple enough, but to achieve a scientifically robust experiment that actually answers the question is a lot harder (or more complex) than everybody assumes.
“There are many different methodologies out there to measure the effectiveness of an uplift test. I think it is important to research and choose one you feel comfortable with. Also, it is important to set expectations and thresholds. Finally, making sure that the control and treatment groups are totally homogeneous is key.” — Adrian Sarasa, Mobile User Acquisition Director at LetGo
Our way, your way and the right way to do an uplift test.
Our data science team has been researching the subject in both academia and through our own experiments. After extensive testing, we developed a method that facilitates measuring the impact of mobile app ads in the real world. We’ll share a high level version in this post, for the complete scientific description, please read our technical blog post (disclaimer: it’s actually more of a paper 🤓).
First, the user base that will be targeted (a.k.a “the population”) is randomized into two segments. Comparing different populations or pre-defined segments adds unnecessary bias to the experiment, so it’s very important that the division be truly “random”. Half of the population, from here on called the “Treatment” segment, will be exposed to the advertiser’s ads. The other half, called “Control”, will only be exposed to non-related generic Public Service Ads (PSA). While conducting an Uplift test, advertisers should pause any and all other campaigns: users shouldn’t see the ads through any other channel.
As shown in the image above, some users will not be exposed, simply because reachability has a limit. Since our interest lies in comparing the activity of the Exposed users in both Treatment and Control Groups; it’s important to show the Public Service Ads (non-related ads) to the Control group. If we didn’t show any ads to the Control group, it’d be difficult to identify which users within that group would have been exposed.
So far so good, except it’s not that simple
Assuming the experiment is setup correctly, there are a few other statistical challenges to overcome, again these are all explained in detail in our technical post, but in summary:
Challenge #1: Sample size matters
As the effect we are trying to measure is sometimes quite small, Uplift experiments require large samples in order to achieve statistical significance. This is why we check the sample size throughout the test. In our experience, a reasonable sample size requires 300,000 to 500,000 users to be exposed with impressions. Of course, at the end of the day the sample size will depend on which is the metric you want to compare (CVR, Event Rate, etc.) and the level of confidence. At Jampp, to ensure the statistical significance of the results, we use a metric quite known among statisticians and scientists: the “p value” (probability value). Without getting too much into detail about the specifics of the “p value” it means that once reached there will be at least 95% of chances the experiment will have the same results if repeated. This eliminates the chances that a result has been reached just by chance.
Challenge #2: Biased Selection
When advertisers compare a segment of users exposed to an ad with a segment that didn’t see any ads, their experiment is subject to selection bias. Advertising platforms, at least the good ones, are designed to predict which users are likely to be more responsive, and ensure the ads are shown to those users. Say women were more likely to convert for a particular type of ad, the advertising platform will select more women from the total group to show them the ad.
In the end, there will be more women users in the Treatment group, and more men in the group who didn’t see ads, so the Treatment users will likely have a higher rate of key events. The selection is then biased; and we won’t know if the event rate is higher because we showed the ad or because there were more women in the segment and they were more likely to convert anyway.
Challenge #3: Ad Delivery Bias
Since the advertising platform will optimize for more conversions, under normal conditions, it will treat the advertisers’ and the PSAs differently so the treatment and the control exposed users won’t be comparable.
If the ad delivery platform is working normally for both groups, the PSAs are not valid control ads. Instead, fixed price campaigns should be run so that the platform selects exposed users in both groups with the same criteria, thus reducing delivery bias.
So the question on everybody’s mind… Is there generally a positive Uplift?
The results vary on different campaigns and we’ll spell out a few more learnings but we’ve run uplifts with several clients across different verticals and the results have been overwhelmingly positive.
The results below are from an Uplift Test conducted a couple of months ago for a Shopping App running a Retargeting campaign in the US. The results show there’s incremental value in terms of in-app conversions when showing ads to users.
An Uplift of 128% means that out of any 100 conversions coming from Jampp’s retargeting campaigns, 56 were likely incremental (and 44 would likely have occurred anyway).
In other words, running retargeting campaigns will help the app see more conversions than if it didn’t run retargeting campaigns.
Then, should I be constantly running uplift tests?
Please don’t! Do you ask your doctor to run an experiment every time he prescribes a certain drug to prove its effectiveness?
At this point, we believe it’s a generally accepted fact that app retargeting campaigns generate uplift but we do understand that every so often (4–6 months) there is value in testing what is the extent of the effect a certain type of activity is generating. Running uplift tests is disruptive (you need to pause other activity), expensive (you need to pay for the media of the control group) and difficult to run.
Improving the way we measure the impact of advertising will continue to be a priority for Jampp.
If you’re interested in the technical challenges of the Uplift Methodology and how we used our technology to develop more robust tests, read about it in detail on our tech blog