To A/B or not to A/B

Julian Hatwell
Last Updated: Feb 16, 2025

Table of Contents

Introduction #

This is the story of a project that began as a straightforward A/B test but quickly revealed more than expected—offering fresh insights and expanding the scope of analysis.

It’s been a while since I worked as an independent data and analytics consultant. I went freelance after many years in data systems, BI, and MIS at a large multinational education company. During that time, I led projects using applied statistics, data mining, and algorithmic forecasting—and discovered a real passion for data science. But the chance to deepen those skills long-term wasn’t there, so I made the leap into freelancing, motivated by clarity about my goals and a desire for more hands-on, impactful work.

One of my first projects was with a startup language app—now a common genre, but at the time still rapidly evolving. These apps rely on gamified exercises, engaging features, and clever design to drive user retention and learning outcomes. Back then, much of that had to be figured out from scratch.

My client, still in the early stages, was looking to boost user growth with tactical feature launches tied to measurable impact. I was brought in to run a time-limited experiment: would a prototype for more interactive exercises lead to better word retention for new vocabulary? The results would inform whether to launch or revisit the design.

Disclaimer #

I’m grateful to that client for agreeing to partially lift our NDA so I could share this work. Enough time has passed—and the tech has evolved enough—that the study no longer holds commercial value. The company remains anonymous, and no personal or sensitive data is included here. The original dataset is not publicly available.

On the Product and Business Goals #

At the time of the project, the product used push notifications to prompt users to complete regular word retention exercises. These took the form of Flashcards, displaying a word in either the target or home (usually native) language. Users could tap buttons to flip the card, confirm memorisation, or skip. This was before gesture-based navigation was common on Android, so buttons were still the primary interface.

The new feature was more interactive—closer to a game. Users saw a sentence in the target language with a blank and had to choose the correct word to fill it from a set of options. The idea was that selecting the right word in context shows deeper understanding. Unlike Flashcards, which rely on self-reported memorisation, this approach generates more objective data on user progress. As a result, it could improve the accuracy of learning personalisation over time.

Strategically, the client was keen on Fill-in-the-Blank exercises for their potential to yield richer formative data. But they came with a trade-off: they were more cognitively demanding and time-consuming, raising concerns about lower engagement and increased churn.

Problem Definition #

The experiment compared two types of exercises—a simple two-level factor:

Flashcards (control group)
Fill-in-the-Blank (test group)

The primary goal was to assess whether these exercises led to different word retention scores, measured through separate weekly in-app vocabulary quizzes. We also aimed to evaluate differences in user engagement and churn between the two groups.

Experimental Hypotheses #

Word Retention #

$H_{0}$ There is no difference in word retention score that depends on the type of exercise - Flashcards vs. Fill-Blanks
$H_{a}$ There is a difference in word retention score that depends on the type of exercise - Flashcards vs. Fill-Blanks

Here, the hoped-for outcome is that $H_{0}$ can be rejected.

Churn Rate #

$H_{0}$ There is no difference in churn rate that depends on the type of exercise - Flashcards vs. Fill-Blanks
$H_{a}$ There is a difference in churn rate that depends on the type of exercise - Flashcards vs. Fill-Blanks

Here, the hoped-for outcome is that $H_{0}$ cannot be rejected.

Confounding Factors #

While designing the experiment, we quickly identified several potential confounding factors:

Engagement bias: If users engage differently with the two exercise types, retention scores could be affected independently of the exercise’s actual learning value. Since participation couldn’t be enforced, lower engagement with Fill-in-the-Blanks might reflect usability issues rather than learning efficacy. Some mitigation—like improving the prototype’s appeal—was considered.
Time-based effects: Learning outcomes or churn impacts might take longer than a single week to materialize. A short-term test could miss these delayed effects.
Survivor bias: If churn rates differ between groups, we risk measuring outcomes only among the more persistent users. This could inflate performance metrics for the group with higher dropout rates.

Hypothetical User Model #

To clarify some of our thinking about confounding factors, together with the client we developed a more sophisticated hypothesis about the existence of latent and causal factors. This is represented in the diagram below.

We developed a hypothetical model of the user, assuming each had a latent, unobserved ability. While not directly measurable, this ability could influence how engaged users were, how easily they retained vocabulary, and their likelihood of churn. Users with lower latent ability would likely face more difficulty and need greater resilience to stick with language learning.

As a startup focused on rapid growth, the client was especially wary of anything that might increase churn. Even a short-term experiment carrying that risk was a concern. Minimising this risk—while still running a meaningful test—was a key part of my role, and heavily influenced both the experimental design and the analytical approach.

Experimental Design #

Given the confounding factors, a simple A/B test followed by a t-test at the end of each week wouldn’t suffice. That approach would overlook two key sources of bias:

Indirect effects: Fill-in-the-Blank exercises could affect retention scores indirectly by altering engagement levels.
Survivor bias: If lower-performing users churned at higher rates due to the added effort required, this would skew group averages.

To address this, I proposed a set of guiding principles to structure the experiment in a way that would let us assess these effects in a single pass:

Random assignment: To control for latent ability and user preferences, users were randomly assigned to one exercise type.
Consistent exposure: Each group remained on the same activity for the full duration to allow us to detect differences in engagement and churn over time.
Limited duration: The experiment would run for four weeks to reduce the risk of users churning simply due to prolonged exposure to a potentially less enjoyable feature.
Minimised exposure: We limited the number of users in the study to further reduce potential churn impact during this critical growth phase.

Sample Size Calculation #

The client agreed on a minimum detectable effect (MDE) of 2 additional words retained per week. This would translate to a gain of roughly 108 words annually—modest, but meaningful—on top of a baseline of about 20 words per week (or 1,040 per year). Naturally, they were hoping for more, but this was set as the threshold for meaningful improvement.

Historical data showed a standard deviation of 4.87 in weekly word retention scores, which provided a useful input for estimating sample size. However, we also had to factor in churn: around 25% of users didn’t engage with the app for at least one week during any four-week period. To ensure we had sufficient completions to maintain 80% statistical power, I adjusted for this dropout risk.

I used the short-hand formula -

$N \approx {(\frac{8 σ}{Δ})}^{2}$ to estimate the total sample size, which gave me

$⌈ {(\frac{8 \times 4.87}{2})}^{2} ⌉ = 380$ This gave us a quick, practical estimate to review feasibility with the client.

Later, I validated the result using power.t.test in R and adjusted for churn by inflating the required sample size, using a conservative adjustment based on the churn rate plus two standard deviations (via $σ^{2} = p (1 - p)$ ).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


mde <- 2
pilot_sd_retention <- 4.87
power <- 0.8          # Desired power (80%)
alpha <- 0.05         # Significance level
four_week_churn_rate <- 0.25
sd_four_week_churn_rate <- sqrt(four_week_churn_rate * (1 - four_week_churn_rate))

# calculates the required sample size per group
sample_size <- pwr.t.test(
  d = mde / pilot_sd_retention,
  power = power,
  sig.level = alpha,
  type = "two.sample"
)$n

adjusted_sample_size = ceiling((1 + four_week_churn_rate + 2 * sd_four_week_churn_rate) * sample_size)

The final result was:

1

## Recommended sample size per group:  200

I don’t know why but it was weirdly satisfying that such a nice round number popped out by pure chance.

Participation in the Trial and Managing Churn #

Although the sample size was padded to account for churn, we still needed to monitor it closely during the trial. Crucially, we couldn’t exclude users who dropped out—doing so would introduce survivor bias. Users had to be entered into the trial at random, and their outcomes included, regardless of whether they completed the full period.

From a larger pool of about 10,000 users, we randomly selected 400 users using R’s random number generator and assigned them to one of two groups. For exactly four weeks, each participant received two push notifications per week, prompting them to complete a vocabulary retention exercise based on words previously encountered in the app.

Each user was assigned a set of metrics:

Per-user:

exercise_type (treatment factor): “Fill-Blanks” or “Flashcards” — the only exercise type they saw during the trial.
churn_week (int): The first week in which the user didn’t use the app at all; if they were active all four weeks, this was set to 5.
churn (factor): “churned” if churn_week < 5; otherwise, “completed trial”.

Per-user, per-week:

engagement (factor): “High” if the user completed at least one exercise that week, “Low” otherwise. If a user churned, engagement was set to “Low” for that week and all subsequent ones.
retention (int): The number of words correctly retained in the weekly quiz. If the user churned, this was set to NA for that and all future weeks.

Interlude #

In the meantime while I was agreeing these parameters with the client and his team, the developers were preparing the A/B test framework. They confirmed that once a user was allocated to a test group, they would only see one type of test for the four week duration and the requested data would be collected weekly. All other users would continue with business as usual.

There was nothing left to do but run the trial. I took a short SCUBA diving break and then worked with another client for a couple of weeks.

Results Analysis Part 1 - Preliminaries #

Immediately after the test period completed, I collated the data for an initial review.

1
2
3
4
5
6
7
8


##    engagement       retention    
##  Min.   :0.0000   Min.   : 7.00  
##  1st Qu.:0.0000   1st Qu.:19.00  
##  Median :1.0000   Median :22.00  
##  Mean   :0.7344   Mean   :22.19  
##  3rd Qu.:1.0000   3rd Qu.:25.00  
##  Max.   :1.0000   Max.   :38.00  
##                   NA's   :223

Here we can see that the grand mean engagement over all users, all weeks was 0.73438, while the mean word retention each week was 22.19317. The number of NAs in this result set looks high but there are duplicates. A user who churned in week 1 would present as 4 NAs. The actual number of churned users was counted separately but can also be determined by looking at the fourth week results only. This was found to be 90, which is 0.22 on a total N of 400 (two groups) and slightly below our pre-study expectation.

Summary Statistics by Treatment/Control and Weeks #

I tabulated all the weekly metrics that we had collected and summarised their means and standard deviations, as shown here.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


## # A tibble: 8 × 6
##   exercise_type  week retention retention_sd engagament engagament_sd
##   <fct>         <int>     <dbl>        <dbl>      <dbl>         <dbl>
## 1 Flashcards        1      21.2         4.62      0.83          0.377
## 2 Flashcards        2      21.7         4.09      0.875         0.332
## 3 Flashcards        3      21.3         4.17      0.835         0.372
## 4 Flashcards        4      22.1         4.14      0.79          0.408
## 5 Fill-Blanks       1      21.8         4.99      0.655         0.477
## 6 Fill-Blanks       2      22.7         5.06      0.64          0.481
## 7 Fill-Blanks       3      22.9         5.22      0.615         0.488
## 8 Fill-Blanks       4      24.2         4.91      0.635         0.483

Observed Four Week Churn Rates #

Aside from the retention and engagement means, we also took the opportunity to check the churn rate. In particular, the four week churn rate was of interest.

1
2
3
4
5


## # A tibble: 2 × 3
##   exercise_type  week churn_rate
##   <fct>         <int>      <dbl>
## 1 Flashcards        4      0.205
## 2 Fill-Blanks       4      0.245

The rate for Fill-Blanks was in line with the prior expectations but the rate for Flashcards was a bit lower. Was this significant? I did a quick log odds ratio test to find out.

1
2
3
4
5
6
7


## 
## z test of coefficients:
## 
##                                                Estimate Std. Error z value
## completed trial:churned/Fill-Blanks:Flashcards -0.22987    0.24023 -0.9569
##                                                Pr(>|z|)
## completed trial:churned/Fill-Blanks:Flashcards   0.3386

The large p-value indicated that that their was no evidence of a significant difference. I plotted the counts to double check.

The non-finding was backed up by the fourfold plot, which showed overlapping confidence intervals of the quarters. The difference in churn rates was entirely within the margin of error. Nevertheless, I chose to reserve my final judgement until I had a chance to look more deeply at the other results.

Naive T-Test #

Sometimes, the only required analysis for an A/B test is a t-test to check for a statistical difference in means between groups, assuming everything has gone well with the experiment. We knew that wasn’t the case here but it’s always an informative measure.

I started by taking mean weekly retention score per user, so long as they didn’t churn in the first week of the study. This was calculated by summing up their individual weekly scores and dividing by the number of weeks that they remained unchurned.

1

## Non-churned grand mean retention:  22.17154

Then I ran the Student’s t-test to get an initial intuition of whether the whole endeavor had yielded a statistically interesting result.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


## 
## 	Welch Two Sample t-test
## 
## data:  fill_blanks and flash_cards
## t = 4.5537, df = 360.61, p-value = 7.219e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.7751163 1.9534842
## sample estimates:
## mean of Fill-Blanks  mean of Flashcards 
##            22.86821            21.50391

These results suggested a statistically significant difference in means of between 0.77512 and 1.95348 with a confidence of 0.95 among users who were given Fill-Blanks exercises over users who were given Flashcards exercises. I noted that the upper bound was below the minimum detectable effect of 2 that we had agreed before starting but it was too early to be disappointed. The possible presence of confounding factors required a deeper analysis.

Results Analysis Part 2 - Engagement #

One of my first priorities was to verify some of our initial assumptions. Did the level of engagement depend on the exercise itself? Also, over the weeks of the study, did the users show a growing or diminishing level of engagement? This information would shine a light on our hypothetical user model and reveal any important interactions.

Analysis of Variance #

The quickest way to check is with an ANOVA test.

1
2
3
4
5
6
7


##                      Df Sum Sq Mean Sq F value Pr(>F)    
## week                  1   0.30   0.300   1.616  0.204    
## exercise_type         1  15.41  15.406  82.960 <2e-16 ***
## week:exercise_type    1   0.03   0.028   0.151  0.697    
## Residuals          1596 296.38   0.186                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to this result, there is a statistically significant effect on engagement depending on the type of exercise, but there is no detectable difference each week.

Mosaic Visualisation and Significance Tests #

A visual analysis can help to better understand what is happening.

I’m a strong advocate for mosaic plots when working with count data like this. I’ve taught workshops on them because, while they can be unfamiliar at first, they’re incredibly revealing once you know how to read them.

The plot recursively divides a canvas by the specified categorical factors, with tile area proportional to counts at each intersection. If the factors are independent, the result is a neat grid. The more skewed or uneven the tiles, the stronger the interactions between variables. Tiles are shaded by Pearson residuals from a $χ^{2}$ test of independence — essentially embedding a statistical test into the visualization. Blue tiles (positive residuals) mark over-represented combinations; red tiles (negative residuals), under-represented ones.

What did this analysis reveal about user engagement?

As expected, the distribution across exercise type (Flashcards vs. Fill-Blanks) and week was even, since users were assigned randomly and engagement was recorded weekly regardless of churn. The striking difference appeared in engagement levels: Low engagement was clearly over-represented among Fill-Blanks users.

The implication was straightforward — users were more likely to complete Flashcard exercises. This aligned with expectations: Flashcards are quicker and cognitively lighter — users can simply tap through and self-confirm memorisation. The pattern held consistently over time, suggesting the effect was inherent to the mechanics of the exercises rather than a temporary novelty or fatigue effect.

Causal Analysis #

The key question was this: How much did lower engagement with Fill-in-the-Blank exercises affect overall word retention?

The client and I had anticipated that these exercises might see lower engagement — or even higher churn — which would be counterproductive, especially if it masked what could otherwise be better learning outcomes. To understand this dynamic, I needed to go beyond basic group comparisons and look into causal pathways.

I was already somewhat familiar with structural equation modelling (SEM) from past work on student surveys, but this problem—where an independent variable (exercise type) may influence the response (retention) indirectly through a mediator (engagement)—called for a more targeted approach.

While researching, I came across the mediation package in R, which seemed tailor-made for this use case. To build confidence in the results, I ran two parallel analyses: one using the mediation package, and another using a more traditional SEM approach, allowing me to compare the outputs and validate the findings.

Mediation Analysis #

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


## 
## Causal Mediation Analysis 
## 
## Nonparametric Bootstrap Confidence Intervals with the Percentile Method
## 
##                Estimate 95% CI Lower 95% CI Upper p-value    
## ACME             -0.219       -0.343        -0.09  <2e-16 ***
## ADE               1.513        1.023         2.01  <2e-16 ***
## Total Effect      1.294        0.808         1.77  <2e-16 ***
## Prop. Mediated   -0.169       -0.304        -0.06  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sample Size Used: 1377 
## 
## 
## Simulations: 10

This model does not take into account the repeated measures occuring each week. Despite the earlier ANOVA test and mosaic plot confirming that this wasn’t a particular concern here, I wanted to be very cautious when interpreting the results numerically.

The summary above shows a significant result for Average Causal Mediation Effect (ACME) of -0.21886. Any significant result here is evidence to reject a null hypothesis of no indirect effect. The ACME has the opposite sign of the Average Direct Effect (ADE) at 1.51281, which suggests that the total effect is less than it otherwise would have been.

I checked alignment with my preliminary result (from the earlier t-test) and found that the total effect of 1.29395 is close to the mean difference between groups of -1.3643. So this causal analysis seems strongly to suggest that the mean difference would have been larger by 0.21886 or some quantity between the ACME confidence interval, shown in the summary above.

Structural Equation Model #

I have used SEM to model latent factors before but had not tried using them to model a mediation effect. Anyway, the approach is pretty much the same except that there aren’t any latent factors, only manifest items. I conducted the analysis in much the same way as I would usually do it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31


## lavaan 0.6-19 ended normally after 1 iteration
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         5
## 
##   Number of observations                          1377
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   engagement ~                                                          
##     exercise_type    -0.187    0.018  -10.179    0.000   -0.187   -0.265
##   retention ~                                                           
##     engagement        1.169    0.369    3.164    0.002    1.169    0.087
##     exercise_type     1.513    0.262    5.785    0.000    1.513    0.160
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .engagement        0.116    0.004   26.239    0.000    0.116    0.930
##    .retention        21.885    0.834   26.239    0.000   21.885    0.974

Here we’re seeing very well aligned results to the mediation method, with a direct effect of exercise_type on retention of 1.51281 and an indirect effect of exercise_type on engagement of -0.18724.

It’s also possible to plot the SEM, which is why I wanted to run the analysis this way.

This was a very nice, simple visual for explaining to the client that Fill-Blanks has a downward effect on engagement, which must have a dampening effect on any increases in retention for users in the Fill-Blanks group.

Results Analysis Part 3 - Word Retention #

This stage of the analysis required the use of a linear mixed model (LMM) for one simple reason; given that we were tracking the users over the course of four weeks, with one overall retention measure per week, the experimental design is repeated samples, which violates the independence assumptions of an OLS linear model. I proceeded by working through a hierarchy of models, increasing the number interaction terms, to determine the best fitting model.

Linear Mixed Model Selection and Analysis #

As LMM are quite a bit more complex than linear models, there is a lot more to the summaries and consequently more console output. For brevity, I only post the best fitting model summary here, after the ANOVA test to determine the best fit.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


## Data: selected_users
## Models:
## mm1: retention ~ (1 | user_id) + engagement + exercise_type + week
## mm2: retention ~ (1 | user_id) + engagement + exercise_type * week
## mm3: retention ~ (1 | user_id) + engagement * exercise_type * week
##     npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## mm1    6 8134.6 8166.0 -4061.3   8122.6                       
## mm2    7 8130.4 8167.0 -4058.2   8116.4 6.2290  1    0.01257 *
## mm3   10 8134.4 8186.7 -4057.2   8114.4 1.9689  3    0.57889  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Above you can see the model definitions. The notation shows that user_id has been set as a grouping variable with different intercepts but not different slopes. This means that the model’s intercept is modulated by each user’s random effect (their latent ability, in our user model).

Model mm2, with an interaction term between exercise_type and week is found to be a significantly better fit to the data than model one. The additional interaction term in model mm3 does nothing to improve on mm2. Therefore, mm2 was selected as the best fitting model. The model summary and diagnostics are shown next.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40


## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: retention ~ (1 | user_id) + engagement + exercise_type * week
##    Data: selected_users
## 
##      AIC      BIC   logLik deviance df.resid 
##   8130.4   8167.0  -4058.2   8116.4     1370 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.3654 -0.6649 -0.0156  0.6648  3.1411 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  user_id  (Intercept)  2.401   1.55    
##  Residual             19.184   4.38    
## Number of obs: 1377, groups:  user_id, 376
## 
## Fixed effects:
##                                Estimate Std. Error        df t value Pr(>|t|)
## (Intercept)                     20.3160     0.5138 1367.1255  39.537   <2e-16
## engagement                       0.8852     0.3658 1356.5889   2.420   0.0157
## exercise_typeFill-Blanks         0.1858     0.5914 1370.9265   0.314   0.7535
## week                             0.1671     0.1497 1066.8353   1.116   0.2645
## exercise_typeFill-Blanks:week    0.5330     0.2132 1063.9878   2.500   0.0126
##                                  
## (Intercept)                   ***
## engagement                    *  
## exercise_typeFill-Blanks         
## week                             
## exercise_typeFill-Blanks:week *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) enggmn ex_F-B week  
## engagement  -0.601                     
## exrcs_tyF-B -0.622  0.111              
## week        -0.634 -0.101  0.592       
## exrcs_tF-B:  0.484  0.006 -0.867 -0.696

Note that there are 376 groups (user_id) discovered, which accounts for all users who didn’t churn in the first week. Then there are 1377 observations overall, which accounts for NAs for users who churned on any week after that, providing at least one but fewer than four observations to the study.

Model Diagnostics #

The model converged without any problems and the scaled residuals are centred on zero. This is also visible in the residual plots. The qqplot looked perfectly normal (no pun intended), and the random effects plot showed symmetry around zero, suggesting a normal distribution for the latent ability factor, as one would expect.

1

## $user_id

Sense-Making the Linear Predictor for Word Retention #

The model coefficients, as given in the summary, are significant for engagement, week, and the interaction between exercise_type * week but notably not for exercise_type main effect. This indicated to me that there is a significant difference between the two exercises but it does not become apparent until the later weeks of the study. I simplified the model accordingly by removing the non-significant main effect and used the ANOVA test to ensure that my simpler model was just as effective in explaining the variance in the retention scores.

1
2
3
4
5
6
7


## Data: selected_users
## Models:
## mma: retention ~ (1 | user_id) + engagement + exercise_type:week
## mm2: retention ~ (1 | user_id) + engagement + exercise_type * week
##     npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)
## mma    6 8128.5 8159.9 -4058.3   8116.5                     
## mm2    7 8130.4 8167.0 -4058.2   8116.4 0.0986  1     0.7535

The Simplified Model and Easy Explanations #

The parameter set of the simplified model was much easier to explain to the client.

1
2
3
4
5


##                                 Estimate     Pr(>|t|)
## (Intercept)                   20.4163911 < 2e-16 *** 
## engagement                     0.8723426 0.0165 *    
## exercise_typeFlashcards:week   0.1393055 0.2486      
## exercise_typeFill-Blanks:week  0.7302830 1.19e-09 ***

The interpretation of these results is as follows:

the baseline expectation is 20.41639 but has some variability, given individual users’ innate ability
two standard deviations of this user variability on the baseline is plus or minus 1.96 * 1.55007 (possibly predictable in future from data collected from in app activities)
if engagement is High, there is an additional 0.87234
for each week up to four (so as not to extrapolate beyond the experimental conditions), there is an additional 0.73028 for Fill-Blanks exercises
the 0.13931 per week for Flashcards exercises is within the margin of error and can be ignored
as a result, there is a difference between the word retention scores for Fill-Blanks vs. Flashcards that accumulates over four weeks to 2.92112 (or more meaningfully, 3 additional words) compared to using only Flashcards
additionally to keep in mind, the more time-and-effort consuming Fill-Blanks tasks is depressing the engagement level, which is counter-productive to the average retention scores for those same users

This plain English explanation made the most sense to my client and their team. I did not attempt to fully quantify the last point because it was the output of a separate modelling procedure. I did explain, however, that we should really be seeing a different engagement coefficient per exercise type but the LMM was not a suitable tool for discovering it because it does not discover relationships between independent variables. Rather, these relationships introduce bias.

A Mildly Shocking Oversight #

Going back to the experimental design parameters, I suddenly realised that our sample size was flawed. This difference in retention scores over four weeks is statistically significant to at least a 95% confidence level. The MDE was set at 2, which meant that I could be sure with 80% confidence that this difference was not in danger of being a false positive,

Unfortunately, I had made an error with the sample size calculation. The client had sought confirmation (with power 0.8) of their MDE of 2 words per week for a total of 8! This would have resulted in a drastically smaller sample size. All things considered, this issue was not a material cause for concern in the end because a much smaller sample could easily have been thrown off by an unusually high churn week, for example. Also, the numbers were still very small with respect to the total number of users. I can laugh about it now but I was quite embarrassed at the time.

Results Analysis Part 4 - Churn Rate #

I was still tasked with identifying whether the extra effort of completing a Fill-Blanks exercise would have an adverse effect on churn rate. So far, I had a bit of evidence from the four week churn rate for just the Fill-Blanks group staying in line with prior expectations. Also, even though the Flashcards control group did appear to be a little lower, there was the non-significant log odds ratio test, reducing any major concern. Nevertheless, I wanted to complete the analysis that I had planned and that data had been collected for.

I proceeded with a non-parametric survival analysis by using the per user data described above. This contains the churn week or 5 if they made it to the end of the study (as the majority did). A boolean churn event of TRUE for those who did churn, and a weekly average retention for the weeks they participated. Finally, there was also the exercise type, of course.

Between Groups Log-Rank Test #

It’s easy to compare survival distributions between groups using the log-rank test

1
2
3
4
5
6
7
8


## Call:
## survdiff(formula = surv_obj ~ exercise_type, data = selected_churn)
## 
##                             N Observed Expected (O-E)^2/E (O-E)^2/V
## exercise_type=Flashcards  200       41     45.8     0.497      1.08
## exercise_type=Fill-Blanks 200       49     44.2     0.514      1.08
## 
##  Chisq= 1.1  on 1 degrees of freedom, p= 0.3

The test shows that there is no significant difference between the two groups’ survival. However, this method does not control for retention, which is a concrete measure of the user’s learning progress. Our hypothetical user model, based on intuition and experience in the sector, suggested that it was a possibility that users who felt they were under-performing might be more likely to churn. So I went ahead and checked that as well.

Cox’s Proportional Hazards Model With A Continuous Variable #

I’ve discussed Proportional Hazard (PH) models in a lot more detail in this post so I will skip over the intricacies and just say that if we find a high value for hazard at any given moment, the survival rate is falling fast. Conversely, we might say that the churn rate is going up. It’s usual to report on the log hazard and check for values that a greater than zero with statistical significance.

For continuous independent variables, it is necessary to fit a smooth spline predictor in place of the raw data. The model reports on the linear part and non-linear part separately.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


## Call:
## coxph(formula = surv_obj ~ exercise_type + pspline(retention, 
##     df = 2), data = selected_churn)
## 
##                              coef se(coef)     se2   Chisq   DF       p
## exercise_typeFill-Blanks   0.0366   0.2561  0.2559  0.0205 1.00    0.89
## pspline(retention, df = 2 -0.0490   0.0362  0.0362  1.8377 1.00    0.18
## pspline(retention, df = 2                          16.6278 1.05 5.1e-05
## 
## Iterations: 3 outer, 11 Newton-Raphson
##      Theta= 0.813 
## Degrees of freedom for terms= 1.0 2.1 
## Likelihood ratio test=21.1  on 3.05 df, p=1e-04
## n= 376, number of events= 66 
##    (24 observations deleted due to missingness)

Notice here the number of observations used in the model is 376, just as we saw with the LMM. Given that there were 400 individuals to begin with, this means 24 individuals churned on week one and contributed no retention data. This made a full analysis impossible but I noticed that only the non-linear partition of the retention data returned a significant result.

The PH terms can be plotted, which sometimes helps with a non-intuitive analysis as we have here.

We see from these plots that the partial PH for the exercise types is close to zero, as the modeling suggests. The retention curve does appear to show a significant divergence from zero, certainly at the lower end. This aligned with the ideas we had in creating the user model at the beginning of the project - the mechanism being that users who are doing less well at memorising vocabulary are more likely to get frustrated stop returning to the app for many days at a time, or for good.

Survival Model of Churn #

As a final sense check, I created a survival model. Think of it as the reverse of churn. The greater the survival rate, the lower the churn rate in a direct $p, 1 - p$ relationship. The survival plot provided me with the ideal way to communicate my churn rate analysis to the client because it is such an easy to understand visual.

Certainly there is the appearance of a slightly lower survival rate for the Fill-Blanks group but it is very small and well within the margin of error. Happily, we could pretty much rule out the new Fill-Blanks exercise as a significant cause of increased churn risk.

Summary #

What began as a simple A/B test for a new vocabulary exercise evolved into a more complex analysis once we examined the client’s goals and the dynamics of the product. A basic t-test wouldn’t have captured the nuanced interactions at play.

By developing a user model with latent ability as a key concept, we identified several important confounding and mediating variables—particularly engagement and churn—that needed to be accounted for. This led to a multi-variate analysis approach, including a linear mixed model to handle repeated measures per user and individual-level variance.

To explore the causal pathway between exercise type, engagement, and retention, I used two methods: the mediation package and structural equation modelling (SEM). While I didn’t quantify the indirect effect precisely, both analyses helped clarify that engagement did mediate the impact of the exercise format on retention — albeit to a modest degree.

Key findings:

The new Fill-Blanks exercises improved average word retention by approximately 3 words over four weeks.
This improvement was lower than the client had hoped.
The indirect negative effect of reduced engagement was present but relatively small in magnitude.
Most importantly, we found no evidence of increased churn risk among users exposed only to Fill-Blanks exercises.

Conclusions #

This project was a powerful reminder that data science is about more than running models — it’s about framing the right questions and understanding the context in which data is generated.

The client’s openness to a deeper analysis allowed us to uncover insights that went beyond surface-level metrics. Although the headline result wasn’t transformative in terms of retention gains, the finding that Fill-Blanks exercises didn’t drive up churn gave the client confidence to move forward with the launch. The feature offered strategic advantages: better formative assessment data and continued product evolution to keep users engaged.

For me, this work reinforced a key lesson: deep domain understanding, clear causal thinking, and honest evaluation of trade-offs are what transform a data project from informative to actionable. It was a rewarding collaboration—and a great example of the kind of analytical thinking I always aim to bring to client work.

Appendix #

Here you can find the source code.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199


library(knitr)
library(tidyr)
library(dplyr)
library(survival)
library(survminer)
library(pwr)
library(vcd)
library(lme4)
library(lmerTest)
library(ggdag)
library(mediation)
library(lavaan)
library(semPlot)

opts_chunk$set(warning = FALSE
              , message = FALSE
              , echo = FALSE
              )

hook_output <- knit_hooks$get("output")
knit_hooks$set(output = function(x, options) {
  lines <- options$output.lines
  if (is.null(lines)) {
    return(hook_output(x, options))  # pass to default hook
  }
  x <- unlist(strsplit(x, "\n"))
  more <- "..."
  if (length(lines)==1) {        # first n lines
    if (length(x) > lines) {
      # truncate the output, but add ....
      x <- c(head(x, lines), more)
    }
  } else {
    x <- c(more, x[lines], more)
  }
  # paste these lines together
  x <- paste(c(x, ""), collapse = "\n")
  hook_output(x, options)
})

par(mar = c(4,3,3,1))
source("HeartTheme.R")
load("retention.RData")
selected_users <- as_tibble(selected_users) %>%
  mutate(exercise_type = relevel(exercise_type, "Flashcards")) # set the control group
dag <- dagify(
  Engagement ~ Ability + "Exercise Type",
  Retention ~ Engagement + Ability + "Exercise Type",
  Churn ~ Retention + Engagement + Ability,
  latent = "Ability",
  exposure = "Exercise Type",
  outcome = "Churn"
)

tidy_dag <- tidy_dagitty(dag, seed = 222, layout = "nicely")

ggdag(tidy_dag, text = FALSE, use_labels = "name") + theme_dag()
mde <- 2
pilot_sd_retention <- 4.87
power <- 0.8          # Desired power (80%)
alpha <- 0.05         # Significance level
four_week_churn_rate <- 0.25
sd_four_week_churn_rate <- sqrt(four_week_churn_rate * (1 - four_week_churn_rate))

# calculates the required sample size per group
sample_size <- pwr.t.test(
  d = mde / pilot_sd_retention,
  power = power,
  sig.level = alpha,
  type = "two.sample"
)$n

adjusted_sample_size = ceiling((1 + four_week_churn_rate + 2 * sd_four_week_churn_rate) * sample_size)
cat("Recommended sample size per group: ", adjusted_sample_size)
summary(dplyr::select(selected_users, engagement, retention))
mean_engagement <- round(mean(selected_users$engagement), 5)
mean_retention <- round(mean(selected_users$retention, na.rm = TRUE), 5)
na_retention <- sum(is.na(selected_users$retention))
na_fourth_week <- sum(is.na(matrix(selected_users$retention, 2 * adjusted_sample_size, 4)[, 4]))
selected_users %>% 
  rename(ret = retention) %>%
  group_by(exercise_type, week) %>%
  summarise(
    retention = mean(ret, na.rm = TRUE), retention_sd = sd(ret, na.rm = TRUE),
    engagament = mean(engagement, na.rm = TRUE), engagament_sd = sd(engagement, na.rm = TRUE),
    .groups = "drop"
  )
selected_users %>% 
  rename(ret = retention) %>%
  group_by(exercise_type, week) %>%
  summarise(
    churn_rate = mean(is.na(ret)),
    .groups = "drop"
  ) %>%
  dplyr::filter(week == 4)
churn_ratio <- selected_churn %>%
      dplyr::select(exercise_type, churn) %>%
      mutate(churn = factor(churn, labels = c("completed trial", "churned")))

lr <- loddsratio(churn ~ exercise_type, data = churn_ratio)
summary(lr)
fourfold(table(churn_ratio))
cat("Non-churned grand mean retention: ", round(mean(selected_churn$retention, na.rm = TRUE), 5))
fill_blanks <- selected_churn[selected_churn$exercise_type == "Fill-Blanks", "retention"]
flash_cards <- selected_churn[selected_churn$exercise_type == "Flashcards", "retention"]
tt <- t.test(fill_blanks, flash_cards, na.action = omit.rm, var.equal = FALSE) 

names(tt$estimate) <- paste("mean of", c("Fill-Blanks", "Flashcards"))
tt
summary(aov(engagement ~ week * exercise_type, data = selected_users, na.action = na.omit))
mosaic(
  table(
  selected_users %>%
    dplyr::mutate(engagement = factor(engagement, labels = c("Low", "High"))) %>%
    dplyr::select(exercise_type, week, engagement)
  ),
  shade = TRUE
)
# tidy up the data 
selected_users_dropna <- selected_users %>%
  filter(!is.na(retention))

set.seed(135)
med_model <- lm(engagement ~ exercise_type, data = selected_users_dropna)
outcome_model <- lm(retention ~ exercise_type + engagement, data = selected_users_dropna)
mediation_result <- mediate(med_model, outcome_model, 
                            treat = "exercise_type", mediator = "engagement", 
                            boot = TRUE, sims = 10)

summary_med <- summary(mediation_result)
summary_med
set.seed(135)
sem_model <- '
  engagement ~ exercise_type
  retention ~ engagement + exercise_type
'

# Fit SEM Model
fit <- sem(sem_model, data = selected_users_dropna)
summary_sem <- summary(fit, standardized = TRUE)
summary_sem
semPaths(fit, what="est", fade=FALSE, residuals=FALSE, edge.label.cex=0.75)
mm1 <- lmer(retention ~ (1 | user_id) + engagement + exercise_type + week, data = selected_users, REML = FALSE)
mm2 <- lmer(retention ~ (1 | user_id) + engagement + exercise_type * week, data = selected_users, REML = FALSE)
mm3 <- lmer(retention ~ (1 | user_id) + engagement * exercise_type * week, data = selected_users, REML = FALSE)
anova(mm1, mm2, mm3)
summary(mm2)
plot(mm2)
qqmath(resid(mm2))
dotplot(ranef(mm2, whichel = "user_id"), main = FALSE, scales = list(y = list(draw = FALSE)))
mma <- lmer(retention ~ (1 | user_id) + engagement + exercise_type:week, data = selected_users, REML = FALSE)
anova(mma, mm2)
coef_summary <- as.data.frame(coef(summary(mma))[,c("Estimate", "Pr(>|t|)")])
p_vals <- coef_summary[, 2]
stars <- ifelse(
  p_vals > 0.05,
  "",
  ifelse(
    p_vals > 0.01,
    " *",
    " ***"
  )
)
p_vals <- ifelse(
  p_vals < 2e-16,
  "< 2e-16",
  ifelse(
    p_vals < 0.01,
    formatC(p_vals, format = "e", digits = 2),
    as.character(round(p_vals, 4))
  )
)
p_vals <- format(paste0(p_vals, stars), justify = "left")
coef_summary$`Pr(>|t|)` <- p_vals


mma_intercept <- round(coef_summary["(Intercept)", "Estimate"], 5)
mma_intercept_sd <- round(sqrt(VarCorr(mma)[[1]][1]), 5)
mma_engagement <- round(coef_summary["engagement", "Estimate"], 5)
mma_fillblanks_week <- round(coef_summary["exercise_typeFill-Blanks:week", "Estimate"], 5)
mma_flashcards_week <- round(coef_summary["exercise_typeFlashcards:week", "Estimate"], 5)

coef_summary
selected_churn$exercise_type <- relevel(selected_churn$exercise_type, "Flashcards")
surv_obj <- Surv(time = selected_churn$churn_week, event = selected_churn$churn)
survdiff(formula = surv_obj ~ exercise_type, data = selected_churn)
cox_model <- coxph(formula = surv_obj ~ exercise_type + pspline(retention, df = 2), data = selected_churn)
cox_model
termplot(cox_model, se = TRUE, terms = 1, ylabs = "Log hazard", xlabs = "Exercise Type")
abline(h = 0, col="grey", lty=3)

termplot(cox_model, se = TRUE, terms = 2, ylabs = "Log hazard", xlabs = "Retention")
abline(h = 0, col="grey", lty=3)
surv_fit <- survfit(surv_obj ~ exercise_type, data = selected_churn)
surv_fit <- survfit(surv_obj ~ exercise_type, data = selected_churn)
ggsurvplot(surv_fit, data = selected_churn, pval = TRUE, conf.int = TRUE,
           legend.labs = c("Flashcards", "Fill-Blanks"),
           xlab = "Weeks", ylab = "Survival Probability",
           ylim = c(0.7, 1.0))

Tags:

Categories: