Analysing SaaS Trial to Subscriber Conversions - Part 3 - Time Dependent Variables

Julian Hatwell
Last Updated: Nov 13, 2023

Table of Contents

Introduction #

You can read the Series Introduction here

In the previous post, we saw how survival curves can be created for different strata, or factors/categories of the independent variables, giving us a way to determine whether there are significant differences in the median survival time (or other quantile) between groups. The groups are fixed throughout the trial period. In a clinical or randomized control trial, these would be set as part of the experimental protocol. In this observational setting, these are often customer segments, such as industry vertical and other things that may not be under our control.

In this third and final post of this series, we use the Cox Proportion Hazards method to model the effect of variables that change over time. In this scenario, these variables are more likely to be things that are under our control, or things we can at least intervene on when we detect a desirable or undesirable effect.

For example, here we will see an analysis of cumulative customer logins, cumulative logged in time, and the number of distinct core features used. These are factors we can potentially nudge through incentives or gamification (although it will be hard to make strong causal inference about such complex interactions). We also see the effect of onboarding events and customer success outreach calls. These are very much under our control, in terms of the consistency of our customer success team to carry them out, and even the content of the calls and interaction with the customer.

The proportional hazard model tells us how these time-varying covariates influence the instantaneous risk of customer churn at any given moment, while accounting for their cumulative effects over time. Unlike the survival curves from our previous analysis that showed us static group differences, the Cox model quantifies the dynamic relationship between customer engagement behaviors and churn risk as both evolve throughout the customer lifecycle.

Most importantly, the model provides hazard ratios that translate directly into actionable insights: for every additional core feature a customer uses, or each additional customer success call completed, we can quantify the percentage change in churn risk. This allows us to prioritize interventions based on their estimated impact and helps us understand not just whether these activities matter, but by how much - giving us the foundation for data-driven customer success strategies and resource allocation decisions.

Theory Behind the Cox Proportional Hazards Estimator #

The model estimates the values for the coefficients $β_{1}, β_{2}, \dots, β_{P}$ for $P$ predictor variables where the instantaneous hazard function for individual $i$ is:

$h_{i} (t) = h_{0} (t) \times e^{(β_{1} X_{1 i} + β_{2} X_{2 i} + \dots + β_{P} X_{P i})} = h_{0} (t) \times e^{(β^{T} \cdot X_{i})}$ The result is always interpreted as the ratio between two individuals, making it unnecessary to estimate $h_{0} (t)$ . For example, if we have just one variable $X_{1}$ is the cumulative logged in time and customer A tallies up 20 hours more than customer B then we have a hazard ratio of:

$\frac{h_{0} (t) \times e^{β_{1} * 20}}{h_{0} (t) \times e^{β_{1} * 0}} = e^{β_{1} * 20}$ Here, the hazard ratio is $e^{β_{1}}$ which is equivalent to a multiplier of $β_{1}$ per additional hour logged relative to another individual.

Modeling #

We proceed to fit a naïve model with all our variables. The time-independent customer variables are also included, and the result table only shows the variables that are significant to a 95% confidence level.

variable	estimate	conf.low	conf.high	p.value	significance
homeware_industry	1.558641	1.234983	1.967121	0.0001861	***
content_channel	1.378438	1.075660	1.766443	0.0112022	*
high_initial_engagement	1.372458	1.017359	1.851501	0.0382071	*

We can see that the only significant items that show here are already accounted for by the stratification analysis in our previous post.

We can check this model assumptions that the hazard ratio is constant over time. Again, we just show the results that do not meet this assumption with 95% confidence.

variable	chisq	p
recent_activity	5.792295	0.0160966
high_initial_engagement	4.448864	0.0349245
content_channel	4.713107	0.0299335

We can see that two of our stratification variables violate the assumptions and should be modeled as strata. The recent activity marker also has a non-constant hazard and should not be included in this model.

We proceed to apply a refined set of variables to be modeled.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


## # A tibble: 9 × 6
##   variable                   estimate conf.low conf.high  p.value significance
##   <chr>                         <dbl>    <dbl>     <dbl>    <dbl> <chr>       
## 1 homeware_industry             1.55     1.23       1.96 0.000221 "***"       
## 2 cs_outreach_day3              1.24     0.951      1.63 0.111    ""          
## 3 cumulative_session_minutes    1.00     1.00       1.00 0.111    ""          
## 4 cs_outreach_day14             1.19     0.948      1.49 0.136    ""          
## 5 referral_channel              1.17     0.916      1.50 0.205    ""          
## 6 cs_outreach_day21             0.873    0.675      1.13 0.304    ""          
## 7 cs_outreach_day7              0.915    0.735      1.14 0.424    ""          
## 8 login_rate_7day               0.637    0.177      2.29 0.490    ""          
## 9 onboarding_calls_completed    1.02     0.852      1.22 0.842    ""

Tags:

Categories: