Analysing SaaS Trial to Subscriber Conversions - Part 3 - Time Dependent Variables
Last Updated:
Table of Contents
Introduction #
You can read the Series Introduction here
In the previous post, we saw how survival curves can be created for different strata, or factors/categories of the independent variables, giving us a way to determine whether there are significant differences in the median survival time (or other quantile) between groups. The groups are fixed throughout the trial period. In a clinical or randomized control trial, these would be set as part of the experimental protocol. In this observational setting, these are often customer segments, such as industry vertical and other things that may not be under our control.
In this third and final post of this series, we use the Cox Proportion Hazards method to model the effect of variables that change over time. In this scenario, these variables are more likely to be things that are under our control, or things we can at least intervene on when we detect a desirable or undesirable effect.
For example, here we will see an analysis of cumulative customer logins, cumulative logged in time, and the number of distinct core features used. These are factors we can potentially nudge through incentives or gamification (although it will be hard to make strong causal inference about such complex interactions). We also see the effect of onboarding events and customer success outreach calls. These are very much under our control, in terms of the consistency of our customer success team to carry them out, and even the content of the calls and interaction with the customer.
The proportional hazard model tells us how these time-varying covariates influence the instantaneous risk of customer churn at any given moment, while accounting for their cumulative effects over time. Unlike the survival curves from our previous analysis that showed us static group differences, the Cox model quantifies the dynamic relationship between customer engagement behaviors and churn risk as both evolve throughout the customer lifecycle.
Most importantly, the model provides hazard ratios that translate directly into actionable insights: for every additional core feature a customer uses, or each additional customer success call completed, we can quantify the percentage change in churn risk. This allows us to prioritize interventions based on their estimated impact and helps us understand not just whether these activities matter, but by how much - giving us the foundation for data-driven customer success strategies and resource allocation decisions.
Theory Behind the Cox Proportional Hazards Estimator #
The model estimates the values for the coefficients
for
predictor variables where the instantaneous hazard function for individual
is:
. For example, if we have just one variable
is the cumulative logged in time and customer A tallies up 20 hours more than customer B then we have a hazard ratio of:
which is equivalent to a multiplier of
per additional hour logged relative to another individual.
Modeling #
We proceed to fit a naïve model with all our variables. The time-independent customer variables are also included, and the result table only shows the variables that are significant to a 95% confidence level.
variable | estimate | conf.low | conf.high | p.value | significance |
---|---|---|---|---|---|
homeware_industry | 1.558641 | 1.234983 | 1.967121 | 0.0001861 | *** |
content_channel | 1.378438 | 1.075660 | 1.766443 | 0.0112022 | * |
high_initial_engagement | 1.372458 | 1.017359 | 1.851501 | 0.0382071 | * |
We can see that the only significant items that show here are already accounted for by the stratification analysis in our previous post.
We can check this model assumptions that the hazard ratio is constant over time. Again, we just show the results that do not meet this assumption with 95% confidence.
variable | chisq | p |
---|---|---|
recent_activity | 5.792295 | 0.0160966 |
high_initial_engagement | 4.448864 | 0.0349245 |
content_channel | 4.713107 | 0.0299335 |
We can see that two of our stratification variables violate the assumptions and should be modeled as strata. The recent activity marker also has a non-constant hazard and should not be included in this model.
We proceed to apply a refined set of variables to be modeled.
|
|