Non-Parametric Survival Analysis of a Sleep Diary

Julian Hatwell
Last Updated: Feb 03, 2021

Table of Contents

Introduction #

Earlier this month I carried out a parametric survival analysis over a self-generated dataset of my sleep times each day over the previous year. Using the scientific method, of course, I set about the task with a null hypothesis that eating certain food groups for dinner had no effect on my sleep. The findings were indeed very interesting, and I was able to reject the null hypothesis using a parametric regression of the data set using a Weibull survival regression. You can read about that here, so I won’t repeat myself and I’ll skip the exploratory analysis. Just remember, the diagnosis is not great! My median nightly sleep time is 5.84 hours, or 05 hours 50 minutes 34 seconds.

In this post, I take the opportunity to explore the same dataset with a more widely used survival analysis, the non-parametric Kaplan-Meyer estimator. As usual, I’ll differ printing out code until the end, unless there is something interesting to show in context.

The K-M Estimator #

The K-M Estimator is calculated cumulatively at each time point where an event occurs.

$$ \hat{S}(t) = \prod_{\substack{t_i \leq t}}{\left( 1 - \frac{d_i}{n_i}\right)} $$ where $n_i$ is the number of observations who have survived up to time $t_i$, or in my case the number of nights where I would still be asleep, and $d_i$ is the number of observations who fail at time $t_i$, or in my case, the number of nights where sleep ends at that elapsed time.

The variance of this estimator is:

$$ \mathrm{var}\left( \hat{S}(t) \right) \approx \left[ {\hat{S}(t)}^2 \right] \sum_{\substack{t_i \leq t}} \frac{d_i}{n_i(n_i - d_i)} $$ and it is usual to take the complementary log-log transform $\mathrm{var} \left( \log \left[ -\log {\hat{S}(t)}^2 \right] \right)$ to constrain the confidence intervals between zero and one.

In a typical survival study, there is an additional factor to consider. Survival studies originate in longitudinal studies of people and conditions generally ending in death. Longitudinal studies are prone to individuals exiting the study over time for other reasons than the events under analysis. When an individual exits the study early, their record is said to be right-censored. It is clipped at the point in time when the individual is no longer observed. It is possible to use the information provided by their length of survival during participation, so long as the uncertainty is also taken into account of not being able to observe if/when the event occurs (e.g. death, relapse, or failure in the case of hardware). Right-censored events increase the variance following the end of their time in the study.

To represent the right-censoring in R, you would provide a Boolean vector of equal length as the observations vector into the Surv object. In my sleep diary, however, each observation is a completely measured night’s sleep and so there aren’t any censored observations. To demonstrate the intricacies of this method, I will first construct the univariate estimator, without adding any co-factors, so the formula is given with just a constant (~1).

1
2


estimator <- survfit(Surv(time_sleeping) ~ 1, conf.type="log-log", data = dframe)
estimator

1
2
3
4


## Call: survfit(formula = Surv(time_sleeping) ~ 1, data = dframe, conf.type = "log-log")
## 
##        n events median 0.95LCL 0.95UCL
## [1,] 365    365   5.84    5.64     6.1

The median sleeping time and 95% confidence intervals are provided by the estimator based on:

$$ \hat{t}_{\mathrm{med}} = \mathrm{inf} \lbrace t : \hat{S}(t) \leq 0.5 \rbrace $$

R makes everything trivial to calculate. The median sleeping time is 05 hours 50 minutes 34 seconds with a lower confidence estimate of 05 hours 38 minutes 35 seconds and an upper confidence interval of 06 hours 05 minutes 49 seconds.

The red (reference) line represent the level where $\hat{S}(t) = 0.5$ and the green (median) and blue (ci) drop lines show where the reference line intersects with the estimator and its confidence interval.

Typically, survival curves from KM estimators do not look like this. Thanks to the even density of data points (this is essentially a time series), the above plot looks like taking the 365 observations stacked on top of each other in order from the shortest on top to the longest at the bottom. The resulting curve has some similarity to the Weibull curves that I was able to infer using the parametric approach.

Multivariate Analysis #

Categorical Variables #

Non-parametric survival analysis, with it’s roots in clinical trials, is well-developed for comparing two groups and handling perhaps one categorical co-variate with a small number of strata, or one continuous co-variate, but it certainly isn’t common to see the exploratory approach that I used in the previous post on parametric methods.

Sticking to tried and testing methods, I’ll just demonstrate some simple between groups hypothesis testing, using one food trigger at a time. Recall that eating a meal containing cheese for dinner seemed to have a very pronounced effect on me, reducing the median sleep time by about 1 hour 40 minutes when comparing nights when dinner contained none of the analysed food triggers.

I will test the null hypothesis that eating cheese has no effect on sleep times. That is $H_0 : S_C(t) = S_N(t)$ and the alternative is $H_A : S_C(t) \neq S_N(t)$ (or $H_A : S_C(t) < S_N(t)$ for a one-sided test), where $S_C$ is the survival distribution for cheesy nights, and $S_N$ for nights with no food trigger. The standard test is the Mantel-Cox test or log-rank test, which tallies a contingency table for treatment (cheese) and control (no food trigger) observations at each event (end of sleep cycle) time. I won’t reproduce the full derivation here but the resulting statistic follows a $\chi^2$ distribution.

1
2
3
4
5
6
7
8


## Call:
## survdiff(formula = Surv(time_sleeping) ~ cheese, data = dframe)
## 
##                N Observed Expected (O-E)^2/E (O-E)^2/V
## cheese=FALSE 301      301    344.5      5.48       103
## cheese=TRUE   64       64     20.5     91.92       103
## 
##  Chisq= 102  on 1 degrees of freedom, p= <2e-16

Evidently, this is a significant result and the null hypothesis is rejected. The effect can be visually analysed with a simple plot.

What’s nice about the KM estimator survival curve is that, unlike a parametric distribution, you can clearly see the location of each event, giving you full transperancy over your empiricial data. In a typical survival study, with perhaps only tens of subjects, the piece-wise nature is really clear and locating sudden changes in the distribution is a matter of a quick visual check.

Let’s take a look at the other food triggers.

1
2
3
4
5
6
7
8


## Call:
## survdiff(formula = Surv(time_sleeping) ~ brassica, data = dframe)
## 
##                  N Observed Expected (O-E)^2/E (O-E)^2/V
## brassica=FALSE 244      244    288.3      6.81      33.9
## brassica=TRUE  121      121     76.7     25.59      33.9
## 
##  Chisq= 33.9  on 1 degrees of freedom, p= 6e-09

1
2
3
4
5
6
7
8


## Call:
## survdiff(formula = Surv(time_sleeping) ~ meat, data = dframe)
## 
##              N Observed Expected (O-E)^2/E (O-E)^2/V
## meat=FALSE 312      312    329.7     0.953        10
## meat=TRUE   53       53     35.3     8.908        10
## 
##  Chisq= 10  on 1 degrees of freedom, p= 0.002

1
2
3
4
5
6
7
8


## Call:
## survdiff(formula = Surv(time_sleeping) ~ spice, data = dframe)
## 
##               N Observed Expected (O-E)^2/E (O-E)^2/V
## spice=FALSE 302      302      335      3.26      41.5
## spice=TRUE   63       63       30     36.43      41.5
## 
##  Chisq= 41.5  on 1 degrees of freedom, p= 1e-10

All the above details are consistent with the parametric findings.

Continuous Variables #

In the previous post, by creating autoregressive features from the time sleeping variable, I identified that the sleep time from the night before affected the sleep time for a given night. I assume that there was some physiological pressure to “catch up.” The same was not true for the previous to last night. We can perform a similar investigation non-parametrically.

I haven’t referred to the hazard function $h(t)$, because I want to keep these posts a bit light on detail, but this is essentially the instantaneous failure rate at time $t$. It is defined as follows:

$$ h(t) = \lim_{\substack{\delta \rightarrow0}} \frac{P(\mathrm{event_T} | t < T < t + \delta | T > t)}{\delta} $$

where $P(\mathrm{event_T})$ is the probability of an event happening at instant T within some tiny time increment $\delta$. There is a mathematical relationship between hazard and survival but it takes a few steps to derive it, so I’ll be a bit hand-wavey here and just say that where $h(t)$ is high, $S(t)$ is falling fast.

Analysis of hazard functions is done by Cox’s Proportional Hazards, which uses a log-likelihood statistic and can be used for survival regression. The coefficients of a proportional hazards regression analysis follow a normal distribution and so can be subject to familiar significance tests.

Let’s quickly see how this looks for the cheese trigger to develop some intuition before using it to assess the autoregressive effect of previous night sleeping time.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


## Call:
## coxph(formula = Surv(time_sleeping) ~ factor(cheese), data = dframe)
## 
##   n= 365, number of events= 365 
## 
##                      coef exp(coef) se(coef)     z Pr(>|z|)    
## factor(cheese)TRUE 1.4012    4.0603   0.1492 9.391   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                    exp(coef) exp(-coef) lower .95 upper .95
## factor(cheese)TRUE      4.06     0.2463     3.031      5.44
## 
## Concordance= 0.591  (se = 0.011 )
## Likelihood ratio test= 69.73  on 1 df,   p=<2e-16
## Wald test            = 88.19  on 1 df,   p=<2e-16
## Score (logrank) test = 102.5  on 1 df,   p=<2e-16

This analysis is indicating a log proportional hazard of 1.4012452 for nights with a meal containing cheese (zero for no cheese), which is statistically significant.

Now for the analysis for my autoregressive features of one and two nights previous. I don’t anticipate a linear relationship so I will pass a penalized spline into the model.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


## Call:
## coxph(formula = Surv(time_sleeping) ~ pspline(ar_1, df = 2) + 
##     pspline(ar_2, df = 2), data = dframe)
## 
##                              coef se(coef)     se2   Chisq   DF      p
## pspline(ar_1, df = 2), li -0.1021   0.0334  0.0333  9.3549 1.00 0.0022
## pspline(ar_1, df = 2), no                           2.1989 1.06 0.1494
## pspline(ar_2, df = 2), li -0.1691   0.0343  0.0342 24.3589 1.00  8e-07
## pspline(ar_2, df = 2), no                           2.4917 1.05 0.1224
## 
## Iterations: 4 outer, 13 Newton-Raphson
##      Theta= 0.936 
##      Theta= 0.925 
## Degrees of freedom for terms= 2.1 2.1 
## Likelihood ratio test=41.4  on 4.12 df, p=3e-08
## n= 365, number of events= 365

This is an interesting result that differs from the parametric findings. The print out shows a coefficient for each spline’s linear part, and a significance test only for the non-linear part. Sleep time on both the previous night (coef = -0.1021) and two nights previous (coef = -0.1691) have an effect, with two nights previous being slightly larger. Both coefficients are found to be significant and negative, and so inversely proportional. The interpretation is that less time sleeping on previous nights increases the harzard. This runs counter to intuition and the parametric findings. However, R provides a built in plotting function that reveals the relationship very clearly.

The plots confirm the model results that hazard increases with less sleep on the previous two nights. There is an inflection point somewhere near the median sleep time where the effect is neutral in both cases. The non-intuitive inverse relationship could be a result of not running the other co-factors in the model (something akin to Simpson’s paradox).

Unlike the KM estimator survival curve analysis, it’s more intuitive to run a multivariate proportional hazards regression with many more co-factors. I’ll run the model with the seasonal (month) effect included to see if my hunch is correct.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


## Call:
## coxph(formula = Surv(time_sleeping) ~ pspline(ar_1, df = 2) + 
##     pspline(ar_2, df = 2) + month, data = dframe)
## 
##                              coef se(coef)     se2   Chisq   DF       p
## pspline(ar_1, df = 2), li  0.1051   0.0418  0.0418  6.3128 1.00 0.01199
## pspline(ar_1, df = 2), no                           1.7641 1.05 0.19541
## pspline(ar_2, df = 2), li -0.0078   0.0411  0.0411  0.0360 1.00 0.84960
## pspline(ar_2, df = 2), no                           4.7869 1.05 0.03084
## monthAug                   1.6808   0.2938  0.2932 32.7242 1.00 1.1e-08
## monthDec                  -0.5794   0.2685  0.2673  4.6548 1.00 0.03097
## monthFeb                  -0.7621   0.2776  0.2767  7.5353 1.00 0.00605
## monthJan                  -0.9732   0.2801  0.2784 12.0708 1.00 0.00051
## monthJul                   1.8935   0.2978  0.2967 40.4382 1.00 2.0e-10
## monthJun                   0.6349   0.2656  0.2651  5.7124 1.00 0.01685
## monthMar                  -0.4502   0.2631  0.2628  2.9282 1.00 0.08704
## monthMay                   0.2949   0.2612  0.2605  1.2747 1.00 0.25889
## monthNov                  -0.3267   0.2609  0.2607  1.5679 1.00 0.21051
## monthOct                  -0.0930   0.2586  0.2583  0.1292 1.00 0.71923
## monthSep                   0.6727   0.2655  0.2652  6.4219 1.00 0.01127
## 
## Iterations: 4 outer, 13 Newton-Raphson
##      Theta= 0.924 
##      Theta= 0.918 
## Degrees of freedom for terms=  2.1  2.0 10.9 
## Likelihood ratio test=152  on 15 df, p=<2e-16
## n= 365, number of events= 365

Wow! I was absolutely spot on. Controlling for the seasonal effect of just sleeping less in the shorter summer nights, the results are consistent with the findings from last time. Sleep time on the previous night (coef = 0.1051) has a significant effect that is no longer reversed, and two nights previous (coef = -0.0078) no longer has a significant effect.

The plots confirm this finding, with previous night hazard appearing to fall monotonically with less sleep, while for two nights previous, the zero reference line is entirely contained inside the confidence interval.

Further work #

It is possible to go further with proportional hazards regression and to select the best model using log-likelihood tests for nested models and AIC for non-nested models. This is familiar territory for linear models and GLMs. I performed this procedure in the previous post on parametric survival analysis. So for the moment, my investigation will end here.

Conclusion #

Running a typical survival analysis on an atypical dataset was an interesting exercise. It did yield a confusing result as I progressed to a more advanced regression analysis with autoregressive continuous variables at first, but this was cleared up by approaching the problem with a critical mindset.

The parametric approach from the previous post yielded a very satisfying analysis because the data was well-fitting to the Weibull distribution, as well as acheiving goodness of fit with a Gamma distribution and normal distribution. However, the non-parametric survival curves yielded by the Kaplan-Meyer estimator actually show you the true, empirical picture of your data set rather than some theoretical distribution. In most cases, this is preferable to work with.

We couldn’t look into some of the aspects of survival analysis that only present themselves in longitudinal studies, such as observations being censored by exiting the study prior to the analysed events. I may return to this topic in a future post.

Overall, it was really great fun to work with a self-generated data set that defines a problem that is pretty core to my personal life. I learned a lot about how to manage my chronic insomnia and that can only be a good thing.

Code Appendix #

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


library(lattice)
library(ggplot2)
library(dplyr)
library(tidyr)
library(tibble)
library(survival)
library(Hmisc)
library(goftest)
library(fitdistrplus)
library(patchwork)

1

dframe <- read.csv("data/insomnia-diary.csv")

1
2
3
4
5
6


days_in_months <- c(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31)
month_vector <- rep(1:12, times = days_in_months)
dframe$month_number <- month_vector
month_name <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
month_positions <-c(0, cumsum(days_in_months) + 1)[-13]
dframe$month <- month_name[dframe$month_number]

1
2
3
4
5
6
7
8


dframe$quarter <- (dframe$month_number - 1) %/% 3 + 1
dframe[["combo_foods"]] <- factor(with(dframe, cheese * 1 + brassica * 2 + meat * 4 + spice * 8))

dframe[["trend"]] = 1:length(dframe$time_sleeping)
dframe[["ar_1"]] = c(dframe$time_sleeping[1], head(dframe$time_sleeping, -1))
dframe[["ar_2"]] = c(rep(dframe$time_sleeping[1], 2), head(dframe$time_sleeping, -2))
dframe[["ar_3"]] = c(rep(dframe$time_sleeping[1], 3), head(dframe$time_sleeping, -3))
dframe[["ar_4"]] = c(rep(dframe$time_sleeping[1], 4), head(dframe$time_sleeping, -4))

1
2
3
4
5
6
7
8


convert_hms <- function(hours) {
  total_seconds <- hours * 3600
  hrs <- as.integer(floor(total_seconds / 3600))
  mins <- as.integer(floor((total_seconds %% 3600) / 60))
  secs <- as.integer(total_seconds %% 60)

  sprintf("%02d hours %02d minutes %02d seconds", hrs, mins, secs)
}

1
2


estimator <- survfit(Surv(time_sleeping) ~ 1, conf.type="log-log", data = dframe)
estimator

1
2
3
4


med <- quantile(estimator, 0.5)
med_sleep <- med$`quantile`
lc_sleep <- med$lower
uc_sleep <- med$upper

1
2
3
4
5
6


plot(estimator, conf.int = TRUE, xlab="Time (hr)", ylab = "Survival (still sleeping) Probability")
title("Kaplan-Meyer Estimator of Insomnia Diary Data")
abline(h = 0.5, col = "red", lty = 3)
lines(rep(x = uc_sleep, 2), y = c(0, 0.5), col = "blue", lty = 2)
lines(rep(x = lc_sleep, 2), y = c(0, 0.5), col = "blue", lty = 2)
lines(rep(x = med_sleep, 2), y = c(0, 0.5), col = "green", lty = 2)

1

survdiff(Surv(time_sleeping) ~ cheese, data = dframe)

1
2


plot(survfit(Surv(time_sleeping) ~ cheese, data = dframe), xlab="Time (hr)", ylab = "Survival (still sleeping) Probability", col=c("black", "red"), lwd = 1)
legend("topright", legend=c("no cheese", "cheese"), col=c("black", "red"), lwd = 1)

1

survdiff(Surv(time_sleeping) ~ brassica, data = dframe)

1
2


plot(survfit(Surv(time_sleeping) ~ brassica, data = dframe), xlab="Time (hr)", ylab = "Survival (still sleeping) Probability", col=c("black", "red"), lwd = 1)
legend("topright", legend=c("no brassica", "brassica"), col=c("black", "red"), lwd = 1)

1

survdiff(Surv(time_sleeping) ~ meat, data = dframe)

1
2


plot(survfit(Surv(time_sleeping) ~ meat, data = dframe), xlab="Time (hr)", ylab = "Survival (still sleeping) Probability", col=c("black", "red"), lwd = 1)
legend("topright", legend=c("no meat", "meat"), col=c("black", "red"), lwd = 1)

1

survdiff(Surv(time_sleeping) ~ spice, data = dframe)

1
2


plot(survfit(Surv(time_sleeping) ~ spice, data = dframe), xlab="Time (hr)", ylab = "Survival (still sleeping) Probability", col=c("black", "red"), lwd = 1)
legend("topright", legend=c("no spice", "spice"), col=c("black", "red"), lwd = 1)

1
2


cph_cheese <- coxph(Surv(time_sleeping) ~ factor(cheese), data = dframe)
summary(cph_cheese)

1

termplot(cph_cheese, se = TRUE, terms = 1, ylabs = "Log hazard")

1
2


cph_ar <- coxph(Surv(time_sleeping) ~ pspline(ar_1, df=2) + pspline(ar_2, df=2), data=dframe)
cph_ar

1
2
3
4
5


termplot(cph_ar, se = TRUE, terms = 1, ylabs = "Log hazard", xlabs = "Previous Night")
abline(h = 0, col="grey", lty=3)

termplot(cph_ar, se = TRUE, terms = 2, ylabs = "Log hazard", xlabs = "Two Nights Previous")
abline(h = 0, col="grey", lty=3)

1
2


cph_ar <- coxph(Surv(time_sleeping) ~ pspline(ar_1, df=2) + pspline(ar_2, df=2) + month, data=dframe)
cph_ar

1
2


summary_output <- summary(cph_ar)
linear_coeffs <- summary_output$coefficients

1
2
3
4
5


termplot(cph_ar, se = TRUE, terms = 1, ylabs = "Log hazard", xlabs = "Previous Night")
abline(h = 0, col="grey", lty=3)

termplot(cph_ar, se = TRUE, terms = 2, ylabs = "Log hazard", xlabs = "Two Nights Previous")
abline(h = 0, col="grey", lty=3)

Tags:

Categories: