Educational Attainment in England - A Deeper Dive

Last Updated: Aug 30, 2023

Table of Contents

Introduction #

On the 25th July 2023, the UK Office for National Statistics produced a wonderful piece of data journalism with open access to their dataset measuring educational attainment across England. The article’s title “Why do children and young people in smaller towns do better academically than those in larger towns?” hides a bold claim in the form of a question. I like to assume that their research question(s) did not start from such a knowledge claim but rather the title emerged from the themes they discovered during their investigation.

In fact, the article’s subtitle provides more of a clue over the sort of questions they must have been asking, uncovering correlations to attainment with factors in the social environment:

"Town size, income deprivation and the education level of other residents are all related to the educational attainment of pupils in English towns."

Overall, I found the article fascinating and enjoyed the approach they took, with plots and visuals that offer immediacy of comprehension to a non-technical audience, and explanations that are free of statistical jargon.

Grateful that the authors have published the data set along with embeddable chart objects, I thought it would be interesting to consider some deeper analyses with a more technical audience in mind. So, I’ve re-run some of the key findings, adding a handful of more advanced statistical tools. These tools allow me to make more detailed inferences, and possibly deeper insights.

Obviously, the ONS do not need me to tell them how to analyse data! I’m sure that the ideas presented here were considered, discussed, and then pared back to the most digestible form. Part of the ONS remit is to choose a level of detail to communicate to the broadest possible audience. My intention with this write up is simply to work in a way that is not constrained by the same concerns.

Note, as usual, I defer printing the source code until the end of the document, except where I have modified the data. In that case, the code is shown so the reader can understand the actions in context

Note also, where I have embedded plots from the ONS article, you will see the appropriate credits. Everything else here is my own.

Research Questions #

The ONS investigation uncovers some important interactions between town size and level of income deprivation, but the complex nature of this effect is set aside in favour of a series of easier to understand bivariate analyses. I would like to consider the following:

Is town size a factor in educational attainment when controlling for other variables, such as level of income deprivation?
Are regional effects a significant factor in educational attainment, or are there more powerful local factors at play?

Analytic Strategy #

I will proceed in two phases:

An initial exploratory analysis with descriptive statistics, undertaking an enhanced approach based on the original article:

A critical assessment of the descriptive plots and charts inn the original report, adding a further level of detail, where possible.
An audit of the differences between groups that were shown in the original article, engineering new features from the provided statistics that offer a greater level of statistical scrutiny (modeling variance, uncertainty, etc).

A modelling approach that handles hierarchical groupings such as town and region, enabling us quantify the differences in attainment at different geographic levels.

Initial Exploratory Analysis #

I have downloaded the original full data set. The data refer to a cohort who sat GCSEs in 2012-2013 and is a longitudinal study of their performance from Key Stage 2 (2007-8), and comprising GCSE results, and qualifications and/or other indicators at ages 18 and 22. That is to say, all the counts, percentages and scores listed relate to this one cohort, unless otherwise stated. Where something is listed with the word “score” in the field name, this is a derived score for which the formula may or may not be available. However, it is applied across the cohort and can be used for between group comparisons.

The data are imported from a csv file. There are 1104 rows and 33 columns. All the necessary detail is in the original article. So, for brevity, I will skip over a comprehensive EDA and descriptive statistics. Instead, I will look at some of the original visual analyses and the claims made about them, and then carry out an enhanced analysis.

“Young People Do Better Where The Older Generations Have Higher Attainment” #

The article presents findings of an inter-generational relationship, which stands up to instinct and reasoning. It is fair to assume that educated parents, other family members and friends are better equipped, through their own experience, to support their children’s education more deliberately and methodically.

The main dataset, in fact, contains a categorical variable for older HE attainment, and here I present one of the first exploratory plots I created from it, as an alternative view of the same information. This is a one-sided bee swarm plot of the cohort attainment score, coloured by the older HE attainment category.

I like this plotting approach when there are enough data points over the range of given values because the shape gives a sense of the data density. A kernel density estimation (KDE) plot is a standard tool to use in the visualisation of univariate data but there are subtleties that are good to view with this alternative perspective. Consider the following two plots for comparison:

You can see with the KDE plots we still visualise the distribution among the factor levels, but the emphasis is on their separateness. The bee swarm plot, on the other hand, gives and impression of the distribution while allowing the data points to mix. It’s more intuitive than scientific but given the subject matter is families and communities, it seems to offer a better metaphor for the towns and villages from where the data are drawn.

Reproducing the original scatterplots #

The raw percentage of older graduates in the population was available as a separate file download in the article. The original authors presented a scatter plot of this relationship, showing a strong correlation. I have recreated this plot, annotating with the Pearson correlation.

“Pupils in Small Towns Do Better on Average” #

Now we move on to the claim that I am most sceptical about. The claim is supported by a beautiful beeswarm plot in the original article.

While it seems to be true that the groups preset with noticeable mean differences, these differences do seem to be very small. Furthermore, the variance for small towns is certainly the largest by quite some margin, and we see that there are many small towns scoring worst in the country. So, the sub-heading claim doesn’t appear to tell the whole story. To their credit, the authors do expand on this point for themselves and the claim itself appears to be sound under further scrutiny. For example, I can verify it through an ANOVA test of the three designations (Small/Medium/Large Towns), and a Student’s t test between just the Small and Medium Towns. Here are the results:

1
2
3
4
5


##               Df Sum Sq Mean Sq F value  Pr(>F)   
## size_flag      2    137   68.27   5.226 0.00551 **
## Residuals   1079  14097   13.07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


## 
## 	Welch Two Sample t-test
## 
## data:  small_towns and medium_towns
## t = 2.3189, df = 758.57, p-value = 0.01033
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.1593041       Inf
## sample estimates:
##  mean of x  mean of y 
##  0.2966166 -0.2530946

For both tests, the p-value is well below the 0.05 significance level, supporting the main statistical claim about differences between groups. However, I am left with niggling concerns:

The variance for small towns is really large, with the extremes eclipsing the other groups.
I am also bothered about the fact that there be an inflationary effect of repeating samples from the same broader geographical area.

While the original report does dig into large geographical regions, such as the North West, or the South East, these are enormous designations that do not capture granular socio-economic realities neighbourinng of small towns.

I have a hypothesis that nearby and neighbouring small towns are very unlikely to be genuinely independent observations when it comes to issues such as income deprivation. What if all the towns in a close geographical area suffered similar issues after the loss of a major local employer, such as a factory, mine or call centre? If a cluster of nearby towns provide similar measures because they are related by geography, proximity and recent economic realities, then their shared misfortune provides a cluster of non-independent samples. As statisticians, we must be vigilant towards sampling issues like this. Such sampling issues are the greatest source of bias and error in statistical investigations.

I will return to this topic later on.

“Towns With High Educational Attainment Have Low Income Deprivation” #

This is the most expected and predictable claim and any rational thinking person would intuit a causal relationship: income deprivation contributes to low educational attainment. I note that the original article is worded in the inverse order than my former statement, and teh wording also lands on the least stigmatising category labels (high educational attainment, low income deprivation). I think this is probably a deliberate editorial decision, made to ensure that no one accuses the ONS of implying such a causal relationship when they can only provide statistical evidence for correlation. I make no such claim either but I’m happy to point out that a causal relationship is very likely what people would assume.

The relationship can be seen in another scatter plot of attainment score vs income score. There was an accompanying additional csv file (for whatever reason, the income score was not part of the main file). Here I annotate the plot again with the Pearson correlation.

We see here a heteroskedastic pattern (reversed), with variance growing as the attainment score falls. This pattern adds to my suspicion other factor(s) are at play, mediating the relationship to attainment in lower income areas.

“Small Towns Are Less Frequently Classified as Income Deprived” #

Given the findings so far, this claim about the nature of small towns seems extremely important. It notes the under-representation of income deprivation as town size decreases. This is an interaction effect that really needs a closer look to understand the drivers for difference in attainment. Think about it; if the sample of small towns contains many more higher income towns, this underlying disproportion will bias the mean attainment score upwards, given the previous correlation claim.

The proportion of each town size designation under each income deprivation class (Higher/Mid/Lower Income Deprivation) is shown in the article with a fixed width horizontal bar plot.

Such plots are useful for their immediacy but some information in lost. I find various kinds of strucplots much more illuminating, especially when there are two- or more-way interactions. I present two mosaic plots for illustration. The Expected frequencies plot shows the expected counts under the Null hypothesis that income deprivation and town size are independent. The second Actual frequencies presents the observed counts.

A mosaic plot is ideal when dealing with categorical and count data. It uses tile area to represent the count of each category intersection, resulting in tiles of varying width and height. The shading of the Actual frequencies mosaic follows a scheme according to the magnitude of the Pearson residuals (of a test of actual vs. expected under independence) at each intersecting level of the categories. The darkest blue indicates a significant over count and the darkest red is conversely a significant under count, when compared to the Null hypothesis. By linking shading depth to the Pearson residuals, the mosaic plot is more than just a visual check. It is a rigorous statistical test.

The mosaic plot provides evidence of the interaction that gave rise to my concerns and why I think it is essential to control for town size and income deprivation when investigating educational attainment.

“The Relationship Between Attainment and Income Differs Between Regions” #

In the original article, there is no mention of the potential effect of geographical proximity between towns and their recorded observations. I can understand that a choice was made not to dive into the increasing complexity of hierarchical modeling for the sake of holding the audience’s attention on the main findings. The authors present a very nice dot plot, illustrating the North West region’s best scores at all three income levels.

As mentioned before, I believe that geographical clusters of smaller towns may have economic effects in common that could bias the statistical inferences drawn from this data set. It would be prudent to find a way to adjust for the amplified signal arising from shared socio-political histories that are linked by physical proximity and connections between communities. Furthermore, the nearest large urban centre could also be an influence because of the presence of centralised education authorities and local funding decisions. To analyse attainment, taking into account the geographical hierarchy of nearest large urban centre and region, I will use a Linear Mixed Model (LMM).

When determining the modeling approach with LMM, it is important to make reasoned decisions about a plausible structure for the hierarchical relationship. LMM allow for different intercepts, or different intercepts and slopes for each independent variable. The proximity relationship is essentially a binary choice: Aylesbury either has London as its closest city, or it does not. Hull is in the region Yorkshire and the Humber, or it is not. Both of these examples are true, by the way. So, it only makes sense to model different intercepts here for both hierarchical factor levels. Slopes would be applied to numerical independent variables. The “/” notation indicates a hierarchical grouping here of ttwa11nm (urban centre) within rgn11nm (region).

1
2
3
4
5
6
7


lmer_model <- lmer(
  education_score ~ 1 + size_flag + income_flag + (1 | rgn11nm/ttwa11nm),
  data = english_education,
  subset = town_size_filter
)

summary(lmer_model)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


## Linear mixed model fit by REML ['lmerMod']
## Formula: education_score ~ 1 + size_flag + income_flag + (1 | rgn11nm/ttwa11nm)
##    Data: english_education
##  Subset: town_size_filter
## 
## REML criterion at convergence: 5114.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.9102 -0.6342 -0.0349  0.6261  3.3471 
## 
## Random effects:
##  Groups           Name        Variance Std.Dev.
##  ttwa11nm:rgn11nm (Intercept) 1.1978   1.0944  
##  rgn11nm          (Intercept) 0.3904   0.6248  
##  Residual                     5.8334   2.4152  
## Number of obs: 1082, groups:  ttwa11nm:rgn11nm, 164; rgn11nm, 8
## 
## Fixed effects:
##                                    Estimate Std. Error t value
## (Intercept)                         -2.4851     0.3629  -6.848
## size_flagMedium Towns               -0.2876     0.3007  -0.956
## size_flagSmall Towns                -0.4828     0.2916  -1.656
## income_flagMid deprivation towns     2.2839     0.2263  10.091
## income_flagLower deprivation towns   5.8011     0.2068  28.051
## 
## Correlation of Fixed Effects:
##             (Intr) sz_fMT sz_fST in_Mdt
## sz_flgMdmTw -0.619                     
## sz_flgSmllT -0.632  0.825              
## incm_flgMdt -0.131 -0.094 -0.137       
## incm_flgLdt -0.103 -0.125 -0.257  0.477

There is a bit more to interpret than we would get from a multivariate linear model. Importantly, the model converges easily and the scaled residuals are fairly symmetrical about zero. The fixed effects only show correlations between different levels of the same factor. These are all good diagnostic indicators, as is the residual plot here:

The main statistics are really interesting. The intercept can be taken as the base value of attainment score and is set to reference levels “Large Towns” and “Higher deprivation towns”. The fixed effects of income deprivation agree with the official investigation. As the levels of income deprivation ease, the attainment outcomes rise dramatically. As I said before, this is the most obvious conclusion of the whole report.

However, in a classic case of Simpson’s paradox, the reported correlations between town size and attainment are reversed once we control for income and the geographical proximity in this hierarchical model. This is evident from the negative values associated with medium and small towns. The model shows that wealthy small towns do well while poor small towns do worst. The disparity in effect sizes explains the heteroskedastic scatter plot we saw earlier.

	coefficients
Large Towns / Higher deprivation towns	-2.4851
Medium Towns	-0.2876
Small Towns	-0.4828
Mid deprivation towns	2.2839
Lower deprivation towns	5.8011

Looking at the variation in the random effects, we can see that the regional differences are also present, as reported in the article but rather more of the variation is explained by the nearest urban centres (the standard deviation is approximately double). So, things are much as I initially suspected. This result seems to validates my intiution; clusters of towns in the same geographical area are subject to local effects that also contribute to their educational outcomes. This effect is stronger than the wider regional effect. This concept is not reported anywhere in the original report.

Group	Std.Dev.
ttwa11nm:rgn11nm	1.0944333
rgn11nm	0.6247878
Residual	2.4152362

1

## $rgn11nm

1

## $`ttwa11nm:rgn11nm`

Summary and Conclusions #

I used a publically available data set that was provided with an already comprehensive piece of data journalism and analysis, to see if I could uncover any deeper insights than those presented to the original article’s non-technical audience.

I confirmed the validity of some of the claims using standard statistical tests that were not discussed in the original write-up.
I showed some alternative visualisations for the data that provide more nuance and information than the simpler graphics that had been chosen for a less technical audience.
I used a hierarchical linear mixed model to control for multiple interacting factors and demonstrated that:
- proximal geographical effects are stronger than broad regional effects
- the original article may have made an erroneous claim about effect of town size on educational attainment because the effect was reversed when controlling for income deprivation

This detailed analysis shows that having multiple tools for different depths of analysis allows you to use multiple perspectives when looking for patterns and draw inferences from your data. Some tools suit a less technical audience and offer immediacy of understanding, while other more detailed tools are required for high impact decision support and policy making.

Appendix #

Here you can find the source code.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221


library(knitr)
library(tidytuesdayR)
library(magrittr)
library(dplyr)
library(tidyr)
library(ggbeeswarm)
library(vcd)
library(vcdExtra)
library(lattice)
library(survey)
library(ggplot2)
library(gridExtra)
library(forcats)
library(lme4)
library(MCMCpack)

opts_chunk$set(warning = FALSE
              , message = FALSE
              , echo = FALSE
              )

hook_output <- knit_hooks$get("output")
knit_hooks$set(output = function(x, options) {
  lines <- options$output.lines
  if (is.null(lines)) {
    return(hook_output(x, options))  # pass to default hook
  }
  x <- unlist(strsplit(x, "\n"))
  more <- "..."
  if (length(lines)==1) {        # first n lines
    if (length(x) > lines) {
      # truncate the output, but add ....
      x <- c(head(x, lines), more)
    }
  } else {
    x <- c(more, x[lines], more)
  }
  # paste these lines together
  x <- paste(c(x, ""), collapse = "\n")
  hook_output(x, options)
})

par(mar = c(4,3,3,1))
source("HeartTheme.R")
english_education <- as_tibble(read.csv("english_education.csv")) %>%
  dplyr::mutate(
      proportion_of_higher_ed=factor(
      level4qual_residents35_64_2011,
      levels = c("Low", "Medium", "High", NA)
    ),
      income_flag=factor(
      income_flag,
      levels = c("Higher deprivation towns", "Mid deprivation towns", "Lower deprivation towns")
      )
  )

income_and_edu <- read.csv("income_and_education_score.csv", sep = ";") %>%
  dplyr::select(-X) %>%
  rename(
  town=Town.name,
  attainment_score=Educational.attainment.score,
  income_score=Income.deprivation.score
  ) %>%
  mutate(
    income_score=as.numeric(sub(",", ".", income_score)),
    attainment_score=as.numeric(sub(",", ".", attainment_score)),
    town=sub(" BUA.*", "", town)
  ) %>%
  as_tibble()

digits_only <- function(vec) {
  re <- regexpr("[[:digit:]]{1,2}", vec)
  matches <- regmatches(vec, re)
  return(matches)
}

older_and_edu <- read.csv("education_and_older_attainment.csv", sep = ";") %>%
  dplyr::select(-X) %>%
  rename(
    town=Town.name,
    attainment_score=Educational.attainment.score,
    percentage_of_higher_ed=Residents.aged.35.64.with.Level.4.or.above.qualifications
  ) %>%
  mutate(
    percentage_of_higher_ed=as.numeric(digits_only(percentage_of_higher_ed)),
    attainment_score=as.numeric(sub(",", ".", attainment_score)),
    town=sub(" BUA.*", "", town)
  ) %>%
  as_tibble()
english_edu_scatter <- function(g, correlation, corr_x, corr_y) {
  g + geom_point(colour = myPal[4], alpha = 0.75) +
  geom_point(shape = 1, colour = myDarkCornFlower) +
  annotate("label", x = corr_x, y = corr_y,
           label = paste("Correlation:\n", round(correlation, 2))) +
  theme_bw()
}
size_flags <- c("Large Towns", "Medium Towns", "Small Towns")
town_size_filter <- english_education$size_flag %in% size_flags
shorten <- function(string, end_pos) {
  shortened <- substr(string, 1, end_pos)
  ifelse(nchar(string) > end_pos,  paste0(shortened, "..."), shortened)
   }
shorten_20 <- function(string) { shorten(string, 20) }
classes <- sapply(english_education, class)
full_names <- names(classes)
short_names <- sapply(full_names, shorten_20)
names(classes) <- NULL
names(short_names) <- NULL
notes <- c("Unique ID", "Raw Town Name", "Population (Census 2011)", "Size Category", "Region Name", "Coastal Category", "Coastal Detail Category", "Nearest Urban Centre ID", "Nearest Urban Centre Name", "Urban/Rural Category", "Job Density Category", "Income Category", "University Category", "Graduates Among Older Residents Category", "Count Students at Key Stage 4", "% Key Stage 2 Attainment", "% Key Stage 4 Attainment", "% Students at Level 2", "% Students at Level 3", "% 19yo in Full Time Higher Education", "% 19yo in Further Education", "% 19yo in Apprenticeship", "% 19yo With Earnings Above 0", "% 19yo With Earnings Above 10K", "% 19yo Jobless", "Ignore (most NA)", "% 22yo Highest Qual up to L2", "% 22yo Highest Qual L3-5", "% 22yo Highest Qual L6 and up", "Score 22yo Highest Qual", "Score Relative Attainment")
kable(cbind(short_names[1:31], classes[1:31], notes[1:31]))
ggplot(
  english_education,
  aes(y = 1, x = education_score, colour = proportion_of_higher_ed),
  size = 0.5) +
  geom_beeswarm(side = 1) +
  scale_y_discrete(
    expand = expansion(mult = c(0.02, 0.1))) +
  theme_bw() +
  theme(axis.text.y = element_blank(), axis.title.y = element_blank())
ggplot(
  english_education[english_education$proportion_of_higher_ed %in% c("High", "Medium", "Low"),],
  aes(x = education_score, fill = proportion_of_higher_ed, colour = proportion_of_higher_ed),
) +
  geom_density(alpha = 0.25) +
  theme_bw() +
  theme(axis.text.y = element_blank(), axis.title.y = element_blank())
ggplot(
  english_education[english_education$proportion_of_higher_ed %in% c("High", "Medium", "Low"),],
  aes(x = education_score, fill = proportion_of_higher_ed, colour = proportion_of_higher_ed),
) +
  geom_density(position="stack", alpha = 0.25) +
  theme_bw() +
  theme(axis.text.y = element_blank(), axis.title.y = element_blank())
with(older_and_edu,
  correlation_oe <<- cor(
      attainment_score, percentage_of_higher_ed, method="pearson"
    )
  )

g <- ggplot(
  older_and_edu,
  aes(x = attainment_score, y = percentage_of_higher_ed),
)

english_edu_scatter(g, correlation_oe, 7.5, 20)
summary(
  aov(
    education_score ~ size_flag,
    data = dplyr::filter(
      english_education,
      town_size_filter
    )
  )
)
small_towns <- english_education %>%
  dplyr::filter(size_flag == "Small Towns") %>%
  dplyr::select(education_score)

medium_towns <- english_education %>%
  dplyr::filter(size_flag == "Medium Towns") %>%
  dplyr::select(education_score)
  

t.test(small_towns, medium_towns, "greater")
with(income_and_edu,
  correlation_ie <<- cor(
      attainment_score, income_score, method="pearson"
    )
  )

g <- ggplot(
  income_and_edu,
  aes(x = attainment_score, y = income_score),
)

english_edu_scatter(g, correlation_ie, 5, 0.8)
size_income <- table(english_education[c("size_flag", "income_flag")])
size_income <- size_income[
  size_flags,
  c("Lower deprivation towns", "Mid deprivation towns", "Higher deprivation towns")
]
dimnames(size_income) <- list(
  size_flag = c("Large", "Medium", "Small"),
  income_flag = c("Lower", "Mid", "Higher")
)
expected_size_income <- round(independence_table(size_income), 0)

mosaic(
  expected_size_income,
  main="Expected frequencies",
  labeling = labeling_values,
  value_type = "expected",
  gp_text = gpar(fontface = 1),
  rot_labels = c(top = -20, left = 70)
)

mosaic(
  size_income,
  gp = shading_hsv,
  gp_args = myShadingPar,
  main="Actual frequencies",
  labeling = labeling_values,
  value_type = "observed",
  gp_text = gpar(fontface = 1),
  rot_labels = c(top = -20, left = 70),
  legend = FALSE
)
lmer_model <- lmer(
  education_score ~ 1 + size_flag + income_flag + (1 | rgn11nm/ttwa11nm),
  data = english_education,
  subset = town_size_filter
)

summary(lmer_model)
plot(lmer_model)
kable(cbind(c("Large Towns / Higher deprivation towns", "Medium Towns", "Small Towns", "Mid deprivation towns", "Lower deprivation towns"), coefficients = round(lmer_model@beta, 4)))
coeff_sd <- as.data.frame(VarCorr(lmer_model))[c(1, 5)]
names(coeff_sd) <- c("Group", "Std.Dev.")
kable(coeff_sd)
dotplot(ranef(lmer_model))[2]
dotplot(ranef(lmer_model))[1]

Tags:

Categories: