Superspreading and the Gini Coefficient

[This article was first published on Theory meets practice…, and kindly contributed to R-bloggers]. (You may report challenge in regards to the content material on this web page right here)

Need to share your content material on R-bloggers? click on right here when you have a weblog, or right here in the event you do not.


We take a look at superspreading in infectious illness transmission from a statistical standpoint. We characterise heterogeneity within the offspring distribution by the Gini coefficient as an alternative of the same old dispersion parameter of the damaging binomial distribution. This enables us to contemplate extra versatile offspring distributions.

Creative Commons License This work is licensed below a Artistic Commons Attribution-ShareAlike 4.Zero Worldwide License. The markdown+Rknitr supply code of this weblog is obtainable below a GNU Normal Public License (GPL v3) license from github.


The current Science report on Superspreading throughout the COVID-19 pandemic by Kai Kupferschmidt has made the dispersion parameter (okay) of the damaging binomial distribution a sizzling amount1 within the discussions of how one can decide efficient interventions. This quick weblog put up goals at understanding the maths behind statements akin to “In all probability about 10% of circumstances result in 80% of the unfold” and replicate them with computations in R.

Warning: This put up displays extra my very own studying course of of what’s superspreading than attempting to make any statements of significance.


Lloyd-Smith et al. (2005) present that the 2002-2004 SARS-CoV-1 epidemic was pushed by a small variety of occasions the place one case straight contaminated numerous secondary circumstances – a so referred to as superspreading occasion. Because of this for SARS-CoV-1 the distribution of what number of secondary circumstances every major case generates is heavy tailed. Extra particularly, the efficient replica quantity describes the imply variety of secondary circumstances a major case generates throughout the outbreak, i.e. it’s the imply of the offspring distribution. With a purpose to deal with dispersion round this imply, Lloyd-Smith et al. (2005) use the damaging binomial distribution with imply (R(t)) and over-dispersion parameter (okay) as a chance mannequin for the offspring distribution. The variety of offspring that case (i), which acquired contaminated at time (t_i), causes is given by [
Y_{i} sim operatorname{NegBin}(R(t_i), k),
s.t. (operatorname{E}(Y_{i}) = R(t_i)) and (operatorname{Var}(Y_{i}) = R(t_i) (1 + frac{1}{okay} R(t_i))). This parametrisation makes it straightforward to see that the damaging binomial mannequin has an extra issue (1 + frac{1}{okay} R(t_i)) for the variance, which permits it to have extra variance (aka. over-dispersion) in comparison with the Poisson distribution, which has (operatorname{Var}(Y_{i}) = R(t_i)). If (krightarrow infty) we get the Poisson distribution and the nearer (okay) is to zero the bigger the variance, i.e. the heterogeneity, within the distribution is. Observe the deliberate use of the efficient replica quantity (R(t_i)) as an alternative of the essential replica quantity (R_0) (as accomplished in Lloyd-Smith et al. (2005)) within the mannequin. That is to focus on, that one is prone to observe clusters within the context of interventions and depletion of susceptibles.

That the dispersion parameter (okay) is making epidemiological fame is a bit shocking, as a result of it’s a parameter in a particular parametric mannequin. A parametric mannequin, which may be insufficient for the noticed information. A secondary goal of this put up is thus to focus extra on describing the heterogeneity of the offspring distribution utilizing classical statistical ideas such because the Gini coefficient.

Unfavourable binomial distributed variety of secondary circumstances

Let’s assume (okay=0.45) as accomplished in Adam et al. (2020). It is a barely increased estimate than the (okay=0.1) estimate by Endo et al. (2020)2 quoted within the Science article. We need to derive statements like “the x% most energetic spreaders contaminated y% of all circumstances” as a operate of (okay). The PMF of the offspring distribution with imply 2.5 and dispersion 0.45 seems to be as follows:

Rt <- 2.5
okay <- 0.45 

# Consider on a bigger sufficient grid, so E(Y_t) is set correct sufficient
# We additionally embrace -1 within the grid to get some extent (0,0) wanted for the Lorenz curve
df <- information.body(x=-1:250) %>% mutate(pmf= dnbinom(x, mu=Rt, dimension=okay))

So we observe that 43% of the circumstances by no means handle to contaminate a secondary case, whereas some circumstances handle to generate greater than 10 new circumstances. The imply of the distribution is checked empirically to equal the desired (R(t)) of two.5:

sum(df$x * df$pmf)
## [1] 2.5

Lloyd-Smith et al. (2005) outline a superspreader to be a major case, which generates extra secondary circumstances than the 99th quantile of the Poisson distribution with imply (R(t)). We use this to compute the proportion of superspreaders in our distribution:

(superspreader_threshold <- qpois(0.99, lambda=Rt))
## [1] 7
(p_superspreader <- pnbinom(superspreader_threshold, mu=Rt, dimension=okay, decrease.tail=FALSE))
## [1] 0.09539277

So 10% of the circumstances will generate greater than 7 new circumstances. To get to statements akin to “10% generate 80% of the circumstances” we additionally must know what number of circumstances these 10% generate out of the two.5 common.

# Compute proportion of the general anticipated variety of new circumstances
df <- df %>% mutate(cdf = pnbinom(x, mu=Rt, dimension=okay), 
 cum_prop_of_Rt = cumsum(prop_of_Rt))

# Summarise
information <- df %>% filter(x > superspreader_threshold) %>% 
 summarise(expected_cases = sum(expected_cases), prop_of_Rt = sum(prop_of_Rt))
## expected_cases prop_of_Rt
## 1 1.192786 0.4771144

In different phrases, the superspreaders generate (on common) 1.19 of the two.5 new circumstances of a technology, i.e. 48%.

These statements can be made with out formulating a superspreader threshold by graphing the cumulative share of the distribution of major circumstances towards the cumulative share of secondary circumstances these generate. That is precisely what the Lorenz curve is doing. Nevertheless, for outbreak evaluation it seems clearer to graph the cumulative distribution in lowering order of the variety of offspring, i.e. following Lloyd-Smith et al. (2005) we plot the cumulative share as (P(Ygeq y)) as an alternative of (P(Y leq y)). It is a variation of the Lorenz curve, however permits statements akin to “the %x circumstances with highest variety of offspring generate %y of the secondary circumstances”.

# Add info for plotting the modified Lorenz curve
df <- df %>% 
 mutate(cdf_decreasing = pnbinom(x-1, mu=Rt, dimension=okay, decrease.tail=FALSE)) %>%
 organize(desc(x)) %>% 
 mutate(cum_prop_of_Rt_decreasing = cumsum(prop_of_Rt))
# Plot the modified Lorenz curve as in Fig 1b of Lloyd-Smith et al. (2005)
ggplot(df, aes(x=cdf_decreasing, y=cum_prop_of_Rt_decreasing)) + geom_line() + 
 coord_cartesian(xlim=c(0,1)) + 
 xlab("Proportion of the infectious circumstances (circumstances with most secondary circumstances first)") + 
 ylab("Proportion of the secondary circumstances") +
 scale_x_continuous(labels=scales::%, breaks=seq(0,1,size=6)) +
 scale_y_continuous(labels=scales::%, breaks=seq(0,1,size=6)) +
 geom_line(information=information.body(x=seq(0,1,size=100)) %>% mutate(y=x), aes(x=x, y=y), lty=2, col="grey") + ggtitle(str_c("State of affairs: R(t) = ", Rt, ", okay = ", okay))

Utilizing the usual formulation to compute the Gini coefficient for a discrete distribution with assist on the non-negative integers, i.e.  [
G = frac{1}{2mu} sum_{y=0}^infty sum_{z=0}^infty f(y) f(z) |y-z|,
the place (f(y)), (y=0,1,ldots) denotes the PMF of the distribution and (mu=sum_{y=0}^infty y f(y)) is the imply of the distribution. In our case (mu=R(t)). From this we get

# Gini index for a discrete chance distribution
gini_coeff <- operate(df) {
 mu <- sum(df$x * df$pmf)
 sum <- 0
 for (i in 1:nrow(df)) {
 for (j in 1:nrow(df)) {
 sum <- sum + df$pmf[i] * df$pmf[j] * abs(df$x[i] - df$x[j])

## [1] 0.704049

A plot of the connection between the dispersion parameter and the Gini index, given a hard and fast worth of (R(t)=2.5), seems to be as follows

We see that the Gini index converges from above to the Gini index of the Poisson distribution with imply (R(t)). In our case this restrict is

gini_coeff( information.body(x=0:250) %>% mutate(pmf = dpois(x, lambda=Rt)))
## [1] 0.3475131

Crimson Marble Toy Instance

For the toy instance offspring distribution utilized by Christian Drosten in his Coronavirus Replace podcast episode 44 on COVID-19 superspreading (in German). The described hypothetical situation is translated to an offspring distribution, the place a major case both generates 1 (with chance 9/10) or 10 (with chance 1/10) secondary circumstances:

# Offspring distribution
df_toyoffspring <- information.body( x=c(1,10), pmf=c(9/10, 1/10))

# Hypothetical outbreak with 10000 circumstances from this offspring distribution
y_obs <- pattern(df_toyoffspring$x, dimension=10000, exchange=TRUE, prob=df_toyoffspring$pmf)

# Match the damaging binomial distribution to the noticed offspring distribution
# Observe It will be higher to suit the PMF straight as an alternative of to the hypothetical
# outbreak information
(match <- MASS::fitdistr(y_obs, "damaging binomial"))
## dimension mu ## 1.69483494 1.90263640 ## (0.03724779) (0.02009563)
# Observe: completely different parametrisation of the okay parameter
(okay.hat <- 1/match$estimate["size"])
## dimension ## 0.590028

In different phrases, when becoming a damaging binomial distribution to those information (most likely not a good suggestion) we get a dispersion parameter of 0.59.

The Gini coefficient permits for a extra wise description for offspring distributions, that are clearly not negative-binomial.

## [1] 0.4263158


The impact of superspreaders underlines the stochastic nature of the dynamics of an person-to-person transmitted illness in a inhabitants. The dispersion parameter (okay) is conditional on the idea of a given parametric mannequin for the offspring distribution (damaging binomial). The Gini index is an alternate characterisation to measure heterogeneity. Nevertheless, in each circumstances the parameters are to be interpreted along with the expectation of the distribution. Estimation of the dispersion parameter is orthogonal to the imply within the damaging binomial and its easy to additionally get confidence intervals for it. That is much less easy for the Gini index.

A heavy tailed offspring distribution could make the illness simpler to manage by focusing on intervention measures to limit superspreading (Lloyd-Smith et al. 2005). The hope is that such interventions are “cheaper” than interventions which goal your complete inhabitants of infectious contacts. Nevertheless, the success of such a focused technique additionally relies on how giant the contribution of superspreaders actually is. Therefore, some effort is required to quantify the impact of superspreaders. Moreover, the above remedy additionally underlines that heterogeneity generally is a useful characteristic to use when attempting to manage a illness. One other facet of such heterogeneity, particularly its affect on the brink of herd immunity, has not too long ago been invested by my colleagues at Stockholm College (Britton, Ball, and Trapman 2020).


Adam, DC, P Wu, J Wong, E Lau, T Tsang, S Cauchemez, G Leung, and B Cowling. 2020. “Clustering and Superspreading Potential of Extreme Acute Respiratory Syndrome Coronavirus 2 (Sars-Cov-2) Infections in Hong Kong.” Analysis Sq..

Britton, T, F Ball, and P Trapman. 2020. “The Illness-Induced Herd Immunity Stage for Covid-19 Is Considerably Decrease Than the Classical Herd Immunity Stage.”

Endo, A, Centre for the Mathematical Modelling of Infectious Ailments COVID-19 Working Group, S Abbott, AJ Kucharski, and S Funk. 2020. “Estimating the Overdispersion in Covid-19 Transmission Utilizing Outbreak Sizes Exterior China [Version 1; Peer Review: 1 Approved, 1 Approved with Reservations].” Wellcome Open Res.

Lloyd-Smith, J. O., S. J. Schreiber, P. E. Kopp, and W. M. Getz. 2005. “Superspreading and the Impact of Particular person Variation on Illness Emergence.” Nature 438 (7066): 355–59.

  1. To be added to the record of characterising portions akin to doubling time, replica quantity, technology time, serial interval, …↩

  2. Lloyd-Smith et al. (2005) estimated (okay=0.16) for SARS-CoV-1.↩

To depart a remark for the creator, please observe the hyperlink and touch upon their weblog: Concept meets apply…. presents day by day e-mail updates about R information and tutorials about studying R and plenty of different matters. Click on right here in the event you’re trying to put up or discover an R/data-science job.

Need to share your content material on R-bloggers? click on right here when you have a weblog, or right here in the event you do not.

Leave a Reply

Your email address will not be published. Required fields are marked *