Statistical analysis of highly skewed immune response data

https://doi.org/10.1016/S0022-1759(96)00216-5Get rights and content

Abstract

This paper considers methods of statistical analysis for highly skewed immune response data. Observations from population studies of immunological variables are rarely normally distributed between individuals; typically the distribution shows extreme levels of skewness. In some situations, skewness remains considerable even after transforming the data. Using resampling techniques, applied to several actual datasets of ELISA assay data, we consider the robustness of normal parametric methods, e.g. t tests and linear regression. Despite the skewness of the transformed data, we demonstrate that such methods are quite robust depending on the number of observations, type of analysis and severity of skewness. We also illustrate how bootstrap resampling can be used to provide a valid alternative method of analysis that can be used either for checking normal parametric analysis or as a direct method of analysis. We illustrate this combined approach by analysing real data to test for association between human serum antibodies to malaria merozoite surface proteins, MSP1 and MSP2, and resistance to clinical malaria, and confirm the protective effect of antibodies to MSP1 and demonstrated a similar protective effect for some antibodies to MSP2.

Introduction

Population surveys of naturally acquired immune responses can be used to obtain quantitative and qualitative data for a number of immunological variables which can be related to clinical variables such as susceptibility to disease, disease progression or prognosis and individual variables such as age, sex, previous medical history, etc. From our own work, and the published data of many other research groups, we have observed that immune response data collected in such studies are often distributed, between individuals, in a very uneven or asymmetrical pattern. A rapid search of the published literature has revealed a number of serological datasets for a variety of different parasitic organisms, all of which showed some degree of skewness (Nutman et al., 1985; Gabra et al., 1986; Perlmann et al., 1989; Tenter et al., 1991; Deplazes et al., 1992; Helbeig et al., 1993; Van Gelder et al., 1993; Paranhos-Bacalla et al., 1994; Ferrari et al., 1995); additional examples can be found in a recent review by Muller et al. (1995). The degree of skewness appears to depend, to some extent, on the type of antigen used in the assay with the responses to defined single antigens, typically represented by recombinant proteins or synthetic peptides, being highly positively skewed and responses to crude parasite extracts, which are mixtures of many different antigens, tending to be less skewed. For example, severely skewed data was obtained for purified antigens of Echinococcus granulosus (Helbeig et al., 1993) and Plasmodium falciparum (Gabra et al., 1986; Perlmann et al., 1989) and moderately skewed data for defined Toxoplasma gondii (Tenter et al., 1991; Van Gelder et al., 1993) and Trypanosoma cruzi (Paranhos-Bacalla et al., 1994) antigens. In contrast, serological responses to crude extracts of a number of helminth parasites, including Echinococcus and Schistosoma, are only slightly skewed (Nutman et al., 1985; Deplazes et al., 1992; Ferrari et al., 1995).

A recent paper (Bennett and Riley, 1992) confirmed that skewness across a range of datasets is more the norm than the exception and that there is no consensus among immunologists about how such data should be analysed. Skewness may be so pronounced that the application of standard analysis can be rather problematic. One common approach to analysis is to ignore the skewness in the data and apply normal parametric methods such as t tests, analysis of variance (or covariance) or standard linear regression, with or without applying a transformation (usually logarithmic) to the data to reduce or remove skewness. As will be confirmed in this paper, transforming immune response data does not always normalize the distribution and may only have a minor impact on reducing the level of skewness (Bennett and Riley, 1992). At first sight, using normal parametric methods on skewed data may be expected to be invalid: confidence intervals intended to provide 95% probability of coverage (of the population value) may actually only attain a lower level of coverage and perhaps give misleading `statistically significant' results. However, because of the `Central Limit Theorem' (CLT) of statistics1 (Ross, 1976) normal methods may still provide approximately valid confidence intervals (or equivalently, hypothesis test p values) provided that, (a) the sample size is large enough, and (b) the distribution is not too severely skewed. A second approach is to use standard non-parametric methods, e.g. Mann-Whitney or Kruskal-Wallis significance tests. However, these methods do not make full use of the available quantitative data and are limited when a multivariate analysis (e.g. multiple regression) is required which is often the case with epidemiological data. One aim of this paper is to investigate the range of values relating to (a) and (b) which will permit the use of normal parametric methods for sero-epidemiologic studies. A third approach to the analysis of immune response data is to avoid analysing the original quantitative data altogether but instead to analyse corresponding binary data after categorising each individual as a `responder' or a `non-responder', depending on whether the measured response is above or below some estimated cut-off value. Standard statistical methods such as the chi-squared test or logistic regression can then be validly applied. Although this approach may make biological sense, it is not ideal. Firstly, defining a cut-off value when, from plots of the data, there may be no obvious division of individuals into two groups, is potentially unjustified and a rather arbitrary process. Results from such analysis may also be sensitive to the particular choice of cut-off, especially for small studies. In addition, it has been shown previously (Bennett and Riley, 1992), that statistical power will be decreased when qualitative rather than quantitative data are analysed.

A relatively new approach to constructing confidence intervals, without making any parametric assumptions about the shape of the underlying population of data, is the bootstrap resampling method, first introduced by Efron (1979). A very readable account of the method is provided by Efron and Tibshirami (1993). The bootstrap method involves Monte Carlo sampling and is computationally intensive. It differs from traditional formula-driven methods in that the original data are resampled, many times, to simulate sampling variability and to provide an approximation for the unknown sampling distribution of the statistic of interest. This sampling distribution is used to provide estimates of standard error and to estimate confidence intervals. Traditional normal methods are based on the assumption that the underlying data distribution is normal so that the sampling variability of the statistic of interest is also normal and so predictable. The bootstrap method does not require assumptions about the underlying population of data, and therefore, intuitively, it is appropriate for analysing immune response data regardless of, for example, severe skewness. The bootstrap is a non-parametric method of analysis and provides a `gold standard' against which other analyses can be compared.

This paper aims to: (i) consider the robustness and limitations of normal parametric methods for the analysis of highly skewed immune response data, and (ii) to explore the use of the bootstrap method as a tool for checking normal analysis and also as a direct method of analysis.

We shall use as examples datasets obtained from enzyme-linked immunosorbent assays (ELISA) of human sera tested for the presence of antibodies to the malaria parasite, Plasmodium falciparum. The data obtained are examined in relation to both clinical and demographic covariates in order to test for any statistical association between immune responsiveness and malarial disease.

Section snippets

Data and immunological methods

Serum samples were collected from children aged 3–8 years living in a highly malaria endemic area of West Africa. Details of the study area, malaria endemicity and demography have been published previously (Greenwood et al., 1987; Riley et al., 1990). Sera were tested for antibodies to recombinant malaria antigens by ELISA (Egan et al., 1995; Taylor et al., 1995). The recombinant antigens used were produced as glutathione-S-transferase fusion proteins in pGex expression vectors, as described

Normal parametric methods

Three different and commonly encountered situations were considered to assess the robustness of normal parametric analysis.

    (a)

    The 1-sample situation of estimating the population average immune response. The relevant normal method is a one-sample t test (or an equivalent t-based confidence interval). This problem can arise in practice when paired data, i.e. two OD measurements per individual are involved and the within-subject differences are calculated for analysis.

    (b)

    The 2-sample situation of

Descriptive analysis

Fig. 1 shows histograms for five datasets of OD data ordered from left to right by level of skewness (upper histograms). Histograms for the corresponding log transformed data are also shown (lower histograms). A standard skewness statistic (see formula in Fig. 1) was calculated for each set of data and all values were significantly greater than zero (two standard errors≅0.34, p<0.05). For untransformed data, skewness ranged from extreme levels of 6.15 (MAD2) down to moderate levels of 1.02

Discussion

Using several real immune response datasets we have considered (i) the robustness of normal parametric methods for analysing skewed immune response data and (ii) the use of bootstrap resampling as an alternative method of analysis. These data are typical of highly skewed data which are often obtained in immuno-epidemiological studies. Although some datasets, for particularly immunogenic antigens, may give normal or near normal distributions, for less immunogenic antigens, where the proportion

Acknowledgements

We are very grateful to Andrea Egan and Rachel Taylor for providing ELISA data, Bruce Worton for helpful discussions, and to the Wellcome Trust for providing financial support. Steve Bennett was supported by the Medical Research Council.

References (25)

  • Aitken, M. (1990) Statistical Modelling in GLIM. Oxford Science Publications,...
  • Armitage, P. and Berry, G. (1994) Statistical Methods in Medical Research. Blackwell Scientific Publications,...
  • Bennett, S. and Riley, E. (1992) The statistical analysis of data from immunoepidemiological studies. J. Immunol....
  • Critchfield, G. et al. (1992) Nonparametric assessment of toxicologic assay linearity by bootstrap analysis. J. Analyt....
  • Deplazes, P. et al. (1992) Detection of Echinococcus coproantigens by enzyme-linked immunosorbent assay in dogs, dingos...
  • Efron, B. (1979) Bootstrap methods: another look at the jackknife. Ann. Stat. 7,...
  • Efron, B. and Tibshirami, R. (1993) An Introduction to the Bootstrap. Chapman and Hall,...
  • Egan, A. et al. (1995) Serum antibodies from malaria-exposed people recognize conserved epitopes formed by the two...
  • Egan, A. et al. (1996) Clinical immunity to plasmodium falciparum malaria is associated with serum antibodies to the...
  • Ferrari, T. et al. (1995) The value of an enzyme-linked immunosorbant assay for the diagnosis of schistosomiasis...
  • Fulford, A.C. (1994) Dispersion and Bias: Can we trust geometric means? Parasitol. Today 10,...
  • Gabra, M. et al. (1986) Defined Plasmodium falciparum antigens in malaria serology. Bull. WHO 64,...
  • Cited by (41)

    • Fatigue and sleepiness responses to experimental inflammation and exploratory analysis of the effect of baseline inflammation in healthy humans

      2020, Brain, Behavior, and Immunity
      Citation Excerpt :

      All analyses were performed using Stata 14 (StataCorp, College Station, Texas) using an alpha level of 0.05. Following recommendations by (McGuinness et al., 1997), cytokine data were log-transformed and analyses were performed using bootstrap resampling method (1000 resamples). The data files are freely available on the OSF repository: [https://osf.io/u78h6/?view_only=a8a203227c854e35a496597d40e2cf5e].

    • Maternal BCG scar is associated with increased infant proinflammatory immune responses

      2017, Vaccine
      Citation Excerpt :

      Cytokine and chemokine concentrations showed skewed distributions. Results were transformed to log10 (cytokine concentration + 1) for graphical representation using GraphPad Prism v6.0c (GraphPad software, Inc., La Jolla, CA, USA) and for analysis by linear regression using bootstrapping [33] using STATA v. 13.1 (College Station, TX, USA). Results from regression analyses are presented as adjusted geometric mean ratios (aGMR) [95% confidence interval (CI)].

    • Factors associated with tuberculosis infection, and with anti-mycobacterial immune responses, among five year olds BCG-immunised at birth in Entebbe, Uganda

      2015, Vaccine
      Citation Excerpt :

      Logistic regression was used to examine associations with LTBI at age five. Cytokine responses were transformed to log10 (concentration + 1) and then analysed using linear regression with bootstrapping to estimate bias corrected accelerated confidence intervals [31]; results were back-transformed to give geometric mean ratios. Spearman's correlation coefficients between cytokines responses at one and five years were calculated.

    • The influence of BCG vaccine strain on mycobacteria-specific and non-specific immune responses in a prospective cohort of infants in Uganda

      2012, Vaccine
      Citation Excerpt :

      Random effects were used to account for potential between-lot variability (since several lots of vaccine were administered within each BCG strain group). As some cytokine results remained skewed after log10 transformation, analyses were boostrapped [33] with 10,000 repeats to calculate bias-corrected accelerated confidence intervals. Cytokine responses of infants with and without a BCG scar were compared using the same methods but without random effects (being independent of potential between-lot variability).

    View all citing articles on Scopus
    View full text