Review
Reference intervals: an update

https://doi.org/10.1016/S0009-8981(03)00133-5Get rights and content

Abstract

Reference intervals serve as the basis of laboratory testing and aid the physician in differentiating between the healthy and diseased patient. Standard methods for determining the reference interval are to define and obtain a healthy population of at least 120 individuals and use nonparametric estimates of the 95% reference interval. This method is less accurate if the group size is significantly less and does not allow for exclusion of outliers. In order to overcome these limitations many authors in the current literature report reference intervals after arbitrary truncation of the data or use inappropriate parametric calculations. We argue that the use of outlier removal and robust estimators, with or without transformation to normality, address the shortcomings of the standard method and eliminate the need for employing less valid methods.

To test these methods of analysis well-defined test groups are required. In a few studies physician-determined health status is provided for each subject along with commonly measured analytes. The NHANES and Fernald studies provide such groups. With such data it is possible to show the range of effects on the reference interval width by including a known non-healthy subgroup. With the NHANES data the effect ranged from negligible to a 30% increase in reference interval width. We found that use of outlier detection with the robust estimator yielded reference intervals that were closer to those of the true healthy group.

Another issue is one of demographics. That is, whether or not one should derive separate reference intervals for different demographic groups, e.g., males and females. The standard mathematical test for deriving separate reference intervals is due to Harris and Boyd. Using the NHANES data we examined 33 analytes for each of three ethnic groups (separated by genders). We used the Harris and Boyd procedure and observed that it was necessary to derive separate reference intervals for approximately 30% of the comparisons. The most notable analytes were glucose and gamma GT.

The methods used by most laboratories have similar precision, identical units, are linearly related (often on a 1:1 basis) and correlate well with each other. As a result the only difference is the method bias. By using the reference interval width, this bias is eliminated. We argue that the log ratio of the reference interval widths is a good estimate of the variability between groups.

Introduction

The reference interval is the most widely used medical decision-making tool. It is central to the determination as to whether or not an individual is healthy. Simply put, a reference interval for a particular analyte, e.g., creatinine, provides a range of acceptable values for healthy cohorts (individuals of the same race, gender, age, etc.). For example, if a patient's creatinine concentration, as determined by a laboratory, lies outside of the reference interval, the value is flagged, and the patient is designated for further examination. In this sense the reference interval is a comparative measurement.

The following scheme defines the National Committee for Clinical Laboratory Standards (NCCLS) procedure for reference interval determination. NCCLS [1].

The problem facing the laboratory scientist is to derive reference intervals for healthy populations [1], [2]. The effects of age, race, exceptional exercise, diet, or non-healthy status on the estimate of the (healthy) reference interval must be considered. The problem here is twofold: first, the reference interval is determined from the samples available to the particular laboratory. These samples may, or may not, be comprised of all healthy individuals. Second, the size of the sample may not be adequate to estimate the endpoints of the reference interval to any a reasonable degree of precision. The end result can be an increased number of incorrect decisions leading to increased cost and unnecessary investigations and risk in patient safety. The goal of this review is to present new approaches to reference interval estimation that make more efficient use of the data while being resistant to outliers. In order to test methods of reference interval estimation, one needs both mathematical simulations and analyses of analytes on well-defined groups.

In the computation of reference intervals, extreme, atypical values can exert a disproportionately large amount of influence on an estimator. One definition of such a value, or outlier, “is some observation whose discordancy from the majority of the sample is excessive in relation to the assumed distributional model for the sample, thereby leading to the suspicion that it is not generated by this model” reference [3]. Possible sources of outliers are: recording errors, laboratory errors, and erroneous inclusion of a second group in the test group, among others.

The advent of the National Health and Nutrition Examination Survey (NHANES) data [4] and our Fernald population [5] with extensive subject history has made it possible to establish reference intervals taking into consideration partitioning by some of the above parameters. The Third National Health and Nutrition Examination Survey (NHANES III), 1988–1994 on CD-ROM (for purposes of abbreviation referred to as NHANES) contains data for 33,994 persons ages 2 months and older who participated in the survey. The CD-ROM with all the data can be obtained from the National Center for Health Statistic Data Dissemination Branch Centers for Disease Control and Prevention 6525 Belcrest Road, Room 1064, Hyattsville, MD 20782-2003, USA [4]. The data are the result of a complex survey design involving stratification and clustering, and thus, weights were assigned to each individual. The weight for an individual indicates the number of people represented by that individual. In our published work, we treated the individuals as coming from a random sample, i.e., individual weights were ignored. Analyses involving the individual weights will be explored in future work.

Clinical chemistry measurements were made on a number of analytes including glucose, sodium, potassium, etc. A physician also determined health status based on a physical examination and participant medical history. The Fernald population is a group of residents who lived near a nuclear feed plant about 15 miles west of Cincinnati, OH. (A nuclear feed plant processes uranium ore and spent uranium into metal for atomic weapons.) Clinical chemistry analyte measurements were made on the Fernald population, similar to those recorded for the NHANES. The health status of the 9000 residents from Fernald was evaluated in a manner similar to that of the NHANES. The scoring differentiated degrees of health into five (NHANES) or six (Fernald) categories.

The issue of whether to use separate reference intervals is of great importance to the clinical chemist. Combining different groups in order to derive a single reference interval has several advantages. First, it allows the laboratory to more easily attain a large number of values. Second, it is easier for physician's to interpret the data if only one or two reference intervals have to be considered. However, the combining of two disparate demographic groups may increase the likelihood of misclassification because the distributions of their analytical values are inherently different. Mathematical solutions to resolve whether or not to combine have been addressed by Harris and Boyd [6], [7]. Their recommendation, which is used by the NCCLS [1], is that two groups should be combined into a single group unless their means and/or standard deviations exceed appropriate predetermined thresholds (See also references [8], [9], [10], [11], [12], [13]).

Section snippets

Sample size requirements

In order to compute a 95% reference interval the 2.5% and 97.5% points of the distribution of the population of interest must be estimated. Here, P=2.5 and the minimum required number of observations is n=(100/P−1)=39. In this case, n=39, the 95% reference interval would consist of the minimum value (1st order statistic) and the maximum value (39th order statistic). As a further example, if n=99, then the nonparametric reference interval would consist of 0.025*(99+1)=2.5 order statistics from

Methods of calculation: established approaches and recent advances

In this section we will review and compare methods of calculating reference intervals. These approaches include traditional normal theory, data transformation followed by methods based on normal theory, nonparametric, data truncation, and robust.

Normal calculations estimate reference intervals using a mean and two (actually 1.96) standard deviations (S.D.) of the data set. This calculation yields the 2.5% and 97.5% reference intervals. This method is valid only if the data come from a Gaussian

Confidence intervals for RI endpoints: nonparametric, normal, bootstrap

It is often desirable to assign precision the estimates of the endpoints of the reference interval. This is achieved by calculating confidence intervals for these endpoints. The International Federation of Clinical Chemistry (IFCC) recommends that nonparametric 90% confidence intervals be computed for each endpoint of the nonparametric 95% reference interval. This forces the requirement that sample consist of at least 120 observations. Note, however, for the minimum recommended sample of size

Outlier problem

Outliers are a known problem in statistical data [20]. One way to determine if there is an outlier affecting values is to examine the distribution of the data. The distribution of values for an analyte can be described by a number of mathematical models. The most common one used is the Gaussian, or normal, distribution (Fig. 2).

Because it has been extensively studied, data described by this model can be easily analyzed. The most common description of a reference interval by this method would be

Current practice

For many analytes the current practice is to assume that the reference intervals of two different groups are equal, e.g., sodium concentrations of males and females. This is due to the large sample sizes required by the nonparametric approach to reference interval computation. It is often the case that groups are partitioned without an analysis to determine whether they are comparable. To state the obvious, hormone levels, such as testosterone, differ significantly between males and females. In

The reference interval width as a diagnostic tool

The reference interval width (RIW) is the value obtained by subtracting the lower reference interval value from the higher one [5]. For the purposes of discussion, this would be subtracting the 2.5% value from the 97.5% value. When calculated in this manner, it is independent of the mean or median values. Thus it can be a useful measurement to evaluate the influence of different estimators, outliers, and populations. Since the RIW is location invariant, that is, a constant shift in the data

Transference of RI new analytical method (e.g., new instrument or reagent)

For any new method, or instrument, which measures an analyte, then the most rigorous criteria is to follow the NCCLS guidelines [1]. That is, to select an appropriate reference sample from which to measure the analyte. Of course as discussed in the previous section, one may need to subdivide the reference sample into appropriate subsamples. Each of these subsamples will require at least 120 observations, which is often not feasible, and thus for many analytes, reference intervals are calculated

Failure to review the data

A common practice is to compute reference intervals without visually reviewing the data. In any analysis of reference interval data, a visual plot of all the data, such as a histogram (Fig. 8), a stem-and-leaf plot, and/or a boxplot (Fig. 9) should be done first. Data that are heavily skewed are then reviewed with a consideration that these may be caused by diseased individuals in the sample. Another important use of this type of data plot is to note whether the (subsequently) calculated

Summary/conclusion/recommendations

To summarize, the establishment of a reference interval depends on the size of the data set and the method of evaluation. The first step is to look at the data, preferably in graphical form, e.g., histogram, box-and-whisker plot, etc. This will provide an indication of the nature of the underlying distribution of the analyte for the population of interest. Next a suitable transformation of the data to achieve Normality should be examined [18], [22], [23], [31]. Transformation is also possible

References (32)

  • H.E. Solberg et al.

    Approved recommendation (1988) on the theory of reference values: Part 3. Preparation of individuals and collection of specimens for the production of reference values

    Clin. Chim. Acta

    (1988)
  • H.E. Solberg et al.

    Approved recommendation on the theory of reference values: Part 4. Control of analytical variation in the production, transfer and application of reference values

    Eur. J. Clin. Chem. Clin. Biochem.

    (1991)
  • H.E. Solberg

    Approved recommendations (1987) on the theory of reference values: Part 5. Statistical treatment of collected reference values: determination of reference limits

    J. Clin. Chem. Clin. Biochem.

    (1987)

    Clin. Chim. Acta

    (1987)
  • R. Dybkaer et al.

    Approved recommendations (1987) on the theory of reference values: Part 6. Presentation of observed values related to reference values

    J. Clin. Chem. Clin. Biochem.

    (1987)

    Clin. Chim. Acta

    (1987)

    Labmedica

    (1988)
  • W.J. Dixon

    Processing data for outliers

    Biometrics

    (1953)
  • R.G. Hoffmann

    Statistics in the practice of medicine

    J. Am. Med. Assoc.

    (1963)
  • Cited by (307)

    View all citing articles on Scopus
    View full text