Article Text

## Abstract

**Objective** Microarray-related studies often involve a very large number of genes and small sample size. Cross-validating or bootstrapping is therefore imperative to obtain a fair assessment of the prediction/classification performance of a gene signature. A deficiency of these methods is the reduced training sample size because of the partition process in cross-validation and sampling with replacement in bootstrapping. To address this problem, we aim to obtain a prediction performance estimate that strikes a good balance between bias and variance and has a small root mean squared error.

**Methods** We propose to make a one-step extrapolation from the fitted learning curve to estimate the prediction/classification performance of the model trained by all the samples.

**Results** Simulation studies show that the method strikes a good balance between bias and variance and has a small root mean squared error. Three microarray data sets are used for demonstration.

**Conclusions** Our method is advocated to estimate the prediction performance of a gene signature derived from a small study.

## Statistics from Altmetric.com

- gene signature
- prediction
- classification
- receiver operating characteristic curve
- microarray
- learning curve

### Strengths and limitations of this study

The proposed method estimates the prediction performance of a gene signature derived from a small study.

The proposed method strikes a good balance between bias and variance and has a small root mean squared error.

The proposed method can be applied to linear and non-linear prediction models.

The proposed method may not work well for studies with an extremely small sample size (eg, n<10).

## Introduction

With the advances in microarray technology, hundreds of thousands of genes with expression information on an individual can be obtained in a single experiment. This high throughput technology enables us to make diagnostic and prognostic predictions based on a participant's gene signature.1–14 There are four key steps in microarray-based studies: (1) data processing (eg, data normalisation), (2) gene selection, (3) prediction model construction and (4) prediction performance evaluation.15 This paper focuses on the last step, the evaluation of prediction performances.

Microarray-based studies often involve a very large number of genes and a relatively small sample size. The same small data set being used for constructing the prediction model and subsequently evaluating the prediction performance tends to give over-optimistic estimates. This is why cross-validating or bootstrapping is imperative if a fair assessment of the prediction/classification performance of a gene signature is to be made. Popular cross-validation (CV) methods are k-fold CV, Monte Carlo CV and leave-one-out CV (LOOCV, also known as jackknifing).16 These methods partition the original data into a training set and a testing set. The training sample size, therefore, is reduced. The bootstrap method is an alternative to CV that operates by sampling with replacement of the original data.17 ,18 Even though the bootstrap sample has the same sample size as the original one, the overlapping of subjects between the bootstrap sample and the original data still reduces the effective (non-overlapping) training sample size. The reduced training sample size will curtail the prediction/classification performance of a gene signature, especially when the sample size of a study is already small.19

Estimating the performance of a prediction model built from a small study is a vexing task. For CV purposes, the already small sample size still needs to be partitioned further into a training set and a testing one. If we make the training size as large as possible (such as LOOCV), it would be nearly unbiased, but the effective sample size left for validation is one subject and we would get a variance that is unduly large, that is, the notorious bias–variance dilemma.20 ,21

The accuracy rate, the error rate and the area under the receiver operating characteristic curve (AUC) are commonly used performance indicators of a prediction model for a binary outcome.15 ,22 In this paper, we focus on the AUC index because it evaluates the global performance of a prediction model, not just at a particular cut-off point, but for each and every possible cut-off point. In a small data set, each and every subject is indispensable. We propose to make a one-step extrapolation from the fitted learning curve for AUC to estimate the prediction/classification performance of the model trained by all the samples.

## Methods

### Monte Carlo CVs

Suppose that a given prediction model is to be evaluated based on a data set with a total of N_{1} cases (or individuals with adverse events) and N_{0} controls (or individuals without adverse events). First, we use CVs with five different folds (LOOCV, 10-fold, 5-fold, 3-fold and 2-fold CV, respectively) to evaluate the performances of the prediction model. (LOOCV is the largest training size possible and the 2-fold CV is the smallest. Between these two extremes, we add three additional fold numbers. More folds can also be tried, but the results are similar.) To be more precise, we use the Monte Carlo random partition to partition the data set into two parts: the training set and the testing set. The random partitions are operated separately for the case group and the control group to ensure that the number of cases and controls is as balanced as possible;23 ,24 the training set has a total of cases and controls (both to the nearest integers), where for LOOCV, 10-fold, 5-fold, 3-fold and 2-fold CVs, respectively.

To be representative of all possible partitions, we suggest running at least 100 Monte Carlo random partitions for all folds, although this may be superfluous for some situations. (For example, there are only distinctive partitions for LOOCV with . Note that in a random partition, we leave out ‘one pair’ of a case and a control, instead of ‘one subject’.24) For each Monte Carlo partition, a prediction model will be built from the training set, and the testing set used to evaluate the performance of this model. The AUCs under the same fold are to be averaged (denoted as ), which leaves us with a total of five s.

### Learning curve for AUC

A learning curve is an assumed functional relation between prediction performance and training sample size.25 In this study, we take as our learning curve, where and . Essentially, this learning curve is a straight line in a double-inverse coordinate. For the ordinate, the AUC values are first transformed to quantiles of standard normal distribution, then squared and finally inversed. This expands the range of 0.5–1.0 in an AUC to a range of in the ordinate. For the abscissa, the range of the sample size is also between 0 and ∞. Such a sample size in inverse should be more sensitive to changes when the original sample size is small. The online supplementary appendix 1 shows that this learning curve is the exact functional relation between prediction performance and training sample size when normality and independence of the data are assumed.

### One-step extrapolation from the learning curve

We calculate and for each fold. On the basis of the five coordinate points, (x, y)s, we draw a linear regression line, that is, To extrapolate the performance (denoted as AUC_{T}) of the prediction model when all the subjects () are used as training samples, we enter into the equation to get .

With calculated, we then obtain , where is the cumulative distribution for the standard normal.

## Simulation studies

### Data and prediction models

We consider four different sample sizes: (cases)+10 (controls), 15+15, 20+20 and 25+25, respectively. A total of 10 genes are considered. The gene expression levels are generated from a normal distribution with a variance of 1. For the cases, the means of the gene expressions are distributed as uniform (−0.8, 0.8); for the controls, the means are set to 0.

Four different data structures are considered. The first three are normally distributed with a correlation coefficient of 0 (independence), 0.2 and 0.5 (dependency). The last data structure is more complicated. For the cases, the expression level of each gene is distributed as a mixture of three normal distributions with variances of 1. The three means are generated from a uniform (−1.5, 1.5) distribution with a probability of 0.6, a uniform (−1.2, 1.2) distribution with a probability of 0.3 and a uniform (−1, 1) distribution with a probability of 0.1, respectively (see online supplementary appendix 2 for this non-normal distribution). The gene expression level for the controls follows the standard normal distribution. The correlation coefficient between any two genes is set at 0.5 in these complex data.

There are many methods to build prediction models. In this paper, we use the naïve multiple regression and the support vector machine (SVM), as detailed below, to build prediction models. Another machine learning method, the random forest (RF), is detailed in online supplementary materials.

The naïve multiple regression is a simple prediction method. First, the β-coefficients ( where p is the number of genes in the gene signature) are calculated as the mean expression difference for each gene between the case and the control groups in the training set. The prediction score of the naïve multiple regression for the jth subject in the testing set is then , where χ_{ij} is the observed gene expression level of the ith gene for this jth subject. (The naïve multiple regression used in this study is similar to a previously proposed compound covariate method26 where the prediction score for the jth subject in the testing set is with the two-sample t-statistic of each gene serving its own weight in the prediction model).

SVM is a more sophisticated method; it is a very efficient learning algorithm for high-dimensional data in classification, regression and pattern recognition. The basis of SVM is to implicitly map data to a higher dimensional space via a kernel function in order to identify an optimal hyperplane that maximises the margin between the two groups.27 There are many software packages available to implement SVM. In this study, we use the e1071-package of R with a default radial basis function kernel to obtain the prediction scores.28

### CVs of the prediction models

In our simulation study, we perform a total of 5000 simulations. In each simulation, a total of 100 random partitions are performed for each fold CV (LOOCV, 10-fold, 5-fold, 3-fold and 2-fold CVs, respectively). From these, we use the previously described learning curve to make a one-step extrapolation to the cross-validated AUC when all the samples are utilised to train the model. For a comparison, we also calculate the internally validated AUCs of the LOO bootstrap in each simulation. This is a modified bootstrap procedure of the ordinary bootstrap. We draw a total of 100 resamplings. At each draw, the observations left out serve as the testing set. The effective (non-overlapping) training sample size of the LOO bootstrap is around 63.2% of the total sample size.17 ,18 (Out-of-bag (OOB)29 estimation employs a majority vote on the multiple prediction made for observation i based on the bootstrap samples at each draw, while the LOO bootstrap takes an average on error of these predictions. Therefore, OOB estimation may have larger variability than the LOO bootstrap when the sample size is small.30) In the simulation, we additionally create a large data set of 1000 cases and 1000 controls for external validation. For a prediction model, an externally validated AUC against this data set is considered as its true AUC value (one true AUC for each round of simulation). It should be pointed out that in real practice, one rarely has the luxury to conduct such a large-scale external validation, but will often have to settle for a satisfactory internal validation method which is precisely the focal point of this paper.

### Bias, variance and root mean squared error

In each round of the simulation, we calculate an estimated AUC, an error (the difference between the estimated AUC and the true AUC) and an error square for each performance evaluation method. On the basis of the 5000 simulations, the bias is calculated as the sample mean of the errors; the variance is the sample variance of the estimated AUCs; and the mean squared error (MSE) is the sample mean of the error squares. Finally, the root mean squared error (RMSE) is calculated from the square root of MSE. (RMSE simultaneously considers bias and variance. This value represents the ‘average’ (root-mean-square average, to be precise) difference between the estimated AUC and the true AUC).

### Simulation results

In figure 1, we present the bias (panels A–D), the variance (panels E–H) and the RMSE (panels I–L), respectively, using naïve multiple regression under different sample sizes. When the variables are independently distributed (panels A, E and I), the bias becomes smaller as the sample size becomes larger (closer to zero; panel A). All the fold-based CV methods underestimate the true AUC value because the training sample sizes they use are smaller than the total sample size given. The training sample size of LOOCV is closest to the total sample size (total sample size minus one pair); hence, it is the least biased (blue line) among all the fold-based CV methods. The training sample size of the LOO bootstrap is about 63% of the total sample size,17 ,18 which makes its bias (black dashed line) comparable to that of the twofold CV (with 50% of the total sample size; green line). As for the bias of our extrapolation method (red line), it is comparable to that of LOOCV.

In figure 1E, we see that the variance reveals a different story; the LOOCV now has the largest variance, and the twofold CV has the smallest variance among the fold-based CV methods. In terms of variance, the extrapolation method is now comparable to the twofold CV. From the RMSE index (figure 1I), we see that the proposed extrapolation method strikes a good balance between the bias and variance.

Similar results can be found when the variables are correlated (panels B, F and J, with a correlation coefficient of 0.2; panels C, G and K, with a correlation coefficient of 0.5) and when they are not normally distributed (panels D, H and L), or when SVM (figure 2) is used for constructing prediction models.

We simulated a more substantially non-normal data set (see online supplementary appendix 3). We found that the proposed extrapolation method can still strike a good balance between bias and variance (see online supplementary appendix 4). In addition, we examined the performances of the 0.632 bootstrap17 and the 0.632+ bootstrap,31 both of which are weighted averages between the LOO bootstrap estimate and the resubstitution estimate. (The 0.632+ bootstrap is an improved version of the 0.632 bootstrap.) We found that the 0.632 bootstrap produces very large upward biases while the 0.632+ bootstrap is quite comparable to our method (see online supplementary appendix 5). We also see that the proposed extrapolation method can outperform the 0.632+ bootstrap in terms of RMSE when sample size (online supplementary appendix 6).

We also tried extrapolation based on different learning curves (a linear equation with and , and a quadratic equation with and ), but we found the results to be no better than using the learning curve in this paper (see online supplementary appendices 7 and 8).

## Real data demonstration

We take three microarray data sets to demonstrate how the extrapolation method can be applied step by step.32–34 As this paper focuses on prediction model evaluation, and not on gene selection, we conveniently construct a 10-gene signature based on the top 10 genes with the smallest Mann-Whitney U test p values for the first two data sets, respectively. In the last example, we directly take the 76-gene signature identified by the original study as the prediction model. Both naïve multiple regression and SVM are used to build the prediction model for each data set. Monte Carlo random partition (a total of 1000 partitions for each fold) is performed to obtain the cross-validated AUCs.

### Example 1

The first data set is colon cancer data.32 The data consist of 2000 gene expressions in 62 tissue samples (40 tumour and 22 normal colon tissue samples). The data are available at http://genomics-pubs.princeton.edu/oncology/. The gene expression level is presented in intensity value, and is otherwise unprocessed. Hence, we first normalise the data by the mean and SD of each gene. The data are then randomly divided into two parts, one for gene selection (28 tumour tissue samples and 10 normal colon tissue samples) and the other for model building and CV (12 tumour tissue samples and 12 normal colon tissue samples). In the gene selection data set, we use the Mann-Whitney U test to identify the top 10 genes with the smallest p value from among the 2000 genes. These are *Hsa.627_M26383*, *Hsa.6814_H08393*, *Hsa.37937_R87126*, *Hsa.692_M76378-3*, *Hsa.3016_T47377*, *Hsa.31630_R64115*, *Hsa.831_M22382*, *Hsa.36689_Z50753*, *Hsa.3331_T86473* and *Hsa.43279_H64489*. In the remaining data set, we build and cross-validate a prediction model for this 10-gene signature.

For naïve multiple regression, the s (averaged from 1000 Monte Carlo partitions) are 0.936 (LOOCV), 0.929 (10-fold CV), 0.928 (5-fold CV), 0.925 (3-fold CV) and 0.921 (2-fold CV), respectively. The (x, y)s are then calculated as: =(0.182, 0.432) for LOOCV, (0.200, 0.465) for 10-fold CV, (0.220, 0.466) for 5-fold CV, (0.250, 0.483) for 3-fold CV and (0.330, 0.503) for 2-fold CV, respectively. These results are plotted in figure 3A. We then draw a linear regression based on the five (x, y) points: (the red line in figure 3A). To predict the performance with a sample size of 24 (all samples in the model building and CV data set are used as the training set, ie, 12 tumour and 12 normal tissue samples), we enter into the regression equation to get (* in figure 3A). The extrapolated performance is therefore . We next perform a total of 100 bootstrapping for this example and the bootstrapped SE for is calculated as 0.080. The results for this example when SVM is used for constructing the prediction model are shown in figure 3B. The (± bootstrapped SE) is calculated as 0.940 (±0.079).

### Example 2

The second example is a breast cancer data set,33 which is available at the Gene Expression Omnibus database (http://www.ncbi.nlm.nih.gov/geo), with accession code GSE2990. These data consist of 22 215 genes for 189 patients with breast cancer (120 patients without relapse and 67 with relapse; 2 patients with unknown relapse status are omitted from our demonstration). The provided data set has already been processed (with background correction, quantile normalisation and log transformation). The data set details and patient profile can be found in the corresponding reference and aforementioned GEO website. We choose those ‘extreme’ patients to be in the gene selection data set, that is, those 43 patients who developed relapse within 5 years and those 91 patients who were free of relapse for at least 5 years. The remaining data set (for model building and CV) now consists of 67−43=24 patients who developed relapses after 5 years and 120−91=29 patients free of relapses during their less than 5-year follow-up periods.

In the gene selection data set, we again pick the top 10 genes from among the 22 215 genes with the smallest p value after Mann-Whitney U test. These 10 genes are *203213_at*, *210222_s_at*, *205898_at*, *218883_s_at*, *203485_at*, *201890_at*, *214710_s_at*, *202779_s_at*, *202503_s_at* and *201291_s_at*. The remaining data set is used to build and cross-validate the prediction model for this 10-gene signature.

The results for naïve multiple regression are presented in figure 3A (blue line). The one-step extrapolated AUC (± bootstrappedSE) is calculated as 0.803 (±0.082). The results for SVM are presented in figure 3B (blue line). The one-step extrapolated AUC (± bootstrappedSE) is calculated as 0.781 (±0.063).

### Example 3

The third example is a different breast cancer data set. The data set34 and its patient profile are available at the GEO database (http://www.ncbi.nlm.nih.gov/geo) with accession code GSE2034. The data consisting of 107 patients with breast cancer with distant relapse and 197 without distant relapse) was divided into training (115 patients) and testing (171 patients) by concentration of the oestrogen receptor and a 76-gene signature was identified by previous researchers.34 We use our method to estimate the prediction performance of this 76-gene signature.

The results for naïve multiple regression are presented in figure 3A (green line). The one-step extrapolated AUC (± bootstrappedSE) is calculated as 0.726 (±0.005). The results for SVM are presented in figure 3B (green line). The one-step extrapolated AUC (± bootstrapped SE is calculated as 0.716 (±0.010).

## Discussion

When estimating the performance of a model derived from a small study, there seems to be no reason to settle for a sample size of (or , in a leave-one-pair-out CV), since what we are looking for is the performance at sample size . In this study, we extrapolate the performance to by exploiting the linear relation between and . The extrapolation is based on five CV methods (LOOCV, 10-fold, 5-fold, 3-fold and 2-fold CV) and is carried out only one-step ahead. The resulting estimate thus inherits the lack of bias in LOOCV and strikes a satisfying variance among the five CV methods. A computer simulation shows that our method performs the best in terms of RMSE when sample sizes are small.

The learning curve for AUC used in this study is based on a linear prediction model (online supplementary appendix 1). However, our simulation study shows that the learning curve is equally suited for non-linear prediction models, such as SVM (figure 2) and RF (online supplementary appendix 9). When making the extrapolation, we may sometimes encounter a slope (b in ) that is near zero (eg, the colon cancer example demonstrated in figure 3). This may occur when the model performance has reached its plateau; thus, varying the training size (as in different CV methods) has little effect on AUC estimates. This can also occur at the other extreme when the prediction/classification problem at hand is more complex and requires a much larger training size than is currently available, to significantly enhance the model performance. In either case, our method amounts to taking the average of the five CV estimates, thereby stabilising the variances.

Two previous studies30 ,35 also exploited the extrapolation concept. Both used an inverse power-law model as the empirical learning curve. In this paper, we are only interested in how the performance will result if all the samples we have are utilised to train the model. We do need an equation (a learning curve) for extrapolation, but this requires extrapolating a mere one step ahead. Our results should therefore be less dependent on what learning curves are being used.

## Acknowledgments

The authors would like to thank Dr Yung-Hsiang Huang and Mr Po-Chang Hsiao for technical supports.

## References

## Supplementary Data

This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

**Files in this Data Supplement:**- Data supplement 1 - Online supplement

## Footnotes

Contributors L-YW designed the simulation study and drafted the manuscript. W-CL conceived the study and participated in its design and coordination.

Funding Ministry of Science and Technology, Taiwan (grant number NSC 102-2628-B-002-036-MY3) and National Taiwan University, Taiwan (grant number NTU-CESRP-104R7622-8).

Competing interests None declared.

Provenance and peer review Not commissioned; externally peer reviewed.

Data sharing statement No additional data are available.

## Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.