Table 1

The effect of different data processing methods on model prediction performance (bootstrapping)

AUCAccuracyPrecisionRecall rateF1 value
Mean±SD95% CIMean±SD95% CIMean±SD95% CIMean±SD95% CIMean±SD95% CI
Data filling
 No filling0.786±0.1010.785 to 0.7870.770±0.0700.769 to 0.7710.437±0.1620.435 to 0.4380.546±0.2080.544 to 0.5480.460±0.1420.459 to 0.461
 Simple filling0.687±0.0940.686 to 0.6880.761±0.0760.760 to 0.7610.455±0.1800.453 to 0.4560.491±0.1650.489 to 0.4920.442±0.1260.441 to 0.443
 RF filling0.677±0.0950.676 to 0.6780.759±0.0770.758 to 0.7600.446±0.1810.444 to 0.4470.488±0.1620.487 to 0.4900.440±0.1290.439 to 0.441
 RF improve filling0.678±0.0920.677 to 0.6780.756±0.0770.755 to 0.7570.443±0.1790.442 to 0.4450.485±0.1610.483 to 0.4860.435±0.1250.434 to 0.436
 p valuep<0.0001p<0.0001p<0.0001p<0.0001p<0.0001
Data sampling
 No sampling0.738±0.1010.737 to 0.7390.823±0.0500.822 to 0.8230.585±0.2290.583 to 0.5880.390±0.1780.388 to 0.3910.441±0.1720.439 to 0.442
 Random over sampler0.718±0.1090.717 to 0.7190.765±0.0700.764 to 0.7650.437±0.1540.435 to 0.4380.531±0.1890.529 to 0.5330.457±0.1350.456 to 0.458
 Random under sampler0.696±0.1060.695 to 0.6970.710±0.0690.709 to 0.7110.364±0.1070.363 to 0.3650.596±0.1610.594 to 0.5970.441±0.1090.440 to 0.442
 SMOTE over sampler0.683±0.1000.682 to 0.6840.755±0.0670.754 to 0.7550.416±0.1370.414 to 0.4170.490±0.1430.488 to 0.4910.435±0.1130.434 to 0.436
 Borderline SMOTE0.699±0.1040.698 to 0.7000.755±0.0720.755 to 0.7560.424±0.1430.422 to 0.4250.506±0.1430.505 to 0.5080.446±0.1150.445 to 0.447
 p valuep<0.0001p<0.0001p<0.0001p<0.0001p<0.0001
Variable selection
 No selection0.702±0.1090.702 to 0.7030.758±0.0780.758 to 0.7590.440±0.1840.438 to 0.4410.493±0.1870.492 to 0.4940.434±0.1370.433 to 0.435
 Lasso selection0.713±0.1050.712 to 0.7130.761±0.0740.760 to 0.7610.447±0.1730.445 to 0.4480.513±0.1770.512 to 0.5140.448±0.1280.447 to 0.449
 Boruta selection0.706±0.1030.705 to 0.7070.766±0.0730.765 to 0.7660.449±0.1700.448 to 0.4500.501±0.1660.500 to 0.5030.450±0.1270.449 to 0.451
 p valuep<0.0001p<0.0001p<0.0001p<0.0001p<0.0001
  • AUC, area under curve; RF, random forest; SMOTE, synthetic minority oversampling technique.