The effect of different data processing methods on model prediction performance (bootstrapping)
AUC | Accuracy | Precision | Recall rate | F1 value | ||||||
Mean±SD | 95% CI | Mean±SD | 95% CI | Mean±SD | 95% CI | Mean±SD | 95% CI | Mean±SD | 95% CI | |
Data filling | ||||||||||
No filling | 0.786±0.101 | 0.785 to 0.787 | 0.770±0.070 | 0.769 to 0.771 | 0.437±0.162 | 0.435 to 0.438 | 0.546±0.208 | 0.544 to 0.548 | 0.460±0.142 | 0.459 to 0.461 |
Simple filling | 0.687±0.094 | 0.686 to 0.688 | 0.761±0.076 | 0.760 to 0.761 | 0.455±0.180 | 0.453 to 0.456 | 0.491±0.165 | 0.489 to 0.492 | 0.442±0.126 | 0.441 to 0.443 |
RF filling | 0.677±0.095 | 0.676 to 0.678 | 0.759±0.077 | 0.758 to 0.760 | 0.446±0.181 | 0.444 to 0.447 | 0.488±0.162 | 0.487 to 0.490 | 0.440±0.129 | 0.439 to 0.441 |
RF improve filling | 0.678±0.092 | 0.677 to 0.678 | 0.756±0.077 | 0.755 to 0.757 | 0.443±0.179 | 0.442 to 0.445 | 0.485±0.161 | 0.483 to 0.486 | 0.435±0.125 | 0.434 to 0.436 |
p value | p<0.0001 | p<0.0001 | p<0.0001 | p<0.0001 | p<0.0001 | |||||
Data sampling | ||||||||||
No sampling | 0.738±0.101 | 0.737 to 0.739 | 0.823±0.050 | 0.822 to 0.823 | 0.585±0.229 | 0.583 to 0.588 | 0.390±0.178 | 0.388 to 0.391 | 0.441±0.172 | 0.439 to 0.442 |
Random over sampler | 0.718±0.109 | 0.717 to 0.719 | 0.765±0.070 | 0.764 to 0.765 | 0.437±0.154 | 0.435 to 0.438 | 0.531±0.189 | 0.529 to 0.533 | 0.457±0.135 | 0.456 to 0.458 |
Random under sampler | 0.696±0.106 | 0.695 to 0.697 | 0.710±0.069 | 0.709 to 0.711 | 0.364±0.107 | 0.363 to 0.365 | 0.596±0.161 | 0.594 to 0.597 | 0.441±0.109 | 0.440 to 0.442 |
SMOTE over sampler | 0.683±0.100 | 0.682 to 0.684 | 0.755±0.067 | 0.754 to 0.755 | 0.416±0.137 | 0.414 to 0.417 | 0.490±0.143 | 0.488 to 0.491 | 0.435±0.113 | 0.434 to 0.436 |
Borderline SMOTE | 0.699±0.104 | 0.698 to 0.700 | 0.755±0.072 | 0.755 to 0.756 | 0.424±0.143 | 0.422 to 0.425 | 0.506±0.143 | 0.505 to 0.508 | 0.446±0.115 | 0.445 to 0.447 |
p value | p<0.0001 | p<0.0001 | p<0.0001 | p<0.0001 | p<0.0001 | |||||
Variable selection | ||||||||||
No selection | 0.702±0.109 | 0.702 to 0.703 | 0.758±0.078 | 0.758 to 0.759 | 0.440±0.184 | 0.438 to 0.441 | 0.493±0.187 | 0.492 to 0.494 | 0.434±0.137 | 0.433 to 0.435 |
Lasso selection | 0.713±0.105 | 0.712 to 0.713 | 0.761±0.074 | 0.760 to 0.761 | 0.447±0.173 | 0.445 to 0.448 | 0.513±0.177 | 0.512 to 0.514 | 0.448±0.128 | 0.447 to 0.449 |
Boruta selection | 0.706±0.103 | 0.705 to 0.707 | 0.766±0.073 | 0.765 to 0.766 | 0.449±0.170 | 0.448 to 0.450 | 0.501±0.166 | 0.500 to 0.503 | 0.450±0.127 | 0.449 to 0.451 |
p value | p<0.0001 | p<0.0001 | p<0.0001 | p<0.0001 | p<0.0001 |
AUC, area under curve; RF, random forest; SMOTE, synthetic minority oversampling technique.