Objectives This research aimed to develop a simple and effective acute coronary syndrome (ACS) screening model in order to intervene early and focus on prevention in patients presenting with arteriosclerosis.
Design A case–control study.
Setting The study used a cross-sectional survey to collect data from 2243 patients who completed anonymous electronic medical record (EMR) data and coronary angiography was gathered at a hospital in Shandong Province between December 2013 and April 2016.
Participants Adults 18 years old and above diagnosed as ACS or non-ACS according to the records in hospital EMR database, and with completed basic information (age and sex).
Predictors 54 laboratory biomarkers and demographic factors (age and sex).
Statistical analysis A dataset without missing data of all patients' laboratory indicators and demographic factors was divided into training set and validation set after being balanced. After the training set balanced, area under the curve of random forest (AUCRF) and least absolute shrinkage and selection operator (LASSO) regression were used for feature extraction. Then two set random forest models were established with the different feature sets, and the process of comparison and analysis was made to evaluate models for the optimal model including sensitivity, accuracy and AUC receiver operating characteristic curves with the internal validation set.
Main outcome measures To establish an ACS screening model.
Results An RF model with 31 features selected by LASSO with an AUC of 0.616 (95% CI 0.650 to 0.772), a sensitivity of 0.832 and an accuracy of 0.714 in the validation set. The other RF model with 27 features selected by AUCRF with an AUC of 0.621 (95% CI 0.664 to 0.785), a sensitivity of 0.849 and an accuracy of 0.728 in the validation set.
Conclusions The established ACS screening model with 27 clinical features provides a better performance for practical solution in predicting ACS.
- Adult cardiology
- Coronary heart disease
- Ischaemic heart disease
- Myocardial infarction
Data availability statement
Data are available on reasonable request. Data can be made available to researchers approved at request of the corresponding author, with a signed data access agreement.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
STRENGTHS AND LIMITATIONS OF THIS STUDY
Using two feature selection methods of area under the curve of random forest (RF) and least absolute shrinkage and selection operator regression to construct models.
The RF algorithm provides interactions between feature, robust to overfitting and more efficient than bagging or boosting.
Key considerations for RF learning application in imputation and imbalanced data.
The study was a single-centre study validated by internal validation.
Acute coronary syndrome (ACS) is one of the leading causes of cardiovascular disease-related death and is encountered frequently in the emergency department (ED).1 It is a comprehensive disease mainly caused by disruption of coronary artery plaque and consequent thrombosis-induced severe coronary artery stenosis or occlusion.2 Despite major improvement in medical treatment, the in-hospital diagnosis of ACS for patients complaining of chest pain is time-consuming and expensive.3 The potential risk of some procedures, such as invasive coronary angiography (CAG), may outweigh the diagnostic benefits.4 Furthermore, some major adverse cardiac events, including ACS, do not present with premorbid clinical manifestations. In such a case, the opportunity for early diagnosis may be lost for those patients without obvious clinical symptoms, leading to poor prognosis. Thus, a case–control study for developing simple and accurate screening methods for ACS performed in regular health check-ups was essential for patients with arteriosclerosis (AS), which can prompt the treatment and avoid the risk of fatal complications as well as improve the prognosis of patients.
Although several risk factors are well known to be associated with ACS, such as the body mass index,5 smoking status, alcohol consumption, hypertension and diabetes,6 7 but because of the authenticity of data on these risk factors and the role of long-term effects cannot be quantified, such as cigarette consumption in terms of the number of cigarettes per day and duration of smoking as well as alcohol, leading these risk factors may not have high sensitivity and specificity for ACS screening individually or in some combinations. Many biomarkers in clinical practice are preferred in the ACS diagnostic process. In the early years, several biomarkers for ACS detection have been proposed, such as high-sensitivity cardiac troponin (hs-cTn).8 9 With high sensitivities and specificities, hs-cTn has been the criterion biomarkers for acute myocardial infarction (MI). Because elevated hs-cTn indicates myocardial injury, patients with elevated hs-cTn usually first go to hospital to receive treatment as early as possible. Hs-cTn makes little effort in people having regular medical examinations who are usually no myocardial injury leading to a normal outcome. Therefore, the selection of an optimal subset of common biomarkers in clinical practice is an important step for the construction of the ACS screening model.
Random forest (RF)10 is a widely used machine learning method for data classification, and has been applied in clinical settings to predict disease and shown to have higher accuracy for diagnosis than some classical methods, such as logistic regression.11
The least absolute shrinkage and selection operator (LASSO)12 regression model was designed to overcome the shortage of multiple regression analysis in high-dimensional data leading to predictions with large variance13
CAG is the gold standard in the diagnosis of coronary stenosis. However, CAG is an invasive test with strict indications that render it difficult to use as a population screening tool.14 Furthermore, it is difficult to determine the nature of the plaque on CAG.15 In contrast, coronary CT angiography (CCTA) is a non-invasive examination16 that can also detect coronary stenosis.17 Abdulla et al compared CCTA with CAG and concluded that CCTA has high accuracy in Coronary Heart Disease (CHD) screening and can, to some degree, safely substitute CTA as a screening tool for ACS.18 The latter studies demonstrated the accuracy of CCTA in detecting coronary stenosis, thereby laying the foundation for the use of CCTA as an ACS-screening tool.
In this study, we proposed a simple and accurate model using RF to screen ACS for health check-ups to have intervene early and focus on prevention. The flow chart of the study population selection can be seen in figure 1. And, a case example shows in figure 2.
Patient and public involvement
No patient or public involved.
Data source and participants
It was a cross-sectional, observational, single-centre study that all patients who were adults 18 years old and above diagnosed as ACS or non-ACS according to the records in hospital electronic medical record (EMR) database, and with completed basic information (age and sex) from December 2013 to April 2016. Patients with known connective tissue disorders, pulmonary embolism, dissection, systemic inflammatory conditions, prior acute MI or prior heart surgery were excluded. Patients with AS were labelled by CCTA or CAG with any part stenosis above 50%. They were all accepted by at least one of CCTA or CAG.
(1) Definitely diagnosed as AS according to the records of CAG in hospital EMR database whose coronary arteries were more than 50% stenosis; (2) at least 18 years old; (3) without missing basic information (age and sex) and (4) availability of medical records of clinical examinations (missing rate below 50%).
(1) Age <18 years; (2) hs-cTnI elevation when the first admitted hospital blood test; (3) previous history of MI, CAG or Percutaneous Transluminal Coronary Intervention (PCI); (4) Cases with missing information of seven factors (age, sex) and (5) patients with known connective tissue disorders, pulmonary embolism, dissection, systemic inflammatory conditions.
Our outcome was ACS. According to the consensus guidelines of the European Society of Cardiology,19 patients with a diagnosis of unstable angina pectoris and acute MI, including ST-segment elevation MI and non-ST-segment elevation MI, were considered as having ACS. All patients completed CAG were identified by a professional cardiovascular panel during the hospitalisation period.
A number of 54 laboratory biomarkers and demographic factors (age and sex) were involved according to the risk factor playing a role in the pathological way of ACS. All laboratory tests were performed under the supervision of a professional cardiovascular panel during hospitalisation. For those laboratory biomarkers that were evaluated more than once, the first measurement after admission was selected in order to eliminate the impact of medical treatments during the hospitalisation. (The time elapsed from biomarkers evaluation and ACS or atherosclerosis detection: 21.35±36.24 hours).
Since all the predictors were objective values, and the outcomes were judged after the measurement of the predictors, it can ensure the assessment of predictors for the outcome and other predictors was blind.
Original data (n=2243) were retrospectively collected for patients with the hs-cTn value under the referent upper limit undergoing CAG or CCTA according to the inclusion and exclusion criteria from December 2013 to April 2016. The baseline population data were selected from the original data. Temporary data were formed after deleting columns with missing values greater than 20% of each column and rows with missing values greater than 20% of each row. By using the technique of multiple imputation20 to handle the missing values in the temporary data, the constructed baseline population data are shown in table 1.
The data of baseline population was divided into training sets and validation sets in a ratio of 9:1. Imbanlanced data usually lead to overly optimistic performance estimates in machine learning like RF, KNN, deep learning. Model performance metrics are also involved with bias that positive predictive values remain unwatchable low while high area under the curve (AUC) achieved by models. We balanced within the training set only using the package of ROSE.21 The Validation set remains the original imbalanced class in order to real reflect the true anticipated.22
We made a comparison between the group with and without ACS; in particular, the continuous biomarkers were compared using t-tests (for normally distributed variables) or the Wilcoxon rank sum test (for variables with skewed distributions) after testing for data normality with the Kolmogorov-Smirnov test, whereas the categorical variables were assessed using test. A two-tailed p<0.05 was considered statistically significant.
In order to construct a more robust and automated model with the most relevant features. Two different feature selection techniques, including LASSO, and AUC of RF (AUCRF) were used to determine the optimal subset of characters.
AUCRF was an algorithm using optimising the AUCRF instead of the classification error for accuracy of RF. AUC-RF builds an RF with all the features, ranking importance of features by the Gini index, then backwards elimination. The AUC-RF algorithm gives out both the out-of-bag (OOB) error and the OOB-AUC based on the OOB predictions.
LASSO regression algorithm provides a method dealing with high-dimensional data overcoming large variance caused by multiple regression analysis13 LASSO penalises the coefficients by setting as many coefficients as possible to zero to extract as few features as possible.
RF, first introduced by Breiman23, was a collection of classification and regression trees. This methodology is used to address two main classes of problems: the assessment and rank variables with respect to their ability to predict the response and the construction of a prediction rule for a supervised learning problem.24 RF variable importance measures can successfully identify predictors involved in interactions. In addition, it overcomes the overfitting problem of individual decision tree and consistently offers the highest prediction accuracy in the setting of classification even with highly correlated variables compared with other models.25–27
Two set models were developed with the methods of RF using features selected by LASSO and AUCRF. Lasso. The optimal parameters of RF were set by mtry=√ (number of variables） and ntree was large enough to discriminate.
We compared the performances of two models with AUC, accuracy, sensitivity, specificity and F1-Score.
The statistical analyses in this study were performed using R software V.4.1.1 with packages caret,28 randomForest,23 mice,20 29 tableone,30 rose21 and receiver operating characteristic (pROC).31 We used the TRIPOD checklist when writing our report.32
The demographic and clinical characteristics of the patients in the whole dataset are shown in table 1. Among total 2243 cases, 1791 (14.3%) patients had ACS. Thirteen laboratory indicators showed significant differences between the case (ACS) and control (non-ACS) groups. In terms of sex, males accounted for 65.7% and females accounted for 34.3% of patients with non-ACS, while males accounted for 68% and females accounted for 32% of patients with ACS. The result showed statistically significant differences in aspartate aminotransferase (AST), alanine aminotransferase (ALT), alkaline phosphatase (AKP), uric acid (UA), DD_I, prothrombin time, K, hs_CTNI, prothrombin time ratio, PT_international standard ratio, haematocrit (HCT), prealbumin (PA), total bile acid (TBA), but the other 43 variables showed no significant differences.
In the feature selection process of LASSO, the best lambda was obtained shown in figure 3 and the lambda.1se=−4.95 was chosen to solve the multicollinearity problem. The result of feature selection with AUCRF was shown in figure 4. Twenty-seven variables were selected with the highest OOB-AUC of 0.9544. We listed all the selected variables in table 2. In the LASSO feature selection term, 31 variables had a weight of none zero. In the AUCRF feature selection term, all the variables ranked according to the mean decrease in gini.
To evaluate the prediction performance of the two models, a series of indicators including AUC, 95% CI, accuracy, sensitivity, specificity, F1-score were listed in table 3. Model B had a better performance. A ROC figure also provided to evaluated (figure 5).
Finally, model B constructed variable selected by AUCRF had the better prediction performance.
Our study shows that 19 variables appear simultaneously in the two reconstructed models including ALB, age, monocytes ratio, mean corpuscular haemoglobin concentration (MCHC), HCT, PA, LDL_C, UA, AKP, TG, K, blood urea nitrogen (BUN), TBA, mean corpuscular volume, GLU, CK, platelets (PLT), AST, ALT, while age is an independent predictor for ACS33 and diabetes was known as a risk factor for both ACS and AS. Although ACS is associated with sudden clinical manifestations, disease progression may occur gradually in a clinically asymptomatic manner over many years; in fact, the rupture of plaques is usually asymptomatic.34 Inflammation plays an important role in plaque instability as well as the pathogenesis of ACS. With vascular endothelial injury, white blood cells in the peripheral circulation, including neutrophil, eosinophils and monocytes are guided to the site of lesion to promote the formation of plaques. Mature atherosclerotic plaques contain large amounts of lipids, foam cells which are macrophages or smooth muscle cells with large amounts of fat, as well as proliferating smooth muscle cells and matrix components (collagen, elastin). The further development of the formed plaque shows the enlargement of the lipid nucleus and the thinning of the vascular cap. Whenever plaque ruptures the PLT adhesion and aggregation further occurs, and thrombosis is formed. PLT are involved in the production of blood clots and provide mediators for the formation and maintenance of local inflammatory responses. Since PLT play a certain role in systemic inflammatory response. Nozawa et al 35 showed monocytes in circulating play an important role in the progression of coronary plaque in AMI, and suggested that the peak monocyte count might be a predictor for plaque. LDL and TG also play an important role in ACS prediction. PLT are involved in the inflammatory process, platelet distributing width (PDW) is an important and simple marker of significantly elevated PLT activation in inflammatory processes.36 UA37 has been identified as a significant determinant of ACS, Heart Failure. Ren et al 38 suggested serum AKP may be a potential predictive biomarker. Adam et al 39 found that K along with BUN and creatinine (Cr) haa a strong relationship in diagnosis of ACS, espectially in mortality-risk assessment of ACS patients. High MCHC and MPV levels may be beneficial for ACS.40 ALT and AST are widely present in the myocardium, brain and other organs, and related to ACS.41 However, the effect of TBA on ACS has not been found.
Our study shows that the other 20 variables appear only in one model, red blood cell, high-density lipoprotein cholesterol, lymphocyte count, monocytes count, lymphocyte ratio, Cr, total cholesterol, MPV in Model A and basophilic granulocyte count, PT_R, eosinophil count, DD_I, PDW, neutrophil count, creatine kinase isoenzymes, direct bilirubin, thrombin time ratio, Ca, sex, hs_CTNI in model B. But they still play a different role in the development of MI.
Biomarkers in clinical practice are usually in complex interaction structures or are highly correlated. In this study, we used AUCRF and LASSO to assess the importance of each variable and constructed the screening model, which can successfully identify predictors involved in interactions and consistently offers the highest prediction accuracy in the setting of classification even with highly correlated variables compared with other models.42 43 And RF is more suitable for feature selection.
Our study shows that through optimising glycogen levels, blood pressure control, lipid levels and ensuring those variables used for constructing models can be a better secondary prevention, and reduce major adverse cardiovascular events in the future.
Our study has several limitations. First, model B with a better prediction performance only has an AUC of 0.621, might be caused by the small amount of data, unbalanced data distribution and other reasons. ECG has always played a vital role in diagnosis of ACS. It is necessary to combine ECG to help make a full judgement when patients with AS have a regular health check at this stage. It is reasonable to believe that the effectiveness of our ACS-screening model can be further improved by introducing new variables, a more complete dataset and a greater number of indices in future studies. Second, only indicators with a missing data rate of less than 20% were included in this study. Therefore, indicators with high missing data rates and high discriminant values could have been missed, and these can be explored in future studies. Third, datasets included in this study involved hospitalised AS patients who received CAG during hospitalisation, and this population may not be fully representative of all AS patients, a further multicentre study should be done. Finally, the RF algorithm, while robust, has some inherent defects, such as the difficulty in the explanation of the contribution of a single predictor.
The established ACS screening model with 27 clinical features provides a better performance for practical solution in predicting ACS, it can intergrate in laboratory test equipment for diagnosis-early-warnin and focus on prevention in regular health check-ups.
Data availability statement
Data are available on reasonable request. Data can be made available to researchers approved at request of the corresponding author, with a signed data access agreement.
Patient consent for publication
This study involves human participants and was approved by the Public Health Ethics Committee of Shandong University (Approval No. 20180801). Participants gave informed consent to participate in the study before taking part.
This study is a joint effort of many investigators and staff members, and their contribution is gratefully acknowledged. We especially thank the Health Management Center, as well as the individuals who participated in this study. we are also grateful for the support of the Qila National Key Research and Development Program of China, the Shandong Provincial Natural Science Foundation of China, Shandong Provincial Key Research and Development project, the National Natural Science Foundation of China, the Taishan Pandeng Scholar Program of Shandong Province, and the National Natural Science Foundation of Shandong Province.
XL and FY contributed equally.
Contributors FY and XL analysed data. XL and FY wrote the manuscript. ML helped to collect data. JL made efforts on supervision for data quality. YC and CL designed the study, assisted with the methods and data analysis discussion, and reviewed all drafts. CL is the guarantor of the article taking full responsibility for the finished work and the conduct of the study, had access to the data, and controlled the decision to publish.
Funding This work was supported by the National Natural Science Foundation of China (Grant number 82070388 and 82170442), the Taishan Pandeng Scholar Program of Shandong Province (Grant number tspd20181220) and the National Natural Science Foundation of Shandong Province (Grant number ZR2020MH035).
Competing interests None declared.
Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.
Provenance and peer review Not commissioned; externally peer reviewed.