Predicting the risk of asthma attacks in children, adolescents and adults: protocol for a machine learning algorithm derived from a primary care-based retrospective cohort

Introduction Most asthma attacks and subsequent deaths are potentially preventable. We aim to develop a prognostic tool for identifying patients at high risk of asthma attacks in primary care by leveraging advances in machine learning. Methods and analysis Current prognostic tools use logistic regression to develop a risk scoring model for asthma attacks. We propose to build on this by systematically applying various well-known machine learning techniques to a large longitudinal deidentified primary care database, the Optimum Patient Care Research Database, and comparatively evaluate their performance with the existing logistic regression model and against each other. Machine learning algorithms vary in their predictive abilities based on the dataset and the approach to analysis employed. We will undertake feature selection, classification (both one-class and two-class classifiers) and performance evaluation. Patients who have had actively treated clinician-diagnosed asthma, aged 8–80 years and with 3 years of continuous data, from 2016 to 2018, will be selected. Risk factors will be obtained from the first year, while the next 2 years will form the outcome period, in which the primary endpoint will be the occurrence of an asthma attack. Ethics and dissemination We have obtained approval from OPCRD’s Anonymous Data Ethics Protocols and Transparency (ADEPT) Committee. We will seek ethics approval from The University of Edinburgh’s Research Ethics Group (UREG). We aim to present our findings at scientific conferences and in peer-reviewed journals.

1. Regarding determination of BTS step feature (in Table 1), will this determination be based on analysis of prescribed medications at a certain time, documentation in the note, or read codes? 2. I would like more detail about your exclusion criteria; are you including patients with co-morbidities such as COPD or interstitial lung disease, or other serious respiratory ailments? I understand the desire to get a "real world cohort" but the concern for confounding is substantial. 3. What (if any) is the target time window you are hoping to predict? 1 week, 2 months, 6 months before exacerbation? 4. Does the data set include prescription fill data as a proxy for adherence or non-adherence? Any other features that might capture medication use?

GENERAL COMMENTS
The manuscript with entitle "Predicting the risk of asthma attacks in adolescents and adults: protocol for a machine learning algorithm derived from a primary care-based retrospective cohort" is acceptable for publication in this journal but there are some points that would be better to correct before publication. In the children under 12 years old Asthma has high prevalence. Why age was 12-80? What about under 12 years old?
The pattern of study, method and analyzing need to more explain.
Please revise the abstract.

GENERAL COMMENTS
It is quite an ambitious undertaking, comprehensive, deploying multiple machine learning methods/algorithms to see if there are significant differences (in terms of utility) in the results precipitated by each.This paper is acceptable as a protocol paper. The authors correctly stated the problem, issues of the different machine learning method. The study protocol has been described clearly. Conditional probabilities vs. unconditional probabilities are a big issue when applying machine learning methods in medical studies. I'm glad to see a study which is carried out to compare the one-class classifier and two-class classifier in a medical study. As each of these two methods have both limitations and strengths, it would be a good idea to come up with a generalization method which can minimize the limitations and maximize the strengths of both methods.
Here is a suggestion. If the data permits, I would also recommend the addition of another data point. Studies have recorded differences, perhaps related to immunological differences, in asthma prevalence by ethnicity (or race). If the data permits results, such extension would be at minimal cost. Another suggestion is gathering the environmental exposures data (maybe through Asthma Smartphone APPs ). This is a solid issue of asthma. Good luck to the authors; I look forward to the publication of findings.

I'm afraid to say that there is no "Result" section in this manuscript.
Thank you for your comment. Since this is a protocol paper, we have not undertaken the work proposed and hence there are no results.

Reviewer: 2
Exciting work. Well thought out methods and clear description of ML techniques.
Thank you very much for your support and positive feedback. Table 1), will this determination be based on analysis of prescribed medications at a certain time, documentation in the note, or read codes?

Regarding determination of BTS step feature (in
These will be determined by analysis of prescribed medications (also coded in READ codes) 2. I would like more detail about your exclusion criteria; are you including patients with co-morbidities such as COPD or interstitial lung disease, or other serious respiratory ailments? I understand the desire to get a "real world cohort" but the concern for confounding is substantial.
Thank you for this very important point. We aim to include patients with various co-morbidities and will not attempt to exclude patients on the basis of certain comorbidities. We have clarified this in the manuscript (reproduced below): "We will not attempt to exclude patients with co-morbidities. We have, however, included comorbidities (see Table 1) that will allow us to adjust for any potential confounders arising from comorbidities." 3. What (if any) is the target time window you are hoping to predict? 1 week, 2 months, 6 months before exacerbation?
We aim to predict asthma attacks over 3-, 6-, 12-and 24-month periods. This was mentioned in the introduction under "Research Aims" and we have reproduced this below: "2. Systematically apply several machine learning algorithms (both one-class classifier and two-class classifiers) to predict the risk of asthma attacks, over 3-, 6-, 12-and 24-month outcome periods."

Does the data set include prescription fill data as a proxy for adherence or non-adherence? Any other features that might capture medication use
We only have access to the prescription records in the primary care database to rely on. Pharmacy records would have given us additional information to help us better estimate patient adherence. However, in this study, we acknowledge this as a limitation, and we have now included this in the discussion accordingly (reproduced below).
"Furthermore, we do not have access to pharmacy records for prescription data (which may help us better estimate patient adherence to medication prescription) and would therefore use prescription records to determine patient usage which may not always be the correct." Reviewer: 3 "The manuscript with entitle "Predicting the risk of asthma attacks in adolescents and adults: protocol for a machine learning algorithm derived from a primary care-based retrospective cohort" is acceptable for publication in this journal " Thank you for your comment.

"In the children under 12 years old Asthma has high prevalence. Why age was 12-80? What about under 12 years old?"
The comparable work (by Blakey et al. 2016) that we originally aimed to improve on in terms of developing prognostic models included patients aged 12-80. However, this is quite important to also look at children under 12 years of age. We have consequently modified our inclusion criteria to include children aged 8-12 years as well. We chose the cut-off of 8 years as this is the cut-off chosen by NICE as part of the Quality and Outcomes Framework for asthma diagnosis (https://www.nice.org.uk/standards-and-indicators/qofindicators). A potential predictor that might be useful to identify asthma attacks is eosinophil count. We have consequently modified the inclusion criteria and also included the eosinophil count as an additional potential feature in Table  1. The relevant changes are reproduced below: "Identify significant risk factors associated with asthma attacks in children, adolescents and adults (aged 8-80 years), and appropriately select these for inclusion in our analysis.

Eosinophil Count
Blood eosinophil count (cells/L) categorised into high and not high (threshold of 0.35 x 10 9 cells/L to define high/not high eosinophil count 13 ) The pattern of study, method and analyzing need to more explain.
Please revise the abstract. Thank you very much for your support and positive feedback.
If the data permits, I would also recommend the addition of another data point. Studies have recorded differences, perhaps related to immunological differences, in asthma prevalence by ethnicity (or race). If the data permits results, such extension would be at minimal cost.
This is very helpful and we would aim to get this information and incorporate in our model. At the moment though, we do not have access to ethnicity information (but the OPCRD does contain linkage data for both ethnicity and deprivation data that we will aim to acquire during the course of this project). We have amended Table 1  Another suggestion is gathering the environmental exposures data (maybe through Asthma Smartphone APPs). This is a solid issue of asthma.
Thank you for the suggestion and environmental exposure data is indeed going to be very relevant for asthma attack prediction. Unfortunately, the dataset we have is anonymised and as a condition of use, we will make no attempt to identify/contact the patients we analyse (and we currently do not have the capacity to collect such information at national level).

REVIEWER
Quan Do Mayo Clinic, USA REVIEW RETURNED 01-Apr-2020

GENERAL COMMENTS
The paper is in the editing format with all comments?