Introduction Administrative healthcare databases are useful tools to study healthcare outcomes and to monitor the health status of a population. Patients with cancer can be identified through disease-specific codes, prescriptions and physician claims, but prior validation is required to achieve an accurate case definition. The objective of this protocol is to assess the accuracy of International Classification of Diseases Ninth Revision—Clinical Modification (ICD-9-CM) codes for breast, lung and colorectal cancers in identifying patients diagnosed with the relative disease in three Italian administrative databases.
Methods and analysis Data from the administrative databases of Umbria Region (910 000 residents), Local Health Unit 3 of Napoli (1 170 000 residents) and Friuli-Venezia Giulia Region (1 227 000 residents) will be considered. In each administrative database, patients with the first occurrence of diagnosis of breast, lung or colorectal cancer between 2012 and 2014 will be identified using the following groups of ICD-9-CM codes in primary position: (1) 233.0 and (2) 174.x for breast cancer; (3) 162.x for lung cancer; (4) 153.x for colon cancer and (5) 154.0–154.1 and 154.8 for rectal cancer. Only incident cases will be considered, that is, excluding cases that have the same diagnosis in the 5 years (2007–2011) before the period of interest. A random sample of cases and non-cases will be selected from each administrative database and the corresponding medical charts will be assessed for validation by pairs of trained, independent reviewers. Case ascertainment within the medical charts will be based on (1) the presence of a primary nodular lesion in the breast, lung or colon–rectum, documented with imaging or endoscopy and (2) a cytological or histological documentation of cancer from a primary or metastatic site. Sensitivity and specificity with 95% CIs will be calculated.
Dissemination Study results will be disseminated widely through peer-reviewed publications and presentations at national and international conferences.
- administrative database
- validating ICD-9 codes
- breast, lung and colorectal cancers
This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
Strengths and limitations of this study
The study will evaluate the validity of the International Classification of Diseases-Ninth Revision—Clinical Modification (ICD-9-CM) codes for breast, lung and colorectal cancers in three large Italian administrative databases.
The strength of this study is that it will use a medical chart review to ascertain cases of cancer diseases.
Once these administrative databases are validated for breast, lung and colorectal cancer diseases, they can be used for outcome research including pharmacoepidemiology, health service research and quality of care research.
This study will be the first to validate ICD-9-CM codes of three cancers in three large administrative databases in Italy.
Validation studies of administrative data are related to that context and are not generalisable to other settings.
As computer technology continues to advance, administrative databases are increasingly growing in numerous healthcare settings worldwide. These databases anonymously store data about patients regarding the healthcare assistance they received, including birth, death or disease treatment. Usually, the diagnosis of the disease is associated with a specific code from the International Classification of Diseases, Ninth Revision (ICD-9) or 10th Revision (ICD-10) edition. The ICD is designed to map health conditions to corresponding generic categories together with specific variations.1 The merging of individual patient data from administrative databases with other sources (eg, prescription and laboratory data) allows one to investigate a wide range of relevant and often unique public health questions,2 monitor population health status over time and perform population-based pharmacoepidemiological research.2–4
To constitute a reliable source for research studies, adequate validation of administrative healthcare databases is mandatory. While non-clinical information in healthcare databases, such as demographic and prescription data, are highly accurate,5 ,6 the validity of registered diagnoses and procedures is variable.6 ,7 Determining the accuracy of the latter two categories of clinical information is vitally important to all potential users and involves confirming the consistency of information within the databases with the corresponding clinical records of patients.5
In Italy, all the Regional Health Authorities maintain large healthcare information systems containing patient data from all hospital and territorial sources. These databases have the potential to address important issues in postmarketing surveillance,8 ,9 epidemiology,10 quality performance and health services research.11 However, there is a concern that their considerable potential as a source of reliable healthcare information has not been realised since they have not been widely validated. A systematic review of ICD-9 code validation in Italian administrative databases12 reported that only a few regional databases have been validated for a limited number of ICD-9 codes of diseases including stroke,13 ,14 gastrointestinal bleeding,15 thrombocytopenia,16 epilepsy,17 infections,18 chronic obstructive pulmonary disease,19 ,20 Guillain-Barré syndrome21 and cancers.22 ,23 In addition, the use of these databases was scarce, as only six administrative databases served as sources for published research articles based on the validated ICD-9 codes. Hence, it is imperative that Regional Health Authorities systematically validate their databases for critical diseases to productively use the information they contain.
Breast, colorectal and lung cancers are the most commonly diagnosed neoplasms worldwide, as well as in Italy.24 Consequently, they generate interest in the scientific community and industry as targets for the development of new drugs and for governments, given that they are an important cause of public health and economic burden. For example, variation in the epidemiology of breast,25 colorectal26 ,27 and lung28 cancers, treatment (pharmacological or surgical) administered to patients suffering from these cancers and potential clinical and economic outcomes29–31 can all be evaluated using validated administrative databases.
The objective of the present protocol is to evaluate the accuracy of the ICD-9-CM codes related to breast, lung and colorectal cancers in correctly identifying the respective diseases using three large Italian administrative healthcare databases.
Setting and data source
Starting from the early 1990s, local and regional Italian healthcare administrative databases have collected information from all patient medical records from public and private hospitals including demographics, hospital admission and discharge dates, vital statistics, the admitting hospital department, the principal diagnosis and a maximum of five secondary discharge diagnoses, and the principal and five secondary, surgical and diagnostic procedures. In addition, these databases contain the records of all drug prescriptions listed in the National Drug Formulary and the basic characteristics of patients' physicians. Each resident has a unique national identification code with which it is possible to link the various types of information, corresponding to each person, within the database. In Italy, healthcare assistance is covered almost entirely by the Italian National Health System (NHS); therefore, most residents' significant healthcare information can be found within the healthcare databases.
The target administrative databases for the present study will be from the Umbria Region (910 000 residents), Local Health Unit 3 of Napoli (1 170 000 residents) and the Friuli-Venezia Giulia Region (1 227 000 residents). For each database, the corresponding Unit (Regional Health Authority of Umbria for Umbria Region, Registro Tumori Regione Campania for Local Health Unit 3 of Napoli and Centro di Riferimento Oncologico Aviano for Friuli-Venezia Giulia Region) will conduct the same validation process.
The source population will be represented by permanent residents aged 18 years or above of Umbria Region, Local Health Unit 3 of Napoli and the Friuli-Venezia Giulia Region. Any resident who has been discharged from hospital with a diagnosis of breast, lung or colorectal cancer will be considered. Residents who have been hospitalised outside the regional territory of competence will be excluded from analysis due to the difficulty in obtaining the medical charts.
Case selection and sampling method
In each administrative database, patients with the first occurrence of diagnosis of breast, lung or colorectal cancer between 2012 and 2014 will be identified using the following groups of ICD-9-CM codes located in primary position: (1) 233.0 and (2) 174.x for breast cancer; (3) 162.x for lung cancer; (4) 153.x for colon cancer and (5) 154.0–154.1 and 154.8 for rectal cancer. Only incident cases will be considered, that is, excluding cases with the same diagnosis (ICD-9-CM codes in any position) in the 5 years (2007–2011) before the period of interest. Subsequently, for each of the above reported groups of ICD-9-CM codes, a random sample of cases will be selected from each administrative database. Table 1 displays the description of the ICD-9-CM codes for each of the cancer diseases of interest.
Chart abstraction and case ascertainment
The corresponding medical charts of the randomly selected sample cases will be obtained from hospitals for validation purposes. From each medical chart, the following information will be retrieved: initials of the patient, date of birth, sex, dates of hospital admission and discharge, any diagnostic procedure that contributed to the diagnosis of the cancer, any pharmacological or surgical intervention that was provided for the treatment of the cancer.
Within each unit, two reviewers will receive training on data abstraction. An initial consensus chart review will be performed with each reviewer independently examining the same number of medical charts (n=20). The inter-rater agreement regarding the presence or absence of breast, lung or colorectal cancer among the pairs of reviewers within each unit will be calculated using the κ statistics. This process will be repeated until the strength of agreement among the pairs of reviewers will be near perfect (κ statistics between 0.81 and 1.00). Any discrepancies will be discussed and resolved through third party involvement (RC).
Case ascertainment of cancer within a medical chart will be based on (1) the presence of a primary nodular lesion in the breast, lung or colon–rectum, documented with imaging or endoscopy and (2) the cytological or histological documentation of cancer from a primary or metastatic site.
Following consensus review, data abstraction will be completed independently. To ensure consistency among all the reviewers, cases with uncertainty will be discussed and resolved through third party involvement (RC).
For non-invasive breast cancer, we will consider the ICD-9-CM code 233.0 valid when there is evidence of a breast nodule documented with imaging (eg, mammography) and a histological diagnosis of ductal or lobular breast carcinoma in situ (pTis).
For invasive breast cancer, we will consider the ICD-9-CM codes 174.x valid when there is evidence of a breast nodule documented with imaging (eg, mammography) and a cytological or histological diagnosis from a primary or metastatic site positive for ductal or lobular adenocarcinoma.
For lung cancer, we will consider the ICD-9-CM codes 162.x valid when there is evidence of a pulmonary nodule documented with imaging (eg, CT scan) and a cytological or histological diagnosis from a primary or metastatic site positive for either small cell lung cancer (microcitoma) or non-small cell lung cancer (NSCLC).
For colon cancer, we will consider the ICD-9-CM codes 153.x valid when there is evidence of a neoplastic lesion within the colon, documented with endoscopy (eg, colonoscopy) or imaging (eg, barium enema), and a histological diagnosis from a primary or metastatic site positive for adenocarcinoma, squamous cell carcinoma or neuroendocrine carcinoma.
For rectal cancer, we will consider the ICD-9-CM codes 154.0–154.1 and 154.8 valid when there is evidence of a neoplastic lesion in the rectosigmoid junction or the rectum, documented with endoscopy (eg, coloscopy) or imaging (eg, barium enema), and a histological diagnosis from a primary or metastatic site positive for adenocarcinoma or squamous cell carcinoma.
We calculated that a sample of 130 charts of cases will be necessary to obtain an expected sensitivity of 80% with a precision of 10% and a power of 80%. For specificity calculation, we will randomly select non-cases, that is, records without the ICD-9-codes of interest from an administrative database. The corresponding medical charts will be retrieved and evaluated. We calculated that a sample of 94 charts of non-cases will be retrieved to obtain an expected specificity of 90% with a precision of 10% and a power of 80%. Overall, each unit will evaluate 1120 charts.
Sensitivity and specificity will be analysed separately for each ICD-9-CM code by constructing 2×2 tables. Sensitivity expresses the proportion of ‘true positives’ (ie, cancer cases classified as positive by both the administrative database and medical record review) and all cases deemed positive by the medical chart review. Specificity expresses the proportion of ‘true negatives’ (ie, cases without cancer identified by both the administrative database and medical record review), and with all cases deemed negative by the medical chart review. For both sensitivity and specificity, 95% CIs will be calculated.
Complete, transparent and accurate reporting is essential in diagnostic accuracy studies because it allows readers to assess internal validity as well as to evaluate the generalisability and applicability of results.32 To ensure quality reporting, any reporting or publication of the results from this study will follow recommended guidelines based on the criteria published by the Standards for Reporting of Diagnostic accuracy (STARD) initiative for the accurate reporting of investigations of diagnostic studies.32–34
In this protocol, we present the approach we will use to analyse the validity of ICD-9-CM codes for breast, lung and colorectal cancers in administrative databases representing northern, central and southern Italy.
Administrative databases constitute a valid alternative to situations in which randomised trials are not able to provide the required evidence for practical or economic reasons. In addition, despite epidemiological studies on cancer being frequently based on cancer registries,35–37 administrative databases can add a further value especially on pharmacoepidemiology3 ,12 ,38 and health services research.39 ,40
Accurate identification of cancer cases using the ICD-9-CM codes may contribute to monitoring cancer trends and to proposing interventions to ameliorate cancer care. In 2008, an Italian study developed and validated an algorithm using a regional administrative database to determine incident cases of breast, lung and colorectal cancers and found a sensitivity of76.7%, 80.8% and 72.4%, respectively, for the three cancers.22 This study will add value to the knowledge of the three cancer diseases given that it covers different areas of Italy.
Ethics and dissemination
Study results will be disseminated widely through peer-reviewed publications and presentations at national and international conferences.
Contributors AM, IA, MF and DS conceived the study and all authors were responsible for designing the protocol. IA and AM drafted the protocol manuscript. IA, DS, GG, FS, PC, GA, EB, RCh, RCi, MDG, DF, MFV, MF and AM critically revised the successive versions of the manuscript and approved the final version.
Funding This study protocol was developed within the D.I.V.O. project (Realizzazione di un Database Interregionale Validato per l'Oncologia quale strumento di valutazione di impatto e di appropriatezza delle attività di prevenzione primaria e secondaria in ambito oncologico) supported by funding from the National Centre for Disease Prevention and Control (CCM 2014), Ministry of Health, Italy. The study funder was not involved in the study design or the writing of the protocol.
Competing interests None declared.
Ethics approval Regional Committee Ethics of Umbria (CEAS).
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement All raw data will be available from the corresponding author.