Article Text

Download PDFPDF

Accuracy of lung cancer ICD-9-CM codes in Umbria, Napoli 3 Sud and Friuli Venezia Giulia administrative healthcare databases: a diagnostic accuracy study
  1. Alessandro Montedori1,
  2. Ettore Bidoli2,
  3. Diego Serraino2,
  4. Mario Fusco3,
  5. Gianni Giovannini1,
  6. Paola Casucci4,
  7. David Franchini4,
  8. Annalisa Granata3,
  9. Valerio Ciullo3,
  10. Maria Francesca Vitale3,
  11. Michele Gobbato5,
  12. Rita Chiari6,
  13. Francesco Cozzolino1,
  14. Massimiliano Orso1,
  15. Walter Orlandi7,
  16. Iosief Abraha1,8
  17. for the D.I.V.O. Group
    1. 1 Health Planning Service, Regional Health Authority of Umbria, Perugia, Italy
    2. 2 Cancer Epidemiology Unit, Centro di Riferimento Oncologico Aviano, Aviano, Italy
    3. 3 Registro Tumori Regione Campania, ASL Napoli 3 Sud, Brusciano, Italy
    4. 4 Health ICT Service, Regional Health Authority of Umbria, Perugia, Italy
    5. 5 SOC Epidemiologia Oncologica, Centro di Riferimento Oncologico Aviano, Aviano, Italy
    6. 6 Dipartimento di Oncologia, Azienda Ospedaliera Perugia, Perugia, Italy
    7. 7 Direzione salute, Regional Health Authority of Umbria, Perugia, Italy
    8. 8 Centro Regionale Sangue, Azienda Ospedaliera di Perugia, Perugia, Italy
    1. Correspondence to Dr Ettore Bidoli; bidolie{at}


    Objectives To assess the accuracy of International Classification of Diseases 9th Revision–Clinical Modification (ICD-9-CM) codes in identifying subjects with lung cancer.

    Design A cross-sectional diagnostic accuracy study comparing ICD-9-CM 162.x code (index test) in primary position with medical chart (reference standard). Case ascertainment was based on the presence of a primary nodular lesion in the lung and cytological or histological documentation of cancer from a primary or metastatic site.

    Setting Three operative units: administrative databases from Umbria Region (890 000 residents), ASL Napoli 3 Sud (NA) (1 170 000 residents) and Friuli Venezia Giulia (FVG) Region (1 227 000 residents).

    Participants Incident subjects with lung cancer (n=386) diagnosed in primary position between 2012 and 2014 and a population of non-cases (n=280).

    Outcome measures Sensitivity, specificity and positive predictive value (PPV) for 162.x code.

    Results 130 cases and 94 non-cases were randomly selected from each database and the corresponding medical charts were reviewed. Most of the diagnoses for lung cancer were performed in medical departments.

    True positive rates were high for all the three units. Sensitivity was 99% (95% CI 95% to 100%) for Umbria, 97% (95% CI 91% to 100%) for NA, and 99% (95% CI 95% to 100%) for FVG. The false positive rates were 24%, 37% and 23% for Umbria, NA and FVG, respectively. PPVs were 79% (73% to 83%)%) for Umbria, 58% (53% to 63%)%) for NA and 79% (73% to 84%)%) for FVG.

    Conclusions Case ascertainment for lung cancer based on imaging or endoscopy associated with histological examination yielded an excellent sensitivity in all the three administrative databases. PPV was moderate for Umbria and FVG but lower for NA.

    • validity
    • sensitivity and specificity
    • administrative database
    • lung cancer
    • Icd-9-cm
    • positive predictive value

    This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See:

    View Full Text

    Statistics from

    Strengths and limitations of this study

    • This study is the first to have validated International Classification of Diseases 9th Revision–Clinical Modification (ICD-9-CM) codes for lung cancer in three large administrative databases in Italy using the same case definition.

    • Medical chart review was used as reference standard to ascertain cases of lung cancer.

    • Case ascertainment was based on the presence of a primary nodular lesion in the lung documented by imaging and cytological or histological documentation of cancer from a primary or metastatic site.

    • Validation studies of administrative data are related to the context and are not generalisable to other settings.

    • We were not able to determine cancer staging and the accuracy of lung cancer ICD-9-CM codes in secondary position.


    There is increasing interest in the use of administrative healthcare databases in clinical and health services research as they provide timely and easy access to a large source of information regarding subjects in a defined geographical area.1–4 This information may include a combination of hospital discharge data, emergency department visit information, physician prescription data or laboratory data.5 Administrative databases provide easy and cheap access to large numbers of patients over wide geographic regions.1 Generally, the diagnoses of the disease are stored in administrative databases using specific codes from the International Classification of Diseases, 9th Revision (ICD-9) or 10th Revision (ICD-10) edition.6

    The use of administrative databases for research is based on an assumption that they avoid recall bias and that these databases convey plausibly accurate data for healthcare utilisation as well as outcome research.7 However, the most critical elements that need to be considered when using healthcare databases are completeness and validity of the data. Regarding an event or outcome, a database is complete when the proportion of these events observed in the population are identical with those detected in the database and bias can be introduced in the presence of missing data.7 On the other hand, validity expresses the proportion of ‘true’ events (disease or exposure) that are verified within the population covered in the database. To avoid biased results based on the use of inaccurate data, an adequate validation of administrative healthcare databases is mandatory.8 In other words, since validity of registered diagnoses and procedures is variable,7 9 the accuracy of the source of information (administrative database) needs to be determined by verifying the corresponding clinical information within the reference source of information (eg, medical charts).4 10–12

    Lung cancer is the most commonly diagnosed neoplasm worldwide and it is the leading cause of cancer-related mortality.13 14 Consequently, lung cancer raises particular interest within the research community15 16 and the government as it has enormous implication targets in terms of public health, quality of cancer care,17 economic burden18 as well as industry in terms of the development of new innovative drugs.19 Administrative databases can play an important role in the evaluation of the quality of cancer care,20 variation in the epidemiology and outcome of the lung cancer,21 22 survival and other benefits of treatment23 24 as well as healthcare utilisation and costs.25

    Several assessments of the validity of oncological codes have been made26–31 using different case definitions or algorithms as well as multiple sources, including inpatient and physician office records, and the accuracy estimates differed depending on the cancer site and the case definition.30 31 In Italy, the validity of ICD-9 codes related to lung cancer in administrative databases is limited.2 32 A systematic review identified only one study that assessed the accuracy of ICD-9 codes related to lung cancer disease.31 To exploit the productivity of Italian administrative databases in terms of research, evaluation of quality of care and drug utilisation and review, three groups of researchers proposed a research proposal—within a call—to determine the accuracy of ICD-9 codes of relevant cancer diseases in their respective administrative databases.3 33 The aim of this study was to assess the validity of ICD-9 codes related to lung cancer based on a simple case definition ascertained using the medical chart across three large healthcare databases from Umbria, (NA) and Friuli Venezia Giulia (FVG).


    Setting and data source

    Administrative databases

    The target administrative databases for the present study were those of the Umbria Region (890 000 residents), the NA (1 170 000 residents), and the FVG Region (1 227 000 residents). For each database, the corresponding unit (Regional Health Authority of Umbria for the Umbria Region, Registro Tumori Regione Campania for the Local Health Unit 3 of Napoli and Centro di Riferimento Oncologico Aviano for the FVG Region) conducted the same validation process.

    In Italy, administrative databases initiated collecting healthcare information regarding their residents starting from the early 90s. These databases gather diagnostic discharge data from public and private hospitals, vital statistics, hospital admission and discharge dates, the admitting hospital department, the principal diagnosis and a maximum of five secondary discharge diagnoses and the principal, and up to five secondary, surgical or pharmacological treatments and diagnostic procedures as well as all drug prescriptions listed in the National Drug Formulary together with the basic characteristics of patients’ physicians. The various types of information can be linked within the database and all residents’ data can be traced as each resident has a unique, lifetime national identification code. In Italy, healthcare is covered almost entirely by the Italian National Health System; therefore, most residents’ significant healthcare information can be found within the healthcare databases.

    Every resident has a unique code within the entire national/regional database. For every medical chart, a Hospital Discharge Register is generated and this has a unique code which is generated in a chronologically progressive way throughout the year and is independent from the type of admission (hospital or day hospital, week surgery, etc). The code comprises a root of numbers that are a combination of the regional code, the hospital code and the department code that helps avoid any duplicate even at the national level. Other controls to avoid duplication of the medical charts identity include control of duplicates of rows and potential duplication based on the admission and/or discharge dates of the same subjects independent from the department in which the patient has been admitted.

    Source population

    The source population was represented by permanent residents aged 18 or above in the Umbria Region, the Local Health Unit 3 of Napoli and the FVG Region. Any resident that has been discharged from hospital with a diagnosis of lung cancer was considered. Residents that have been hospitalised outside the regional territory of competence were excluded from analysis due to the difficulty in obtaining the medical charts.

    Patient and public involvement

    This was a retrospective study based on consultation of medical charts. Patients were not directly involved.

    Case selection and sampling method

    In each administrative database, the following process was followed to identify new cases with lung cancer: (1) records of patients with occurrence of diagnosis of lung cancer between 1 January 2012 and 31 December 2014 were identified using the ICD-9-CM codes 162.x located in primary position of the hospital discharge; (2) records subsequent to the index date were deleted; (3) prevalent cases, that is, those with the same diagnosis (ICD-9-CM codes 162.x in any position) in the 5 years (2007–2011) before the period of interest, were excluded.

    This cohort represented our target population from which a sample of cases was obtained using a simple random method.

    For controls (non-cases), the following process was followed: (1) subjects aged 18 or older with a diagnosis of cancer disease (ie, patients having a diagnosis of cancer in primary position (ICD-9 140–239)) were identified; (2) from this cohort, subjects with lung cancer (ICD-9-CM codes 162.x in primary position) were excluded; (3) prevalent cases, that is, those with the same diagnosis (ICD-9 140–239 codes in any position) in the 5 years (2007–2011) before the period of interest, were excluded.

    This cohort represented our target population from which a sample of non-cases (controls) were obtained using a simple random method.

    Chart abstraction and case ascertainment

    Medical charts of the randomly selected samples of cases and non-cases were obtained from hospitals for case ascertainment. From each medical chart, the following data were collected: clinical chart number, hospital and ward of admission, date of birth, sex, dates of hospital admission and discharge, signs and symptoms, any diagnostic procedures that contributed to the diagnosis of the cancer, any pharmacological or surgical interventions that were provided for the treatment of the cancer.

    Within each unit, two medical doctors (MDs) acting as reviewers received training on data abstraction evaluating the same (n=20) medical charts independently. The inter-rater agreement among the pairs of reviewers within each unit was near perfect (κ >0.9). Following the consensus review, data abstraction was completed independently. To ensure consistency among all the reviewers, cases with uncertainty were discussed and resolved through a third party involvement (IA, RC).

    We considered the ICD-9-CM codes 162.x valid, when there is evidence of a pulmonary nodule documented with (1) imaging (eg, CT scan) or endoscopy and (2) a cytological or histological diagnosis from a primary or metastatic site positive for either small cell lung cancer or non-small cell lung cancer. Cases and non-cases were validated by pairs of MDs, one of whom was an oncologist.

    Statistical analysis

    We calculated that a sample of 130 charts of cases was necessary to obtain an expected sensitivity of 80% with a precision of 10% and a power of 80%. For specificity, we calculated that a sample of 94 charts of non-cases was necessary to obtain an expected specificity of 90% with a precision of 10% and a power of 80%,3 according to binomial exact calculation.34 The 2×2 tables were developed to calculate sensitivity and specificity with their corresponding 95% CI. Accuracy data were calculated separately for each administrative database.

    In case of missing medical charts, we performed a formal sensitivity analysis based on a worst case scenario in which the missing cases were considered as false positives and missing controls were considered false negatives.


    The exclusion of prevalent cases of lung cancer in primary position allowed the identification of a cohort of 1690 new cases from Umbria, 1655 from NA and 2013 from FVG. Subsequently, each unit randomly selected 130 cases of which the corresponding medical charts were requested for evaluation. These random samples represented 7.7% of the original new cases for Umbria, 7.6% for NA and 6.5% for FVG. Four (3%) medical charts were not available from NA. Figure 1 displays the identification of cases from the three operative units. For the non-cases, each unit randomly selected 94 medical charts. Two medical charts of non-cases from Umbria were missing (see online supplemental table A).

    Supplementary file 1

    Figure 1

    Flow-chart of incident cases identification using the administrative databases and the corresponding charts (final cell) identified and examined.

    The most common ICD-9-CM subgroup was the code 162.9 (ie, bronchus and lung, unspecified) accounting for 51% of cases in Umbria, 58% in NA and 35% in FVG, followed by the code 162.3 (ie, upper lobe, bronchus or lung) accounting for 25% in Umbria, 14% in NA and 24% in FVG. The mean age of the patients was 70 years in Umbria, 68 years in NA and 72 years in FVG. Most of the diagnoses (range 66% to 87%) of lung cancer were performed in medical departments. The instrumental tools for diagnosis included CT scan, bronchoscope, chest X-ray and positron emission tomography/CT. The surgical interventions were limited to only 12%–26% of patients and included lobectomy, pneumonectomy and other surgical interventions. Table 1 displays the basic characteristics of lung cancer cases in each unit.

    Table 1

    Characteristics of patients with lung cancer who were identified in the three administrative healthcare databases

    True positive rates resulted very high for all the three units. The sensitivity was 99% (95% CI 95% to 100%) for Umbria, 97% (95% CI 91% to 100%) for NA and 99% (95% CI 95% to 100%) for FVG. The false positive rates were 24%, 37% and 23% for Umbria, NA and FVG, respectively. PPVs were 79% (73% to 83%)%) for Umbria, 58% (53% to 63%)%) for NA and 79% (73% to 84%)%) for FVG.

    Table 2 provides cross tabulation of the ICD-9-CM code results from the results of the medical charts, whereas figure 2 displays sensitivities and specificities across the three operative units.

    Figure 2

    Sensitivity and specificity with 95% CIs for lung cancer International Classification of Diseases 9th Revision–Clinical Modification codes for the three administrative databases.

    Table 2

    Cross tabulation of the index test (ICD-9-CM code) results by the results of the reference standard (medical chart)

    Misclassification of cases and non-cases is described in table 3. Most of false positives cases (89%) were due to missing histological documentation (28 in Umbria, 39 in NA and 23 in FVG), whereas in 11 (11%) cases overall, the histological documentation resulted negative for lung cancer. Overall, only four false negatives were identified and the reasons were due to unclear or possible lung cancer histology. No coding errors were identified.

    Table 3

    Reason for incorrect identification of cases and controls

    Missing data for cases and non-cases did not affect the estimates of sensitivity and specificity.

    A subgroup analysis based on age showed that false positive rates were higher in the age group ≥65 than in the age group <65 years influencing specificity in the Umbria and FVG databases (see online supplemental table B).


    This study evaluated the ability of three administrative databases (Umbria, NA and FVG) to identify incident lung cancer cases. According to our case definition, that is, the requirement of a clinical or instrumental documentation of a lesion together with the presence of histological documentation within the same medical chart, we determined that ICD-9 codes have an excellent sensitivity across the three databases but a moderate specificity and PPVs in Umbria and FVG, while NA yielding a lower value of specificity (63%) and PPV (58%). The rate of false positives influenced the results of specificity and PPVs and this was predominantly due to missing histological documentation that resulted not present during the evaluation of the first medical chart of the cases. Part of the rate of false positives could be explained by the unavailability of the histological documentation within the first medical chart of admission. If we have used a broader criteria, such as the evaluation of a subsequent medical chart,28 the addition of surgical procedures35 or a combination of both,28 it may have led to higher PPVs. However, despite the PPV estimate resulted similar to that of another Italian study31 that compared the accuracy of lung cancer ICD-9codes of a regional administrative database versus a cancer registry, this study obtained a similar PPV of 78.7%—a result similar to that of Umbria and FVG, despite the fact that the tested algorithm was based on a combination of ICD-9-CM diagnosis, surgical procedures, chemotherapy and radiotherapy codes.31

    In our study, biopsies or surgical procedures could not be performed due to the critical clinical conditions of the patients, or for their advanced age that may explain in part the higher rates of false positives from the NA operative unit. Indeed, published medical literature reports that most lung cancer patients present with advanced disease and are diagnosed based mainly on symptoms.36 37 This condition may also explain why in our assessment the most prevalent ICD-9-CM subgroup code was 162.9, namely ‘bronchus and lung, unspecified’, in which case, given the metastatic or locally advanced disease, the site of the primitive tumour loses its relevance because a radical surgical approach is not possible.

    The validation of our algorithm can be extended and tested in other regional settings as well as at national level especially in the areas that are not covered by cancer registries. By combining the lung cancer ICD-9 codes with prescription databases, mortality databases and other sources, researchers at regional and national level can efficiently identify a cohort with lung cancer and perform pharmacoepidemiological studies or other health services-related research.

    Strength and limitation

    Strengths of our work include the requirement for validation purposes of the presence of histological or cytological documentation in addition to a radiological or endoscopic presence of a primary lesion. Unlike studies that used cancer registries to validate lung cancer codes, we ascertained the presence of the disease by using clinical charts to confirm the accuracy of cases that were identified in the administrative databases.

    Additionally, our study assessment was based on a prepublished protocol3 and no deviation from protocol occurred during the study development. We followed recommended guidelines based on the criteria published by the Standard Protocol Items: Recommendations for Interventional Trials initiative for the accurate reporting of investigations of diagnostic studies. Hence, we used detailed and explicit eligibility criteria, as well as duplicate and independent processes for medical charts review and data abstraction.38–40

    We acknowledge some limitations in our study. First, although in our study we considered three Italian regions from three different areas (North, Middle, South) of Italy, the accuracy results of this validation study could not be generalisable to other settings due to the specific characteristics of the patients included in the three regions (such as age, sex, clinical conditions, comorbidities). Second, the stage of the disease could be an important factor that may have influenced the sensitivity, but we could not perform this analysis because the cancer staging is an element that cannot be found in the index test. Third, we did not perform the accuracy of cancer codes in secondary position that may underestimate the incidence of lung cancer disease but further research is necessary to quantify the estimate.


    We developed a case definition for lung cancer based on imaging or endoscopy associated with histological examination that yielded excellent sensitivity for three population-based healthcare databases, two of which had a moderate PPV. In the NA healthcare database, the PPV resulted lower and future research is needed to address the reason for a higher rate of false positives. The development of this case definition can be extended in other regional and local areas where cancer registries are lacking in Italy. Results from our study support the use of healthcare databases as a valuable tool to investigate several aspects of lung cancer and to conduct population-based longitudinal studies with long-term outcomes.


    1. 1.
    2. 2.
    3. 3.
    4. 4.
    5. 5.
    6. 6.
    7. 7.
    8. 8.
    9. 9.
    10. 10.
    11. 11.
    12. 12.
    13. 13.
    14. 14.
    15. 15.
    16. 16.
    17. 17.
    18. 18.
    19. 19.
    20. 20.
    21. 21.
    22. 22.
    23. 23.
    24. 24.
    25. 25.
    26. 26.
    27. 27.
    28. 28.
    29. 29.
    30. 30.
    31. 31.
    32. 32.
    33. 33.
    34. 34.
    35. 35.
    36. 36.
    37. 37.
    38. 38.
    39. 39.
    40. 40.
    View Abstract


    • Contributors AM, IA, MF and DS conceived the original idea of the study. IA, DS, AM, MF, EB, GG, FC, MO and WO designed the study. PC, DF, PC, AG, MFV, VC and MG identified the cohort using administrative database with the supervision of WO, EB, DS, MF, and AM. IA, FC, MO, AG, PC, VC, MFV and MG undertook the data abstraction with the supervision of AM, GG, WO, FS, MF, EB, PC and DS. IA, RC, AM, DS and MF performed case ascertainment. IA, AM, FC, EB, MF, MG and MO performed the analysis. DS, GG, PC, DF, AG, VC, RC, MFV and WO helped with the interpretation of the results. The initial draft of the manuscript was prepared by IA, AM, EB, DS and MF. DS, GG, PC, DF, AG, VC, MFV, MG, RC, FC, MO and WO revised critically the manuscript for important intellectual content. All the authors read and approved the final manuscript. AM, MF and EB are the guarantors of the data for the respective operative units.

    • Funding This study was developed within the D.I.V.O. project (Realizzazione di un Database Interregionale Validato per l’Oncologia quale strumento di valutazione di impatto e di appropriatezza delle attività di prevenzione primaria e secondaria in ambito oncologico) supported by funding from the National Centre for Disease Prevention and Control (CCM 2014), Ministry of Health, Italy. The study funder was not involved in the study design or the writing of the protocol.

    • Competing interests None declared.

    • Patient consent Not required.

    • Ethics approval Regional Ethics Committee of Umbria (CEAS), authorisation number: 2656/15 (04/11/2015).

    • Provenance and peer review Not commissioned; externally peer reviewed.

    • Data sharing statement No additional data are available.

    • Collaborators Giuliana Alessandrini; Marcello De Giorgi; Roberto Cirocchi; Paolo Collarile; Fabrizio Stracci.

    Request Permissions

    If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.