Article Text

Validation and optimisation of an ICD-10-coded case definition for sepsis using administrative health data
  1. Rachel J Jolley1,
  2. Hude Quan1,2,
  3. Nathalie Jetté1,2,3,4,
  4. Keri Jo Sawka1,
  5. Lucy Diep1,
  6. Jade Goliath1,
  7. Derek J Roberts1,5,6,
  8. Bryan G Yipp6,7,
  9. Christopher J Doig1,6,7
  1. 1Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada
  2. 2O'Brien Institute for Public Health, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada
  3. 3Department of Clinical Neurosciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada
  4. 4Hotchkiss Brain Institute, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada
  5. 5Department of Surgery, Cumming School of Medicine, University of Calgary, Calgary, Canada
  6. 6Department of Critical Care Medicine, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada
  7. 7Snyder Institute of Chronic Diseases, University of Calgary, Calgary, Alberta, Canada
  1. Correspondence to Dr Christopher J Doig; cdoig{at}


Objective Administrative health data are important for health services and outcomes research. We optimised and validated in intensive care unit (ICU) patients an International Classification of Disease (ICD)-coded case definition for sepsis, and compared this with an existing definition. We also assessed the definition's performance in non-ICU (ward) patients.

Setting and participants All adults (aged ≥18 years) admitted to a multisystem ICU with general medicosurgical ICU care from one of three tertiary care centres in the Calgary region in Alberta, Canada, between 1 January 2009 and 31 December 2012 were included.

Research design Patient medical records were randomly selected and linked to the discharge abstract database. In ICU patients, we validated the Canadian Institute for Health Information (CIHI) ICD-10-CA (Canadian Revision)-coded definition for sepsis and severe sepsis against a reference standard medical chart review, and optimised this algorithm through examination of other conditions apparent in sepsis.

Measures Sensitivity (Sn), specificity (Sp), positive predictive value (PPV) and negative predictive value (NPV) were calculated.

Results Sepsis was present in 604 of 1001 ICU patients (60.4%). The CIHI ICD-10-CA-coded definition for sepsis had Sn (46.4%), Sp (98.7%), PPV (98.2%) and NPV (54.7%); and for severe sepsis had Sn (47.2%), Sp (97.5%), PPV (95.3%) and NPV (63.2%). The optimised ICD-coded algorithm for sepsis increased Sn by 25.5% and NPV by 11.9% with slightly lowered Sp (85.4%) and PPV (88.2%). For severe sepsis both Sn (65.1%) and NPV (70.1%) increased, while Sp (88.2%) and PPV (85.6%) decreased slightly.

Conclusions This study demonstrates that sepsis is highly undercoded in administrative data, thus under-ascertaining the true incidence of sepsis. The optimised ICD-coded definition has a higher validity with higher Sn and should be preferentially considered if used for surveillance purposes.

  • Sepsis
  • Diagnosis Validation
  • Administrative Data
  • Patient Care/Classification
  • International Classification of Disease

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See:

Statistics from

Strengths and limitations of this study

  • This study examined the validity of an optimised International Classification of Disease (ICD)-10-CA-coded case definition to identify sepsis and severe sepsis in an inpatient administrative database for both ICU and non-ICU patients.

  • Sepsis is undercoded in administrative data. Although sepsis is undercoded, our algorithm identifies with confidence a cohort of patients with sepsis (a minimum number of false-positive cases). This algorithm is optimal for studies where identifying a cohort of true sepsis cases is important.

  • We also report an algorithm that optimises the identification of patients with an increased case-capture rate for sepsis (although a slight increase in the number of false positives): this algorithm may be optimal for surveillance studies.

  • Sepsis is a hard-to-define condition. A validated algorithm to identify patients with sepsis from administrative data may facilitate health services research into this expensive and high morbidity and mortality condition.


Sepsis is a life-threatening condition with a high rate of occurrence in the intensive care unit (ICU).1 ,2 It is one of the most costly diseases to treat3 ,4 leaving long-term physical and cognitive effects on its survivors.5 Historically, sepsis has been difficult to define, diagnose and treat.6 In 1992, the American College of Chest Physicians and Society for Critical Care Medicine (ACCP/SCCM) published the first consensus clinical definitions of sepsis outlining the terminology and clinical characteristics of the spectrum of illness.7 In 2001, these clinical definitions were updated to provide more clarification on the signs and symptoms of the disease, and to identify methodologies to increase the accuracy and reliability of the diagnosis of sepsis.8 Since the consensus conference clinical definitions were published, most studies use these clinical definitions regardless of study type (ie, clinical trial or health services research) and/or data source (ie, administrative data, or prospective clinical record).

Administrative health data are widely collected, and are a generally cost-effective way of studying multiple outcomes, health service usage and resource allocation in large populations.9 Administrative data typically use WHO's International Classification of Diseases (ICD)10 codes, an alphanumeric classification system including a core code category made up of the first three characters that are mandatory reporting to facilitate international comparisons, with the most recent update, ICD-10, released in 1994. A major advantage of ICD-10 is that it contains almost twice the number of codes (12 420 codes in ICD-10 compared against 6882 in ICD-9) permitting richer and more precise capture of clinical information, allowing for improved international comparability.11 ,12

However, irrespective of coding systems, it may be difficult to recognise and translate complex conditions, such as sepsis, into a single code. Therefore, often for complex conditions such as sepsis, multiple codes may exist. Some studies have used infection codes,13 or a more limited number of codes, for sepsis.14 Reported sensitivities in validation studies have ranged from 5.9% to 82.3%.15–25 These studies varied significantly in the number and types of codes applied, and the methods in developing the ICD coding algorithms.

The Canadian Institute for Health Information (CIHI) created an ICD-10-CA (Canadian Revision)-coded case definition to define sepsis in administrative data.26 This particular definition uses 49 ICD codes to define sepsis (in adult and neonate populations) and 28 codes specific to organ dysfunction for severe sepsis. The Canadian Revision which includes more detailed subcodes, however, remains true to the original ICD-10 implementation. The CIHI administrative data-coded definition, although using the enhanced capability of ICD-10, has not been validated. An accurate and validated ICD-coded case definition is important, as healthcare resource allocation and other healthcare delivery system decisions can be and have been determined from these data.9

We therefore examined the validity of the CIHI ICD-10-coded case definition in ICU and non-ICU settings, and determined if it could be improved to increase the accuracy of case capture for a diagnosis of sepsis.


Data sources and study population

This study used two databases, the inpatient discharge abstract database (DAD), which has detailed information including demographic, administrative and procedural data on inpatient hospital visits, with each inpatient visit record containing up to 50 ICD-10-CA diagnosis coding fields recorded per hospital encounter. Of these, 25 are released to researchers. In prior research in acutely ill patient populations including diagnoses of catheter-related blood-stream infections and postoperative sepsis, the minimum number of diagnostic coding fields needed to capture at least 90% of secondary diagnosis cases was 15 fields.27 Clinical data were also abstracted from an ICU-specific clinical database (TRACER—details described elsewhere)28 containing ICU-specific clinical and demographic characteristics including APACHE (acute physiology and chronic health evaluation) II29 and SOFA30 (sequential organ dysfunction assessment) scores. Medical charts were also reviewed. All data were linked using the Alberta personal health number, which is a unique lifetime identifier.

Our study population comprised two separate validation cohorts. The first cohort included all adult patients (aged 18 years and older) admitted to an ICU in one of three hospitals in the Calgary region in Alberta, Canada, between 1 January 2009 and 31 December 2012. All three hospitals contain a multisystem ICU with general medicosurgical ICU care; Foothills Medical Centre (FMC) includes a regional specialty programme of burns, trauma surgery, neurosciences, thoracic surgery and transplant surgery (renal, pancreas, bone marrow); Peter Lougheed Centre includes a vascular surgery programme and the Rockyview General Hospital includes the regional urological and ENT programme. The second cohort included a random selection of all non-ICU, or general medical and surgical inpatient medical records from the FMC in Alberta, Canada, between 1 January 2009 and 31 December 2012.

Defining sepsis in medical chart and data abstraction

Sepsis was defined in the medical record review using a checklist criteria tool (table 1) developed based on the ACCP/SCCM 2001 Consensus Conference updated definitions8 and consensus of clinical experts. The tool was tested through a consensus review completed by two independent physicians, one trained in intensive care medicine and the other in surgery (BGY and DJR). Each physician was given the same 10 randomly selected health records, with health record coding masked, and using the tool, determined if sepsis was present or absent for each case. If sepsis was present, the classification of severity (sepsis, severe sepsis and septic shock) was indicated. These results were compared and discussed to ensure full consensus. A full consensus agreement (κ statistic=1.00) occurred after the first round of 20 medical charts, validating the tool for use in the subsequent part of the study.

Table 1

Diagnostic criteria used to determine a diagnosis of sepsis, severe sepsis or septic shock

Four chart reviewers underwent data abstraction training with two of the principal investigators (CJD and HQ) using the above-described checklist criteria tool. An initial consensus chart review was performed with each reviewer independently reviewing the same 20 charts. The inter-rater agreement among all four reviewers was calculated using the κ statistic. This was done until the strength of agreement achieved among all four reviewers was near perfect (κ statistic between 0.81 and 1.00).31 Two rounds of review were performed; the κ score was calculated after each round until full consensus was reached; any remaining discrepancies were discussed and resolved through a third-party expert reviewer (CJD). Following the consensus review, data abstraction was completed independently. Cases with uncertainty were discussed to ensure consistency among all reviewers, and any major unresolved cases were brought to a third-party critical care physician (CJD) for resolution.

Defining sepsis in ICD administrative data

Administrative data from the DAD were obtained for each patient corresponding to the specified inpatient visit during the study period. Using the DAD, sepsis was defined as per CIHI's 2009 report26 by searching through any 1 of the 25 diagnosis coding fields for any of the codes listed in table 2. Any neonate and paediatric-specific codes from the original definition remained in the algorithm, although we limited our study population to adults. Severe sepsis was indicated by the combination of a code of sepsis and at least one organ dysfunction code.

Table 2

ICD-10-CA codes used to define sepsis and severe sepsis in administrative data by ICD-coded case definition

After the primary analysis, we revised the CIHI ICD-10-CA-coded case definition for sepsis informed by a systematic review of the existing literature.32 We examined ICD-10-CA codes to determine if codes, which may indicate sepsis, were missing and should be included in the new definition based on clinical knowledge of the resulting diagnosis (see table 3 for a list of all ICD codes used with description). As well, we determined the codes in the primary diagnostic coding position that had a high frequency in the false-negative population. We performed an additive analysis in which each possible new code was added individually to the original CIHI definition (see online supplementary table S1), as well as the inverse in which all new codes were included in the original definition, with the removal of each individually to determine the changes in accuracy until the most optimal values of sensitivity (Sn), specificity (Sp), positive predictive value (PPV) and negative predictive value (NPV) were achieved.

Table 3

ICD-10-CA codes and descriptions

Statistical analysis

A sample size calculation estimated that 409 charts were required using an estimated prevalence of 19%,33 at a significance level of 5% and 99% confidence. In order to gain a representative sample of the population, a random sample of 1001 patients was selected spread across the three tertiary care hospitals. Descriptive statistics were used to describe the study populations acquired by each ICD-coded case definition. The Charlson comorbidity score was calculated using previously described methods.34 Sn, Sp, PPV, NPV and their 95% CIs for the CIHI and optimised coding algorithm were calculated. Sn was calculated as the proportion of cases classified as positive by both the administrative data (DAD) and medical record review or ‘true positives’ (TP) compared with all cases positive by the reference standard (medical record review). Sp was calculated as the proportion of cases without sepsis identified by both the DAD and medical record review, or ‘true negatives’ (TN), compared with all cases negative by the reference standard. PPV was calculated as the proportion of TP cases of sepsis compared with all the cases identified as sepsis by the DAD. NPV was calculated as the proportion of cases without sepsis (TN) compared with all the sepsis compared with all the cases identified as not sepsis by the DAD. All statistical analyses were performed using STATA V.12 (Stata Corp., College Station, Texas, USA).35


Patient characteristics for reference standard diagnosis

A total of 1001 patients admitted to the ICU were included and linked to the DAD and TRACER databases. Of these, 604 patients were classified as sepsis (86 (14.2%) with sepsis, 203 (33.6%) with severe sepsis, 315 (52.2%) with septic shock,) and 397 were classified as not sepsis. Of the sepsis patients included in the study, 59.3% were men, their median age was 61 years, 76.5% were admitted through the emergency department (ED), and 44.9% had two or more Charlson comorbidities (table 4). The mean APACHE II score within the first 24 h of admission was 20.8, and the admission SOFA score was 6.6. Median hospital length of stay (LOS) was 19 days, and median ICU LOS was 5.8 days. ICU mortality was 17.1% and hospital mortality was 24.0%.

Table 4

Patient clinical characteristics and demographics of the study population by ICD-coded algorithm and reference standard definition (n=1001)

Patient characteristics for the CIHI and optimised algorithm

There were 285 cases of sepsis identified by the CIHI algorithm, and 257 cases of severe sepsis. The optimised ICD-coded case definition increased the number of cases of sepsis identified by 207 (n=492), and 138 for severe sepsis (n=395). The optimised definition had similar cohort characteristics in the sepsis and severe sepsis populations compared with the CIHI definition, however, the CIHI definition patients had higher mean APACHE II scores for both sepsis (22.9 vs 20.9) and severe sepsis (23.6 vs 22.4), and higher admission SOFA scores for sepsis (7.5 vs 6.6) and severe sepsis (7.7 vs 7.0) (see table 4). Median ICU LOS was higher in the patients identified with the optimised severe sepsis ICD-coded case definition at 6.3 vs 5.9 days in the CIHI definition, while overall hospital LOS was similar among each cohort. ICU mortality was 6.6% higher in patients with sepsis, and 4.4% higher in patients with severe sepsis classified based on the CIHI coding definition. Hospital mortality was 7.0% higher in patients with sepsis, and 4.2% higher in patients with severe sepsis identified by the CIHI coding definition.

Performance of ICD-coded case definitions for sepsis classification in ICU patients

The results of the performance of each ICD-coded case definition are shown in table 5. The CIHI ICD-10-CA definition had a moderate Sn of 46.4% and NPV of 54.7%, but was highly specific (98.7%) with a PPV of 98.2%. The severe sepsis CIHI ICD-10-CA definition had Sn of 47.2%, NPV of 63.2%, Sp of 97.5% and PPV of 95.3%. The optimised coding algorithm for sepsis had Sn increase significantly by 25.5–71.9% and NPV increase to 66.6%, while Sp and PPV decreased to 85.4% and 88.2%, respectively. For the severe sepsis optimised coding algorithm, the same trend was noted, with Sn increased by approximately 18–65.1%, NPV with an increase to 70.1%, while Sp and PPV decreased to 88.2% and 85.6%, respectively.

Table 5

Validity by administrative data definition/coding algorithm

Performance of ICD-coded case definitions for sepsis classification in non-ICU patients

A total of 202 non-ICU patient medical records were included and linked to the DAD. For the non-ICU population, the CIHI-coded case definition for a diagnosis of sepsis had extremely low Sn of 6.7%, and for severe sepsis it was slightly higher with Sn of 25%; however, both were highly specific at 100% and had high PPV and NPV (table 5). The optimised ICD-coded case definition improved the Sn for sepsis cases to 60%, while the Sn remained the same for severe sepsis at 25%, however, in both cases, the PPV was decreased substantially to 52.6% for sepsis and 50% for severe sepsis.


This study examined the validity of an optimised ICD-10-CA-coded case definition to identify sepsis and severe sepsis in an inpatient administrative database. We identified ICD codes that optimised the performance of the coded definitions, and our data show the new, optimised ICD-10-CA-coded definitions with added codes achieve a higher validity than the existing CIHI definition. We increased the Sn by over 25% in the ICU population without losing much Sp by including codes for pneumonia (J189), enterocolitis due to Clostridium difficile (A047), chronic obstructive pulmonary disease with acute lower respiratory infection (J440), other Streptococcus as the cause of diseases classified elsewhere (B9548), Staphylococcus aureus as the cause of diseases classified elsewhere (B956) and Escherichia coli as the cause of diseases classified elsewhere (B962). The code for septic shock (R572) was missing from the original CIHI definition, and was also included in the new definition.

When sepsis is identified and coded, it is relatively accurate, as determined by the moderate to excellent Sp and high PPV in our results. This optimised ICD-based case definition, although capturing more cases, is still only moderately sensitive suggesting that sepsis is undercoded in administrative data. Our ICD case definition has Sn of 71.9%, similar to that of other hospital-acquired infections internationally,36 and for non-communicable diseases, such as hypertension37 and diabetes,38 in Canadian data. The low NPV achieved by our definition for both sepsis and severe sepsis codes may be related to the high prevalence of sepsis in ICU patients.39 In patients admitted to non-ICU settings, sepsis may not be detected well at any point during their hospital stay, as shown in our analysis of non-ICU patients. Although some studies have suggested that patients with severe sepsis are commonly admitted to non-ICU settings,40 ,41 these studies have sometimes been based on administration of antibiotics in the ED as the criterion for suspected infection, or case identification using anecdotal screening rather than a developed objective instrument. In our anecdotal experience, most patients with an estimated mortality rate of 20% or higher at the time of hospital admission usually receive treatment in an ICU setting. Approximately 80% of our patients were admitted to the ICU directly from the ED, whereas, the remaining patients were admitted from another hospital ward. It may be that severe sepsis is not highly prevalent in non-ICU settings, or it may be that coding for sepsis in non-ICU settings is often missed. Although our sample size in the non-ICU patients was smaller than the ICU, our results did demonstrate a high Sp and NPV indicating that when sepsis was coded as not present in the non-ICU population, it was accurate.

Undercoding may have important implications if used for surveillance of sepsis, or planning of resources, and allocation of services. Other conditions have also been found to be grossly undercoded, resulting in inaccurate assessments of prevalence, and thereby contributing to inadequate allocation of resources for monitoring and appropriate treatment.42 For sepsis survivors, it is important to have an accurate way of capturing these patients for future planning as they are at a high risk for long-term neurocognitive and physical conditions.43–45 Further, these coding definitions could be used for quality assessment surveillance monitoring studies, for example, to document the rapidity of administration of antibiotics.

The undercoding of sepsis could be due to a variety of other reasons including physician documentation in the medical record. Healthcare coders may not identify a diagnosis of sepsis based on the physician's documentation alone. Physicians may not explicitly state the term ‘sepsis’ within the medical chart, instead terms such as ‘SIRS’ or ‘shock’ are used, or identifying only the infection present. Rothberg et al46 suggest that patients may be diagnosed with respiratory failure having the symptoms of pneumonia, and/or criteria of sepsis without identifying the specific condition or sepsis. As well, selective undercoding of a milder form of sepsis may occur, as coders may intentionally disregard coding sepsis if there are other more resource-intensive and very apparent diagnoses present, that is, any highly acute but mild cases of sepsis that clinically resolves quickly, but where a patient has an extended hospital stay for another reason complicating the episode of sepsis, sepsis may be missed as contributing to the hospital stay.47 Although new clinical definitions for sepsis have been developed, and/or may be developed in the future, how these definitions are applied in research involving administrative data is uncertain. Definitions that rely on specific laboratory results such as pro-calcitonin levels, may not be captured by healthcare coders unfamiliar with the specific implications of these diagnostic results such as these.

Other studies that have examined the definition of sepsis in administrative data have also identified variations in reporting. Gaieski et al48 examined four previously published methods of capturing cases of severe sepsis in administrative data using ICD-9 codes including the well known ‘Angus’ and ‘Martin’ implementations, and compared the incidence and mortality over a 6-year period. They identified up to a 3.5-fold variation among four sepsis case definitions in incidence, with a number of cases ranging from 894 013 to 3 110 630, and mortality ranging from 14.7% to 29.9% depending on the ICD-9 definition used. Iwashyna et al49 validated the ICD-9 coding definitions for the Angus and Martin implementations and found these to have low sensitivities when identifying severe sepsis using administrative data. These studies along with our results suggest the need for linkages of administrative to other types of data, such as pharmacy data (eg, antibiotics or inotropic use), to enhance the ascertainment of sepsis for surveillance purposes.


There are several limitations to this study. First, we defined our reference standard using medical record data extracted by reviewers to assess the validity of the ICD-10-CA data. The potential for misclassification of sepsis within the chart review may have occurred, however, we used a comprehensive process for training and validation to mitigate this possibility. The ICU patient population was selected from tertiary care centres in a large metropolitan area which may then influence the generalisability of case capture to data coming from smaller community hospitals. We also could not validate the optimised algorithm on a different patient sample due to feasibility of medical record review, which therefore, may also impact the generalisability of the case capture. However, we believe that based on the representativeness of the original sample, the optimised definition would still have performed better than the original CIHI case definition. We would encourage other investigators to examine the performance of our reported algorithm in other data sets.


This study validated and optimised ICD-10-CA-coded case definitions for the identification of sepsis and severe sepsis in administrative data. We revised these ICD-coded definitions and optimised the performance, improving the Sn, with a small decrease in Sp and PPV. Sepsis, regardless of severity level, is undercoded, but with the improved Sn and high NPV, these definitions can be used for better defining cohorts of patients with sepsis. Further studies are needed to determine if an ICD-coded case definition for sepsis in administrative data in combination with other data can maximise both the Sn and Sp to improve diagnostic accuracy.


The authors acknowledge the Alberta Sepsis Network Research grant provided by Alberta Innovates Healthy Solutions (AIHS) for funding this research.


Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.


  • Twitter Follow Derek Roberts at @DerekRoberts01

  • Contributors RJJ, CJD, HQ and NJ developed the research question and study methods and performed data analysis. Data collection was performed by RJJ, CJD, KJS, LD, JG, DJR and BGY. All authors contributed meaningfully to the drafting and editing, and all approved the final manuscript. RJJ and CJD are responsible for the integrity of the data.

  • Funding The work is funded by an Alberta Sepsis Network grant from Alberta Innovates: Health Solutions (AIHS). NJ holds a Canada Research Chair in Neurological Health Services Research from the Canadian Institutes of Health Research and a Population Health Investigator Award from AIHS. HQ is funded by a Population Health Scholar Award from AIHS. BGY holds a Canada Research Chair in pulmonary Immunology, Inflammation and Host Defense.

  • Competing interests None declared.

  • Ethics approval Conjoint Health Research Ethics Board at the University of Calgary.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement No additional data are available.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.