Article Text


Identification of acute myocardial infarction from electronic healthcare records using different disease coding systems: a validation study in three European countries
  1. Preciosa M Coloma1,
  2. Vera E Valkhoff1,2,
  3. Giampiero Mazzaglia3,
  4. Malene Schou Nielsson4,
  5. Lars Pedersen4,
  6. Mariam Molokhia5,
  7. Mees Mosseveld1,
  8. Paolo Morabito6,
  9. Martijn J Schuemie1,
  10. Johan van der Lei1,
  11. Miriam Sturkenboom1,7,
  12. Gianluca Trifirò1,6,
  13. on behalf of the EU-ADR Consortium
  1. 1Department of Medical Informatics, Erasmus MC University Medical Center, Rotterdam, The Netherlands
  2. 2Department of Gastroenterology and Hepatology, Erasmus MC University Medical Center, Rotterdam, The Netherlands
  3. 3Department of Research, Health Search, Italian College of General Practitioners, Florence, Italy
  4. 4Department of Clinical Epidemiology, Aarhus University Hospital, Aarhus, Denmark
  5. 5Primary Care and Population Sciences, Kings College, London, UK
  6. 6Department of Clinical and Experimental Medicine and Pharmacology, University of Messina, Messina, Italy
  7. 7Department of Epidemiology, Erasmus MC University Medical Center, Rotterdam, The Netherlands
  1. Correspondence to Dr Preciosa M Coloma; p.coloma{at}


Objective To evaluate positive predictive value (PPV) of different disease codes and free text in identifying acute myocardial infarction (AMI) from electronic healthcare records (EHRs).

Design Validation study of cases of AMI identified from general practitioner records and hospital discharge diagnoses using free text and codes from the International Classification of Primary Care (ICPC), International Classification of Diseases 9th revision-clinical modification (ICD9-CM) and ICD-10th revision (ICD-10).

Setting Population-based databases comprising routinely collected data from primary care in Italy and the Netherlands and from secondary care in Denmark from 1996 to 2009.

Participants A total of 4 034 232 individuals with 22 428 883 person-years of follow-up contributed to the data, from which 42 774 potential AMI cases were identified. A random sample of 800 cases was subsequently obtained for validation.

Main outcome measures PPVs were calculated overall and for each code/free text. ‘Best-case scenario’ and ‘worst-case scenario’ PPVs were calculated, the latter taking into account non-retrievable/non-assessable cases. We further assessed the effects of AMI misclassification on estimates of risk during drug exposure.

Results Records of 748 cases (93.5% of sample) were retrieved. ICD-10 codes had a ‘best-case scenario’ PPV of 100% while ICD9-CM codes had a PPV of 96.6% (95% CI 93.2% to 99.9%). ICPC codes had a ‘best-case scenario’ PPV of 75% (95% CI 67.4% to 82.6%) and free text had PPV ranging from 20% to 60%. Corresponding PPVs in the ‘worst-case scenario’ all decreased. Use of codes with lower PPV generally resulted in small changes in AMI risk during drug exposure, but codes with higher PPV resulted in attenuation of risk for positive associations.

Conclusions ICD9-CM and ICD-10 codes have good PPV in identifying AMI from EHRs; strategies are necessary to further optimise utility of ICPC codes and free-text search. Use of specific AMI disease codes in estimation of risk during drug exposure may lead to small but significant changes and at the expense of decreased precision.

Statistics from

Article summary

Article focus

  • This article evaluates the positive predictive value (PPV) of different disease codes and free-text search in identifying cases of acute myocardial infarction (AMI) from population-based healthcare databases in three countries in Europe.

Key messages

  • The overall PPV of different disease coding systems for identifying AMI was good, ranging from a ‘best-case scenario’ PPV of 75% (International Classification for Primary Care (ICPC), Netherlands) to 95% (International Classification of Diseases 9th revision-Clinical Modification (ICD9-CM)) to 100% (ICD-10th revision (ICD-10), Denmark). These findings are consistent with PPV estimates for ICD9-CM and ICD-10 cited in the literature. Until now, there is no study describing the PPV of ICPC codes for identifying AMI.

  • Use of free text alone had a lower PPV, ranging from ‘best-case scenario’ PPV of 20–60%. Strategies are necessary to optimise use of natural language processing in the identification of AMI in these electronic healthcare record (EHR) data.

  • Misclassification of AMI cases resulting from the use of disease codes (or free text) with low PPV has corresponding implications in the estimation of incidence rates. Studies using EHR data to derive incidence rates of clinical events should thus correct for this potential misclassification.

  • Use of more specific disease codes for identifying AMI during drug use may lead to a small but significant change in risk estimates and at the expense of decreased precision. Further studies are warranted to investigate the effect of different PPVs on outcome misclassification and should take into account the type of database as well as test more drug-event associations and control for other confounders.

Strengths and limitations of this study

  • Large healthcare databases covering a total population of over four million from three countries were investigated— a formidable challenge in itself because of the diversity in healthcare and disease coding practices. The implementation of a standardised validation questionnaire facilitated harmonised data collection and analysis across databases without compromising data protection. The opportunity to simultaneously evaluate different disease coding systems as well as free text also allowed the investigation of the effect of outcome misclassification.

  • This study evaluated the accuracy of the codes using the PPV; however, there are other measures such as sensitivity and negative predictive value that could not be calculated.

  • Despite the reasonable size of the random sample used in this validation study, it was not adequate to permit evaluation of some of the individual, less frequently occurring, codes.


Cardiovascular diseases remain an important cause of morbidity and mortality worldwide and the conduct of disease surveillance has changed with the availability of secondary data sources as well as changes in disease coding terminologies. Information derived from multicountry databases containing electronic healthcare records (EHRs) is increasingly being used for drug safety surveillance, including drug-related adverse cardiovascular outcomes.1–3 Cases of acute myocardial infarction (AMI) may be identified using electronic databases from different countries, which may differ not only in their healthcare systems, but also in their disease registration and coding procedures. Innovations in recent years have brought about discovery and subsequent clinical use of biomarkers that allow earlier recognition of disease as well as therapeutic interventions that reduce the extent of myocardial injury and mortality. Such developments have led to revisions in the definition of AMI and changes in diagnosis and prognosis.4–6 Studies that estimate AMI incidence from EHR data must also then consider the implications of new diagnostic criteria on the disease coding practices of such databases.7–10

The accuracy of specific disease coding terminologies in identifying AMI from healthcare data has been evaluated in previous studies. These studies, mostly performed on data representing administrative/insurance claims, have derived positive predictive values (PPVs) of International Classification of Diseases-9th revision-Clinical Modifications (ICD9-CM) codes as well as diagnosis-related groups codes, used in billing.11–14 A recent study evaluated the PPV of ICD-10th revision (ICD-10) diagnostic codes used to assess Charlson comorbidity index conditions, including myocardial infarction, in the population-based Danish National Registry of Patients.15 Until now, there is no study that has evaluated the validity of International Classification of Primary Care (ICPC) codes and unstructured (free text) search, or the combination of diagnosis codes and free-text search, in the identification of AMI from electronic healthcare data. Furthermore, the opportunity to simultaneously evaluate different disease coding systems as well as free-text permits investigation of the effect of outcome misclassification.

We conducted a validation study within the context of the EU-ADR Project (Exploring and Understanding Adverse Drug Reactions by Integrative Mining of Clinical Records and Biomedical Knowledge, Funded by the European Commission under its Seventh Framework Programme, the EU-ADR Project has designed and developed a computerised integrative system that exploits EHR data from different countries (as well as biomedical data) to facilitate early detection of adverse drug reactions.3 Databases contributing EHR data to the Project are part of the EU-ADR network and represent a huge resource for monitoring of drug safety in Europe. In this validation study, we evaluated and compared PPV of free-text search and disease codes from three different terminologies: ICPC; ICD9-CM and ICD-10 in identifying cases of AMI from population-based healthcare databases in Denmark, Italy and the Netherlands. We further assessed the effect of outcome misclassification on the estimation of risk of AMI during drug use.


Data sources

The EU-ADR database network currently comprises anonymised demographic and clinical data of over 20 million individuals from eight population-based EHR databases in three European countries.3 The data are pooled using a distributed network approach that allows data holders to maintain control over their protected data. Validation of AMI case identification was performed in three of these databases: (1) Integrated Primary Care Information (IPCI, the Netherlands); (2) Health Search/CSD Patient DB (HSD, Italy) and (3) Aarhus University Hospital Database (Aarhus, Denmark). IPCI and HSD are both general practice (GP) databases documenting patient consults, including referrals for hospitalisation or specialist care as well as prescriptions for medications. Aarhus is a comprehensive record-linkage database system in which drug dispensation data are linked to a registry of hospital discharge diagnoses and various other registries, including death registries. All these databases have been extensively used for epidemiological research.16–18 A more detailed description of the characteristics of the databases has been previously published.3 ,19 A table of database characteristics of the entire EU-ADR network is provided in online supplementary appendix 1. The three databases employ different disease coding terminologies: IPCI uses ICPC; HSD uses ICD9-CM and Aarhus uses ICD-10. Clinical narratives from general practitioners’ notes in both HSD and IPCI are also recorded as unstructured text that can be used to identify medical events. Standardised data extraction was carried out using the Java-based software Jerboa, developed within the EU-ADR Project.3

Cohort definition and follow-up time

To harmonise follow-up definitions across databases, we defined the eligibility period for each patient as starting on the date of registration in the database and ending on the date the patient transfers out of the system, with the last supply of data, occurrence of AMI (as described below) or on the patient's death, whichever is earlier. In order to be included in the study cohort, participants had to have at least 1 year of continuous and valid data.

Identification of AMI

Potential cases of AMI were initially identified using harmonised and database-specific codes derived from hospital discharge diagnoses (in the case of Aarhus) or from general practitioner diagnoses (in the case of IPCI and HSD). These codes included the ICPC code K75 (IPCI), ICD9-CM codes 410/410.x/410.x0 (HSD), and the ICD-10 codes I21.x (Aarhus). IPCI and HSD also performed free-text search using specific key words. The ICD9-CM code 411.81 (corresponding to acute coronary occlusion) was specifically used in HSD, in combination with free text to refine the search. The free-text search strings employed in IPCI and HSD are given in online supplementary appendix 2. The process of mapping and harmonisation of event data extraction from different EHR databases in the EU-ADR project was based on medical concepts derived from the Unified Medical Language System.20 ,21 We only considered the first occurrence (first diagnosis) of AMI in each patient.

Case validation

Random sampling of cases for validation was carried out separately in each of the three databases using a specific module in the Jerboa software designed for this purpose. The module uses the standard random function in Java that generates random numbers. We required a sample size of 200 cases/database. Since the use of free-text search was known to be more extensive in IPCI, an additional 200 potential cases identified by free text were obtained in IPCI. A manual review of GP records and hospitalisation charts was performed by medically trained assessors using a standardised questionnaire, pilot-tested in the databases and reviewed by a panel of experts. Diagnostic criteria for AMI as prescribed in the current guidelines4 ,7 were incorporated in the questionnaire, as well as information regarding cardiovascular risk factors and potential alternative diagnoses that could explain findings suggestive of AMI. For GP databases (IPCI and HSD), it was also determined whether the AMI diagnosis was made either directly by the general practitioner or by a medical specialist. The standardised questionnaire was then implemented as a computerised data entry algorithm using the custom-built software Chameleon. This software was installed locally in each database, allowing the data holders to keep patient-level data within their protected environment. The data entry algorithm is shown in figure 1 and the questionnaire in online supplementary appendix 3. On the basis of the information collected in the questionnaire, the potential AMI cases were classified as: (1) definite case; (2) non-case or (3) non-assessable case, if the available information was deemed to be insufficient for the case validation.

Figure 1

Data entry algorithm implemented based on a standardised questionnaire.

Assessment of index date

We determined how the coded date of the event (which is detected automatically) was related to the actual date of diagnosis of AMI and to the date of onset of first symptoms, as derived from manual validation. In addition, for the administrative database Aarhus, we compared the coded date with the date of hospital admission related to the pertinent case.

Statistical analyses

  1. PPV and corresponding 95% CI were calculated overall in each database and specifically for each code or free-text search, using medical charts as the gold standard. PPV was calculated as the proportion of the number of confirmed AMI cases out of the total number of randomly selected potential cases. Non-assessable cases were not initially included in either the numerator or the denominator for the PPV calculation, under the assumption that these would not constitute significant bias. However, because the number of non-assessable and non-retrievable cases turned out to be unexpectedly high, we defined, a posteriori, two levels of PPV in order to account for the effects of both non-retrievable and non-assessable cases. We recalculated a ‘worst-case scenario’ PPV as the proportion of confirmed AMI cases out of the total number of randomly selected potential cases, this time including both non-retrievable and non-assessable cases. We retained as a ‘best-case scenario’ PPV the proportion of confirmed AMI cases out of the total number of cases that excluded both non-retrievable and non-assessable cases.

  2. Effect of outcome misclassification on AMI risk estimation during drug use. To investigate the impact of outcome misclassification on estimation of risk of drug-related AMI, we evaluated the association between drug exposure and risk of AMI in the entire population covered by the three databases (ie, not only the randomly selected cases). Drug prescription and/or dispensation data were used to estimate the incidence rate of AMI during drug exposure. Drug prescriptions and dispensations are locally coded in each database (see online supplementary appendix 1), but these codes are linked to the Anatomical Therapeutic Chemical Classification (ATC, system, which is used as the common drug coding system in the EU-ADR network. Overlapping treatment episodes with the same drug (same ATC code) are combined into a single episode of drug use that starts when the first prescription begins and stops when the last prescription ends. When a patient uses more than one drug at a time, the corresponding person-time is labelled accordingly. Using individual data on the start date and end date of prescription or dispensation, those periods during which an individual is included in the study, but is not using any drug, are marked as unexposed. Events are then assigned to the episodes (ie, drug use/non-use) in which they occurred. The duration covered by each prescription or dispensation is estimated within each database, according to the legend duration (if dosing regimen is available), or is otherwise based on the defined daily dose. We estimated the incidence rate of AMI during the current use of six reference drugs: three drugs well known from the literature to be positively associated with AMI (positive controls: rofecoxib, rosiglitazone and levonorgestrel/oestrogen); and three other drugs, unlikely to be associated with AMI, based on the currently available literature (negative controls: ferrous sulfate, gemfibrozil, amoxicillin/clavulanic acid).22 We employed the case definitions of AMI taking into account codes and free text with varying values of PPV: (1) ‘AMI’ included all eligible codes and free text to identify patients with AMI; (2) ‘AMI50’ included codes and free text having PPV ≥50% and (3) ‘AMI75’ included codes and free text having PPV of ≥75%. We calculated incidence rate ratios (IRRs) for AMI during drug exposure (with non-exposure to the specified drug as reference) for each of the six drugs, pooled across all the databases. A Mantel-Haenszel test was used to assess the differences between the incidence rates, corrected for age and sex.


The three healthcare databases considered for this analysis comprised data from 4 034 232 individuals with 22 428 883 person-years of follow-up during the period 1996–2009. Within this population, a total of 42 774 potential cases of AMI were identified. From the random sample of 800 potential cases of AMI (200 cases/database plus an additional 200 cases for free text-identified cases in IPCI) selected for validation, the medical records/charts could be retrieved and reviewed for 748 (93.5%) of them. The hospital medical charts of 52 potential cases in Aarhus could not be accessed because no institutional agreement was in place to allow access to the medical charts. The demographic and clinical characteristics of the randomly selected 748 cases are shown in table 1. The mean age was 67 years across all the databases and the patients were predominantly men (62–70% overall). Chest pain at rest or with exertion was the most frequently reported symptom of AMI cases across all the databases (more than 50% of confirmed cases in both IPCI and Aarhus and 11% in HSD). Hypertension, cigarette smoking and dyslipidaemia were the most frequently recorded cardiovascular risk factors.

Table 1

Characteristics of patients in the random sample of potential AMI cases

All 148 potential cases of AMI identified in Aarhus were confirmed by manual chart review. As regards IPCI, 93 (46.5%) potential ICPC-coded cases were confirmed, 31 (15.5%) were judged as non-cases and 76 (38%) cases were judged as non-assessable. From the 200 potential cases identified by free-text search, 26 (13%) cases were confirmed and 68 (34%) were considered non-assessable, while the remaining (106, 53%) were classified as non-cases. For HSD, 115 (57.5%) cases were confirmed and 79 (39.5%) were declared non-assessable. Table 2 shows the ‘best-case scenario’ PPV and ‘worst-case scenario’ PPV overall for the codes used to identify AMI in each database. In table 3, the percentage distribution and PPV for the specific diagnosis codes from each coding scheme and free-text search are given. All the ICD-10 codes used in Aarhus had the 100% ‘best-case scenario’ PPV. The PPVs in the ‘worst-case scenario’ all decreased, ranging from 66.7 (95% CI 1.3 to 100) to 75.0 (95% CI 58.7 to 91.3). Overall, the ICD9-CM codes had good PPV, with 410.9*, the most frequently reported code, having a ‘best-case scenario’ PPV of 96.9% (95% CI 93.5% to 100%) and a ‘worst-case scenario’ PPV of 59.9 (95% CI 50.1 to 69.6). The ‘best-case scenario’ PPV of free-text search alone in HSD was 60% (95% CI 17.1 to 100). In IPCI, the ICPC code K75 had a ‘best-case scenario’ PPV of 75% (95% CI 67.4% to 82.6%), while free-text search alone had a PPV of 19.7% (95% CI 12.9% to 26.5%). The ‘worst-case scenario’ PPVs were correspondingly lower. All validated cases of AMI in IPCI and HSD were supported with confirmation of the diagnosis by a medical specialist (ie, cardiologist).

Table 2

Overall positive predictive value (PPV) for acute myocardial infarction (AMI) identification, according to database

Table 3

Number and distribution of confirmed AMI cases by diagnostic code or free text

The relationship between the coded date and the date of onset of symptoms across the three databases is shown in figure 2. There was not enough information for this assessment in 25 cases from Aarhus (16.9%). The lag time between coded date and date of symptom onset (as manually validated) ranged from 1 day before to more than 60 days before the automatically detected event date. The coded date for the majority of cases coincided with the onset of symptoms in all databases: Aarhus=72 cases (48.6%); HSD=110 cases (95.6%) and IPCI=67 cases (56.3%). For the administrative/claims database Aarhus, the characterisation of the coded index date with respect to hospitalisation is as follows: (1) date of hospital admission=100 cases (67.6%); (2) during hospital stay=9 cases (6.1%); (3) ≥7 days preceding hospitalisation 23 (15.5%) and (4) not possible to establish=16 (10.8%).

Figure 2

Differences in automatically recorded date of acute myocardial infarction (AMI; time 0) and manually validated date of onset of AMI symptoms across the databases.

Figure 3 shows the IRRs, adjusted for age and sex, for six drugs across the different PPV categories. In general, although the number of AMI cases identified using all eligible codes (‘AMI’) was greater compared with case definitions based on code with ≥50% PPV or ≥75% PPV (ie, ‘AMI50’ and ‘AMI75’), there was only a small change in the resulting IRRs. The clear exception is the positive control drug rosiglitazone, in which the IRR of 2.44 (95% CI 1.62 to 3.67) with the ‘AMI definition’ decreased to 1.62 (95% CI 0.77 to 3.41) with ‘AMI75,’ the risk then becoming insignificant; the IRR remained fairly stable at 1.64 (95% CI 0.78 to 3.45) with the ‘AMI50’ definition. The same trend was observed for rofecoxib and levonorgestrel/oestrogen: although the IRR changes corresponding to each definition were smaller compared with rosiglitazone, the risk disappeared with both the ‘AMI75’ and ‘AMI50’ definitions. For the negative controls (where the 95% CIs all included 1), the impact of using codes with different PPVs was less pronounced.

Figure 3

Impact of codes and free text with different ‘best-case scenario’ positive predictive values on age-adjusted and sex-adjusted incidence rate ratio estimates for acute myocardial infarction during drug exposure (non-exposure to the same drug as reference).


We examined PPV of primary hospital discharge diagnosis codes and general practitioner-recorded diagnoses for AMI in three European EHR databases. The overall ‘best-case scenario’ PPV for the coding scheme-based diagnoses was good, ranging from 75% (IPCI, ICPC coding) to 95% (HSD, ICD9-CM) to 100% (Aarhus, ICD-10). The use of free-text search was more extensive in IPCI compared with HSD, largely due to the lesser granularity of the ICPC coding system. The use of free text alone had a lower PPV, ranging from a ‘best-case scenario’ PPV of 20% in IPCI to 60% in HSD. Although 52 of the initially identified cases of AMI in Aarhus were missing and could not be validated, the inaccessibility of the corresponding medical charts was random and thus was deemed unlikely to introduce bias. However, to account for any potential bias introduced by these non-retrievable cases (as well as non-assessable cases), ‘worst-case scenario’ PPVs were calculated. The impact on the corresponding PPVs was high: for ICD-10 codes overall, PPV dropped to 74% while for ICD9-CM and ICPC codes PPV decreased to 60% and 46%, respectively. These findings reiterate the need for adequate case retrieval in outcome validation studies and, if necessary, to perform resampling and take into account the impact of missing cases in the analysis. More importantly, misclassification of AMI cases resulting from use of disease codes (or free text) with low PPV has analogous implications on the estimation of incidence rates. Studies using EHR data to derive incidence rates of clinical events should thus correct for this potential misclassification.

Routinely collected EHR data are increasingly being used in many areas of biomedical research and a recently identified promising area for EHRs is the proactive surveillance of potentially drug-induced outcomes. The validity of such surveillance activities depends, however, on the accuracy of the definitions of the outcomes being investigated, while at the same time preserving data confidentiality. The use of a standardised questionnaire implemented in an automated data entry validation algorithm facilitated harmonised data collection and analysis across different databases without compromising data protection. The procedure also enabled us to document recorded database information on cardiovascular risk factors such as a family history of coronary artery disease, hypertension, diabetes, dyslipidaemia, smoking and obesity. Such information may be useful in evaluating potential confounder effects when conducting epidemiological studies. Other information related to diagnostic procedures or interventions requiring hospitalisation (eg, coronary angiography) may not be consistently recorded in GP databases, unless provided with the discharge letters or referrals from specialists, hence the observed higher proportion of such information from reimbursement claims data (Aarhus). In the same way, the documentation of initiation of long-term pharmacotherapy for the management of AMI may not be as well documented in claims data as in GP data. It is important to note that information derived from GP databases are data recorded in the course of routine clinical care and provide a different perspective from those derived from databases documenting reimbursement claims for utilisation of healthcare services, which are more for auditing purposes.23 ,24

Since the context within which clinical events are recorded differs between GP databases and administrative/claims databases, there is often also an expected delay between onset of first symptoms (which are more likely to be documented in GP records) and diagnoses recorded upon hospital discharge (documented in reimbursement claims, and also in GP data if referral letters from a cardiologist are available). Our evaluation of the automatically detected index date shows that most of the time the coded event date coincided with the date of onset of first symptoms (and with the date of hospital admission for the administrative database), although there can be a wide range between these two dates.

For this validation study, we have chosen PPV as the relevant measure of accuracy for the codes used in identifying AMI from EHR. Such a metric enables the use of GP and claims records to determine the probability that an individual has an AMI, based on such data. PPV measurements are correlated with disease prevalence, however, and are strongly dependent on specificity. Specificity and other measures of validity, such as sensitivity (ie, how many cases of AMI are missed) and negative predictive value, cannot be calculated from our data, because the data extraction was based on searching for the codes/free text pertinent to the diagnosis of interest. Another limitation is that, in the estimation of drug-related IRR of AMI, we only adjusted for age and sex and did not consider other potential confounding factors.

The results we obtained in this study are consistent with the PPV estimates for ICD-10 and for ICD9-CM cited in the literature. The ICD-10 codes I21, I22 and I23 were found to have 98% PPV in a Danish study evaluating the accuracy of ICD-10-coded myocardial infarction as a component of the Charlson comorbidity index.15 Previous studies evaluating earlier versions of ICD have also demonstrated the accurate coding practices in Danish administrative registries, including the Danish MONICA (Monitoring Trends and Determinants in Cardiovascular Disease) study where 93.5% of the patients in the Danish National Patient Register were found to have definite or possible AMI.25 The PPV of the ICD9-CM code 410 to identify cases of AMI among records with a prior primary hospital discharge code in the Saskatchewan Hospital Automated Database was 97%.12 In another study using Medicare claims, the PPV of several ICD9-CM codes (410.01, 410.11, 410.21, 410.31, 410.41, 410.51, 410.61, 410.71, 410.81 or 410.91) for identifying AMI in either primary or secondary hospital discharge diagnoses was 94.1%.11 While ICPC codes are often used to estimate the incidence or prevalence of various clinical outcomes,26–28 we are not aware of any published studies that have assessed the accuracy of ICPC codes in the identification of AMI in EHR data. A study in the Netherlands evaluated ICPC-coded diagnoses in GP records in the context of cardiovascular risk factor assessment after pre-eclampsia, but only the validity of an ICPC-coded pre-eclampsia diagnosis was determined.29

The available knowledge regarding the value of free-text mining in identifying outcomes from EHR data is an area of research that is gaining a lot of interest.30 ,31 Our findings show that there is potential for the use of free-text search in identification of AMI from EHR databases, but that appropriate combination of key words and natural language processing techniques needs to be further evaluated and optimised.

Our investigation of the impact of outcome misclassification on estimation of AMI risk with drug use showed that the use of codes with lower PPV generally resulted in small changes in the estimated relative risks, but the use of codes with higher PPV may lead to attenuation or disappearance of risk for positive associations (non-differential misclassification biases the risk estimates towards the null). It is important to note that the change in the estimated risk of AMI during drug use when using more specific criteria is virtually due to the exclusion of AMI cases identified by free-text search: with AMI50, cases identified by free text in IPCI were excluded while with AMI75 all cases identified by free text in IPCI and HSD were excluded. The impact analyses were performed on aggregated data, but should ideally be stratified according to database. This is because the impact on IRR is not only a function of using specific versus non-specific codes, but also a function of the database characteristics. Future studies should thus take into account the data source as well as test more drug-event associations, control for other confounders and increase sample size, especially since these estimates were based on relatively small numbers (as reflected in the wide CI). Although we considered only the ‘best-case scenario’ PPVs in the analyses for outcome misclassification, these findings suggest that similar implications would be expected with the ‘worst-case scenario’ PPVs.


We have shown that a network of EHR databases from different countries with different disease coding systems can accurately identify patients with AMI and that adequate case retrieval remains an essential step in validation. The results obtained in this study are consistent with the PPV estimates for ICD9-CM and ICD-10 cited in the literature. Strategies are necessary to optimise the use of ICPC, in combination with free-text search, in the identification of AMI from EHR data. Use of more specific disease codes for identifying AMI during drug use may lead to a small but significant change in risk estimates and at the expense of decreased precision.


View Abstract
  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

    Files in this Data Supplement:


  • Contributors PMC, VEV, GT, MMol, MJS, MS and JvdL contributed to the conception and design of the study. MJS developed the Jerboa software for data extraction. MMos developed the Chameleon software for validation. PMC, VEV, MMol, PM and GT developed and tested the validation algorithms. GM, PM, MSN, LP and MS provided the data and performed local database analyses. PMC and VEV performed the aggregated analyses and PMC drafted the manuscript. All authors contributed to critically revising the manuscript for important intellectual content and approved the final version for submission.

  • Funding This research has been funded by the European Commission's Seventh Framework Programme (FP7/2007–2013) under grant no. 215847, The EU-ADR Project. The funding agency had no role in the design and conduct of the study, the collection and management of data, the analysis or interpretation of the data, and preparation, review or approval of the manuscript.

  • Competing interests None.

  • Ethics approval The respective scientific and ethics committees of each database approved the use of the data for this study.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement No additional data are available.

Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.