Article Text

Download PDFPDF

Internal deterministic record linkage using indirect identifiers for matching of same-patient hospital transfers and early readmissions after acute coronary syndrome in a nationwide hospital discharge database: a retrospective observational validation study
  1. Afonso Rocha1,2,
  2. Luıs Filipe Azevedo3,
  3. J C Silva Cardoso4,
  4. Thomas G Allison5,
  5. Alberto Freitas6
  1. 1Center for Health Technology and Services Research (CINTESIS), University of Porto-Faculty of Medicine, Porto, Portugal
  2. 2Cardiovascular Rehabilitation Unit, Physical Medicine and Rehabilitation, Centro Hospitalar Universitário Sao Joao EPE, Porto, Portugal
  3. 3Department of Health Information and Decision Sciences (CIDES) & Center for Health Technology and Services Research (CINTESIS), University of Porto-Faculty of Medicine, Porto, Portugal
  4. 4Department of Cardiology, Centro Hospitalar Universitário São João, University of Porto-Faculty of Medicine, Porto, Portugal
  5. 5Department of Cardiovascular Medicine and Cardiovascular Surgery, Mayo School of Medicine, Rochester, Minnesota, USA
  6. 6Department of Health Information and Decision Sciences (CIDES) & Center for Health Technology and Services Research (CINTESIS), University of Porto-Faculty of Medicine, Porto, Portugal
  1. Correspondence to Dr Afonso Rocha; afonsomrocha{at}


Objectives To assess validity of record linkage using multiple indirect personal identifiers to identify same-patient hospitalisations and definition of episode of care (EC) due to acute coronary syndrome (ACS).

Methods Using national hospital discharge data to identify all admissions due to ACS, we used six different linkage rules using indirect identifiers with increasing level of detail and compared validity against a pseudonymised unique identifier used as gold standard (GS). Contiguous hospitalisations within each matched group of hospitalizations occurring within 28 days of each other were considered one EC. We classified hospitalisations according to time between the first pair of hospitalisations as hospital transfer (HT: ≤1 day), early readmission (ER: 2–28 days) or recurrent cases (>28 days).

Results There were 146 671 hospitalisations (unlinked), 121 987 ACS 28-day EC (linked GS), with 18 398 HTs (≤1 day), and 6286 ERs (≤28 days). Linkage rules using demographic and residence code variables produced linkage rates with highest validity for rule using sex, date of birth and four-digit residence code with sensitivity of 98.4 (95% CI: 98.4 to 98.5); specificity of 97.8 (95% CI: 97.6 to 98.0) and Cohen’s κ of 0.9 to detect ACS-EC, compared with GS linkage rule. Similarly, validity for HT and ER was high and of similar magnitude, with sensitivity ranging between 97.2% and 98.1%, and specificity between 98.8% and 99.9%, respectively.

Conclusions Our internal linkage validation study using indirect patient identifiers will allow calibration of incidence rates and performance indicators, accounting for the effect of HT and readmissions.

  • acute coronary syndrome
  • medical record linkage
  • deterministic linkage
  • hospital admissions

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

View Full Text

Statistics from

Strengths and limitations of this study

  • Demonstrates the validity of using deterministic record linkage with indirect identifiers to link patient-level hospitalisations allowing for aggregation of hospitalisations within the same episode of care.

  • Shows a valid method to overcome limitations of anonymised large administrative databases, allowing for epidemiological research with retrospective analysis and calibration of past and present ACS hospitalisation incidence trends, in-hospital mortality and performance indicators.

  • This methodology is applicable in different countries and settings having high rates of hospital transfers and readmissions, such as trauma, stroke and intensive care patients.

  • There was no assessment of quality of gold standard used for validation (unique pseudonymised identifier).

  • Validation was done on the National Hospital Discharge Database which has very low missing/invalid rates, whereby it may not be applicable in databases with higher error rates and for external record linkage between different data sets, where stringent deterministic methods result in a high number of false negatives.


Hospital administrative data provide a valuable source of information to address healthcare management, resource utilisation and quality of care research questions. Strong points of these databases are very wide coverage and low-cost systematic data collection.1 On the downside, hospital administrative data are not designed for research purposes, often lack unique patient identifier and pertain to each hospitalisation not allowing linkage of multiple hospitalisations within the same episode of care (EC), being thereby susceptible to imprecisions and overestimation when patients are transferred between hospitals or have multiple readmissions for a single EC.2 This is especially problematic in the case of acute coronary syndromes (ACS) where clinical pathways and referral networks have been implemented to assure timely access to coronary angiography and revascularisation procedures, with hospital transfer (HT) rates up to 30%.3 4

Identifying whether a hospital admission is a transfer from another hospital, an early readmission (ER) within the same EC, or a late readmission due to a new ACS event remains challenging and is of paramount importance for analysing and interpreting outcome data and for monitoring trends of ACS subtypes, therapeutic measures and healthcare services performance.5 Additionally in the US, from 2012 onwards, hospitals in which 30‐day hospital readmission rates for certain conditions, including acute myocardial infarction, exceed the national average are financially penalised under the Patient Protection and Affordable Care Act.6

A standard approach to minimise multiple counting has been to exclude inter-HTs and readmissions but, since these are not random events, this method introduces bias and leads to loss of relevant information.7 8 On the other hand, treating sequential hospitalisations as independent EC results in overestimation of standardised ACS trends, lowers estimates of the proportion of patients submitted revascularisation treatment and may artificially decrease in-hospital mortality rates.4 5 7 9 Therefore, sequential hospitalisations for the same patient, occurring within a preset time frame, should be combined as one EC as this should be considered the preferred unit of analysis. When only unlinked data is available and there is no unique patient identifier, using an internal linkage method through demographic and event-based variables is desirable to identify and account for HTs and readmissions within the same EC.10

We aimed to build and assess the validity of a matching algorithm using secondary non-unique patient identifiers and event-based variables, using a stepwise deterministic linkage method, to identify patient-level ACS hospitalisations and contiguous hospitalisations occurring within 28 days from each other, classified as one ACS-EC, by using pseudonymised data (unique direct identifier) as gold standard (GS).


Study population and data sources

Data for the study were obtained retrospectively from the administrative national hospital discharge database provided by the Portuguese Ministry of Health’s Central Administration for the Health System which includes hospitalisations occurring in all public acute care hospitals of the Portuguese National Health Service in mainland Portugal. Data providing is mandatory for every hospitalisation and used for hospital’s reimbursement purposes, but also for disease prevalence estimation and healthcare utilisation assessment. Collected information includes demographics (age, sex, residence code), hospital admission and discharge dates, discharge diagnosis in a principal diagnosis field and up to 30 secondary diagnosis fields using the International Classification of Diseases—ninth revision—clinical modification (ICD9-CM) and discharge status (deceased or alive).

Due to data privacy issues, administrative health data has traditionally been released to researchers without unique direct identifiers. From 2011 onwards, a pseudonymised unique patient identifier was provided, allowing to track same patient hospitalisations against which we aimed to assess and validate our matching algorithm. Therefore, our analysis was restricted to all hospitalisation episodes, both inpatient and outpatient, between 2011 and 2015. We followed the modified Standards for Reporting of Diagnostic Accuracy criteria to report our findings.11

Event identification and classification

Coding procedures for ACS-EC vary considerably between institutions and with time, especially in the case of HTs for specialised care and treatment, ranging from both institutions (referring and receiving) coding the ACS hospitalisation and the procedure (duplicating both counts) to only the receiving institution coding the hospitalisation episode and procedure either as an inpatient or outpatient code. To capture all information pertaining to each ACS-EC, we included all hospitalisation episodes, both inpatient and outpatient, with a primary discharge diagnosis field showing ICD9-CM codes 410.x, 411.0–411.1 and 414.x and procedural codes: cardiac catheterisation (37.21, 37.22, 37.23), percutaneous coronary intervention (00.66, 36.03, 36.04; 36.06, 36.07, 36.09) and surgical coronary revascularisation (coronary artery bypass grafting (CABG): 36.10–36.17, 36.19). An exploratory analysis revealed heterogeneity of coding practices for HTs and elective readmissions among hospitals, ranging from both institutions coding admission with an ACS coding (410.x, 411.0–411.1), to first institution coding ACS and receiving institution coding hospitalisation with a 414.x code. Since we wanted to capture and aggregate all the information related to hospitalisations within each ACS-EC, including revascularisation procedures, we decided to use all codes (410.x, 411.0–411.1 and 414.x) and selected, for each matching rule, only episodes having, at least, one hospitalisation with a 410.x or 411.0–411.1 code.

Since there is no specific coding allowing identification of ACS subtypes, we used codes 410.0–410.6 and 410.8 for ST-segment elevation myocardial infarction (STEMI), codes 410.7 and 410.9 for non-STEMI (NSTEMI) and code 411.0 and 411.1 for unstable angina.12 We used the diagnostic hierarchy method proposed by Lopez et al which reflects the severity of ACS subtypes, from STEMI (most severe), over NSTEMI to unstable angina (less severe). For an ACS-EC with multiple hospitalisations, the most severe category was used.2

The steps taken to select linkable inpatient and outpatient hospitalisation episodes with an ACS-related primary diagnosis is shown in online supplementary figure 1. First, we identified redundant episodes (n=21) that had the same combination of values for all variables (60 variables), and kept only one record among duplicates. Second, we excluded records with missing or invalid values in the linking variables contained in each linkage rule (n=1095). Lastly, we restricted hospitalisation episodes to patients aged ≥30 years (n=171) due to concerns of unreliability of ACS estimates in younger patients.

Linkage method

We used internal deterministic data linkage requiring matches on different combinations of person-level identifiers and calculated time interval (in days) between matched hospitalisations to define a 28-day ACS-EC comprising first admission and all contiguous admissions occurring within 28-day period from each other (HT: ≤1 day; ER: 2–7 days; late readmission: 8–28 days).2 13 Cases with identical demographic identifiers (matched hospitalisations) admitted to the same hospital or in two separate hospitals within 28 days of each other were considered as belonging to the same 28-day ACS-EC, counted only once and had all their information aggregated. Matched hospitalisations occurring beyond 28 days from each other were considered as a new ACS-EC.

We set six test linkage rules using various combinations and granularity of the following linkage variables: sex; date of birth and residence code. Deterministic linkage rules require, for identification of matched hospitalisations (hospitalisations pertaining to the same patient) an exact match on values of all linkage variables specified on each matching the rules (table 1). Residence code consists of a sequential combination of six digits according to the administrative level of detail: two identifying districts (total of 18); two for municipalities within each district (total of 278) and two for parishes within each district and municipality (4050 up to 2012; and 2882 after the administrative reform of 2013).14 A direct pseudonymised identifier (unique patient ID nine-digit combination derived from national identification number) was used as the GS.15 We sequentially tested rules with increasing level of granularity to assess validity and linkage error rate compared with the GS.

Table 1

Stepwise deterministic matching algorithms according to detail of identifying variables

Despite contiguous admissions, within 28 days from each other, being counted only once as a single 28-day ACS-EC, we aggregated information from different hospitalisations regarding revascularisation procedures, severity indicators, comorbidities and in-hospital mortality.

Statistical analysis

For each matching rule we calculated total matching rate (proportion of total hospitalisations successfully linked) and matching rate for ACS-EC. Using unique ID as GS we calculated the number of matching errors (missed matches; false matches) for each matching rule. Comparative linkage quality was assessed by calculating sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) with their 95% CIs using the one-sample Clopper–Pearson and the standard logit methods, respectively.16 17 Chance-weighted proportional agreement between matching rules and GS was calculated using Cohen’s κ and classified as poor if κ≤0.20; fair if 0.21<κ≤0.40; moderate if 0.41<κ≤0.60; high 0.61<κ≤0.80 and excellent agreement if κ>0.80.18 We compared baseline characteristics between true matches and false matches and between missed matches and true non-matches, using independent samples t-test and Chi-square for continuous and categorical variables, respectively. We then described the characteristics of the study population with a single hospital admission compared with those having multiple hospitalisations within a 28-day ACS-EC. Analyses were performed using IBM SPSS Statistics V.25 and Microsoft Excel V.16.30.

Patient and public involvement

There was no patient or public involvement in any step of this study.


During the study period there were 146 671 hospitalisations due to ACS with mean age 67.7 (12.3) years and 68.9% were men. Median length of stay was 3 days (IQR 6). Unlinked data revealed 26 842 (18.3%) hospitalisations with STEMI, 36 597 (24.9%) with NSTEMI, 10 347 (7.5%) with unstable angina and 72 885 (49.6%) classified as other acute and subacute forms of ischaemic heart disease. Cardiac catheterisation was performed in 70% of hospitalisation episodes, percutaneous coronary intervention in 38.2% and CABG in 6.3%, while in 23.8% of hospitalisations no cardiac procedure was performed. Heart failure was present in 19 699 (13.4%), 15 019 (10.2%) had atrial fibrillation, 2981 (2.0%) ventricular fibrillation, along with 1469 (1.0%) cardiac arrests and a total of 6241 (4.3%) in-hospital deaths (online supplementary table 1).

The linkage rule requiring an exact match on the unique patient ID (GS) identified 34 948 matched hospitalisations corresponding to 23.8% of all hospitalisations, with 16.8% readmissions within 28 days from initial hospitalisation. Among the test rules based on indirect identifiers, matching rate decreased from 99.9% for rule 1% to 28.1% for rule 6, and the matching rate of ACS-EC with multiple hospitalisations decreased from 99.5% to 15.3%, from rules 1 to 6 (table 2).

Table 2

Total number and proportion of matched hospitalisations and 28-day ACS episode of care using each matching rule

The proportion of ACS-EC with multiple hospitalisations increased from 16.9% in 2011 to 17.9% in 2015, being less frequent in women and with advancing age compared with single hospitalisation ACS-EC. The rate of multiple hospitalisations within same EC was lower for those with unstable angina, and higher in those submitted to cardiac procedures, especially CABG. There was considerable geographical heterogeneity in incidence of ACS hospitalisations and proportion of ACS-EC with multiple hospitalisations with major coastal districts (Lisbon and Porto) being responsible for 40.8% of all ACS-EC, but smaller inland districts depicting the highest rate of ACS-EC with multiple hospitalisations ranging up to 37.4% (online supplementary table 2).

All test rules overestimated the number of recurrent ACS and underestimated first ACS hospitalisation compared with the GS. Rule 6 had the lowest detection of HTs (11.0% vs 12.5% for GS) but the second highest proportion of first ACS hospitalisation identification (71.9% vs 76.2% for GS) (table 3).

Table 3

Total number and proportion of ACS hospitalisations according to time between first and subsequent hospital admission for matched hospitalisations using each matching rule

As level of detail of variables included in matching rules increased, the number of false matches decreased from 121 255 (82.7% of matches) for rule 1 to 1490 for rule 6 (3.6%) and, inversely, the proportion of missed matches increased from 0 (0.0%) for rule 1 to 3343 (8.1%) for rule 6. Furthermore, validity measures showed that adding residence code to demographic variables in matching rules significantly increased validity against the GS, with sensitivity decreasing only slightly from 100% for rule 1 to 97.8% for rule 5, with a steeper decrease to 86.2% for rule 6, with both specificity and PPV increasing as matching rule granularity increases. Cohen’s κ depicted an excellent agreement between rules using sex, date of birth and residence codes (linkage rules 4 to 6) and the GS for the detection of 28-day ACS-EC, with rule 5 showing the highest degree of agreement (κ=0.941), closely followed by rule 4 (κ=0.927), then decreasing for rule 6 (κ=0.876) (table 4). Table 5 shows the matching quality according to time between first and subsequent hospital admission for matched hospitalisations identified using matching rule 5, with somewhat lower PPV for HTs and late readmissions.

Table 4

Measures of matching quality for each matching rule in the detection of 28-day ACS episode of care

Table 5

Measures of matching quality according to time between first and subsequent hospital admission for matched hospitalisations identified using matching rule 5

Using matching rule 5 to identify multiple hospitalisations within the same ACS-EC, 98.3% of episodes were correctly classified, while there was a false match rate of 7.3% and a false non-match rate of 0.4%. Table 6 compares the characteristics of ACS hospitalisation erroneously classified by rule 5 as a match (false match) or non-match (missed match) compared with true match and true non-matches, respectively. False match rate was more common in those presenting with unstable angina or coded as other ACS and subacute ACS (ICD9-CM 414) and in patients submitted to either cardiac catheterisation or percutaneous coronary intervention. Missed matches were more common in younger age, in hospitalisations coded as ICD9-CM 414 and in patients submitted to coronary artery bypass surgery (table 6). When analysing mismatch rates at district level, Lisbon had an exceedingly high proportion of false matches (21.8%) compared with the other districts (4.8%), with three municipalities alone being responsible for 77.2% of all false matches. Exclusion of these three municipalities from the analysis resulted in a drop in false-match rate in Lisbon district from 21.8% to 6.6%, approaching the district and national average.

Table 6

Characteristics of matching errors of 28-day ACS-EC identified, using matching rule 5


Using the National Hospital Discharge Database, we built, tested and compared the validity of deterministic internal record linkage using different combinations of indirect identifiers for the identification of 28-day EC consisting of patient-level sequential hospitalisations occurring within 28 days from each other. We found that linkage rules which include demographic and residence code variables showed comparable linkage rates and high validity compared with the GS. We found that false match rate was significantly reduced by increasing the level of detail of residence code, from district to municipality and to parish but, in case of inclusion of parish coding (rule 6), at the expense of an increase in missed matches, loss of sensitivity and agreement with the GS. To our knowledge this is the first study to validate a matching algorithm, without direct identifiers, for matching and identification of ACS patient-level hospitalisations, incorporating all subtypes of ACS and including a wider range of hospitalisations (eg, code 414.x) in order to capture and aggregate information from all sequential hospitalisations within the same ACS-EC, and to assess the impact of aggregating information on ACS hospitalisation counts, characterisation of ACS patients and on indicators of performance.

Most record linkage studies using indirect identifiers have been designed to externally link different data sets, namely clinical registries with claims data, whereby two records are considered a true match, given agreement or disagreement on a set of partial identifiers.19 20 For our study we took a different perspective, we aimed to internally link same-patient hospitalisation episodes due to ACS to build patient-level data on consecutive hospitalisations using event-based variables to define a time frame to build an EC. We chose deterministic linkage for its simplicity and appropriateness in scenarios in which missing and invalid values in matching variables are rare and these matching variables are sufficiently discriminative, as is often the case in large administrative data sets.21 By doing an analysis of different sets of identifiers against the GS, we demonstrated that combination of demographic and residence code (at district and municipality levels) variables showed the highest validity. Westfall and McGloin7 found similar results in a subanalysis of 120 206 myocardial infarction hospital admissions where they used a matching algorithm of indirect identifiers (age or month–year of birth, sex, zip code, ICD9 code) to detect HTs as same-patient hospitalisations occurring within 7 days of first hospitalisation, and found a sensitivity of 96.7% and specificity of 98.7%.

Choosing the appropriate matching rule is highly dependent on the aim of the analysis, and on the type, quality and completeness of data pertaining to the matching variables chosen.15 22 Moreover, use of a matching rule for record linkage should ideally be preceded by a pilot study, where validity against a given GS (usually a unique patient identifier) is assessed. In our study, we found that stricter residence code matching rule (six digits) resulted in a higher proportion of missed matches and loss of agreement with GS compared with more relaxed rules (four digits), possibly because it is more susceptible to coding errors, changes of residency and to administrative reforms with change in parishes’ number and codes overtime.23

Our matching algorithm with highest face validity (rule using sex, date of birth and four-digit residence code—rule 5) showed a low missed-match rate of 0.4% and higher false-match rate of 7.4%. We found high regional heterogeneity with clustering of false matches in three municipalities within the same district. It possibly reflects regional variations in data quality, reporting or coding procedures,24 and it reinforces the need for detailed analysis of characteristics associated with linkage error when doing validation studies for matching algorithms used in record linkage studies. We found missed matches to be more common in younger ages and those with planned procedures (ICD9-CM 414; surgical revascularisation); while false matches were more frequent in those with unstable angina and submitted to catheterisation and/or percutaneous revascularisation procedures. Nonetheless, these linkage errors had limited impact on the overall performance of the matching algorithm with specificity above 98% in detection of all contiguous hospitalisations’ subtypes. Linkage methods that maximise specificity lead to the most robust study results and should therefore be the main focus when building matching rules for record linkage studies.25

Our study has some limitations. Although we used a pseudonymised unique identifier as GS, it consists of a long string of numbers and is therefore susceptible to errors, the impact of which has not been assessed. In our study, we did an internal record linkage to identify patient-level contiguous hospitalisations, classified according to time elapsed between sequential hospitalisations, using indirect identifiers with low missing/invalid rates.

Different studies have compared linkage rates for a linkage rule using indirect identifiers with one using direct identifiers to link records from registries to Medicare claims data and showed, like we did in our study, highly valid linkages compared with the GS rule(s) that included direct identifiers.10 15 We used a deterministic linkage method and required exact matches on >3 variables in our rules. The expected error rates are low, and the rate for false-positive linkages is anticipated to be small. However, false-negative linkages are a concern in all rules, including the GS. The degree of bias from the imperfect GS depends on the number of false-negative matches in the GS and the prevalence of the true linkage.

Our results are likely generalisable to attempts that link hospitalisation-level records, but both expected error rates of linkage variables and prevalence of the condition should be considered. We have used standard demographic variables as linking variables, such as gender, date of birth and residence code, which are much less prone to errors and missing data, since most of these are automatically uploaded to the database. Nonetheless, our results may not be applicable in settings in which databases have high error rates in these linkage variables, since they will produce a large number of false-negative links warranting for the addition of probabilistic linkage methods.


Deterministic linkage using multiple indirect identifiers allows for accurate and valid internal linkage of patient-level contiguous hospitalisations in a preset time frame defining an EC, comparable with linkage with direct identifiers in hospital administrative data. Most data on nationwide or large-scale trends of ACS incidence, management and mortality have been abstracted from unlinked administrative health data and released to researchers without a unique patient identifier, and even in those jurisdictions that have recently introduced pseudonymised databases, longer-term trends analysis still relies heavily on unlinked records.26 Therefore, our method of identifying, classifying and aggregating information of contiguous hospitalisations within the same EC will allow calibration of incidence rates and performance indicators to the number of EC and not to hospitalisations, and will be of value in different countries. Furthermore, it might also be useful in other clinical conditions that have high rates of transfers and readmissions, such as trauma,27 stroke28 and intensive care patients.23


View Abstract


  • Contributors AR, LFA and AF conceived the study, developed the study design, made data analysis and interpretation and drafted the manuscript. LFA and AF provided statistical expertise for data analysis. JCSC and TGA critically revised the work for important intellectual content. All authors contributed to refinement of the study protocol and approved the final manuscript.

  • Funding This article was supported by National Funds through Fundação para a Ciência e a Tecnologia within CINTESIS, R&D Unit (reference UID/IC/4255/2019), and by project NORTE-01-0145-FEDER-000026—Symbiotic technology for societal efficiency gains: Deus ex Machina, financed by NORTE2020 under PORTUGAL2020.

  • Competing interests None declared.

  • Patient consent for publication Not required.

  • Ethics approval Ethical approval was not required for the present study because it uses anonymised secondary data obtained during routine care, systematically reported by all public health hospitals in mainland Portugal and publicly available.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data availability statement Data are available upon reasonable request. Data source for this study is a retrospective administrative health database with data from all hospitalisations that were already anonymised. These are anonymised secondary data obtained during routine care, systematically reported by all public health hospitals in mainland Portugal and publicly available. Data were provided by the ACSS, an official organ of the Ministry of Health, through a cooperation protocol with the Faculty of Medicine, University of Porto and CINTESIS for research purposes.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.