Article Text

A retrospective cohort study assessing patient characteristics and the incidence of cardiovascular disease using linked routine primary and secondary care data
  1. Rupert A Payne1,
  2. Gary A Abel1,
  3. Colin R Simpson2
  1. 1General Practice and Primary Care Research Unit, University of Cambridge, Cambridge, UK
  2. 2eHealth Research Group, Centre for Population Health Sciences, The University of Edinburgh, Edinburgh, UK
  1. Correspondence to Dr Rupert A Payne; rap55{at}


Objectives Data linkage combines information from several clinical data sets. The authors examined whether coding inconsistencies for cardiovascular disease between components of linked data sets result in differences in apparent population characteristics.

Design Retrospective cohort study.

Setting Routine primary care data from 40 Scottish general practitioner (GP) surgeries linked to national hospital records.

Participants 240 846 patients, aged 20 years or older, registered at a GP surgery.

Outcomes Cases of myocardial infarction, ischaemic heart disease and stroke (cerebrovascular disease) were identified from GP and hospital records. Patient characteristics and incidence rates were assessed for all three clinical outcomes, based on GP, hospital, paired GP/hospital (similar diagnoses recorded simultaneously in both data sets) or pooled GP/hospital records (diagnosis recorded in either or both data sets).

Results For all three outcomes, the authors found evidence (p<0.05) of different characteristics when using different methods of case identification. Prescribing of cardiovascular medicines for ischaemic heart disease was greatest for cases identified using paired records (p≤0.013). For all conditions, 30-day case fatality rates were higher for cases identified using hospital compared with GP or paired data, most noticeably for myocardial infarction (hospital 20%, GP 4%, p=0.001). Incidence rates were highest using pooled GP/hospital data and lowest using paired data.

Conclusions Differences exist in patient characteristics and disease incidence for cardiovascular conditions, depending on the data source. This has implications for studies using routine clinical data.

This is an open-access article distributed under the terms of the Creative Commons Attribution Non-commercial License, which permits use, distribution, and reproduction in any medium, provided the original work is properly cited, the use is non commercial and is otherwise in compliance with the license. See: and

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Article summary

Article focus

  • Data linkage allows information to be combined from different routine clinical data sources.

  • Previous work has shown differences between sources of data but has not examined this at the patient level.

Key messages

  • Patients' apparent characteristics, and disease incidence and severity, vary depending on whether primary care, hospital or combined definitions of cardiovascular events are used.

  • Use of isolated routine primary care or hospital data may result in biased patient selection.

  • This has implications in the public health arena, clinical trial patient recruitment and validity and reliability of secondary data in clinical trials.

Strengths and limitations of this study

  • The strengths of this study are the novel analytical approach, using a large routine data set linked at individual patient level from multiple GP surgeries.

  • Limitations of this study include restricting our analysis to four coding groups, uncertainty as to whether GP and hospital events could be considered to be recorded simultaneously, potential diagnostic coding inaccuracies and the relatively small number of GP surgeries, which may not have been representative.


Primary care data sets are commonly used for assessment of cardiovascular outcomes. Such events often are associated with hospitalisation.1 However, it is possible that the manner in which outcomes are coded and recorded in electronic health records may differ between primary and secondary care. This may result not only in differences in the apparent incidence of a condition, depending on whether primary or secondary care records are used, but also in differences in the observed characteristics of patients. Studies have observed that variations in diagnostic criteria can affect estimates of disease prevalence,2 and the complexities of clinical coding systems for electronic healthcare records can lead to inconsistent data recording.3 This will lead to uncertainties with respect to disease prevalence and mortality,4 impact on clinical care, have additional health service implications such as affecting funding5 and potentially influence identification of patients for clinical trials. Previous studies have compared general practice coding and disease prevalence with other unlinked data sources, including paper notes.6 7 However, the effect of combining information from two sources has not been previously examined. This study used linked individual patient electronic health records collected from primary and secondary care to examine the effect of using data from different parts of the healthcare service on the incidence rates, case fatality rates and patient characteristics of myocardial infarction (MI), ischaemic heart disease (IHD) and cerebrovascular disease (CVD).


Data sources

Sixty general practitioner (GP) surgeries take part in the Scottish national Practice Team Information (PTI) project, of which 40 self-selected surgeries contributed to the data set used in this study. Practices involved in the PTI project provide routine central recording of clinical activity and morbidity from a sample of GP surgeries considered reasonably representative of the Scottish population. Practices are reimbursed to ensure that data recording is optimal. Clinical coding used the Read code system. Data are used to calculate national estimates and used by various organisations (eg, NHS Boards, Scottish Government) to inform policies and better understand health in Scotland.

Patient details from the PTI data set were linked to the corresponding admissions recorded in Scottish national hospital data (the Scottish Morbidity Record, SMR-01) using probabilistic matching. Matching was based on Soundex-encoded name, date of birth, sex, postcode and a unique nationwide identifier, the Community Health Index. Experienced human review was used to set a threshold for linkage. A substantial proportion of patients in this GP cohort have no hospital admissions, and as such, it is difficult to know whether the absence of a match is either due to a genuine lack of corresponding hospital record or due to a false-negative error. Match rates are thus difficult to quantify, although the use of multiple identifiers should improve linkage quality. The linkage was carried out by the Information Services Division, NHS National Services Scotland. The work was approved by the Privacy Advisory Committee of NHS National Services Scotland. For the 2004–2006 period, SMR-01 data are considered to be 88% accurate.8 SMR-01 records are generated for all inpatient hospital medical discharges and transfers. Coding is based on the International Statistical Classification of Diseases and Related Health Problems (ICD) system (ICD9 prior to 2000, ICD10 thereafter), with up to six inpatient diagnoses per record. Accident and emergency, maternity and psychiatric admissions, along with outpatient attendances, are not recorded in SMR-01. SMR-01 itself is also routinely linked to national mortality data (General Registrar's Office for Scotland, GROS). SMR-GROS data are also used to generate Scottish National Statistics.

Identification and classification of cases

We first identified all records of MI, IHD and CVD from both GP and hospital data sets using the following Read codes (MI: G30%/35%/38%, Gyu34/35/36; IHD: G3%, Gyu3%; CVD: G6%, Gyu6%, F4236; where % indicates a ‘wildcard’ match) and ICD codes (MI: ICD10 I21–22, ICD9 410; IHD including MI: ICD10 I20–25, ICD9 410–414; CVD (stroke) including haemorrhage and transient ischaemic attack (TIA): ICD10 I60–69, G45–46; ICD9 430–438). Hospital events were identified from any of the six diagnostic positions. These were not necessarily first events.

We then found all episodes of a similar GP and hospital event type occurring within a 30-day period and made the assumption that these pairings represented the same clinical event. Where the GP and hospital dates differed for these paired episodes, the first of the two dates was taken. The choice of 30 days was a pragmatic one but supported by visual evaluation of the distribution of time gaps between similar hospital and GP event types over a 2-year period. Of note, an event recorded by the GP does not necessarily require a face-to-face consultation or a referral to be made; hospital admissions will usually be retrospectively recorded by the GP, using the admission date as opposed to the data-entry date.

Analysis was carried out over the period 1 January 2005 to 1 January 2007. The total population was randomly allocated to one of four methods of identifying cardiovascular events: those based on GP events only; those based on hospital events only; those based on pooled GP/hospital events, with an event in GP data only, hospital data only or both the GP and hospital data (although not necessarily occurring within 30 days); and those based on paired GP/hospital events (those recorded in both GP and hospital data within 30 days). An episode was included as an incident event only if there was no record of a similar clinical event at any time prior to 1 January 2005 coded in the same data set(s).

This method of identifying incident events is shown in figure 1. For example, for an event to be included using only GP data, the first event would have to be recorded by the GP during the 2-year period of interest, with no similar events recorded by the GP prior to 1 January 2005; hospital data are completely ignored in this case. A similar approach is used for identifying events using hospital-only data, with GP records ignored in this situation. For the third method, identifying events using pooled GP/hospital data, the first event needs to be recorded by either the hospital or the GP during the 2-year study period; there must be no similar event recorded in either data set prior to 1 January 2005. For the final method, the first occurrence of paired (ie, within 30 days) records in both GP and hospital data sets constituted an incident event if it occurred during the 2-year period; any unpaired GP or hospital records occurring prior to 1 January 2005 were ignored.

Figure 1

Identification of incident events. The figure shows how incident events can be identified from linked general practice (GP) and hospital data sets, for eight hypothetical patients, illustrating some of the potential coding combinations. Circles correspond to the presence of a GP (○) or hospital (●) clinical code, with numbers illustrating the order. Immediately adjacent circles represent codes occurring within 30 days of one another. It can be seen that, for any given patient, it is possible to classify them as having an incident event in up to four ways: GP data only, hospital data only, paired GP/hospital and pooled GP/hospital; the code that identifies an incident event for each of these methods is shown on the right of the figure. Codes do not count as incident events if a further, similarly classified, event has occurred prior to the start of the study period. In our study, patients were randomly allocated to one of the four coding methods. For instance, if patient E was allocated to ‘hospital only’ coding, they would not be classified as having had an event; in contrast, they would be classified as having had an event if they were allocated to any of the other three coding methods.

For each incident event, we determined the patient's age, sex, socioeconomic status (Scottish Index of Multiple Deprivation quintile),9 recorded current smoking status, record of hypertension, record of diabetes and Charlson Index.10 Comorbidities, including Charlson Index, were determined from the GP data as the presence of any relevant diagnostic Read code prior to the incident episode date; the list of codes used is available from the authors on request. Although we have not formally evaluated performance of our Charlson Index Read code list, we match 87% of those events identified by the method described by Khan et al,11 and as such believe that this represents a reasonable, albeit pragmatic, measure of comorbidity. Death from any cause within 30 days of the event was ascertained from linked national mortality (GROS) data. Drug therapy recorded in the GP record, starting prior to or within 30 days after the event, and continuing for any period of time after the event, was ascertained for patients alive at 30 days. Drug classes included were ACE inhibitors (including angiotensin receptor blockers), β-blockers, calcium channel blockers, diuretics (including potassium sparing and combination diuretics), nitrates, statins and antiplatelet agents (aspirin or clopidogrel for MI or IHD; aspirin or dipyridamole for CVD).

Statistical analysis

Incidence rates were calculated excluding patients with events in the relevant data set(s) prior to 1 January 2005. Incidence rates are expressed per 100 000 patient-years (based on total number of days of follow-up for each patient within each respective group). Statistical differences in patient characteristics (including drug treatment) between coding categories were evaluated using χ2 tests (for proportions) and Kruskal-Wallis non-parametric analysis of variance (for continuous data). The association between coding and 30-day case fatality was assessed by logistic regression, including the covariates age, sex, deprivation, smoking status, hypertension, diabetes and Charlson Index. Differences in the four incident rates obtained were examined using Poisson regression.

Data management was carried out using Microsoft SQL Server 2000. Statistical analysis was performed using SPSS V.17 (SPSS Inc.).


Differences in identification of incidence events

There were a total of 240 846 patients, evenly distributed between the four coding groups. Numbers of incident events are shown in table 1. Incidence rates for the three conditions are shown in figure 2. There was strong evidence (p<0.001, Poisson regression) that the incidence rates for all three clinical conditions depends on which data set(s) are used to identify cases. In all cases, the pooled GP/hospital data produced the highest incidence rates (376, 1089 and 767 per 100 000 patient-years for MI, IHD and CVD, respectively), and the paired GP/hospital data gave the lowest incidence rates (188, 489 and 272 per 100 000 patient-years, respectively). There was no evidence that the incidence rates based on only GP data differ from those of the hospital data for either MI (p=0.14) or CVD (p=0.27), but there was strong evidence that they were higher for IHD (975 and 673 events per 100 000 patient-years for hospital and GP, respectively, p<0.001). The pooled GP/hospital data produced slightly higher incidence rates than hospital data alone for CVD (p<0.001) and marginally so for MI (p=0.048) and IHD (p=0.066).

Table 1

Variation of patient characteristics with different methods of identifying cases

Figure 2

Incidence rates, expressed per 100 000 patient-years, for different clinical conditions over a 2-year time period beginning 1 January 2005, based on general practice (GP), hospital, paired GP/hospital and pooled GP/hospital data. CVD, cerebrovascular disease; IHD, ischaemic heart disease; MI, myocardial infarction.

Patient characteristics

Patient characteristics are shown in table 1 for all three clinical conditions. There was no evidence that rates of diabetes and hypertension, or the distribution of sex or deprivation, varied between coding groups. Greater numbers of smokers were found in the paired GP/hospital group for patients with MI (45% in the paired group compared with 28%–34% in the other groups, p=0.028) and IHD (35% compared with 24%–27%, p=0.021). The level of comorbidity for all conditions, as measured by the Charlson Index, is lower in the paired GP/hospital group (1.8, 1.3 and 1.9 for MI, IHD and CVD, respectively) and higher in the hospital group (2.2, 1.7 and 2.4, respectively, p≤0.014). For IHD and CVD, there is evidence that patients identified using solely GP or solely hospital data were slightly younger.


Differences in prescribing rates were observed between coding groups (table 2). These were most marked for IHD, where rates of prescribing of ACE inhibitors, β-blockers, nitrates, statins and antiplatelet agents were higher in the paired group (p≤0.013). However, this finding did not appear to be replicated for MI specifically. For CVD, prescribing rates for statins and antiplatelet agents were lower in the hospital group (p≤0.022).

Table 2

Variation of patient characteristics with different methods of identifying cases

case fatality

Considerable 30-day case fatality rate differences exist for all three conditions depending on the coding used (p≤0.002, table 3). Rates for all conditions are highest in patients coded only in hospital and lower in the GP and paired GP/hospital groups. The most striking differences were observed for MI, with a 30-day case fatality rate of 20% for the hospital group but only 4% for the GP group.

Table 3

Variation of case fatality rates with different methods of identifying cases


In a world where electronic healthcare data are becoming increasingly used for the purposes of clinical trials and epidemiological research, there is a need for researchers to understand whether additional information can be gained by linking two (or indeed more) electronic health record data sources together. However, where there is overlap between the constituent data sets, such as with coding of clinical conditions, the researcher needs to decide which data set to rely on for identifying cases, or indeed whether combining information from both the data sets may be of value. Our study demonstrates that the method of coding MI, IHD and CVD appears to result in identification of different types of patient, in particular as characterised by prescribing and case fatality rates. Incident rates of disease also vary depending on the coding method used.

Previous work examining the epidemiology of cardiovascular disease has been conducted in Scotland using routine clinical data. Primary care data have been used to demonstrate that IHD is a common problem associated with male gender, increasing age and socioeconomic deprivation.12 Yet the recording of IHD data varies in general practice with different methods used for case detection.13 Furthermore, external factors such as payment-for-performance have been shown to improve the recording of IHD-related health indicators.14 Such incentivisation was introduced to UK general practice (but not hospital practice) in 2004, and so it is possible that this may have reduced the discrepancies between hospital and GP data in our study. Interestingly, pooling of GP and SMR records has previously been advocated for detecting MI cases,15 and pooled GP/SMR data from the same data set we used have demonstrated differences between cohorts of incident and prevalent MI.16 However, the effect of using only one component of such a data set has been hitherto unknown.

Reasons for differences in incidence rates and patient characteristics

Our data do not allow us to determine the exact cause of our findings, but a number of hypotheses may be proposed. Incident disease is reassuringly similar between GP and hospital groups for MI and CVD. The lower incidence of IHD for the GP group reflects the fact that many patients will have had relatively stable coronary disease for a number of years but not necessarily required acute hospital admission. Thus, many GP episodes of IHD do not count as true incident cases as they have had prior contact with the GP, whereas a higher number of hospital episodes are incident cases as these patients have never been previously admitted. The lower incidence rates for the paired GP/hospital group, and higher incidence rates for the pooled GP/hospital group, are inevitable consequences of the way in which the two data sets are united, although the magnitude of these differences will nonetheless reflect the degree of inconsistency in coding between the two. Furthermore, it would appear that because the paired GP/hospital data considerably underestimate the true disease incidence, it is probably not a useful method for identifying cases, even though such cases might be more rigorously identified. In addition, the increase in incidence rate using the pooled GP/hospital data demonstrates the potential advantage of combining two data sets, over use of a single data set, from the perspective of improving case finding.

The discrepancies in death rates are probably relatively straightforward to explain. Acute MI admission has a high case fatality,1 but those surviving beyond discharge have a much lower case fatality subsequently. It seems likely that the GP may fail to record the cause of death in patients who do not survive the hospital admission, thus resulting in the lower case fatality rates observed in the paired GP/hospital coding group. Furthermore, it is possible that patients coded only by the GP may represent ‘less serious’ illness, where hospitalisation is not deemed necessary by the GP. It is recognised that many patients suffering relatively minor strokes may not be admitted to hospital,17 resulting in lower case fatality for CVD in the GP group, although with the growing availability of active treatment options for ischaemic stroke in the form of thrombolysis, this may well change. We used national mortality data to identify deaths from both GP and SMR data sets, so discrepancies in recording of death between GP and hospital are unlikely to explain the differences in case fatality rates observed. Furthermore, the majority of paired events share exactly the same date, suggesting that retrospective date entry by the GP of the hospital event is common, and thus, there is no reason why this could not be carried out for fatal events.

The higher prescribing rates for IHD in the paired coding group are probably due to GPs responding appropriately to secondary care instigated intervention, reflected in appropriate treatment. That such differences were not observed for MI may be due to better communication and awareness for this specific condition compared with other IHD, such as angina, meaning that prescribing in the hospital group appears just as good as for the paired GP/hospital group. However, fewer MI events may have left us underpowered to detect differences. The lack of difference in the GP and paired groups for CVD may reflect poorer awareness of stroke management guidelines18 in comparison with coronary heart disease, and so prescribing rates are consequently no higher in the paired group. The lower prescribing rates of statins and antiplatelet agents in the CVD hospital group may reflect the GP being unaware of these patients' clinical need resulting in undertreatment; this is supported by the higher prescribing rates in the paired group. The differences in other patient characteristics—specifically smoking and comorbidity—are less easy to understand but may represent increased disease severity and mortality in hospitalised smokers and multimorbid patients. The small differences in age (<3 years) seem unlikely to be clinically relevant, although may be pertinent from the public health perspective. Finally, it may be that miscoding of diagnoses may explain some of the above differences; for instance, heart failure may be used as an alternative but incorrect code for MI.19 Furthermore, the introduction of sensitive troponin assays has influenced MI detection rates20; it is possible that lack of familiarity among some clinicians for the resulting terms (eg, non-ST elevation MI, acute coronary syndrome) may result in inaccurate diagnoses being recorded.


This study has highlighted important issues related to patient coding and linked data, but although it has the advantage of using a reasonably large routine data set, linked at the individual patient level, a number of issues and limitations should be considered. The relatively small number of GP surgeries (40) may not have been fully representative. In addition, the number of events is relatively small, and given the conservative nature of the χ2 test, this increases the possibility of type 2 errors; thus, a larger data set may have identified more differences between groups. We restricted our analysis to four simple coding groups—GP, hospital, paired and pooled GP/hospital. However, it is clear that there are many further ways of categorising events, including the presence or absence of prior or subsequent coding based on the alternative half of the data set. For instance, an incident GP event with a historical hospital event may be coded differently to a GP event with no previous hospital record. However, we found that many of these theoretical categories have only a handful of cases. Furthermore, even when we examined six or seven separate smaller coding categories, similar differences in patient characteristics persisted between groups (data not shown). Our choice of four main groups was therefore a pragmatic one, which reflects the choice that would face a researcher dealing with a similar linked data set. The decision to use a 30-day limit for pairing data could also be questioned; we are unable to prove that these two events are truly the same clinical episode. The choice was again, therefore, partly pragmatic, although supported by examination of the distribution of time gaps between the GP and hospital data. We did not limit the lead-in time period prior to 1 January 2005 in any way. Length of GP records is generally greater and more variable than SMR records, and there is the potential to see a lower number of new incident events among persons with longer GP records. Our study used routine GP data, and it is possible that such profound differences may not be found with research-standard databases, such as General Practice Research Database (GPRD).21 Nonetheless, work linking primary care research databases to hospital (and other) records is ongoing, and the issues raised by our study must be acknowledged. The SMR data set only records hospital events in Scotland and thus fails to capture events in elsewhere in the UK or abroad. Similar issues face the English equivalent Hospital Episode Statistics, and a UK-wide hospital events data set would be valuable. SMR (and Hospital Episode Statistics) also provide multiple diagnostic codes for a single event. We elected to use all six diagnostic positions to ensure maximum capture of relevant hospital events. However, the robustness of low-priority diagnoses might be questioned. Nonetheless, we found similar results when we used only two diagnostic positions (data not shown). We also did not examine miscoding of events—for example, a code of angina being used rather than the code for MI. Coding of SMR is considered 99% complete and 88% accurate8; corresponding metrics are not available for PTI data (although the completeness and accuracy of Read coding of morbidity in Scottish general practice has been shown previously to be greater than 91%22). Furthermore, the two data sets use different coding systems, so completely reliable comparison is not possible. However, we used relatively broad definitions, and the Read code system is based on ICD. Nonetheless, we may in particular have missed some administrative Read codes, which might have enabled identification of additional cases in the GP group. Of course, ideally further validation of the coding should be conducted; linkage to laboratory data might be one way of achieving this. Finally, our 30-day limit for prescribing was selected from a pragmatic perspective. However, it is possible that patients who were admitted for over 30 days would not have had a new prescription issued by the GP within the 30-day post-event period, resulting in an apparent underestimation of prescribing. We believe that these numbers will be relatively small, however, and unlikely to alter the overall interpretation of our findings.

Research and policy implications

These results have significant implications for linked data; the drug management, disease severity and to some degree the patient characteristics vary depending on how the disease cohort is defined. They also have implications for the use of unlinked routine data—use of isolated primary or secondary care data may result in a biased selection of patients. This may affect patient recruitment as well as the validity and reliability of such information sources as secondary data in clinical trials, including clinical outcomes. It is similarly relevant to the public health environment. Using linked data allows one to have a more robust definition, by using pairs of GP and hospital codes only, but it is clear that the apparent incidence of a disease will be considerably lower. Alternatively, linked data enable a looser but more inclusive disease definition, using both GP and hospital data, but not relying on the coding occurring simultaneously. When using separate data from only one source, one needs to take into account that patient characteristics may not be representative of the wider population. It is difficult to recommend one coding approach over another, however, and the decision will need to be based on the specific question being posed.


In conclusion, patient characteristics vary depending on whether GP, hospital or combined definitions of cardiovascular events are used. In particular, disease severity as measured by mortality varies considerably. This has important implications for studies using linked routine primary and secondary care data, and for studies where information is only available from one of these sources. These issues should be acknowledged by studies using routine data as a secondary data source, and further work is merited to examine whether similar discrepancies exist for other clinical conditions or within primary care research databases.


Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

    Files in this Data Supplement:


  • To cite: Payne RA, Abel GA, Simpson CR. A retrospective cohort study assessing patient characteristics and the incidence of cardiovascular disease using linked routine primary and secondary care data. BMJ Open 2012;2:e000723. doi:10.1136/bmjopen-2011-000723

  • Contributors RAP conceived the study. RAP and GAA contributed to the study design, analysis and interpretation and to the drafting of the article. CRS acquired the data and set up the linked database. All authors contributed to the critical revision of the paper and approval of the final version.

  • Funding This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement There are no additional data available.