Purpose The Royal College of General Practitioners Research and Surveillance Centre (RCGP RSC) is one of the longest established primary care sentinel networks. In 2015, it established a new data and analysis hub at the University of Surrey. This paper evaluates the representativeness of the RCGP RSC network against the English population.
Participants and method The cohort includes 1 042 063 patients registered in 107 participating general practitioner (GP) practices. We compared the RCGP RSC data with English national data in the following areas: demographics; geographical distribution; chronic disease prevalence, management and completeness of data recording; and prescribing and vaccine uptake. We also assessed practices within the network participating in a national swabbing programme.
Findings to date We found a small over-representation of people in the 25–44 age band, under-representation of white ethnicity, and of less deprived people. Geographical focus is in London, with less practices in the southwest and east of England. We found differences in the prevalence of diabetes (national: 6.4%, RCPG RSC: 5.8%), learning disabilities (national: 0.44%, RCPG RSC: 0.40%), obesity (national: 9.2%, RCPG RSC: 8.0%), pulmonary disease (national: 1.8%, RCPG RSC: 1.6%), and cardiovascular diseases (national: 1.1%, RCPG RSC: 1.2%). Data completeness in risk factors for diabetic population is high (77–99%). We found differences in prescribing rates and costs for infections (national: 5.58%, RCPG RSC: 7.12%), and for nutrition and blood conditions (national: 6.26%, RCPG RSC: 4.50%). Differences in vaccine uptake were seen in patients aged 2 years (national: 38.5%, RCPG RSC: 32.8%). Owing to large numbers, most differences were significant (p<0.00015).
Future plans The RCGP RSC is a representative network, having only small differences with the national population, which have now been quantified and can be assessed for clinical relevance for specific studies. This network is a rich source for research into routine practice.
- PRIMARY CARE
- PUBLIC HEALTH
- STATISTICS & RESEARCH METHODS
This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
Strengths and limitations of this study
The Royal College of General Practitioners Research and Surveillance Centre network is representative of the national population on a variety of domains, both demographic and clinical, which support the network's suitability for real-world evidence research.
Statistically significant differences have been found, due to the large numbers of the sample, but this profile has now quantified them, so that they can be assessed on a study-by-study basis for their clinical relevance.
Some of the analysis compared data extracted using different methodology, which may have skewed the results.
The network is comprised of volunteer practices, with better management of clinical conditions, which may introduce a bias to research into clinical outcomes.
Data from primary care sentinel networks is widely used in research and surveillance;1 the Royal College of General Practitioners Research and Surveillance Centre (RCGP RSC) is one of the longest established.2 These networks are of particular importance for conditions largely managed in primary care.3 However, outputs may be biased if the sample is not representative of the monitored population, or if there is uncertainty about the denominator.4 Estimating rates of disease is easiest from countries that have registration-based primary care systems,5 though useful data can be obtained from health systems that do not require registration.6
Demographics and geographical spread are important, and many sentinel networks are involved in the surveillance of infectious conditions, such as influenza, and assessment of vaccine effectiveness. Different age-groups and differing levels of deprivation have different rates of many illnesses, including rates of swab-confirmed influenza.7 There are also disparities associated with ethnicity, though these are not consistent.8–10
For infectious diseases like influenza, geographic spread of sentinel practices is important, as well as consistent case ascertainment from year to year.11 It is also important that the characteristics of the participating practices are representative.12 In particular, when assessing vaccine effectiveness, vaccine uptake needs to be properly recorded within the database, in a way that is representative of the national population rates.13
Disease, prescription and vaccination patterns, as well as the underlying population demographics, are inter-related. Any assessment of quality requires an assessment of the rate of case ascertainment followed by an evaluation of the management of the condition.14 Some sentinel networks have focused on the management of long-term conditions, including diabetes.15 Chronic conditions, such as diabetes, benefit from improved recording of comorbidities and risk factors.
The RCGP RSC is a network of general practices, which extracts data from the computerised medical record systems of over 100 practices in England. The network established a weekly returns service in 1964, which has enabled prompt surveillance of infectious diseases and identification of epidemics; with influenza surveillance as a key priority for the network.
The characteristics of the RCGP RSC practice network have previously been compared with population-level data to ascertain the representativeness of the sample.16 However, since the most recent report in 2009, there have been substantial changes within the network, including the commissioning in 2015 of an entirely new data and analytics hub at the University of Surrey.
We carried out this study to describe the representativeness of the RCGP RSC practice network comparing population demographics; geographical distribution; the prevalence and management of chronic conditions, and data completeness taking type 2 diabetes mellitus (T2DM) as an exemplar condition; medication prescribing; and vaccine uptake.
We compared four key areas of representativeness with the national population:
Demographics: age, gender, ethnicity and deprivation;
Prevalence and management of chronic conditions, and the data completeness of risk factors for T2DM patients;
Medications prescribing rates and costs, and influenza vaccine uptake.
We also compared the demographic characteristics of the practices within the RCGP RSC network which provide nasopharyngeal virology swabbing specimens for Public Health England's (PHE) viral infection surveillance programme, with those that do not. The data for this study was extracted from practices providing data uploads in March 2015 into the new RCGP RSC data upload system. RCGP RSC member practices are all volunteers who sign a written agreement to participate in surveillance and research. The practice upload includes all registered patients' coded data.
The data extracted by the RCGP RSC sentinel system is the principal primary care public health surveillance data used by PHE. All data extracted are pseudonymised as close to source as possible, and kept in a secure network database. Patients with flags in their records suggesting they have opted out of data sharing do not have their data retained or analysed. Data used for surveillance is part of direct patient care. However, all planned research use of this data requires review by research ethics committees prior to use.
The RCGP RSC population was divided by gender, aggregated into the age bands used in the 2011 Census for England and Wales, and compared with national census data available from the Office for National Statistics. Age was calculated from the date of birth at the point of extraction. Similarly, patients were classified into five ethnic groups (according to census categories), and compared against national data. We used an updated version of an established ethnicity classification mapping to maximise the use of any ethnicity data.17
In the process of data extraction, patient postcodes were converted into Lower Super Output Areas (LSOAs),18 and these were used to assign an Index of Multiple Deprivation (IMD) score to each patient.19 Postcode data was then removed from the final extract to maintain pseudonymisation. We also compared the demographics (age, gender, ethnicity and IMD scores) of practices providing viral throat swabs for PHE with practices which do not.
We used LSOA data, coordinates and map files provided by the Office for National Statistics geographical survey,20 to display the geographical distribution of the practices. Using each GP practice postcode, we plotted this against a map of English Clinical Commissioning Group (CCG) regions, using the statistical software R, and the package maptools.21
We compared the prevalence and quality of management of key chronic diseases. We used the data collected for a national pay-for-performance scheme, the Quality Outcomes Framework (QOF), to compare the prevalence of chronic conditions that are part of the scheme,22 for RCGP RSC practices and at the national level. QOF quality points were used as a surrogate of the quality of care and management of these conditions.
Additionally, we selected diabetes as an exemplar condition to demonstrate the level of data quality found in the database for chronic conditions. We assessed the level of completeness in data recording for a number of demographics (age, gender, deprivation) and risk factors (smoking status; systolic blood pressure; body mass index (BMI, or height and weight); ethnicity; cholesterol and glycaemic control), within the RCGP RSC database.
We only analysed coded data to define data completeness; while free-text data can be extracted, there are risks of extracting patient-identifiable data, so this data is not used within the RCGP RSC network. Coded data in primary care has been widely used in health outcomes research,23 and has known strengths and limitations.5
Medication prescribing and vaccine uptake
Prescription data for England was collated via the Prescription Cost Analysis 2014 report,24 and these were compared with the RCGP RSC data for the same year. We calculated proportions for the most commonly prescribed British National Formulary (BNF) chapters (gastrointestinal, cardiovascular, respiratory, central nervous, endocrine, nutrition/blood and infections), and compared these with the national proportions.
The average cost per item of each BNF chapter was multiplied by the quantities prescribed to obtain costs per chapter, for both the RCGP RSC practices and national data. Additionally, we compared influenza vaccine uptake in the RCGP RSC database with national rates published by PHE,25 focusing on four at-risk age groups (children aged 2, 3, 4 and over 65 years old).
We used standard descriptive statistics to compare the RCGP RSC with national data. The χ2 test was used to compare proportions across most variables. Differences in the cumulative frequency distribution of IMD scores were assessed using the Kolmogorov-Smirnov test. The statistical software R was used for all tests of significance, and the p value was reported. Owing to the large number of statistical tests (69 tests), we applied a Bonferroni-Šidák correction26 ,27 to a level of significance of 0.01, resulting in a new level of significance of 0.00015 (Equation 1). This version of the correction was used since all the tests are independent.
Bonferroni-Šidák equation 1a′=new significance level, α=current significance level, m=number of tests.
Cohort description and findings to date
The cohort includes data from 1 042 063 patients registered at the time of data upload in one of the 107 participating practices.
Data for age and gender was recorded for the complete RCGP RSC population. The age and gender distribution of the RCGP RSC population was similar to the census distribution (figure 1 and online supplementary table S1). There was a significantly higher proportion of both males and females in the 25–44 years age band when compared with the census, and a lower proportion of people in the 0–4 years age band. All differences were statistically significant; with the exception of people aged 65–74 years, men aged 45–64 years and women aged 5–14 years.
The ethnicity of 630 754 (61%) patients from the RCGP RSC cohort was identified from their medical records. The proportions of the five census ethnicity groups were similar to those from census data (see online supplementary figure S1 and table S2). The majority of patients in the RCGP RSC cohort were of white ethnicity (84.4%), similar to the census population (85.4%). Owing to the large numbers, the differences between the census and RCGP RSC data were significant, even when applying a Bonferroni-Šidák correction.
IMD scores were calculated for all patients included in the analysis, using LSOAs (see online supplementary figure S2). The mean IMD score for the RCGP RSC population was 19.8 (SD 0.00682), which was less deprived than the English population (mean IMD score 21.8; SD 0.00050). This is mostly due to a small over-representation of patients in the two least deprived deciles (see online supplementary figure S3). The Kolmogorov-Smirnov test showed that this difference was significant.
The demographic characteristics of the practices participating in the PHE swabbing programme were compared with non-participating practices. There was a significantly higher proportion of people in the 25–44 years age band for participating practices, and a lower proportion of people in the 65+ years band (see online supplementary figure S4). All differences were statistically significant, with the exception of men aged 15–24 years and women aged 5–14 years.
Participating practices had significantly less people in the white census group (75%) compared to non-participating practices (83%), and more people in ethnic minority groups (see online supplementary figure S5). Finally, the population in participating practices was more deprived (mean IMD score 21.5, SD 0.018) than that of the non-participating practices (mean IMD score 19.0, SD 0.010, see online supplementary figure S6), and the Kolmogorov-Smirnov test showed this to be significant.
The RCGP RSC practices are broadly distributed across England, with a higher concentration of practices in London, and a slightly lower number in the southwest and east of England (figure 2). The distribution shows a number of practice clusters, probably due to effective recruitment strategies using the support of CCGs.
The prevalence of common chronic diseases was similar to that reported nationally in the QOF scheme for chronic disease management (figure 3 and online supplementary table S3i). Diabetes (national: 6.4%, RCPG RSC: 5.8%), learning disabilities (national: 0.44%, RCPG RSC: 0.40%), obesity (national: 9.2%, RCPG RSC: 8.0%), and pulmonary disease (national: 1.8%, RCPG RSC: 1.6%) are slightly under-represented, while cardiovascular diseases (national: 1.1%, RCPG RSC: 1.2%) are slightly over-represented.
All differences were significant, with the exception of asthma, chronic kidney disease, depression, palliative care and stroke. The median percentage of QOF targets achieved by the RCGP RSC network (98.76%; IQR 2.64%) was higher than the national median percentage (97.04%; IQR 5.45%), due to an over-representation of the network in the higher score deciles (see online supplementary figure S7).
Diabetes was taken as an exemplar condition to demonstrate the completeness of recorded data within the RCGP RSC database (table 1). This analysis was performed on patients who have complete data for the past 10 years. The recording of data related to cardiovascular risk factors ranges from 77% to 99% for the diabetic population, and from 26% to 69% for the non-diabetic population. Owing to the introduction of QOF targets, the data completeness related to risk factors or clinical management is expected to be higher for patients with a chronic condition.28
The comparison with non-diabetic patients is provided to demonstrate the improvement in recording expected for patients with a chronic condition, compared to those without. Since the QOF targets incentivise accurate recording of chronic conditions diagnoses, and of continuous assessments of risk factors for patients with these conditions, primary care data is particularly well suited for longitudinal research into chronic conditions.
Medication prescribing and vaccine uptake
Prescription items and costs by BNF chapters in the RCGP RSC dataset were compared with national data. The RCGP RSC had higher prescribing rates and costs for drugs within the infections chapter (national: 5.58%, RCPG RSC: 7.12%), while the practices in the network prescribed less drugs in the nutrition and blood chapter (national: 6.26%, RCPG RSC: 4.50%). This difference was significant (figure 4), and it was also reflected in the prescription costs (see online supplementary figure S8).
Influenza vaccine exposure rates within the RCGP RSC network were compared with the national rates published by PHE. Influenza vaccine uptake in the RCGP RSC network is similar to the national rates, although slightly smaller; these are shown in figure 5. The largest difference was seen in the group of patients aged 2 years (national: 38.5%, RCPG RSC: 32.8%), while the smallest differences were in the over 65 years olds (national: 73%, RCPG RSC: 71%).
Strengths and limitations
This study shows the representativeness of the RCGP RSC against nationally published data on a variety of domains, both demographic and clinical. Beyond the areas concerning surveillance, the RCGP RSC database is also representative in a number of clinically relevant areas, which support the network's suitability for real-world evidence research.
This paper provided a clear assessment of the prevalence of chronic conditions when compared with national data, and the completeness of recorded data for risk factors, comparing patients with T2DM and those without. The purpose of this was to assess the suitability of longitudinal research into chronic conditions. It was shown that the data completeness for risk factors drastically improves for patients with an exemplar chronic condition (T2DM), due to the QOF scheme in English primary care.
This paper also highlighted key differences of this network against a national population, due to the nature of its primary purpose. Prescriptions for infections were higher than national levels (though not vaccine uptake), and there was an over-representation of practices in the higher deciles of QOF scores. This is because the RCGP RSC is a sentinel network comprised of volunteer practices, and focused around infectious and respiratory disease surveillance. These differences have been evaluated in this paper, and their relevance can be considered for research using this database.
Most of the differences found were statistically significant; this is due to the large numbers involved in the analysis, which means that very small differences can be easily detected, even when a Bonferroni-Šidák correction is applied to adjust the significance level. Though the difference introduces a bias to the cohort, it has, nonetheless, been quantified in this paper, and can be assessed on a study-by-study basis for its clinical relevance.
An important limitation of this study is that a number of areas (ethnicity, prescriptions, vaccine uptake) were assessed by comparing publically available national rates with data extracted from the RCGP RSC database. The criteria used to extract the network's rates may not have been the same as that used for the published national data, leading to significant differences.
It must be noted that recruitment efforts within the network continue to be expanded, particularly regarding new projects, such as the Integrate study for gastrointestinal disease surveillance29 which is focused in the northwest of England; this shift in geographical focus may change the demographic structure of the network. Therefore, the assessment of representativeness presented in this paper may change in future.
Conclusions and future collaborations
This network provides a representative sample of the population of England in terms of demographics and clinical outcomes. Future recruitment needs to ensure that, as far as possible, any areas of difference are minimised. The RCGP RSC network, in addition to surveillance, could also be used for research into routine practice, and the interaction of infectious disease with long-term conditions.
The use of large healthcare data sets continues to expand,30 with many networks having been set up in the UK providing regular updates to large data sets.31–33 These networks are recognised as having a vital role in disease surveillance, monitoring response to policy change, identification of health-environment interactions and monitoring of side effects of medication.34–37 More recently, it has been suggested that these large data networks could also be used to perform large-scale clinical trials and ensure real-world medication effectiveness for new medications.38
This is particularly important, as clinical trial populations are often not representative of the real-world population, where any interventions or new treatments are ultimately implemented.39 ,40 If clinical outcomes are to be measured using these data sets, it is important that they provide a representative sample of the underlying population, reporting any deviations. The RCGP RSC network is representative of the underlying English population, and key significant differences have been clearly quantified in this paper, making this database a rich source for health outcomes research.
It is expected that this database will be available to researchers on a case-by-case basis. Ethical approval by the NHS Research Ethics Committee is needed for data requests to be considered. Data requests for aggregated data will be provided by the University of Surrey team. Researchers wishing to directly analyse the patient-level anonymised data will be required to complete information governance training and work on the data from the secure servers at the University of Surrey. We encourage interested researchers to attend the short courses on how to analyse primary-care data offered by the university twice a year.
The authors would like to thank the participating practices and patients for providing the data for this cohort.
Contributors AC led the analysis of data and drafting of the paper. WH and AM supported the analysis of data. AM contributed to the drafting of the paper. JvV and SJ supported the statistical analysis of the data. IY supported the development of the cohort and contributed to the drafting of the paper. SdL led the development of the cohort, conceived of the idea, and contributed to the drafting of the paper.
Funding The development of the RCGP RSC data set is supported by surveillance work funded by Public Health England.
Competing interests The RCGP RSC is primarily funded by Public Health England.
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement The RCGP RSC data set can be accessed by bona fide researchers on a case-by-case basis. Ethical approval by the NHS Research Ethics Committee is needed for data requests to be considered. Aggregated data tables may be created from the source data to allow specific analyses for approved research and surveillance projects. Researchers wishing to directly analyse the patient-level anonymised data will be required to complete information governance training and work on the data from the secure servers at the University of Surrey. Patient-level data cannot be taken out of the secure servers at the University of Surrey. We encourage interested researchers to attend the short courses on how to analyse primary-care data offered by the university twice a year.
↵i The denominators vary by condition, depending on the QOF definition (all population, population over 18 years, etc).