Objective The use of electronic medical record (EMR) data is necessary to improve clinical research efficiency. However, it is not easy to identify patients who meet research eligibility criteria and collect the necessary information from EMRs because the data collection process must integrate various techniques, including the development of a data warehouse and translation of eligibility criteria into computable criteria. This research aimed to demonstrate an electronic medical records retrieval system (ERS) and an example of a hospital-based cohort study that identified both patients and exposure with an ERS. We also evaluated the feasibility and usefulness of the method.
Design The system was developed and evaluated.
Participants In total, 800 000 cases of clinical information stored in EMRs at our hospital were used.
Primary and secondary outcome measures The feasibility and usefulness of the ERS, the method to convert text from eligible criteria to computable criteria, and a confirmation method to increase research data accuracy.
Results To comprehensively and efficiently collect information from patients participating in clinical research, we developed an ERS. To create the ERS database, we designed a multidimensional data model optimised for patient identification. We also devised practical methods to translate narrative eligibility criteria into computable parameters. We applied the system to an actual hospital-based cohort study performed at our hospital and converted the test results into computable criteria. Based on this information, we identified eligible patients and extracted data necessary for confirmation by our investigators and for statistical analyses with our ERS.
Conclusions We propose a pragmatic methodology to identify patients from EMRs who meet clinical research eligibility criteria. Our ERS allowed for the efficient collection of information on the eligibility of a given patient, reduced the labour required from the investigators and improved the reliability of the results.
- Oral & Maxillofacial Surgery
- Public Health
- Statistics & Research Methods
This is an open-access article distributed under the terms of the Creative Commons Attribution Non-commercial License, which permits use, distribution, and reproduction in any medium, provided the original work is properly cited, the use is non commercial and is otherwise in compliance with the license. See: http://creativecommons.org/licenses/by-nc/2.0/ and http://creativecommons.org/licenses/by-nc/2.0/legalcode.
Statistics from Altmetric.com
The focus of this work was to establish a pragmatic methodology to efficiently collect information from electronic medical records (EMRs) about patients who meet clinical research eligibility criteria.
The use of EMR data is necessary to improve clinical research efficiency. However, it is not easy to identify patients who meet research eligibility criteria and collect necessary data from EMRs because the data collection process must integrate various techniques, including the development of a data warehouse and the translation of eligibility criteria into computable criteria. An efficient ERS and a standardised data processing model that integrates these techniques are essential to facilitate clinical research that utilises EMRs.
Strengths and limitations of this study
Our method uses a specialised data model for patient identification in clinical research and efficient data conversion that does not depend on the EMR database structure when converting narrative criteria to computable criteria.
We propose that computable criteria should not be a result of the automated conversion of narrative criteria but rather a result of research preparation involving medical concepts that are not expressed logically or explicitly in the narrative criteria. Therefore a large amount of the conversion of the eligibility criteria to computable criteria should be executed at the protocol development stage.
It is important to further discuss protocol standardisation, including eligibility criteria representation for computable use.
Enabling medical records retrieval system use in and across multiple institutions is an important future task.
Medical information technology has recently advanced in many countries, and enormous amounts of clinical data are already stored as electronic medical records (EMRs). Utilising the data collected in EMRs is necessary to improve clinical research efficiency.1–3 An EMR is a large database of patient data and is used in observational research to investigate the relationships among diseases, treatments and outcomes,4–7 to conduct surveillance for rare drug reactions,4 ,8 and to recruit patients for clinical trials.9–13 However, it is not easy to identify patients who meet research eligibility criteria and collect necessary information from EMRs.2 ,3 Herein, we describe three major issues concerning EMR-based observational studies: EMR patient data retrieval function, eligibility criteria protocol representation and EMR data accuracy.
To identify patients who meet research eligibility criteria, it is necessary to obtain various types of information stored in EMRs by subject, for example, diagnosis and prescribed medications. However, the EMR database is designed to facilitate online transaction processing for rapid and detail-oriented clinical information searches on individual patients, and the current EMR system does not facilitate this retrieval function.2 ,3 ,14 Data warehouses are essential components of data-driven decision support. To allow for efficient research analyses, EMR data must first be warehoused to enable data analyses across patient populations.15–21 However, healthcare data modelling is difficult and time consuming because of the complexity of the medical knowledge involved. Thus, the most common approaches to clinical data warehouse modelling are variations on the entity-attribute-value (EAV) model,22–28 where data are stored in a single table with three columns: entity identification, attribute and attribute value. The EAV design has advantages, including flexibility and ease of storage; however, it requires transforming EAV data into another analytical format before analysis.25 ,28 Online analytical processing (OLAP) is most frequently used for searching data stored in the data warehouse.29–31 OLAP systems in relational databases are typically designed based on Kimball's star schema.32 However, the star schema was devised to facilitate online measurement analyses. In healthcare, this method can be used to dynamically gather online analyses of numeric data (eg, a specific dose of a drug for a specific disease) in clinical practice. Therefore, this method is not suitable for identifying patients who meet the complicated eligibility criteria for a given clinical research study. Data-modelling methods that facilitate the identification of patients and enable the collection of necessary information from EMRs remain to be established.28
Current eligibility criteria are written in a text format that cannot be computationally processed. Additionally, to be applied in actual EMR, eligible criteria need to be integrated with the data model of EMRs.33 Several investigations have sought to establish computable eligibility criteria.34–41 However, there is no consensus regarding a standard patient information model,33 and the eligibility criteria are not yet completely standardised. Using natural language processing (NLP) technologies to convert the text format of eligibility criteria to a computer or to extract patient identifications from EMRs is far from perfect without human intervention.3 ,42 ,43
Current EMRs have been used to support claims for medical service fees and the treatments administered to each patient; therefore, data gathered specifically for research purposes may be incomplete and unreliable.2 ,3 ,44
Although various investigations on each technique are executed individually, standardised methods must still be established that integrate these techniques, facilitate the identification of patients who are eligible for clinical research, and collect necessary information from EMRs.
We designed a pragmatic data processing model optimised for patient identification and for the collection of necessary information from EMRs for clinical research. These tools are implemented as an electronic medical records retrieval system (ERS).44
This research aimed to demonstrate an ERS and an example of a hospital-based cohort study that used the ERS to identify both patients and exposure. Another aim was to evaluate the feasibility and usefulness of the ERS, the method to convert text form eligible criteria to computable criteria, and a confirmation method to increase research data accuracy.
Materials and methods
Outline of our procedure for patient identification and data collection from the EMR
To identify patients who met the eligibility criteria for the clinical research in question, data were collected in the following ways:
The text form of the narrative criteria was converted into computable criteria.
A targeted patient list was created.
A flag was added for investigators to confirm the targeted patient list.
Reports were created for the investigators to confirm.
After confirmation by the investigator, the statistical analyses were executed.
EMR retrieval system
In our hospital, EMR use was introduced in 2005; approximately 800 000 cases of clinical information have already been stored. To comprehensively and efficiently collect information about patients participating in clinical research, we developed an ERS.44
EMRs store various types of information, integrating billing, pharmacy, radiology, laboratory information and others.4 In creating the ERS database, we designed a new data model based on the star schema that was optimised for patient identification in clinical research. We identified nine data categories from EMRs that are useful for clinical research: demographic characteristics, physical findings, diagnostic studies, laboratory tests, diagnoses, progress reports on an EMR template,44 ,45 medications and injections, operation records and other treatments. We then designated these categories to ‘entities’. In our hospital, the diagnosis is managed by codes that were originally defined by our hospital and mapped with International Statistical Classification of Diseases (ICD) 10 codes46 for medical insurance purposes. Operations codes were also managed by codes that originally were defined by our hospital and mapped with ICD-9 Clinical Modification codes. We identified available columns (eg, ICD code, diagnosis date) from the EMR data model and designated these columns as ‘attributes’ of the entities.
Figure 1 presents our data model. In our model, all entities in a given schema are independent and complete; this allows for logical operations and for the creation of eligible patient lists for each respective parameter in a study. The target patient list is generated by combining these patient lists. The data model also supports the inference of medical concepts expressed in the eligibility criteria in reference to corresponding patient data accumulated in EMRs.33 ,34
In our hospital, a replicate of the EMR database known as ‘Open DB’ was established for the secondary use of accumulated EMR data.7 A data mart for our ERS was created to ensure that the data retrieval process was practical and independent of the EMR system structure; the data mart was created on the relational database management system by extracting, transforming and loading (ETL) information from the Open DB.7 ,44 The ETL process is performed automatically once nightly except for the ‘Progress notes by EMR template’ entity, which is referred directly from the Open DB to ensure real-time visibility for the eClinical trial.44
An OLAP tool was installed to efficiently search through data from multiple patients.44 The OLAP tool runs in an Internet browser and can generate structured query language (SQL) based on predefined metadata (ie, a data model) by defining logical queries (ie, programmes) using a graphical user interface (GUI). Moreover, this tool allows reports on information retrieved from the browser to be transcribed using hypertext markup language (HTML). The reports are created in various formats, including portable document format (PDF), comma-separated values (CSV) and extensible markup language (XML).44
To protect personal information in medical records at our hospital, the EMR network is separated physically from other networks. Our data mart and OLAP servers are deployed in the same EMR network and managed using the same EMR security policies. Additionally, the use of our ERS is limited to clinical research approved by the ethics committee at our hospital, and only designated staff members at our centre are allowed to retrieve data. Our centre creates and manages ERS user identification separate from the EMRs. For the external output of CSV and other data, permission must be obtained from our department of medical informatics, and data extraction must be executed in the presence of supervisors who are responsible for protecting personal information at our hospital.
Application to clinical research
We applied the system to a hospital-based cohort study performed at our hospital titled ‘Risk of osteomyelitis of the jaw induced by oral bisphosphonates (BP) in patients taking medications for osteoporosis: a hospital-based cohort study in Japan’,47 in which we identified eligible patients, extracted research data and evaluated the feasibility of our system. The ethics committee at Kyoto University Hospital approved this research. A different paper details the purpose, methods, results and discussion of this research.47
The eligibility criteria were as follows:
Patients diagnosed with osteoporosis and treated with osteoporosis medications at Kyoto University Hospital between November 2000 and October 2010.
Patients aged 20 years or older.
Patients with a history of treatment with radiation therapy to the maxillofacial region.
Patients with primary or metastatic tumours in the maxillofacial region.
Patients treated with intravenous BP.
The data collected were diagnosis, date of diagnosis, sex, birthdate and the doses and dates when osteoporosis medications, steroids, anticancer drugs, diabetes drugs and HbA1c tests were administered.
Conversion of the text form of the narrative criteria to computable criteria
To identify eligible patients and collect the necessary data from the EMRs, narrative criteria and data must be converted to computable criteria. Such computable criteria include entities, attributes, logical operators (ie, ‘and’ and ‘or’), codes and parameters.33–37 The clinical research purpose and clinical practice demands made it necessary to perform this task.
We manually executed the conversion from text eligibility criteria to computable criteria. As an example of the conversion from narrative criteria to computable criteria, we present the following two-step conversion procedure:
Step 1: Convert the narrative criteria into entity-level criteria
Medical concepts expressed as narrative criteria are mapped onto entities in the data model and converted into entity-level criteria. This task is manually performed at the protocol development stage of the study by the investigators. For each entity, a criterion is created to extract patients who meet each condition. If exclusive conditions for the same entity must be defined, a different criterion is created. Additionally, the list of codes for drugs and diagnoses (ie, ICD-10) is created, and the period of treatments and others are defined by investigators. In this study, we mapped ‘osteoporotic patients’ onto two entities (ie, ‘diagnosis’ and ‘medications and injections’) and converted it to a combination of two criteria (ie, ‘diagnosis of osteoporosis’ and ‘osteoporosis drug administration’). In the test research, we defined the entity-level criteria according to the entered diagnosis and ordered treatments rather than the diagnostic criteria of the disease. This process reflects that the test research aimed to estimate some risks of osteomyelitis of the jaw with BP administration instead of diagnosing osteoporosis patients accurately. The recorded diagnosis in the EMR was typically designed to ensure payment for medical claims. We thus sought to reduce the number of false-positives by extracting patients with a given treatment type.
Step 2: Convert entity-level criteria into attribute-level criteria (ie, computable criteria)
The abovementioned corresponding codes, date and parameters are mapped onto attributes of the entity-level criteria, and these factors become computable criteria.
Creating a targeted patient list
A targeted patient list is created from the entire set of patients for whom EMRs have been obtained by defining logical queries (ie, programmes defined by the GUI) based on the computable criteria included in the ERS.
Logical queries are first defined in the ERS to identify patients who meet the conditions for each criterion. The ERS automatically generates the SQL necessary for data extraction according to the logical queries. Logical queries are then defined to include or exclude eligible patients who meet each criterion for the demographic entity. The targeted patient list is created by executing the logical query. Figure 2 presents an example of an SQL automatically generated by the ERS.
We thus designed our data model to enable the creation of a targeted patient list by defining the patients extracted from each criterion (ie, ‘in’ or ‘not in’) as conditions for the demographic entity that was the unique patient list for the entire hospital. If logical queries are defined using our method, even if the eligibility criteria are complicated, it is not necessary to dramatically change the SQL structure generated in the ERS.
Flagging entries for investigators to confirm
To improve research data accuracy, confirmation by the investigators is necessary. When confirmation is required, additional information is linked.
For the targeted patient list, logical queries are defined to flag certain items according to the investigators’ interest. Necessary logical queries are first defined for each criterion. Logical queries are then defined for addition to the patient list as ‘1’ if the data correspond or ‘0’ if they do not. Data sets created by these operations are joined by ‘union’ and pivoted on a cross-tabulation list using statistical analysis software. We show an example of an SQL generated by the ERS in figure 3.
Create reports for investigators to confirm
To help investigators confirm the targeted patient list, reports are created by linking the findings for diagnostic imaging, pathological diagnosis, operations and other findings. Investigators confirm these entries using the reports and EMR information, including progress notes and images. When the diagnosis history, medication, laboratory results, progress notes and other information are necessary, the same operation is executed for each instance. For example, the list of radiological findings involves ‘patient ID’, ‘study category’, ‘report name’, ‘diagnosis’, ‘findings’ and ‘comment’. The reports may improve the investigators’ confirmation efficiency because they prevent the need to refer to the medical records for each patient who needs confirmation.
Confirmation by the investigator and execution of the statistical analyses
The investigators confirm the accumulated data and execute the statistical analysis. In this test research, two oral and maxillofacial surgeons diagnosed cases by a chart review with an observation of imaging findings.47
To evaluate our system, we collected information about the research period using the recall method. For the accuracy of the data collected by the ERS, we evaluated the results after they were confirmed by the investigator.
Computable criteria, datasets and system evaluation
We present the computable criteria in table 1. To increase data accuracy, we collected all of the exclusion criteria for the investigators to confirm. As table 1 shows, we extracted information from EMRs. For investigator confirmation, we also reported all targeted patients using the following lists: osteoporosis drugs administered, oral BP administered, intravenous BP administered, diabetes drugs administered, anticancer drugs administered, steroid drugs administered, osteoporosis diagnoses, oral cancer diagnoses, patients diagnosed with inflammation of the jaw, patients diagnosed with other suspicious diseases, patients diagnosed with diabetes, HbA1c values, radiological findings, pathological findings and radioisotope findings. These data were extracted from the ERS for statistical analyses, presented in CSV format, and analysed using statistics software.
Among the approximately 800 000 cases at our hospital, 8772 were categorised using the terms ‘Inclusion criteria: Osteoporosis diagnosis’; among this group, 7195 were further categorised using ‘Inclusion criteria: Osteoporosis drug administration’. We then calculated the time that had elapsed since the osteoporosis diagnosis, determined that 7062 patients were aged 20 years or older, and created a targeted patient list. Among those on the targeted patient list, 23 patients were placed under the heading ‘Exclusion criteria: Oral cancer diagnosis’, 110 under ‘Exclusion criteria: Intravenous BP administration’, 4200 under ‘Oral BP administration’, 84 under ‘Inflammatory jaw condition diagnosis’, 2064 as ‘Other suspicious disease diagnosis’, 1700 as ‘Diabetes diagnosis’, 4551 as ‘Steroid drug administration’, 904 as ‘Anticancer drug administrations’, 1055 as ‘Diabetes drug administrations’ and 3641 as ‘HbA1c test execution’. Because of the end point considered, patients who were classified under ‘Inflammatory jaw condition diagnosis’ or ‘Other suspicious disease diagnosis’ were confirmed using predefined hierarchical diagnostic criteria by investigators who performed the statistical analyses and arranged the research results. We show the schema of data collection and confirmation as figure 4.47
The accuracy of the data extracted by the ERS was then characterised. Reviewing the medical records revealed that 2817 patients were not labelled as ‘Oral BP administration’, including seven (one who received intravenous BP) treated at other hospitals. Six patients had been treated with radiation therapy to the oral and maxillofacial regions. Among the 72 patients classified under ‘Inflammatory jaw condition diagnosis’, 35 cases and 37 non-cases were identified.
The data extraction period lasted approximately 3 months. Ten meetings were held during the protocol development stage to create and validate the computable criteria and the list of codes for various drugs and diagnoses (ie, ICD-10). The time required for logical query definition when using the ERS was approximately 20 h. The investigator confirmations and statistical analyses took approximately 4 months.
We identified eligible patients for this research and extracted the data necessary for confirmation by investigators and for statistical analyses.
We asked the chart reviewers to evaluate the system in a questionnaire about ‘the effect of computer programming support for data retrieval from the EMR’, ‘the result of the data retrieval’, ‘the positive and negative aspects of our ERS use’ and ‘the aspects of our method that should be improved’. The investigators evaluating the system mentioned that the following points: (1) the method enabled them to extract the necessary data for diagnosis and drug administration without exception; (2) by screening the entire patient population at the hospital using the ERS, they could identify not just eligible patients in the department of oral and maxillofacial surgery but all eligible patients, which reduced the study bias and (3) by creating reports for confirmation, it enabled investigators to devote their time to reading images, thus effectively reducing the time required for reviewing medical records. The aspects of our method that should be improved are the ‘lack of claim data’ and the ‘administrative complexity of EMR data use’. No negative aspects of our ERS use were noted.
The ERS allowed for the collection of information on patient eligibility by efficiently combining clinical information. Although we did not compare our method with other methods, our proposed method reduced the labour normally required from investigators and improved the reliability of test research results, which indicated that it was useful.
To design the ERS database, we designed a new data model optimised for patient identification. The main differences between our data model and the star schema were as follows: (1) demographic data, which were presented in list form in our EMR system, were presented as a fact-less fact table and (2) date, time, measurements and text information were presented in dimension tables.32 The most significant characteristic of our method for patient identification is the use of a specialised data model in clinical research and the ability to execute a large number of conversion tasks at the protocol development stage. Data can be converted efficiently in a way that does not depend on the EMR database structure when converting narrative criteria to computable criteria. In this research, we considered whether data were extracted directly from EMRs at the protocol development stage. However, EMR data were recorded in a sequential format for every medical practice, and the database structure was complicated. Comprehending the location and meaning of the necessary data thus required tremendous effort. It was difficult to make precise logical queries for patient identification. However, because our ERS data model was arranged by subjects (eg, tests, diagnosis), it was easy to interpret the available information. Due to the standardisation of computable criteria and SQL possible with the ERS, it was also possible to create computable criteria in little time. Additionally, verifying the patient identification accuracy was easy because it was possible to test each individual criterion.
The SQL generated by our ERS does not reduce the time required for data retrieval. Our ERS also cannot retrieve information that is not in the data model. Current EMRs do not store all necessary data for clinical research, including information related to pregnancy, performance status, cancer stage, availability of transportation to the hospital, specific tests that are not typically performed, drug regimen, outcomes (including death) and adverse events. Additionally, all tests are not administered to all patients, and necessary information may have been recorded in medical records at another hospital.44 To facilitate EMR use in clinical research, it is necessary to accumulate as much of this information as possible. In the hospital, much of this information does not integrate well with EMRs, including test reports stored only in the departmental system.50 However, it is important to utilise this information. Additionally, enabling ERS use in and across multiple institutions is also an important future task.
Currently, most clinical research studies that use data from EMRs are planned according to the concept that the primary use of EMRs is for clinical practice and a secondary use is for clinical research.44 Therefore, most investigators attempt to convert the text form eligibility criteria that already have been defined on a protocol to computable criteria at the data collecting stage.35 ,36 However, we propose that computable criteria should not be a result of the automated conversion of narrative criteria but rather a result of research preparation involving medical concepts that are not expressed logically or explicitly in the narrative criteria. Some medical concepts may be interpreted differently depending on the research and the investigator caring for the patients. Additionally, current eligibility criteria are vague or complex, and they do not consider the use of the actual EMR. To convert computable criteria appropriately, high-level medical decisions to answer the research question are required. Therefore, we thought that a large amount of the conversion of the eligibility criteria to computable criteria should be executed at the protocol development stage. In addition, the conversion process should be divided into entity-level conversions that require higher medical decisions and attribute-level conversions. To reduce the burden of conversion, it may be useful to apply NLP technology for the conversion from entity-level criteria to attribute-level criteria. Moreover, it is important to further discuss protocol standardisation, including eligibility criteria representation for computable use. For instance, the attribute-level criteria that describe the search conditions in detail may be useful in global studies to address diseases that vary according to the diagnostic criteria used in each country.
Concerning EMR data accuracy, the ICD10 code (osteomyelitis of the jaw) sensitivity was 48.6% (35/72). The investigators reported six simple diagnosis errors, seven oral BP administrations at other hospitals, and six patients who were treated with radiation therapy in the oral and maxillofacial region.47 For the accuracy of current EMRs, the investigators had to confirm the information. However, the EMRs provided rich confirmation data and were useful in improving research data accuracy. In this study, we checked the data from actual EMRs manually and identified patients precisely and extensively using coded information, narrative information, and images. However, only information from existing EMRs was available. Current EMRs have a high degree of flexibility in data entry and are not currently managed for research purposes, which decreases their reliability. It is necessary to improve data quality through quality control without placing too much of a burden on clinical practice. Alternatively, it may be possible to organise data sufficiently before research use.51–53 Standardising the terminology and exchange formats used in the healthcare setting has facilitated international discourse.46 ,54–58 It is necessary to further discuss not only clinical practice but also research purposes, particularly how to utilise various standards when using EMRs beyond the hospital setting.
We propose a pragmatic method for EMR-based observational studies. Our ERS is already used to support hospital-based cohort studies, clinical trial recruitment and the eClinical trial infrastructure44 at our centre. We believe an efficient ERS and standardised data processing model are essential to facilitate clinical research that utilises EMRs.
The authors would like to acknowledge the staff of the department of medical informatics of Kyoto University Hospital for their generous support.
Contributors KY designed the study, developed the ERS system, identified the computable eligibility criteria, wrote logical queries, collected data and wrote the manuscript. ES is grant holder who designed the study, developed the ERS system and wrote and edited the manuscript. TY designed and conducted the ‘Risk of osteomyelitis of the jaw induced by oral bisphosphonates in patients taking medications for osteoporosis: a hospital-based cohort study in Japan’ (BRONJ study) study and the current study, identified the computable eligibility criteria and wrote and edited the manuscript. KA and MY designed and conducted the BRONJ study. ST designed the study and provided comments and feedback. KB is the principal investigator of the BRONJ study. MY owns the ERS system and supervised the study. MF supervised the study and provided comments and feedback. All of the authors read and approved the final manuscript.
Funding This work was supported by the Coordination, Support and Training Program for Translational Research of the Ministry of Education, Culture, Sports, Science and Technology of Japan and by Grants-in-Aid for Scientific Research of Japan (23790566).
Competing interests None.
Ethics approval This research was approved by the ethics committee of Kyoto University Hospital.
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement No other data are available.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.