Development of the individualised Comparative Effectiveness of Models Optimizing Patient Safety and Resident Education (iCOMPARE) trial: a protocol summary of a national cluster-randomised trial of resident duty hour policies in internal medicine

Introduction Medical trainees’ duty hours have received attention globally; restrictions in Europe, New Zealand and some Canadian provinces are much lower than the 80 hours per week enforced in USA. In USA, resident duty hours have been implemented without evidence simultaneously reflecting competing concerns about patient safety and physician education. The objective is to prospectively evaluate the implications of alternative resident duty hour rules for patient safety, trainee education and intern sleep and alertness. Methods and analysis 63 US internal medicine training programmes were randomly assigned 1:1 to the 2011 Accreditation Council for Graduate Medical Education resident duty hour rules or to rules more flexible in intern shift length and number of hours off between shifts for academic year 2015–2016. The primary outcome is calculated for each programme as the difference in 30-day mortality rate among Medicare beneficiaries with any of several prespecified principal diagnoses in the intervention year minus 30-day mortality in the preintervention year among Medicare beneficiaries with any of several prespecified principal diagnoses. Additional safety outcomes include readmission rates, prolonged length of stay and costs. Measures derived from trainees’ and faculty responses to surveys and from time-motion studies of interns compare the educational experiences of residents. Measures derived from wrist actigraphy, subjective ratings and psychomotor vigilance testing compare the sleep and alertness of interns. Differences between duty hour groups in outcomes will be assessed by intention-to-treat analyses. Ethics and dissemination The University of Pennsylvania Institutional Review Board (IRB) approved the protocol and served as the IRB of record for 40 programmes that agreed to sign an Institutional Affiliation Agreement. Twenty-three programmes opted for a local review process. Trial registration number NCT02274818; Pre-results.

The long hours worked by resident physicians received some academic attention in the 1970s, [1] but it was not until 1984 when those long hours became publicly linked with concerns about patient safety. [2] Those concerns resonated with a public who found it self-evident that the often 30-hour shifts of resident physicians would lead to fatigue, and that fatigue would lead to errors that would harm patients. Regulation of duty hours across all specialties occurred first in New York State as a reaction to the Libby Zion case, and then nationally in 2003 when the Accreditation Council for Graduate Medical Education (ACGME)-the organization that oversees resident education in the United States-limited residents to 80 hours of work per week averaged over 4 weeks, and limited the length of individual shifts to 24 hours, with an additional 4 hours to safely transfer care to the next resident. Partly on the basis of an Institute of Medicine report and a trial from Brigham and Women's Hospital, [3,4] those regulations were further tightened for first year residents (interns) in 2011, limiting their maximum shift length to 16 hours. This change prompted a charged debate. [5] Proponents argued that the restrictions did not go far enough. Others argued that the regulations were overly restrictive and inflexible, and that they harbored increased risk to patients by increasing patient handoffs. [6,7] Meanwhile, large observational studies, using data from Medicare and the Veterans Administration across millions of hospitalizations, found essentially no difference in important patient outcomes following implementation of either the 2003 [8][9][10][11][12][13][14][15][16] or 2011 duty hour standards. [17] Several studies associated the 2011 standards with less direct patient contact, increased perception of medical errors, increased transitions of care, decreased educational opportunities, and only modestly increased quantities of sleep. [18][19][20] Others found no changes in trainees' educational test scores. [21] Program directors and trainees expressed concern that the rules reduced training quality and increased rather than decreased medical errors. [22][23][24] The increasing recognition of the importance of supervision - with separate mandates implemented by the ACGME over the same period -added further uncertainty to the debate. In the end, a well-meaning effort to manage resident fatigue was perceived by many to promote burnout, increase handoffs, decrease educational opportunities and delay the professional maturation required to produce competent, independent physicians. Currently, the evidence available to resolve these controversies is limited to a patchwork of laboratory and in vivo studies of sleep deprivation, large scale epidemiological observations of patient outcomes, surveys of resident and educator opinions, and often single-center trials of unique residency duty hour designs that focused on workload or sleep but not patient outcomes.
In this context, we created the iCOMPARE (Individualized Comparative Effectiveness of Models Optimizing Patient Safety and Resident Education) trial, a cluster randomized trial carried out by internal medicine (IM) residency programs in the United States. For the 2015-2016 academic year, participating residency programs were randomized to one of two groups: 1) maintain standard (STD) duty hour rules, or 2) permit a more flexible (FLEX) set of duty hour rules, noted principally for removing the 16-hour shift length restriction for interns and allowing them to work up to 24 hours with an additional 4 hours for care transitions. In contrast to prior work, iCOMPARE was designed to simultaneously assess the impact of duty hour rules on patient safety, resident education, and intern sleep and alertness.
While the iCOMPARE trial was being planned, leaders in surgical education developed a roughly parallel trial, the Flexibility in Duty Hours Requirements for Surgical Trainees (FIRST) trial, which was fielded in the 2014-2015 academic year. The design of the FIRST trial and its initial results have been published. [25,26] Here we describe the design of the iCOMPARE trial.

Funding and Organization
The iCOMPARE trial is funded primarily by the National Heart, Lung, and Blood Institute (NHLBI), with additional funding from the ACGME. The NHLBI appointed an independent Data and Safety Monitoring  Table 2 compares the duty hour standards between the FLEX and STD arms. Cluster randomization occurred at the level of the residency program, the level at which duty hour policies are implemented. Although duty hour standards are mandated, individual programs vary considerably in how they schedule their trainees within those standards. The trial is pragmatic in that the intervention arm effectively represents the exposure of residency programs to an alternative set of duty hour standards. The exposure is the policy change, not the actual duty hours that are implemented in response. This approach is akin to clinical trials of outpatient pharmaceuticals in which the exposure is the prescription of the control or intervention drug, regardless of participant adherence.
Hypotheses regarding how interns spend their time require detailed time-motion observations, and hypotheses about interns' sleep and alertness require detailed observations of sleep cycles and psychomotor vigilance-both accomplished through substudies deployed in a sample of participating programs.

Sample Size and Power
The non-inferiority 30-day patient mortality hypothesis (H1a) is the trial's primary hypothesis. Defining the outcome as the trial year rate minus the pre-trial year rate adjusts each program's outcome for secular trends in 30-day mortality, as well as for differential program-to-program patient risk profiles.
The variability in this outcome and the 30-day mortality rate in the STD group was estimated using Medicare data from 2008 for the population of target IM programs (i.e., all IM programs meeting the study-specific size and population criteria). Using a two-sample t-test for non-inferiority of the between group mean year over year difference in mortality and assuming 80% power, Type I error of 5%, a non-

Aim 2 -Education
Education measures are derived from multiple sources. The primary education measures are specified in Table 3 and come from the Time-Motion study, the ACGME year-end trainee and core faculty surveys, and the interns' In-Training Examination (ITE) scores provided by the American College of Physicians Time-Motion study. We recruited 3 IM programs randomized to STD and 3 randomized to FLEX to participate in direct observations of some of their interns, targeting programs in Philadelphia or Baltimore for operational convenience. Participating programs received $3,000 to support these substudy activities. We recruited interns rotating on General Medicine services between March-May 2016.
Eligible and interested interns provided written consent.
Twenty three observers (medical students and undergraduates) were trained to follow participating interns. They used a custom-built tablet-based software program to document start and stop times for various intern activities: direct patient care, indirect patient care, education, rounds, work, handoffs, and miscellaneous, each with various subcategories reflecting greater specificity of tasks. For example, direct patient care had subcategories for patient interactions, family interactions, and physical contact (e.g., physical exam). At least one activity had to be selected at all times, although more than one could be selected to reflect multitasking. At the start and stop of the shift an observer completed brief surveys that summarized total patient census numbers for that intern, including the numbers of transfers, discharges, admissions, and patients received at the beginning of a shift and handed-off at the end.
Shifts were selected at each site aiming to capture 30 shifts in a proportion mirroring -specific for each site -how interns generally spend their time on a general medicine inpatient rotation in a given week.
Observers shadowed approximately 10-15 interns for 1-3 shifts per intern. Extended shifts (e.g., 24 hour call cycles) were often split by two observers. A 10% sample of shifts was observed simultaneously by two observers to estimate inter-rater reliability. Interns received a $50 Amazon gift card for each observed shift. provided by the ACGME to iCOMPARE was at the level of treatment group arm (FLEX or STD) and year collected (2014,2015,2016). All data were stripped of program identifiers and individual respondent identifiers. iCOMPARE identified the programs enrolled in iCOMPARE and their treatment group assignments to the ACGME and received summary treatment-group level information for the single question items specified in hypotheses H2b and H2c and many content areas (combined responses over several questions covering related content) that comprise secondary education outcomes. ITE score data from ACP. The ACP provided iCOMPARE with trainee level ITE scores from 2015 and 2016; identifying information for a score was limited to PGY year and program identifier. The ACP provided scores for those trainees who had given permission to share their score for research purposes.
APDIM annual survey data. The APDIM provided data from its survey of program directors in fall 2015 and fall 2016. The deidentified data set did not include name of program or program director but did have a flag for STD or FLEX or not in study.
End-of-Shift Surveys. From August 31, 2015 to April 26, 2016, iCOMPARE conducted 16 two-week cycles of daily surveys administered to all trainees at all participating programs. At the start of each cycle, we randomized each trainee into one of 14 groups and then allocated each group to a day of the 2-week period to receive a survey so that each trainee was surveyed once every two weeks.
Surveys alternated between two sets of questions. Survey 1 was sent every other day to trainees asking: [1] the name of the rotation a trainee was on in the past 24 hours, specifying details such as inpatient or not, type of inpatient rotation, at main teaching hospital or other setting, or not in hospital; and if the trainee was on an inpatient rotation; [2] number of new patient evaluations completed in past 24 hours; [3] number of handoffs experienced in past 24 hours; and [4] the number of patients for which the Survey 2 was sent on the alternate days and asked the same first question as Daily Survey 1, as well as the trainee's ratings (too little, just right, too much) for [1] time spent in educational conference and related activities, [2] sense of ownership of patients, [3] work intensity, and [4] continuity of care. Data for these questions are related to satisfaction and complement the study's End-of-Year survey and ACGME survey described below.
Trainees were entered into an incentive lottery designed so that in each 2-week cycle, one intern and one resident at each of the 63 IM programs were each awarded either a $25 or $100 Amazon gift card if they had completed their survey during that period.
End-of-Year surveys. All trainees and program directors received iCOMPARE study-specific surveys in May of 2016 with up to 6 reminders to nonrespondents. At the end of the intervention year, a $2,500 cash incentive was provided to each of the 9 programs with the highest response rates.
The trainee survey was administered to all trainees with only slight differences between versions for interns and PGY2 and higher trainees. The instrument was initially developed for the FIRST trial 26 (available online) and included items on trainee satisfaction, experience of duty hours, supervision, fatigue management, and resident and patient safety, and ended with the Maslach Burnout Inventory-Human Services Survey (MBI-HSS). [38] The MBI-HSS is a 22-item rating scale assessing three domains: emotional exhaustion (9 items), depersonalization (5 items), and lack of personal accomplishment (8 items). Items are answered on a frequency scale of 0 to 6 where 0 indicates never and 6 indicates every day. The program director survey (available online)was modeled from an earlier survey to program directors [39] and included items on resident and faculty workload, resident morale, continuity, education, patient safety, and program finances and administration.

Aim 3 -Sleep and alertness
Outcomes for the third aim include sleep duration and both subjective and objective measures of alertness among interns at 6 sites randomized to STD and 6 sites randomized to FLEX. At each of these 12 sites, program coordinators recruited interns scheduled to be on general medicine, medical intensive care, cardiology, or cardiac care rotations. Each participating program received $8,000 to cover the costs associated with scheduling interns for data acquisition and managing study equipment. Data collection spanned November 5, 2015 to May 31, 2016.
Interns providing written consent underwent 14-days of continuous measurements of rest and activity via wrist actigraphs (model wGT3X-BT, The Actigraph Corp., Pensacola, FL). [40][41][42] Interns were instructed to wear the actigraph continuously, even on days off, except during activities that might damage it (e.g., water immersion or contact sports) or that would impede the delivery of clinical care.
They were asked to remove the actigraph for up to 2 waking hours for recharging on days 1 and 7. Each participating intern received a $10 Amazon gift card for each day for which data were received.
Each morning, interns completed a brief online survey including the name of the shift an intern was working; a log reflecting sleep times and quality, and experiences of excessive sleepiness; and the Karolinska Sleepiness Scale. [43] Interns then completed a 3-minute Psychomotor Vigilance Test (PVT-B) [44,45] on an Android Smartphone (Samsung Galaxy SIII Neo). The PVT-B is based on simple reaction time to stimuli occurring at random inter-stimulus intervals and is the gold standard for measuring the neurobehavioral effects of acute and chronic sleep loss and circadian misalignment.
[46] Actigraphy, survey, and PVT-B data were automatically encrypted and uploaded to a secure server and checked daily for protocol adherence and potential technical issues with the equipment. If problems were detected or no data were received at all, interns were contacted by the study team to resolve any difficulties interns were having with equipment. Details of scoring are presented in the supplementary appendix materials 2: Sleep Actigraphy Scoring and supplementary appendix Figure 1 and Figure 2.

Statistical Considerations
Non-inferiority tests will be one-sided and superiority tests will be two-sided. All primary analyses will compare the FLEX and STD treatment groups as randomized, regardless of adherence to the assigned duty hour standards, according to the intention-to-treat (ITT) principle. Since directors at programs assigned to FLEX have considerable latitude in design of trainee schedules, we expect variation amongst the duty-hour schedules followed in the FLEX group. Protocol-specified secondary analyses addressing the degree of difference between FLEX and STD schedules will be completed. We will also report mortality results adjusting for the clinical condition associated with patient's principal diagnosis as well as demographic variables and comorbidities determined using a 6-month look-back period. Similar approaches will be used for the other patient safety outcomes specified in hypotheses H1b-H1e.
The outcome measures for the education hypotheses (H2a-H2d) and the sleep and alertness hypotheses (H3a-H3b) are person-level (intern, trainee, or faculty) measures. For each outcome, the repeated measures on an individual will be averaged, which eliminates within-person correlations. Each personlevel mean will be the response variable in a mixed random effects linear regression model with a single fixed term for treatment group and with restricted maximum likelihood estimation of the random IM program effect as implemented in SAS (Cary, NC) or STATA (College Station, TX) to account for within program correlations in outcome measures. When pre-trial year data are available, additional analyses will adjust an individual's trial year mean response for the pre-trial year program-level mean response (i.e., a difference of differences analysis). The pre-trial program level mean response must be used since

Ethics and Dissemination
The University of Pennsylvania Institutional Review Board (IRB) approved the iCOMPARE protocol and served as the IRB of record for all participating programs that agreed to sign an Institutional Affiliation Agreement. Forty programs chose this option and submitted the required paperwork. Twenty-three programs opted for a local review process.
Plans for dissemination include submitting results for the various aims to academic meetings and peerreviewed publications as they become available. A Data Safety Monitoring Board will review and approve results related to the primary hypotheses before dissemination.

Discussion
iCOMPARE is a one-year, cluster-randomized, pragmatic trial designed to evaluate the availability of an alternative resident duty hour schedule for effects on patient safety, resident education, and intern sleep and alertness. iCOMPARE's design is similar to that of the FIRST trial conducted among surgical trainees. What distinguishes both of these trials from prior work looking at duty hour schedules, in addition to their large size, comprehensive approach, and immediate policy relevance, is that they help elevate the standards of evidence applied to graduate medical education policy more generally.
Although randomized trials provide strong evidence for the causal inference required for good policy, they typically answer narrow questions. Those questions today revolve around the length of shifts within an 80-hour work week. Inside the field of graduate medical education, these issues have been hotly debated. With the results of the FIRST trial demonstrating non-inferiority in patient outcomes when a more flexible schedule was available, the ACGME issued new duty hour standards, effective July 2018, that correspond to the intervention arm of iCOMPARE. [47] Given that surgery and internal medicine are large fields with many residents caring for many patients, it is important to study the duty hour rules in both specialties as surgical and medical training programs differ in structure, process, culture, the kinds of residents they attract, the patients they serve, and the duties of trainees.
The aims are broad and include patient safety, trainees sleep and alertness and a host of education outcomes.
The protocol draws on a number of data collection methods including real time direct observation of trainee activities, existing and study administered surveys, robust recording of sleep periods over a 14day period and patient medical record data.
The lag between the observation period and data release for patient medical record data is substantial.
iCOMPARE promises narrow but substantive evidence to inform resident duty hour standards in internal medicine and signals new interest in methodologically stronger research in medical education. concerns about patient safety. [2] Safety concerns resonated with a public who found it self-evident that the often 30-hour shifts of resident physicians would lead to fatigue, and that fatigue would lead to errors that would harm patients. Notably, the possible link between duty hours and patient safety was not just a U.S. concern. Many countries began to limit hours in the 1990s. For example, New Zealand has had a limit of 72 hours a week since 1985. hours to safely transfer care to the next resident. Partly on the basis of an Institute of Medicine report and a trial from Brigham and Women's Hospital, [5,6] those regulations were further tightened for first year residents (interns) in 2011, limiting their maximum shift length to 16 hours. This change prompted a charged debate. [7] Proponents argued that the restrictions did not go far enough. Others argued that the regulations were overly restrictive and inflexible, and harbored increased risk to patients by increasing patient handoffs. [8,9] Meanwhile, large observational studies, using data from Medicare and the Veterans Administration across millions of hospitalizations, found essentially no difference in important patient outcomes directors and trainees expressed concern that the rules reduced training quality and increased rather than decreased medical errors.
[24-26] The increasing recognition of the importance of supervisionwith separate mandates implemented by the ACGME over the same period -added further uncertainty to the debate. In the end, a well-meaning effort to manage resident fatigue was perceived by many to promote burnout, increase handoffs, decrease educational opportunities and delay the professional maturation required to produce competent, independent physicians. Currently, the evidence available to resolve these controversies is limited to a patchwork of laboratory and in vivo studies of sleep deprivation, large scale epidemiological observations of patient outcomes, surveys of resident and educator opinions, and often single-center trials of unique residency duty hour designs that focused on workload or sleep but not patient outcomes.
In this context, we created the iCOMPARE (individualized Comparative Effectiveness of Models Optimizing Patient Safety and Resident Education) trial, a cluster-randomized trial carried out by internal medicine (IM) residency programs in the U.S. during the 2015-2016 academic year. Participating residency programs were randomized to one of two groups: 1) maintain standard (STD) duty hour rules, or 2) permit a more flexible (FLEX) set of duty hour rules, noted principally for removing the 16-hour shift length restriction for interns and allowing them to work up to 24 hours with an additional 4 hours for care transitions. In contrast to prior work, iCOMPARE was designed to simultaneously assess the impact of duty hour rules on patient safety, resident education, and intern sleep and alertness.

Funding and Organization
The iCOMPARE trial is funded primarily by the National Heart, Lung, and Blood Institute (NHLBI), with additional funding from the ACGME. The NHLBI appointed an independent Data and Safety Monitoring Board (DSMB) to advise the Institute regarding the trial's progress, monitor data quality and safeguard the interests of study participants.

Study Aims and Hypotheses
Study hypotheses are presented in Table 1. The primary hypothesis for the trial is that 30-day any-site patient mortality under FLEX will not exceed (will not be inferior to) 30-day patient mortality under STD, measured as the difference in difference across STD and FLEX programs between a program's 30-day patient mortality rate in the trial year and that rate in the pre-trial year (i.e., 30-day patient mortality rate at a program during the trial year minus 30-day patient mortality rate at the program in the pretrial year). Secondary outcomes related to patient safety include 7-day and 30-day hospital readmission rates, complication rates, the probability of a prolonged length of stay, total resource utilization, and Medicare payments.
ICOMPARE's education hypotheses are that trainees in the FLEX arm will spend more time in direct patient care and education and have greater satisfaction with educational experiences compared to their STD arm peers; that standardized test scores for interns in FLEX will not be lower than such scores for interns in the STD arm; that faculty in the FLEX arm will report greater satisfaction with their clinical ICOMPARE's sleep hypotheses are that the average daily sleep will not be less among interns in the FLEX arm compared to those in the STD arm, and that interns in the FLEX arm will not have greater subjective sleepiness or lower behavioral alertness than interns in the STD arm.

Study Design
iCOMPARE is a cluster-randomized trial to compare two alternative duty hour standards in 63 IM training programs in the United States fielded in academic year 2015-2016. Table 2   The non-inferiority 30-day patient mortality hypothesis (H1a) is the trial's primary hypothesis. Defining the outcome as the trial year rate minus the pre-trial year rate adjusts each program's outcome for secular trends in 30-day mortality, as well as for differential program-to-program patient risk profiles.

Sample Size and Power
Using a two-sample t-test for non-inferiority of the between group mean year over year difference in 30day mortality and assuming 80% power, Type I error of 5%, a non-inferiority margin of 1%, pooled SD for the outcome of 1.5%, and a 30-day mortality rate of 11%, we calculated a required sample of 58 programs, 29 per treatment group. The pooled SD of the outcome (year over year difference in 30-day mortality) and the 30-day mortality rate in the STD group were estimated using available Medicare  Alertness study was calculated to be 290 interns (145 per treatment group) and increased to 384 interns (192 per treatment group) to anticipate data loss related to non-adherence and dropouts.

Study Population and Inclusion Criteria
The CONSORT diagram is provided in Figure 1. In 2014 there were 379 IM ACGME-accredited training programs in the country not on probation. The 54 programs in New York State were excluded because that state's legislated duty hour standards were not subject to the waiver required for the intervention arm. Since the patient safety outcomes would be measured in Medicare patients, Veterans Administration hospitals did not qualify for inclusion. Sufficiently precise estimates of our patient safety outcome measures necessitated exclusion of 84 programs training only at hospitals in the bottom 50% by trainee-to-bed ratio or the bottom 25% by patient volume in the measured diagnoses. In about 50% of teaching hospitals, the number of residents per bed is so few that the residents' impact on patient care is minimal, and so any changes in the impact of those residents' schedules would be too insignificant to measure.
An additional 62 programs in the lowest quartile of number of trainees were excluded to ensure measurement of enough trainees. Two other programs in the lowest quartile of number of trainees were allowed by the Steering Committee. One program was approved for inclusion despite deviations in size of its two affiliated hospitals because together the two hospitals met the qualifying hospital size. A second program was approved for inclusion because the hospital it was affiliated with had far exceeded the trainee-to-bed ratio even though the patient volume was slightly low. A total of 179 programs were eligible for inclusion.

Patient and Public Involvement
The study was based on a fundamental concern about the impact of training schedules on patient, trainee and program director experiences. However, no patients were directly involved in the design and conduct of the study. Program directors from each participating program were involved and essential to recruitment which was at the training program level. Program directors were fully informed about the intervention and individual trainees had the opportunity to decline participation in surveys and other education outcomes. Results will be presented at national meetings to program directors and published in peer-reviewed journals.

Data Sources
Hypothesis testing required outcome data from patients, trainees, and program directors and faculty.
Because of the available and highly reliable patient mortality information (both in hospital and after discharge) in the Medicare fee for service (FFS) program, the iCOMPARE patient population was limited to Medicare FFS beneficiaries with a qualifying principal diagnosis on hospital admission (Supplementary Appendix: Appendix Materials 1); these diagnoses were chosen for their common treatment on internal medicine services (excluding oncology and neurology diagnoses) and their elevated mortality rates.
Because similar data are not available for patients enrolled in a Medicare managed care program, Randomized programs were further invited to participate in the observational Time-Motion and the Sleep and Alertness sub-studies, described below.

Timeline
Trial development began in 2013. The Medicare data required for the analyses of the patient safety outcomes are not expected to be available until mid 2018 ( Figure 2).

Outcomes, Measures, and Data Collection
Aim  [19,37] complication rate, [13,34,38,39] and resource utilization measures (cost) [17,35] and payment among fee-for-service Medicare beneficiaries with specific diagnoses. Details regarding justification for and operationalization of these outcomes are provided in Supplementary Appendix: Appendix Materials 2.

Aim 2 -Education
Education measures are derived from multiple sources. The primary education measures are specified in Table 3 and come from the Time-Motion sub-study, the ACGME year-end trainee and core faculty surveys, and the interns' In-Training Examination (ITE) scores provided by the American College of Time-Motion sub-study. We recruited 3 IM programs randomized to STD and 3 randomized to FLEX to participate in direct observations of some of their interns, targeting programs in the mid-Atlantic region for operational convenience; programs with tertiary hospitals as well as community based programs were included in both arms. Participating programs received $3,000 to support these sub-study activities. We recruited interns rotating on General Medicine services between March-May 2016.
Eligible and interested interns provided written consent. Among the 129 interns invited to participate, 120 (93%) consented.
Twenty-three observers (medical students and undergraduates) were trained to follow participating interns. They used a custom-built tablet-based software program to document start and stop times for various intern activities: direct patient care, indirect patient care, education, rounds, work, handoffs, and miscellaneous, each with various subcategories reflecting greater specificity of tasks. For example, direct patient care had subcategories for patient interactions, family interactions, and physical contact (e.g., physical exam). At least one activity had to be selected at all times, although more than one could be selected to reflect multitasking. At the start and stop of the shift an observer completed brief surveys that summarized total patient census numbers for that intern, including the numbers of transfers, discharges, admissions, and patients received at the beginning of a shift and handed-off at the end.
Shifts were selected at each site aiming to capture 30 shifts in a proportion mirroring -specific for each site -how interns generally spend their time on a general medicine inpatient rotation in a given week.
Observers shadowed approximately 8-10 shifts, for 1-3 shifts per intern. Extended shifts (e.g., 24 hour call cycles) were often split by two observers. A 10% sample of shifts was observed simultaneously by two observers to estimate inter-rater reliability. The methods for the Time-Motion sub-study were   questions are related to satisfaction and complement the iCOMPARE's end-of-year survey (described below) and ACGME survey (described above).
Trainees were entered into an incentive lottery designed so that in each 2-week cycle, one intern and one resident at each of the 63 IM programs were each awarded either a $25 or $100 Amazon gift card if they had completed their survey during that period. After the first cycle, the cycle response rate ranged between 39% and 42%. year, a $2,500 cash incentive was provided to each of the 9 programs with the highest response rates.
The trainee survey was administered to all trainees with only slight differences between versions for interns and PGY2 and higher trainees. The instrument was initially developed for the FIRST trial 26 and included items on trainee satisfaction, experience of duty hours, supervision, fatigue management, and on a frequency scale of 0 to 6 where 0 indicates never and 6 indicates every day.
The program director survey was modeled from an earlier survey to program directors [41] and included items on resident and faculty workload, resident morale, continuity, education, patient safety, and program finances and administration.

Aim 3 -Sleep and alertness
Outcomes for the third aim include sleep duration and both subjective and objective measures of alertness among interns at 6 sites randomized to STD and 6 sites randomized to FLEX. At each of these Sleep Actigraphy Scoring and Supplementary Appendix Figure 1 and Figure 2.

Statistical Considerations
Non-inferiority tests will be one-sided and superiority tests will be two-sided. All primary analyses will compare the FLEX and STD treatment groups as randomized, regardless of adherence to the assigned duty hour standards, according to the intention-to-treat principle. Since directors at programs assigned to FLEX have considerable latitude in design of trainee schedules, we expect variation amongst the dutyhour schedules followed in the FLEX group. Protocol-specified secondary analyses addressing the degree of difference between FLEX and STD schedules will be completed. We will also report mortality results adjusting for the clinical condition associated with patient's principal diagnosis as well as demographic  analyses will adjust an individual's trial year mean response for the pre-trial year program-level mean response (i.e., a difference of differences analysis). The pre-trial program level mean response must be used since the trainees and faculty providing data in the pre-trial year are not the same trainees and faculty providing data in the trial year, but are at the same IM program. Specifically we have pre-trial data for Medicare patient level safety analyses, and for some Education outcomes (the ACGME end-ofyear surveys, ITE score data from ACP, and the iCOMPARE end-of-year surveys).

Dissemination
Plans for dissemination include submitting results for the various aims to academic meetings and peerreviewed publications as they become available. The Data and Safety Monitoring Board will comment on manuscripts reporting results related to the primary hypotheses before journal submission. Some may wonder why duty hours or shift lengths were not imposed on the intervention group and why, instead, the intervention group merely had permission to use more flexibility in their scheduling.

Discussion
The design of this study means that the potency of the duty hour changes actually implemented by programs may be less extreme than what was permitted by the intervention, adding noise to or in certain shifts? But the question at hand is: What happens when programs are allowed flexibility in their scheduling of shifts? The outcome of those policy decisions is, in its implemented state, a product of the flexibility of the rules and the degree to which individual programs take advantage of that flexibility. In some ways, this study design is consistent with that of effectiveness trials of drugs, where the anticipated effect is a product not just of whether one was randomized to the study drug, but also whether one was adherent to the drug. In the real world, adherence is relevant. Context is similarly relevant. The effects we observe in this trial also depend critically on the oversight and supervision provided to interns by more senior residents and on other safety nets built into the environment of hospital practice. While those safety nets potentially blunt observed effects, they take this study beyond the in vitro relevance of laboratory study to a pragmatic context.
With the results of the FIRST trial demonstrating non-inferiority in patient outcomes when a more flexible schedule was available, the ACGME issued new duty hour standards, effective July 2018, that correspond to the intervention arm of iCOMPARE. [49] Given that surgery and internal medicine are large fields with many residents caring for many patients, it is important to study the duty hour rules in both specialties as surgical and medical training programs differ in structure, process, culture, the kinds of residents they attract, the patients they serve, and the duties of trainees.

Conclusions
There is considerable interest in understanding concerns in the US health care system, such as why US health care is so expensive, why its overall effects on health lag behind those of peer nations in particular with respect to safety, and why its individual effects are so unevenly distributed. In contrast,        The authors report no competing interests.

Appendix Materials 1. Rationale and ICD-9 Codes for Principal Diagnoses Qualifying Hospital/Patient for Inclusion in iCOMPARE Randomization/Analysis
The original ICD-9 code list was created by searching for relevant codes for each medical condition that were found to be associated with high mortality rates in our preliminary 2008 data. The outcomes team reviewed the code list and expanded the code list in two ways: x We consulted the official ICD-9 code book and looked for additional high-volume codes that were related to the medical conditions of interest. The list of proposed expansion codes was reviewed and approved by the coding consultant to the study, Dr. Patrick Romano of the University of California, Davis.
x To account for the presence of ICD-10 codes on claims with discharge dates beginning October 1,

2015, we utilized the General Equivalence Mappings (GEMs) made available by the Centers for
Medicare and Medicaid Services, which provide the closest possible approximation of a translation between the ICD-9 and ICD-10 code systems. Our goal was to translate each ICD-10 code back to its most closely equivalent ICD-9 code, and our review of the crosswalk caused us to pick up a small number of additional ICD-9 codes. As with the original code list expansion, these proposed additional codes were reviewed and approved by our coding consultant to the study. services after validating them through an adjudicated process. All the patient safety outcomes are calculated by the outcomes team based on information from the claims data.
The reliability of the data used for calculating patient safety outcomes for mortality, readmissions, and length of stay (LOS) is excellent. The reliability of the data used to calculate complications, costs, and payments is also high, but lower than of the data used in calculating the 30-day mortality, readmissions and LOS outcomes, due to variation across hospitals in the number and may vary across hospitals due in the number of diagnostic and procedure codes recorded in the claim. However, due to the randomized nature of the study, we would expect there to be no difference in these outcomes between the two study arms (FLEX and STD). References regarding the reliability of the various patient safety measures are provided below.
In addition to mortality, the following outcomes measures were collected: x Readmission: This calculation is based on the admission and discharge dates in the Medicare claims.
x Length of Stay and Prolonged Length of Stay: Calculated based on the admission and discharge dates of the index claim.
x Complications: Calculated using the Agency for Healthcare Research and Quality Patient Safety Indicators (see below). x Payments: Calculated using the payment variables that appear in the inpatient, outpatient, and Part B claims. The total amounts paid by Medicare, the beneficiary, and the primary payer are summed.
Year-based adjustments for inflation are applied to the payment figures.
The Patient Safety Indicators were calculated by the research team using SAS programs provided by the Agency for Healthcare Research and Quality, which were run on the Medicare claims. Some of the Patient Safety Indicators that were considered "postoperative" or "perioperative" were modified for use with the study's population of medical patients. For these Patient Safety Indicators, the portion of the code that required the patient to have had surgery was deleted.
The following Patient Safety Indicators were used: x PSI 03 -Pressure ulcer rate x PSI 06 -Iatrogenic pneumothorax rate x PSI 07 -Central venous catheter-related blood stream infection rate x PSI 08 -Postoperative hip fracture rate x PSI 09 -Perioperative hemorrhage or hematoma rate x PSI 10 -Postoperative physiologic and metabolic derangement rate x PSI 11 -Postoperative respiratory failure rate x PSI 12 -Perioperative pulmonary embolism or deep vein thrombosis rate x PSI 13 -Postoperative sepsis rate Costs are calculated using a resource utilization-based method of cost estimation. The following items, which are calculated using the Medicare claims data, are included in the total cost estimate: x Accommodation costs, which are based on the number of general floor days and the number of intensive care unit days during the index admission. This information comes from the revenue center files.
x Operating room cost, based on the amount of time spent in the operating room (for patients who had a surgical procedure performed). This is determined using Part B claims.
x Emergency room visit fixed costs, based on post-discharge visits to the emergency room. This is determined using Part B claims.
x Costs of services provided, based on Relative Value Units (RVUs), determined using the Current Procedural Terminology codes on bills. This is determined using Part B claims.
In addition, any costs that occurred within 30 days of the index admission date, and all the costs associated with any readmissions that began within 30 days, are also included in the total cost calculation.

Appendix Materials 3: Sleep Actigraphy Scoring
According to conventional standards, actigraphy data were classified in 1-minute epochs as wake, sleep, or missing. The first classification was performed by the algorithm of the device manufacturer (Actilife software, version 6.13.3, standard settings, Sadeh scoring algorithm). Off-wrist periods were identified During this visual scoring process, both Pulsar and the study investigators were blinded to study arm (STD or FLEX). Likewise, PVT-B data were inspected by study sleep experts blinded to arm (Appendix Figure 2). PVT-B performance was classified into three categories as [1] adherent (i.e., PVT-B data reflected an effort to do the task correctly, and comments left by the subject did not suggest nonadherence), [2] possibly non-adherent (i.e., PVT-B data reflected a consistently poor effort to do the task correctly, but comments left by the subject did not suggest non-adherence), and [3] non-adherent (i.e., PVT-B data reflected a consistently poor effort to do the task correctly, and comments left by the subject did suggest non-adherence, e.g., performing the task while brushing teeth). Comments left by interns were inspected for distractions and non-fatigue related impairment and flagged accordingly.
Comments that could have revealed the study arm were blacked out by Pulsar prior to classification by the study sleep experts.  Agreement. Twenty-three programs opted for a local review process. Strengths and limitations of this study: iCOMPARE is the largest randomized trial examining the impact of duty hours in internal medicine training programs.
The aims are broad and include patient safety, trainees' sleep and alertness and a host of education outcomes.
The protocol draws on a number of data collection methods including real time direct observation of trainee activities, existing and study administered surveys, robust recording of sleep periods over a 14day period and patient medical record data.
The lag between the observation period and data release for patient medical record data is substantial.
iCOMPARE promises narrow but substantive evidence to inform resident duty hour standards in internal medicine and signals new interest in methodologically stronger research in medical education. concerns about patient safety. [2] Safety concerns resonated with a public who found it self-evident that the often 30-hour shifts of resident physicians would lead to fatigue, and that fatigue would lead to errors that would harm patients. Notably, the possible link between duty hours and patient safety was not just a U.S. concern. Many countries began to limit hours in the 1990s. For example, New Zealand has had a limit of 72 hours a week since 1985. hours to safely transfer care to the next resident. Partly on the basis of an Institute of Medicine report and a trial from Brigham and Women's Hospital, [5,6] those regulations were further tightened for first year residents (interns) in 2011, limiting their maximum shift length to 16 hours. This change prompted a charged debate. [7] Proponents argued that the restrictions did not go far enough. Others argued that the regulations were overly restrictive and inflexible, and harbored increased risk to patients by increasing patient handoffs. [8,9] Meanwhile, large observational studies, using data from Medicare and the Veterans Administration across millions of hospitalizations, found essentially no difference in important patient outcomes directors and trainees expressed concern that the rules reduced training quality and increased rather than decreased medical errors.
[24-26] The increasing recognition of the importance of supervisionwith separate mandates implemented by the ACGME over the same period -added further uncertainty to the debate. In the end, a well-meaning effort to manage resident fatigue was perceived by many to promote burnout, increase handoffs, decrease educational opportunities and delay the professional maturation required to produce competent, independent physicians. Currently, the evidence available to resolve these controversies is limited to a patchwork of laboratory and in vivo studies of sleep deprivation, large scale epidemiological observations of patient outcomes, surveys of resident and educator opinions, and often single-center trials of unique residency duty hour designs that focused on workload or sleep but not patient outcomes.
In this context, we created the iCOMPARE (individualized Comparative Effectiveness of Models Optimizing Patient Safety and Resident Education) trial, a cluster-randomized trial carried out by internal medicine (IM) residency programs in the U.S. during the 2015-2016 academic year. Participating residency programs were randomized to one of two groups: 1) maintain standard (STD) duty hour rules, or 2) permit a more flexible (FLEX) set of duty hour rules, noted principally for removing the 16-hour shift length restriction for interns and allowing them to work up to 24 hours with an additional 4 hours for care transitions. In contrast to prior work, iCOMPARE was designed to simultaneously assess the impact of duty hour rules on patient safety, resident education, and intern sleep and alertness.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59

Funding and Organization
The iCOMPARE trial is funded primarily by the National Heart, Lung, and Blood Institute (NHLBI), with additional funding from the ACGME. The NHLBI appointed an independent Data and Safety Monitoring Board (DSMB) to advise the Institute regarding the trial's progress, monitor data quality and safeguard the interests of study participants.

Study Aims and Hypotheses
Study hypotheses are presented in Table 1. The primary hypothesis for the trial is that 30-day any-site patient mortality under FLEX will not exceed (will not be inferior to) 30-day patient mortality under STD, measured as the difference in difference across STD and FLEX programs between a program's 30-day patient mortality rate in the trial year and that rate in the pre-trial year (i.e., 30-day patient mortality rate at a program during the trial year minus 30-day patient mortality rate at the program in the pretrial year). Secondary outcomes related to patient safety include 7-day and 30-day hospital readmission rates, complication rates, the probability of a prolonged length of stay, total resource utilization, and Medicare payments.
ICOMPARE's education hypotheses are that trainees in the FLEX arm will spend more time in direct patient care and education and have greater satisfaction with educational experiences compared to their STD arm peers; that standardized test scores for interns in FLEX will not be lower than such scores for interns in the STD arm; that faculty in the FLEX arm will report greater satisfaction with their clinical  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60   F  o  r  p  e  e  r  r  e  v  i  e  w  o  n  l  y   9 teaching experiences and greater perceptions of safety, teamwork and supervision than faculty in the STD arm.
ICOMPARE's sleep hypotheses are that the average daily sleep will not be less among interns in the FLEX arm compared to those in the STD arm, and that interns in the FLEX arm will not have greater subjective sleepiness or lower behavioral alertness than interns in the STD arm.

Sample Size and Power
Using a two-sample t-test for non-inferiority of the between group mean year over year difference in 30day mortality and assuming 80% power, Type I error of 5%, a non-inferiority margin of 1%, pooled SD for the outcome of 1.5%, and a 30-day mortality rate of 11%, we calculated a required sample of 58 programs, 29 per treatment group. The pooled SD of the outcome (year over year difference in 30-day mortality) and the 30-day mortality rate in the STD group were estimated using available Medicare Motion study has at least 80% power to detect a 3% difference between FLEX and STD in time spent in direct care.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60   F  o  r  p  e  e  r  r  e  v  i  e  w  o  n  l  y   11 Hypothesis H3a addresses the FLEX versus STD difference in average intern daily sleep over a 14-day period. Targeting 90% power, one-sided Type I error of 5%, a non-inferiority margin of 0.5 hours, and assuming expected average daily sleep in STD of 6.9 ± 1.5 hours [31], the sample size for the Sleep and Alertness study was calculated to be 290 interns (145 per treatment group) and increased to 384 interns (192 per treatment group) to anticipate data loss related to non-adherence and dropouts.

Study Population and Inclusion Criteria
The CONSORT diagram is provided in Figure 1. In 2014 there were 379 IM ACGME-accredited training programs in the country not on probation. The 54 programs in New York State were excluded because that state's legislated duty hour standards were not subject to the waiver required for the intervention arm. Since the patient safety outcomes would be measured in Medicare patients, Veterans Administration hospitals did not qualify for inclusion. Sufficiently precise estimates of our patient safety outcome measures necessitated exclusion of 84 programs training only at hospitals in the bottom 50% by trainee-to-bed ratio or the bottom 25% by patient volume in the measured diagnoses. In about 50% of teaching hospitals, the number of residents per bed is so few that the residents' impact on patient care is minimal, and so any changes in the impact of those residents' schedules would be too insignificant to measure.
An additional 62 programs in the lowest quartile of number of trainees were excluded to ensure measurement of enough trainees. Two other programs in the lowest quartile of number of trainees were allowed by the Steering Committee. One program was approved for inclusion despite deviations in size of its two affiliated hospitals because together the two hospitals met the qualifying hospital size. A second program was approved for inclusion because the hospital it was affiliated with had far exceeded the trainee-to-bed ratio even though the patient volume was slightly low. A total of 179 programs were eligible for inclusion.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  for shared staffing with other training programs eligible for and interested in iCOMPARE so that those programs could be randomized together and avoid having different duty hour schedules for trainees at a common hospital. This occurred for 3 pairs of hospitals. In total, 63 programs agreed to participate and were randomized.

Patient and Public Involvement
The study was based on a fundamental concern about the impact of training schedules on patient, trainee and program director experiences. However, no patients were directly involved in the design and conduct of the study. Program directors from each participating program were involved and essential to recruitment which was at the training program level. Program directors were fully informed about the intervention and individual trainees had the opportunity to decline participation in surveys and other education outcomes. Results will be presented at national meetings to program directors and published in peer-reviewed journals.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59

Data Sources
Hypothesis testing required outcome data from patients, trainees, and program directors and faculty.
Because of the available and highly reliable patient mortality information (both in hospital and after discharge) in the Medicare fee-for-service (FFS) program, the iCOMPARE patient population was limited to Medicare FFS beneficiaries with a qualifying principal diagnosis on hospital admission (Supplementary Appendix: Appendix Materials 1); these diagnoses were chosen for their common treatment on internal medicine services (excluding oncology and neurology diagnoses) and their elevated mortality rates.
Randomized programs were further invited to participate in the observational Time-Motion and the Sleep and Alertness sub-studies, described below.

Timeline
Trial development began in 2013. The Medicare data required for the analyses of the patient safety outcomes are not expected to be available until mid 2018 ( Figure 2).

Aim 2 -Education
Education measures are derived from multiple sources. The primary education measures are specified in Table 3 and come from the Time-Motion sub-study, the ACGME year-end trainee and core faculty surveys, and the interns' In-Training Examination (ITE) scores provided by the American College of  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  Time-Motion sub-study. We recruited 3 IM programs randomized to STD and 3 randomized to FLEX to participate in direct observations of some of their interns, targeting programs in the mid-Atlantic region for operational convenience; programs with tertiary hospitals as well as community-based programs were included in both arms. Participating programs received $3,000 to support these sub-study activities. We recruited interns rotating on General Medicine services between March-May 2016.
Eligible and interested interns provided written consent. Among the 129 interns invited to participate, 120 (93%) consented.
Twenty-three observers (medical students and undergraduates) were trained to follow participating interns. They used a custom-built tablet-based software program to document start and stop times for various intern activities: direct patient care, indirect patient care, education, rounds, work, handoffs, and miscellaneous, each with various subcategories reflecting greater specificity of tasks. For example, direct patient care had subcategories for patient interactions, family interactions, and physical contact (e.g., physical exam). At least one activity had to be selected at all times, although more than one could be selected to reflect multitasking. At the start and stop of the shift an observer completed brief surveys that summarized total patient census numbers for that intern, including the numbers of transfers, discharges, admissions, and patients received at the beginning of a shift and handed-off at the end.
Shifts were selected at each site aiming to capture 30 shifts in a proportion mirroring -specific for each site -how interns generally spend their time on a general medicine inpatient rotation in a given week.
Observers shadowed approximately 8-10 shifts, for 1-3 shifts per intern. Extended shifts (e.g., 24 hour call cycles) were often split by two observers. A 10% sample of shifts was observed simultaneously by two observers to estimate inter-rater reliability. The methods for the Time-Motion sub-study were    [3] number of handoffs experienced in past 24 hours; and [4] the number of patients for which the trainee was the primary provider. These questions provide another view of how interns spend their time.
Survey 2 was sent on the alternate days and asked the same first question as Survey 1, as well as the trainee's ratings (too little, just right, too much) for [1] time spent in educational conference and related activities, [2] sense of ownership of patients, [3] work intensity, and [4] continuity of care. Data for these questions are related to satisfaction and complement the iCOMPARE's end-of-year survey (described below) and ACGME survey (described above).
Trainees were entered into an incentive lottery designed so that in each 2-week cycle, one intern and one resident at each of the 63 IM programs were each awarded either a $25 or $100 Amazon gift card if they had completed their survey during that period. After the first cycle, the cycle response rate ranged between 39% and 42%. year, a $2,500 cash incentive was provided to each of the 9 programs with the highest response rates.
The trainee survey was administered to all trainees with only slight differences between versions for interns and PGY2 and higher trainees. The instrument was initially developed for the FIRST trial 26 and included items on trainee satisfaction, experience of duty hours, supervision, fatigue management, and  The program director survey was modeled from an earlier survey to program directors [41] and included items on resident and faculty workload, resident morale, continuity, education, patient safety, and program finances and administration.

Aim 3 -Sleep and alertness
Outcomes for the third aim include sleep duration and both subjective and objective measures of alertness among interns at 6 sites randomized to STD and 6 sites randomized to FLEX. At each of these Sleep Actigraphy Scoring and Supplementary Appendix Figure 1 and Figure 2.

Statistical Considerations
Non-inferiority tests will be one-sided and superiority tests will be two-sided. All primary analyses will compare the FLEX and STD treatment groups as randomized, regardless of adherence to the assigned duty hour standards, according to the intention-to-treat principle. Since directors at programs assigned to FLEX have considerable latitude in design of trainee schedules, we expect variation amongst the dutyhour schedules followed in the FLEX group. Protocol-specified secondary analyses addressing the degree of difference between FLEX and STD schedules will be completed. We will also report mortality results adjusting for the clinical condition associated with patient's principal diagnosis as well as demographic

Dissemination
Plans for dissemination include submitting results for the various aims to academic meetings and peerreviewed publications as they become available. The Data and Safety Monitoring Board will comment on manuscripts reporting results related to the primary hypotheses before journal submission. Some may wonder why duty hours or shift lengths were not imposed on the intervention group and why, instead, the intervention group merely had permission to use more flexibility in their scheduling.

Discussion
The design of this study means that the potency of the duty hour changes actually implemented by programs may be less extreme than what was permitted by the intervention, adding noise to or in certain shifts? But the question at hand is: What happens when programs are allowed flexibility in their scheduling of shifts? The outcome of those policy decisions is, in its implemented state, a product of the flexibility of the rules and the degree to which individual programs take advantage of that flexibility. In some ways, this study design is consistent with that of effectiveness trials of drugs, where the anticipated effect is a product not just of whether one was randomized to the study drug, but also whether one was adherent to the drug. In the real world, adherence is relevant. Context is similarly relevant. The effects we observe in this trial also depend critically on the oversight and supervision provided to interns by more senior residents and on other safety nets built into the environment of hospital practice. While those safety nets potentially blunt observed effects, they take this study beyond the in vitro relevance of laboratory study to a pragmatic context.
With the results of the FIRST trial demonstrating non-inferiority in patient outcomes when a more flexible schedule was available, the ACGME issued new duty hour standards, effective July 2018, that correspond to the intervention arm of iCOMPARE. [49] Given that surgery and internal medicine are large fields with many residents caring for many patients, it is important to study the duty hour rules in both specialties as surgical and medical training programs differ in structure, process, culture, the kinds of residents they attract, the patients they serve, and the duties of trainees.

Appendix Materials 1. Rationale and ICD-9 Codes for Principal Diagnoses Qualifying Hospital/Patient for Inclusion in iCOMPARE Randomization/Analysis
The original ICD-9 code list was created by searching for relevant codes for each medical condition that were found to be associated with high mortality rates in our preliminary 2008 data. The outcomes team reviewed the code list and expanded the code list in two ways: x We consulted the official ICD-9 code book and looked for additional high-volume codes that were related to the medical conditions of interest. The list of proposed expansion codes was reviewed and approved by the coding consultant to the study, Dr. Patrick Romano of the University of California, Davis.
x To account for the presence of ICD-10 codes on claims with discharge dates beginning October 1,
The reliability of the data used for calculating patient safety outcomes for mortality, readmissions, and length of stay (LOS) is excellent. The reliability of the data used to calculate complications, costs, and payments is also high, but lower than of the data used in calculating the 30-day mortality, readmissions and LOS outcomes, due to variation across hospitals in the number and may vary across hospitals due in the number of diagnostic and procedure codes recorded in the claim. However, due to the randomized nature of the study, we would expect there to be no difference in these outcomes between the two study arms (FLEX and STD). References regarding the reliability of the various patient safety measures are provided below.
In addition to mortality, the following outcomes measures were collected: x Readmission: This calculation is based on the admission and discharge dates in the Medicare claims.
x Length of Stay and Prolonged Length of Stay: Calculated based on the admission and discharge dates of the index claim.
x Complications: Calculated using the Agency for Healthcare Research and Quality Patient Safety Indicators (see below). x Costs: Calculated using inpatient, revenue center, and Part B claims. We utilize a resource costing method (more details are provided below) to estimate the costs associated with accommodations (general floor and intensive care unit), the operating room, post-discharge emergency room visits, and other services indicated by the presence of Current Procedural Terminology codes (which are translated to Relative Value Units).
x Payments: Calculated using the payment variables that appear in the inpatient, outpatient, and Part B claims. The total amounts paid by Medicare, the beneficiary, and the primary payer are summed.
Year-based adjustments for inflation are applied to the payment figures.
The Patient Safety Indicators were calculated by the research team using SAS programs provided by the Agency for Healthcare Research and Quality, which were run on the Medicare claims. Some of the Patient Safety Indicators that were considered "postoperative" or "perioperative" were modified for use with the study's population of medical patients. For these Patient Safety Indicators, the portion of the code that required the patient to have had surgery was deleted.
The following Patient Safety Indicators were used: x PSI 03 -Pressure ulcer rate x PSI 06 -Iatrogenic pneumothorax rate x PSI 07 -Central venous catheter-related blood stream infection rate x PSI 08 -Postoperative hip fracture rate x PSI 09 -Perioperative hemorrhage or hematoma rate x PSI 10 -Postoperative physiologic and metabolic derangement rate x PSI 11 -Postoperative respiratory failure rate x PSI 12 -Perioperative pulmonary embolism or deep vein thrombosis rate x PSI 13 -Postoperative sepsis rate Costs are calculated using a resource utilization-based method of cost estimation. The following items, which are calculated using the Medicare claims data, are included in the total cost estimate: x Accommodation costs, which are based on the number of general floor days and the number of intensive care unit days during the index admission. This information comes from the revenue center files.
x Operating room cost, based on the amount of time spent in the operating room (for patients who had a surgical procedure performed). This is determined using Part B claims.
x Emergency room visit fixed costs, based on post-discharge visits to the emergency room. This is determined using Part B claims.
x Costs of services provided, based on Relative Value Units (RVUs), determined using the Current Procedural Terminology codes on bills. This is determined using Part B claims.
In addition, any costs that occurred within 30 days of the index admission date, and all the costs associated with any readmissions that began within 30 days, are also included in the total cost calculation.

Appendix Materials 3: Sleep Actigraphy Scoring
According to conventional standards, actigraphy data were classified in 1-minute epochs as wake, sleep, or missing. The first classification was performed by the algorithm of the device manufacturer (Actilife software, version 6.13.3, standard settings, Sadeh scoring algorithm). Off-wrist periods were identified investigators who are experts in sleep research. In an iterative process, any discrepancies were documented and then corrected by Pulsar, until agreement with the study investigators was reached.
During this visual scoring process, both Pulsar and the study investigators were blinded to study arm (STD or FLEX). Likewise, PVT-B data were inspected by study sleep experts blinded to arm (Appendix Figure 2). PVT-B performance was classified into three categories as [1] adherent (i.e., PVT-B data reflected an effort to do the task correctly, and comments left by the subject did not suggest nonadherence), [2] possibly non-adherent (i.e., PVT-B data reflected a consistently poor effort to do the task correctly, but comments left by the subject did not suggest non-adherence), and [3] non-adherent (i.e., PVT-B data reflected a consistently poor effort to do the task correctly, and comments left by the subject did suggest non-adherence, e.g., performing the task while brushing teeth). Comments left by interns were inspected for distractions and non-fatigue related impairment and flagged accordingly.