Article Text

Comparing hormone therapy effects in two RCTs and two large observational studies that used similar methods for comprehensive data collection and outcome assessment
  1. Arthur Hartz1,
  2. Tao He2,
  3. Robert Wallace3,
  4. John Powers4
  1. 1Huntsman Cancer Institute at the University of Utah, St Louis, Missouri, USA
  2. 2Health Services Research, University of Utah College of Medicine, Salt Lake City, Utah, USA
  3. 3Department of Epidemiology, College of Public Health, University of Iowa, Iowa City, Iowa, USA
  4. 4Department of Medicine, George Washington University School of Medicine, District of Columbia, DC, USA
  1. Correspondence to Dr Arthur Hartz; hartzarthur{at}


Objectives Prospective observational studies (OSs) that collect adequate information about confounders can validly assess treatment consequences. However, what constitutes adequate information is unknown. This study investigated whether the extensive information collected by the Women's Health Initiative (WHI) in two OSs and two randomised controlled trials (RCTs) was adequate.

Design Secondary analysis of WHI data. Cox regression was used to select from all baseline risk factors those that best predicted outcome. Cox regression that included these risk factors was used for two types of analyses: (1) comparing RCT and OS assessments of the effects of hormone therapy on outcome for participants with specific characteristics and (2) evaluating whether adjustment for measured confounders could eliminate outcome differences among datasets.

Setting The WHI included more than 800 baseline risk factors and outcomes during a median follow-up of 8 years.

Participants 151 870 postmenopausal women ages 50–79.

Primary and secondary outcome measures Myocardial infarction and stroke.

Results RCT and OS results differed for the association of hormone therapy with outcome after adjusting for confounding factors and stratifying on factors that were hypothesised to modulate the effects of hormone therapy (eg, age and time since menopause) or that empirically modulated the effects of hormone therapy in this dataset (eg, blood pressure, previous coronary revascularisation and private medical insurance). Some of the four WHI datasets had significantly worse outcomes than others even after adjusting for risk and stratifying by type of hormone therapy, for example, the risk-adjusted HR for myocardial infarction was 1.37 (p<0.0001) in an RCT placebo group compared with an OS group not taking hormone therapy.

Conclusions Apparently the WHI did not collect sufficient information to give reliable assessments of treatment effects. If the WHI did not collect sufficient data, it is likely that few OSs collect sufficient information.

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 3.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Article summary

Article focus

  • Observational studies (OSs) are frequently used to compare outcomes of patients who choose different treatments.

  • Results of OSs may be invalid because of confounding due to an association between patient risk and treatment choice.

  • The present study assessed whether the extensive information collected by the Women's Health Initiative (WHI) was adequate to eliminate confounding and give valid results.

Key messages

  • The effects of hormone therapy on stroke and myocardial infarction differ for OSs and randomised controlled trials even after taking advantage of extensive participant information to remove confounding and to select similar participants.

  • Participants who self-selected for different studies had different outcomes that could not be explained by differences in measured risk factors.

  • As comprehensive data such as collected by the WHI appear to be inadequate to ensure the validity of an observational study, it is unclear what observational study results can be accepted with confidence.

Strengths and limitations of this study

  • The WHI dataset is unusually comprehensive and provided a good test of whether excellent datasets can ensure valid results for an observational study. The conditions for valid OSs were not identified.

Medical practice often depends on observational studies (OSs) that compare outcomes of similar patients treated differently. However, OS results may be erroneous because patient risk factors are confounded with treatment choice. Only if confounding factors can be adequately measured, can their effects be removed with statistical methods. The success of removing confounding errors has been vigorously debated.1–3

The strongest evidence against the validity of the OSs has been discrepancies between OSs and randomised controlled trials (RCTs). In particular, RCTs from the Women's Health Initiative (WHI) found that hormone therapy (HT) increased the risk of myocardial infarction (MI)4 or had no effect5 and increased the risk of stroke.4 ,5 These findings contradicted a large body of well-performed OSs suggesting that HT may reduce the risk of cardiovascular disease by 30–50%.6–8

However, RCT/OS discrepancies do not prove that the OS design is invalid. Another possibility is that the discrepancies are caused by differences in characteristics of the study population, therapy or outcome measurements (eg, duration of follow-up). For example, the women evaluated in the WHI RCT were older than those in most OSs, and there is some evidence that HT has a greater adverse effect on older women or women who began HT several years after menopause.9–12 There is also evidence that the influence of HT on MI risk is greatest soon after initiation,13 and OSs that can follow participants soon after they begin therapy may give results similar to RCTs.10 ,14 It may be possible that other patient characteristics (eg, obesity, smoking or health status) that differ between types of studies alter the associations between HT and outcomes.

The WHI offers an excellent opportunity to assess the value of OSs for three reasons: (1) The same type of data were collected in almost the same way for two RCTs and two OSs of HT; (2) the data collected included comprehensive information about numerous potentially relevant risk factors that are rarely available in OSs, including many often suspected to cause confounding (eg, those related to socioeconomic status, functional status, psychological status, lifestyle factors and healthcare behaviours) and (3) the sample sizes were large enough to enable subgroup comparisons.


The ability of an OS to eliminate confounding was examined by testing three hypotheses:

  1. Result differences between OSs and RCTs can be eliminated by adjusting for the WHI risk factors.

  2. Differences between OSs and RCTs are caused by differences in modulating factors such as the time after menopause that HT is initiated,9–12 the time OS participants are on HT prior to beginning the study13 ,14 or other participant characteristics that have not been previously suggested.

  3. Confounding factors associated with which specific WHI study recruited the participant can be eliminated by adjusting for the WHI risk factors.

WHI dataset

Data were obtained from the WHI, which has been described in detail.5 ,6 The study was approved by institutional review boards, and all participants signed informed consent forms. In brief, it was a long-term national health study that focused on strategies for preventing heart disease, breast and colorectal cancer and osteoporosis in postmenopausal women. Women aged 50–79 were enrolled from 1993 to 1998 at 40 clinical centres throughout the USA for clinical trials. Women were asked to enrol in an RCT and those who were not ineligible or not interested were given the opportunity to enrol in the WHI OS.

There were four WHI studies relevant to the present analysis: (1) an RCT of oestrogen therapy (E-alone) for women without a uterus, (2) an RCT of oestrogen plus progesterone (E+P) for women with a uterus, (3) an RCT of diet and (4) WHI OS with no interventions. The RCT of diet served as a second OS for the effects of HT because HT use was not randomised for these patients. Participants who were enrolled in the RCT for diet as well as an RCT for HT were considered to be only in the RCT for HT dataset.

For follow-up and outcome ascertainment all participants completed a self-administered, self-report. This report was completed semiannually by the RCT participants and annually by the OS participants. Adjudicated outcomes were based on medical records, autopsy reports and death certificates.

The more than 800 baseline risk factors analysed in the present study were in the following categories: demographics, general health, clinical and anthropometric, functional status, healthcare behaviours, reproductive, medical history, family history, personal habits, thoughts and feelings, therapeutic class of medication, hormones, supplements and dietary intake.

Statistical analysis

The Cox proportional hazard regression analysis was used to test the association between outcome and the primary risk factor after adjusting for covariables. The outcomes analysed in this study were MI or stroke that developed after the participants were enrolled in the study. The primary risk factors were HT (either the binary variable for any HT use or the three category variable for use of E-alone, E+P or neither) or the categorical variable for the four datasets.

The primary risk factors were represented by an indicator variable for every category except the reference category. The HR associated with an indicator variable for a category represented the risk for participants with that variable compared with the risk of participants in the reference category. The reference category for the HT variables was no HT use, and the reference category for dataset was the WHI OS.

To identify which covariables should be included in a Cox model, we first tested the statistical significance of more than 800 risk factors by including only the risk factor and age in the Cox model for a given outcome. All risk factors that were statistically significant at the p<0.01 level after adjusting for age alone were then included in a backwards stepwise Cox proportional hazard regression analysis, and variables that remained statistically significant at the p<0.0001 level were retained in the model. We then used the Cox forward stepwise procedure to test whether any of the variables not already in the model could enter at the p<0.0001 level. It is unlikely that many of these variables were significant by chance alone and even less likely that adjusting for spurious variables would distort the association between HT and outcome.

To identify which risk factors modulated the association between HT and outcome we tested the interactions of HT with the risk factors that had been tested with the timing hypothesis or that had a statistically significant association with outcome at the p<0.01 level after adjusting for age and dataset.

In an analysis that only included OS participants not taking HT at baseline, follow-up began at the time the participant completed the questionnaire that first reported HT or, if they never began HT, follow-up began at the time they completed their first questionnaire after baseline. (If follow-up for these participants had begun as late as it did for the HT participants, it would have diminished the HR associated with HT.) The baseline age of participants in this analysis was computed for the time that follow-up began.

Stepwise procedures were used to find a logistic regression equation that included the risk factors independently associated at the p<0.0001 level with taking baseline HT in the WHI OS. An individual's propensity score was the probability derived from her characteristics and the estimated parameters in this equation. We evaluated whether grouping participants with similar propensity scores decreased confounding in the OSs so that OS and RCT results became more similar.

The median follow-up time was 8 years. However, for the E+P RCT, treatment was ended after a mean follow-up of 5.2 years even though follow-up on all participants was continued. To make time on HT in the study comparable for the OS and each RCT, we ended follow-up at 5 years.

All statistical analyses were performed using SAS V.9 (SAS Institute Inc, Cary, North Carolina, USA).

Sample size

Participants available for analysis included 161 748 WHI participants: 93 651from the observational study, 16 590 from the RCT of oestrogen plus progesterone (E+P), 10 722 from the RCT of oestrogen only (oestrogen-alone) and 40 785 additional women who were in the diet study and not in an RCT of HT. Of the 161 748 WHI participants, 9584 were excluded because they did not meet the following RCT exclusion criteria: platelets less than 75 000/mm3, haematocrit less than 32%, oral daily use of a glucocorticosteroid, body mass index less than 18, systolic blood pressure greater than 200 mm Hg, diastolic blood pressure greater than 105 mm Hg, breast cancer ever, other cancers in the last 10 years, or stroke, transient ischaemic attack (TIA) or MI in the past 6 months. An additional 294 were missing information on the use of HT at baseline.

Missing data for the covariables were imputed by the mean value for ordinal or binary variables and the mode value for variables with three or more categories. After determining which risk factors were independently associated with a given outcome at the p<0.0001 level, we created a corresponding indicator variable for each of those risk factors that indicated if the variable was missing. If the missing indicator variable was statistically significant at the p<0.05 level, participants missing the corresponding risk factor were excluded. There were 146 936 participants included in the fully adjusted Cox model for MI and 149 470 included in the fully adjusted Cox model for stroke. The ability of the Cox model to predict outcome as measured by the C statistic was not improved by excluding participants with estimated values of the covariates.


Baseline participant characteristics for participants in the four datasets are compared in table 1. For two datasets participants on HT were compared with participants without HT. That was not necessary, however, for the RCTs for E+P and for E-alone because randomisation in these studies made the treatment arm unrelated to baseline characteristics. In the OS and RCT for diet datasets the risks due to age, race, income, educational level, physical functioning and smoking were most favourable for participants on E+P and least favourable for participants not taking HT. With the exception of smoking these characteristics were also more favourable for participants in the RCT for E+P than in the RCT for E-alone. Both socioeconomic status variables (education and income levels) are lower for the two RCTs of HT datasets than for the other two datasets, p<0.0001. For this reason it was important to evaluate whether socioeconomic status influenced the association between HT and outcome.

Table 1

Percentage of participants in a given category by dataset and type of hormone therapy

Propensity score

The logistic regression equation to predict the probability that a participant in the OS used HT (ie, the propensity score) included 94 independent risk factors statistically significant at the p<0.0001 level and had a C statistic of 0.90, indicating that the equation was highly predictive of HT use.

Risk factors for MI and stroke

We identified 16 risk factors (in addition to dataset) that were independently associated with MI at the p≤0.0001 level. The variables and their associated χ² value for the full dataset in parenthesis were age (594.3), taking medication for diabetes (284.3), smoking at baseline (182.4), systolic blood pressure (150.1), history of coronary artery bypass surgery (110.1), history of cardiovascular disease (67.1), limited in climbing stairs (62.8), worse general health (52.1), family history of MI (50.0), lower income (46.4), current history of MI (44.2), white race (44.1), the ratio of waist circumference to hip circumference (38.1), hypertensive medications (33.4), taking calcium channel blockers (24.0) and higher haematocrit (18.6). The C statistic of the predictive value for this equation was high, 0.78 (95% CI 0.77 to 0.79).

Twelve risk factors were independently associated with stroke at the p≤0.0001 level: age (667.4), systolic blood pressure (181.4), history of diabetes (110.3), medication for hypertension (85.3), current smoking (79.9), physical function (68.2), history of stroke (49.1), history of cardiovascular disease (38.8), TIAs (30.8), cardiotonic medication, especially digitalis (27.1), lower income (21.7) and lifetime HT duration (14.9). The C statistics for these variables was 0.76 (95% CI 0.76 to 0.77).

Association of HT with MI and stroke

The risk-adjusted HRs for a specific type of HT (E+P or E-alone) and for either HT are shown in table 2 for each dataset. In the WHI OS dataset E+P and E-alone had similar HRs. In the diet dataset E-alone was significantly protective for MI (HR=0.65) but E+P was not (HR=0.96, p=0.04 for the difference between HRs for E-alone and E+P), and there was no association of either type of HT with stroke. In the RCT datasets there was an association of E+P with an increased risk of MI (HR=1.30) as well as stroke (HR=1.34), but E-alone was not associated with MI.

Table 2

Risk-adjusted HRs for hormone therapy in different datasets

To test for differences in HRs among the datasets, we combined all datasets and included main effects, interactions between HT and dataset and risk factors in the Cox model. The MI HRs for E+P was larger in the E+P RCT than in the OS (p=0.07), and the MI HR for E-alone was higher in the RCT for E-alone than in the diet dataset (p=0.06). For stroke, where the evidence for the HT risk is stronger, the HR in the combined RCT datasets was significantly higher than it was in the WHI OS dataset (p<0.0001) and in the diet dataset (p=0.005).

Influence of patient characteristics on the association between HT and outcomes

The analyses reported in tables 3 and 4 examined how OS and RCT differences might be influenced by the timing of the HT with respect to age, menopausal status and previous hormones. Also these tables show the effects of additional adjustment for confounding using propensity scores. The HRs and their CIs are presented for women on any HT. Where it might be informative, HRs without CIs are presented for women on a specific type of HT (either E+P or E-alone).

Table 3

MI HRs for hormone therapy in subgroups defined by participant characteristics associated with hormone exposure

Table 4

Stroke HRs for hormone therapy in subgroups defined by participant characteristics associated with hormone exposure

Myocardial infarction

Table 3 presents the MI HR for HT, E+P and E-alone. The timing hypothesis suggests that HRs should be significantly lower in the 50–59 age group or in the group with menopause less than 10 years than in the other groups, but none of these differences were significantly different in the expected direction. To the contrary, the E+P HR for women aged 50–59 was much higher (1.63) than it was for older women (1.01 for women age 60–69).

The HR for HT during the first 3 years (1.26) is greater than the subsequent risk (1.08). For the RCT for E+P the difference is greater, 1.45 vs 1.11, and the test of the time-dependent covariables of duration of exposure was of marginal statistical significance (p<0.05). Since OS participants on HT began HT several years before enrolment, a diminished effect of HT with time could explain an OS/RCT difference. However, results of other analyses do not support this explanation: there was no evidence that previous HT exposure reduced the HR in the RCT (ie, the HR was lower for participants with no previous exposure, 1.07, than for those with previous exposure, 1.51), and there was no indication in the WHI OS dataset of increased MI risk for participants who began HT after study baseline, the HR was lower than it was for participants who began HT at baseline. (Information on HT usage after baseline was not available for the diet RCT study.)

The last rows in table 3 are HRs stratified by propensity scores. Stratifying by propensity score in addition to adjusting for the significant covariables was expected to reduce confounding, but there was no evidence that doing this gave results similar to the RCTs.

Additional factors that significantly modulated the association between HT and MI in the OS dataset at the p<0.05 level included blood pressure, previous coronary revascularisation, hours of sleep, haematocrit, working status, thyroid disease, antineoplastics, private medical insurance, bone fracture after age 55, colon polyps, ever lived or worked on farm and hostility. Neither education nor income was a statistically significant modulating variable. No factors that significantly modulated the HT HR in the WHI OS dataset also significantly modulated this HR in the RCT datasets. The MI HRs in the RCT and OS datasets did not become similar if they were stratified by the modulating variables.


Although E+P and E-alone had similar associations for stroke, results in table 4 include only the HRs for HT and no HRs for E+P and E-alone. As shown in this table there was no consistent evidence that the HT HR for stroke was lower for women who were younger or had menopause recently. In contrast to the MI analyses, there was also no RCT evidence that the HT HR for stroke was stronger soon after beginning HT.

The only variable found to significantly influence HT HR for stroke in the WHI OS dataset was endometrial aspiration; the HR was 0.85 for those who had had an endometrial aspiration and 1.16 for participants who did not (p<0.001). Stratifying on this variable did not make the OS and RCT results more similar. In addition, the lack of an obvious medical explanation, the number of factors tested and the lack of this relationship in the RCT datasets makes it more likely that this result occurred by chance.

After recalculating the HR in the WHI OS dataset for only those participants with midrange of propensity scores (those with a probability of using HT between 0.25 and 0.75), the HR for stroke was virtually unchanged. This suggests that adjusting for the propensity score did not diminish confounding.

Adequacy of WHI information to eliminate confounding

In table 5, the MI risks are compared for participants in the four different WHI datasets who are on the same treatment at baseline (E+P, E-alone or no HT). The HR in the table represents the risk of the outcome for participants in that dataset compared with participants on the same treatment in the WHI OS dataset. If the WHI variables are adequate to eliminate confounding, the adjusted HRs should be near 1.00.

Table 5

MI HRs comparing participants in each of the three RCT datasets to WHI OS participants

Some HRs shown in the table were statistically significant at p<0.0001. For participants not taking HT the risk-adjusted HR was 1.37 for the RCT for E-alone. For participants taking E-alone the HR in the RCT was 1.44, and for participants taking E+P the HR was 1.53 for intervention participants. Risk-adjustment sometimes made HRs closer to 1.00 as expected (eg, intervention participants in the RCT for E+P), sometimes had minimal effect on HRs, and sometimes made a non-significant HR significant (eg, participants not on HT in the diet dataset).


The WHI data analysed contained information on more than 800 possible confounders including information that made it possible to accurately predict HT use. It also contained information on factors that might have influenced response to HT. Some of these factors were related to the timing hypothesis (eg, age, time since menopause, previous HT use, beginning HT after baseline), and some were identified empirically (eg, blood pressure, previous coronary revascularisation and private medical insurance). Since OS and RCT participants differed with respect to these factors, these factors could have conceivably contributed to differences between the OSs and the RCTs. However, after taking into account all of these confounding factors and stratifying on factors that may have influenced the response to HT, OS and RCT differences remained.

The WHI data also contained information from four different studies, and the participants in these studies had different outcomes. After stratifying participants with respect to the type of HT and taking into account the information available in the WHI, we could not eliminate the outcome differences from the four studies.

The above results suggest that there were important risk factors not captured by the WHI that contributed to confounding. Since the WHI dataset is unusually comprehensive, it is likely that most OSs do not capture information on these risk factors. Without including information on potentially important confounders OSs cannot give reliably valid results.

Comparison to previous studies

OSs prior to WHI suggested a 30–50% reduction in coronary heart disease incidence among women using HT.6–8 There was a smaller benefit shown in the analyses of the observational data in the present study: a 17% reduction in the OS and a 25% reduction in the RCT for diet.

After the WHI results were published, six studies of the association between HT and stroke or MI compared RCT results from the WHI with observational study results: three of these studies used observational data from the WHI13 ,15 ,16 and three used observational data from the Nurses’ Health Study.9 ,10 ,14 Two of the WHI studies found, after controlling for time on HT and covariables, E+P HRs for MI did not significantly differ for the two study designs but HRs for stroke were higher in the RCT.

The goals and analytic methods of the present study differ substantially from previous studies using WHI data. The lead author believed that the extensive WHI data would be sufficient to give reliably valid results and extraordinary efforts were made to confirm this hypothesis. These efforts included an assessment of more than 800 risk factors as potential confounders and evaluating all marginally significant or previously suggested factors as potential effect modifiers. Even when the OS and RCT results were not the same, it was possible that the OS results were still valid. As a more definitive test of the adequacy of the WHI data we tried to eliminate differences in risk-adjusted outcomes from different datasets, which few if any other studies have attempted.

The present study differed from previous WHI studies in the following ways: (1) it included participants with and without a uterus, which made it possible to assess the effect of HT preparation. (2) It included participants in the diet RCT, which made it possible to compare risk-adjusted outcomes for two RCT and two OS datasets. (3) It evaluated more than 800 possible risk factors including those often suspected to cause confounding such as socioeconomic status, health behaviours, life style, stress and psychological characteristics. (4) It screened numerous participant characteristics for possible modulating effects on the association between HT and outcomes. (5) It analysed the risk for OS participants who began taking HT after enrolment. (6) It compared participants on the same treatment in different datasets and demonstrated that adjusting for WHI variables does not necessarily eliminate risk differences between datasets.

One of the WHI studies previously evaluated the timing hypothesis and did not find effects of prior HT use or menopause within 5 years.16 Another analysis of WHI data has been often cited as supporting the timing hypothesis.17 Although we tried to define coronary heart disease and years since menopause to get the same results, we could not. This suggests that the trends in the previous analysis were not robust to changing definitions.

A WHI study also found, as we did in the present study that the MI HR for E+P was greatest in the early years of treatment. This could explain OS and RCT differences because most OS participants taking HT at baseline began HT several years prior to baseline. However, some analyses in the present study did not support this explanation: (1) the RCT did not find that the effect of E-alone on MI changed over time; (2) none of the datasets found that the effect of any HT on stroke changed over time; (3) WHI OS participants who began HT after baseline had low MI risk and (4) prior HT exposure did not reduce the association between HT and cardiovascular disease.

Results from the OS performed by the Nurses’ Health Study differed from our analysis of the WHI OS in important respects. One was that there was no protective association of HT and CHD for women over the age of 60.9 (Other studies have also suggested that HT is less protective for older women.11 ,12) A second was that there was increased risk for new initiators of HT during the first 2 years after initiation and the risk increased 10 years after menopause.14 Based on these findings the researchers in the Nurses’ Health Study hypothesised that the OS results might be influenced by timing of HT initiation in relation to menopause onset or age and by length of follow-up. A third result that differed from ours was that HT significantly increased the risk of stroke.10 Since this later result was similar to the WHI RCTs and the previous results might have explained differences between OSs and RCTs, the Nurses’ Health Study suggested that OSs of HT could get the same results as RCTs.

The disagreements between our results and the results of the Nurses’ Study do not show that the analyses or interpretation in either study are necessarily incorrect. The disagreements do demonstrate, however, the difficulty of getting valid results from OSs.

In addition to OSs of the Nurses’ Health Study that give results similar to RCTs there is also an RCT that found oestradiol had an extraordinary protective effect on cardiovascular disease, which is consistent with the weaker protective effect of a different oestrogen preparation in the WHI OS.12

A previously published analysis of the WHI data shows that WHI risk factors cannot eliminate the association of adherence to placebo with MI, stroke or breast cancer.18 Since the effect of adherence to placebo is probably a marker of unmeasured confounders, that study supports the implication of the present study that WHI risk factors are inadequate to eliminate unmeasured confounders.


This study provided strong evidence that the WHI did not collect information on important risk factors related to MI or stroke. Although the WHI is unusually comprehensive, other datasets may provide information about these risk factors or about the risk factors that could cause confounding for the outcomes they assessed. It is also possible that the WHI did collect the necessary information on the confounding factors, but the analytic methods used here were inadequate to take advantage of this information. However, the concerns raised by this study are still valid because both the dataset and the analytic methods used were much more comprehensive than is practical for almost all OSs.

Conclusion and future directions

We did not find that the comprehensive data provided by the WHI were adequate to overcome problems often attributed to OSs. The findings do not imply that most OSs are invalid. They do suggest, however, that given the current methodology, even very good OS datasets may not be adequate to give reliably valid results.

Owing to the key role that OSs are likely to play in studies of comparative effectiveness, it is critical to find ways to make OSs more valid. Although there has been some research on OS methodology,14 more is required. There should be investigations to learn why some OSs agree with RCTs and others do not. More specific research goals include the following: (1) identify criteria for treatments unlikely to have confounding problems (eg, when there is little patient input to treatment, and one treatment is not preferred for higher risk patients), (2) find new risk factors that better adjust for patient behaviours that affect outcomes (eg, factors related to choosing or adhering to treatment) and (3) develop methods for assessment of confounding after data collection (eg, finding good markers for important unmeasured confounding factors). Without better OS methodology there will be underuse or misuse of OSs for comparative effectiveness research.


The Women's Health Initiative Study (WHI) is conducted and supported by the NHLBI in collaboration with the WHI Investigators. This manuscript was prepared using a limited access dataset obtained by the NHLBI and does not necessarily reflect the opinions or views of the WHI or the NHLBI. The research was supported in part by the Huntsman Cancer Foundation and the Beaumont Foundation.


Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

    Files in this Data Supplement:


  • Contributors AH supervised the study and prepared the manuscript. AH, TH, RW and JP participated in the conception and design, interpretation of data, revising the article and final approval of the version submitted.

  • Funding This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement A de-identified dataset that contains all of the information used for the current study can be obtained by applying to the NHLBI.