Article Text

Original research
Critical appraisal and issues regarding generalisability of comparative effectiveness studies of NOACs in atrial fibrillation and their relation to clinical trial data: a systematic review
  1. Eveline M Bunge1,
  2. Ben van Hout2,
  3. Sylvia Haas3,
  4. Georgios Spentzouris4,
  5. Alexander Cohen5
  1. 1 Pallas health research and consultancy BV, Rotterdam, The Netherlands
  2. 2 School of Health and Related Research, University of Sheffield, Sheffield, UK
  3. 3 Formerly Technical University of Munich, Munich, Germany
  4. 4 Daiichi Sankyo Europe, Munchen, Germany
  5. 5 Department of Haematology, Guys and St Thomas' NHS Foundation Trust, King's College London, London, UK
  1. Correspondence to Ms Eveline M Bunge; bunge{at}


Objective To critically appraise the published comparative effectiveness studies on non-vitamin K antagonist oral anticoagulants (NOACs) in non-valvular atrial fibrillation (NVAF). Results were compared with expectations formulated on the basis of trial results with specific attention to the patient years in each study.

Methods All studies that compared the effectiveness or safety between at least two NOACs in patients with NVAF were eligible. We performed a systematic literature review in Medline and EMbase to investigate the way comparisons between NOACs were made, search date 23 April 2019. Critical appraisal of the studies was done using among others ISPOR Good Research Practices for comparative effectiveness research.

Results We included 39 studies in which direct comparison between at least two NOACs were made. Almost all studies concerned patient registries, pharmacy or prescription databases and/or health insurance database studies using a cohort design. Corrections for differences in patient characteristics was applied in all but two studies. Eighteen studies matched using propensity scores (PS), 8 studies weighted patients based on the inverse probability of treatment, 1 study used PS stratification and 10 studies applied a proportional hazards model. These studies have some important limitations regarding unmeasured confounders and channelling bias, even though the larger part of the studies were well conducted technically. On the basis of trial results, expected differences are small and a naïve analysis suggests trials with between 7200 and 56 500 patients are needed to confirm the observed differences in bleedings and between 51 800 and 7 994 300 to confirm differences in efficacy.

Discussion Comparisons regarding effectiveness and safety between NOACs on the basis of observational data, even after correction for baseline characteristics, may not be reliable due to unmeasured confounders, channelling bias and insufficient sample size. These limitations should be kept in mind when results of these studies are used to decide on ranking NOAC treatment options.

  • adult cardiology
  • statistics & research methods
  • cardiology

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • To our knowledge, this is the first systematic review that critically appraised the quality and generalisability of the comparative effectiveness studies on non-vitamin K antagonist oral anticoagulants (NOACs) in patients with atrial fibrillation and to relate this to clinical trial data.

  • A naïve trial analysis was conducted to estimate the number of patients needed in a randomised clinical trial to confirm the differences in efficacy and bleeding.

  • Thirty-nine articles were included, of which only one included all four NOACs.


Guidelines state a preference for non-vitamin K antagonist oral anticoagulants (NOACs) above vitamin K antagonists (VKAs) in patients with non-valvular atrial fibrillation (NVAF) requiring prevention of stroke and systemic embolism.1 2 However, no recommendation for a specific NOAC is made in these guidelines, and in daily practice, physicians have to make a choice which of the four available NOACs (dabigatran, rivaroxaban, apixaban, edoxaban) they prescribe for a particular patient.3–6

In the absence of head-to-head trials, comparative effectiveness research (CER) has been conducted to compare the NOACs with regard to effectiveness and safety. This is also described as real-world evidence; that is, the data will come from patients treated in daily practice. Comparisons on effectiveness and safety between NOACs are however not easy to make, as patients will not be prescribed one of the NOACs at random. The choice of a certain NOAC for a patient will at least partly be driven by patient characteristics, such as age, concomitant medications, and the risk of stroke and/or bleeding. This can lead to systematic differences between the treatment groups, which is known as channelling bias.7 In order to make a valid comparison on effectiveness and safety between the NOACs, adjusting for these characteristics is necessary when these characteristics are also related to the outcome (confounding variables).

Several techniques exist to correct for imbalances in risks, but there is no gold standard and all methods have advantages and disadvantages. Cox proportional hazards (Cox PH) regression model adjustment can be used but large sample sizes are needed when the number of events is relatively low and the number of covariates is high (as a rule of thumb, about 10 events per predictor variable8) and these large sample sizes are not always available. Event rates are low, around 1 per 100 patient years for efficacy outcomes and to detect differences, even in a randomised clinical trial, one needs substantial number of patients. This number would only increase when the results are contaminated by a lack of balance between the patients’ groups. Another method to adjust for confounding is using propensity scores (PS) to create comparable patient groups before the analysis. A PS is the probability of an individual receiving a specific treatment given a specific set of patient characteristics (eg, age, gender, comorbidities).9 Variables related to the outcome should be included in the PS despite their strength of association on treatment (exposure) selection. This will increase the precision of the estimated exposure effect, while bias will not be increased. Variables that are related to the exposure but not the outcome will decrease the precision of the estimated exposure effect without decreasing bias.10 Adjustment for confounding using PS can be done by matching the treatment groups on the PS, by weighing treatment groups based on the PS inverse probability of treatment weighting (IPTW), by PS stratification or by covariate adjustment using the PS.9 11 Well-conducted PS methods will lead to treatment groups that are very well comparable regarding important confounders, which increases the confidence in the results; however, there are also some disadvantages. For instance, in PS matching studies, patients who cannot be matched to another patient will be excluded from the analyses, and in IPTW, when patients on one treatment have a low PS and patients treated with the other treatment have a high PS, extreme weights can occur which can bias the results.12

To gain more understanding in how the above described methodologies were applied in peer-reviewed CER on effectiveness and safety in NOACs in patients with NVAF, we conducted a systematic literature review. Within this, we compare the results with those from a naïve analysis of the results of the four major trial for rivaroxaban, apixaban, dabigatran and edoxaban, and compare the results from the various analyses with those from the trials.


Information sources, search strategy and eligibility criteria

We performed a systematic literature review to identify peer-reviewed CER on NOACs in patients with atrial fibrillation. A search in Medline (access through PubMed) and EMbase was performed combining search strings on NOAC, VKA and atrial fibrillation (see online supplemental appendix 1 for the search strings). The search was conducted on 23 April 2019 and we checked all articles published in English language. The title and abstract selection was done in duplicate by two independent researchers.

Supplemental material

The following inclusion criteria were used:

  • Population: patients with NVAF.

  • Intervention: NOAC (dabigatran, rivaroxaban, apixaban or edoxaban).

  • Comparator: other NOAC(s) (dabigatran, rivaroxaban, apixaban and/or edoxaban).

  • Outcomes: effectiveness and safety.

  • Study type: comparative effectiveness studies with a cohort design.

The following exclusion criteria were applied:

  • Studies on only one NOAC.

  • Studies in which VKA is the comparator for the NOACs, and NOACs are not compared against each other.

  • Studies on cost-effectiveness and healthcare resources use.

  • Studies on adherence or persistence.

Critical appraisal

We checked the setting, inclusion and exclusion criteria, and the following baseline characteristics: age, proportion males, CHA2DS2-VASc (Congestive heart failure, Hypertension, Age ≥75 years, Diabetes Mellitus, Prior Stroke or TIA or thromboembolism, Vascular disease, Age 65–74 years, Sex) score and comorbidity index.

We used the criteria suggested by ISPOR, Yao et al 13 and Austin et al as a guidance to critically appraise the articles in which PS were used.12 14 15 The criteria we checked concerned:

  • The variables included in the PS model.

  • Explanation of the variable selection procedure for PS model.

  • Distribution of baseline characteristics for each group before PS analysis.

  • In case of PSM:

    • Matching ratio.

    • Distance metric.

    • With or without replacement.

    • Comparability of baseline characteristics in the matched groups.

    • Sample size before and after matching.

  • In case of IPTW:

    • Comparability of baseline characteristics in the weighted groups.

    • Extreme weights.

  • In case of PS stratification:

    • Number of strata, comparability of baseline characteristics.

  • In case of analyses in which no PS was used in the main analyses:

    • We evaluated whether the ratio number of covariates to the number of events seemed sufficient to produce valid results.8

  • Sensitivity analyses to further explore the magnitude of residual confounding (ie, case–cross-over study designs; clinical details in a subsample; proxy measures; or instrumental variable techniques).

Naïve trial analysis

Trials are quite often designed with a null hypothesis and associated with a power calculation, while real-world studies are often dictated by the number of observations available. To give the results from the real-world evidence some perspective, we undertook a naïve trial analysis in which the risk reductions from each trial with respect to efficacy and safety outcomes were applied to an average number of outcomes observed in the warfarin arms in each trial. This leads to an estimate of the relevant rates for each drug and the differences are illustrated by the number of patients (sample size) needed in a randomised clinical trial to confirm the estimated differences.


In total, we found 1302 unique articles in our search, of which 39 articles fulfilled the inclusion and exclusion criteria and were included for data extraction (see figure 1). In tables 1–5, study characteristics are presented. The most important differences between the studies are outlined in table 6.

Table 1

Characteristics of the included articles that used propensity score matching (PSM) as primary analyses (n=18)

Table 2

Characteristics of the included articles that used inverse probability of treatment weighting as primary analyses (n=8)

Table 3

Characteristics of the included articles that used adjusted Cox proportional hazard models as primary analyses (n=10)

Table 4

Characteristics of the included articles that used unadjusted primary analysis (n=2)

Table 5

Characteristics of the included articles that used propensity score stratification as primary analyses (n=1)

Table 6

Main differences between the included studies (n=39)

Figure 1

PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flowchart. NOACs, non-vitamin K antagonist oral anticoagulants.

More than 50% of the studies were conducted in the USA (n=24),16–39 five were conducted in Denmark,40–44 four in Taiwan,45–48 and one in France,49 Sweden,50 Scotland,51 the UK,52 Spain53 and China.54 Dabigatran and rivaroxaban were included in all 39 studies, apixaban was included in 26 studies and edoxaban was included in 1 study. Next to these NOACs, VKA was included in 25 of these studies as one of the comparators. The results below focus on the NOAC to NOAC comparisons only.

In the studies that included apixaban, dabigatran and rivaroxaban, rivaroxaban was most dominantly used in the USA, the UK, Scotland and Taiwan, while dabigatran was the most prescribed NOAC in Denmark. In three other European studies, the distribution was about equal between the three NOACs. In none of the included studies apixaban was the most dominantly prescribed NOAC.


Most studies concerned patient registries, pharmacy or prescription databases and/or health insurance databases (n=39), while there were three clinical practice-based studies.50 53 54

Study population

All studies included only patients with NVAF. In seven studies, it was specifically described that patients were newly diagnosed with NVAF and initiated NOAC treatment during study period.21 27 34 37 40 45 54 None of the other studies included prevalent users of (N)OAC, but included, for example, ‘newly treated’, ‘initiating treatment’, ‘new users’, ‘first-time prescription’ of NVAF patients who were prescribed (N)OAC. In some studies, (N)OAC use in the past (between 3 months and 2 years before index date) was allowed, while this seemed not be allowed in some other studies, or it was not described.

Inclusion criteria

Five studies concerned elderly patients specifically (ie, ≥65 years old),19 21 23–25 two included adults ≥45 years old33 40 and one study included patients between 30 and 100 years of age.44 The other studies included all adults with atrial fibrillation (it was assumed that if no further age specification was provided, ‘adults’ meant that all >18 years old were included). In one study, only patients who were hospitalised for bleeding after start with OAC treatment were included.22 No other focus on a specific group of patients with AF was found.

Exclusion criteria

NOAC use that could be related to other disorders, such as transient AF, major knee or hip surgery, venous thromboembolism or pulmonary embolism, were specifically described as exclusion criteria in most studies, except in 10 studies.16 27 28 33–35 50 52–54 In one study, patients with liver injury before their first oral anticoagulant (OAC) prescription were specifically excluded.18

Baseline characteristics

Baseline characteristics of patients with NVAF differed between studies. Mean age ranged from 65 to 84 years between the studies. The percentage of males ranged from 39% to 73%, and the mean CHA2DS2-VASc score ranged from 2.1 to 4.9. Excluding the five studies that specifically focused on an elderly population of ≥65 years old and the two additional studies that used the Medicare database (only patients of 65 years or older are in Medicare), the mean age ranged from 65 to 78 years. Different measures were used to assess the comorbidity index: Charlson Comorbidity Index, Charlson-Deyo Index and Gagne Comorbidity Score, while in 30 of the 43 studies no comorbidity index was presented.

Selection of covariates

Most studies (n=34) did not provide a rationale for the selection of covariates that were included in the PS model or in adjusted analysis. However, in one of the articles, an extensive rationale and selection procedure of covariates that were included in the analysis was provided.33 In three other studies, the authors selected covariates based on medical knowledge on risk factors with reference to earlier published studies.31 39 52 In one other study, it was reported that sociodemographic and clinical characteristics that were associated with treatment initiation and the risk of major bleeding were included in the model to adjust for differences across cohorts, without further explanation or reference.30

Definition of primary study outcomes

Primary outcomes differed between the studies. Effectiveness outcomes included in the studies included stroke, systemic embolism (or composite of stroke/systemic embolism), all-cause death, myocardial infarction, venous thromboembolism and safety outcomes included major bleeding, or a specific type of bleeding (eg, intracranial haemorrhage, gastrointestinal bleeding) and liver injury. In most studies, ICD-9 or ICD-10 codes were used, but whether this concerned a primary diagnosis only or whether it could be either a primary or a second diagnosis differed between the studies. In some studies, it was not described whether the ICD codes referred to primary diagnosis only or to a primary or secondary diagnosis.

Statistical approaches to adjust for confounding (primary analysis)

In 18 studies, PS matching was done.16 19–21 23 26 29 30 32–37 39 40 47 49 IPTW was used in eight studies.17 22 24 25 28 43 46 48 PS-stratified analyses was done in one study.41 In 12 studies, the primary analyses used a Cox PH regression model in which adjustment for confounding was done.18 27 31 38 42 44 45 50–52 Finally, in two studies no adjustment for differences in baseline characteristics was performed.53 54

PS matching


Creatinine clearance was not included as a covariate in any of the 18 studies. All 18 studies took the following covariates into account: age, sex, CHA2DS2-VASc score and/or the individual comorbidities included in this score, HAS-BLED score (Hypertension, Abnormal renal and liver function, Stroke, Bleeding, Labile INR, Elderly, Drugs or alcohol) and/or the individual conditions included in this score (except alcohol use in Lai et al 47), renal disease and co-medication use such as antiplatelets. Some included other comorbidities, such as cancer, rheumatic disease, specific heart diseases, Chronic Obstructive Pulmonary Disease (COPD), HIV, dementia, depression, neurological disorders and/or a various list of co-medications as well.

Matching method

In one study, the matching method was not described.49 In two studies, the calliper used was not described.23 29 In seven studies, 1:1 PS matching without replacement was used and a calliper of 0.01 was applied.16 19 20 26 30 32 36 Five other studies also matched 1:1 without replacement but used another calliper: in three studies, a calliper of 0.2 was used,39 40 47 while two others used a calliper of <0.25.33 35 In three studies, three-way matching was used.21 34 37

Balance covariates

In two studies, it was not described how the balance between covariates was evaluated.33 35 In two studies, the balance was evaluated using p<0.05 (of which one also used standardised difference of <10%),23 47 and in another study, it was stated that the groups were comparable even though a p value of >0.05 was found.29 Balance was checked with an absolute standardised difference of <10% in 13 studies.16 19–21 26 30 32 34 36 37 39 40 47 49 Balance was reached in all studies after matching.

Sample size

In four studies, the sample size before matching was not reported,29 35 36 39 and in one study, the sample size after matching was not reported.34 At study start (before PSM), sample size between the NOACs differed greatly, except in three studies.21 37 40


In one study, balance was tested using analysis of variance (ANOVAs) for significant differences.22 Balance was checked with an absolute standardised difference of <10% in the other nine studies.17 24 25 28 43 46 48 Balance was reached in all studies after IPTW.

There was no reporting on extreme weights in the eight included studies.17 22 24 25 28 43 46 48

PS stratification

In one study, asymmetric trimming of the PS was done, which resulted in a small part of both treatment groups being removed in order to gain in comparability. Balance in covariates was reached with standardised difference of <10%. In a Cox model, this trimmed PS was used in 10 deciles as strata.41

Cox HP regression models

In 10 studies, Cox HP regression models were applied with adjustment for a number of confounders.18 27 31 38 42 44 45 50–52 In one of these studies, the number of events per variable was not sufficient for such an analysis.50 The ratio was acceptable in the other studies for at least some of the outcomes.18 28 31 38 42 44 45 51 52

Unadjusted analysis

In two studies, no adjustment for confounding factors seemed to have been done, even though significant differences between treatment groups existed at baseline. Cerdá et al 53 presented events per 100 patient-years and used a log-rank test to determine whether outcomes differed between the NOACs. Li et al 54 conducted a Cox proportional hazard model, likely unadjusted, but this was not clearly described in the article.

Sensitivity analyses

Although in some articles sensitivity analyses were done, none of the included studies further explored the magnitude of residual confounding in their sensitivity analyses using one of the approaches recommended by IPSOR (see the Methods section).

Study results

Which NOAC performed best differed between the included studies. We found only one study that included all four NOACs, in which no preference for one specific NOAC was found, except that rates of major bleeding were lower with rivaroxaban.53 Of the 26 studies in which apixaban, rivaroxaban and dabigatran were included, apixaban was favourable compared with dabigatran and rivaroxaban in 13 studies, of which 10 were from the USA, 2 from Europe and 1 from Asia,16 17 19 20 23 26 28 29 32 36 42 50 52 while dabigatran and rivaroxaban were not found to be the single most favourable NOAC in any of the remaining 13 studies. Results for these 13 studies were mixed, with either no favourable NOAC at all or one NOAC was selected as the least favourable, while the other two NOACs did not differ.

Naïve trial analysis

The primary efficacy endpoint (strokes/SE) in the warfarin arms were estimated at 1.69% (RE-LY),3 2.2% (ROCKET),6 1.60% (ARISTOTLE)5 and 1.50% (ENGAGE)4 (see table 7). From this range, we chose a relatively arbitrary base rate of 1.6% and applied the observed risk reduction to estimate comparable base rates of 1.05% for dabigatran, 1.24% for rivaroxaban, 1.26% for edoxaban and 1.27% for apixaban. Using the sample size calculator,55 the biggest expected difference was between dabigatran and apixaban, and it was estimated that a trial sample size with 51 847 patients would be needed to confirm this difference. The smallest difference was between edoxaban and apixaban, and a trial of 7 994 340 patients is required to confirm that difference.

Table 7

Primary efficacy and safety endpoints of the four pivotal trials

The primary safety endpoint was major bleeding for RE-LY, ARISTOTLE, and ENGAGE AF and major bleeding plus clinically relevant non-major bleeding for ROCKET AF, but data on major bleeds only for ROCKET-AF are available as well. Major bleeds in the warfarin arms were estimated at 3.36% (RE-LY),3 3.4% (ROCKET),6 3.09% (ARISTOTLE)5 and 3.43% (ENGAGE).4 From this range, we choose a relatively arbitrary base rate of 3.2% and applied the observed risk reduction to estimate comparable base rates of 2.21% for apixaban 2.57% for edoxaban, 2.96% for dabigatran and 3.29% for rivaroxaban. Using the sample size calculator,55 the biggest expected difference was between rivaroxaban and apixaban, and it was estimated that a trial with 7196 patients would be needed to confirm this difference. A much smaller difference is between edoxaban and apixaban which would require a trial of 56 512 patients to confirm that difference.


In total, we found 39 studies directly comparing the effectiveness and/or safety of at least two NOACs in patients with NVAF. Three studies can be considered to be of low quality due to insufficiently described methods and/or small sample size.50 53 54

Even though the remaining studies could be considered of sufficient quality based on the technical aspects of the studies, there are some issues that can hamper the generalisability of the results. These issues concern residual confounding, the use of a smaller or broader calliper, differences in baseline characteristics between studies, channelling bias and change in treatment paradigm, and the high number of patients needed.

Balance in baseline characteristics between NOACs was checked with p values or a standardised difference of <10%. Balance was well at baseline in some studies, or was reached after PS matching or IWTP.56 Even though some studies included over 40 covariates in their PS, in most studies, it was not described how the covariates were selected. The ISPOR Good Research Practices for Retrospective Database Analysis recommends to include all factors that are theoretically related to outcome or treatment selection, even if the relation is weak or statistically non-significant.15 Directed acyclic graphs might be helpful as well.57 And even though balance was reached for all of these variables, one should keep in mind that balance between unmeasured or unmeasurable factors cannot be assumed.15 Therefore, due to the lack of randomisation, there is always a possibility of residual confounding. This possibility was acknowledged in all included studies, and all studies have largely the same missing covariates. Hardly any laboratory results and lifestyle information were included, such as body mass index, smoking status and alcohol consumption, which are also risk factors for ischaemic stroke and bleeding events, respectively. Creatinine clearance, for instance, seems to be an important covariate as subgroup analyses from the pivotal trials suggest that renal clearance might be an effect modifier.5 58 Only in one study, however, the authors were able to take renal clearance into account in the adjusted analyses.50 Especially when prescription of a certain NOAC in daily practice is driven by creatinine clearance, not adjusting for this variable may lead to biased results. However, it is unknown what the magnitude and direction (ie, will the differences in effectiveness and safety between NOACs be smaller or larger) of this potential bias due to lack of randomisation would be. The magnitude of residual confounding was not further explored in the sensitivity of the included studies.

In general, a calliper of <0.2 of the SD of the logit of the PS is considered to be ‘optimal’.59 About half of the included PS matching studies used a smaller calliper, namely, of <0.1. This means that the matching is more precise in these studies, but the disadvantage is that possibly more patients cannot be matched to another patient due to this smaller allowed maximum differences, and thus will be excluded from the analysis. Excluding patients from the analysis will limit the generalisability of the results to the total patient population, especially when the excluded patients differ from the included patients, for example, on the baseline risk for stroke.

All included studies focused on patients with NVAF only. In eight studies, inclusion criteria regarding age were applied. Three of these will likely still cover the largest part of NOAC users as they set relatively broad age ranges. The other five focused on an elderly population of patients with NVAF aged ≥65 years. Besides applying specific inclusion criteria regarding age in some studies, these differences also depended on the specific registry or database that was used, for example, Medicare is for people of 65 years old or older. Even though only five of the included studies focused on an elderly NVAF population, and the others applied broad age ranges, there were differences in mean age, proportion of males and mean CHA2DS2-VASc score between the studies, which can have an impact on the results and jeopardise the generalisability of the results.

Rivaroxaban was the most prescribed NOAC in almost all included studies from the USA. However, in the first quarter of 2017, apixaban was the most prescribed NOAC in NVAF in the USA (ie, in 50% of new OAC prescriptions). Especially older patients, women, increased stroke or bleeding risk and having comorbidities was associated with prescription of apixaban versus other NOACs.60 Rivaroxaban was also the most prescribed NOAC in the included studies from the UK and Scotland. Based on the Clinical Practice Research Datalink (CPRD), 56.5% of the OAC prescriptions concerned a NOAC, of which rivaroxaban was still described most often in 2015.61 Dabigatran was described most often in the studies from Denmark. Haastrup et al described that most patients with AF that initiated NOAC received dabigatran between 2008 and 2016, but a trend was observed that per 1000 person-years the number of patients described dabigatran decreased and the number of patients receiving rivaroxaban and apixaban increased.62 This shows that the treatment paradigm changed over time, and might still be changing, and this pattern differs between the USA, Europe and Asia. Channelling bias therefore likely occurs and might shift between the NOACs. Although in a few studies it was mentioned that selective prescriptions were noticed and that these might have changed over time, none of the included studies dealt with temporal trends in prescription patterns.

Our naïve analysis predicts that in terms of the primary efficacy outcome, observational studies will need a relatively high number of patients to be able to demonstrate the differences between the NOACs and a small sample size will not allow robust comparison to be made.

The pattern of major bleeding events seen in the included observational studies confirms the expectation from our naïve analysis of the pivotal clinical trials that rivaroxaban seems to have the least favourable safety profile among apixaban and dabigatran. The findings are not consistent to allow for a robust conclusion between apixaban and dabigatran which confirms the need for a high number of patients, although a trend for a slight better safety profile of apixaban can be observed.

The requirement for a high number of patients to compare NOACs both in terms of efficacy and safety as predicted by the pivotal trial results is confirmed by the findings of the observational studies. This finding may support the claim that the differences between the NOACs are relatively small.

In the process of conducting systematic reviews, it is inevitable that the review will never be completely up to date with the most recent published evidence. Even though our search ended in April 2019, recently published studies will have encountered the same issues as described above. Residual confounding and channelling bias cannot have been ruled out in newer publications. Ideally, head-to-head trials should be conducted to compare the efficacy/effectiveness and safety of the four NOACs to overcome the methodological issues in the comparative effectiveness studies. To our knowledge, one head-to-head trial including all four NOACs is currently running. This nationwide cluster randomised cross-over study aims to compare efficacy and safety of the four NOACs (; NCT03129490).

In conclusion, even though the larger part of these studies are conducted as well as possible considering what data are available, there are some important limitations regarding the generalisability of the study results especially given the relatively high patient number required for a meaningful comparison between NOACs. Most studies included all patients with NVAF on NOAC available in the registry/database during the study period and did not apply further specific inclusion and exclusion criteria, but differences between studies regarding baseline characteristics existed. Mean age at study start and baseline risk for stroke (CHA2DS2-VASc score) differed between the studies. As channelling bias cannot be ruled out, the result of these studies might not be generalisable. Furthermore, results from the PS studies are only applicable to the patients that were kept in the analyses as patients excluded from the analysis likely differ from the ones that were included in the analysis. The 1:1 matched cohorts depended on the sample size of the NOAC with the least number of patients and as a result many patients from the larger of the two NOAC groups were excluded as they could not be matched. In clinical practice, these limitations should be kept in mind when results of these studies are used to decide what NOAC should be prescribed for a certain patient. Given the small differences between efficacy and safety outcomes between NOACs, the element of patient preference should be taken into consideration,63 as tailoring anticoagulation treatment towards patient preferences can promote adherence to treatment.


The authors would like to thank Pearl Gumbs for the initiation of this project and her input with regard to study design and interpretation of the data. Pearl Gumbs was working at Daiichi Sankyo Europe at that time.


Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.


  • Contributors EMB: conceptualisation (support); methodology (equal); writing—original draft preparation; writing—review and editing (equal). BAvH: conceptualisation (support); methodology (equal); writing—review and editing (equal). SH: conceptualisation (support); writing—review and editing (equal). GS: conceptualisation (support), supervision; writing—review and editing (equal). AC: conceptualisation (lead); writing—review and editing (equal).

  • Funding This work was supported by Daiichi Sankyo Europe.

  • Competing interests EMB reports grants from Daiichi Sankyo during the conduct of the study; grants from Daiichi Sankyo, outside the submitted work. BAvH reports grants from Daiichi Sankyo during the conduct of the study. SH reports personal fees from Aspen, personal fees from Bayer, personal fees from BMS/Pfizer, personal fees from Daiichi-Sankyo, personal fees from Portola, outside the submitted work. GS reports personal fees from Daiichi Sankyo Europe GmbH, outside the submitted work. AC reports personal fees from Daiichi Sankyo Europe during the conduct of the study; grants and personal fees from Bayer AG, personal fees from Boehringer Ingelheim, grants and personal fees from Bristol-Myers Squibb, grants and personal fees from Pfizer Limited, personal fees from Portola Pharmaceuticals, personal fees from Janssen, personal fees from ONO Pharmaceuticals, from AbbVie, outside the submitted work.

  • Patient and public involvement statement This research was done without patient involvement. Patients were not invited to comment on the study design and were not consulted to develop patient relevant outcomes or interpret the results. Patients were not invited to contribute to the writing or editing of this document for readability or accuracy.

  • Patient consent for publication Not required.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data availability statement All data relevant to the study are included in the article or uploaded as supplementary information.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.