Article Text

PDF

Are there researcher allegiance effects in diagnostic validation studies of the PHQ-9? A systematic review and meta-analysis
  1. Laura Manea1,2,
  2. Jan Rasmus Boehnke3,
  3. Simon Gilbody1,2,
  4. Andrew S Moriarty2,
  5. Dean McMillan1,2
  1. 1 Deparment of Health Sciences, University of York, York, UK
  2. 2 Hull York Medical School, University of York, York, United Kingdom
  3. 3 Dundee Centre for Health And Related Research, University of Dundee, Dundee, United Kingdom
  1. Correspondence to Dr Laura Manea; laura.manea{at}york.ac.uk

Abstract

Objectives To investigate whether an authorship effect is found that leads to better performance in studies conducted by the original developers of the Patient Health Questionnaire (PHQ-9) (allegiant studies).

Design Systematic review with random effects bivariate diagnostic meta-analysis. Search strategies included electronic databases, examination of reference lists and forward citation searches.

Inclusion criteria Included studies provided sufficient data to calculate the diagnostic accuracy of the PHQ-9 against a gold standard diagnosis of major depression using the algorithm or the summed item scoring method at cut-off point 10.

Data extraction Descriptive information, methodological quality criteria and 2×2 contingency tables.

Results Seven allegiant and 20 independent studies reported the diagnostic performance of the PHQ-9 using the algorithm scoring method. Pooled diagnostic OR (DOR) for the allegiant group was 64.40, and 15.05 for non-allegiant studies group. The allegiance status was a significant predictor of DOR variation (p<0.0001).

Five allegiant studies and 26 non-allegiant studies reported the performance of the PHQ-9 at recommended cut-off point of 10. Pooled DOR for the allegiant group was 49.31, and 24.96 for the non-allegiant studies. The allegiance status was a significant predictor of DOR variation (p=0.015).

Some potential alternative explanations for the observed authorship effect including differences in study characteristics and quality were found, although it is not clear how some of them account for the observed differences.

Conclusions Allegiant studies reported better performance of the PHQ-9. Allegiance status was predictive of variation in the DOR. Based on the observed differences between independent and non-independent studies, we were unable to conclude or exclude that allegiance effects are present in studies examining the diagnostic performance of the PHQ-9. This study highlights the need for future meta-analyses of diagnostic validation studies of psychological measures to evaluate the impact of researcher allegiance in the primary studies.

  • screening
  • Phq-9
  • diagnostic meta-analysis
  • allegiance effect

This is an Open Access article distributed in accordance with the terms of the Creative Commons Attribution (CC BY 4.0) license, which permits others to distribute, remix, adapt and build upon this work, for commercial use, provided the original work is properly cited. See: http://creativecommons.org/licenses/by/4.0/

View Full Text

Statistics from Altmetric.com

Strengths and limitations of this study

  • An original study—the first meta-analysis of diagnostic validation studies of psychological measures to evaluate the impact of researcher allegiance.

  • Using rigorous methodology—strict inclusion/exclusion and quality assessment criteria.

  • We found that the allegiance effect was a significant predictor of the variation of the diagnostic OR in the meta-regression analysis.

  • Substantial variability observed in methodological quality of included studies.

  • Based on the observed methodological differences between the independent and non-independent studies, we were unable to conclude or exclude that allegiance effects are present in studies examining the diagnostic performance of the Patient Health Questionnaire (PHQ-9).

Research on allegiance effects has a long tradition in psychotherapy research. In this context, allegiance describes the phenomenon that researchers and clinicians who developed a treatment approach or are for other reasons invested in it tend to find larger effect sizes in favour of their treatment than for comparison groups.1 This finding has been extensively replicated2 3 and is also robust when the quality of research is controlled for. Researcher allegiance is subject of ongoing debates about the design of efficacy studies as well as implications for policy.2 4 5 Researcher allegiance is also discussed widely in the literature on experimental as well as evaluation research.6 Since the motivational underpinnings of allegiance effects are potentially far more ingrained into human behaviour and decision making than previously thought,7 they may occur commonly in clinical research in general.

Although it has been suggested that allegiance effects may play a role in the validation of psychological screening and case-finding tools (eg, O’Shea et al., in press), systematic evaluations of this hypothesis are rare and studies that acknowledge potential allegiance effects in such studies mainly come from forensic psychology and psychiatry backgrounds.8–11 Diagnostic validation studies are geared at establishing the sensitivity and specificity of a screening or case-finding tool, which is used in practice to differentiate cases from non-cases or to decide about whether further assessment or treatment is indicated or will be offered. An allegiance effect in such studies would be seen in systematically higher sensitivities or specificities if the original author(s) is (are) part of the team of such a study. Such a bias would have a deleterious affect on practice through promising overoptimistic accuracy of the screening or case-finding tool or in evaluating the cost-effectiveness of the measure in a screening or case-finding context.

The depression module of the Patient Health Questionnaire (PHQ-9) is a widely used depression-screening instrument in non-psychiatric settings. The PHQ-9 was developed by a team of researchers, with its development underwritten by an educational grant from Pfizer US Pharmaceuticals.12 The PHQ-9 can be scored using different methods, including an algorithm based on Diagnostic and Statistical Manual of Mental Disorders (DSM)-IV criteria and a cut-off based on summed-item scores. The psychometric properties of these two approaches have been summarised in two recently published meta-analyses.13 14 The goal of the current review is to investigate, based on an established database of PHQ-9 diagnostic validation studies,13 14 whether an allegiance effect is found that leads to an increased sensitivity and specificity in studies that were conducted by researchers closely connected to the original developers of the instrument.

Methods

Study selection

Similar search strategies were used in both systematic reviews (for full details, please see Manea et al and Moriarty et al 13 14). Embase, Medline and PsycINFO were searched from 1999 (when the PHQ-9 was first developed) to August 2013 and September 2013, respectively, using the terms ‘PHQ-9’, ‘PHQ’, ‘PHQ$’ and ‘patient health questionnaire’. The search strategy is presented in online supplementary appendix 1. The reference lists of studies fitting the inclusion criteria were manually searched and a reverse citation search in Web of Science was performed. The authors of unpublished studies were contacted and conference abstracts were reviewed in an attempt to minimise publication bias.

Supplementary file 1

The following inclusion-exclusion criteria were used:

Population: adult population. Instrument: studies that used the PHQ-9. Comparison (reference standard): the accuracy of the PHQ-9 had to be assessed against a recognised gold-standard instrument for the diagnosis of either DSM or International Classification of Disease (ICD) criteria for major depression. Studies were included if the diagnoses were made using a standardised diagnostic structured interview schedule (eg, Mini International Neuropsychiatric Interview (MINI), Structured Clinical Interview for DSM Disorders (SCID)). Unguided clinician diagnoses with no reference to a standard structured diagnostic schedule or comparisons of the PHQ-9 with other self-report measures were excluded. Studies were also excluded if the target diagnosis was not major depressive disorder (MDD, eg, any depressive disorder). Outcome: studies had to report sufficient information to calculate a 2×2 contingency table for the algorithm or the recommended cut-off point 10. Study design: any design. Additional criterion: we avoided double counting of evidence by ensuring that only one study of those that reported overlapping datasets in different journals were included in the meta-analysis. Citations with overlapping samples were examined to establish whether they contained information relevant to the research question that was not contained in the included report.

Quality assessment

Quality assessment was performed using the Quality Assessment of Diagnostic Accuracy Studies (Revised)  (QUADAS-2) tool, a tool for evaluating the risk of bias and applicability of primary diagnostic accuracy studies when conducting diagnostic systematic reviews.15 It covers the areas of patient selection, index test, reference standard and flow and timing.16 This tool was adapted for the two reviews and quality assessments were carried out by two independent reviewers for all studies included in the reviews.

Data synthesis and statistical analysis

We constructed 2×2 tables for cut-off point 1014 and the algorithm scoring method.13 Pooled estimates of sensitivity, specificity, positive/negative likelihood ratios and diagnostic ORs (DOR) were calculated using random effects bivariate meta-analysis.17 Heterogeneity was assessed using I2 for the DOR, an estimate of the proportion of study variability that is due to between-study variability rather than sampling error. We considered values of ≥50% to indicate substantial heterogeneity.18 Summary receiver operating characteristic curves (sROC) were constructed using the bivariate model to produce a 95% confidence ellipse within ROC space.19 Each data point in the sROC space represents a separate study, unlike a traditional ROC plot, which explores the effect varying thresholds on sensitivity and specificity in a single study.

We undertook a meta-regression analysis of logit DOR using research allegiance as covariate in the meta-regression model.20 21 Analyses were conducted using STATA V.12, with the metan, metandi and metareg user-written commands.

Allegiance rating

We rated authorship on a paper if any of the developers of the PHQ-9—Kurt Kroenke, MD, Robert L Spitzer, MD and Janet BW Williams—as an indicator of potential allegiance. We also rated as evidence of allegiance as acknowledged collaborations with the developers of the PHQ-9, even if they were not listed as coauthors or if the authors acknowledged funding from Pfizer to conduct the study.

Results

Overview of included studies

Thirty-one studies reported the diagnostic properties of the PHQ-9 at cut-off point 10 or above and were included in this analysis.14 Twenty-seven studies were included in the algorithm review.13 The study selection flow charts can be found in online supplementary appendix 2 (figures 1 and 2). The characteristics of these studies are reported in tables 1 and 2 and the results of the methodological assessment are presented in tables 3 and 4.

Table 1

Descriptive characteristics of algorithm studies13

Table 2

Descriptive characteristics of the summed items scoring method studies cut-off point 1014

Table 3

Quality assessment of included studies in the algorithm meta-analysis13

Table 4

Quality assessment of included studies in the summed item scoring method cut-off point 10 meta-analysis14

Algorithm scoring method

Descriptive characteristics

The descriptive characteristics of the included studies are presented in table 1. Seven individual studies that reported the diagnostic performance of the PHQ-9 using the algorithm scoring method were coauthored by the original developers of the PHQ-9,22–26 specifically acknowledged one of the developers and support by an educational grant from Pfizer USA,27 or were coauthored by the first author of a previous study that had also been coauthored by one of the developers.28 Twenty non-allegiant studies reported the diagnostic properties of the PHQ-9 using the algorithm scoring method.

Three (43%, 3/7) of the allegiant studies were conducted exclusively in hospital settings.22 26 28 The remaining four studies (67%, 4/7) were conducted in different settings or non-exclusively hospital settings: one in primary care25 and three in mixed settings: psychosomatic walk in clinics and family practices23,i outpatient clinics and family practices24 and primary care and hospital settings.27 In the non-allegiant group, 13 (65%, 13/20) studies were conducted in hospital settings.29–41 Of the remaining seven studies, six were conducted in primary care settings42–47 and one in a community sample.48

In both groups (non-allegiant and allegiant studies), the majority of studies validated a translated version of the PHQ-9. Two of the studies authored by developers (28%, 2/7),25 26 and eight (40%, 8/20) allegiant studies29 30 37–40 42 48 were conducted in English.

The mean prevalence of MDD in the group of allegiant studies was 13.4% (range 6.1%–29.2%); in the non-allegiant group it was 15.5% (range 3.9%–32.4%). The mean age of patients in the PHQ-9 developers group was 45.7; all but one study had a mean age in the range of 40–50 years. In the non-allegiant group, the mean age was 54.6 (range 29.3–75.0), with almost half (8) of the studies reporting a mean age of over 60. The percentage of females in the PHQ-9 developers was 56.8% (range 28.6%–67.8%) and in the non-allegiant group was 59.1 (18%–100%).

All allegiant studies used a self-reported PHQ-9, whereas in seven non-allegiant studies (30%, 6/20) the PHQ-9 was administered by a researcher.30–33 43 48 Apart from Muramatsu et al., all allegiant studies used the SCID as a gold standard27; the non-allegiant studies used a wider range of gold standards including SCAN, CIDI, MINI and C-DIS, although the SCID was also frequently used by the independent studies as well (45%, 9/20 studies).

Four out of the seven allegiant studies (57%) did not include a conflict of interest statement.22 23 25 27 Also, four (57%) of the allegiant studies acknowledged funding from Pfizer.23–25 27 Only one study27 acknowledged the collaboration with one of the developers of the PHQ-9.

Of the non-allegiant studies, 12 (60%) did not include a conflict of interest statement.29–32 35–37 39 44–46 48 It appears that newer studies were more likely to include a conflict of interest statement, which may reflect a recent change in reporting. Funding was acknowledged by most studies (18/20) and most received funding from academic or/and health research institutions. Two studies received funding from pharmaceutical companies—Lundbeck43 and Pfizer35 and one study acknowledged that Pfizer Italia provided the Italian version of PHQ-9 and gave the authors permission to use it.36

Diagnostic test accuracy

Pooled sensitivity and specificity was calculated separately for the non-allegiant and allegiant studies. Pooled sensitivity for the allegiant studies of the PHQ-9 was 0.77 (95% CI 0.70 to 0.84), pooled specificity was 0.94 (95% CI 0.90 to 0.97) and the pooled DOR was 64.40 (95% CI 34.15 to 121.43). Heterogeneity was high (I²=78.9%). Figure 1 represents the sROCs for this set of studies.

Figure 1

Patient Health Questionnaire-9 algorithm scoring method summary receiver operating characteristic plot for the diagnosis of major depressive disorder in allegiant studies (panel A) and non-allegiant studies (panel B). Pooled sensitivity and specificity estimates using a bivariate meta-analysis. HSROC, hierarchical receiver operating characteristic.

Pooled sensitivity for the non-allegiant studies was lower compared with the developer authored studies group at 0.48 (95% CI 0.41 to 0.91), pooled specificity was the same at 0.94 (95% CI 0.91 to 0.95). The pooled DOR was approximately four times lower at 15.05 (95% CI 11.03 to 20.52) (see figure 1). Heterogeneity was substantial at I²=68.1%.

The meta-regression analysis for algorithm studies with non-allegiant status as the predictor of the DOR showed that non-allegiant status was a significant predictor of the DOR (p<0.0001) and explained a substantial amount of the observed heterogeneity (51.5%).

Quality assessment

The results of the quality assessment using QUADAS-2 are given in table 3 for the studies reporting on the diagnostic performance of the algorithm scoring method. In the patient selection domain, more non-allegiant studies (65%, 13/20) than allegiant (29%, 2/7) met the criterion for consecutive referrals. There were no marked differences on the other two criteria in this domain (avoid case-control design, avoid inappropriate exclusions). In the index test domain, the proportion of studies reporting that the PHQ-9 was conducted blind to the reference test was comparable between the two groups. There were differences in this domain for those studies using a translated version of the test. All non-English allegiant studies (5/5) used an appropriately translated version of the PHQ-9, whereas just over a half of the non-allegiant studies reported this (55%, 6/11). However, the majority of both sets of studies did not report details of psychometric properties of the translated version. For the reference test domain, nearly all studies in both groups were rated as using a reference test that would correctly classify the condition. While most allegiant studies reported that the reference test was interpreted blind to the PHQ-9 score (86%, 6/7), this was reported in only 60% (12/20) of the non-allegiant studies.

The two sets of studies that used translated versions of the reference test were broadly comparable. There was a slight indication that the allegiant studies were more likely to use an appropriately translated version of the reference test and report data on the psychometric properties of the translated version, although the numbers for the translated comparison are very low. There were, however, some more notable differences on the flow and timing domain. Most allegiant studies ensured that the time between the index and reference test was under 2 weeks (86%, 6/7) in comparison to 70% (14/20) of the non-allegiant studies. More allegiant studies met the criterion for ‘all participants included in the analysis’ (57%, 4/7) than non-allegiant studies (25%).

Summed items scoring method (cut-off point 10 or above)

Descriptive characteristics

Table 2 presents the sample characteristics of the 31 PHQ-9 validation studies that reported the psychometric properties of the PHQ-9 at cut-off point 10 or above. Five of these studies were coauthored by the original developers of the instrument or acknowledged collaboration12 23 26 49 or were coauthored by the first author of a previous study that had also been coauthored by one of the developers.28 Twenty-six studies were conducted by independent researchers.

Three (60%, 3/5) allegiant studies26 28 49 and 11 non-allegiant studies (42%, 11/26)30–32 34 37 38 50–54 were conducted in hospital settings.

Three (60%, 3/5) allegiant studies12 26 49 and 13 non-allegiant studies (13/26)30 37 38 42 48 51–53 55–59 were conducted in English.

The mean prevalence of MDD in the allegiant group was 13.2% (range 6.1%–33.5%) and in the non-allegiant group was 16.1% (range 2.5%–43.2%). The mean age of patients in the allegiant group studies was 48.1 (range 41.9–61.0) and in the 26 non-allegiant studies that reported these data was 49.1 (range 23.0–78.0). The percentage of females in the allegiant studies that reported these data12 23 26 28 was 56.3% (range 28.6%–67.8%) and in the non-allegiant group was 64.9% (range 12%–100%).

Three allegiant studies used the self-reported mode of administration and two of them did not specify how the PHQ-9 was administered. In nine non-allegiant studies (34%, 9/26), the PHQ-9 was administered by the researcher.30–32 48 56 58–61 All allegiant studies used SCID as a gold standard; the non-allegiant studies used a wider range of gold standards including SCAN, CIDI, MINI, CIS-R, C-DIS, although the SCID was used in half of the studies (50%, 13/26 studies).

Three allegiant studies (60%) did not include a conflict of interest statement.12 23 49 Two of these studies12 23 acknowledged funding from Pfizer. None of the allegiant studies acknowledged collaboration or authorship of one of the developers of the PHQ-9.

Of the non-allegiant studies, 13 (42%) did not include a conflict of interest statement.30–32 37 42 46 48 53 55 60 62–64 Similar to the algorithm studies, the newer studies were more likely to include a conflict of interest statement. Funding was acknowledged by most studies (27/31) and most received funding from academic and/or health research institutions. One study57 acknowledged that the last author involved in the development of one of the instruments (CORE-OM), ‘but does not gain financially from its use’. One study51 acknowledged funding from industry, AHA Pharmaceuticals Roundtable, but stated that ‘the funding organisations had no role in the design or conduct of the study, collection, management, analysis or interpretation of data; or preparation, review or approval of the manuscript. Fine et al. disclosed that the last author had financial and consulting interests (Pfizer was not cited as one of them).56

Diagnostic test accuracy

Pooled sensitivity of allegiant studies was 0.87 (95% CI 0.77 to 0.93), pooled specificity was 0.87 (95% CI 0.76 to 0.94) and the pooled DOR was 49.31 (95% CI 25.74 to 94.48)—see table 5. Heterogeneity was moderate (I²=55.1%). Figure 2 represents the sROCs for this group.

Figure 2

Patient Health Questionnaire-9 summed items scoring method at cut-off point 10 summary receiver operating characteristic plot for diagnosis of major depressive disorder in allegiant studies (panel A) and non-allegiant studies (panel B). Pooled sensitivity and specificity using a bivariate meta-analysis. HSROC, hierarchical receiver operating characteristic.

Table 5

Pooled estimates of diagnostic properties of the Patient Health Questionnaire-9 at cut-off point 10 and using algorithm scoring method in the non-independent vs independent studies groups

Pooled sensitivity of non-allegiant studies was 0.76 (95% CI 0.67 to 0.83), pooled specificity was 0.88 (95% CI 0.85 to 0.91) and the pooled DOR was 24.96 (95% CI 14.81 to 42.08), approximately half that of the allegiant studies (table 2). Heterogeneity was high at I²=81.5%. Figure 2 represents the sROCs for this group.

The meta-regression for the studies using a cut-off point of 10 or above with allegiance status of the predictor showed that allegiance status was a significant predictor of the DOR (p=0.015) and explained 19.0% of observed heterogeneity.

Quality assessment

The results of the quality assessment using the QUADAS-2 are given in table 4. For the patient selection domain, the two groups of studies were broadly comparable on two items (consecutive or random sample, avoid case-control design). However, all allegiant studies were rated as avoiding inappropriate exclusions (5/5) in contrast to 58% (15/26) of the non-allegiant studies.

On the index test domain, there were a number of differences between the two groups of studies. More of the non-allegiant studies (81%, 21/26) reported that the PHQ-9 was interpreted blind to the reference test compared with 60% (3/5) of the allegiant studies. All (5/5) allegiant studies were rated as prespecifying the threshold on the PHQ-9 compared with 73% (19/26) of the non-allegiant studies. The two sets of studies were broadly comparable in terms of two items from the reference test domain (correctly classify target condition, reference test interpreted blind). Only one allegiant study used a translated version of the index test or reference test, so it is not possible to comment on differences between the two sets of studies in terms of these items from the index or reference test domains. For the flow and timing domain, the two groups of studies were broadly comparable for two of the criteria (interval of 2 weeks or less, all participants receive same reference test). However, fewer than half of the non-allegiant studies met the criterion for ‘all participants included in the analysis’ (42%, 11/26), whereas all allegiant studies met this criterion.

Discussion

This is to our knowledge the first systematic examination of a possible ‘allegiance’ or authorship effect in the validation of screening or case-finding psychological instrument for a common mental health disorder. We reviewed diagnostic validation studies of the PHQ-9, a widely used depression screening instrument. We found that allegiant studies reported higher sensitivity paired with similar specificity compared with non-allegiant studies. When entered as a covariate in meta-regression analyses, allegiance status was predictive of variation in the DOR for both the algorithm scoring method and the summed-item scoring method at a cut-off point of 10 or above.

Previous research has proposed several possible explanations for the allegiance effect.9–11 One possibility is the advertent bias that may serve to inflate the performance of a test when evaluated by those who have developed it. However, before concluding that the differences are due to this, it is important to explore and rule out alternative explanations. First, it is possible that any observed differences are a result of differences in study characteristics of the two sets of studies (eg, setting, clinical population). Second, differences in the methodological quality of the studies may also account for any differences. These possibilities are examined below.

Difference in study characteristics as potential alternative explanations

The two sets of studies were broadly comparable in terms of gender and the prevalence of depression, so these variables are unlikely to offer an explanation for the differences. While there were some indications from both sets of comparisons that the PHQ-9 may have been researcher-administered more often in the independent studies, it is not immediately clear how this would lead to lowered diagnostic performance.

The diagnostic meta-analyses of the PHQ-913 14 have shown that the sensitivity and DOR of the PHQ-9 tends to be lower in hospital settings for both algorithm and summed-item scoring methods. While the fact that proportionally more non-allegiant algorithm studies were conducted in secondary care could explain the lower sensitivity and DOR values in the algorithm studies, in the studies that reported the cut-off point of or above this would not be the case as proportionally more allegiant studies were conducted in hospital settings.

Similarly, differences in the proportions of studies using translated versions of the PHQ-9 are also unlikely to offer an obvious explanation of the difference in diagnostic performance, because in the algorithm set of studies more of the allegiant studies used a translated version of the test, but the proportions were in the opposite direction for the studies using a cut-off of 10 or above. We tested this by carrying out a sensitivity analysis restricting the sample to English studies and studies with adequate translation. The allegiance effect was still predictive of DOR variation between allegiance and non-allegiance studies variation in both algorithm (p=0.00) and summed item scoring at cut-off point of 10 meta-analyses (p=0.02).

A similar conclusion is also likely to apply to the age of the samples. There were more older adults studies in the non-allegiant than allegiant studies in the algorithm comparison. Depression could be more difficult to identify in older adults due to physical comorbidities that may present with similar symptomatology to depression and could account for the lower diagnostic performance in the non-allegiant studies. However, the non-allegiant samples in the studies that reported the psychometric properties at cut-off point 10 or above had younger samples than the allegiant studies, so this would not support this interpretation.

The SCID was used as the gold standard in nearly all allegiant studies. The fact that some non-allegiant studies used other gold standards could potentially explain the poorer psychometric properties of the PHQ-9 in these studies. The SCID is often regarded as the most valid of the available semi-structured interviews used in depression diagnostic validity studies as the reference standard. If we assume that this is the case and, furthermore, that the PHQ-9 is an accurate method of screening for depression, then the PHQ-9 may be more likely to agree with the SCID than other reference standards. However, when we carried out a sensitivity analysis restricting the sample to SCID-only studies, the allegiance effect was still predictive of DOR variation between allegiance and non-allegiance studies variation in both algorithm (p=0.01) and summed item scoring at cut-off point of 10 reviews (p=0.02).

Differences in methodological quality as potential alternative explanations

The quality of the studies was evaluated using the QUADAS-2. Although there were several potential methodological differences between the two groups of studies from the algorithm papers, not all of these offer obvious explanations of the observed differences and some are unlikely as explanations. For example, more allegiant studies ensured that the reference test was interpreted blind to the index test. This is unlikely to account for the observed differences, because a lack of blinding is typically associated with artificially increased diagnostic performance, which is in the opposite direction to the pattern of results observed here. The impact of some other differences is less clear-cut. For example, a higher number of the non-allegiant studies met the criterion for consecutive referrals. For this to provide an explanation of the observed differences, the non-consecutive nature of the referrals in the studies by those who had developed the PHQ-9 would need to have led to the overinclusion of true positives or underinclusion of false negatives given that these studies tended to report higher sensitivity relative to the non-allegiant studies (and vice versa for the independent studies). It is not immediately obvious how this would occur. The allegiant studies were more likely to have met the criterion of ‘included all participants in the analysis’. It is possible that the greater loss of participants from the non-allegiant studies may have artificially reduced the observed diagnostic accuracy, although, again, it is not immediately obvious how this would have affected the true positive and false negative rates. Although there is not an obvious explanation of how these differences in methodological quality could account for the observed differences in diagnostic performance, it is important to recognise that they cannot on that basis be ruled out.

There are, however, two differences in methodological quality among the algorithm studies that are clearer potential alternative explanations. The higher rate of appropriate translations among the allegiant studies is potentially important, because lower diagnostic estimates may be expected from studies that have poorly translated versions of the index test. In the flow and timing domain, more allegiant studies ensured that there was a less than 2-week interval between the index and reference test. This is consistent with lower diagnostic performance in the non-allegiant studies: as the interval increases it is likely that depression status may change and this would lead to lower levels of agreement between the index test and the reference test.

There were also differences on some quality assessment items between the two sets of studies in the summed item scoring method comparison. The threshold was reported as prespecified in all allegiant studies in contrast to approximately three-quarters of the non-allegiant studies. On the face of it, this is unlikely to explain the observed differences, because the use of a prespecified cut-off point is likely to be associated with lower not higher diagnostic test performance. One possibility, however, is that studies that performed poorly at this cut-off point were less likely to be reported by those who had developed the measure. As discussed in more detail in the ’Limitations' section, we were unable to explore this possibility through the use of formal tests for publication bias.

All allegiant studies avoided inappropriate exclusions compared with approximately half of the non-allegiant studies. While this is a potential alternative explanation of the differences, it is not immediately obvious how this would explain the differences in diagnostic performance between the two sets of studies. Fewer than half of the non-allegiant studies met the criterion for ‘all participants included in the analysis’, in contrast to all of the allegiant studies met this criterion, but again this difference should usually work against the inclusive studies, not those excluding cases. More of the non-allegiant studies reported that the PHQ-9 was interpreted blind to the reference test. This does offer a potential explanation, because the absence of blinding may artificially inflate diagnostic accuracy.

Limitations

The results of this review need to be viewed in light of the limitations of the primary studies that contributed to the review and the review itself. An important consideration is to establish whether any observed differences between the diagnostic performance of the non-allegiant and allegiant studies are better accounted for by study characteristic or methodological differences. Caution, however, is needed in interpreting any differences, because of the small number of allegiant studies in both the algorithm and cut-off 10 or above comparisons. The small number of allegiant studies also meant that we were also unable to explore the potential role of publication bias in the non-allegiant and allegiant studies. At least 10 studies are required to use standard methods of examining publication bias, but the number of allegiant studies in both the algorithm and cut-off 10 or above comparisons were fewer than this. Papers published from August 2013 onwards are not covered in the literature search used and so it potentially misses some more recent studies that would be eligible for inclusion, although it is unlikely that many, if any, new allegiant studies have been published since.

Conclusions and implications for further research

The aims of the review was to investigate whether an allegiance effect is found that leads to an increased diagnostic performance in diagnostic validation studies that were conducted by teams connected to the original developers of the PHQ-9. Our analyses showed that diagnostic studies conducted by independent/non-allegiant researchers had lower sensitivity paired with similar specificity compared with studies that were classified as allegiant. This conclusion held for both the algorithm and cut-off 10 or above studies. We explored a range of possible alternative explanations for the observed allegiance effect including both differences in study characteristics and study quality. A number of potential differences were found, although for some of these it is not clear how they would necessarily account for the observed differences. However, there were a number of differences that offered potential alternative explanations unconnected to allegiance effects. In the algorithm studies, the studies rated as allegiant were also more likely to use an appropriate translation of the PHQ-9 and were also more likely to ensure that the index and reference test were conducted within 2 weeks of each other, both of which may be associated with an improvement in observed diagnostic performance of an instrument. The majority of studies in both meta-analyses did not provide clear statements about potential conflict of interest and/or funding; however, the newer studies were more likely to provide such statements, which may reflect increasing transparency in this area of research.

We cannot, therefore, conclude that allegiance effects are present in studies examining the diagnostic performance of the PHQ-9; but nor can we rule them out. Conflicts of interest are an important area of investigation in medical and behavioural research, particularly due to concerns about trial results being influenced by industry sponsorship. Future diagnostic validity in this area should as a matter of routine present clear statements about potential conflicts of interest and funding, particularly relating to the development of the instrument under evaluation. Future meta-analyses of diagnostic validation studies of psychological measures should routinely evaluate the impact of researcher allegiance in the primary studies examined in the meta-analysis.

Acknowledgments

One of the authors of this paper (SG) was supported by the NIHR Collaboration for Leadership in Applied Health Research and Care Yorkshire and Humber (NIHR CLAHRC YH). The views and opinions expressed are those of the author(s), and not necessarily those of the NHS, the NIHR or the Department of Health.

References

  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
  21. 21.
  22. 22.
  23. 23.
  24. 24.
  25. 25.
  26. 26.
  27. 27.
  28. 28.
  29. 29.
  30. 30.
  31. 31.
  32. 32.
  33. 33.
  34. 34.
  35. 35.
  36. 36.
  37. 37.
  38. 38.
  39. 39.
  40. 40.
  41. 41.
  42. 42.
  43. 43.
  44. 44.
  45. 45.
  46. 46.
  47. 47.
  48. 48.
  49. 49.
  50. 50.
  51. 51.
  52. 52.
  53. 53.
  54. 54.
  55. 55.
  56. 56.
  57. 57.
  58. 58.
  59. 59.
  60. 60.
  61. 61.
  62. 62.
  63. 63.
  64. 64.
View Abstract

Footnotes

  • i This study provided separate estimates for the two settings in which it was conducted; therefore separate psychometric estimates were generated for each sample for both algorithm scoring method and summed items scoring method at cut-off point 10 (see below).

  • Contributors LM led on all stages of the review and is the guarantor. We used an established database of diagnostic validation studies of the PHQ-9 (Manea et al., 2015; Moriarty et al., 2015). SG provided expert advice on methodology and approaches to assessment of the evidence base. AM carried out the literature searches, screened the studies, extracted data and assessed the quality of the included studies for one of the systematic reviews (Moriarty et al., 2015). LM carried out the literature searches, screened the studies, extracted data and assessed the quality of the included studies for the other systematic review (Manea et al., 2015), analysed the data for both systematic reviews and drafted the report. JB involved in the development of the study, wrote the introduction section of the review and contributed to the production of the final report. DM supervised the quality assessment, methodology and approaches to evidence synthesis, provided senior advice and supported throughout and contributed to the production of the final report. All parties were involved in drafting and/or commenting on the report.

  • Funding LM was an NIHR Clinical Lecturer when this research was carried out. The NIHR had no role in the study design, methods, data collection, analysis or interpretation of data, nor any role in the preparation of the manuscript or decision to submit the manuscript for publication.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement No additional data are available.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.