Are there researcher allegiance effects in diagnostic validation studies of the PHQ-9? A systematic review and meta-analysis

Objectives To investigate whether an authorship effect is found that leads to better performance in studies conducted by the original developers of the Patient Health Questionnaire (PHQ-9) (allegiant studies). Design Systematic review with random effects bivariate diagnostic meta-analysis. Search strategies included electronic databases, examination of reference lists and forward citation searches. Inclusion criteria Included studies provided sufficient data to calculate the diagnostic accuracy of the PHQ-9 against a gold standard diagnosis of major depression using the algorithm or the summed item scoring method at cut-off point 10. Data extraction Descriptive information, methodological quality criteria and 2×2 contingency tables. Results Seven allegiant and 20 independent studies reported the diagnostic performance of the PHQ-9 using the algorithm scoring method. Pooled diagnostic OR (DOR) for the allegiant group was 64.40, and 15.05 for non-allegiant studies group. The allegiance status was a significant predictor of DOR variation (p<0.0001). Five allegiant studies and 26 non-allegiant studies reported the performance of the PHQ-9 at recommended cut-off point of 10. Pooled DOR for the allegiant group was 49.31, and 24.96 for the non-allegiant studies. The allegiance status was a significant predictor of DOR variation (p=0.015). Some potential alternative explanations for the observed authorship effect including differences in study characteristics and quality were found, although it is not clear how some of them account for the observed differences. Conclusions Allegiant studies reported better performance of the PHQ-9. Allegiance status was predictive of variation in the DOR. Based on the observed differences between independent and non-independent studies, we were unable to conclude or exclude that allegiance effects are present in studies examining the diagnostic performance of the PHQ-9. This study highlights the need for future meta-analyses of diagnostic validation studies of psychological measures to evaluate the impact of researcher allegiance in the primary studies.

Objectives To investigate whether an authorship effect is found that leads to better 26 performance in studies conducted by the original developers of the PHQ-9 (non-independent 27 studies).

28
Design Systematic review with random effects bivariate diagnostic meta-analysis. Search 29 strategies included electronic databases, examination of reference lists, and forward citation 30 searches.

31
Inclusion criteria Included studies provided sufficient data to calculate the diagnostic 32 accuracy of the PHQ-9 against a gold standard diagnosis of major depression using the 33 algorithm or the summed item scoring method at cut-off point 10.

37
Seven non-independent and twenty independent studies reported the diagnostic performance 38 of the PHQ-9 using the algorithm scoring method. Pooled diagnostic odds ratio (DOR) for 39 the first group was 64.40, and 15.05 for independent studies group. The allegiance status was 40 a significant predictor of DOR variation (p < 0.0001).

41
Five non-independent studies and twenty-six independent studies reported the performance of 42 the PHQ-9 at recommended cut-off point of 10. Pooled DOR for the non-independent group 43 was 49.31, and 24.96 for the independent studies. The allegiance status was a significant 44 predictor of DOR variation (P = 0.015). 45 Some potential alternative explanations for the observed authorship effect including 46 differences in study characteristics and quality were found, though it is not clear how some of 47 them account for the observed differences • An original study-the first meta-analysis of diagnostic validation studies of 61 psychological measures to evaluate the impact of researcher allegiance.

62
• Using rigorous methodology-strict inclusion/exclusion and quality assessment 63 criteria. 64 • We found that the allegiance effect was a significant predictor of the variation of the 65 diagnostic odds ratio in the meta-regression analysis.

66
• Substantial variability observed in methodological quality of included studies.

67
• Based on the observed methodological differences between the independent and non-68 independent studies we were unable to conclude or exclude that allegiance effects are 69 present in studies examining the diagnostic performance of the PHQ-9. making than previously thought (e.g., (Markman & Hirt, 2002)), they may occur commonly 91 in clinical research in general. 92 Although it has been suggested that allegiance effects may play a role in the validation of  overlapping samples were examined to establish whether they contained information relevant 143 to the research question that was not contained in the included report. 144 Quality assessment 145 Quality assessment was performed using the QUADAS-2 tool, a tool for evaluating the risk in English.

215
The mean prevalence of major depressive disorder in the group of studies co-authored by 216 PHQ-9 developers was 13.4 (range 6.1% -29.2%); in the independent group it was 15.5% 217 (range 3.9% -32.4%). The mean age of patients in the PHQ-9 developers group was 45.75; 218 all but one study had a mean age in the range of 40 to 50 years. In the independent group the 219 mean age was 54.6 (range 29.3 -75.0), with almost half (8) of the studies reporting a mean 220 age of over 60. The percentage of females in the PHQ-9 developers was 56.8% (range 28.6%

221
-67.8%) and in the independent group was 59.1 (18% -100%). The meta-regression analysis for algorithm studies with independent status as the predictor of 269 the diagnostic odds ratio showed that independent status was a significant predictor of the  The results of the quality assessment using QUADAS-2 are given in table 3 for the studies   275 reporting on the diagnostic performance of the algorithm scoring method. In the patient 276 selection domain, more of the independent studies (65%, 13/20) than the non-independent 277 (29%, 2/7) met the criterion for consecutive referrals. There were no marked differences on 278 the other two criteria in this domain (avoid case-control design, avoid inappropriate 279 exclusions). In the index test domain, the proportion of studies reporting that the PHQ-9 was 280 conducted blind to the reference test was comparable between the two groups. There were 281 differences in this domain for those studies using a translated version of the test. All non-

282
English non-independent studies (5/5) used an appropriately translated version of the PHQ-9; 283 whereas just over a half of the independent studies reported this (55%, 6/11). However, the or were co-authored by the first author of a previous study that had also been co-authored by 308 one of the developers (Navinés et al., 2012). Twenty-six studies were conducted by 309 independent researchers. The mean prevalence of major depressive disorder in the group of studies authored by PHQ-9 326 developers was 13.2% (range 6.1% -33.5%) and in the independent group was 16.1% (range  (table 5). 374 Heterogeneity was high at I² = 81.5 %. Figure 5 represents the summary ROCs for this group.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  The meta-regression for the studies using a cut-off point of 10 with allegiance status of the 380 predictor showed that allegiance status was a significant predictor of the diagnostic odds ratio 381 (P = 0.015) and explained 18.95% of observed heterogeneity.  that may serve to inflate the performance of a test when evaluated by those who have 419 developed it. However, before concluding that the differences are due to this, it is important 420 to explore and rule out alternative explanations. First, it is possible that any observed 421 differences are a result of differences in study characteristics of the two sets of studies (e.g., 422 setting, clinical population). Secondly, differences in the methodological quality of the 423 studies may also account for any differences. These possibilities are examined below.

425
Difference in study characteristics as potential alternative explanations 426 The two sets of studies were broadly comparable in terms of gender and the prevalence of 427 depression, so these variables are unlikely to offer an explanation for the differences. While 428 there were some indications from both sets of comparisons that the PHQ-9 may have been 429 researcher-administered more often in the independent studies, it is not immediately clear 430 how this would lead to lowered diagnostic performance. studies as the reference standard. If we assume that this is the case and, furthermore, that the 458 PHQ-9 is an accurate method of screening for depression, then the PHQ-9 may be more 459 likely to agree with the SCID than other reference standards.

461
Differences in methodological quality as potential alternative explanations 462 The quality of the studies was evaluated using the QUADAS-2. Although there were several 463 potential methodological differences between the two groups of studies from the algorithm The results of this review need to be viewed in the light of the limitations of the primary 517 studies that contributed to the review and the review itself. An important consideration is to 518 establish whether any observed differences between the diagnostic performance of the 519 independent and non-independent studies are better accounted for by study characteristic or 520 methodological differences. Caution, however, is needed in interpreting any differences,

521
because of the small number of non-independent studies in both the algorithm and cut-off 10 522 comparisons. The small number of non-independent studies also meant that we were also 523 unable to explore the potential role of publication bias in the independent and non-524 independent studies. At least 10 studies are required to use standard methods of examining 525 publication bias, but the number of non-independent studies in both the algorithm and cut-off 526 10 comparisons were fewer than this. The aims of the review was to investigate whether an allegiance effect is found that leads to 531 an increased diagnostic performance in diagnostic validation studies that were conducted by 532 teams connected to the original developers of the PHQ-9. Our analyses showed that 533 diagnostic studies conducted by independent researchers had lower sensitivity paired with 534 similar specificity compared to studies that were classified as non-independent. This 535 conclusion held for both the algorithm and cut-off 10 studies. We explored a range of 536 possible alternative explanations for the observed allegiance effect including both differences 537 in study characteristics and study quality. A number of potential differences were found, 538 though for some of these it is not clear how they would necessarily account for the observed 539 differences. However, there were a number of differences that offered potential alternative 540 explanations unconnected to allegiance effects. These included the greater use of the SCID in 541 the studies rated as non-independent in both the algorithm and the cut-off 10 studies. In the 542 algorithm studies, the studies rated as non-independent were also more likely to use an 543 appropriate translation of the PHQ-9 and were also more likely to ensure that the index and 544 reference test were conducted within two weeks of each other, both of which may be 545 associated with an improvement in observed diagnostic performance of an instrument.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 F o r p e e r r e v i e w o n l y n/a n/a Unclear 35. Zhang et al.

Thombs et al. (2008)
?      Objectives 4 Provide an explicit statement of questions being addressed with reference to participants, interventions, comparisons, outcomes, and study design (PICOS).

METHODS
Protocol and registration 5 Indicate if a review protocol exists, if and where it can be accessed (e.g., Web address), and, if available, provide registration information including registration number.

No
Eligibility criteria 6 Specify study characteristics (e.g., PICOS, length of follow-up) and report characteristics (e.g., years considered, language, publication status) used as criteria for eligibility, giving rationale.

5
Information sources 7 Describe all information sources (e.g., databases with dates of coverage, contact with study authors to identify additional studies) in the search and date last searched. Study selection 9 State the process for selecting studies (i.e., screening, eligibility, included in systematic review, and, if applicable, included in the meta-analysis).

5
Data collection process 10 Describe method of data extraction from reports (e.g., piloted forms, independently, in duplicate) and any processes for obtaining and confirming data from investigators.

6
Data items 11 List and define all variables for which data were sought (e.g., PICOS, funding sources) and any assumptions and simplifications made.

5-6
Risk of bias in individual studies 12 Describe methods used for assessing risk of bias of individual studies (including specification of whether this was done at the study or outcome level), and how this information is to be used in any data synthesis. Risk of bias across studies 15 Specify any assessment of risk of bias that may affect the cumulative evidence (e.g., publication bias, selective reporting within studies).

RESULTS
Study selection 17 Give numbers of studies screened, assessed for eligibility, and included in the review, with reasons for exclusions at each stage, ideally with a flow diagram.

Appendix
Study characteristics 18 For each study, present characteristics for which data were extracted (e.g., study size, PICOS, follow-up period) and provide the citations. Tables 1  and 2 Risk of bias within studies 19 Present data on risk of bias of each study and, if available, any outcome level assessment (see item 12). Tables 3  and 4 Results of individual studies 20 For all outcomes considered (benefits or harms), present, for each study: (a) simple summary data for each intervention group (b) effect estimates and confidence intervals, ideally with a forest plot.

N/A
Synthesis of results 21 Present results of each meta-analysis done, including confidence intervals and measures of consistency. Table 5 Risk of bias across studies 22 Present results of any assessment of risk of bias across studies (see Item 15). Tables 3  and 4 Additional analysis 23 Give results of additional analyses, if done (e.g., sensitivity or subgroup analyses, meta-regression [see Item 16]). 11 and 17

DISCUSSION
Summary of evidence 24 Summarize the main findings including the strength of evidence for each main outcome; consider their relevance to key groups (e.g., healthcare providers, users, and policy makers).

17-21
Limitations 25 Discuss limitations at study and outcome level (e.g., risk of bias), and at review-level (e.g., incomplete retrieval of identified research, reporting bias).    Objectives To investigate whether an authorship effect is found that leads to better performance in studies conducted by the original developers 24 of the PHQ-9 (allegiant studies).

25
Design Systematic review with random effects bivariate diagnostic meta-analysis. Search strategies included electronic databases, examination 26 of reference lists, and forward citation searches.

27
Inclusion criteria Included studies provided sufficient data to calculate the diagnostic accuracy of the PHQ-9 against a gold standard diagnosis 28 of major depression using the algorithm or the summed item scoring method at cut-off point 10.

29
Data extraction Descriptive information, methodological quality criteria, and 2×2 contingency tables. Some potential alternative explanations for the observed authorship effect including differences in study characteristics and quality were found, 38 though it is not clear how some of them account for the observed differences Allegiant studies reported better performance of the PHQ-9. Allegiance status was predictive of variation in the DOR. Based on the observed 42 differences between independent and non-independent studies we were unable to conclude or exclude that allegiance effects are present in 43 studies examining the diagnostic performance of the PHQ-9. This study highlights the need for future meta-analyses of diagnostic validation 44 studies of psychological measures to evaluate the impact of researcher allegiance in the primary studies.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47

73
Although it has been suggested that allegiance effects may play a role in the validation of psychological screening and case-finding tools (e.g., Pharmaceuticals.
[12] The PHQ-9 can be scored using different methods, including an algorithm based on DSM-IV criteria and a cut-off based 84 on summed-item scores. The psychometric properties of these two approaches have been summarised in two recently published meta-analyses. whether an allegiance effect is found that leads to an increased sensitivity and specificity in studies that were conducted by researchers closely 87 connected to the original developers of the instrument. [14] respectively, using the terms "PHQ-9", "PHQ", "PHQ$" and "patient health questionnaire". The search strategy is presented in Appendix 2.

93
The reference lists of studies fitting the inclusion criteria were manually searched and a reverse citation search in Web of Science was 94 performed. Authors of unpublished studies were contacted and conference abstracts were reviewed in an attempt to minimise publication bias.

95
The following inclusion-exclusion criteria were used:  We rated authorship on a paper if any of the developers of the PHQ-9 -Kurt Kroenke, MD, Robert L Spitzer, MD, and Janet B W Williams -as 123 an indicator of potential allegiance. We also rated as evidence of allegiance as acknowledged collaborations with the developers of the PHQ-9, 124 even if they were not listed as co-authors or if the authors acknowledged funding from Pfizer to conduct the study.

148
The mean prevalence of major depressive disorder in the group of allegiant studies was 13.4 % (range 6.1% -29.2%); in the non-allegiant group 149 it was 15.5% (range 3.9% -32.4%). The mean age of patients in the PHQ-9 developers group was 45.7; all but one study had a mean age in the 150 range of 40 to 50 years. In the non-allegiant group the mean age was 54.6 (range 29.3 -75.0), with almost half (8) of the studies reporting a 151 mean age of over 60. The percentage of females in the PHQ-9 developers was 56.8% (range 28.6% -67.8%) and in the non-allegiant group was 152 59.1 (18% -100%).
153 154 1 This study provided separate estimates for the two settings in which it was conducted; therefore separate psychometric estimates were generated for each sample for both algorithm scoring method and summed items scoring method at cut-off point 10 (see below).  176 The meta-regression analysis for algorithm studies with non-allegiant status as the predictor of the diagnostic odds ratio showed that non-185 allegiant status was a significant predictor of the diagnostic odds ratio (p < 0.0001) and explained a substantial amount of the observed 186 heterogeneity (51.5%). The results of the quality assessment using QUADAS-2 are given in The two sets of studies that used translated versions of the reference test were broadly comparable. There was a slight indication that the       1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  16 The meta-regression for the studies using a cut-off point of 10 or above with allegiance status of the predictor showed that allegiance status was 255 a significant predictor of the diagnostic odds ratio (P = 0.015) and explained 19.0% of observed heterogeneity.   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48 17 This is to our knowledge the first systematic examination of a possible 'allegiance' or authorship effect in the validation of screening or case 273 finding psychological instrument for a common mental health disorder. We reviewed diagnostic validation studies of the PHQ-9, a widely used 274 depression screening-instrument. We found that allegiant studies reported higher sensitivity paired with similar specificity compared to non-275 allegiant studies. When entered as a covariate in meta-regression analyses, allegiance status was predictive of variation in the DOR for both the 276 algorithm scoring method and the summed-item scoring method at a cut-off point of 10 or above.  The two sets of studies were broadly comparable in terms of gender and the prevalence of depression, so these variables are unlikely to offer an 286 explanation for the differences. While there were some indications from both sets of comparisons that the PHQ-9 may have been researcher-287 administered more often in the independent studies, it is not immediately clear how this would lead to lowered diagnostic performance.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  18 The diagnostic meta-analyses of the PHQ-9 [13], [14] have shown that the sensitivity and DOR of the PHQ-9 tends to be lower in hospital but the proportions were in the opposite direction for the studies using a cut off of 10 or above. We tested this by carrying out a sensitivity 296 analysis restricting the sample to English studies and studies with adequate translation. The allegiance effect was still predictive of DOR 297 variation between allegiance and non-allegiance studies variation in both algorithm (p = 0.00) and summed item scoring at cut-off point of 10 298 meta-analyses (p = 0.02).

299
A similar conclusion is also likely to apply to the age of the samples. There were more older adults studies in the non-allegiant than allegiant 300 studies in the algorithm comparison. Depression could be more difficult to identify in older adults due to physical co-morbidities that may 301 present with similar symptomatology to depression and could account for the lower diagnostic performance in the non-allegiant studies.

302
However, the non-allegiant samples in the studies that reported the psychometric properties at cut-off point 10 or above had younger samples 303 than the allegiant studies, so this would not support this interpretation.

305
The SCID was used as the gold standard in nearly all allegiant studies. The fact that some non-allegiant studies used other gold standards could 306 potentially explain the poorer psychometric properties of the PHQ-9 in these studies. The SCID is often regarded as the most valid of the 307 available semi-structured interviews used in depression diagnostic validity studies as the reference standard. If we assume that this is the case 308 and, furthermore, that the PHQ-9 is an accurate method of screening for depression, then the PHQ-9 may be more likely to agree with the SCID  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48 than other reference standards. However, when we carried out a sensitivity analysis restricting the sample to SCID only studies the allegiance 310 effect was still predictive of DOR variation between allegiance and non-allegiance studies variation in both algorithm (p = 0.01) and summed 311 item scoring at cut-off point of 10 reviews (p= 0.02). The quality of the studies was evaluated using the QUADAS-2. Although there were several potential methodological differences between the 316 two groups of studies from the algorithm papers, not all of these offer obvious explanations of the observed differences and some are unlikely as 317 explanations. For example, more allegiant studies ensured that the reference test was interpreted blind to the index test. This is unlikely to 318 account for the observed differences, because a lack of blinding is typically associated with artificially increased diagnostic performance, which   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48 There are, however, two differences in methodological quality among the algorithm studies that are clearer potential alternative explanations.

331
The higher rate of appropriate translations among the allegiant studies is potentially important, because lower diagnostic estimates may be 332 expected from studies that have poorly translated versions of the index test. In the flow and timing domain, more allegiant studies ensured that 333 there was a less than two-week interval between the index and reference test. This is consistent with lower diagnostic performance in the non-334 allegiant studies: as the interval increases it is likely that depression status may change and this would lead to lower levels of agreement between 335 the index test and the reference test.

337
There were also differences on some quality assessment items between the two sets of studies in the summed item scoring method comparison.

METHODS
Protocol and registration 5 Indicate if a review protocol exists, if and where it can be accessed (e.g., Web address), and, if available, provide registration information including registration number.

No
Eligibility criteria 6 Specify study characteristics (e.g., PICOS, length of follow-up) and report characteristics (e.g., years considered, language, publication status) used as criteria for eligibility, giving rationale.

5
Information sources 7 Describe all information sources (e.g., databases with dates of coverage, contact with study authors to identify additional studies) in the search and date last searched. Study selection 9 State the process for selecting studies (i.e., screening, eligibility, included in systematic review, and, if applicable, included in the meta-analysis).

5
Data collection process 10 Describe method of data extraction from reports (e.g., piloted forms, independently, in duplicate) and any processes for obtaining and confirming data from investigators.

6
Data items 11 List and define all variables for which data were sought (e.g., PICOS, funding sources) and any assumptions and simplifications made.

5-6
Risk of bias in individual studies 12 Describe methods used for assessing risk of bias of individual studies (including specification of whether this was done at the study or outcome level), and how this information is to be used in any data synthesis.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  Risk of bias across studies 15 Specify any assessment of risk of bias that may affect the cumulative evidence (e.g., publication bias, selective reporting within studies).

RESULTS
Study selection 17 Give numbers of studies screened, assessed for eligibility, and included in the review, with reasons for exclusions at each stage, ideally with a flow diagram.

Appendix
Study characteristics 18 For each study, present characteristics for which data were extracted (e.g., study size, PICOS, follow-up period) and provide the citations. Tables 1  and 2 Risk of bias within studies 19 Present data on risk of bias of each study and, if available, any outcome level assessment (see item 12). Tables 3  and 4 Results of individual studies 20 For all outcomes considered (benefits or harms), present, for each study: (a) simple summary data for each intervention group (b) effect estimates and confidence intervals, ideally with a forest plot.

N/A
Synthesis of results 21 Present results of each meta-analysis done, including confidence intervals and measures of consistency. Table 5 Risk of bias across studies 22 Present results of any assessment of risk of bias across studies (see Item 15). Tables 3  and 4 Additional analysis 23 Give results of additional analyses, if done (e.g., sensitivity or subgroup analyses, meta-regression [see Item 16]). 11 and 17

DISCUSSION
Summary of evidence 24 Summarize the main findings including the strength of evidence for each main outcome; consider their relevance to key groups (e.g., healthcare providers, users, and policy makers).

25
Design Systematic review with random effects bivariate diagnostic meta-analysis. Search strategies included electronic databases, examination 26 of reference lists, and forward citation searches.

27
Inclusion criteria Included studies provided sufficient data to calculate the diagnostic accuracy of the PHQ-9 against a gold standard diagnosis 28 of major depression using the algorithm or the summed item scoring method at cut-off point 10.

29
Data extraction Descriptive information, methodological quality criteria, and 2×2 contingency tables.  Some potential alternative explanations for the observed authorship effect including differences in study characteristics and quality were found, 38 though it is not clear how some of them account for the observed differences

73
Although it has been suggested that allegiance effects may play a role in the validation of psychological screening and case-finding tools (e.g., [14] respectively, using the terms "PHQ-9", "PHQ", "PHQ$" and "patient health questionnaire". The search strategy is presented in Appendix 1.

93
The reference lists of studies fitting the inclusion criteria were manually searched and a reverse citation search in Web of Science was 94 performed. Authors of unpublished studies were contacted and conference abstracts were reviewed in an attempt to minimise publication bias.

106
Quality assessment 107 Quality assessment was performed using the QUADAS-2 tool, a tool for evaluating the risk of bias and applicability of primary diagnostic 108 accuracy studies when conducting diagnostic systematic reviews.
[15] It covers the areas of: patient selection, index test, reference standard and 109 flow and timing.
[16] This tool was adapted for the two reviews and quality assessments were carried out by two independent reviewers for all 110 studies included in the reviews.

111
Data synthesis and statistical analysis 112 We constructed 2x2 tables for cut-off point 10 [14] and the algorithm scoring method [13] Pooled estimates of sensitivity, specificity, 113 positive/negative likelihood ratios, and diagnostic odds ratios were calculated using random effects bivariate meta-analysis.
[17] Heterogeneity 114 was assessed using I 2 for the diagnostic odds ratio, an estimate of the proportion of study variability that is due to between-study variability 115 rather than sampling error. We considered values of ≥50% to indicate substantial heterogeneity.
[18] Summary Receiver Operator Characteristic 116 curves (sROC) were constructed using the bivariate model to produce a 95% confidence ellipse within ROC space.
[19] Each data point in the 117 summary ROC space represents a separate study, unlike a traditional ROC plot, which explores the effect varying thresholds on sensitivity and 118 specificity in a single study. 119 We undertook a meta-regression analysis of logit diagnostic odds ratio using research allegiance as covariate in the meta-regression model. [20],

121
Allegiance Rating 122 F o r p e e r r e v i e w o n l y 8 We rated authorship on a paper if any of the developers of the PHQ-9 -Kurt Kroenke, MD, Robert L Spitzer, MD, and Janet B W Williams -as 123 an indicator of potential allegiance. We also rated as evidence of allegiance as acknowledged collaborations with the developers of the PHQ-9, 124 even if they were not listed as co-authors or if the authors acknowledged funding from Pfizer to conduct the study.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46 47 48 9 previous study that had also been co-authored by one of the developers [28]. Twenty non-allegiant studies reported the diagnostic properties of 139 the PHQ-9 using the algorithm scoring method.  funding from pharmaceutical companies -Lundbeck [43] and Pfizer [35] and one study acknowledged that Pfizer Italia provided the Italian  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  The meta-regression analysis for algorithm studies with non-allegiant status as the predictor of the diagnostic odds ratio showed that non-185 allegiant status was a significant predictor of the diagnostic odds ratio (p < 0.0001) and explained a substantial amount of the observed 186 heterogeneity (51.5%). The results of the quality assessment using QUADAS-2 are given in table 3       The meta-regression for the studies using a cut-off point of 10 or above with allegiance status of the predictor showed that allegiance status was 255 a significant predictor of the diagnostic odds ratio (P = 0.015) and explained 19.0% of observed heterogeneity.   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48 17 This is to our knowledge the first systematic examination of a possible 'allegiance' or authorship effect in the validation of screening or case 273 finding psychological instrument for a common mental health disorder. We reviewed diagnostic validation studies of the PHQ-9, a widely used 274 depression screening-instrument. We found that allegiant studies reported higher sensitivity paired with similar specificity compared to non-275 allegiant studies. When entered as a covariate in meta-regression analyses, allegiance status was predictive of variation in the DOR for both the 276 algorithm scoring method and the summed-item scoring method at a cut-off point of 10 or above. The two sets of studies were broadly comparable in terms of gender and the prevalence of depression, so these variables are unlikely to offer an 286 explanation for the differences. While there were some indications from both sets of comparisons that the PHQ-9 may have been researcher-287 administered more often in the independent studies, it is not immediately clear how this would lead to lowered diagnostic performance.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  18 The diagnostic meta-analyses of the PHQ-9 [13], [14] have shown that the sensitivity and DOR of the PHQ-9 tends to be lower in hospital but the proportions were in the opposite direction for the studies using a cut off of 10 or above. We tested this by carrying out a sensitivity 296 analysis restricting the sample to English studies and studies with adequate translation. The allegiance effect was still predictive of DOR 297 variation between allegiance and non-allegiance studies variation in both algorithm (p = 0.00) and summed item scoring at cut-off point of 10 298 meta-analyses (p = 0.02).

299
A similar conclusion is also likely to apply to the age of the samples. There were more older adults studies in the non-allegiant than allegiant 300 studies in the algorithm comparison. Depression could be more difficult to identify in older adults due to physical co-morbidities that may 301 present with similar symptomatology to depression and could account for the lower diagnostic performance in the non-allegiant studies.

302
However, the non-allegiant samples in the studies that reported the psychometric properties at cut-off point 10 or above had younger samples 303 than the allegiant studies, so this would not support this interpretation.

305
The SCID was used as the gold standard in nearly all allegiant studies. The fact that some non-allegiant studies used other gold standards could 306 potentially explain the poorer psychometric properties of the PHQ-9 in these studies. The SCID is often regarded as the most valid of the 307 available semi-structured interviews used in depression diagnostic validity studies as the reference standard. If we assume that this is the case 308 and, furthermore, that the PHQ-9 is an accurate method of screening for depression, then the PHQ-9 may be more likely to agree with the SCID  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48 than other reference standards. However, when we carried out a sensitivity analysis restricting the sample to SCID only studies the allegiance 310 effect was still predictive of DOR variation between allegiance and non-allegiance studies variation in both algorithm (p = 0.01) and summed 311 item scoring at cut-off point of 10 reviews (p= 0.02). The quality of the studies was evaluated using the QUADAS-2. Although there were several potential methodological differences between the 316 two groups of studies from the algorithm papers, not all of these offer obvious explanations of the observed differences and some are unlikely as 317 explanations. For example, more allegiant studies ensured that the reference test was interpreted blind to the index test. This is unlikely to 318 account for the observed differences, because a lack of blinding is typically associated with artificially increased diagnostic performance, which   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48 There are, however, two differences in methodological quality among the algorithm studies that are clearer potential alternative explanations.

331
The higher rate of appropriate translations among the allegiant studies is potentially important, because lower diagnostic estimates may be 332 expected from studies that have poorly translated versions of the index test. In the flow and timing domain, more allegiant studies ensured that 333 there was a less than two-week interval between the index and reference test. This is consistent with lower diagnostic performance in the non-334 allegiant studies: as the interval increases it is likely that depression status may change and this would lead to lower levels of agreement between 335 the index test and the reference test.

337
There were also differences on some quality assessment items between the two sets of studies in the summed item scoring method comparison.

338
The threshold was reported as pre-specified in all allegiant studies in contrast to approximately three quarters of the non-allegiant studies. On the 339 face of it, this is unlikely to explain the observed differences, because the use of a pre-specified cut-off point is likely to be associated with lower 340 not higher diagnostic test performance. One possibility, however, is that studies that performed poorly at this cut-off point were less likely to be 341 reported by those who had developed the measure. As discussed in more detail in the limitations section, we were unable to explore this 342 possibility through the use of formal tests for publication bias.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48 excluding cases. More of the non-allegiant studies reported that the PHQ-9 was interpreted blind to the reference test. This does offer a potential 349 explanation, because the absence of blinding may artificially inflate diagnostic accuracy.   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48 The aims of the review was to investigate whether an allegiance effect is found that leads to an increased diagnostic performance in diagnostic 366 validation studies that were conducted by teams connected to the original developers of the PHQ-9. Our analyses showed that diagnostic studies 367 conducted by independent/non-allegiant researchers had lower sensitivity paired with similar specificity compared to studies that were classified 368 as allegiant. This conclusion held for both the algorithm and cut-off 10 or above studies. We explored a range of possible alternative 369 explanations for the observed allegiance effect including both differences in study characteristics and study quality. A number of potential 370 differences were found, though for some of these it is not clear how they would necessarily account for the observed differences. However, there 371 were a number of differences that offered potential alternative explanations unconnected to allegiance effects. In the algorithm studies, the 372 studies rated as allegiant were also more likely to use an appropriate translation of the PHQ-9 and were also more likely to ensure that the index 373 and reference test were conducted within two weeks of each other, both of which may be associated with an improvement in observed diagnostic 374 performance of an instrument. The majority of studies in both meta-analyses did not provide clear statements about potential conflict of interest 375 and/or funding, however the newer studies were more likely to provide such statements, which may reflect increasing transparency in this area of 376 research.

METHODS
Protocol and registration 5 Indicate if a review protocol exists, if and where it can be accessed (e.g., Web address), and, if available, provide registration information including registration number.

No
Eligibility criteria 6 Specify study characteristics (e.g., PICOS, length of follow-up) and report characteristics (e.g., years considered, language, publication status) used as criteria for eligibility, giving rationale.

5
Information sources 7 Describe all information sources (e.g., databases with dates of coverage, contact with study authors to identify additional studies) in the search and date last searched. Study selection 9 State the process for selecting studies (i.e., screening, eligibility, included in systematic review, and, if applicable, included in the meta-analysis).

5
Data collection process 10 Describe method of data extraction from reports (e.g., piloted forms, independently, in duplicate) and any processes for obtaining and confirming data from investigators.

6
Data items 11 List and define all variables for which data were sought (e.g., PICOS, funding sources) and any assumptions and simplifications made.

5-6
Risk of bias in individual studies 12 Describe methods used for assessing risk of bias of individual studies (including specification of whether this was done at the study or outcome level), and how this information is to be used in any data synthesis.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48 Study characteristics 18 For each study, present characteristics for which data were extracted (e.g., study size, PICOS, follow-up period) and provide the citations. Tables 1  and 2 Risk of bias within studies 19 Present data on risk of bias of each study and, if available, any outcome level assessment (see item 12). Tables 3  and 4 Results of individual studies 20 For all outcomes considered (benefits or harms), present, for each study: (a) simple summary data for each intervention group (b) effect estimates and confidence intervals, ideally with a forest plot.

N/A
Synthesis of results 21 Present results of each meta-analysis done, including confidence intervals and measures of consistency. Table 5 Risk of bias across studies 22 Present results of any assessment of risk of bias across studies (see Item 15). Tables 3  and 4 Additional analysis 23 Give results of additional analyses, if done (e.g., sensitivity or subgroup analyses, meta-regression [see Item 16]). 11 and 17

DISCUSSION
Summary of evidence 24 Summarize the main findings including the strength of evidence for each main outcome; consider their relevance to key groups (e.g., healthcare providers, users, and policy makers).