Occurrence and nature of questionable research practices in the reporting of messages and conclusions in international scientific Health Services Research publications: a structured assessment of publications authored by researchers in the Netherlands

Objectives Explore the occurrence and nature of questionable research practices (QRPs) in the reporting of messages and conclusions in international scientific Health Services Research (HSR) publications authored by researchers from HSR institutions in the Netherlands. Design In a joint effort to assure the overall quality of HSR publications in the Netherlands, 13 HSR institutions in the Netherlands participated in this study. Together with these institutions, we constructed and validated an assessment instrument covering 35 possible QRPs in the reporting of messages and conclusions. Two reviewers independently assessed a random sample of 116 HSR articles authored by researchers from these institutions published in international peer-reviewed scientific journals in 2016. Setting Netherlands, 2016. Sample 116 international peer-reviewed HSR publications. Main outcome measures Median number of QRPs per publication, the percentage of publications with observed QRP frequencies, occurrence of specific QRPs and difference in total number of QRPs by methodological approach, type of research and study design. Results We identified a median of six QRPs per publication out of 35 possible QRPs. QRPs occurred most frequently in the reporting of implications for practice, recommendations for practice, contradictory evidence, study limitations and conclusions based on the results and in the context of the literature. We identified no differences in total number of QRPs in papers based on different methodological approach, type of research or study design. Conclusions Given the applied nature of HSR, both the severity of the identified QRPs, and the recommendations for policy and practice in HSR publications warrant discussion. We recommend that the HSR field further define and establish its own scientific norms in publication practices to improve scientific reporting and strengthen the impact of HSR. The results of our study can serve as an empirical basis for continuous critical reflection on the reporting of messages and conclusions.

In 2009, Chalmers and Glasziou estimated that 85% of research funding in biomedical sciences was wasted avoidably, 1 resulting in The Lancet's series "Increasing value: reducing waste". This series has stirred the international scientific community, prompting funders, regulators, academic institutions, and scientific publishers to act. Funders of biomedical research have responded by organising conferences on research waste, and journal editors have initiated discussions on data sharing and open access. 2 While evidence for questionable research practices (QRPs) in biomedical sciences is mounting, 1 little is known about the occurrence and nature of QRPs in the policy-and management-oriented field of health services research (HSR). The term 'questionable' covers a wide range of practices. A questionable practice is not necessarily wrongful, but does 'raise questions'.
The HSR field is an applied field of research, and produces evidence on topics such as co-payments, evaluation of quality improvement efforts, cost-effectiveness of medications, patient empowerment, therapy compliance, and effects of policies. Given the growing evidence for the prevalence of QRPs in the reporting of messages and conclusions in the biomedical field, 3,4 QRPs may also occur in the HSR field. Just like biomedical researchers, health services researchers are under pressure to publish in high-impact journals to increase their citation scores and attract media attention to augment their prestige and chances for future research funding and job security. [5][6][7][8] Unlike biomedical research, HSR findings are not easily generalised from one local or national health services setting to another, and messages and conclusions tend to be limited to a specific national context. 9 A broad spectrum of quantitative and qualitative methods is used in HSR, including designs that are less subject to strict codes of execution than randomized controlled trials, such as observational and case study designs. Furthermore, HSR has difficulty creating alignment between the construction of scientific knowledge and the implementation of that knowledge in policy and practice. 10 Although reporting in scientific publications is highly standardised, the discussion and conclusion sections offer researchers relative freedom when deriving messages and conclusions from study results. 4 We explored the occurrence and nature of QRPs in the reporting of messages and conclusions in international scientific HSR publications authored by researchers from HSR institutions in the Netherlands. We also examined the relationship between study type, methodology, and design and the occurrence of QRPs. With our study, we want to fuel the debate on fostering responsible messages and conclusions, and provide a basis for the discussion on QRPs in the international HSR field.

What is already known about this topic
 In the biomedical field, estimates for the occurrence of questionable research practices (QRPs) in the interpretation of results in scientific publications vary from 10% of publications deriving discordant conclusions from study results to 100% of publications containing rhetorical practices resulting in spin.  The debate on fostering responsible reporting practices to date mainly focusses on the biomedical field.
Knowledge on the scientific reporting in the applied field of Health Services Research (HSR) is lacking.
Added value of this study  With this explorative study, we identify a broad scope of QRPs in the reporting of messages and conclusions in HSR publications. Furthermore, we demonstrate that recommendations for policy and practice are not commonly reported in Health Services Research publications, despite the policy-and management-oriented nature of Health Services Research (HSR).  To ensure the applicability of HSR, those in this field should reflect on the severity of the nature of identified QRPs, and the inclusion and form of recommendations for policy and practice.   Table 1 presents the characteristics of the 116 included publications from the 13 participating HSR institutions. To summarise, 54.3% of the publications were quantitative, 28.4% were qualitative, and 17.2% applied a mixedmethods approach. Sixteen percent of the publications were based on a published study protocol. The mean impact factor of the journals was 2.81, and the average number of authors was six.

Occurrence of QRPs per publication
Of the 116 HSR publications, the median number of QRPs per publication was six (interquartile range, 5·75), out of 35 possible QRPs. The distribution of the observed frequency of QRPs across publications is visualised in figure 1.

Frequency of QRPs per type
For each of the QRPs, we counted how often they were identified in the included publications. Appendix 1 presents the percentage of occurrence per QRP type.
QRPs that occurred most frequently were:  Implications for policy and practice do not adequately reflect the results in the context of the referenced literature (69·0%)*; *In 50·0% of publications, no implications for policy and practice were mentioned, and in 19·0% of publications, implications were mentioned without adequate justification.  Recommendations for policy and practice do not adequately reflect the results in the context of the referenced literature (65·5%)**; **In 34·5% of publications, no recommendations for policy and practice were reported, and in 31·0% of publications, recommendations were mentioned without adequate justification.  Contradicting evidence is poorly documented (63·8%);  Conclusions do not adequately reflect the findings as presented in the results section (46·6%);  Possible impact of the limitations on the results is not or poorly discussed (44·0%);  Conclusions are not supported by the results as presented in the context of the referenced literature (43·1%).
QRPs that occurred least frequently were:  The main source of evidence for supporting the results is based on the same underlying data (2·6%);  Generalising findings to populations not included in the original sample is not justified (2·6%);  Causative wording is used in the hypothesis/research question, although there is no theory to support causation (2·4%);  Possible clinical relevance of statistically non-significant results is not addressed (2·4%);  Generalising findings to time periods not included in the original study is not justified (0·0%). Figure 2 shows the distribution of QRPs across publications. The horizontal axis shows the publications (n=116) ordered from the publication with the lowest (0) to the highest number (18) of observed QRPs in the reporting of messages and conclusions. The vertical axis shows the QRPs ordered from least (Generalisation to different time period) to most (Implications for practice are lacking) frequently observed. On the right vertical axis, the occurrence of QRPs is presented in number of QRPs counted. Each dot represents a QRP.  Table 2 shows the associations between total number of QRPs (applicable to all study designs) and methodological approach (quantitative, qualitative, and mixed), type of research (descriptive, exploratory, hypothesis testing, and measurement instruments), and study design (observational, (quasi) experimental, systematic review, economic evaluation, case study, and meta-analyses). No statistically significant differences in number of QRPs was found by type of research, methodological approach, or study design.

Discussion
We explored the occurrence and nature of QRPs in the reporting of messages and conclusions in international scientific HSR publications authored by researchers from HSR institutions in the Netherlands, and examined the relationship between study type, methodology, and design and the occurrence of QRPs. Our results indicate that HSR publications have a median of six QRPs per publication. We identified most QRPs in the reporting of implications for policy and practice, recommendations for policy and practice, contradictory evidence, study limitations, and conclusions as based on the results and in the context of the literature. No significant associations between number of QRPs and type of study, study design, or methodological approach were identified.

Limitations and Strengths
We applied a broad and sensitive definition of 'questionable', for instance by considering the absence of contradictory evidence or the absence of implications and recommendations for policy and practice as a QRP. The choice to not present contradictory evidence does not defy current publication checklists, yet this practice may hinder interpretation of findings in the full context of evidence. If authors searched for contradictory evidence, but did not mention its absence, readers of the publication would not have any clues on its existence. With our broad definition encompassing 35 possible QRPs we bring to light the areas that offer possibilities for further enhancing publication practices in HSR. Consequently, this definition allows for a discussion in the field of HSR on the extent to which the identified QRPs are acceptable. This is an important strength of our applied approach.
Although we endeavoured to develop a reliable measurement instrument that would guide the review process, the instrument allowed latitude for the reviewer's interpretation. Consequently, a different group of reviewers might arrive at somewhat different scoring frequencies for observed QRPs. However, because we defined each QRP in detail, it is unlikely that there would be substantial differences in the overall distribution of different types of QRPs across publications. Our consensus method contains a degree of subjectivity, and there is the risk that one reviewer's opinion will dominate. To counteract this, NK and DK performed random checks on 10% of all assessments. By recording the motivation for every identified QRP, we supported the consistency of our measurement and justified our results. Because publications were selected based on the title, selection bias might have occurred. Considering we found no relationship between study characteristics and number of QRPs, it is unlikely that a different sample would have led to different results. Inevitably, reviewers sometimes assessed publications written by authors they knew professionally or personally, and as such, a positive view of a colleague's work might have led to underestimating the QRPs in these publications.
Our study results may be representative for HSR research publications internationally. Given the fact that publication in international journals is highly standardised in terms of language (English) and format, our findings can most likely be transferred to HSR communities in other countries.

Interpretation
In HSR publications, recommendations for policy and practice warrant most attention. A study can be conducted properly, using a sound design and appropriate methodology. However, making recommendations without adequate justification could lead to incorrect inferences in policy and the management of healthcare, and undermine society's confidence in science. 10,[21][22][23][24] Measures for safeguarding scientific soundness like those often used in biomedical research (eg, trial registration, open data policies, and an improved reporting and archiving infrastructure 25 ) do not address reporting conclusions not supported by study results, and are not tailored to the observational and explorative designs most prevalent in HSR. Moreover, existing publication checklists address a report's completeness, but do not question the justification of the conclusions. 4 If we intend to improve the reporting of HSR conclusions and recommendations, we will need to better understand the factors that influence authors when reporting the discussion and conclusions section of a HSR publication eg, media pressure and relationships with funders. 5,6,8,26 Consequently, subsequent research can focus on what influences researchers when writing their scientific publications, and what factors play a role in the process from research design to the acceptance of a manuscript by a peer-reviewed journal.
A third of the HSR publications studied gave no recommendations for policy or practice, while another third did not provide an adequate justification for the recommendations. One could argue that HSR is an applied field of research, and that its ultimate goal should be to contribute to better health services and systems; researchers should therefore take responsibility for providing guidance to those who can act on the research findings instead 10 of leaving them empty-handed. On the other hand, health services researchers may feel more comfortable committing to a more traditional interpretation of the role of academics, refraining from normative judgement. If the latter is the dominant viewpoint, the HSR community needs to consider the role of scientific evidence in helping decision makers address the challenges they face, and informing policies and practices. Internationally, the HSR community has been promoting further strengthening of the link between HSR and practice. 27 In biomedical research, research being "new" might contribute to a confused assessment of implications. 28 This problem is amplified in HSR, where there is a limited accumulation of evidence. HSR considers a larger range of contextual factors and stakeholders in politics or management. Moreover, HSR recommendations are often based on observational or exploratory research, which is considered to be weak evidence in biomedical circles (eg, the GRADE checklist). 29 Perhaps the norms determined by the biomedical research field make health services researchers hesitant to provide any implications or recommendations at all.

Implications and recommendations for policy and practice
The HSR field currently seems to adhere to the norms and expectations set by the biomedical field, even though HSR is multidisciplinary, and differences in approach and type of methodology pose serious challenges to observing these norms. Therefore, the HSR community needs to further define specific scientific norms appropriate to the field.
Scientific norms are developed through the forum of a scientific community. 30 This forum function is particularly strong in the Netherlands, where a community of HSR institutions work together closely. Our study was able to bring together the main Dutch academic and non-academic HSR institutions. Consequently, the results of our study help to facilitate critical reflection on the current state of research and encourage debate on how to systematically advance the reporting of messages and conclusions in HSR. Such a debate in the Dutch context is needed, given the attempts over the past decade by the Netherlands Organisation for Health Research and Development (ZonMw) to strengthen the link between research and practice. It would also be very timely, considering the ongoing, overarching Dutch research programme on responsible research practices funded by ZonMw, of which this study is a part. We recommend the HSR community to reflect on the questions our results bring forward: how do we include implications and recommendations for policy and practice in scientific publications?; how should we describe conclusions in context of literature with limited accumulation of evidence?; and what is the severity of the identified QRPs? Through this publication, we would like to urge journal editors and those working in the international field of HSR to join in this debate.

Conclusions
QRPs in the reporting of messages and conclusions occur frequently in peer-reviewed international scientific HSR publications from Dutch HSR institutions. These QRPs differ in severity and cannot always be qualified as wrongful, but they do 'raise questions'. To ensure the applicability of HSR research in policy and practice, the HSR field should reflect on scientific norms for the reporting of conclusions and the inclusion of recommendations for policy and practice. Our study can serve as an empirical basis for continuous critical reflection on the current state of research, and encourage debate on how to systematically advance the reporting of messages and conclusions in HSR.  Sources. direction and magnitude of bias are not or poorly discussed. or just listed without further discussion. 27·6 72·4 0·0 The conclusions in the abstract do not adequately reflect the conclusions in the main text. 22·4 75·0 2·6 The main results discussed in the discussion paragraph do not adequately address the original objectives/research questions as posed in the introduction. 20·7 75·9 3·4 The outcome measure used does not allow the conclusions that are stated. * 18·1 81·9 0·0 Lack of distinction between results and discussion. The results section contains elements of discussion and interpretation beyond the scope of explaining the results. 17·2 82·8 0·0 The sampling methodology does not allow the type of generalization provided. 15·5 84·5 0·0 The objectives/research questions of the study are differently phrased in the introduction and the discussion. 14·7 36·2 49·1 The order of presenting the results in de discussion is inconsistent with the ordering of the objectives/research questions as posed in the introduction. 14·7 75·0 10·3 Hyperboles and exaggerating adjectives are unjustifiably used 12·1 87·9 0·0 The title does not adequately reflect the main findings. 11·2 88·8 0·0 The abstract does not adequately reflect the main findings. 10·3 89·7 0·0 A potential causal relationship claimed in the discussion paragraph is not justified. 10·3 89·7 0·0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  The title does not adequately reflect the main findings.

6.1.2
The abstract does not adequately reflect the main findings.

6.1.3
The conclusions in the abstract do not adequately reflect the conclusions in the main text. 6.1.4 The objectives/research questions of the study are differently phrased in the introduction and the discussion. 6.1.5 The outcome measure does not adequately reflect the objectives/research questions of the study. 6.1.6 The main results discussed in the discussion paragraph do not adequately address the original objectives/research questions as posed in the introduction. 6.1.7 The order of presenting the results in de discussion is inconsistent with the ordering of the objectives/research questions as posed in the introduction.

6.1.10
The outcome measure used does not allow the conclusions that are stated. 6.1.11 The conclusion/discussion distracts from main outcomes by overstating the relevance of secondary outcomes. 6.1.12 The conclusions are not supported by the results as presented in context of the referenced literature.

6.1.13
Recommendations do not adequately reflect the results in context of the referenced literature.
6.1.14 Implications for policy and practice do not adequately reflect the results in the context of the referenced literature.

6.1.15
Lack of distinction between results and discussion. The results section contains elements of discussion and interpretation beyond the scope of explaining the results.

6.2
Main results are not or inadequately interpreted into the context of evidence Supporting evidence is poorly documented.

6.2.2
Contradicting evidence is poorly documented.

6.2.3
Evidence is used inappropriately to support the findings (i.e. the argument is not supported by the actual message of the cited evidence). Will be measured as: Evidence seems to be used selectively to support the findings, given the title of the referenced evidence.

6.2.4
The main source of evidence to support the results is based on the same underlying data.

6.3
Limitations are not adequately mentioned Sources, direction and magnitude of bias are not or poorly discussed, or just listed without further discussion.

6.4.2
Generalization of findings to populations not included in the original sample is not justified.

6.4.3
Generalization of findings to time periods not included in the original study is not justified.

6.4.4
Generalization of findings to geographical locations not included in the original study is not justified.

6.4.5
Generalization of findings to settings/institutions not included in the original study is not justified.

Unjustified causation
6.5.1 Causative wording is used in the hypothesis/research question, although there is no theory supporting causation.

6.5.2
A causal relationship is claimed, although the research design is not appropriate to determine causation (methods lack control of potential confounding or systematic bias).

6.5.3
A causal relationship is claimed although potential sources of bias and their potential impact on the findings were not discussed.

6.5.4
A potential causal relationship claimed in the discussion paragraph is not justified.

Tables properly represent results
Tables give a reflection of actual results instead of cherry picking
The objectives /research questions cannot be answered with the outcome measure that is studied The main results discussed in the discussion paragraph do not adequately address the original objectives/research questions as posed in the introduction.

6.2.2
Contradicting evidence is poorly documented.
Only limited evidence to oppose against the main results is provided and only superficially discussed. No thorough reflection of the findings in perspective of contradicting evidence.

6.2.3
Evidence is used inappropriately to support the findings (i.e. the argument is not supported by the actual message of the cited evidence). Will be measured as: Evidence seems to be used selectively to support the findings, given the title of the referenced evidence. The main source of evidence to support the results is based on the same underlying data.
Most supporting evidence is grounded in the same data source as was used for the reviewed study (not necessarely self-citing), inducing circularity in argumentation.

6.3
Limitations are not adequately mentioned Sources, direction and magnitude of bias are not or poorly discussed, or just listed without further discussion. The possible impact of the limitations on the results (i.e., magnitude and direction of any potential sources of bias) is not or poorly discussed.
Is the impact of limitations discussed (if no limitations are mentioned then this is considered a QRP). The extent to which potential risks of bias affect the interpretation of the findings is not thoroughly discussed.
6.4 Unjustified generalisations The sampling methodology does not allow the type of generalization provided.
The sample is too specific, small, or flawed (for instance by attrition, selection bias) for the generalization that is made.

6.4.2
Generalization of findings to populations not included in the original sample is not justified. Generalization of findings to time periods not included in the original study is not justified.
The characteristics of the included time period are too specific (for instance in election period, affecting the policy that was studied) and no or inadequate evidence is provided to support the generalization that is made Generalization of findings to geographical locations not included in the original study is not justified.
The characteristics of the included igeographical location(s) are too specific to generalise to other geographical locations (for instance very urbanised area to rural setting) and no or inadequate evidence is provided to support the generalization that is made Generalization of findings to settings/institutions not included in the original study is not justified. (methods lack control of potential confounding or systematic bias).

6.5.3
A causal relationship is claimed although potential sources of bias and their potential impact on the findings were not discussed.
No or inadequate discussion is included concerning the impact of potential sources of bias on the possible causation that was found in the results

6.5.4
A potential causal relationship claimed in the discussion paragraph is not justified.
When a causal relation may not be assumed solely based on the study's findings, no or inadequate supporting and contradicting evidence is used to discuss the possible causation that was found in the results.
6.6 Effect size The relevance of statistically significant results with small effect size is overstated.
Importance of findings is exaggerated. Although (some) results are statistically significant, the clinical/practical relevance is minor due to small effect size/causation is unlikely. 6.6.2 The possible clinical relevance of statistically nonsignificant results is not addressed.
Importance of findings is dismissed, since no statistical significance was reached. Although the findings reflect likely causation and non-significance was likely due to lack of power. Jargon, technical and complex language, that does not fit the journal audience, are used without properly explaining the meaning.
The journal audience is not properly addressed by the language used. Language use seems to be overly complex to impress or distract the reader.

7.1
Overall qualitative evaluation of the study (e.g. quality, reporting style).
If a certain aspect impacts the answer to multiple questions, specify in "other comments". E.g. if the discussion section does not contain main results, then this item cannot be assessed. *Give information separately for cases and controls in case-control studies and, if applicable, for exposed and unexposed groups in cohort and cross-sectional studies.

55
 This study describes an assessment of publications and is therefore able to detect QRPs that go 56 unnoticed in survey studies that rely on self-report.

58
 Although we aimed to develop a reliable measurement instrument that would guide the review process, 59 the instrument allowed latitude for the reviewer's interpretation.

61
 In our assessment method, we relied on consensus among assessors, which inevitably introduces some 62 subjectivity. Independent assessments showed a consensus rate of >80% between assessors.

64
 Because publications were selected based on the title, selection bias might have occurred.

79
and describing a hypothesis after finding significant results. 3 A questionable practice is not necessarily wrongful, 80 but does 'raise questions'. In this study we further define the meaning of questionable research practices in the 81 reporting of messages and conclusions in the field of HSR specifically.

83
The HSR field is an applied field of research, and produces evidence on topics such as co-payments, evaluation

105
Although reporting in scientific publications is highly standardised, the discussion and conclusion sections offer 106 researchers relative freedom when deriving messages and conclusions from study results. 5 We explored the 107 occurrence and nature of QRPs in the reporting of messages and conclusions in international scientific HSR 108 publications authored by researchers from HSR institutions in the Netherlands. We also examined the 109 relationship between study type, methodology, and design and the occurrence of QRPs. With our study, we want 110 to fuel the debate on fostering responsible messages and conclusions, and provide a basis for the discussion on

131
We developed an extensive list of QRPs in the reporting of messages and conclusions. Items were based on the 132 EQUATOR checklists 15 and earlier checklists for identifying "spin" (ie, "a way to distort science reporting 133 without actually lying") 5 or other QRPs. 13,14,16,17 The proposed list of QRPs was reviewed, refined, and 134 complemented using 14 semi-structured interviews with the directors/leaders and representatives (n=19) of the 135 13 participating HSR institutions. Next, the five participating international health services researchers provided 136 email feedback on the list resulting from these interviews; the list was adapted accordingly, resulting in 35 137 possible QRPs in the reporting of messages and conclusions in HSR publications.

138
We developed a data extraction form in Excel that contained the list of QRPs and bibliometric information, and 139 conducted a pilot to evaluate its feasibility and usability. In the pilot, two assessors (RG, TJ) independently    lists. In a consensus meeting between TJ and RG, differences in selected titles were resolved by discussing its fit 162 with the definition. Consensus was reached on all included publications.

163
The HSR publications (n=717) were assigned a random number. Per institution, the publications with unique 164 first authors with the lowest assigned number were included in the sample. Three HSR institutions did not have 165 enough publications with unique first authors, resulting in a selection of nine, eight, and two publications for 166 these institutions. Furthermore, two publications were excluded during assessment because they concerned 167 research protocols. These publications were replaced by another publication authored by the same institution.

168
One publication was excluded because its methodology was considered incomprehensible by the reviewers.

175
The assessment started with a test phase. During this phase, agreements and disagreements in assessments of the 176 first 30 publications were thoroughly discussed (by RG, TJ, NK, and DK) to increase the accuracy of the 177 assessments; agreement between the two reviewers (TJ, RG) was 81% for the first 20 publications, which 178 increased to 82% when assessing the next 10 publications. The notion emerged that it was necessary having two 179 reviewers with complementary expertise assess each publication independently, followed by a consensus 180 procedure and random check by the project leaders. RG trained the third reviewer (JM).

204
We

216
Ethics approval

217
A waiver for ethical approval was obtained for this study from the medical ethics review committee at 218 Amsterdam UMC. To avoid negative consequences for the authors of the included publications, each publication 219 was assigned a unique identification number. Extracted data were entered in SPSS using this number to separate 220 author information from the study data.

224 225
Characteristics of included publications 226 Table 1 presents the characteristics of the 116 included publications from the 13 participating HSR institutions.

227
To summarise, 54.3% of the publications were quantitative, 28.4% were qualitative, and 17.2% applied a mixed-228 methods approach. Sixteen percent of the publications were based on a published study protocol. The mean 229 impact factor of the journals was 2.81, and the average number of authors was six.

265
QRPs that occurred least frequently were:

284
The difference in the number of QRPs by publication characteristics 285 Table 2 shows the associations between total number of QRPs (applicable to all study designs) and 286 methodological approach (quantitative, qualitative, and mixed), type of research (descriptive, exploratory, 287 hypothesis testing, and measurement instruments), and study design (observational, (quasi) experimental, 288 systematic review, economic evaluation, case study, and meta-analyses).

289
No statistically significant differences in number of QRPs was found by type of research, methodological 290 approach, or study design.

306
Limitations and Strengths

307
We applied a broad and sensitive definition of 'questionable', for instance by considering the absence of 308 contradictory evidence or the absence of implications and recommendations for policy and practice as a QRP.

309
The choice to not present contradictory evidence does not defy current publication checklists, yet this practice 310 may hinder interpretation of findings in the full context of evidence. If authors searched for contradictory 311 evidence, but did not mention its absence, readers of the publication would not have any clues on its existence.

312
Knowledge on the occurrence of QRPs is often derived from survey studies, relying on self-report.

375
Implications and recommendations for policy and practice

376
The HSR field currently seems to adhere to the norms and expectations set by the biomedical field, even though 377 HSR is multidisciplinary, and differences in approach and type of methodology pose serious challenges to 378 observing these norms. Therefore, the HSR community needs to further define specific scientific norms 379 appropriate to the field.

381
Scientific norms are developed through the forum of a scientific community. 33

436
Transparency statement

437
The lead author (DK) affirms that the manuscript is an honest, accurate, and transparent account of the study 438 being reported; that no important aspects of the study have been omitted; and that any discrepancies from the 439 study as originally planned have been explained.

440
Data sharing statement

441
The data concern the quantitative results from a review of scientific publications. Anonymized data is available The title does not adequately reflect the main findings.

6.1.2
The abstract does not adequately reflect the main findings.

6.1.3
The conclusions in the abstract do not adequately reflect the conclusions in the main text. 6.1.4 The objectives/research questions of the study are differently phrased in the introduction and the discussion. 6.1.5 The outcome measure does not adequately reflect the objectives/research questions of the study. 6.1.6 The main results discussed in the discussion paragraph do not adequately address the original objectives/research questions as posed in the introduction. 6.1.7 The order of presenting the results in de discussion is inconsistent with the ordering of the objectives/research questions as posed in the introduction.

6.1.8
The conclusions do not adequately reflect the objectives of the study. The conclusions do not adequately reflect the findings as presented in the results paragraph.

6.1.10
The outcome measure used does not allow the conclusions that are stated. 6.1.11 The conclusion/discussion distracts from main outcomes by overstating the relevance of secondary outcomes. 6.1.12 The conclusions are not supported by the results as presented in context of the referenced literature.

6.1.13
Recommendations do not adequately reflect the results in context of the referenced literature.
6.1.14 Implications for policy and practice do not adequately reflect the results in the context of the referenced literature.

6.1.15
Lack of distinction between results and discussion. The results section contains elements of discussion and interpretation beyond the scope of explaining the results.

6.2
Main results are not or inadequately interpreted into the context of evidence Supporting evidence is poorly documented.

6.2.2
Contradicting evidence is poorly documented.

6.2.3
Evidence is used inappropriately to support the findings (i.e. the argument is not supported by the actual message of the cited evidence). Will be measured as: Evidence seems to be used selectively to support the findings, given the title of the referenced evidence.

6.2.4
The main source of evidence to support the results is based on the same underlying data.

6.3
Limitations are not adequately mentioned Sources, direction and magnitude of bias are not or poorly discussed, or just listed without further discussion.

6.3.2
The possible impact of the limitations on the results (i.e., magnitude and direction of any potential sources of bias) is not or poorly discussed. The sampling methodology does not allow the type of generalization provided.

6.4.2
Generalization of findings to populations not included in the original sample is not justified.

6.4.3
Generalization of findings to time periods not included in the original study is not justified.

6.4.4
Generalization of findings to geographical locations not included in the original study is not justified.

6.4.5
Generalization of findings to settings/institutions not included in the original study is not justified.

Unjustified causation
6.5.1 Causative wording is used in the hypothesis/research question, although there is no theory supporting causation.

6.5.2
A causal relationship is claimed, although the research design is not appropriate to determine causation (methods lack control of potential confounding or systematic bias).

6.5.3
A causal relationship is claimed although potential sources of bias and their potential impact on the findings were not discussed.

6.5.4
A potential causal relationship claimed in the discussion paragraph is not justified.

6.6
Effect size 6.6.1 The relevance of statistically significant results with small effect size is overstated. 6.6.2 The possible clinical relevance of statistically nonsignificant results is not addressed. Description of quantitative and/or qualitative methods of analyses is reported 4.12 Handling of missing data is reported 4.13 Comparator is explained

Tables properly represent results
Tables give a reflection of actual results instead of cherry picking

Graphs properly represent results
Scaling is appropriate The outcome measure does not adequately reflect the objectives/research questions of the study.
The objectives /research questions cannot be answered with the outcome measure that is studied The main results discussed in the discussion paragraph do not adequately address the original objectives/research questions as posed in the introduction.
The research questions and/or objectives that were stated in the introduction section are not or only partly answered by the main results

6.1.7
The order of presenting the results in de discussion is inconsistent with the ordering of the objectives/research questions as posed in the introduction.
Not an actual QRP, but it does conflict with transparency in presenting the study's findings. If there's just one objective/research question, this item is not applicable (no structuring possible) and should be scored -8.

6.1.8
The conclusions do not adequately reflect the objectives of the study.
The objectives of the study are not met by the conclusions the study arrives at. Conclusions can be stated in the discussion paragraph and/or the conclusion paragraph. Either the study along the way shifted perspective, however no justification is provided. Or the write-up of the conclusions is flawed. Framing conclusion as extension to the discussion is not a QRP (undesirable, however beyond the scope if this indicator).

6.1.9
The conclusions do not adequately reflect the findings as presented in the results paragraph. The conclusion/discussion distracts from main outcomes by overstating the relevance of secondary outcomes.
The main outcomes are ignored or their importance reduced, while favouring secondary outcomes. Most space is taken by discussing these secondary outcomes. 6.1.12 The conclusions are not supported by the results as presented in context of the referenced literature. Only limited evidence to support the main results is provided and only superficially discussed. No thorough reflection of the findings in perspective of supporting evidence.

6.2.2
Contradicting evidence is poorly documented.
Only limited evidence to oppose against the main results is provided and only superficially discussed. No thorough reflection of the findings in perspective of contradicting evidence.

6.2.3
Evidence is used inappropriately to support the findings (i.e. the argument is not supported by the actual message of the cited evidence). Will be measured as: Evidence seems to be used selectively to support the findings, given the title of the referenced evidence. The main source of evidence to support the results is based on the same underlying data.
Most supporting evidence is grounded in the same data source as was used for the reviewed study (not necessarely self-citing), inducing circularity in argumentation.

6.3
Limitations are not adequately mentioned Sources, direction and magnitude of bias are not or poorly discussed, or just listed without further discussion.  The possible impact of the limitations on the results (i.e., magnitude and direction of any potential sources of bias) is not or poorly discussed.

Is the impact of limitations discussed (if no limitations are mentioned then this is considered a QRP). The extent to which potential risks of bias affect the interpretation of the findings is not thoroughly discussed.
6.4 Unjustified generalisations The sampling methodology does not allow the type of generalization provided.
The sample is too specific, small, or flawed (for instance by attrition, selection bias) for the generalization that is made.

6.4.2
Generalization of findings to populations not included in the original sample is not justified.

The included sample is too specific, small or flawed (for instance by attrition, selection bias) and no or inadequate evidence is provided to support the generalization that is made. Population does not include geographical location (this is a separate QRP).
Population includes population characteristics such as gender, ethnicity, age, etc.

6.4.3
Generalization of findings to time periods not included in the original study is not justified.

6.4.4
Generalization of findings to geographical locations not included in the original study is not justified.

6.4.5
Generalization of findings to settings/institutions not included in the original study is not justified. (methods lack control of potential confounding or systematic bias).

6.5.3
A causal relationship is claimed although potential sources of bias and their potential impact on the findings were not discussed.

6.5.4
A potential causal relationship claimed in the discussion paragraph is not justified.
When a causal relation may not be assumed solely based on the study's findings, no or inadequate supporting and contradicting evidence is used to discuss the possible causation that was found in the results.
6.6 Effect size 6.6.1 The relevance of statistically significant results with small effect size is overstated.
Importance of findings is exaggerated. Although (some) results are statistically significant, the clinical/practical relevance is minor due to small effect size/causation is unlikely. 6.6.2 The possible clinical relevance of statistically nonsignificant results is not addressed. 1

Additional information to the methods of the development of the definition and measurement instrument for "questionable research practices in the reporting of messages and conclusions in scientific health services research publications"
This document describes the methods used to develop a definition of questionable research practices (QRPs) in the reporting of messages and conclusions, and to construct a measurement instrument that allows for the identification of questionable research practices in the reporting of messages and conclusions in Health Services Research (HSR).

Methodology
Methods included an explorative review of definitions in literature, a consultation meeting with the project group, institution/department leaders of Dutch HSR institutions and project advisors (n=13), semi-structured interviews with 13 HSR institutes (n=19) and an expert consultation (n=5).
Setting 13 HSR groups, departments, or institutions (hereafter referred to as "HSR institutions") in the Netherlands, including both academic and non-academic institutions participated in this study. These institutions all agreed to participate in an effort to assure the overall quality of HSR publications in the Netherlands.

Literature review
First, a literature review was conducted searching for existing definitions of questionable research practices in the reporting of conclusions and messages, and operationalisations of QRPs. Search terms included in different order and combination: 'questionable research practices', 'spin', 'over interpretation', 'discordant conclusions', 'QRPs', 'outcome reporting bias', 'questionable conclusions' and 'responsible conclusions'. Documents were included if they described methods to measure questionable research practices in scientific publication, or provided definitions of the above key terms. Referred documents that fit the criteria were also included in the review.
After identifying the main literature that suited our aim, we came to a preliminary definition of QRPs based on Boutron 2010, Ochobo 2013, and Horton 1995 1-3 .
An extensive list of possible types of QRPs in the reporting of messages and conclusions was developed, based on the EQUATOR checklists 4 and instruments from previous studies. For example, instruments for identifying 'spin', reporting of qualitative research and other QRPs such as 3,5-7 . Spin in this context refers to "a way to distort science reporting without actually lying")

Consultation meeting
Second, we presented the preliminary QRP definition and the first draft of items referring to QRPs (see page 3) during a consultation meeting of participating HSR institutions on 6 June 2017. The meeting lasted three hours, during which the research project and the preliminary definition and draft of QRP items was discussed. Representatives of the participating HSR institutions (n=7), project advisors (n=2) and project group members (n=4) attended the meeting. The attendees discussed their thoughts about the definition and its operationalisation. Detailed notes from this meeting were summarized and shared with the representatives of all participating institutes (including those who did not attend).
The central conclusion of the meeting was to focus on the 'measurability' of the QRPs. An important consideration in developing the instrument for the assessment of scientific publication is to focus on the possibility to measure the QRPs. Therefore, the focus should be on QRPs that can be quantified. These should be distinguished from QRPs that, although possibly important, are not quantifiable. Semi-structured interviews Third, we conducted fourteen semi-structured interviews with nineteen leaders/representatives of the thirteen HSR institutions. These representatives had to have a clear overview of the process of reporting research in their institute. One of the institutions was represented by two separate departments, hence two representatives were separately interviewed. Three interviews were conducted with both the institute leader and a second representative. One of the interviews included three representatives of an institution. The aim of the interviews was to discuss our draft of QRP items and identify additional measurable QRPs in the reporting of messages and conclusions in HSR, explore potential causes of QRPs in messages and conclusions, and to discuss experiences of the institute leaders with these QRPs. A semi-structured interview guide was developed by the project team (see page [4][5]. During the interview, we presented the interviewees with a draft of QRP items. The draft list was iteratively adjusted, i.e. after each interview we drafted a new version including the findings of the previous interviews.
Interviewees were approached through e-mail to schedule an appointment. Two researchers conducted the interviews of which thirteen took place at the participating institutions and one interview took place in a public space. During the first interview, both researchers were present to align the approach. The remainder of the interviews were equally divided between them. The interviews lasted one hour. In concordance with ethical guidelines, the goal of the interview was explained at the start of the interview and permission to audio-record the interview was obtained.
With the support of the recordings, a report was written and shared with the interviewees for validation. All interviewees confirmed the reports, after mostly minor edits to the report. From the interview reports, we drew up a new draft of the list of QRP items (see page 6-7). In the research group, we specifically paid attention to correct wording of the QRPs.

Expert consultation
Fourth, ten leading international health services researchers were asked to provide feedback on this list of QRP items. These HSR experts were invited through e-mail in which we explained the aim of the study, and included the definition of QRPs and the list of QRP items. Five experts provided their comments to the items. Five experts did not respond after a reminder, or indicated not having time to review the QRP items. Feedback was summarized, and comments were used to adapt the QRP definition and list of QRP items.

Measurement instrument
We developed the measurement instrument in Excel format by taking items from earlier developed checklists (EQUATOR and COREQ) and the list of QRPs. The measurement instrument was completed after a final consensus meeting of the research group. The measurement instrument exists of three sections: 1) bibliographic information of the publication (eg. funder, journal, number of authors), 2) basic methodological information (eg. included population, analyses method) and 3) possible QRPs in messages and conclusions. A pilot was conducted to assess the feasibility and usability of the instrument. In the pilot, two project members independently assessed five international HSR publications to identify modifications needed to improve the items in the instrument, and to align the interpretation of the items. The project group discussed the proposed modifications, resulting in the final version: the data extraction form (see supplementary material 1.)

List of possible questionable research practices presented during the consultation meeting and the interviews
With each interview, new QRP's were added to the list which were then presented during the next interview.

Definition: Questionable reporting of messages and conclusions:
"The use of reporting, from whatever motive, consciously or unconsciously, to make conclusions or messages weaker or stronger than results justify." Potential Actual

Interviewprotocol eerste consultatieronde juni/juli 2017 Toelichting op het interview
Het doel van het ZonMw Project is om te komen tot aanbevelingen ter bevordering van verantwoord rapporteren over gezondheidszorgonderzoek (responsible conclusions and messages in health services research).

Expert consultation
Definition: "To frame, from whatever motive, consciously or unconsciously, conclusions or messages as an answer to the research question that are not justified by the results" [Comments concerning definition]

Measuring questionable reporting of conclusions & messages
Title, abstract, main text, and conclusions do not align The conclusions in the abstract do not adequately reflect the conclusions in the main text. 22·4 75·0 2·6 The main results discussed in the discussion paragraph do not adequately address the original objectives/research questions as posed in the introduction. 20·7 75·9 3·4 The outcome measure used does not allow the conclusions that are stated. * 18·1 81·9 0·0 Lack of distinction between results and discussion. The results section contains elements of discussion and interpretation beyond the scope of explaining the results. 17·2 82·8 0·0 The sampling methodology does not allow the type of generalization provided. 15·5 84·5 0·0 The objectives/research questions of the study are differently phrased in the introduction and the discussion. 14·7 36·2 49·1 The order of presenting the results in de discussion is inconsistent with the ordering of the objectives/research questions as posed in the introduction. 14·7 75·0 10·3 Hyperboles and exaggerating adjectives are unjustifiably used 12·1 87·9 0·0 The title does not adequately reflect the main findings. 11·2 88·8 0·0 The abstract does not adequately reflect the main findings. 10·3 89·7 0·0 A potential causal relationship claimed in the discussion paragraph is not justified. 10·3 89·7 0·0 The outcome measure does not adequately reflect the objectives/research questions of the study. * 9·6 90·4 0·0 A causal relationship is claimed. although the research design is not appropriate to determine causation. 9·6 90·4 0·0 The relevance of statistically significant results with small effect size is overstated. * 9·6 90·4 0·0 Generalising findings to settings/institutions not included in the original study is not justified. 9·5 89·7 1·0 The conclusion/discussion distracts from main outcomes by overstating the relevance of secondary outcomes. * 8·4 91·6 0·0 *Give information separately for cases and controls in case-control studies and, if applicable, for exposed and unexposed groups in cohort and cross-sectional studies.

55
 This study describes an assessment of publications and is therefore able to detect QRPs that go 56 unnoticed in survey studies that rely on self-report.

58
 Although we aimed to develop a reliable measurement instrument that would guide the review process, 59 the instrument allowed latitude for the reviewer's interpretation.

61
 In our assessment method, we relied on consensus among assessors, which inevitably introduces some 62 subjectivity.

64
 Because publications were selected based on the title, selection bias might have occurred.

79
and describing a hypothesis after finding significant results. 3 A questionable practice is not necessarily wrongful, 80 but does 'raise questions'. In this study we further define the meaning of questionable research practices in the 81 reporting of messages and conclusions in the field of HSR specifically.

83
The HSR field is an applied field of research, and produces evidence on topics such as co-payments, evaluation

138
We developed a data extraction form in Excel that contained the list of QRPs and bibliometric information, and 139 conducted a pilot to evaluate its feasibility and usability. In the pilot, two assessors (RG, TJ) independently

163
The HSR publications (n=717) were assigned a random number. Per institution, the publications with unique 164 first authors with the lowest assigned number were included in the sample. Three HSR institutions did not have 165 enough publications with unique first authors, resulting in a selection of nine, eight, and two publications for 166 these institutions. Furthermore, two publications were excluded during assessment because they concerned 167 research protocols. These publications were replaced by another publication authored by the same institution.

168
One publication was excluded because its methodology was considered incomprehensible by the reviewers.

175
The assessment started with a test phase. During this phase, agreements and disagreements in assessments of the 176 first 30 publications were thoroughly discussed (by RG, TJ, NK, and DK) to increase the accuracy of the 177 assessments; agreement between the two reviewers (TJ, RG) was 81% for the first 20 publications, which 178 increased to 82% when assessing the next 10 publications. The notion emerged that it was necessary having two 179 reviewers with complementary expertise assess each publication independently, followed by a consensus

224 225
Characteristics of included publications 226 Table 1 presents the characteristics of the 116 included publications from the 13 participating HSR institutions.

227
To summarise, 54.3% of the publications were quantitative, 28.4% were qualitative, and 17.2% applied a mixed-228 methods approach. Sixteen percent of the publications were based on a published study protocol. The mean 229 impact factor of the journals was 2.81, and the average number of authors was six.

250
QRPs that occurred most frequently were:

265
QRPs that occurred least frequently were:

284
The difference in the number of QRPs by publication characteristics 285 Table 2 shows the associations between total number of QRPs (applicable to all study designs) and 286 methodological approach (quantitative, qualitative, and mixed), type of research (descriptive, exploratory, 287 hypothesis testing, and measurement instruments), and study design (observational, (quasi) experimental, 288 systematic review, economic evaluation, case study, and meta-analyses).

289
No statistically significant differences in number of QRPs was found by type of research, methodological 290 approach, or study design.

306
Limitations and Strengths

307
We applied a broad and sensitive definition of 'questionable', for instance by considering the absence of 308 contradictory evidence or the absence of implications and recommendations for policy and practice as a QRP.

309
The choice to not present contradictory evidence does not defy current publication checklists, yet this practice 310 may hinder interpretation of findings in the full context of evidence. If authors searched for contradictory 311 evidence, but did not mention its absence, readers of the publication would not have any clues on its existence.

312
Knowledge on the occurrence of QRPs is often derived from survey studies, relying on self-report.

321 322
Although we endeavoured to develop a reliable measurement instrument that would guide the review process,

335
Our study results may be representative for HSR research publications internationally. Given the fact that 336 publication in international journals is highly standardised in terms of language (English) and format, our 337 findings can most likely be transferred to HSR communities in other countries.

340
In HSR publications, recommendations for policy and practice warrant most attention. A study can be conducted

358
A third of the HSR publications studied gave no recommendations for policy or practice, while another third did 359 not provide an adequate justification for the recommendations. One could argue that HSR is an applied field of 360 research, and that its ultimate goal should be to contribute to better health services and systems; researchers 361 should therefore take responsibility for providing guidance to those who can act on the research findings instead

376
The HSR field currently seems to adhere to the norms and expectations set by the biomedical field, even though 377 HSR is multidisciplinary, and differences in approach and type of methodology pose serious challenges to 378 observing these norms. Therefore, the HSR community needs to further define specific scientific norms 379 appropriate to the field.

440
Data sharing statement

441
The data concern the quantitative results from a review of scientific publications. Anonymized data is available  The title does not adequately reflect the main findings.

6.1.2
The abstract does not adequately reflect the main findings.

6.1.3
The conclusions in the abstract do not adequately reflect the conclusions in the main text. 6.1.4 The objectives/research questions of the study are differently phrased in the introduction and the discussion. 6.1.5 The outcome measure does not adequately reflect the objectives/research questions of the study. 6.1.6 The main results discussed in the discussion paragraph do not adequately address the original objectives/research questions as posed in the introduction. 6.1.7 The order of presenting the results in de discussion is inconsistent with the ordering of the objectives/research questions as posed in the introduction.

6.1.8
The conclusions do not adequately reflect the objectives of the study. The conclusions do not adequately reflect the findings as presented in the results paragraph.

6.1.10
The outcome measure used does not allow the conclusions that are stated. 6.1.11 The conclusion/discussion distracts from main outcomes by overstating the relevance of secondary outcomes. 6.1.12 The conclusions are not supported by the results as presented in context of the referenced literature.

6.1.13
Recommendations do not adequately reflect the results in context of the referenced literature.
6.1.14 Implications for policy and practice do not adequately reflect the results in the context of the referenced literature.

6.1.15
Lack of distinction between results and discussion. The results section contains elements of discussion and interpretation beyond the scope of explaining the results.

6.2
Main results are not or inadequately interpreted into the context of evidence Supporting evidence is poorly documented.

6.2.2
Contradicting evidence is poorly documented.

6.2.3
Evidence is used inappropriately to support the findings (i.e. the argument is not supported by the actual message of the cited evidence). Will be measured as: Evidence seems to be used selectively to support the findings, given the title of the referenced evidence.

6.2.4
The main source of evidence to support the results is based on the same underlying data.

6.3
Limitations are not adequately mentioned Sources, direction and magnitude of bias are not or poorly discussed, or just listed without further discussion.

6.3.2
The possible impact of the limitations on the results (i.e., magnitude and direction of any potential sources of bias) is not or poorly discussed. The sampling methodology does not allow the type of generalization provided.

6.4.2
Generalization of findings to populations not included in the original sample is not justified.

6.4.3
Generalization of findings to time periods not included in the original study is not justified.

6.4.4
Generalization of findings to geographical locations not included in the original study is not justified.

6.4.5
Generalization of findings to settings/institutions not included in the original study is not justified.

Unjustified causation
6.5.1 Causative wording is used in the hypothesis/research question, although there is no theory supporting causation.

6.5.2
A causal relationship is claimed, although the research design is not appropriate to determine causation (methods lack control of potential confounding or systematic bias).

6.5.3
A causal relationship is claimed although potential sources of bias and their potential impact on the findings were not discussed.

6.5.4
A potential causal relationship claimed in the discussion paragraph is not justified.

6.6
Effect size 6.6.1 The relevance of statistically significant results with small effect size is overstated. 6.6.2 The possible clinical relevance of statistically nonsignificant results is not addressed. Description of quantitative and/or qualitative methods of analyses is reported 4.12 Handling of missing data is reported 4.13 Comparator is explained

Tables properly represent results
Tables give a reflection of actual results instead of cherry picking

Graphs properly represent results
Scaling is appropriate

(Statistical) uncertainty is reported
Confidence intervals are provided for the main results

6
Questionable messages and conclusions The outcome measure does not adequately reflect the objectives/research questions of the study.
The objectives /research questions cannot be answered with the outcome measure that is studied The main results discussed in the discussion paragraph do not adequately address the original objectives/research questions as posed in the introduction.
The research questions and/or objectives that were stated in the introduction section are not or only partly answered by the main results

6.1.7
The order of presenting the results in de discussion is inconsistent with the ordering of the objectives/research questions as posed in the introduction.
Not an actual QRP, but it does conflict with transparency in presenting the study's findings. If there's just one objective/research question, this item is not applicable (no structuring possible) and should be scored -8.

6.1.8
The conclusions do not adequately reflect the objectives of the study.
The objectives of the study are not met by the conclusions the study arrives at. Conclusions can be stated in the discussion paragraph and/or the conclusion paragraph. Either the study along the way shifted perspective, however no justification is provided. Or the write-up of the conclusions is flawed. Framing conclusion as extension to the discussion is not a QRP (undesirable, however beyond the scope if this indicator).

6.2.2
Contradicting evidence is poorly documented.
Only limited evidence to oppose against the main results is provided and only superficially discussed. No thorough reflection of the findings in perspective of contradicting evidence.

6.2.3
Evidence is used inappropriately to support the findings (i.e. the argument is not supported by the actual message of the cited evidence). Will be measured as: Evidence seems to be used selectively to support the findings, given the title of the referenced evidence. The main source of evidence to support the results is based on the same underlying data.
Most supporting evidence is grounded in the same data source as was used for the reviewed study (not necessarely self-citing), inducing circularity in argumentation.

6.5.3
A causal relationship is claimed although potential sources of bias and their potential impact on the findings were not discussed.
No or inadequate discussion is included concerning the impact of potential sources of bias on the possible causation that was found in the results

6.5.4
A potential causal relationship claimed in the discussion paragraph is not justified.
When a causal relation may not be assumed solely based on the study's findings, no or inadequate supporting and contradicting evidence is used to discuss the possible causation that was found in the results.
6.6 Effect size The relevance of statistically significant results with small effect size is overstated.
Importance of findings is exaggerated. Although (some) results are statistically significant, the clinical/practical relevance is minor due to small effect size/causation is unlikely. 6.6.2 The possible clinical relevance of statistically nonsignificant results is not addressed.

Methodology
Methods included an explorative review of definitions in literature, a consultation meeting with the project group, institution/department leaders of Dutch HSR institutions and project advisors (n=13), semi-structured interviews with 13 HSR institutes (n=19) and an expert consultation (n=5).
Setting 13 HSR groups, departments, or institutions (hereafter referred to as "HSR institutions") in the Netherlands, including both academic and non-academic institutions participated in this study. These institutions all agreed to participate in an effort to assure the overall quality of HSR publications in the Netherlands.

Literature review
First, a literature review was conducted searching for existing definitions of questionable research practices in the reporting of conclusions and messages, and operationalisations of QRPs. Search terms included in different order and combination: 'questionable research practices', 'spin', 'over interpretation', 'discordant conclusions', 'QRPs', 'outcome reporting bias', 'questionable conclusions' and 'responsible conclusions'. Documents were included if they described methods to measure questionable research practices in scientific publication, or provided definitions of the above key terms. Referred documents that fit the criteria were also included in the review.
After identifying the main literature that suited our aim, we came to a preliminary definition of QRPs based on Boutron 2010, Ochobo 2013, and Horton 1995 1-3 .
An extensive list of possible types of QRPs in the reporting of messages and conclusions was developed, based on the EQUATOR checklists 4 and instruments from previous studies. For example, instruments for identifying 'spin', reporting of qualitative research and other QRPs such as 3,5-7 . Spin in this context refers to "a way to distort science reporting without actually lying")

Consultation meeting
Second, we presented the preliminary QRP definition and the first draft of items referring to QRPs (see page 3) during a consultation meeting of participating HSR institutions on 6 June 2017. The meeting lasted three hours, during which the research project and the preliminary definition and draft of QRP items was discussed.
Representatives of the participating HSR institutions (n=7), project advisors (n=2) and project group members (n=4) attended the meeting. The attendees discussed their thoughts about the definition and its operationalisation. Detailed notes from this meeting were summarized and shared with the representatives of all participating institutes (including those who did not attend).
The central conclusion of the meeting was to focus on the 'measurability' of the QRPs. An important consideration in developing the instrument for the assessment of scientific publication is to focus on the possibility to measure the QRPs. Therefore, the focus should be on QRPs that can be quantified. These should be distinguished from QRPs that, although possibly important, are not quantifiable.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  Semi-structured interviews Third, we conducted fourteen semi-structured interviews with nineteen leaders/representatives of the thirteen HSR institutions. These representatives had to have a clear overview of the process of reporting research in their institute. One of the institutions was represented by two separate departments, hence two representatives were separately interviewed. Three interviews were conducted with both the institute leader and a second representative. One of the interviews included three representatives of an institution. The aim of the interviews was to discuss our draft of QRP items and identify additional measurable QRPs in the reporting of messages and conclusions in HSR, explore potential causes of QRPs in messages and conclusions, and to discuss experiences of the institute leaders with these QRPs. A semi-structured interview guide was developed by the project team (see page [4][5]. During the interview, we presented the interviewees with a draft of QRP items. The draft list was iteratively adjusted, i.e. after each interview we drafted a new version including the findings of the previous interviews.
Interviewees were approached through e-mail to schedule an appointment. Two researchers conducted the interviews of which thirteen took place at the participating institutions and one interview took place in a public space. During the first interview, both researchers were present to align the approach. The remainder of the interviews were equally divided between them. The interviews lasted one hour. In concordance with ethical guidelines, the goal of the interview was explained at the start of the interview and permission to audio-record the interview was obtained.
With the support of the recordings, a report was written and shared with the interviewees for validation. All interviewees confirmed the reports, after mostly minor edits to the report. From the interview reports, we drew up a new draft of the list of QRP items (see page 6-7). In the research group, we specifically paid attention to correct wording of the QRPs.

Expert consultation
Fourth, ten leading international health services researchers were asked to provide feedback on this list of QRP items. These HSR experts were invited through e-mail in which we explained the aim of the study, and included the definition of QRPs and the list of QRP items. Five experts provided their comments to the items. Five experts did not respond after a reminder, or indicated not having time to review the QRP items. Feedback was summarized, and comments were used to adapt the QRP definition and list of QRP items.

Interviewprotocol eerste consultatieronde juni/juli 2017 Toelichting op het interview
Het doel van het ZonMw Project is om te komen tot aanbevelingen ter bevordering van verantwoord rapporteren over gezondheidszorgonderzoek (responsible conclusions and messages in health services research).

QRP list and comment form used for expert consultation
Experts provided comments in the comment boxes

Expert consultation
Definition: "To frame, from whatever motive, consciously or unconsciously, conclusions or messages as an answer to the research question that are not justified by the results" [Comments concerning definition]