Risk factors for pain and functional impairment in people with knee and hip osteoarthritis: a systematic review and meta-analysis

Objective To identify risk factors for pain and functional deterioration in people with knee and hip osteoarthritis (OA) to form the basis of a future ‘stratification tool’ for OA development or progression. Design Systematic review and meta-analysis. Methods An electronic search of the literature databases, Medline, Embase, CINAHL, and Web of Science (1990–February 2020), was conducted. Studies that identified risk factors for pain and functional deterioration to knee and hip OA were included. Where data and study heterogeneity permitted, meta-analyses presenting mean difference (MD) and ORs with corresponding 95% CIs were undertaken. Where this was not possible, a narrative analysis was undertaken. The Downs & Black tool assessed methodological quality of selected studies before data extraction. Pooled analysis outcomes were assessed and reported using the Grading of Reccomendation, Assessment, Development and Evaluation (GRADE) approach. Results 82 studies (41 810 participants) were included. On meta-analysis: there was moderate quality evidence that knee OA pain was associated with factors including: Kellgren and Lawrence≥2 (MD: 2.04, 95% CI 1.48 to 2.81; p<0.01), increasing age (MD: 1.46, 95% CI 0.26 to 2.66; p=0.02) and whole-organ MRI scoring method (WORMS) knee effusion score ≥1 (OR: 1.35, 95% CI 0.99 to 1.83; p=0.05). On narrative analysis: knee OA pain was associated with factors including WORMS meniscal damage ≥1 (OR: 1.83). Predictors of joint pain in hip OA were large acetabular bone marrow lesions (BML; OR: 5.23), chronic widespread pain (OR: 5.02) and large hip BMLs (OR: 4.43). Conclusions Our study identified risk factors for clinical pain in OA by imaging measures that can assist in predicting and stratifying people with knee/hip OA. A ‘stratification tool’ combining verified risk factors that we have identified would allow selective stratification based on pain and structural outcomes in OA. PROSPERO registration number CRD42018117643.

not well understood  we aimed to identify risk factors for pain and functional deterioration in primary knee and hip OA subjects to create a 'stratification tool' for OA development or progression  We found 82 studies with 41,810 participants, which were included for analysis  Knee OA pain was associated with MRI Knee effusion score ≥1, Meniscal damage≥1 , Kellgren and Lawrence≥ 2 and increasing age. Predictors for painful hip bone marrow lesion development were knee pain and hip pain.

INTRODUCTION
It has been reported that over 30.8 million US adults suffer from osteoarthritis (OA) (1). Between 1990-2010, the years lived with disability worldwide caused by OA increased from 10.5 million to 17.1 million, an increase of 62.9% (2). Current OA treatment lacks any disease-modifying treatments with a predominance to manage symptoms rather than modify underlying disease (3). The clinical symptoms of OA can be assessed using several questionnaires, the most common of which is the Western Ontario McMaster Arthritic Index (WOMAC) (4,5,6). Although pain is a recognised as an important outcome measure in OA, it is not clear what the optimal assessment tools are in OA and how they relate to other risk factors.
OA has various subtypes and since current therapies cannot prevent OA progression, early detection and stratification of those at risk may enable effective pre-symptomatic interventions (7,8). Several methods are used to define, diagnose and measure OA progression, including imaging techniques [e.g. plain radiography, Computed Tomography (CT) and Magnetic Resonance Imaging (MRI)]. Plain radiography provides high contrast and high resolution images for cortical and trabecular bone, but not for non-ossified structures (e.g. synovial fluid) (9). The most recognised radiographic measure classifying OA severity is Kellgren and Lawrence (KL) grading which assesses osteophytes, joint space narrowing (JSN), sclerosis and bone deformity (10,11). However, it has been argued that MRI may be more suitable for imaging arthritic joints, providing a whole organ image of the joint (12). Wholeorgan MRI scoring method (WORMS) is used in MRI for OA assessing damage, providing a detailed analysis of the joint.

Recently, OMERACT-OARSI (Outcome Measures in Rheumatology-Osteoarthritis Research Society
International) have published a core domain set for clinical trials in hip and/or knee OA (13). Six domains were assessed as being mandatory in the assessment of OA, including pain, physical function, quality of life, patient's global assessment of the target joint, and adverse events including mortality and/or joint structure, depending on the intervention tested. However, there still remains F o r p e e r r e v i e w o n l y a need to identify risk factors for pain and structural damage in OA so that potential interventions can be studied in a timely manner. In this study, we aimed to identify risk factors for pain, worsening function and structural damage that can predict knee/hip OA development and progression. Our results report a systematic review, with meta-analysis where enough studies were identified for valid comparisons. By identifying risk factors for OA pain and structural damage, tools for stratifying specific disease groups could be developed in the future.

METHODS
This systematic review has been reported in accordance with the PRISMA reporting guidelines. The review protocol was registered a priori through PROSPERO (Registration: CRD42018117643).

Study Identification
Studies were eligible for inclusion if they were a full text article that satisfied all of the following: 1) 100 or more participants analysed in the study (to increase power for comparisons); 2) convincing definition of OA using American College of Rheumatology criteria; 3) abstract/title that must refer to pain and/or structure in relation to OA as a primary disease; 4) Knee or hip osteoarthritis; Non-English studies, letters, conference articles and reviews were excluded.
The titles and abstracts were reviewed by one reviewer (SS). The full-text for each paper was assessed for eligibility by one reviewer (SS) and double-checked by a second (TS). Any disagreements were addressed through discussion and adjudicated by a third reviewer (NS or FH). All studies which satisfied the criteria were included in the review.

Quality assessment
To assess the risk of bias and the power of the methodology, the Downs & Black (D&B) tool was applied (14). These tools assessed the following aspects of each study: reporting quality, external validity, internal validity-bias, selection bias and power. The D&B tool was modified to apply to both interventional and observational studies, resulting in an 'observational Downs and Black tool' (18 items) and an 'interventional downs and black tool' (27 items) (Supplementary File 2). Critical appraisal was performed by one reviewer (SS) and verified by a second (KT). Any disagreements were dealt with by discussion and adjudicated through a third reviewer (TS). In previous literature D&B score ranges were given corresponding quality: excellent (26-28); good (20)(21)(22)(23)(24)(25); fair (15)(16)(17)(18)(19); and poor (<14) (14). The D&B tool was therefore used to exclude poor quality studies with a score 15/28 or lower in interventional studies and 10/19 or lower in observation studies.

Data extraction
Data were extracted including: subject demographic data, study design, pain and function outcome measures, imaging used, OA severity scores, change in pain and function outcome measures and change in OA severity scores. After all relevant data had been extracted, authors of these papers

Outcomes
The primary outcome was to determine the development of pain and functional impairment for those with KOA. The secondary outcome was to determine which factors are associated with structural changes in KOA.

Data analysis
All data were assessed for study heterogeneity through scrutiny of the data extraction tables. These identified that there was minimum study-based heterogeneity based on: population, study design and interventions-exposure variabilities for given outcomes. Where there was study heterogeneity, as narrative analysis was undertaken. In this instance, the odds ratio (OR) of all predictor variables were tabulated with a range of OR presented. Where the range did not pass through 1, this was interpreted as significant. Where there was sufficient data to pool and study homogeneity evident, a pooled meta-analysis was deemed appropriate. When I 2 was 50% or greater, a random-effects model meta-analysis was undertaken. When I 2 was less than 50%, a fixed effects model approach was adopted. Continuous outcomes were assessed using standardised mean difference (SMD) scores of measures for developing severe OA, whereas dichotomous variables were assessed through OR data. All data were presented with 95% confidence intervals (CI) and forest-plots.
Due to the presentation of the data, there were minimal data to permit meta-analyses. Where there was insufficient data to pool the analysis, a narrative analysis was undertaken to assess risk factors

Search Strategy
The results of the search strategy are presented in Figure 1. In total, 11,010 citations were identified.
Of these, 141 papers were deemed potentially eligible and screened at full-text level. Of these, 82 met the selected criteria and were included.

Characteristics of Included Studies
A summary of the included studies is presented as Table 1. This consisted of 27 observational studies, 51 RCTs whilst four studies were case-control designs.

Methodological Quality
The methodological quality of the evidence was moderate (Supplementary Table 1; Supplementary   Table 2). Based on the results of the Downs and Black Observational Studies Checklist, recurrent strengths of the evidence were clear description of the methods adopted (35 studies; 95%), appropriate acknowledgment of principal confounders in each group and their distribution presented (30 studies; 81%) and variability in data presented for the main outcomes (37 studies; 100%). Furthermore the main outcome measures were deemed reliable and valid in all studies (37

Knee OA systematic review and meta-analysis
Findings from the narrative analysis found the following were predictors for worsening joint pain:  Figure 2. Results show that female gender, increasing age and the presence of a knee effusion score being ≥1 at baseline were all significantly associated an increased probability of knee OA at statistically significant levels (p<0.05). Interestingly, in this meta-analysis, BMI did not reach statistical significance. The analysis conducted revealed six variables significantly associated  N=493) were all associated with OA knee development based on lower quality evidence. The variables of gender (when combining male and female), BML score, ethnicity, BMI and synovitis were not shown to be significantly associated with the KOA development ( Table 2).

Hip OA systematic review
We found that baseline knee pain score (MD:-1.42; 95% CI: -1.61 to -1.23; p<0.01; N=198) and baseline hip pain score (MD:-0.72; 95% CI: -0.97 to -0.47; p<0.01; N=198) were both significantly associated with the development of hip BMLs and pain. However, our findings were based on low quality evidence. There was no association between the development of hip BML and BMI or age.  female gender were associated with worsening function in people with KOA. In contrast, our metaanalysis of two studies which could be analysed showed that age, radiological features (KL score of 2 or more) and osteophyte presence, knee effusion, poor baseline function, cartilage loss graded 2 or more zones and meniscal tears were associated with development and/or KOA progression.
Our meta-analysis identified risk factors that are appreciated only when results were pooled together. These were namely: WORMS-defined knee effusion score ≥1, cartilage loss graded 2 or more, meniscal damage graded 1 or more and baseline function score. To our knowledge, this is the largest and most up to date systematic review of its kind so far, reviewing 82  conclusions that without standardisation it is difficult to pool data from different trials (94).
After plain radiography, MRI was the most used modality with WORMS as the commonest scoring  from WORMS, having no MOAKS studies included in our final selection was surprising. This could be due to the eligibility criteria being too restrictive. A future systematic review and meta-analysis focusing on the imaging aspect of evaluating OA will be important.
In HOA, the evaluation of BML size and location is essential in predicting pain progression and these can be assessed effectively using MRI. We recommend that all MRI studies for HOA evaluate BML size and location. Due to the few MRI studies included, further work is needed to determine whether MOAKS or WORMS is the most appropriate scoring system to recommend in KOA studies.
Gait analysis is considered a risk factor for pain/function and was therefore included as a target outcome measure. However, few studies included gait analysis measures, which could not be included in the analysis, perhaps due to the minimum sample size (n=100) being too restrictive.
There were several limitations within our study. Despite identifying novel risk factors for exhibiting KOA, a small dataset was pooled together for the meta-analysis (2 studies) compared to Silverwood Standardising data collection and reporting is important in conducting meta-analyses. We believe the following should be undertaken to improve data pooling in future work: ensuring group comparisons in studies are selected from the same population (people with confirmed OA) to improve internal validity, observational studies should conduct a power analysis to determine sample sizes and all studies should include absolute frequency of events data rather than summary odds ratios. Such considerations will improve future meta-analyses to identify OA risk factors.

Checklist items
1. Is the hypothesis/aim/objective of the study clearly described? 2. Are the main outcomes to be measured clearly described in the Introduction or Methods section? 3. Are the characteristics of the patients included in the study clearly described? 4. Are the distributions of principal confounders in each group of subjects to be compared clearly described? 5. Are the main findings of the study clearly described? 6. Does the study provide estimates of the random variability in the data for the main outcomes? 7. Have the characteristics of patients lost to follow-up been described? 8. Have actual probability values been reported (e.g. 0.035 rather than <0.05) for the main outcomes except where the probability value is less than 0.001? 9. Were the subjects asked to participate in the study representative of the entire population from which they were recruited? 10. Were those subjects who were prepared to participate representative of the entire population from which they were recruited? 11. If any of the results of the study were based on "data dredging", was this made clear? 12. Were the statistical tests used to assess the main outcomes appropriate? 13. Were the main outcome measures used accurate (valid and reliable)? 14. Were study participants in different groups (trials and cohort studies) or were the cases and controls (case-control studies) recruited over the same period of time?
Hill 2016  1. Is the hypothesis/aim/objective of the study clearly described? 2. Are the main outcomes to be measured clearly described in the Introduction or Methods section? 3. Are the characteristics of the patients included in the study clearly described? 4. Are the interventions of interest clearly described? 5. Are the distributions of principal confounders in each group of subjects to be compared clearly described? 6. Are the main findings of the study clearly described? 7. Does the study provide estimates of the random variability in the data for the main outcomes? 8. Have all important adverse events that may be a consequence of the intervention been reported? 9. Have the characteristics of patients lost to follow-up been described? 10. Have actual probability values been reported (e.g. 0.035 rather than <0.05) for the main outcomes except where the probability value is less than 0.001? 11. Were the subjects asked to participate in the study representative of the entire population from which they were recruited? 12. Were those subjects who were prepared to participate representative of the entire population from which they were recruited? 13. Were the staff, places, and facilities where the patients were treated, representative of the treatment the majority of patients receive? 14. Was an attempt made to blind study subjects to the intervention they have received? 15. Was an attempt made to blind those measuring the main outcomes of the Intervention? 16. If any of the results of the study were based on "data dredging", was this made clear? 17. In trials and cohort studies, do the analyses adjust for different lengths of follow-up of patients, or in case-control studies, is the time period between the intervention and outcome the same for cases and controls? 18. Were the statistical tests used to assess the main outcomes appropriate? 19. Was compliance with the intervention/s reliable? 20. Were the main outcome measures used accurate (valid and reliable)? 21. Were the patients in different intervention groups (trials and cohort studies) or were the cases and controls (case-control studies) recruited from the same population? 22. Were study subjects in different intervention groups (trials and cohort studies) or were the cases and controls (case-control studies) recruited over the same period of time? 23. Were study subjects randomized to intervention groups? 24. Was the randomized intervention assignment concealed from both patients and health care staff until recruitment was complete and irrevocable? 25. Was there adequate adjustment for confounding in the analyses from which the main findings were drawn? 26. Were losses of patients to follow-up taken into account? 27. Was there sufficient power to detect treatment effect at significance level of 0.05?

METHODS
This systematic review has been reported in accordance with the PRISMA reporting guidelines. The review protocol was registered a priori through PROSPERO (Registration: CRD42018117643).

Study Identification
Studies were eligible for inclusion if they were a full-text article that satisfied all of the following: 1) 100 or more participants analysed in the study (to increase power for comparisons); 2) convincing definition of OA using American College of Rheumatology criteria (14), based on symptoms of sustained pain and stiffness in the affected joint, radiographic changes Non-English studies, letters, conference articles and reviews were excluded.
The titles and abstracts were reviewed by one reviewer (SS). The full-text for each paper was assessed for eligibility by one reviewer (SS) and double-checked by a second (TS). Any disagreements were addressed through discussion and adjudicated by a third reviewer (NS or FH). All studies which satisfied the criteria were included in the review.

Quality Assessment
To assess the risk of bias and the power of the methodology, the Downs & Black (D&B) tool was applied (15). These tools assessed the following aspects of each study: reporting quality, external validity, internal validity-bias, selection bias and power. The modified D&B tool was used.
Accordingly, the 27-item randomised controlled trial (RCT) version was used for RCTs whilst the 18item non-RCT version was used for non-RCT designs (Supplementary File 2). Both 18-item and 27item tools have been demonstrated to be valid and reliable tools to assess RCT and non-RCT papers (14). Critical appraisal was performed by one reviewer (SS) and verified by a second (KT). Any disagreements were dealt with by discussion and adjudicated through a third reviewer (TS). In previous literature D&B score ranges were given corresponding quality: excellent (26-28); good (20-25); fair (15)(16)(17)(18)(19); and poor (<14) (14). Item 4 on the non-RCT and Item 5 from the RCT tool are scored

Data Extraction
Data were extracted including: subject demographic data, study design, pain and function outcome measures, imaging used, OA severity scores, change in pain and function outcomes and change in OA severity scores. After all relevant data had been extracted, authors of these papers were approached to try and attain individual patient data (IPD) related to baseline and change in pain, function and structural scores for each study. No data was received from authors to inform this analysis.

Outcomes
The primary outcome was to determine the development of pain and functional impairment for those with knee and hip OA. The secondary outcome was to determine which factors are associated with structural changes in knee and hip OA.

Data Analysis
All data were assessed for study heterogeneity through scrutiny of the data extraction tables. These identified that there was minimum study-based heterogeneity based on: population, study design and interventions-exposure variabilities for given outcomes. Where there was study heterogeneity, a narrative analysis was undertaken. In this instance, the odds ratio (OR) of all predictor variables were tabulated with a range of OR presented. Where there was sufficient data to pool (two or more studies with data available to analyse) and study homogeneity evident, a pooled meta-analysis was deemed appropriate. As interpreted by the Cochrane Collaboration (16), when I 2 was 50% or greater representing high-statistical heterogeneity, a random-effects model meta-analysis was undertaken.
When I 2 was less than this figure, a fixed effects model approach was adopted. Continuous whereas dichotomous variables were assessed through OR data. All data were presented with 95% confidence intervals (CI) and forest-plots.
Due to the presentation of the data, there were minimal data to permit meta-analyses. Where there was insufficient data to pool the analysis (data only available from one study), a narrative analysis was undertaken to assess risk factors for the development of increased pain and functional impairment.

Search Strategy
The results of the search strategy are presented in Figure 1. In total, 11,010 citations were identified.
Of these, 141 papers were deemed potentially eligible and screened at full-text level. Of these, 82 met the selected criteria and were included.

Characteristics of Included Studies
A summary of the included studies is presented as Table 1. This consisted of 31 non-RCTs (27 observational cohort studies/four case-control studies) and 51 RCTs.
In total, 45,767 knees were included in the analysis. This consisted of 13,870 males and 23,497 females; four studies did not report the gender of their cohorts (17,18,19,20). Thirty-six studies

Meta-Analysis
Two studies were identified where data could be evaluated for OA risk factors by meta-analysis (41, 67). Six variables significantly associated with the development of knee OA. As illustrated in Table 2 and were significantly associated with the development of hip BMLs and pain.

Meta-Analysis
There were insufficient data to permit meta-analysis for the hip OA dataset. Further work may impact our confidence in the estimated effect, for both studies recruiting participants with hip and knee OA. Secondly, the eligibility criteria may have been too restrictive, resulting in limited papers including gait analysis or MOAKS. Wet biomarkers were not included in our analyses. Finally, the inability to pool data was partly attributed to variability in methods to report data. Standardising data collection and reporting is important in conducting meta-analyses.

DISCUSSION
We believe the following should be undertaken to improve data pooling in future work: ensuring group comparisons in studies are selected from the same population (people with confirmed OA) to improve internal validity, observational studies should conduct a power analysis to determine sample sizes and all studies should include absolute frequency of events data rather than summary odds ratios. Such considerations will improve future meta-analyses to identify OA risk factors.
To conclude, our work helps to develop steps towards building a stratification tool for risk factors for knee OA pain and structural damage development. We also highlight the need for collection of core datasets based on defined domains, that has recently also been highlighted by the OMERACT-OARSI  (13). Collection of future datasets based on standardised core outcomes will assist in more robust identification of risk factors for large joint OA.  Figure 2b: Forest-plot to present the association between age and presentation of knee OA. Figure 2c: Forest-plot to present the association between knee effusion score greater or equal to 1 and presentation of knee OA. Figure 2d: Forest-plot to present the association between BMI and presentation of knee OA.    1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59

Checklist items
1. Is the hypothesis/aim/objective of the study clearly described? 2. Are the main outcomes to be measured clearly described in the Introduction or Methods section? 3. Are the characteristics of the patients included in the study clearly described? 4. Are the distributions of principal confounders in each group of subjects to be compared clearly described? 5. Are the main findings of the study clearly described? 6. Does the study provide estimates of the random variability in the data for the main outcomes? 7. Have the characteristics of patients lost to follow-up been described? 8. Have actual probability values been reported (e.g. 0.035 rather than <0.05) for the main outcomes except where the probability value is less than 0.001? 9. Were the subjects asked to participate in the study representative of the entire population from which they were recruited? 10. Were those subjects who were prepared to participate representative of the entire population from which they were recruited? 11. If any of the results of the study were based on "data dredging", was this made clear? 12. Were the statistical tests used to assess the main outcomes appropriate? 13. Were the main outcome measures used accurate (valid and reliable)? 14. Were study participants in different groups (trials and cohort studies) or were the cases and controls (case-control studies) recruited over the same period of time? 15. Were the participants in different groups (trials and cohort studies) or were the cases and controls (case-control studies) recruited from the same population? 16. Was there adequate adjustment for confounding in the analyses from which the main findings were drawn? Have actual probability values been reported (e.g. 0.035 rather than <0.05) for the main outcomes except where the probability value is less than 0.001? 27. Were the subjects asked to participate in the study representative of the entire population from which they were recruited? 28. Were those subjects who were prepared to participate representative of the entire population from which they were recruited? 29. If any of the results of the study were based on "data dredging", was this made clear? 30. Were the statistical tests used to assess the main outcomes appropriate? 31. Were the main outcome measures used accurate (valid and reliable)? 32. Were study participants in different groups (trials and cohort studies) or were the cases and controls (case-control studies) recruited over the same period of time? 33. Were the participants in different groups (trials and cohort studies) or were the cases and controls (case-control studies) recruited from the same population? 34. Was there adequate adjustment for confounding in the analyses from which the main findings were drawn? 35. Were losses of patients to follow-up taken into account? 36. Did the study mention having conducted a power analysis to determine the sample size needed to detect a significant difference in effect size for one or more outcome measures? F o r p e e r r e v i e w o n l y 8 3. Are the characteristics of the patients included in the study clearly described? 4. Are the interventions of interest clearly described? 5. Are the distributions of principal confounders in each group of subjects to be compared clearly described? 6. Are the main findings of the study clearly described? 7. Does the study provide estimates of the random variability in the data for the main outcomes? 8. Have all important adverse events that may be a consequence of the intervention been reported? 9. Have the characteristics of patients lost to follow-up been described? 10. Have actual probability values been reported (e.g. 0.035 rather than <0.05) for the main outcomes except where the probability value is less than 0.001? 11. Were the subjects asked to participate in the study representative of the entire population from which they were recruited? 12. Were those subjects who were prepared to participate representative of the entire population from which they were recruited? 13. Were the staff, places, and facilities where the patients were treated, representative of the treatment the majority of patients receive? 14. Was an attempt made to blind study subjects to the intervention they have received? 15. Was an attempt made to blind those measuring the main outcomes of the Intervention? 16. If any of the results of the study were based on "data dredging", was this made clear? 17. In trials and cohort studies, do the analyses adjust for different lengths of follow-up of patients, or in case-control studies, is the time period between the intervention and outcome the same for cases and controls? 18. Were the statistical tests used to assess the main outcomes appropriate? 19. Was compliance with the intervention/s reliable? 20. Were the main outcome measures used accurate (valid and reliable)? 21. Were the patients in different intervention groups (trials and cohort studies) or were the cases and controls (case-control studies) recruited from the same population? 22. Were study subjects in different intervention groups (trials and cohort studies) or were the cases and controls (case-control studies) recruited over the same period of time? 23. Were study subjects randomized to intervention groups? 24. Was the randomized intervention assignment concealed from both patients and health care staff until recruitment was complete and irrevocable? 25. Was there adequate adjustment for confounding in the analyses from which the main findings were drawn? 26. Were losses of patients to follow-up taken into account? 27. Was there sufficient power to detect treatment effect at significance level of 0.05?  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  Eligibility criteria 6 Specify study characteristics (e.g., PICOS, length of follow-up) and report characteristics (e.g., years considered, language, publication status) used as criteria for eligibility, giving rationale.

Methods, Study Identification
Information sources 7 Describe all information sources (e.g., databases with dates of coverage, contact with study authors to identify additional studies) in the search and date last searched.

Methods, Search Strategy
Search 8 Present full electronic search strategy for at least one database, including any limits used, such that it could be repeated.

PRISMA 2009 Checklist
Data collection process 10 Describe method of data extraction from reports (e.g., piloted forms, independently, in duplicate) and any processes for obtaining and confirming data from investigators.

Methods, Data Extraction
Data items 11 List and define all variables for which data were sought (e.g., PICOS, funding sources) and any assumptions and simplifications made.

Methods, Data Extraction & Methods Outcomes
Risk of bias in individual studies 12 Describe methods used for assessing risk of bias of individual studies (including specification of whether this was done at the study or outcome level), and how this information is to be used in any data synthesis.

Methods, Quality Assessment
Summary measures 13 State the principal summary measures (e.g., risk ratio, difference in means).

Section/topic # Checklist item Reported on page #
Risk of bias across studies 15 Specify any assessment of risk of bias that may affect the cumulative evidence (e.g., publication bias, selective reporting within studies).

Methods, Quality Assessment
Additional analyses 16 Describe methods of additional analyses (e.g., sensitivity or subgroup analyses, meta-regression), if done, indicating which were pre-specified.
Methods, Data Analysis, Paragraph 2

RESULTS
Study selection 17 Give numbers of studies screened, assessed for eligibility, and included in the review, with reasons for exclusions at each stage, ideally with a flow diagram.

PRISMA 2009 Checklist
Risk of bias within studies 19 Present data on risk of bias of each study and, if available, any outcome level assessment (see item 12).

Results, Methodological Quality Assessment
Results of individual studies 20 For all outcomes considered (benefits or harms), present, for each study: (a) simple summary data for each intervention group (b) effect estimates and confidence intervals, ideally with a forest plot. Results Table  2 Risk of bias across studies 22 Present results of any assessment of risk of bias across studies (see Item 15).

Supplementary File 2 and 3; Results, Methodological Quality
Additional analysis 23 Give results of additional analyses, if done (e.g., sensitivity or subgroup analyses, meta-regression [see Item 16]).

DISCUSSION
Summary of evidence 24 Summarize the main findings including the strength of evidence for each main outcome; consider their relevance to key groups (e.g., healthcare providers, users, and policy makers).

Recently, OMERACT-OARSI (Outcome Measures in Rheumatology-Osteoarthritis Research Society
International) have published a core domain set for clinical trials in hip and/or knee OA (13). Six domains were assessed as mandatory in the assessment of OA, including pain, physical function,

METHODS
This systematic review has been reported in accordance with the PRISMA reporting guidelines. The review protocol was registered a priori through PROSPERO (Registration: CRD42018117643).

Study Identification
Studies were eligible for inclusion if they were a full-text article that satisfied all of the following: 1) 100 or more participants analysed in the study (to increase power for comparisons); 2) convincing definition of OA using American College of Rheumatology criteria (14), based on symptoms of sustained pain and stiffness in the affected joint, radiographic changes Non-English studies, letters, conference articles and reviews were excluded.
The titles and abstracts were reviewed by one reviewer (SS). The full-text for each paper was assessed for eligibility by one reviewer (SS) and double-checked by a second (TS). Any disagreements were addressed through discussion and adjudicated by a third reviewer (NS or FH). All studies which satisfied the criteria were included in the review.

Quality Assessment
To assess the risk of bias and the power of the methodology, the Downs & Black (D&B) tool was applied (15). These tools assessed the following aspects of each study: reporting quality, external validity, internal validity-bias, selection bias and power. The modified D&B tool was used.
Accordingly, the 27-item randomised controlled trial (RCT) version was used for RCTs whilst the 18item non-RCT version was used for non-RCT designs (Supplementary File 2). Both 18-item and 27item tools have been demonstrated to be valid and reliable tools to assess RCT and non-RCT papers (14). Critical appraisal was performed by one reviewer (SS) and verified by a second (KT). Any disagreements were dealt with by discussion and adjudicated through a third reviewer (TS). In previous literature D&B score ranges were given corresponding quality: excellent (scored 26-28); good (scored 20-25); fair (scored [15][16][17][18][19]; and poor (scored <14) (14). Item 4 on the non-RCT and Item 5 from the RCT tool are scored two points, hence the total scores equate to 19 and 28 points

Data Extraction
Data were extracted including: subject demographic data, study design, pain and function outcome measures, imaging used, OA severity scores, change in pain and function outcomes and change in OA severity scores. After all relevant data had been extracted, authors of these papers were approached to try and attain individual patient data (IPD) related to baseline and change in pain, function and structural scores for each study. No data was received from authors to inform this analysis.

Outcomes
The primary outcome was to determine the development of pain and functional impairment for those with knee and hip OA. The secondary outcome was to determine which factors are associated with structural changes in knee and hip OA.

Data Analysis
All data were assessed for study heterogeneity through scrutiny of the data extraction tables. These identified that there was minimum study-based heterogeneity based on: population, study design and interventions-exposure variabilities for given outcomes. Where there was study heterogeneity, a narrative analysis was undertaken. In this instance, the odds ratio (OR) of all predictor variables were tabulated with a range of OR presented. Where there was sufficient data to pool (two or more studies with data available to analyse) and study homogeneity evident, a pooled meta-analysis was deemed appropriate. As interpreted by the Cochrane Collaboration(16), when I 2 was 50% or greater representing high-statistical heterogeneity, a random-effects model meta-analysis was undertaken.
When I 2 was less than this figure, a fixed effects model approach was adopted. Continuous

Search Strategy
The results of the search strategy are presented in Figure 1. In total, 11,010 citations were identified.
Of these, 141 papers were deemed potentially eligible and screened at full-text level. Of these, 82 met the selected criteria and were included .

Characteristics of Included Studies
A summary of the included studies is presented as females; four studies did not report the gender of their cohorts (17)(18)(19)(20). Thirty-six studies were undertaken in the USA; 30 were undertaken in Europe; nine were conducted in Australasia and seven in Asia. Mean age of the cohorts was 61.7 years (standard deviation (SD): 7.56); 36 studies did not report age (17,21,. Mean follow-up period was 35.4 months (SD: 33.6). The most common measures of pain were WOMAC pain (n=55; 50%) and Visual Analogue Scale (VAS) Pain (n=21; 19%).

Methodological Quality Assessment
The  baseline hip pain score (MD:-0.7; 95% CI: -1.0 to -0.5) were significantly associated with the development of hip BMLs and pain.

Meta-Analysis
There were insufficient data to permit meta-analysis for the hip OA dataset. Further work may impact our confidence in the estimated effect, for both studies recruiting participants with hip and knee OA. Secondly, the eligibility criteria may have been too restrictive, resulting in limited papers including gait analysis or MOAKS. Wet biomarkers were not included in our analyses. Finally, the inability to pool data was partly attributed to variability in methods to report data. Standardising data collection and reporting is important in conducting meta-analyses.
We believe the following should be undertaken to improve data pooling in future work: ensuring group comparisons in studies are selected from the same population (people with confirmed OA) to improve internal validity, observational studies should conduct a power analysis to determine  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60   F  o  r  p  e  e  r  r  e  v  i  e  w  o  n  l  y   14 sample sizes and all studies should include absolute frequency of events data rather than summary odds ratios. Such considerations will improve future meta-analyses to identify OA risk factors.

DECLARATIONS
Page 5, line 47-59, Page 6, line 1-31 Data items 11 List and define all variables for which data were sought (e.g., PICOS, funding sources) and any assumptions and simplifications made.

Risk of bias in individual studies
12 Describe methods used for assessing risk of bias of individual studies (including specification of whether this was done at the study or outcome level), and how this information is to be used in any data synthesis.

RESULTS
Study selection 17 Give numbers of studies screened, assessed for eligibility, and included in the review, with reasons for exclusions at each stage, ideally with a flow diagram.