Objectives To assess the benefits and harms of exercise in patients with depression.
Design Systematic review
Data sources Bibliographical databases were searched until 20 June 2017.
Eligibility criteria and outcomes Eligible trials were randomised clinical trials assessing the effect of exercise in participants diagnosed with depression. Primary outcomes were depression severity, lack of remission and serious adverse events (eg, suicide) assessed at the end of the intervention. Secondary outcomes were quality of life and adverse events such as injuries, as well as assessment of depression severity and lack of remission during follow-up after the intervention.
Results Thirty-five trials enrolling 2498 participants were included. The effect of exercise versus control on depression severity was −0.66 standardised mean difference (SMD) (95% CI −0.86 to −0.46; p<0.001; grading of recommendations assessment, development and evaluation (GRADE): very low quality). Restricting this analysis to the four trials that seemed less affected of bias, the effect vanished into −0.11 SMD (−0.41 to 0.18; p=0.45; GRADE: low quality). Exercise decreased the relative risk of no remission to 0.78 (0.68 to 0.90; p<0.001; GRADE: very low quality). Restricting this analysis to the two trials that seemed less affected of bias, the effect vanished into 0.95 (0.74 to 1.23; p=0.78). Trial sequential analysis excluded random error when all trials were analysed, but not if focusing on trials less affected of bias. Subgroup analyses found that trial size and intervention duration were inversely associated with effect size for both depression severity and lack of remission. There was no significant effect of exercise on secondary outcomes.
Conclusions Trials with less risk of bias suggested no antidepressant effects of exercise and there were no significant effects of exercise on quality of life, depression severity or lack of remission during follow-up. Data for serious adverse events and adverse events were scarce not allowing conclusions for these outcomes.
Systematic review registration The protocol was published in the journal Systematic Reviews: 2015; 4:40.
- Systematic Review
- Evidence Based Medicine
- Randomised Clinical Trials
This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/
Statistics from Altmetric.com
Strengths and limitations of this study
The protocol for this review has previously been published.
Using meta-regression analysis, trial sequential analysis and the grading of recommendations assessment, development and evaluation system, the conclusions from this review is based on a firm and transparent platform.
Based on an extensive literature search, this review included 35 trials allocating almost 2500 participants diagnosed with depression to exercise or control interventions than could be analysed.
The effect estimates are largely based on trials at high risk of bias.
Effect estimates from included trials had considerable heterogeneity.
Depression is a common disorder affecting up to 17% of the population during their lifetime.1 2 Based on data from WHO, depression is ranked as the second largest healthcare problem globally, in terms of years lived with disability.3 Depending on its severity, depression is often treated using psychotherapy, antidepressants or a combination of both. However, the clinical benefits of antidepressants4–6 and psychotherapy7–9 has been challenged. Both treatments are costly in terms of time and money and may also have adverse effects. Compliance with antidepressant treatment is poor; the dropout rate in clinical trials is reported to be between 12% and 40% within the initial 6–8 weeks of treatment.4 10
The weakness of evidence for the beneficial effect of current interventions, along with problems related to low compliance and harms, has resulted in an interest in using alternative interventions. The use of exercise as an intervention has attracted considerable attention, and various forms of exercise varying in intensity have been assessed in a number of randomised clinical trials to test their effectiveness as a treatment for patients with depression. In 2011, we published a meta-analysis of randomised clinical trials examining the effect of exercise on depressive symptoms in patients with clinical depression.11 The results suggested that referring patients with clinical depression to exercise programme was associated with a small-to-moderate effect on depressive symptoms. However, restricting the analysis to three trials at low risk of bias, the effect estimate was non-significant. Since 2011, other reviews have been published on the effect of exercise on depressive symptoms,12 in older people,13 and in patients with chronic illnesses.14 However, none of these reviews addressed the specific population of adults diagnosed with major depression according to valid diagnostic criteria, such as the International Classification of Diseases15 or the Diagnostic and Statistical Manual of Mental Disorders.16 The reviews contained a number of trials that included volunteers who were defined as being depressed on the basis of psychometric testing (eg, Beck Depression Inventory17), as opposed to individuals with a clinical diagnosis of major depression. Furthermore, several randomised clinical trials investigating the effect of exercise in clinically depressed individuals have been published since our 2011 review.11
The objectives of the present systematic review are to investigate the beneficial and harmful effects of exercise, in terms of severity of depression, lack of remission, quality of life and suicide versus controls with or without co-interventions in adults with a clinical diagnosis of major depression. The current systematic review differs from our previous review in a number of aspects.11 We only considered trials including participants diagnosed with depression according to a validated diagnostic system. We also included trials including participants with somatic comorbidity, for example, cancer or diabetes. The harmful effects of exercise interventions are also addressed, the intervention effects being assessed according to the grading of recommendations assessment, development and evaluation (GRADE) framework, and bibliographical searches have been extended to include a Chinese and a South American database until 2016.
The protocol for this review has previously been published.18
The following bibliographical databases was searched: CENTRAL, MEDLINE, EMBASE, Science Citation Index (Web of Science), LILACS and Wanfang using medical subject headings (MeSH or similar) when possible or text word terms: depression, depressive disorder and exercise, aerobic, non-aerobic, physical activity, physical fitness, walking, jogging, running, bicycling, swimming, strength or resistance (see online supplementary material S1 for an example of a bibliographical search). The main search was conducted in August 2015, and the latest search was conducted on 20 June 2017.
Supplementary file 1
One investigator (JK) examined titles and abstracts to remove obviously irrelevant reports. Two investigators (JK+HS) examined full-text reports and abstracts determining compliance with inclusion criteria. A trial was considered eligible if it was a randomised clinical trial including participants diagnosed as having major depression according to a valid and recognised diagnostic system (ie, Research Diagnostic Criteria,19 International Classification of Diseases (ICD)15 or Diagnostic and Statistical Manual of Mental disorders (DSM)16 and included participants aged >17 years. Abstracts and full-text reports were included.
Trials were excluded if they measured depression immediately after a single bout of exercise, compared one form of exercise versus another, or compared different exercise intensities without including a control group. The trials had to allocate participants to an exercise intervention versus a control group (ie, exercise vs a control group receiving no intervention or treatment as usual or an attention control using light exercise) or using exercise as an add-on treatment (ie, exercise plus usual treatment in the experimental group vs usual treatment alone in the control group). Exercise intervention was defined as a systematic physical intervention with the intention to increase muscle strength and/or cardiovascular fitness, for example, running, swimming or weight lifting. In case of attention control, it should specifically be mentioned by the authors of the trial report that the intervention was intended as a control intervention.
The primary outcomes were: 1) depressive symptoms measured on a continuous scale assessed at the end of the intervention; 2) lack of remission, that is, a binary outcome of the proportion of participants in each intervention group of the trial who did not obtain remission at the end of the intervention according to the authors’ own definition and 3) serious adverse events defined according to International Council for Harmonisation, Good Clinical Practice (ICH-GCP) as any untoward medical occurrence that was life threatening, resulted in death or persistent or significant disability (ICH-GCP 1997).20 Serious adverse events accordingly include suicide attempts as well as suicides. The secondary outcomes were quality of life, non-serious adverse events (eg, muscle injuries) as well as depressive symptoms and lack of remission assessed after the intervention.
Two authors (JK, HS) independently extracted data using a prepiloted structured form. Any discrepancies in the data extraction or inclusion/exclusion of trials was resolved by referring to the original papers. CG or MN assisted as adjudicator in cases of disagreements. Data extraction included, in addition to outcomes, information regarding country of origin, number of randomised participants, number of participants included in efficacy analysis, mean age of participants, diagnostic system, baseline assessment of depression severity, type of intervention, frequency of intervention and duration of intervention. Continuous outcomes were preferred in the following order: postintervention scores with corresponding SD, mean change from baseline with SD, mean difference between groups postintervention and reported outcomes were preferred to figures. JK and CH independently performed the assessment of bias domains. The authors JK, CG and MN have previously published trial reports assessing the effect of exercise in participants with depression,21 22 and to reduce the risk of academic bias two additional authors were included in the current systematic review (CH, HS).
Risk of bias assessment
Definitions in the assessment of bias risk of a trial was conducted according to the Cochrane Handbook for Systematic Reviews of Interventions23 of the following domains: allocation sequence generation, allocation concealment, blinding of participants and personnel, blinding of outcome assessors, incomplete outcome data, selective outcome reporting, for-profit bias and other bias. Trials assessed as having ‘low risk of bias’ in all of the above specified domains were considered ‘trials at low risk of bias’. Trials assessed as having ‘uncertain risk of bias’ or ‘high risk of bias’ in one or more of the above specified domains were considered trials at ‘high risk of bias’. In line with our previous systematic review11 and the latest Cochrane review on exercise for depression,24 trials at low risk of bias in the allocation concealment domain, blinded outcome assessment domain and the incomplete outcome data domain were characterised as ‘trials potentially having less risk of bias than other trials at high risk of bias’. Trials assessing the effect of behavioural interventions are rarely able to mask the allocation, and participants and healthcare providers are therefore not blinded. Therefore, we will also report the number of trials at low risk of bias in the remaining domains.
Data synthesis and analysis
In order to be able to include all of the trials in our meta-analysis, estimates of standardised mean difference (SMD) for each individual trial was carried out. SMD is the mean difference in depression score between the exercise and control groups divided by the pooled SD at follow-up. The result is a unit-free effect size. By convention, SMD effect sizes of 0.2, 0.5 and 0.8 are considered small, medium and large intervention effects.23 For dichotomous variables, we calculated the risk ratio (RR) with a 95% CI. It was expected that some trials would have several intervention groups. Data from the experimental groups were pooled and compared with the data from the control group. In case of discrepancies between the random-effects model analysis and the fixed-effect model analysis, both results are reported; otherwise, only results from the random-effects analysis are reported. The degree of heterogeneity was quantified using the I2 statistic,25 which can be interpreted as the percentage of variation observed between the trials attributable to between-trial differences, rather than sampling error (chance). Heterogeneity was explored by analyses of subgroups (see below).
For the primary outcomes, trial sequential analysis was performed.26 27 In order to calculate the required information size and the cumulative Z-curve’s eventual breach of relevant trial sequential monitoring boundaries, the required information size for the primary continuous outcome was based on type I error of 5%, a beta of 10%, the SE of the meta-analysis and a minimal difference of three points on the Hamilton Depression Scale, 17 items (HAM-D17).18 Post hoc we calculated the required information size including all trials. This was done by converting effect estimates from trials reporting other outcome scales into the HAM-D17 scale as described by Thorlund et al.28 In order to calculate the required information size and the cumulative Z-curve’s eventual breach of relevant trial sequential monitoring boundaries, the required information size for lack of remission was based on type I error of 5%, a beta of 10%, the proportion of participants in the control group with the outcome and a relative risk reduction of 15% and 30%.
Bayes factors were calculated for all primary outcomes.29 Low p values suggest that we can reject the null-hypothesis. But even a low p value from a meta-analysis can be misleading if there is also a low probability that data are compatible with the anticipated intervention effect. In other words, the probability that the actual measured difference in effect of the compared interventions resulted from an a priori anticipated ‘true’ difference needs to be considered. For this purpose, it is helpful to calculate the Bayes factor, which is the ratio of the p value probabilities of the meta-analysis result divided by the probability of the anticipated effect, or ‘true’ effect.29 As suggested by Jakobsen et al,29 a Bayes factor <0.1 together with a low p value suggest, if bias can be ruled out, that the observed result is compatible with the a priori expected effect. If the Bayes factor is >0.1, the result is not compatible with the a priori expected effect and the effect may be lower.
To assess the potential impact of missing data (incomplete outcome data bias), we did sensitivity analysis of missing data using the following strategy: a ‘best-worst’ case scenario was assessed, assuming that all participants lost to follow-up in the intervention group had a beneficial outcome (the group mean minus 1 SD), and all those with missing outcomes in the control group have had a harmful outcome (the group mean plus 1 SD and 2 SD). In addition, the reverse ‘worst-best-case’ scenario analysis was also performed.29 Missing data for the ‘lack of remission’ outcome were imputed in sensitivity analysis according to the following scenarios30: 1) poor outcome analysis: assuming that all of the drop-outs/participants lost from both the experimental and the control arms experienced the outcome, including all randomised participants in the denominator; 2) good outcome analysis: assuming that none of the drop-outs/participants lost from the experimental and the control arms experienced the outcome, including all randomised participants in the denominator; 3) extreme case analysis favouring the experimental intervention (‘best-worse’ case scenario): none of the drop-outs/participants lost from the experimental arm, but all of the drop-outs/participants lost from the control arm experienced the outcome, including all randomised participants in the denominator and 4) extreme case analysis favouring the control (‘worst-best’ case scenario): all of the drop-outs/participants lost from the experimental arm, but none from the control arm experienced the outcome, including all randomised participants in the denominator.
In subgroup analyses, the possible effects of variables on intervention effects on outcomes and heterogeneity were compared. Trials potentially having less risk of bias (ie, trials with adequate allocation concealment, blinded outcome assessment and intention-to-treat analysis) were compared with trials at high risk of bias. The effect of age was assessed by comparing trials including older participants (mean age >59 years) to trials including younger participants (mean age <60 years). The effect of type of exercise was assessed by comparing trials using group exercises compared with trials using individual exercise. The effect of duration of intervention was assessed by comparing trials with short duration of intervention to trials with long duration of intervention splitting by the median time of duration. The effect of type of control group was assessed by comparing trials using attention control to trials with waitlist controls and comparing trials with exercise as add-on to medication to trials not using any medication. In addition, a within-study comparison of low-dose exercise versus high-dose exercise in trials using different exercise intensities was performed. The effect of comorbid somatic disease was assessed by comparing the effect estimates from trials including participants with depression compared with trials including participants with depression in addition to a somatic disease. Publication bias was assessed by visual inspection of a funnel plot and by Egger’s test and if publication bias plausible Duval’s and Tweedie’s trim and fill procedure was conducted.31
We assessed and graded the evidence according to the GRADE for high risk of bias, imprecision, indirectness, heterogeneity and publication bias.32 Based on this assessment, the intervention was graded accordingly: ‘high quality’—we are very confident that the true effect lies close to that of the estimate of the effect; ‘moderate quality’—we are moderately confident in the effect estimate. The true effect is likely to be close to the estimate of the effect, but there is a possibility that it is substantially different; ‘low quality’—our confidence in the effect estimate is limited: the true effect may be substantially different from the estimate of the effect; ‘very low quality’—we have very little confidence in the effect estimate: the true effect is likely to be substantially different from the estimate of the effect.33
Deviations from our protocol
Post hoc we included trials using the Chinese Classification of Mental Disorders (CCMD) as well as a few trials including participants classified as having ‘minor depression’. The CCMD system closely adhere to the ICD and DSM systems and have been found highly compatible in field studies, so these studies were included.34 A few trials included some participants classified as having ‘minor depression’ according to the trials chosen diagnostic system (eg, DSM), and it is questionable if these participants have major depression. We therefore decided to include these trials and to conduct a subgroup analysis exclusively including participants with major depression. To further explore heterogeneity, we post hoc included subgroup analysis comparing intervention effects in inpatients and outpatients as well as an analysis according to trial size. Trials were divided into small or large trials using the median of total n included in the efficacy analysis. The effect of exercise capacity was post hoc assessed by comparing trials with a high increase in maximal oxygen uptake (VO2max) with studies with lower increase in maximal oxygen uptake. Assessment of exercise capacity was based on the increase of VO2max in the intervention groups and trials were stratified to either high or low increase in exercise capacity by median. We did not conduct trial sequential analysis based on a relative risk reduction of 30% of lack of remission as this was an implausible effect.
Depressed participants were not involved in this study.
Bibliographical search and trial characteristics
The main bibliographical search was conducted on 26 August 2015 and the final updates were conducted on 20 June 2017. As illustrated in online supplementary figure S1, we identified 45 publications reporting the effect of exercise on depressive symptoms in 35 randomised clinical trials.21 22 35–78 Seventeen trials were conducted in Europe,21 22 40 49 52 53 55 61 65–68 74 75 77 79 80 eight in the USA,38 39 43 45 60 64 76 81 six in Asia,47 69–73 two in Australia54 58 and two in South America.56 63 A total of 2630 participants were randomised and 2498 were included in the efficacy analysis of benefit. Ten trials included inpatients47 49 56 67 69–73 79 and five trials included participants with a mean age >60 years.52 54 58 60 61 No trials exclusively included participants with comorbid somatic disease. Four trials reported the continuous outcome as mean change from baseline in each group with a corresponding SD,39 53 65 68 and one trial presented data as mean difference between groups postintervention.40 The remaining trials reported postscores in each group with corresponding SD (see table 1 for trial characteristics).
Supplementary file 8
Bias risk assessment
Sequence generation was adequate in 15/35 (43%), allocation concealment was adequate in 13/35 (37%) trials, blinding of participants and trial personnel was adequate in 0/35 (0%), blinded outcome assessment was performed in 16/35 (46%), low risk of bias in the ‘incomplete outcome data’ domain was found in 12/35 (34%) trials, selective outcome reporting domain was adequate in 31/35 (89%), for-profit bias domain was adequate in 19/35 (54%) and 25/35 (71%) were free of other bias. Accordingly, all trials were at high risk of bias. Given the nature of the intervention, no trial had blinded participants or trial personnel, however, two trials had low risk of bias in all other bias domains.22 54 Five trials (16%) were sponsored by for-profit organisations: three trials were supported by pharmaceutical companies,53 79 82 one trial by a company producing fitness machines45 and one trial by an insurance company.21 According to our a priori defined criteria, 4/35 (11%) trials potentially had less risk of bias than the other trials at high risk of bias21 22 54 56 (see table 2 for details on assessment of risk of bias).
The effect of exercise on depression severity
All included trials provided a continuous outcome on depression severity for the assessment of the exercise intervention encompassing 2498/2630 randomised participants (95%). The effect of intervention versus control was a SMD of −0.66 (95% CI −0.86 to −0.46; p<0.001) (figure 1). This corresponds to an effect on the HAM-D17 scale of −4.1 (95% CI −5.3 to −2.9) points.
Missing outcome analysis for depression as a continuous outcome did not markedly change the effect estimates. The least favourable outcome for the exercise intervention was the worse/best outcome analysis using +2 SD resulting in an effect estimate of −0.57 SMD (95% CI −0.78 to −0.36; p<0.001) (see online supplementary table S1).
Supplementary file 9
Heterogeneity and subgroup analysis
The I2 was 81% suggesting substantial heterogeneity. Subgroup analysis revealed that the effect estimates for trials potentially having less risk of bias was −0.11 SMD (95% CI −0.41 to 0.18; p=0.45; I2=62%) compared with that of the trials at high risk of bias −0.75 SMD (−0.98 to −0.52; p<0.001; I2=81%) (test of subgroup difference, p<0.001). In addition, trials including 50 participants or less had a pooled estimate of −1.11 (-1.52 to −0.72; p<0.001; I2=78%) compared with that of larger trials of −0.37 (-0.57 to −0.18; p<0.001; I2=75%) (test of subgroup difference, p=0.001). Trials of short duration of intervention (<10 weeks) had an SMD of −0.92 (−1.09 to −0.74; p<0.001; I2=14%) compared with trials with longer duration of intervention, −0.49 (-0.75 to −0.23; p<0.001; I2=83%) (test of subgroup difference, p=0.007). Effect estimates from trials including participants with minor depression compared with trials exclusively including participants with major depression did not differ (test of subgroup difference, p=0.53).
Four trials allocated 206 participants to different exercise intensities/doses.45 58 73 83 Comparing the postintervention depression scores for participants allocated to either high-intensity/high-dose versus low-intensity/low-dose exercise showed a difference of −0.40 SMD (95% CI −0.67 to −0.12; p=0.005; I2=0%) in favour of high-intensity/high-dose exercise. As shown in table 3, no other trial characteristic significantly explained any of the observed heterogeneity (see online supplementary table S2 for trial characteristics used to explore heterogeneity.
Supplementary file 1
Trial sequential analysis and diversity adjusted required information size
The diversity adjusted required information size for HAM-D17 as a continuous outcome was calculated based on our anticipated intervention effect of a minimal relevant difference of 3.0 HDRS points, an SD of 6.78 points, a risk of type I error of 0.05, a power of 90% and the observed diversity of 92% to 2610 participants. Only 14 trials reported results from HAM-D17 21 22 38 39 43 44 52 53 55 56 58 68 70 83 with an accrued 1124 participants. As shown in online supplementary figure S2, the cumulative Z-curve just crossed the trial sequential monitoring boundary for benefit. With the aforementioned settings, the pooled estimate is therefore less likely to be a random finding due to lack of power or multiple testing if bias could be ignored. Post hoc, we calculated the adjusted required information size for HAM-D17 including all trials as shown in online supplementary figure S3. As with the original analysis, the Z-curve crossed the trial sequential monitoring boundary for benefit supporting that the pooled estimate is less likely to represent a type 1 error if bias could be ignored.
Supplementary file 2
Supplementary file 10
Fourteen trials reported effect estimates using the HAM-D17.21 22 38 39 43 45 52 53 55 63 68 70 83 84 Based on these trials, Bayes factor was calculated (δ=−3.37; SEδ=0.96; µa=−3.0) and was found to be 0.002, which is below the Bayes factor threshold for significance of 0.1, supporting the intervention effect if bias could be ignored.
Inspection of the funnel plot (not shown) suggested that small trials with small or no effect of exercise were missing (see online supplementary figure S4). Egger’s test supported the suspicion of publication bias, p<0.00001. Using the Duval’s and Tweedie’s trim and fill procedure, the estimate was reduced into −0.27 SMD (95% CI −0.50 to −0.05). This corresponds to an effect on the HAM-D17 scale of −1.7 (95% CI −3.1 to −0.31) points.
Supplementary file 7
The effect of exercise on depression—lack of remission
Nineteen trials, randomising 1825 participants and including 1639 participants (90%) in final analysis reported remission as an outcome.21 22 38–40 43 45 47 49 53 54 56 60 61 65 68–70 72 Remission postintervention was defined in various ways: a postintervention score on the HAM-D17<8 points,44 53 56 69 70 not fulfilling the DSM criteria for depression and a HAM-D17<8 points,21 22 39 not fulfilling the DSM criteria for depression,38 54 60 a BDI score <9 points,43 a BDI score <10 points,40 a HAM-D17 score <10 points,83 a Montgomery-Asberg Depression Rating Scale (MADRS) score <10 points,47 a MADRS score <10 points and a 50% reduction in symptom score,65 a 75% reduction in HAM-D24,72 a HAM-D17 score <11.28 points and a reduction in HAM-D17 scores >7.74 points68 and one study used MADRS not specifying the cut-off for remission.49 The RR for lack of remission was 0.78 (95% CI 0.68 to 0.90; p=0.0008) in favour of the intervention using a random-effects analysis. The I2 was 69% suggesting substantial heterogeneity. The forest plot for the intervention effect on lack of remission is illustrated in online supplementary figure S5.
Supplementary file 6
The scenario in least favour of the intervention was the ‘poor’ outcome analysis having an effect estimate of RR 0.88 (95% CI 0.83 to 0.94; p=0.0002; I2=69%). As shown in online supplementary table S1, the remaining scenarios did not substantially differ from the main analysis.
Heterogeneity and subgroup analysis
I2 was 69% for the outcome lack of remission suggesting substantial heterogeneity. For this outcome, only two trials22 84 were considered as trials potentially having less risk of bias than the other trials at high risk of bias. The RR of these two trials was 0.95 (95% CI 0.74 to 1.23; p=0.78) compared with 0.77 (96% CI 0.64 to 0.92; p=0.003) for trials at high risk of bias (test of subgroup difference, p=0.19). Trials including 52 participants or less in their final analysis had a RR of 0.62 (95% CI 0.50 to 0.76; p<0.001; I2=45%) compared with 0.95 (95% CI 0.80 to 1.12; p=0.52; I2=68%) for larger trials (test of subgroup difference, p=0.002). Also, trials with a duration of <10 weeks had a RR of 0.63 (95% CI 0.51 to 0.77; p<0.001; I2=40%) compared with 0.93 (95% CI 0.78 to 1.10; p=0.39; I2=69%) for trials of a longer duration (test of subgroup difference, p=0.004). As shown in online supplementary table S3, no other trial characteristic significantly explained any of the observed heterogeneity (see online supplementary table S2 for trial characteristics used to explore heterogeneity).
Supplementary file 3
Trial sequential analysis and diversity adjusted required information size
The diversity adjusted required information size for lack of remission was calculated based on our observed diversity of 74%, a proportion in the control group with lack of remission of 66%, an anticipated intervention effect of 15% relative risk reduction, a risk of type I error of 0.05% and a power of 90%. As shown in online supplementary figure S6, the cumulative Z-curve just crossed the trial sequential monitoring boundary for benefit. With the aforementioned settings, the pooled estimate is therefore less likely to be a random finding due to lack of power or multiple testing if bias could be ignored.
Supplementary file 5
Bayes factor was calculated based on the observed relative risk of remission, the associated SE and an anticipated intervention effect of relative increase in number of participants with remission by 15% (δ=−0.248; SEδ=0.08; µδ=−0.163). Bayes factor was 0.02, which is below the Bayes factor threshold for significance of 0.1.
Inspection of the funnel plot (not shown) suggested that small trials with small or no effect of exercise were missing. Egger’s test supported the suspicion of publication bias, p=0.002. Imputing theoretically missing studies by the Duval’s and Tweedie’s trim and fill procedure, reduced the estimate of intervention effect into a relative risk reduction of 0.93 (95% CI 0.79 to 1.11).
The effect of exercise on serious adverse events
Serious adverse events (ie, death or suicide attempts) were reported in only three trials.21 22 58 In these trials, one suicide attempt22 and one death by suicide21 were recorded in the intervention groups. The RR for death or suicide in the two trials was 2.21 (95% CI 0.24 to 20.21; p=0.48; I2=0%) as illustrated in online supplementary figure S7.
Supplementary file 11
Missing outcome analysis for ‘serious adverse events’ varied according to missing data scenario: poor outcome analysis relative risk, 0.92 (95% CI 0.37 to 2.30; p=0.86; I2=60.0%), good outcome analysis, 2.19 (95% CI 0.23 to 20.76; p=0.50; I2=0.0%), best/worst outcome analysis 0.08 (95% CI 0.02 to 0.34; p=0.001; I2=5.4%), worst/best outcome analysis 19.17 (95% CI 2.64 to 139.2; p=0.004; I2=0.0%).
Trial sequential analysis and Bayes analysis
We decided not to conduct trial sequential analysis or Bayes analysis due to too sparse data.
Only 3/35 trials reported on this outcome and no formal assessment for publication bias was made. However, the lack of reporting in the vast majority of trials suggest risk publication bias.
The effect of exercise on quality of life
Nine trials randomising 827 participants reported on quality of life,21 22 38 40 56 60 71 76 85 observing that participants allocated to exercise did not have significantly better quality of life (SMD 0.40; 95% CI −0.03 to 0.83; p=0.07). The I2 was 88% showing substantial heterogeneity (see online supplementary figure S8).
Supplementary file 4
Non-serious adverse events
Non-serious adverse events were reported in only 10 trials.21 22 39 56 58 60 65 67 68 75 Five trials reported on musculoskeletal adverse events without conducting formal tests58 60 65 67 68 and four trials reported on number of participants with high depression scores postintervention compared with baseline assessment.21 22 65 68 The RR for increased severity of depression in patients allocated to exercise postintervention was 0.83 (95% CI 0.40 to 1.70; p=0.60; I2=0.0%).
The effect of exercise on depression beyond the duration of the intervention
Assessment of depression beyond the intervention was conducted in seven trials,21 38 40 52 60 63 86 with a median duration between end of intervention and assessment of depression of 6 months (range 5–23.5 months). The SMD between the intervention group and the control group using a random-effects analysis was −0.10 (95% CI −0.28 to 0.09; p=0.31; I2=19.5%). The I2 for this estimate was 19.5% suggesting low heterogeneity (see online supplementary figure S9).
Remission beyond the intervention was assessed in five trials,21 38–40 54 and the relative risk of lack of remission was 0.95 (95% CI 0.82 to 1.11; p=0.53) with an I2 of 0.0% (see online supplementary figure S10).
The GRADE assessments are presented in table 4, and quality of evidence for both primary and secondary outcomes was very low or low.
Four studies reported change in scores from baseline with corresponding SDs, and one study reported mean difference between groups postintervention. Comparing the effect size of these five studies with the remaining did not seem to explain part of the heterogeneity (p=0.23).
Thirty-five clinical trials allocating more than 2498 participants diagnosed with depression according to validated diagnostic instruments were included in the present systematic review. Pooled estimates suggested moderate antidepressant effect assessed both as a continuous outcome and as lack of remission. Due to risk of bias, inconsistency of effect estimates and publication bias, we have, however, very little confidence in these effect estimates. Subgroup analyses exploring reasons for the heterogeneity found that trials potentially having less risk of bias than other trials at high risk of bias had no effect of exercise on depression. Furthermore, duration of intervention and trial size were inversely associated with effect estimates. Exercise did not improve quality of life or depression or remission after the intervention. Serious adverse events or adverse events were reported inconsistently and only by a few trials not permitting firm conclusions regarding these outcomes.
Strengths and limitations
The strengths of this systematic review are that it is based on the published protocol, a comprehensive search strategy and the inclusion of patient-centred outcomes such as quality of life as well as adverse events. Also, to avoid spurious finding from repeated testing, trial sequential analysis and Bayes analysis were undertaken and these analyses did not suggest that the pooled estimates could be reduced to random errors for effect on depression severity or no remission. Neither trial sequential analysis nor Bayes factor analysis are, however, able to wash of spurious effects induced by bias, fraud or other reasons.26 29 87–89 Had we restricted the trial sequential analysis to trials of potentially lower risk of bias, the number of trials and participants would be limited and we had seen evidence far from crossing any boundaries for benefit, harms or futility. The conclusions for serious adverse events and adverse events were associated with wide CIs due to lack of data and firm conclusions for these outcomes are presently not available.
The number of trials with adequate allocation concealment was 37% in the current systematic review compared with only 15.1% in trials assessing non-drug interventions for depression.90 Blinded outcome assessment was performed in 46% of the included trials compared with 44% in non-drug antidepressant trials in general.90 The incomplete outcome bias domain was adequate in 34% of our included trials compared with 32.9% of antidepressant non-drug trials in general.90Compared with non-drug trials assessing interventions for participants with depression, the included exercise trials have more bias domains with low risk of bias. However, all our included trials were at high risk of bias. Two trials had low risk of bias for all bias domains except for blinding of participants and trial personnel, and four trials fulfilled our criteria for trials at potentially less risk of bias than the rest of the trials with at risk of bias. Despite a search strategy including bibliographical databases and trials from China and South America, the vast majority of included trials were conducted in North America and western Europe, which is comparable to the geographical distribution of non-drug trials in general,90 limiting the applicability to other geographic regions.
All outcomes for the primary analysis reflect depression severity, however, the different psychometrics may represent different aspects of depression not reflected in the pooled estimate. An in-depth discussion of the included assessment scales is beyond the scope of this review, but in the current systematic review we found no significant differences of effect estimates from trials using HAM-D17 compared with trials using other assessment scales (data not shown).
The effect of exercise on depression
Our present results are similar to the latest Cochrane review by Cooney et al, 24 who found a moderate effect of exercise on depressive symptoms (−0.62 SMD) when including all trials and no effect when restricting the analysis to trials with less risk of bias (−0.18 SMD). The Cochrane review did find evidence of a small antidepressant effect beyond the intervention, which we could not confirm in our present systematic review. Bridle et al 13 included nine trials allocating old (>60 years) participants with depression to exercise interventions versus control interventions. Restricting the analysis to four trials at lower risk of bias they found small-to-moderate effect estimates (SMD −0.34) in favour of exercise. The studies by Cooney et al 24 and Bridle et al 13 both included trials allocating participants with depressive symptoms and not necessarily diagnosed using a validated diagnostic system, potentially explaining the differences in the effect sizes. However, in our present systematic review the estimate for four trials at potential less risk of bias than the remaining trials was −0.11 SMD and in the study by Cooney et al, the effect estimate for eight trials with lower risk of bias was −0.18 SMD24 compared with −0.34 in the study by Bridle et al.13 Meta-analysis of randomised clinical trials assessing the effects of exercise for depression consistently finds positive effects, however, when restricting the analysis to trials with less risk of bias the pooled effect sizes becomes very small or negligible. Meta-analysis examining the effect of exercise beyond the intervention also finds no or small effects of exercise. In the process of interpretation of effect estimates in the current research field, it is important to recognise that effect estimates from trials with non-blinded outcome assessment are at high risk of bias as reported by Savović et al.91 Sixteen of 35 trials in the current systematic review did not use blinded outcome assessment. In contradiction to the current systematic review, a recent meta-analysis by Schuch et al 12 concluded that ’exercise has a large and significant antidepressant effect in people with depression………Our data strongly support the claim that exercise is an evidence-based treatment for depression’. This statement was based on a meta-analysis of 25 randomised clinical trials including participants with depression or depressive symptoms to exercise or control conditions and excluding trials using any form of active control group. Surprisingly, the authors found that adjusting for publication bias using the trim and fill procedure,31 the estimate increased from an SMD of 0.98 to 1.11. The effect in SMD in included studies ranged from −0.23 to 4.56 representing considerable heterogeneity.12 The authors classified four trials as having lower risk of bias using the same criteria as in our systematic review and 21 trials as having high risk of bias. This illustrates some of the challenges in meta-analysis of exercise and depression: the large heterogeneity driven by small studies inflating the effects of random-effects analysis,92 the misconception that we can restrict our analysis to statistics and not consider the evident effect of bias.23 91 Compared with our previous review,10 we now included 35 trials including 2498 participants versus previously 13 trials and 687 participants. It may seem as a paradox that this large increase in data has not provided us with a similar increase in certainty of conclusions reflected by heterogeneity of trial results as well as our conclusions from the systematic reviews. The increase in available data is, however, primarily provided by small trials at high risk of bias introducing exaggerated effect estimates. In the current systematic review, we included four trials with 530 participants at lower risk of bias compared with three trials with 239 participants in our previous review, reflecting that only a small part of the additional data comes from trials at lower risk of bias. The continuous increase in data associated with high risk of bias will not provide patients, clinicians or policymakers with adequate information and represents an unethical enrolment of trial participants and waste of resources.93–99 We therefore recommend that future systematic reviews and meta-analysis a priori should have a primary outcome restricting effect analysis to larger trials with lower risk of bias and that any recommendations regarding exercise interventions for participants with depression should be assessed with the GRADE framework.
The I2 of 81% and 69% for the primary outcomes indicate substantial evidence of heterogeneity of intervention effects that is variation in effect estimates beyond chance. Part of this heterogeneity was explained by bias and by trial size: trials at high risk of bias or small trials have very large effect estimates compared with trials potentially at less risk of bias or larger trials. The funnel plots and Egger’s test indicate publication bias, however, the association between trial size and effect estimates could suggest that the asymmetry in the funnel plots are due to small study bias rather than publication bias.100 It could be argued that both the delivery of exercise as well as the actual increase in fitness are fundamental to the assessment of the antidepressant effects of exercise, and in line with our previous review, we found duration of intervention inversely associated with effect size.11 Comparing different exercise intensities, we did find a small effect of high-intensity exercise compared with lower-intensity exercise. However, assessing delivered exercise expressed as increase in maximal oxygen uptake we could not reproduce this finding. Future trials need to pay more attention to the dose of the intervention as well as compliance with intervention.101 We suggest using maximal oxygen uptake or one repetition maximum as the gold standard to assess the received exercise. Several studies compare exercise with control interventions rather than waitlist control to reduce the effect of non-specific effects, for example, the DEpression og MOtion (DEMO) trials and the trials by Mather et al.21 22 52 Also, it could be speculated that the effect of exercise would be harder to detect if participants also received medical treatment in addition. The current systematic review could not confirm that the type of control condition explained heterogeneity. The discussion of control group is important in non-drug trials: choosing a waitlist control group the results potentially reflects non-specific effects, choosing an active control group (eg, relaxation exercise) the trial is potentially a comparison between two active treatments. However, in the current systematic review we found no evidence that trials using an attention control group or exercise as add-on to pharmacotherapy had significantly different effect estimates compared with other trials.
Our systematic review did not find indications of a positive effect on quality of life in participants with depression allocated to exercise interventions, which is in concordance with the review by Cooney et al.24 Only 3/35 trials reported on serious adverse events, and we found no significant effects of exercise on risk of death or suicide attempt. No indication of increased severity of depression or other adverse events in participants allocated to exercise could be detected. However, data on adverse events were reported sporadically in a minority of trials and currently it is not possible to conclude on the risk of serious adverse events or adverse event from exercise interventions in participants with depression.
We have little confidence in the pooled effect estimates, especially because trials with less than high risk of bias produced significantly lower effect estimates, suggesting that exercise interventions only produce small or negligible antidepressant effects, depending on how much of the effect is caused by bias and how much is caused by the intervention. There was no effect of exercise on depression beyond the intervention itself. We found no effect on quality of life. There is currently no evidence in favour of exercise for patients with depression with a view to ameliorate depressive symptoms. Our systematic review did not evaluate possible beneficial effects of exercise on, for example, metabolism or cardiovascular fitness,22 102 and it is possible that exercise may have beneficial effects on these factors in patients diagnosed with depression.
Despite the large number of published trials, further trials with more robust methodology seem still required to establish progress in this field. Also, additional trials from outside North America and Europe may be required for results to be valid for patients in Asia, Africa and South America. To further elaborate on the current findings, we recommend that future trials must include blinded outcome assessors and outcomes assessing quality of life, metabolic effects and long-term effects beyond the intervention. It is also important that future trials systematically collect and report data on death, suicide events, musculoskeletal injuries and other potential adverse effects in both the intervention group as well as in the control group. Moreover, future trials ought to be designed according to the standard protocol items: recommendations for interventional trials (SPIRIT) guidelines and reported according to the consolidated standards for reporting of trials (CONSORT) guidelines103 104 and transparently report deidentified individual participant data enabling individual participant data meta-analyses.105
The authors appreciate the help from Youling He with the Chinese Wanfang bibliographical database and translation of Chinese papers. The authors also thank Janus C Jakobsen for assistance with the calculation of Bayes factor.
Contributors JK conceived the project, collected data, did the statistical analysis, analysed the data, drafted and revised the manuscript. He is guarantor. CH collected the data, analysed the data and revised the manuscript. HS conceived the project, collected data, analysed the data and revised the manuscript. CG conceived the project, analysed the data and revised the manuscript. MN conceived the project, analysed the data and revised the manuscript.
Competing interests K, CG, and MN have previously published two trials and a meta-analysis on this topic, which could introduce an academic bias in the current systematic review. We asked new authors (HS and CH) to be involved in the preparation of the protocol, trial selection and bias assessment. No support from any organisation was received for the submitted work; no financial relationship with any organisations that might have an interest in the submitted work in the previous three years; and apart from the above no other relationship or activities that could appear to have influenced the submitted work.
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement All data used in this study are available in figures and tables. No other data were used.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.