Article Text

Which are the most useful scales for predicting repeat self-harm? A systematic review evaluating risk scales using measures of diagnostic accuracy
  1. L Quinlivan1,
  2. J Cooper1,
  3. L Davies2,
  4. K Hawton3,
  5. D Gunnell4,
  6. N Kapur1,5
  1. 1Centre for Mental Health and Safety, University of Manchester, Manchester, UK
  2. 2Institute of Population Health, University of Manchester, Manchester, UK
  3. 3Department of Psychiatry, Centre for Suicide Research, University, Warneford Hospital, Oxford, UK
  4. 4School of Social and Community Medicine, University of Bristol, Bristol, UK
  5. 5Manchester Mental Health and Social Care Trust, Manchester, UK
  1. Correspondence to Dr L Quinlivan; leah.quinlivan{at}manchester.ac.uk

Abstract

Objectives The aims of this review were to calculate the diagnostic accuracy statistics of risk scales following self-harm and consider which might be the most useful scales in clinical practice.

Design Systematic review.

Methods We based our search terms on those used in the systematic reviews carried out for the National Institute for Health and Care Excellence self-harm guidelines (2012) and evidence update (2013), and updated the searches through to February 2015 (CINAHL, EMBASE, MEDLINE, and PsychINFO). Methodological quality was assessed and three reviewers extracted data independently. We limited our analysis to cohort studies in adults using the outcome of repeat self-harm or attempted suicide. We calculated diagnostic accuracy statistics including measures of global accuracy. Statistical pooling was not possible due to heterogeneity.

Results The eight papers included in the final analysis varied widely according to methodological quality and the content of scales employed. Overall, sensitivity of scales ranged from 6% (95% CI 5% to 6%) to 97% (CI 95% 94% to 98%). The positive predictive value (PPV) ranged from 5% (95% CI 3% to 9%) to 84% (95% CI 80% to 87%). The diagnostic OR ranged from 1.01 (95% CI 0.434 to 2.5) to 16.3 (95%CI 12.5 to 21.4). Scales with high sensitivity tended to have low PPVs.

Conclusions It is difficult to be certain which, if any, are the most useful scales for self-harm risk assessment. No scales perform sufficiently well so as to be recommended for routine clinical use. Further robust prospective studies are warranted to evaluate risk scales following an episode of self-harm. Diagnostic accuracy statistics should be considered in relation to the specific service needs, and scales should only be used as an adjunct to assessment.

  • PSYCHIATRY
  • PUBLIC HEALTH
  • Diagnostic accuracy
  • HEALTH SERVICES ADMINISTRATION & MANAGEMENT

This is an Open Access article distributed in accordance with the terms of the Creative Commons Attribution (CC BY 4.0) license, which permits others to distribute, remix, adapt and build upon this work, for commercial use, provided the original work is properly cited. See: http://creativecommons.org/licenses/by/4.0/

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • We evaluated the diagnostic accuracy of widely used scales which were tested for predictive use in studies between 2002 and 2014, and included 98 600 hospital presentations of self-harm or attempted suicide.

  • The study provides an important critical evaluation of the scales, including a wide range of diagnostic accuracy statistics which are likely to be useful for clinicians, commissioners and hospital risk managers.

  • We did not conduct a meta-analysis due to the wide heterogeneity of the scales and studies themselves.

  • We limited our analyses to cohort studies of adults which used repeat self-harm or attempted suicide as an outcome, and reported measures of diagnostic accuracy.

Introduction

Self-harm is a frequent clinical challenge and a strong predictor of future suicide.1 ,2 One in six individuals presenting to hospital with self-harm will repeat the behaviour within 1 year.2–4 Psychosocial assessment on presentation to hospital is a key component of recommended clinical management.5 ,6 Guidelines recommend that all patients presenting to the hospital services with self-harm should receive a preliminary psychosocial assessment to determine mental capacity and evaluate willingness to stay for further treatment.5 Mental health professionals should conduct a more comprehensive evaluation of risk and needs at a later stage, and risk scales are typically a core component of assessments despite limited evidence of their effectiveness.6 ,7 Some clinical guidelines advise against the use of scales to determine management, but suggest they can be used to help structure assessments.6 Other guidelines recommend that only scales that have undergone formal testing should be used as part of clinical assessments.8

Our recent study in 32 English hospitals found that at least 20 risk tools were in use, suggesting that there is a lack of consensus over which scales are best for evaluating risk of further self-harm.7 The uncertainty is perhaps due to methodological differences between studies and variable standards of reporting. There are a small number of reviews which consider the predictive ability of risk scales for repeat self-harm which may help clinicians to select the most helpful tools6 ,9 ,10 but the information provided is mostly limited to dual indicators such as sensitivity/specificity, positive/negative predictive values, and there is little practical guidance for clinicians in selecting the ‘most useful tools’.

While these dual indicators are useful for determining the predictive validity of a scale, a broader range of diagnostic test criteria may be helpful when selecting an appropriate scale for clinical use given the inevitable trade-off between sensitivity (the proportion of individuals who repeat self-harm identified by the test as high risk) and specificity (proportion of people who did not repeat self-harm identified as low risk by the test). For example, a highly sensitive test might identify all patients at risk of future self-harm but could be over inclusive with cost and resource implications. Conversely, the higher threshold inherent in highly specific tests may result in false negatives and a host of deleterious consequences for patients and clinical services.

We have conducted a systematic review of existing research on risk scales to consider these issues.

The objectives were to:

  1. Investigate the performance of risk scales following self-harm or attempted suicide on a wide range of dual measures, as well as more global measures of accuracy.

  2. Consider which might be the most useful scales following self-harm in clinical practice settings.

This information may be useful to clinicians, commissioners and hospital risk managers, who need to critically evaluate scales for use in clinical practice.

Method

This study extends the reviews carried out as part of the National Institute for Health and Care Excellence (NICE) self-harm guidelines6 and evidence update11 on the use of risk scales for repeat self-harm. We included recent evidence and considered a much broader range of diagnostic accuracy statistics than the original reviews.

Literature search

We identified studies evaluating the predictive validity of risk scales for repeat self-harm from the NICE review on the longer term management of self-harm and the evidence update.6 ,11 We used the same published search strategy11 (see online supplementary appendix 1) on CINAHL, EMBASE, MEDLINE and PsychINFO databases through to February 2015. Reference lists were also screened and related references reviewed.

Inclusion and exclusion criteria

Consistent with the NICE self-harm evidence update,11 studies were included if they used a cohort design—the optimal design for evaluating the diagnostic accuracy of scales as case–control studies can overestimate diagnostic accuracy.12 Although suicide is an extremely important outcome following self-harm, the low base-rate hinders predictive efforts even in high-risk populations.13 We focused on repeat self-harm or attempted suicide as an outcome, as the incidence rate is higher and the prediction of repetition may more feasible than predicting suicide.14 Studies were included if measures of diagnostic accuracy (such as sensitivity, specificity and positive predictive values) were reported.

Studies were excluded if the scales were validated on a specific or restricted samples (eg, veterans, prisoners or specialist mental healthcare population), or a sample which did not include people presenting with self-harm or attempted suicide. One study15 recruited a mixed sample of people (presenting with suicide ideation or self-harm), but since a majority of the sample (>75%) had a history of self-harm and the study outcome was self-harm repetition, this study was included.

Some tools were validated in more than one setting and these were included once in the final analysis, using the original paper, if this met the inclusion criteria. We did this in order to gain an indication of the ‘best-case’ scenario for different instruments (the first study of a new screening tool in a setting where it was developed might be expected to give the most positive results) and because of the potential difficulty of combining measures of diagnostic accuracy from different settings. However, in order to contextualise results we did also examine the broader performance of scales which had been tested in multiple studies in a post hoc analysis.

Assessment of bias and study quality

Study bias was evaluated at the study level using the QUADAS (Quality Assessment of Diagnostic Accuracy Studies) and STARD (Standards for Reporting of Diagnostic Accuracy) guidelines.16 ,17

Statistical analysis

True positives, false positives, true negatives and false negatives were extracted from the papers by two researchers (LQ and JC) independently, and results discussed with the third author (NK). Authors were contacted where these data were unavailable.

We used a wide range of recommended diagnostic accuracy estimates18 ,19 to evaluate the predictive validity of the risk scales (box 1 and see online supplementary appendix 2), including sensitivity (proportion of individuals who repeat self-harm identified as high risk by the test); specificity (proportion of people who did not repeat self-harm identified as low risk by the test); positive predictive values (probability that a person identified by the test as high risk will actually go onto self-harm); negative predictive values (probability that a person identified as low risk will not go onto self-harm).

Box 1

Measures of diagnostic accuracy for scales following self-harm

Key terms

  • Sensitivity (Sens): Proportion of individuals who repeat self-harm identified as high risk by the test, that is, how well the test identifies patients who repeat self-harm

  • Specificity (Spec): Proportion of people who did not repeat self-harm identified as low risk by the test, that is, how well the test identifies patients who will not repeat self-harm

  • Positive predictive value (PPV): The probability that the person identified as at risk for repeat self-harm will actually repeat self-harm

  • Negative predictive value (NPV): The probability that the person identified as low risk for repeat self-harm will not actually repeat self-harm

  • Positive likelihood ratio (LR+): How much more likely a positive test result is to occur in a patient who repeats self-harm versus one who does not repeat

  • Negative likelihood ratio (LR−): How much less likely a negative test result is to occur in a patient who repeats self-harm compared to a patient who does not repeat

  • Diagnostic OR (DOR): Overall global measure of test performance and represents the strength of the association between the test result and repeat self-harm (interpreted the same as an OR)

  • Number allowed to diagnosis (NAD): The number of people correctly classified as having a repeat self-harm episode before an misclassification occurs20

Positive and negative likelihood ratios (how much more or less likely test results are to occur in patients who repeat self-harm vs those who do not) were also calculated.19 Likelihood ratios of 1 indicate no change in likelihood of disease or outcome (in this case repeat self-harm). Positive likelihood ratios between 1–2, 2–5, 5–10 and >10 indicate minimal, small, moderate and large increases in risk, respectively.19 ,21 Negative likelihood ratios of 0.5–1.0, 0.02–0.5, 0.1–0.2 and <0.1 indicate minimal, small, moderate and large decreases in risk.21

We also calculated global diagnostic statistics that summarise the diagnostic performance of a test as a single indicator,18 including the ‘number allowed to diagnose’ (number of individuals who are correctly assigned as at high risk of repetition before one is misassigned),20 and the diagnostic OR18 (odds of positivity in repeater relative to the odds of non-repeater). Higher values indicate greater test discriminatory power.18 ,20

CIs for sensitivity and specificity were calculated using the Wilson score method without correction.22 CIs for positive and negative likelihood ratios were produced using the method of Simel et al.23 The CI for the diagnostic OR was produced using the method published by Armitage and Berry.24 CIs for ‘number allowed to diagnose’ were constructed using the method based on constant χ2 boundaries from Press et al.25 Results are unpooled due to heterogeneity in the studies.

STATA V.13.0; StataCorp. Stata Statistical Software: Release 13. College Station, Texas: StataCorp LP, 2013) and RevMan V.5.1 (Cochrane Collaboration)26 were used for statistical analysis.

Results

Search results

The NICE 2011 review on the longer term management of self-harm included seven cohort studies testing the predictive validity of risk scales for repeat self-harm.27–33 Four were excluded as they did not meet our inclusion criteria—they examined global measures rather than scales28 ,32—were statistically derived without testing in a defined cohort,30 or used a restricted clinical population31 (figure 1). The NICE evidence update11 included one additional cohort study.34 The search strategy from January 2012 to February 2015 resulted in an additional 60 papers of which three were relevant prospective cohort studies,15 ,35 ,36 and one additional cohort study14 was retrieved from related references (see figure 1). We also reran the searches for the earlier time periods. No additional studies were identified. In total, there were eight studies examining 11 scales which were included in the final analysis (figure 1).

Figure 1

Preferred Reporting for Systematic Reviews and Meta-Analyses flow diagram17 describing the search process for included studies. NICE, National Institute for Health and Care Excellence.

Description of studies

The methodological characteristics of the eight studies evaluating 11 scales are described in table 1. Further detailed information on bias and reporting is presented in online supplementary appendix 3. The studies were conducted between 2002 and 2014, and included 98 600 hospital presentations of self-harm or attempted suicide. In terms of service context, the studies were generally carried out across multiple sites, the majority in publicly funded health services. Four studies were based on self-harm emergency department populations.15 ,27 ,35 ,36 Randall et al15 included a mix of patients presenting with self-harm or suicidal ideation. One study was based on patients treated for self-poisoning.29 Two studies were based on hospital presentations for suicide attempts with suicidal intent as an inclusion criterion,33 ,34 and one study14 was based on patients admitted to a medical bed after self-harm. The length of follow-up ranged between 3 and 36 months, and outcome data was mostly ascertained through hospital databases. The incidence of repeat self-harm across studies ranged from 3%34 to 37%,33 possibly suggesting differences in casemix.

Table 1

Methodological characteristics of the studies

Four studies involved developing a tool which was then validated on a split site or external data set.14 ,27 ,35 ,36 The remainder were validation studies of existing scales.15 ,29 ,33 ,34 The scales varied in length ranging from four items (Manchester Self-harm Rule, ReACT Self-Harm Rule 37) to 53 items for the Global Severity Scale. Most scales included previous history of self-harm or suicide attempts or prior psychiatric treatment as items. Others scales items included personality factors (Barratt Impulsivity Scale, clinical symptomology (eg, Global Severity Index), drug misuse (eg, Drug Abuse Screening Test) and variations in symptoms associated with suicidal thoughts and behaviours (eg, Suicide Assessment Scale).

None of the studies were explicitly formatted according to standard guidelines (eg, STARD17) and reporting varied across the studies. For example, there were variations across studies in the reporting of recruitment flow34 and patient characteristics,29 ,33 ,34 cross-tabulations of raw data,14 ,33 ,34 ,36 CIs for diagnostic accuracy statistics,,15 ,29 ,33 and use of thresholds (eg, Randall et al15 did not use any). The database studies14 ,27 ,35 ,36 were the most robustly reported according to STARD indices.

Diagnostic accuracy statistics

The full range of diagnostic accuracy statistics are presented in table 2. Figures 2 and 3 show forest plots for sensitivity and positive predictive values, respectively. Sensitivity (how well the test identifies people who repeat self-harm) ranged from 5.6% for the Repeated Episodes of Self-Harm scale14 using the threshold for the highest risk to 97% for the Manchester self-harm rule27 95% for the ReACT Self-Harm rule,36 and 89% for the Söderjukuest Self-harm Rule.35

Table 2

Diagnostic accuracy statistics with 95% CIs*

Figure 2

Forest plot of sensitivity and 95% CIs for individual scales. BIS, Barratt Impulsivity Scale; DAST, Drug Abuse Screening Test; ERRS, Edinburgh Risk of Repetition; GSI, Global Severity Index; MSHR, Manchester Self-Harm Rule; MSPS, Modified SAD PERSONS Scale; ReACT, ReACT Self-Harm Rule; RESH, Repeated Episodes of Self-Ham score; SoSHR, Söderjukuset Self-harm Rule; SUAS, Suicide Assessment Scale.

Figure 3

Forest plot of positive predictive values and 95% CIs for individual scale. BIS, Barratt Impulsivity Scale; DAST, Drug Abuse Screening Test; ERRS, Edinburgh Risk of Repetition; GSI, Global Severity Index; MSHR, Manchester Self-Harm Rule; MSPS, Modified SAD PERSONS Scale; ReACT, ReACT Self-Harm Rule; RESH, Repeated Episodes of Self-Ham score; SoSHR, Söderjukuset Self-harm Rule; SUAS, Suicide Assessment Scale.

Positive predictive values for the latter high sensitivity scales were low (26%, 21% and 11%, respectively) and were highest for the Repeated Episodes of Self-Harm scale at the highest threshold (84%)14 followed by the Global Severity Index (73%),15 and the Drug Abuse Screening Test15 (figure 3). It should be noted that the Repeated Episodes of Self-Harm score was tested on inpatients admitted to hospital services for self-harm.14

Positive likelihood ratios ranged from 15.7 for the Repeated Episodes of Self-Harm scale14 at the highest threshold (indicating a large increase in the likelihood of repetition) to 1.0 for Söderjukuset Self-harm Rule35 and the Suicide Assessment Scale33 (indicating no change in the likelihood of repetition) (table 2). The diagnostic OR which presents the accuracy of a test as a global single indicator ranged from 16.34 (Repeated Episodes of Self-Harm scale at the highest threshold14) and 10.77 (Manchester Self-Harm Rule27) to 1.01 for the Södersjukuset Self-harm Rule35 and the Suicide Assessment Scale33 (table 2).

Although the length of follow-up varied, there were no clear patterns in relation to the prediction of shorter versus longer term risk. As noted previously, there was a wide variation in the methodological characteristics of the studies and in the scales themselves.

Operational issues

Operational characteristics (ie, the time taken to do the scale, technical specifications, ease of use, cost, staff training, user acceptability) are important to the clinical use of a scale and are listed in detail in table 3. Scales with characteristics which may need to be considered before their use include the Global Severity Index (copyright protected, costs associated with use, a 53-item scale with training required prior to use).37 The Drug Abuse Screening test may also be limited for clinicians working with self-harm populations, as the test is designed to assess drug-related problems.

Table 3

Scale operational factors

Discussion

Main findings

Risk scales are in widespread use in health services managing self-harm patients.7 We examined the diagnostic accuracy of a number of scales after self-harm and found a wide variation in samples, follow-up, reporting, thresholds, as well as differences in the content of the scales themselves. This heterogeneity was reflected in the variation in predictive accuracy across scales. For example, the Manchester Self-Harm Rule was high in sensitivity (97%) but had low positive predictive value (22%).27 Conversely, the Drug Abuse Screening Test had low sensitivity (15%) but high positive predictive value (98%).15 Scales which scored highly on global measures of diagnostic accuracy included the Repeated Episodes of Self-Harm scale at the highest threshold14 (16.34), the Manchester Self-Harm Rule27 (10.77), the Drug Abuse Screening Test15 (8.66) and the Barratt Impulsivity Scale15 (8.25), but even these scales varied markedly in their sensitivity from 6% for the Repeated Episodes of Self-Harm scale14 to 97% for the Manchester Self-Harm Scale.27

Methodological limitations

We did not conduct any meta-analyses due to the heterogeneity of the studies, nor did we calculate the receiver operating characteristics of the scales as we did not have the raw interval data. However, we provided a range of diagnostic accuracy statistics and associated CIs, which are useful in the critical evaluation of risk scales following self-harm. Some scales were tested in several settings, and we made no attempt to pool accuracy statistics across studies. Instead, we focused on a single study for each scale. This was the original study where this met inclusion criteria. We did this in order to gain an indication of the scale performance under potentially optimal conditions and because of the difficulty in pooling results from different settings.

Two scales in particular had been tested in multiple studies and settings (Edinburgh Risk of Repetition Scale and the Manchester Self-Harm Rule).9 Sensitivities for the Edinburgh Risk of Repetition Scale ranged from 26% to 41%, and specificities ranged from 84% to 91% in an early study.44 A further validation study conducted in Australia provided similar results (sensitivity: 26%, specificity: 84%).29 Broadly similar results were found in Oxford,45 for the Edinburgh Risk of Repetition Scale, but sensitivities were lower when tested on a 12-month rather than a 6-month follow-up, and ranged from 3% to 16%.45

The Manchester Self-Harm Rule was validated in Sweden,35 Manchester28 ,36 and Canada.15 The results were similar to those of Cooper et al27 in demonstrating the high sensitivity (94%, 94%, 98% and 95.1% for the studies, respectively) and low specificity (18%, 26%, 17% and 14.7%, respectively) of the scale.

We were keen to replicate the searches carried out as part of UK national guidance as far as possible. In some senses, the current paper was intended as an update of the review carried out as part of the NICE self-harm (longer term management guidelines), and we were constrained by the original methodology. Some well-known scales were not included in the NICE review6 ,11 on the basis of the prespecified inclusion criteria, for example, because they did not explicitly report diagnostic accuracy outcomes), and therefore did not find their way into the current paper.44 ,46 Data in the papers45 ,46 (sensitivities ranging from 5.3% to 14.6% and specificities ranging from 93% to 97%) and from subsequent reviews9 indicate that in any case these older studies and scales did not have superior results to those described in our study. Inclusion of these additional scales would not have changed our findings. Although we used a published search strategy,6 ,11 there is a possibility that additional scales were excluded due to the search criteria and of publication bias in the included studies as some studies with negative results may not be widely accessible.

We considered the performance of these scales only in relation to people who self-harmed rather than the wider general or clinical population. However, this is an important clinical group, and in many settings risk scales are an intrinsic part of their management. Our main outcome was repeated self-harm or attempted suicide rather than suicide. While suicide is extremely important, because it is a relatively low-frequency event, it is much harder to predict. This is reflected in the poorer performance of scales in relation to suicide than repeat self-harm as outlined in UK guidance.6 Only two of the studies included in this review also reported suicide outcomes.27 ,36 The Manchester Self-Harm Rule identified 100% of the 22 suicide deaths that occurred within the 6-month follow-up period.27 The ReACT Rule identified 60 of the 66 suicide deaths (91%) in the derivation data set and 23 of 26 (88%) in the test data set within 6 months of the index episode.36 These results indicate high sensitivity, but this is once again at the expense of low specificity and poor positive predictive value. Two other studies combined suicide and repeat self-harm as an outcome,14 ,33 and deaths by suicide were not included in the remaining studies.15 ,29 ,34 ,35

Clinical implications

What is the most useful scale following self-harm?

The use of scales is dependent on multiple factors. The scales are not directly comparable due to differences in the incidence of repeat self-harm across studies and methodological quality. Many of the studies were conducted in high-income countries in centrally funded health services,15 ,27 ,33–36 and so the findings may not be applicable to different settings. The Repeated Episodes of Self-Harm Scale was developed on an inpatient sample which is unlikely to be transferable to emergency department services. The performance of the scales may be additionally influenced by cultural contexts. For example, the Barratt Impulsivity scale15 ,40 was developed in the USA, and the terminology of some of the items may reduce the performance of the scale in other cultures (eg, ‘I squirm at plays or lectures’). There is also a challenging balance when selecting scales based on diagnostic accuracy statistics, and no scale performed well across all indices.

Global indicators such as the diagnostic OR provide the strength of the association between the exposure and the disease and are readily interpreted by clinicians. False-positive and false-negative results are equally weighted, which is advantageous for research and meta-analyses, but may limit clinical use as clinicians cannot evaluate the scale on the basis of sensitivity and specificity.18 The scales which had the highest global diagnostic ORs were the Repeated Episodes of Self-Harm scale at the highest threshold (16.34) and the Manchester Self-Harm Rule (10.77).14 ,27

The balance between sensitivity and specificity is dependent on various factors such as resources, purpose of the test and stage of treatment. Clinicians may prefer a test high in sensitivity to capture as many repeat self-harm episodes as possible, for example, the Manchester Self-Harm Rule27 or the ReACT Self-Harm rule.36 Highly sensitive tests are sometimes used to screen patients or can assist in ‘ruling out’ patients as the possibility of a false negative is relatively low.19 The Manchester Self-Harm rule was also validated in other prospective cohort studies and similar sensitivities and specificities were reported.15 ,28 ,35 ,36 However, the ReACT Self-Harm Rule36 and Manchester Self-Harm Rule27 have poor specificity and positive predictive values, and there is a possibility that many patients could be false positives (ie, incorrectly labelled as at high risk), which has cost and resource implications.47

Scales high in specificity, such as the Repeated Episodes of Self-Harm scale at the highest threshold,14 may be useful for a later stage of assessment or if treatment outcomes are expensive, medically invasive or burdensome to the patients. Scales high in specificity can also be used to ‘rule in’ patients, as the number of false positives is low (so people labelled as at high risk are quite likely to be at high risk). However, the clinical utility of high specificity scales may be limited because of the small numbers of patients who screen positive, and the fact that the high risk of the patients who reach the threshold is already fairly obvious on the basis of conventional clinical risk factors (eg, for the Repeated Episodes of Self-Harm Scale at the highest threshold, the small number of patients who have multiple prior episodes of self-harm, psychiatric diagnosis and recent psychiatric hospitalisation are clearly at elevated risk14). The sensitivities of such scales in this study were poor, and there is a possibility of false negatives (people being labelled as at low risk when they are actually at high risk).

Clinicians might consider scales with high positive predictive values such as the Repeated Episodes of Self-Harm scale at the highest threshold, as positive predictive values are a measure of the probability that an individual at high risk actually goes on to repeat self-harm. However, positive predictive values are affected by how common the outcome is, which affects their transferability to clinical settings with a different incidence of repeat self-harm. The scales with high positive predictive values (eg, Repeated Episodes of Self-Harm scale at high threshold,14 and Global Severity Index15 were also low in sensitivity, which is a further consideration when the evaluating the usefulness of scales for clinical practice.

Scales can be evaluated using likelihood ratios (probability of a specific result among people who repeat self-harm divided by the probability of a given result among people who do not repeat self-harm), and they are widely used in evidence-based medicine.48 They are advantageous in evaluating scales, as information from both sensitivity and specificity is used, they are not affected by prevalence, and they are fairly easy to interpret (eg, >10 indicates a useful test). The Repeated Episodes of Self-Harm scale at the highest threshold had the highest likelihood ratio (15.7),14 which indicates that the highest risk threshold is useful in predicting repeat self-harm, but had low sensitivity (6%) which limits the scale for screening purposes. There are limitations in the use of likelihood ratios for clinical practice. The estimation of baseline risk may be dependent on clinical experience, accurate estimates of prevalence, and familiarity in expressing risk in terms of probabilities.49

Clinicians may prefer to use a scale for predicting completed suicide, but scales which do so are perhaps more likely to have high sensitivity and be over inclusive (eg, the Manchester Self-Harm Rule27 and the ReACT Self-Harm Rule.36 Only two of the studies in this review27 ,36 evaluated suicide separately as an outcome, and the predictive utility of the scales for suicide needs to be investigated further.

We were unable to examine the predictive usefulness of the scales in predicting shorter versus longer term risk of self-harm repetition due to the heterogeneity of the scales and methodological characteristics. The use of scales in predicting shorter vs longer term risk is clinically important and should be investigated further using prospective cohort studies.

Conclusion

On the basis of our review, it is clear that no scale appears to perform sufficiently well to be used routinely. The limitations of risk scales in clinical practice are well documented, and it is suggested that the clinical focus should be on ‘conducting comprehensive clinical assessments of each patient’s situation and needs’ rather than the categorisation of patients into high-risk and low-risk categories (p.463).50–52 The focus on risk assessment can detract from the therapeutic relationship,53 and studies have reported that patients and staff can find assessments with scales an adverse experience.8 However, risk scales continue to be widely used in self-harm services with hospitals commonly developing local instruments.7 Traditional paradigms which simply aim to balance sensitivity versus specificity may be of limited usefulness in the development of risk scales for use following self-harm. Future research should involve head-to-head comparisons. This may have more validity than comparing scales used in different patient groups across different settings. Studies need to determine the effectiveness of risk scales using robust predictive accuracy cohort studies that are clearly reported according to STARD criteria.54 Until then, it is difficult to evaluate what the most useful instruments are and, in line with clinical guidance, scales should not be used in isolation to determine management or to predict risk of future self-harm.6

Acknowledgments

The authors would like to thank Dr David While for his statistical assistance and the authors of papers who provided us with additional data.

References

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

Footnotes

  • Contributors NK and JC designed the study with input from DG, KH and LD. LQ analysed the data with assistance from NK and JC. LQ, NK and JC interpreted the results and wrote the first draft. All authors contributed to subsequent drafts and have approved the final version of the manuscript.

  • Funding This paper presents independent research funded by the National Institute of Health Research (NIHR) under its Programme Grants for Applied Research Programme (grant reference number RP-PG-0610-10026).

  • Disclaimer The views expressed are those of the authors and not necessarily those of the NHS, the National Institute of Health Research or the Department of Health.

  • Competing interests DG, KH and NK are members of the Department of Health's (England) National Suicide Prevention Advisory Group. NK chaired the NICE guideline development group for the longer term management of self-harm, the NICE Topic Expert Group (which developed the quality standards for self-harm services), the NICE evidence update for self-harm, and is a chair of the NICE guideline for Depression. KH and DG are NIHR Senior Investigators. KH is also supported by the Oxford Health NHS Foundation Trust and NK by the Manchester Mental Health and Social Care Trust. NK, KH and JC are authors of some of the papers and scales included in this review.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement No additional data are available.