Article Text

Download PDFPDF

Intertester reliability of clinical shoulder instability and laxity tests in subjects with and without self-reported shoulder problems
  1. Henrik Eshoj1,
  2. Kim Gordon Ingwersen2,3,
  3. Camilla Marie Larsen2,4,
  4. Birgitte Hougs Kjaer2,5,
  5. Birgit Juul-Kristensen2,6
  1. 1 Department of Haematology, Quality of Life Research Center, Odense University Hospital, Odense, Denmark
  2. 2 Department of Sports Science and Clinical Biomechanics, University of Southern Denmark, Odense, Denmark
  3. 3 Research Unit at Department of Physical Therapy and Occupational Therapy, Hospital Lillebaelt–Vejle Hospital, Vejle, Denmark
  4. 4 Health Sciences Research Centre, University College Lillebaelt, Odense, Denmark
  5. 5 Research Unit at Department of Physical Therapy and Occupational Therapy, Bispebjerg and Frederiksberg University Hospital, Copenhagen, Denmark
  6. 6 Department of Health Sciences, Institute of Occupational Therapy, Physiotherapy and Radiography, Bergen University College, Bergen, Norway
  1. Correspondence to Birgit Juul-Kristensen; bjuul-kristensen{at}


Objective First, to investigate the intertester reliability of clinical shoulder instability and laxity tests, and second, to describe the mutual dependency of each test evaluated by each tester for identifying self-reported shoulder instability and laxity.

Methods A standardised protocol for conducting reliability studies was used to test the intertester reliability of the six clinical shoulder instability and laxity tests: apprehension, relocation, surprise, load-and-shift, sulcus sign and Gagey. Cohen’s kappa (κ) with 95% CIs besides prevalence-adjusted and bias-adjusted kappa (PABAK), accounting for insufficient prevalence and bias, were computed to establish the intertester reliability and mutual dependency.

Results Forty individuals (13 with self-reported shoulder instability and laxity-related shoulder problems and 27 normal shoulder individuals) aged 18–60 were included. Fair (relocation), moderate (load-and-shift, sulcus sign) and substantial (apprehension, surprise, Gagey) intertester reliability were observed across tests (κ 0.39–0.73; 95% CI 0.00 to 1.00). PABAK improved reliability across tests, resulting in substantial to almost perfect intertester reliability for the apprehension, surprise, load-and-shift and Gagey tests (κ 0.65–0.90). Mutual dependencies between each test and self-reported shoulder problem showed apprehension, relocation and surprise to be the most often used tests to characterise self-reported shoulder instability and laxity conditions.

Conclusions Four tests (apprehension, surprise, load-and-shift and Gagey) out of six were considered intertester reliable for clinical use, while relocation and sulcus sign tests need further standardisation before acceptable evidence. Furthermore, the validity of the tests for shoulder instability and laxity needs to be studied.

  • reliability
  • shoulder
  • instability
  • laxity
  • clinical test

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • The strength of the study is the use of a three-phased standardised study protocol.

  • Presentation of raw findings increases transparency and interpretation of study findings.

  • No valid gold standard for including shoulder instability and laxity subjects was used.

  • A 50/50 prevalence of positive and negative tests for all six tests was not accomplished.


Shoulder complaints, affecting shoulder-related quality of life (QoL), are frequent and may be caused by shoulder instability and/or laxity1 due to traumatic or non-traumatic injuries to the shoulder joint.2 The traumatic shoulder instability is mainly prompted by a high-impact injury during sports participation, resulting in a shoulder dislocation, predominantly in anterior direction.3 The non-traumatic shoulder instability is usually related to repetitive overhead activities and/or patients with generalised joint hypermobility or glenohumeral hyperlaxity, often referred to as multidirectional shoulder instability.2 4 5

Irrespectively of aetiology, shoulder instability and laxity is often accompanied by a variety of symptoms, including shoulder discomfort, pain besides glenohumeral subluxations and/or repeated dislocations.6–8 Clinically, shoulder instability and laxity are diagnosed and verified by a group of shoulder pain and instability provoking/relief tests, supplemented by shoulder laxity tests.9 10 The former tests usually include the anterior shoulder instability and laxity tests; apprehension, relocation and surprise, and the laxity tests consisting of the load-and-shift, sulcus sign and Gagey tests.11–13 An ongoing discussion is the use of pain as a diagnostic criterion in diagnosing anterior shoulder instability with the clinical tests apprehension, relocation and surprise.14–16 In one way, it may be a confounding factor, since pain has shown to be less predictive and reliable as a diagnostic criterion.14 On the contrary though, others have suggested that unrecognised and underlying glenohumeral instability may lead to repetitive microtrauma and painful shoulder conditions,15 16 justifying pain as diagnostic criterion when testing for anterior shoulder instability.

Nonetheless, symptoms may become chronic, and lead to reduced work and sports capability,17–19 and with exercise-based management as the most often recommended first-choice treatment.20 21 Hence, early diagnosis using reliable and accurate clinical tests to guide focused treatment is essential. Few studies, though, have investigated the reliability of clinical shoulder instability and laxity tests showing large variations in reliability and with limited methodological quality, hampering interpretation and comparison with other studies.14 22 23

Therefore, the objective of this study was to investigate the intertester reliability of commonly used clinical shoulder instability and laxity tests and second to describe the mutual dependency for each test evaluated by each tester, in a group of sports-active individuals with and without self-reported shoulder problems.

Materials and methods

Study design

An intertester reliability study was conducted involving two physiotherapists as intertester examiners. A third physiotherapist (study coordinator), not involved in the actual intertester reliability study (test phase), managed all practical aspects during the study period. The Guidelines for Reporting Reliability and Agreement Studies, a consensus document on how to report reliability and agreement studies, were followed.24 A standardised protocol for reliability studies, consisting of three phases: preparation and training of clinical tests, overall agreement and test phase (the actual reliability study), was applied.25 Two early career physiotherapists with 6 months clinical experience were involved in the intertester reliability study. A test protocol describing each clinical test was developed and subsequently used by the two testers to practice all tests in order to reach uniformity and mutual agreement in performing and interpreting each test. In the overall agreement phase, the two testers examined 19 individuals (8 affected shoulders and 11 normal shoulders). The two testers were mutually blinded to the health status of the individuals (affected shoulders vs normal shoulders) and also to each other’s test results. Before proceeding to the final study phase, the two testers needed an overall agreement of at least 80% based on findings from the six clinical shoulder tests.25 In the actual intertester reliability test phase, the two testers examined a new group of individuals with affected, respectively, normal shoulders with the six clinical shoulder tests. The procedure was the same as in the agreement phase, meaning that testers were blinded to the health status of the individuals and each other’s test results.

Study subjects

A sample size of at least 40 individuals was targeted based on recommendations for performing clinical reliability studies.25 Sixty-five individuals (women and men (aged 18–60 years)) were recruited and screened for eligibility from Metropolitan University College, Copenhagen, and Bispebjerg and Frederiksberg University Hospital, Copenhagen, resulting in an included number of 13 individuals with instability-related and/or laxity-related shoulder problems (hereinafter referred to as shoulder affected) versus 27 normal shoulder individuals, respectively.

Shoulder affected individuals answering yes to at least one of two questions (“Do you have a sense of shoulder instability?” and “Have you ever had a shoulder injury?”) were eligible for a clinical shoulder examination performed by the study coordinator. The shoulder affected individuals were then included if they present with at least one positive clinical shoulder test out of the following: apprehension, relocation, surprise, load-and-shift, sulcus sign or Gagey. Individuals with normal shoulders were recruited through public advertisements followed by a telephone interview and included if they present with no self-reported shoulder pathology or complaints. In general, any individuals with prior shoulder surgery were excluded. In the actual test phase, individuals completed a short questionnaire with basic demographic details (age, gender, weight, height), in addition to the following: pain level during rest and activity (Numeric Pain Rating Scale),26 shoulder injury ever (yes/no), subjective shoulder instability (yes/no) and sports-related activity (hours/week). Further, all individuals completed the patient-reported Western Ontario Shoulder Instability (WOSI) questionnaire designed to measure shoulder function and QoL in patients with shoulder instability and laxity symptoms.27 The time period between each test phase was approximately 2 weeks, and new subjects were included for each phase. Only the study phase is reported in the current manuscript. The study was exempted for notification to the Danish Health Research Study Board due to the non-invasive and non-treating study design. However, oral and written consent was provided from all individuals and, ethical guidelines were followed according to the Declaration of Helsinki.28

Clinical tests

The clinical shoulder tests consisted of three shoulder joint-provoking tests for anterior shoulder instability (apprehension, relocation and surprise) besides three shoulder laxity tests (load-and-shift, sulcus sign and Gagey) (table 1).11 13 14 22 23 29

Table 1

Performance and evaluation of the clinical shoulder instability and laxity tests 

The apprehension test (table 1, figure 1) was positive if glenohumeral apprehension and/or pain were evoked during testing whereas relief of symptoms with the relocation test (table 1, figure 2) was regarded as a positive test. As for the apprehension, the surprise test (table 1, figure 3) was positive if glenohumeral apprehension and/or pain were evoked during testing. The load-and-shift test (table 1, figures 4 and 5) was rated on a four-point scale ranging from 0 to 3 (best to worst; 0=little glenohumeral movement; 3=humeral head moves beyond the glenoid rim and remains dislocated).12 Also, to enhance mutual agreement between testers when performing the load-and-shift test, only the direction (anterior vs posterior) with most glenohumeral head translation was scored. Sulcus sign (table 1, figure 6) was objectively measured in centimetre (continuous scale) by use of a small ruler according to previously used grading scales as follows: I (<1 cm translation), II (1–2.0 cm translation) or III (>2.0 cm translation).29 Finally, Gagey test (table 1, figure 7) was rated as positive with passive abduction above 105°.13

Figure 4

Load-and-shift—anterior direction.

Figure 5

Load-and-shift—posterior direction.


Demographics and descriptive data were tested for normality by visual inspection of histograms and Shapiro-Wilk’s test. Group differences (affected shoulders vs normal shoulders) were tested by Fisher’s exact test for categorical variables, whereas Student’s t-test and Mann-Whitney U test were used for parametric and non-parametric distributed data, respectively.

Apprehension, relocation, surprise and Gagey tests were dichotomous variables whereas the load-and-shift and sulcus sign tests were dichotomised to also allow for nominal statistics. Thus, load-and-shift was rated positive when scored 2 or 3, while for sulcus sign a positive rating was equal to measurements exceeding 1 cm.29 For transparency, data from each test is presented by 2×2 contingency tables besides the use of McNemar’s test for significant between-tester differences. Furthermore, observed and expected agreements are presented along with prevalence and bias30 indexes. Reliability was evaluated with the use of Cohen’s kappa (κ) coefficients including 95% CIs.25 Also, since kappa is sensitive to imbalances in prevalence and bias (eg, if a 50/50 distribution of positive and negative tests cannot be accomplished) the use of prevalence-adjusted and bias-adjusted kappa (PABAK) calculation is a valid supplement to the original kappa values.30 31 By definition, PABAK reflects the ideal situation, thereby accounting for variation of prevalence and bias between testers (as presented in the ‘real’ world).32 PABAK calculation is performed by adjusting for high or low prevalence by computing the average of cells a and d in a cross table, substituting this value for the actual values in those cells. Similarly, an adjustment for bias is achieved by substituting the mean of cells b and c for those actual cell values.30 Finally, the relationship for each tester between the individual tests and the classification (mutual dependency) by self-reported shoulder problems was tested by Cohen’s kappa (κ) coefficients and the characterisation of the groups was tested with Fisher’s exact tests.

The classification system proposed by Landis and Koch was used to interpret reliability as follows: 0.00–0.20 (Slight); 0.21–0.40 (Fair); 0.41–0.60 (Moderate); 0.61–0.80 (Substantial) and 0.81–1.00 (Almost perfect).33

Statistical Package for the Social Sciences (SPSS, Chicago, Illinois, USA), V.22, was used for all statistical analyses, with P<0.05 interpreted as significant.


Characteristics of the participating individuals are presented in table 2. Demographics showed no difference between the individuals with affected shoulders (n=13) and normal shoulders (n=27). Furthermore, both groups (92% and 74%; P=0.18) were relatively active with a weekly participation in sports-related activity for more than 4 hours per week. However, as expected due to the design, affected shoulders had significantly higher pain during activity (4.23 vs 1.44; P=0.02), higher frequency of shoulder injury ever (62% vs <1%; P<0.001), higher subjective shoulder instability (69 vs 11%; P<0.001) and worse total WOSI score (506 vs 136; P=0.001) (table 2).

Table 2

Participant characteristics, study phase

Prevalence of positive tests was especially low for the load-and-shift test (table 3), and significant between-tester differences were found for relocation and sulcus sign tests (P=0.021) (not shown in tables).

Table 3

Contingency tables with findings from tester A and B

Reliability varied between κ 0.39–0.73 (95% CI 0.00 to 1.00), indicating fair (relocation; κ 0.39), moderate (load-and-shift, sulcus sign; κ 0.43 and 0.48) and substantial (apprehension, surprise, Gagey; κ 0.65–0.73) reliability (table 4). The prevalence index of all six tests ranged from 0.05 to 0.44, (lowest for load-and-shift, relocation and sulcus; 0.05, 0.28 and 0.30), whereas the bias index ranged from 0.03 to 0.20 (highest for relocation and sulcus). PABAK improved reliability for relocation, load-and-shift, sulcus sign and Gagey test, now corresponding to moderate (relocation and sulcus sign; κ 0.50), substantial (Gagey; κ 0.80) and almost perfect (load-and-shift; κ 0.90) reliability (table 4).

Table 4

Reliability of six clinical shoulder instability and laxity tests

The κ values for mutual dependency indicate that apprehension, relocation and surprise tests for both examiners were the most frequently used tests for characterising self-reported shoulder problems (table 5). This was further confirmed by the significant group difference in the presence of positive tests.

Table 5

Kappa statistics for mutual dependency of the individual tests and self-reported shoulder problems for each tester


The intertester reliability across the selected six clinical shoulder instability and laxity tests ranged from fair to substantial. Use of PABAK calculations improved intertester reliability to substantial and almost perfect across most tests, except for the relocation and sulcus sign tests. The tests most often used to characterise self-reported shoulder instability and laxity (mutual dependency) were apprehension, relocation and surprise tests.

The intertester reliability for the apprehension, relocation and surprise was higher than, or equivalent, to previously reported results of these tests using the same diagnostic procedures (apprehension and/or pain).23 Specifically for the apprehension and surprise test, the present κ values were somewhat higher than previously reported (0.65 vs 0.44–0.45). The reason for this may be that the current study included both affected and normal shoulder individuals as opposed to only including symptomatic subjects.23 This may have increased subject variation, known to affect reliability positively. Also, PABAK calculations did not affect the overall reliability of the apprehension and surprise tests, probably due to an optimal prevalence index of positive and negative tests (close to 0.50). For the relocation test, the existing intertester reliability was almost similar to previously reported (κ 0.39 vs 0.44),23 however, lower. Apparently, the primary reason for the current poor reliability in relocation was presence of systematic bias between testers, as indicated by the actual raw data (contingency tables) and the statistical significant interexaminer difference. Likewise, systematic bias between testers was also found for the sulcus sign test in the present study. Hypothetically, this may be explained by intertester variability in the force produced to translate the humeral head in posterior (relocation test) or inferior (sulcus sign test) direction, in the current study. This is, however, only speculative and further studies are needed to standardise these tests.

Reliability for the present sulcus sign test was slightly lower than previously reported (κ 0.39 vs >0.50).22 23 The discrepancy in reliability observed may be due to the use of different test positions with participants in the current study sitting upright29 as opposed to a previous lying test position.22 However, due to the presence of systematic bias in both the relocation and sulcus sign test, PABAK did not affect the overall reliability much.

For the load-and-shift test, reliability was relatively low (including wide CI). This may be due to the current low prevalence index below 50%, which is the optimum prevalence in reliability studies.25 However, the present dichotomous rating of the load-and-shift test (meaning that only individuals that could either subluxate or dislocate the shoulder during testing was deemed positive) may have influenced the prevalence of positive tests largely. Therefore, using PABAK, reliability of the load-and-shift test improved considerably (from moderate to almost perfect). Nevertheless, different statistics (kappa vs Intraclass Correlation Coefficients), different scoring systems (dichotomous rating (positive yes/no) versus four-point grading scale (0–3)23 and inclusion of shoulder asymptomatic athletes only22 make comparison across studies difficult.

Finally, reliability of the Gagey test was substantial and PABAK did not affect reliability much due to a nearly optimal prevalence and low bias between testers. Unfortunately, there is no other study to compare with.

Although the current study was designed to investigate reliability, and not diagnostic accuracy, the mutual dependency between the individual tests and self-reported shoulder problems was analysed. It revealed that the tests most often used to characterise those with and without self-reported shoulder instability and laxity (mutual dependency) proved to be the apprehension, relocation and surprise tests. This may indicate a relationship between these tests, which may come as no surprise, since these tests are a continuum of the apprehension test and, thus, closely related.9 Nevertheless, for clinicians it is of interest to specify the clinical characteristics of patients with self-reported shoulder problems. Thus, the current prevalence of positive tests may mirror these characteristics of the included patients and should be taken into consideration in the management of such musculoskeletal conditions. It is recommended to develop and test the clinimetric properties of a more comprehensive test battery for evaluating such self-reported shoulder problems. No prior studies were found addressing mutual dependency of the current tests for shoulder instability and laxity, which hampers comparison.

The present study has several limitations. First, the lack of standardised measurement of the amount of force exerted by the two testers during especially the relocation and sulcus sign test may have limited the current inter-tester reliability. Further standardisation in both performance and interpretation is therefore needed. Also, the current study did not randomise the order of the clinical tests. However, we do not believe this to have biased the reliability of the data, since the same order was used for both testers.

Second, no valid gold standard for classifying shoulder instability and laxity was used. To compensate for this, self-reported confirmation of shoulder-related problems was applied, but this was not reflected in the current WOSI scores, which were relatively low. Lack of a more objective gold standard may have decreased diagnostic accuracy, however, not reliability, which was the primary objective of the present study. Also, in the group with normal shoulders, one individual reported to have had a previous shoulder injury and three individuals reported subjective shoulder instability, which does not comply with the inclusion criteria for being regarded as shoulder healthy in the current study. At the clinical session, a self-reported questionnaire was completed regarding demographic data and historical information. Apparently, in the baseline questionnaire three shoulder healthy individuals answered yes to perceiving instability in their shoulder and one had had a previous shoulder injury, even though they all had reported no shoulder trouble during the telephone inclusion interview. However, as depicted in table 2, WOSI and pain scores in the group with normal shoulders seem not to be influenced severely by these four individuals. Also, recalculations of demographic data and mutual dependency with the revised classification into affected/normal shoulders did not change the mutual dependency of the most frequently used tests for classification into affected/normal shoulders, and neither was kappa and demographics affected (data not shown).

Third, due to a relative short recruitment period besides difficulties in recruiting subjects with shoulder instability and laxity only 13 subjects with an affected shoulder were included. Naturally, this also affected the prevalence of positive and negative test findings meaning that the prevalence of 0.50, as recommended in reliability studies,25 in all six tests was not accomplished. However, to overcome this, PABAK calculations was used and reported along with kappa, to show transparently how data would have been with equal distributions of positive and negative test results. Nevertheless, future studies should use inclusion criteria of more established shoulder instability and laxity conditions, and, if possible, verified by objective criteria as surrogate for a gold standard of shoulder instability and laxity. This may optimise prevalence as well as diagnostic accuracy in studies where this is a further aim.

The strengths of the study are the use of standardised procedures (including blinding to patient status and the use of a three-phased protocol for conducting reliability studies). Also, presentation of raw data, using contingency tables, along with kappa and PABAK values, increases data transparency and improves interpretation of the reliability study.


This study showed acceptable intertester reliability for four of six clinical shoulder instability and laxity tests in relatively sports active individuals with and without self-reported shoulder problems. However, relocation and sulcus sign tests need further standardisation before being recommended for use in clinical practice. Based on the frequency and mutual dependency of the current tests, especially apprehension and surprise tests seem important in the characterisation of self-reported shoulder problems. Future research on the validity of tests for shoulder instability and laxity is needed.


We would like to thank Physiotherapists Rasmus Fitzner, Pernille Madsen and Jacob Hansen from Metropolitan University College, Copenhagen, Denmark for recruitment and testing of study participants. Furthermore, a special thanks to Bispebjerg Frederiksberg University Hospital, Copenhagen, Denmark for providing facilities for data collection.


  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
  21. 21.
  22. 22.
  23. 23.
  24. 24.
  25. 25.
  26. 26.
  27. 27.
  28. 28.
  29. 29.
  30. 30.
  31. 31.
  32. 32.
  33. 33.


  • Contributors HE, KGI, CML and BJ-K conceived and designed the study and interpreted the results. HE and BHK recruited study participants and collected data. HE performed the statistical analysis. HE drafted the manuscript with KGI, CML, BHK and BJ-K contributing to the manuscript. All authors have read and approved the final manuscript. HE is the guarantor.

  • Funding This work was supported by Region of Southern Denmark’s Research fund and The Danish Rheumatism Association.

  • Competing interests None declared.

  • Patient consent Obtained.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement No additional data are available.