Article Text

Download PDFPDF

Ultrasound assessment for grading structural tendon changes in supraspinatus tendinopathy: an inter-rater reliability study
  1. Kim Gordon Ingwersen1,2,
  2. John Hjarbaek3,
  3. Henrik Eshoej1,
  4. Camilla Marie Larsen1,4,
  5. Jette Vobbe5,
  6. Birgit Juul-Kristensen1,6
  1. 1Department of Sports Science and Clinical Biomechanics, University of Southern Denmark, Odense, Denmark
  2. 2Department of Rehabilitation, Hospital Lillebaelt—Vejle Hospital, Vejle, Denmark
  3. 3Department of Radiology, Musculoskeletal section, Odense University Hospital, Odense, Denmark
  4. 4Health Sciences Research Centre, University College Lillebaelt, Odense, Denmark
  5. 5Shoulder Unit, Orthopaedic Department, Hospital Lillebaelt, Vejle Hospital, Vejle, Denmark
  6. 6Department of Health Sciences, Institute of Occupational Therapy, Physiotherapy and Radiography, Bergen University College, Bergen, Norway
  1. Correspondence to Dr Kim Gordon Ingwersen; kim.riis{at}


Aim To evaluate the inter-rater reliability of measuring structural changes in the tendon of patients, clinically diagnosed with supraspinatus tendinopathy (cases) and healthy participants (controls), on ultrasound (US) images captured by standardised procedures.

Methods A total of 40 participants (24 patients) were included for assessing inter-rater reliability of measurements of fibrillar disruption, neovascularity, as well as the number and total length of calcifications and tendon thickness. Linear weighted κ, intraclass correlation (ICC), SEM, limits of agreement (LOA) and minimal detectable change (MDC) were used to evaluate reliability.

Results ‘Moderate—almost perfect’ κ was found for grading fibrillar disruption, neovascularity and number of calcifications (k 0.60–0.96). For total length of calcifications and tendon thickness, ICC was ‘excellent’ (0.85–0.90), with SEM(Agreement) ranging from 0.63 to 2.94 mm and MDC(group) ranging from 0.28 to 1.29 mm. In general, SEM, LOA and MDC showed larger variation for calcifications than for tendon thickness.

Conclusions Inter-rater reliability was moderate to almost perfect when a standardised procedure was applied for measuring structural changes on captured US images and movie sequences of relevance for patients with supraspinatus tendinopathy. Future studies should test intra-rater and inter-rater reliability of the method in vivo for use in clinical practice, in addition to validation against a gold standard, such as MRI.

Trial registration number NCT01984203; Pre-results.

  • Reliability
  • Tendinopathy

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • A standardised procedure for US capturing and measuring structural changes of the supraspinatus tendon is presented.

  • A specific procedure for grading and interpreting tendinopathy-related changes is presented.

  • Grading and measurement can be performed reliably.

  • Performance of the method in vivo is warranted to validate the method in clinical practice.


Rotator cuff (RC) tendinopathy can be considered a continuum of pathology, and tailored rehabilitation according to the stage in this continuum is recommended.1 ,2 Anamnesis and special orthopaedic tests are often used when diagnosing RC tendinopathy, but these tests often lack high specificity and sensitivity, making diagnosis uncertain,3 thus challenging precise and targeted treatment.

Grey-scale (GS) ultrasound (US) and Power Doppler (PD) visualisation of RC tendons may be helpful to detect signs of tendinopathy, such as hypoechoic areas, fibrillar disruption (FD), neovascularisation (NV), calcifications (CAs) embedded in the tendon or oedema, and confirm the ‘a priori’ hypothesis of RC tendinopathy, provided satisfactory clinimetric properties of the US method.4 ,5

However, US is an operator-dependent technique and requires thorough training and experience in performance and assessment before precise diagnoses can be made, especially in relation to more subtle changes as often seen within tendinopathy.6 Poor to fair reliability has previously been found when comparing US diagnoses made by novel and experienced clinicians.7–9 Further, when grading subtle structural tendon changes, especially hypoechoic areas, only fair and therefore unsatisfactory reliability has been found, even among experienced clinicians.6 ,8 ,10–12

Standardised procedures for capturing and assessing US are known to increase reliability of US-based diagnoses.6 Previously, assessment of tendinopathy were found reliable, in patients with tendinopathy in the elbow, ankle or knee, when using standardised procedures for measuring GS and PD.4 ,11

For the shoulder, however, there is a lack of clinically relevant, standardised and reliable methods for assessing tendinopathy. Since US is highly influenced by clinician experience and technique, both standardised US procedures for image and movie capturing and standardised procedures for assessment of structural changes in relation to tendinopathy need to be defined.

Therefore, the aim of this study was to evaluate the inter-rater reliability of measuring and grading structural changes in the tendons of patients clinically diagnosed with supraspinatus tendinopathy (cases) and healthy participants (controls), on images and movies captured through standardised US procedures.

Materials and methods

Study design

The study followed the protocol for diagnostic procedures in reproducibility studies.13 This protocol includes a three-phase study design consisting of (1) training, (2) an overall agreement and (3) a study phase (the actual reliability study; figure 1).

Figure 1

Flow chart of the training, overall agreement and study phase. US, ultrasound.

The phases constitute a methodological model for optimising procedures, and aim at eliminating clinician subjectivity as much as possible. The aim of the training phase is to ensure that raters have sufficient competence and experience in performing the procedures. The overall agreement phase is an extended training phase and ensures that gross systematic bias between raters is minimised, and requires at least 80% agreement between raters before proceeding to phase 3. The study phase is the final evaluation of reliability of the developed procedures.13

Inter-rater reliability (phase 3) between two raters (raters A and B) was tested on measuring and grading structural changes relevant to tendinopathy on US-captured images and movies. Rater A (KGI; physiotherapist) had 1 year of clinical musculoskeletal US experience, and rater B (JH; radiologist) had more than 15 years of clinical musculoskeletal US experience.

US image capturing and measurement

On the basis of the literature,4 ,10 ,11 ,14–17 consensus was made on definitions of relevant potential pathological structural changes related to tendinopathy, including (1) FD, (2) NV, (3) CA and (4) tendon thickness (TT). Hereafter, a standardised protocol for US capturing was developed, consisting of three static images (GS), three dynamic movie sequences (GS) and one Doppler movie sequence (table 1).

Table 1

Description of US procedures for capturing image and movie sequences of FD, NV, CAs and TT

Second, on the basis of previous scales used to measure structural changes in tendinopathy at the elbow,4 ,16 two ordinal grading scales for FD and NV were adjusted for use in the shoulder.19 The scales ranged from 0 to 4 (FD: 0=normal tendon; 4=partial rupture, corresponding to disruption of the fibres in the full thickness of the tendon; NV (0=normal, including no signal; 4=extreme, including Doppler activity in more than 50% of the region of interest, ROI; table 2; see online supplementary appendix).

Supplemental material

Table 2

Grading scales with definitions for FD and NV

CA was analysed as number of CAs and total length (in mm), while TT was measured in mm.18

Rater A performed capture of all US images and movie sequences with the participant seated, the shoulder internally rotated with the dorsal side of the hand placed on the sacrum, and the elbow flexed and directed laterally, to optimise visualisation of the supraspinatus tendon.20

A GE LOGIQ e B12 (GE Healthcare) with a 5.0–13.0 MHz linear transducer was used for image capturing. All US scannings were standardised and performed for GS imaging at 13.0 MHz and 56% gain, while PD scanning was performed with a pulse repetition frequency of 0.41 kHz and gain at 56%. Manufacturer recommendations for musculoskeletal imaging of the shoulder were preset for the remaining parameters.

Captured images and movie sequences were stored with unique identifier labels on an external hard disk. Measurement of captured images and movie sequences was performed in ‘OsiriX V.5.8.2 32-bit’ (rater A) and RadiAnt DICOM viewer V.1.9.16 (32 bit; rater B).

In the overall agreement and study phase, raters were blinded to each other's results and the participant status (case/control), and images and movies were stored for at least 21 days before measurements to secure blinding of rater A.

Training and overall agreement phases

In the training phase, raters A and B practised the US procedures for capturing, measuring and grading the captured images and movies on 10 participants (cases and controls). Overall agreement phase was performed on 20 participants (10 cases and 10 controls), and the overall agreement of at least 80% on each parameter (present/not present for dichotomised variables, CA, NV, FD; no significant (p>0.05) rater difference for continuous variables, TT, CA) was obtained before the actual reliability study.

Study phase 3 (actual reliability study)


General inclusion criteria were: 18–65 years old; the ability to understand spoken and written Danish; no prior shoulder surgery/dislocation; no sensory or motor deficits in the neck/arm; no suspected competing diagnoses (rheumatoid arthritis, cancer, neurological disorders, fibromyalgia, psychiatric illness).

Inclusion criteria for cases were: clinical diagnosis of RC tendinopathy with current shoulder symptoms lasting for at least 3 months prior to inclusion; pain located in the proximal lateral aspect of the upper arm (C5 dermatome) aggravated by shoulder abduction; positive ‘full can test’ and/or ‘Jobe's test’, and/or pain at ‘resisted external rotation test’; and positive ‘Hawkins-Kennedy test’ and/or ‘Neer's test’; and US verification of at least one of the following characteristics: FD, NV, CA (the involved side), or side difference (increased/decreased) TT of the supraspinatus tendon.21

Exclusion criteria for cases were pain (during rest) rated above 40 mm (visual analogue pain scale, range 0–100 mm); bilateral shoulder pain; <90° of active elevation of the arm; full thickness rupture in the supraspinatus tendon (verified by US); CA above 5 mm in the vertical distance (X-ray); corticosteroid injection within the latest 6 weeks; humerus fracture (X-ray); diagnoses of glenohumeral osteoarthritis; frozen shoulder; clinically suspected labrum lesion; symptomatic osteoarthritis in the acromioclavicular joint; or symptoms from the cervical spine.21

Inclusion criteria for controls were no shoulder discomfort within the latest 3 months and negative clinical shoulder tests.

Cases were consecutively recruited from specialised shoulder units at three hospitals in Denmark as part of a randomised controlled trial.21 Controls were recruited by advertisement among staff from The Department of Sports Science and Clinical Biomechanics, University of Southern Denmark, and the Rehabilitation Department, Lillebaelt hospital—Vejle hospital.

Informed consent was obtained from participants before inclusion.


Linear weighted Cohen's κ (LWk) was used to calculate inter-rater reliability with 95% CIs for the ordinal variables (FD, number of CA and NV). First, a linear weighing (LWk V.1) was applied, corresponding to the formula: 1−|i−j|/(k−1), where i and j are the number of rows and columns, and k is the maximum number of possible ratings.22 Second, the same weighing was used (LWk V.2), but with the restriction that disagreement between grades 0 and >0 was weighted as 0, to account for the ability to differentiate between healthy and non-healthy.

The κ was interpreted as ≤0.00=poor; 0.01 to 0.20=slight; 0.21 to 0.40=fair; 0.41 to 0.60=moderate; 0.61 to 0.80=substantial and 0.81 to 1.00=almost perfect.23

For the continuous variables (TT, total length of CA), intraclass correlation (ICC; 3.1) was calculated as a measure of reliability. ICC was interpreted as <0.40=poor, 0.40 to 0.75=fair to good and >0.75=excellent reliability.24 Bland-Altman plots with 95% limits of agreement (LOA) were calculated as a measure of absolute agreement for TT (right and left) and total length of CA, and between-rater difference was tested by a paired t-test. Funnel effects and systematic bias were assessed visually and from Pearson’s correlation coefficient (r). SEM was calculated as SEM(Agreement)25 to extrapolate results to the general population of potential raters, and minimal detectable change (MDC) was calculated at individual (MDCIndividual) and group (MDCgroup) levels.26 Unpaired t-test was calculated for defining a potential cut-point of TT between cases and controls.

For the study phase, a sample size of 40 participants was applied, as previously recommended for reliability studies.13

Data were analysed in Stata/IC V.14 (2015, Statacorp, College Station, Texas, USA). p Values <0.05 were considered significant.


There were no differences in demographics between cases and controls, except for pain and discomfort, as expected, due to the study design (table 3).

Table 3

Demographics (study phase; n=40))

Total agreement ranged from 83% to 99%, LWk V.1 for FD, NV and CA ranged from 0.60 to 0.96, and κ with constraints (LWk V.2) varied from 0.51 to 0.98, representing reliability of ‘moderate—almost perfect’ (table 4).

Table 4

Inter-rater reliability of grading presence of FD, NV and number of CAs (study phase; n=40)

For total length of CA and TT, ICC ranged from 0.85 to 0.90 (excellent), with SEM(Agreement) ranging from 0.63 to 2.94 mm, MDC(group) from 0.28 to 1.29 mm and MDC(individual) from 1.75 to 8.15 mm (table 5).

Table 5

Inter-rater reliability of TT and total length of CA (study phase; n=40)

No systematic rater differences were found in measured TT and total length of CA (table 5). Bland-Altman plots showed no funnel effects, but a small interaction between difference and increased mean was found for TT in the left shoulder (r=0.35, p=0.03; figure 2). In general, LOA showed a larger variation for CA than for TT (table 5 and figure 2).

Figure 2

Bland-Altman plots with 95% LOA for TT (right and left) and total length of CAs. CA, calcifications; LOA, limits of agreement; TT, tendon thickness.

No significant difference was found between cases and controls in TT.


The inter-rater reliability study showed moderate to perfect reliability for grading FD, NV and number of CAs, using standardised procedures. Inter-rater reliability for measuring the total length of CA and TT was excellent, and MDC indicated small detectable changes for group level, especially in TT.

FD and hypoechoic areas

Despite merging hypoechoic areas and FD into one scale, reliability was still only moderate (LWk of 0.60 and 0.51). This was, however, in line with previous studies of tendinopathy, where agreement on subtle changes (‘mild abnormality’ and ‘normal’) was considered especially difficult, presumably due to difficulties in differing structural changes and anisotropy.4 ,6 ,8 ,10 ,11 Grading FD may be more easily interpreted with in vivo US examinations, as the examiner is more flexible when evaluating presence of anisotropy.


The current reliability of NV was almost perfect. The reason for the high reliability in the current study may be the grading of NV in relation to a predetermined ROI (fixed box of 5×5 mm placed over the area with most NV), as opposed to grading NV relative to the TT or the tendon in general as previously in tendinopathy of the elbow.4 ,16 The current modification was performed to increase standardisation, as well as to account for between and within variations in TT, of interest in intervention studies.

Other studies have found prevalence of NV in 30–65% of symptomatic shoulders with only 25% of asymptomatic shoulders.27 ,28 This study found prevalence of NV in 38% of the cases and 0% in the control group. This large variation in prevalence across previous studies may be due to different populations, PD settings, measurement methods and the position of the participant arm during US image capturing. This study placed the hand at the sacrum, to maximally stretch the supraspinatus tendon, which may have increased the risk of overlooking NV due to restricted flow in the neovessels. Different study designs across studies make it difficult to compare prevalence and establish normative levels for use in clinical practice.


The substantial κ for detecting the total number of CA is in line with previous studies,4 ,8 but LOA, SEM and MDC showed considerable variation on the total length of CA. This variation may be due to US methodological problems, for example, that shadows underneath CA may falsely be interpreted as FD and/or normal tendon structure may appear hyperechoic, thus resembling CA, which may result in misclassifications. However, reliability of number of CA was high, indicating that measuring individual lengths of CA and/or few undetected/misclassified CA have influenced agreement of total length of CA. One outlier seen in the Bland-Altman plots (figure 2) indicates that raters A and B disagreed on at least one larger structural change, which, owing to the generally small size and low prevalence of CA, has influenced the variation considerably.

Tendon thickness

Excellent reliability, and MDC of ≤0.33 mm, indicates that the variable is sensitive for detecting changes, in line with a previous study using the same method for measuring TT.18 This means that it may be a clinically relevant measurement for assessment of changes in tendon properties, such as increased/decreased oedema. Some studies have found significant differences in TT between symptomatic and non-symptomatic participants,29 ,30 which are in contrast with the current study and a recent study.18 The reason for the discrepancy across studies may be due to the use of different methods for measuring TT, small sample sizes, different inclusion criteria or, as in this study, the inclusion of more active controls (recruited among health personnel) with potentially thicker tendons than the average population.

One limitation of the study is the transferability to clinical setting, as this study used captured images and strictly standardised procedures, which are rarely used in clinical settings. In vivo, raters would be more flexible when evaluating presence of anisotropy in the interpretation of potential FD; also, they would be able to perform repeated image capturing and measurements when CA or NV were suspected to be present. Use of a standardised protocol for reliability studies13 may be a weakness, since reliability of the current US method and procedures may have been deceptively high compared with a clinical setting. However, if the standardised method has poor reliability in a standardised setting, reliability is also assumed to be poor and the method less relevant for use in a clinical setting. The raters measured and graded the captured images and movies on different DICOM viewers. Whether this has influenced the reliability is unknown. However, since reliability is found to be high on most variables, it is considered to not be important and to mimic clinical practice.

The strengths of this study are the design, incorporating a stepwise and standardised procedure in order to minimise bias and increase reliability.13 The present standardisation of US image and movie capturing, measuring and grading structural changes are anticipated to increase reliability and sensitivity of the method. Despite one of the raters having relatively few years of US experience, reliability was still high and satisfactory, indicating that the protocol can even be followed by other than very US-experienced clinicians. By using captured images and movie sequences, it was ensured that both raters had equal underlying bases for interpretation of the reliability study.

Further, the use of weighing with restrictions when calculating κ was considered important, due to the importance of being able to differ between cases and controls.


Inter-rater reliability was moderate to almost perfect when a standardised procedure was applied for measuring structural changes on captured US images and movie sequences of relevance for patients with supraspinatus tendinopathy. Future studies should test intra-rater and inter-rater reliability of the method in vivo for use in clinical practice, in addition to validation against a gold standard, such as MRI.



  • ▸ Additional material is available. To view please visit the journal online (

  • Contributors KGI, BJ-K, HE and CML conceived and designed the study protocol. KGI and BJ-K procured the project funding. KGI, BJ-K, JV and JH developed and standardised the ultrasound procedure and defined the grading scale. JV and JH secured access and coordinated screening procedures at the shoulder units. KGI was the project coordinator and performed the inclusion and US image capturing. KGI and JH were raters. KGI and BJ-K planned and coordinated the statistical analyses. KGI performed the statistical analyses. KGI drafted the manuscript, and BJ-K, JH, JV, HE and CML contributed to the manuscript. All authors read and approved the final manuscript. KGI is the guarantor.

  • Funding Region of Southern Denmark's Research fund, The Danish Rheumatism Association and the Ryholts Foundation funded the trial.

  • Competing interests None declared.

  • Patient consent Obtained.

  • Ethics approval All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and the 1964 Helsinki declaration and its later amendments or comparable ethical standards. The Regional Scientific Ethics Committee of Southern Denmark has approved the trial (project ID: S-20130071).

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement No additional data are available.