Article Text

Reproducibility of tender point examination in chronic low back pain patients as measured by intrarater and inter-rater reliability and agreement: a validation study
  1. Ole Kudsk Jensen1,3,
  2. Jacob Callesen2,
  3. Merete Graakjaer Nielsen1,3,
  4. Torkell Ellingsen2,3
  1. 1Spine Center, Diagnostic Center, Regional Hospital Silkeborg, Silkeborg, Denmark
  2. 2Institute of Public Health, Aarhus University, Aarhus, Denmark
  3. 3Department of Rheumatology, Diagnostic Center, Region Hospital Silkeborg, Silkeborg, Denmark
  1. Correspondence to Dr Ole Kudsk Jensen; olejesen{at}


Objectives To evaluate the reliability and agreement of digital tender point (TP) examination in chronic low back pain (LBP) patients.

Design Cross-sectional study.

Settings Hospital-based validation study.

Participants Among sick-listed LBP patients referred from general practitioners for low back examination and return-to-work intervention, 43 and 39 patients, respectively (18 women, 46%) entered and completed the study.

Main outcome measures The reliability was estimated by the intraclass correlation coefficient (ICC), and agreement was calculated for up to ±3 TPs. Furthermore, the smallest detectable difference was calculated.

Results TP examination was performed twice by two consultants in rheumatology and rehabilitation at 20 min intervals and repeated 1 week later. Intrarater reliability in the more and less experienced rater was ICC 0.84 (95% CI 0.69 to 0.98) and 0.72 (95% CI 0.49 to 0.95), respectively. The figures for inter-rater reliability were intermediate between these figures. In more than 70% of the cases, the raters agreed within ±3 TPs in both men and women and between test days. The smallest detectable difference between raters was 5, and for the more and less experienced rater it was 4 and 6 TPs, respectively.

Conclusions The reliability of digital TP examination ranged from acceptable to excellent, and agreement was good in both men and women. The smallest detectable differences varied from 4 to 6 TPs. Thus, TP examination in our hands was a reliable but not precise instrument. Digital TP examination may be useful in daily clinical practice, but regular use and training sessions are required to secure quality of testing.

  • Rheumatology
  • Statistics & Research Methods
  • Pain Management

This is an open-access article distributed under the terms of the Creative Commons Attribution Non-commercial License, which permits use, distribution, and reproduction in any medium, provided the original work is properly cited, the use is non commercial and is otherwise in compliance with the license. See: and

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Article summary

Article focus

  • Diffuse hyperalgesia may be evaluated by tender point (TP) examination and may reflect deficient descending pain inhibition as in fibromyalgia.

  • TP examination is increasingly relevant to improve clinical assessment in inflammatory as well as non-inflammatory rheumatological disorders.

  • Reproducibility of this examination technique is not well documented and was therefore investigated.

Key messages

  • In sick-listed chronic low back pain (LBP) patients, digital TP examination was a reliable but not precise instrument.

  • In both women and men, there was more than 70% agreement within ±3 TPs.

  • The method was quick and easy to use with no requirements of equipment, except in initial training sessions.

Strengths and limitations of this study

  • The study included a well-defined chronic LBP population that was referred from general practitioners for LBP examination and return-to-work intervention.

  • The number of patients was limited and only two raters were involved, resulting in wide CIs and limited generalisability.


Tender point (TP) examination has been the cornerstone examination in patients with chronic widespread pain (CWP) to distinguish fibromyalgia patients from patients with CWP only. In the general population, the former and latter conditions have been identified in 0.5–4%1 and 10–13%,2 ,3 respectively. Persons fulfilling the fibromyalgia criteria (CWP and ≥11 TPs) report more pain and disability than persons with CWP who have less than 11 TPs.4 TP examination is performed by standardised digital palpation at 18 points symmetrically distributed on the body (figure 1).5 In the general population, men and women had a median of 3 and 6 TPs, respectively,6 and women may have up to 4 TPs more than men.7

Figure 1

Locations of tender points according to the American College of Rheumatology.5

TP examination may be relevant in conditions other than CWP or regional pain syndromes. In inflammatory rheumatic diseases, TP examination may also contribute to the clinical evaluation. For instance, high-disease activity in the absence of inflammatory activity in rheumatoid arthritis is often seen in patients with many TPs.8 This may lead to inappropriate treatment of disease activity. In systemic lupus erythematosus, health status has been shown to be inferior in patients with many TPs as compared with patients with few TPs.9

In sick-listed low back pain (LBP) patients, the intensity of back pain is associated with the number of TPs, and patients with radiculopathy have fewer TPs than patients with non-specific LBP.10 Furthermore, TPs are associated with the reporting of widespread pain and with long-term prognosis.11 According to another study,12 patients with both CWP and non-specific LBP have more pain, higher disability and more TPs than patients with LBP only.

Reliability and agreement studies are, however, few and insufficient. The original study defining fibromyalgia5 included 293 patients and 265 controls. Since then, we have been able to identify only three small studies comparing the reliability of digital palpation and dolometry with TPs defined as in the original study.13–15 Each study included 15–25 individuals. The reliability was acceptable and comparable for both dolorimetry and digital palpation, and κ values of 0.44–0.92 were reported for the digital examination. However, only the reliability of testing each TP location as positive was estimated, not the reliability of the total TP counts. In other non-specific pain studies, the reliability of TP examination was not formally tested, or digital examination was not used.16–20

Since the total TP count—and not each single TP—is used for the clinical evaluation in rheumatological conditions, more reliability and agreement studies of the total TP count are needed.

Accordingly, the purpose of the present study was to investigate the reproducibility of total TP counts based on digital TP examination in chronic sick-listed LBP patients in terms of (1) intrarater and inter-rater reliability and (2) intrarater and inter-rater agreement.


The patients were recruited among patients referred from their general practitioners to the Spine Center for participation in a controlled study.

Inclusion criteria: partly or fully sick-listed for more than 4 weeks due to LBP with or without radiculopathy, LBP should be the prime reason for sick-listing and at least as bothersome as pain elsewhere, age 16–60 years, referred from a well-defined geographical area of about 280 000 inhabitants, and the patient should be able to speak and understand Danish.

Exclusion criteria: living outside the referral area, continuing or progressive radiculopathy resulting in plans for surgery, low back surgery within the last year, previous lumbar fusion operation, suspected cauda equina syndrome, progressive paresis or other serious back disease, (eg, tumour), pregnancy, known dependency on drugs or alcohol or primary psychiatric disease.

The patients were contacted between 1 November 2009 and 1 March 2010 and were only included in the present study after more than 3 weeks had passed since their first consultation at the Spine Center. They were offered participation in the study by one of the authors (JC), who was the leader of the project but was not a staff member, and they were told that the investigation had nothing to do with the management of their LBP. The patients were informed that the examination would only include measuring of diffuse tenderness by TP examination and spinal range of motion (not reported in this paper). Previously, all patients had been subjected to a clinical low back examination and TP examination at their first consultation at the Spine Center.

The examinations were performed by two clinicians (OKJ and MGN), both consultants in rheumatology and rehabilitation. Beforehand, the TP examination method was taught by the more experienced rater (OKJ=Rater A) to the less experienced rater (MGN=Rater B) during a 2 h session. Each test day, before starting examinations, the two raters calibrated their thumbs with a dolorimeter,21 which was able to register four pressures at a time and calculate means and SDs.

The examinations were performed during two test days, days 1 and 2, at 1-week intervals. To include all patients, the test days were repeated twice. The patients were randomised so that half of the patients were first tested by Rater A, the other half first by Rater B, but keeping the same sequence on day 2 as on day 1. Twenty minutes passed between the examinations.

Before examination, the patients filled out a questionnaire including questions regarding back+leg pain22 and disability,23 increasing scores representing increasing pain and disability. At the clinical examination, the patient's range of spinal motion was first measured in the standing position. Subsequently, the patient was asked to lie prone, and a 4 kg digital pressure was demonstrated on the distal, dorsal aspect of the forearm. The patient was instructed in the following way: “This is a firm pressure. Afterwards, this pressure will be applied on different spots on the body. At every spot, I would like you to report if the pressure is painful or is felt like firm pressure.” The TPs (figure 1) were tested in a standardised manner from right to left, first testing the medial fat pads of the knees and the posterior aspects of the greater trochanter. Afterwards, with the patient seated, the spots were tested from the top and downwards as follows: the suboccipital muscle insertions, the anterior-lateral aspect of the intertransverse aspects of C5–7, the midpoints of the upper borders of the trapezius, the medial parts of the supraspinatus, the costochondral junctions of costa 2, the forearm 2 cm distal to the epicondyles and the outer upper quadrants of the buttocks. The patients were instructed not to tell the result of the TP examination to the raters or others.

Positive TPs (eg, pressures causing pain) were memorised by the raters and summed up to the total number of TPs (the TP count). The procedure lasted 6–8 min per examination. A secretary was associated with each rater. The TP counts were reported to this secretary, who passed the data to the project leader (JC). In this way, the raters were blinded in relation to each other.

The secretary also registered pain response at every single TP location.

Statistical analyses

The requirement for testing intrarater and inter-rater reliability was planned to include a sample size of at least 40 persons.24 The TP counts were distributed as discrete numerical variables and were normally distributed. For the quantification of intrarater and inter-rater reproducibility of TP examination, two types of analysis were applied: the intraclass correlation coefficient (ICC) and the Bland-Altman method for assessing agreement.25 ,26 ICC provides information on the ability to differentiate between the variation between subjects and measurement variation. The ICC was defined as the ratio of variance among patients (subject variability) over the total variance (subject variability, observer variability and measurement variability). ICC ranges between 0 (no reliability) and 1 (perfect reliability), and values of ICCs are excellent when >0.75 and poor when <0.40. Results between these ranges represent moderate-to-good reliability.27 According to another reference, ICC >0.7 is considered good.25

The Bland-Altman method provides insight into the distribution of differences in relation to mean values.28 Agreement was quantified by calculating the mean difference between two sets of observations and the SD for this difference. The closer the mean difference was to 0 and the smaller the SD of this difference, the better was the agreement. The differences were depicted in relation to the mean values. The 95% limits of agreement were defined as the mean difference between the raters ±1.96 × SDof the difference. Furthermore, agreement within ±1 TPs and ±3 TPs was calculated.

To determine whether a real change in outcome has occurred in clinical practice and research, a change must be at least the smallest detectable difference (SDD) of a measurement procedure.25 The SDD was calculated as 1.96 × √(2 × SEM2), where the SE of measurement (SEM) was defined as SDof the difference/√2. SDD was calculated and rounded up to the nearest whole number.

Cronbach's α is a measure of internal consistency indicating if different items of a test battery are intercorrelated and measure the same construct. Values >0.9 are considered excellent.

The reliability of each TP location was measured by κ statistics.


Eighty-three patients were invited to join the study, and 39 patients completed both test days (figure 2). Four patients dropped out from days 1 to 2, three without explanation, and the fourth was excluded because of hospital admission and change of pain medication between the two test days. Pain medication was unchanged in the other patients.

Baseline characteristics are displayed in table 1.

Table 1

Baseline characteristics

Intrarater reliability and agreement

The mean TP count was seven and differed little between test days (table 2). The ICC in Rater A was excellent, 0.83 (95% CI 0.69 to 0.98), reflecting a high degree of reliability. ICC was somewhat lower, but still good in Rater B, 0.72 (CI 0.49 to 0.95). The relations between TP counts on days 1 and 2 are graphically displayed in figure 3 (left panel). The circles representing more than one observation were all located near the equality lines, and the observations were distributed over the whole range of TP counts.

Table 2

Intrarater differences, reliability and agreement

Figure 3

Intrarater reliability and agreement. Reliability with lines of equality shown in the left panel. Agreement shown by Bland-Altman plots in the right panel displaying differences of tender point (TP) counts on the y-axis and average of TP counts on the x-axis. The upper and the lower horizontal lines represent 95% limits of agreement. Areas of the circles are proportional to the number of observations.

In about half of the observations, agreement was within ±1 TP. For both raters, more than 75% of the TP counts were within ±3 TPs in both sexes. The limits of agreement were within ±4 and ±6 TPs for Rater A and Rater B, respectively (figure 3 right panel), corresponding to the SDD (table 2). Measurement errors (SEM) were 1.34 (1.90/√2) and 1.89 (2.68/√2) for Rater A and Rater B, respectively. Cronbach's α was 0.96 and 0.92 for Rater A and B, respectively.

Inter-rater reliability and agreement

The mean differences of TP counts differed little between the two raters (table 3). The relations between TP counts of Raters A and B are shown in figure 4, left panel, and the limits of agreement in the right panel. The circles representing more than one observation were all located near the equality and zero lines. On both test days, ICC was higher than 0.75. In more than 70% of the cases, Rater B agreed with Rater A regarding ±3 TPs in both men and women. The limits of agreement were within ±5 TPs, corresponding to SDD of 5 TPs. measurement errors (SEM) were 1.63 (2.30/√2) and 1.47 (2.08/√2) on days 1 and 2, respectively. Cronbach's α was 0.94 and 0.96 on days 1 and 2, respectively.

Table 3

Interrater differences, reliability and agreement

Figure 4

Inter-rater reliability and agreement. Reliability with lines of equality shown in the left panel. Agreement shown by Bland-Altman plots in the right panel displaying differences of tender point (TP) counts on the y-axis and the average of TP counts on the x-axis. The upper and the lower horizontal lines represent 95% limits of agreement. Areas of the circles are proportional to the number of observations.

Reliability of testing each TP location

In the appendix is shown the reliability of testing each TP location. Agreement varied from 69% to 90%, and κ values varied from 0.13 to 0.89.


The present study showed that digital TP examination resulted in total TP counts with acceptable-to-excellent reliability when calibration of the thumbs with a dolorimeter was performed before the testing. This indicated that the measurement error, which was less than 2 TPs, was considerably smaller than the variation between individuals. The lesser experienced Rater B did not perform as well as the more experienced Rater A, and this was especially evident on comparison of the lower limits of the CIs. However, the reliability of Rater B was acceptable, but more training and regular use would probably improve the results. Training has been shown to reduce the variability in applying a 4 kg digital force.29

Agreement is independent of the variation between subjects. We consider an agreement of more than 70% as good, and it was found for ±3 TPs in both men and women, indicating that digital TP examination in daily practice may be used, keeping in mind the uncertainty of ±3 TPs. This part of the result was especially important, since we found that TP counts were higher in women than in men, in line with other studies. In the general population, TP counts of more than 10 and 6 have been identified in 10–20% of women and men, respectively.6 ,7 Thus, a TP count of 9 may be normal in women, but high in men.

The median TP count of 8 was elevated as compared with the median TP count in the general population, which is between 3 and 6 TPs.6 Previously, it has been shown that TP counts were elevated in regional pain conditions as compared with pain-free controls, but lower than in fibromyalgia.30

However, SDD ranged from 4 to 6, indicating less precision of TP examination than reliability. Thus, according to the present study, TP examination may result in TP counts that may differentiate between high, intermediate or low levels, but not between different levels in the low or high range. Moreover, TP examination—as used in the present study—would not be sufficiently precise to differentiate between patients with higher or lower TP counts than 10/11 TPs such as are used in the diagnosis of fibromyalgia.

Accordingly, an SDD of 4–6 was not impressive, but it was not so different from other measures in LBP. The minimal detectable change, which is defined closely to SDD,25 ,31 has been shown to be 4–5 points in the Roland Morris Questionnaire,32 a commonly used instrument in LBP.

In fibromyalgia, the peripheral sensory thresholds are normal, but pain processing is augmented, primarily due to dysfunction of the descending pain inhibition system in the brainstem.33 In the present study, the patients were sick-listed because of chronic LBP, and we have previously presented data making it plausible that LBP can partly be explained by mechanisms similar to those seen in fibromyalgia patients.10

We found high internal consistency, as all of Cronbach's α values were above 0.90. This may support the assumption that TP counts measure the same construct, that is, insufficient pain inhibition, rather than local abnormality. Therefore, in chronic LBP patients, TPs may be interpreted as follows: a high TP count may indicate an insufficiently functioning descending pain inhibition system, whereas a low TP count may indicate a well-functioning system. TP counts in the middle of the distribution are inconclusive. The present study does not provide sufficient data to set limits for high or low TP counts in LBP patients.

In the present chronic LBP population, there was no significant change in TP counts during 1 week. We could have chosen a shorter or longer interval, but 1 week was chosen for pragmatic reasons, because we assumed that 1 week would not be too long in a patient population with long-lasting pain. One might expect more change in TP counts during 1 week in patients with acute LBP. A systematic difference in TP count between the first and second TP examinations might have occurred, but such a potential difference was not apparent because the raters were randomised to be either the first or second rater.

The value of TP examination has been questioned. First, the examination method may be unreliable, because the pain response may be affected by expectations1 or distress.34 When the examination is performed randomly with the patient blinded for the pressure gradient, the results are different as compared with non-blinded testing.34 ,35 Second, it may be inadequate to use a sharp cut-point (≥11 TPs) to distinguish health from disease in pain conditions.36 At present, fibromyalgia is considered part of a larger continuum.37 ,38 Third, there have been problems with implementation of the examination technique, especially in primary care. Often, it has been incorrectly performed, and some physicians have refused to use the method.39

Therefore, new criteria for diagnosing fibromyalgia have been developed and validated. These criteria do not include TP examination, and therefore they will enable clinicians and researchers to diagnose fibromyalgia by surveys. However, the new criteria were not meant to replace the original American College of Rheumatology (ACR) criteria, but to represent an alternative method of diagnosis39; and the new criteria have not been tested in rheumatic conditions and may not be relevant in patients with inflammatory rheumatic diseases. In these conditions, fibromyalgia symptoms may be caused by rheumatic disease and not by dysfunction of the descending pain inhibition system. Therefore, TP examination will still be relevant both at present and in the future.

The reliability of testing each TP location was not different from previous reporting in the literature.13–15


The present study was conducted in a well-defined population recruited by general practitioners on the basis of sick-listing due to LBP, and all had chronic LBP. TPs were normally distributed, making it possible to analyse data with parametric methods.


The number of patients was small, resulting in wide CIs of ICC, and only two raters participated. If more raters had participated, the results would have been more generalisable.


The possible advantages of using TP examination in LBP patients include ease and speed, no requirements of equipment and good reliability and agreement. Furthermore, malingering or appealing distress will probably not induce bias in LBP patients, who do not know what to prefer, many or few TPs.

The possible disadvantages include lack of precision and the need for training and equipment (dolorimeter).

We need to know more about the variability of the TP count over time, and we need reproducibility studies comparing TP counts with other measures of dysfunction of the descending pain-inhibiting system.37 As an example, lack of cold tolerance has been documented in whiplash patients with prolonged symptoms.40 TP counts may be compared with cold tolerance.

Furthermore, it would be interesting to see reliability and agreement studies of the total TP count in fibromyalgia patients and patients with inflammatory rheumatic diseases. Findings resembling the results of the present study may have implications for the fibromyalgia criteria.


Digital TP examination in sick-listed chronic LBP patients was a reliable but not precise instrument. More reliability and agreement studies are needed in LBP patients and other populations, including patients with inflammatory rheumatic diseases.


The authors would like to thank Senior Biostatistician Robin Christensen, MSc, PhD, Head of the Musculoskeletal Statistics Unit, The Parker Institute, Copenhagen University Hospital, Denmark, for invaluable help with the statistical analyses.


Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

    Files in this Data Supplement:


  • Contributors JC, OKJ and TE planned the study. JC designed the study in detail and was responsible for acquisition of data and obtaining funding. MGN and OKJ performed the clinical examinations. JC and OKJ were responsible for analysing and interpreting the data. OKJ wrote the manuscript, which was again revised by JC, TE and MGN. OKJ was responsible for administrative and technical support. All authors discussed the results and commented on the manuscript. All authors read and approved the final manuscript.

  • Funding The study was supported by The Research Fund of Regional Hospital Silkeborg, Denmark.

  • Competing interests None.

  • Patient consent Obtained.

  • Ethics approval All patients signed informed consent. The study was reported to the Regional Ethics Committee, who answered that approval was not necessary because only methodology was studied. The study was reported to the Danish Data Protection Agency (No. 2007-58-0010).

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement No additional data are available.