Article Text

Download PDFPDF

Psychometric properties of gross motor assessment tools for children: a systematic review
  1. Alison Griffiths1,2,3,
  2. Rachel Toovey3,4,
  3. Prue E Morgan1,
  4. Alicia J Spittle3,4
  1. 1 Department of Physiotherapy, School of Primary and Allied Health Care, Monash University, Frankston, Victoria, Australia
  2. 2 Department of Physiotherapy, The Royal Children’s Hospital, Parkville, Victoria, Australia
  3. 3 Murdoch Children’s Research Institute, Parkville, Victoria, Australia
  4. 4 Department of Physiotherapy, The University of Melbourne, Parkville, Victoria, Australia
  1. Correspondence to Dr Alicia J Spittle; aspittle{at}


Objective Gross motor assessment tools have a critical role in identifying, diagnosing and evaluating motor difficulties in childhood. The objective of this review was to systematically evaluate the psychometric properties and clinical utility of gross motor assessment tools for children aged 2–12 years.

Method A systematic search of MEDLINE, Embase, CINAHL and AMED was performed between May and July 2017. Methodological quality was assessed with the COnsensus-based Standards for the selection of health status Measurement INstruments checklist and an outcome measures rating form was used to evaluate reliability, validity and clinical utility of assessment tools.

Results Seven assessment tools from 37 studies/manuals met the inclusion criteria: Bayley Scale of Infant and Toddler Development-III (Bayley-III), Bruininks-Oseretsky Test of Motor Proficiency-2 (BOT-2), Movement Assessment Battery for Children-2 (MABC-2), McCarron Assessment of Neuromuscular Development (MAND), Neurological Sensory Motor Developmental Assessment (NSMDA), Peabody Developmental Motor Scales-2 (PDMS-2) and Test of Gross Motor Development-2 (TGMD-2). Methodological quality varied from poor to excellent. Validity and internal consistency varied from fair to excellent (α=0.5–0.99). The Bayley-III, NSMDA and MABC-2 have evidence of predictive validity. Test–retest reliability is excellent in the BOT-2 (intraclass correlation coefficient (ICC)=0.80–0.99), PDMS-2 (ICC=0.97), MABC-2 (ICC=0.83–0.96) and TGMD-2 (ICC=0.81–0.92). TGMD-2 has the highest inter-rater (ICC=0.88–0.93) and intrarater reliability (ICC=0.92–0.99).

Conclusions The majority of gross motor assessments for children have good-excellent validity. Test–retest reliability is highest in the BOT-2, MABC-2, PDMS-2 and TGMD-2. The Bayley-III has the best predictive validity at 2 years of age for later motor outcome. None of the assessment tools demonstrate good evaluative validity. Further research on evaluative gross motor assessment tools are urgently needed.

  • paediatrics
  • reliability
  • validity
  • rehabilitation medicine
  • gross motor assessment

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Strengths and limitations of this study

  • This systematic review comprehensively assesses methodological quality of included studies using the COnsensus-based Standards for the selection of health status Measurement INstruments checklist.

  • Results of this systematic review can provide guidance to clinicians when choosing gross motor assessment tools based on test psychometric properties and clinical utility.

  • Areas for future research are identified including improving the evidence of inter-rater and intrarater reliability and responsiveness to change as well as the ascertainment of predictive validity over a longer period of time.

  • Only articles or test manuals written in English were included.

  • Only one reviewer screened titles and abstracts for inclusion.


Motor function promotes cognitive and perceptual development in children and contributes to their ability to participate in their home, school and community environments.1 Motor impairment can negatively affect activity and participation levels of children,2 which may lead to lower levels of physical activity, fitness and health into adulthood.3 While severe motor deficits are usually diagnosed before 2 years of age, mild motor deficits may not become evident until children are in preschool and primary school environments where they are exposed to increasingly complex tasks and compared with their peers.3 Identification of motor difficulties is an important step towards support and intervention for the child and their family.

Healthcare professionals and researchers require standardised assessment tools to identify, classify and diagnose motor problems in children.4 Furthermore, assessment tools are essential to monitor the effects of interventions.4 There is no gold standard of motor assessment for children and the available tests vary in their ease of use and interpretability in clinical and research settings, and whether they are norm or criterion referenced.5 Criterion referenced tests are designed to be scored as items or criteria are demonstrated; meaning that the score is a reflection of a child’s competence on the test items. Most available assessments however, are norm referenced, meaning that a child’s results are reported in relation to a specific population.4 The characteristics of the normed population should be taken into consideration when interpreting test results as environmental and cultural differences have been found to affect motor development.6

Healthcare professionals should be aware of the validity and reliability of assessment tools to assist in their instrument selection and interpretation of results. Validity refers to ‘the degree to which (an instrument) is an adequate reflection of the construct to be measured’.7 If an instrument does not have adequate construct or content validity then it may not be assessing the skills that it purports to. Reliability refers to ‘the degree to which the measurement is free from measurement error’,7 which is significant when interpreting results. If a child is assessed as being significantly delayed in their gross motor skills, the reliability of that tool indicates the likelihood that a result is due to error.

A systematic review in 2010 by Slater et al 8 evaluated performance-based gross motor tests for children with developmental coordination disorder; however, it did not include the second and most recent version of the Movement Assessment Battery for Children-2 (MABC-2), which is widely used. Brown and Lalor9 suggested that as a result of the changes to the original MABC in age range, age bands, materials and tasks, the MABC-2 requires independent reliability and validity assessment. Over the past 8 years, there has also been a significant increase in the number of papers assessing the psychometric properties of motor assessment tools in children. A systematic review of these and previous papers is warranted, in order to add to our understanding of the psychometrics of standardised gross motor assessment tools.

The primary aim of this systematic review is to identify and evaluate the clinical utility and psychometric properties of gross motor assessment tools appropriate for use in preschool and school age children from 2 to 12 years by assessing the methodological quality of the included studies. The secondary aim of this review is to identify any areas for further research.


A comprehensive search strategy was completed in databases OVID Medline (1996 to May 2017), CINAHL plus (1937 to July 2017), Embase (1974–May 2017) and AMED (1985–July 2017) (see online supplementary tables 1-4). The search strategy used MeSH terms and text words for (‘child’ or ‘paediatric’) and (‘motor skills’ or ‘motor activity’ or ‘gross motor’ or ‘psychomotor’ or ‘developmental coordination disorder’) and (‘questionnaires’ or ‘outcome assessment’ or ‘instrument’ or ‘task performance’) and (‘reliability’ or ‘validity’ or ‘psychometrics’). Reference lists of included articles were also screened to identify any additional papers. If full texts were unavailable or further information was required regarding availability of manuals, the authors were contacted.

Assessment tools were included if they were (1) discriminative, predictive or evaluative of gross motor skills, (2) assessed ≥two gross motor (eg, balance, jumping, etc) items, (3) able to extract a meaningful gross motor subscore, (4) applicable to children aged 2–12 years, (5) criterion or norm referenced test with a standardised assessment procedure and (6) instructional manuals are published or commercially available.

Articles describing use of the assessment tool were included if; ≥90% of the study population were within 2-12 years of age, it was available in English and if validity and/or reliability of the assessment tool was reported.

Assessment tools were excluded if they met any of the following criteria: (1) questionnaires or screening tools, (2) only applicable to children with a specific diagnosis (eg, cerebral palsy, Down’s syndrome), (3) test manuals not available in English and (4) the version of the test has been superseded.

Titles and abstracts were screened by the first author with any studies that clearly did not meet inclusion criteria excluded. The remaining papers were obtained in full text and reviewed by two authors (AG, RT or PM) with selection based on inclusion and exclusion criteria. Papers and assessment tools were included after discussing with both raters, with conflicting decisions discussed until a consensus was reached.

Methodological assessment of the papers was completed using the four-point scale of the COnsensus-based Standards for the selection of health status Measurement INstruments (COSMIN) checklist.10 The COSMIN incorporates three quality domains: validity, reliability and responsiveness consisting of seven measurement properties: content, construct and criterion validity, internal consistency, reliability, measurement error and responsiveness7 (see online supplementary table 5). Cross-cultural validity, structural validity and hypothesis testing are all considered to be a component of construct validity.7 While predictive validity is considered to be a component of content validity, it is reported separately in this paper for interpretability of results.7

Supplementary file 5

The overall score for each measurement property on the COSMIN checklist is determined by a ‘worse score counts’ approach.10 Each property is rated as excellent, good, fair or poor methodological quality based on descriptive criteria. Data extraction and assessment of methodological quality was performed independently by two assessors (AG and RT). In the case of any uncertainty, a third reviewer (AS) performed a COSMIN assessment and disagreement was resolved through discussion.

A data extraction form for each assessment tool was adapted from the CanChild Outcome Measures Rating Form to collate information on clinical utility, validity, reliability and responsiveness.11 Items chosen to represent the clinical utility of the assessment tools were the cost of manuals, kits, training requirements, time to administer the assessment and the ease of scoring. All reported values for reliability were collected; however, only those papers reporting intraclass correlation coefficient (ICC) were directly compared.

Patient and public involvement

As this was a systematic review of existing papers, there was no patient or public involvement.


Figure 1 provides details of study selection. Seven assessment tools were identified for inclusion: Bayley Scale of Infant and Toddler Development III (Bayley-III), Bruininks-Oseretsky Test of Motor Proficiency 2 (BOT-2), MABC-2, McCarron Assessment of Neuromuscular Development (MAND), Neurological Sensory Motor Developmental Assessment (NSMDA), Peabody Developmental Motor Scales 2 (PDMS-2) and Test of Gross Motor Development 2 (TGMD-2). The corresponding manuals were then added to the final yield resulting in 30 papers and 7 manuals. Twenty assessment tools were excluded (see online supplementary table 6).

Supplementary file 6

Figure 1

Preferred Reporting Items for Systematic Reviews and Meta-Analyses flow diagram detailing study selection.

The majority of assessment tools identified in this review are discriminative and most lend themselves towards use in a research setting. All norm referenced tools are from western countries and each identified test covers a different age range as shown in table 1.

Table 1

Gross motor assessment tool characteristics

The TGMD-2 is the only tool that assesses gross motor skills in isolation and that focusses on quality of performance. The other gross motor assessments were either in conjunction with assessment of fine motor and/or balance (MAND, MABC-2, BOT-2 and PDMS-2) or as a component of a developmental assessment (NSMDA, Bayley-III).

Despite the variability in test structures, there is some consistency of items included within the gross motor skill subsets between tests. Most include a locomotion task such as walking, running or stair climbing; an object control or manipulation task such as throwing or catching a ball and a static or dynamic balance task such as standing on one leg or hopping. The PDMS-2, BOT-2 and the MAND also include strength assessments (the PDMS-2 only in some age groups).

The number of gross motor items for assessment vary both within and between the tools (table 1). For example, the number of items tested in the Bayley-III and the PDMS-2 depends on the age and ability of the child. Several assessments report criteria for describing gross motor delay, although all test manuals warn against diagnosing delay based on a single assessment.

The PDMS-2 is notable for the inclusion of credit towards incomplete skills in the scoring system. Most other tests award a point or credit towards a skill only if it is demonstrated to the full satisfaction of the stated criteria (score of 0 or 1). The PDMS-2 however is scored 0–2 allowing for 1 mark to be allocated as a child progresses towards a skill without mastering it. The TGMD-2 is also notable for its marking system, in which points are awarded for the quality of the action performed, instead of satisfactory completion of the task only. These actions include preparatory movements prior to running and jumping, or arm position during movements. The NSMDA marking criteria is somewhat more complicated with a system of scores 1–4 with a symbol of ‘+' denoting hyperactive response and '–' a hyporeactive response. The PDMS-2, MABC-2, BOT-2, MAND, TGMD-2 and Bayley-III all require raw scores to be converted to a standard (or scaled) score based on tables supplied in the manuals. For the BOT-2, this is a multiple step process which can then be converted to both sex-specific or combined standard scores and percentile ranks. A summary of assessment tool characteristics can be found in table 1.

Clinical utility

The clinical utility of the assessment tools is summarised in table 2, while scoring and administration is detailed in online supplementary table 7. The shortest administration time is 15–20 min for the TGMD-2 and the MAND, while most manuals report 20–60 min is required to complete an assessment. These times are not inclusive of equipment set up, pack up and scoring, which varies depending on the amount of equipment and complexity of the scoring process. All assessments require the user to be familiar with the test before administration and to possess a high level of understanding of child movement and development. The MABC-2 and PDMS-2 are the only assessments that come with supporting material to guide intervention postassessment (when the complete kit is purchased).

Supplementary file 7

Table 2

Clinical utility of gross motor assessment tools

Methodological quality

All articles were assessed using the COSMIN checklist to determine methodological quality. Several studies were marked down for failing to report missing data, small sample sizes and for using inappropriate statistical methods. A summary of the articles and corresponding COSMIN methodology rating is provided in table 3.

Table 3

Methodological quality of included articles


The content and construct validity of the included assessment tools are summarised in table 4. Most assessments were developed by or with input from experts in the field, with most also performing literature reviews. Bruininks and Bruininks12 performed comprehensive surveys, pilot, tryout and standardisation studies before finalising the BOT-2, providing the most comprehensively reported content validity.

Table 4

Content and construct validity of assessment tools

Construct validity was confirmed with factor analysis (either exploratory or confirmatory) in most assessment tools. The TGMD-2 has the most evidence for construct validity with several papers performing confirmatory and exploratory factor analysis.13–18 The MABC-2, BOT-2, Bayley-III, MAND and PDMS-2 had factor analysis performed only in one paper. The MABC-2 was shown to require changes to remain valid in the Chinese-speaking and Dutch-speaking populations.19 20 The BOT-2, MABC-2 and TGMD-2 all provide evidence of the ability to discriminate between particular age or diagnosis groups, which can be considered to support their content validity. The NSMDA has minimal assessment of construct validity in children over 2 years. The Bayley-III, NSMDA and MABC-2 are the only assessments that provide evidence of predictive validity (table 5). Concurrent validity between the MABC-2, PDMS-2 and BOT-2 is moderate to high, while the TGMD-2 is only weakly correlated with the MABC-25 (table 5). The PDMS-2, TMGD-2 and NSMDA report correlations with other criteria such as paediatrician diagnosis, physical fitness or psychomotor/intelligence tests.

Table 5

Criterion and predictive validity of assessment tools


Internal consistency of assessments are summarised in table 6. The high internal consistency of the BOT-2 is well supported, including for children with an intellectual disability.21 22 The MABC-2 appears to have lower internal consistency than the BOT-2, which may relate to the limited number of test items (eight) on the MABC-2. The highest values for internal consistency for the MABC-2 were obtained in specific populations (intellectual disability and developmental coordination disorder) with poor to fair methodology only. Conversely, the highest quality articles reported the lowest values, although it should be noted that these assessed age band 1 (3–6 years) only. Internal consistency is reported to be high for the PDMS-2, while the Bayley-III is shown to have excellent internal consistency in children aged 24–42 months. The TGMD-2 is reported by two good quality (and four poor to fair quality) articles to have excellent internal consistency, including for children with vision impairment and intellectual disability. The MAND is the only assessment tool included in this review without published data of internal consistency or reliability in this age group.

Table 6

Reliability of assessment tools

The reliability findings are summarised in table 6 and in figures 2 and 3. Test–retest reliability was excellent in the Bayley-III (table 6), BOT-2 and PDMS-2; and was good to excellent in the MABC-2 and TGMD-2 (figure 2). Intrarater reliability was rarely investigated or reported for most tools, with the TGMD-2 demonstrating better results than the MABC-2 (figure 3). Only the TGMD-2 and MABC-2 report inter-rater reliability values using an ICC (figure 3).23 24 Inter-rater reliability is also supported in the BOT-2 with Pearson’s correlation coefficient and Kappa, respectively. The studies referred to in the test manuals for the TGMD-2, Bayley-III, BOT-2 and MABC-2 all report reliability findings using Pearson’s correlation, which is less ideal than an ICC or weighted kappa for statistical analysis.25 26 Only studies reporting ICCs are visually represented in figure 2 (test–retest) and figure 3 (inter-rater and intra-rater). The TGMD-2 test–retest reliability results from Houwen et al 16 were believed to contain an error as the reported ICC was outside of the reported CIs (ICC 0.92, 95% CI 0.82 to 0.91). This data set was therefore excluded from figure 2.

Figure 2

Test–retest reliability of gross motor assessment tools. BOT-2, Bruininks-Oseretsky Test of Motor Proficiency second edition12; ICC, intraclass correlation coefficient; MABC-2, Movement Assessment Battery for Children second edition29; PDMS-2, Peabody Developmental Motor Scales second edition34; TGMD-II, Test of Gross Motor Development second edition.15

Figure 3

Inter-rater and intrarater reliability of gross motor assessment tools. ICC, intraclass correlation coefficient; MABC-2, Movement Assessment Battery for Children second edition29; TGMD-II, Test of Gross Motor Development second edition.15

Responsiveness was reported for the Bayley-III, BOT-2, MABC-2 and PDMS-2 with minimal detectable change (MDC) or a SE of measurement (SEM).21 Sensitivity and specificity for detecting change was shown to be satisfactory in the MABC-2, PDMS-2 and MABC-221 (table 6). There have been no studies to date on the responsiveness of the TGMD-2, NSMDA or MAND.


This review identified seven gross motor assessment tools appropriate for use in clinical or research settings, each with their own strengths and limitations. Interestingly, only one of the seven assessments (TGMD-2) measured gross motor skills in isolation. This is likely a reflection on current practice to assess children’s development as a whole, rather than assessing individual domains in isolation. A gross motor assessment embedded within a developmental assessment, such as that of the Bayley-III may be more appropriate than an isolated gross motor assessment for children where there is suspicion of multiple impairments.

A review by Slater et al 8 reported that the TGMD-2 and the MABC (first edition) were recommended for assessing gross motor skills in children with developmental coordination disorder, but found that the MABC needed further evidence of validity. Cools et al 27 also published a detailed review of the clinical utility of gross motor assessment tools for children, but did not address the validity, reliability or responsiveness to change of these measures. This review adds to the literature by including updated information on the psychometric properties of the measures and a thorough methodological assessment using the COSMIN checklist, which allows the reader to interpret these results with confidence. We have identified 10 additional publications to support the content, construct and criterion validity of the MABC-2 and have demonstrated an overall higher methodological quality of the papers assessing the MABC-2 when compared with the TGMD-2. Papers that received lower methodological scores on the COSMIN can be attributed to inadequate reporting statistical methods, small sample sizes and non-independent assessors. Further research in this area should consider addressing these limitations in their study design to reduce potential error and increase confidence when interpreting results.

Content validity has been established for five of the included assessment tools; however, further research into the content validity for the MAND and NSMDA is required. The NSMDA’s ability to predict a diagnosis of CP and motor outcomes over time does support its content validity; however, the methodology scored as poor to fair on the COSMIN and as such content validity cannot be fully established. The use of expert panels, focus groups and/or stakeholder feedback for the BOT-2, MABC-2, TGMD-2 and PDMS-2 demonstrate thorough consideration of the relevance and comprehensiveness of the each test’s assessment items during development.

The TGMD-2 is the only assessment tool considered to have well-established construct validity, with several papers reporting factor analysis. The NSMDA has undergone factor analysis for children up to, but not beyond 2 years of age and as such further research is needed to support its validity in older children. All other included assessment tools have undergone factor analysis assessment of their construct validity in one paper and are supported by the ability to discriminate between medical diagnosis or age, and as such are considered to have adequate construct validity. The criterion validity indicates that the TGMD-2 may be measuring a slightly different construct to the other assessment tools included in this study as it has poor agreement with the MABC-2, which in turn has good agreement with the PDMS-2 and the BOT-2. This difference may be related to the inclusion of the assessment of quality of movement in the TGMD-2, or the inclusion of balance and/or fine motor tasks on the other assessments. There is scope to investigate the criterion validity of the MAND and the gross motor subsections of the Bayley-III and the NSMDA with the other assessment tools in this study in the future.

The BOT-2 was the only assessment tool to have its reliability assessed with excellent methodology. In conjunction with its reported results, it can be considered to have the strongest evidence for internal consistency and test–retest reliability out of the included assessment tools. The PDMS-2 and the MABC-2 can be considered to have the next best established test–retest reliability with good methodological quality. The reported test–retest reliability values for the TGMD-2 are impacted by the poor to fair methodological quality, and further high-quality research needs to be done to support its body of evidence. Test–rest, inter-rater or intrarater reliability has not been assessed in the MAND and NSMDA. In the clinical context, gross motor assessments are often repeated over time or between therapists and as such these measures of reliability should be established. The Bayley-III would also benefit from further research into its reliability, with no published inter-rater or intrarater reliability measures, and with only one, fair quality report of good test–retest reliability.

As yet, there is little evidence to support the use of these assessments as outcome measures. The inclusion in some of the articles of minimal detectable change (MDC) and minimal clinically important difference (MCID) is valuable for clinicians.7 The difference between MDC and MCID is also of importance, as a change in score does not necessarily relate to a meaningful change for the child or their family. Only the Bayley-III, BOT-2, MABC-2 and PDMS-2 have a reported MCID with satisfactory sensitivity and specificity; however, due to the fair methodological quality used to obtain these values they cannot be used with a high level of confidence until further studies have been performed. The TGMD-2 was created in part to be used as an outcome measure; however, there are no articles to date investigating its responsiveness to change.15 It should also be noted that all of the included assessment tools measure impairment and activity limitations, but do not specifically address the other elements of the International Classification of Functioning, Disability and Health domains of participation, personal factors and environment.2 Clinicians should use appropriate assessments or questionnaires to ensure that these domains of health are also addressed in line with the WHO guidelines.2

When considering a test’s reliability all three elements of test error should be taken into account— these can be described as time sampling (assessed with test–retest reliability), content sampling (assessed as internal consistency) and interscorer difference (or inter-rater reliability).15 This is one of the reasons that clinicians should consider repeating assessments and/or completing a second alternative assessment. All assessments should be interpreted in conjunction with clinical reasoning and observation. Included assessment tools are not intended to be diagnostic on their own; results need to be combined with other assessments and expert opinion to arrive at a clinical diagnosis.

The clinical utility varied across all of the included assessment tools, with the primary differences being in cost and time to administer the assessments. Clinicians and researches should select their assessment tool with consideration of psychometric properties (inclusive of the methodological rigour behind them), clinical utility and for the population, situation and age group in question.

A potential limitation of this study was that one author screened the titles and abstracts, which may have led to a sampling bias. While care was taken to include all potentially relevant papers and assessment tools until the second round of assessment with two authors, the potential for exclusion of papers relevant to this review remains. The process of excluding both papers and assessment tools in this single step may also be seen as a limitation, as the total number of assessment tools (or different versions of tools) was not reported. This process does, however comply with the COSMIN and PRISMA guidelines. A second limitation was the restriction of included papers and manuals to those published in English. Unfortunately, this resulted in the exclusion of three assessment tools that have been reported as commonly used in Europe: The Motoriktest für Vier- bis Sechjärige Kinder 4–6, the Körperkoordinationtest für Kinder and the Maastrichtse Motoriek Test.27 The authors also note the third edition of the TGMD is soon to be published and will need to be subjected to a similar level of assessment of psychometric properties in the future.

Clinicians and parents who need guidance to set realistic therapy goals and to understand future intervention requirements benefit from understanding a test’s predictive ability. The NSMDA and the MABC-2 are the only tools that have demonstrated long-term (≥4 years follow-up) predictive validity, while the Bayley-III has good predictive validity at 2 years for future movement difficulties and for the diagnosis of cerebral palsy at 4 years. However, further research into the long-term predictive validity of all included gross motor assessment tools is warranted.

While validity and reliability should guide selection of assessment tools, clinical utility must also be taken into consideration. Most tests have ongoing costs associated with forms and equipment replacement, which may be prohibitive to some users. The NSMDA requires the therapist to handle the child for several items, which should be considered in relation to manual handling policies of institutions. Assessment burden for children and families should also be taken into consideration when selecting an assessment tool. Younger children are more likely to be distracted and may not understand test items as well, which may also increase assessment times.28

When a new edition of an assessment tool is released resulting in a change in age groups, scoring or tasks, it is insufficient to rely on the psychometric assessments that were performed on the original test. The MABC-2 manual provides justification for the inclusion of reliability and validity assessment of the original MABC29; however, owing to the significant changes in age groups and tasks between editions these were not included for the analysis of the MABC-2 in this review. Two studies quoted in the MABC-2 manual to support the validity and reliability are both unpublished works and as such are also unable to be included in this systematic review. This could indicate a publication for the MABC-2.

The thorough methodological assessment of the included articles using the COSMIN checklist should be seen as a strength of this paper, as should the range of assessment tools included in this review. While it has previously been argued that the ‘worst score counts’ criteria in the COSMIN creates a floor effect,30 the COSMIN authors argue that only ‘fatal flaws’ contribute to an overall score of poor.10 There are few tools available to assess the psychometric properties of assessment tools and arguably none so robustly validated as the COSMIN.

There are many appropriate gross motor assessment tools available for use in research and clinical settings today. Most of the available tools demonstrate adequate validity and reliability in children aged 2–12 years and as such the authors do not believe that new assessment tools need to be developed for use. There is scope however to improve the evidence of inter-rater and intra-rater reliability and predictive validity should be ascertained over a longer period of time and with greater methodological rigour. Tools also need clearer assessment of their responsiveness to change to assist clinicians and researchers with outcome measure selection. Researchers should be mindful of the methods they use to assess validity and reliability. Clarity of reporting, statistical methods and sample sizes should be carefully considered to ensure the highest quality of evidence.


Currently available gross motor assessment tools for children have good to excellent content and construct validity. The BOT-2, MABC-2, PDMS-2 and TGMD-2 are the most reliable assessments in this age group. The Bayley-III has the best predictive validity at 2 years of age, and the NSMDA and the MABC-2 both have good predictive validity at 4 years of age. There is scope for further research into the predictive validity, reliability and responsiveness of gross motor assessment tools in preschool and school-aged children. In practice, clinicians should choose assessments with consideration of their psychometric properties in the context of the child that they are assessing.

Supplementary file 1

Supplementary file 2

Supplementary file 3

Supplementary file 4


  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
  21. 21.
  22. 22.
  23. 23.
  24. 24.
  25. 25.
  26. 26.
  27. 27.
  28. 28.
  29. 29.
  30. 30.
  31. 31.
  32. 32.
  33. 33.
  34. 34.
  35. 35.
  36. 36.
  37. 37.
  38. 38.
  39. 39.
  40. 40.
  41. 41.
  42. 42.
  43. 43.
  44. 44.
  45. 45.
  46. 46.
  47. 47.
  48. 48.
  49. 49.
  50. 50.


  • Contributors All individuals listed as authors meet the appropriate authorship criteria and have approved the acknowledgement of their contributions. The primary author, AG, was responsible for the drafting of the paper and liaising with the coauthors on findings and conclusions. RT contributed to the paper through interpretation of data, completing methodological assessments and revising manuscript content throughout its development. PEM and AJS both contributed to the paper through assisting with the development of research design, interpretation of data and revising manuscript content through its development.

  • Funding This study was part-funded by grants from the National Health and Medical Research Council Career Development Fellowship (AJS) 1053767 and Centre of Research Excellence in Newborn Medicine 1060733 (AJS and AG) and the Victorian Government’s Operational Infrastructure Support Programme.

  • Competing interests None declared.

  • Patient consent Not required.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement This paper includes data obtained from reviewing papers of published manuscripts. Data can be accessed by contacting the primary author.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.