Article Text

A systematic review of behavioural marker systems in healthcare: what do we know about their attributes, validity and application?
  1. Aaron S Dietz1,
  2. Peter J Pronovost1,2,
  3. Kari N Benson1,
  4. Pedro Alejandro Mendez-Tellez2,
  5. Cynthia Dwyer3,
  6. Rhonda Wyskiel1,
  7. Michael A Rosen1,2
  1. 1The Armstrong Institute for Patient Safety and Quality, The Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
  2. 2Department of Anesthesiology and Critical Care Medicine, The Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
  3. 3Surginal Intensive Care Unit, Johns Hopkins Hospital, Baltimore, Maryland, USA
  1. Correspondence to Dr Michael A Rosen, Armstrong Institute for Patient Safety and Quality, and Department of Anesthesiology & Critical Care Medicine, Johns Hopkins University School of Medicine, 750 East Pratt Street, 15th Floor, Baltimore, MD 21202, USA; mrosen44{at}jhmi.edu

Abstract

Objective Behavioural marker systems are advocated as a method for providing accurate assessments, directing feedback and determining the impact of teamwork improvement initiatives. The present article reports on the state of quality surrounding their use in healthcare and discusses the implications of these findings for future research, development and application. In doing so, this article provides a practical resource where marker systems can be selected and evaluated based on their strengths and limitations.

Methods Four research questions framed this review: what are the attributes of behavioural marker systems? What evidence of reliability and validity exists? What skills and expertise are required for their use? How have they been applied to investigate the relationship between teamwork and other constructs?

Results Behavioural markers systems are generally designed for specific work domains or tasks. They often cover similar content with inconsistent terminology, which complicates the comparison of research findings across clinical domains. Although several approaches were used to establish the reliability and validity of marker systems, the marker system literature, as a whole, requires more robust reliability and validity evidence. The impact of rater training on rater proficiency was mixed, but evidence suggests that improvements can be made over time.

Conclusions A consensus of definitions for teamwork constructs must be reached to ensure that the meaning behind behavioural measurement is understood across disciplines, work domains and task types. Future development efforts should focus on the cost effectiveness and feasibility of measurement tools including time spent training raters. Further, standards for the testing and reporting of psychometric evidence must be established. Last, a library of tools should be generated around whether the instrument measures general or domain-specific behaviours.

  • Teamwork
  • Qualitative research
  • Performance measures

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Reliable data collection is what separates mumbo jumbo from science, hope from reality (ref. 1, p.175).

Introduction

Breakdowns in teamwork have been recognised as a prominent contributor to medical errors and incidents of patient harm for over two decades.2–4 Accordingly, patient safety researchers and practitioners alike have sought a better understanding of effective team processes and improvement strategies.5–8 The validity of conclusions drawn from these efforts and the extent to which teamwork improves as a result of interventions is contingent upon rigorous, psychometrically driven measurement practices.9

Teamwork measurement in healthcare relies primarily on observational or self-report methods, each with inherent tradeoffs.10 Behavioural marker systems are an observational measurement approach used widely in aviation and other high-risk industries.11 Behavioural markers are concrete and observable examples of some aspect of effective or ineffective performance. The development and use of valid marker systems is essential for providing accurate assessments, directing feedback and determining the impact of teamwork improvement initiatives.12 Without strong reliability and validity evidence, the link between team effectiveness and safety and performance outcomes may be misrepresented or misleading. Therefore, the definition of constructs, content of measurement items and measurement procedures demand careful scrutiny.13

The purpose of this article is twofold. First, this article systematically reviews the state of science and practice surrounding the use of behavioural marker systems in healthcare by answering four questions: (1) What are the attributes of behavioural marker systems? (2) What evidence of reliability and validity exist? (3) What skills and expertise are required for their use? and (4) How have behavioural marker systems been applied to investigate the relationship between teamwork and other constructs? Answering these questions provides details about the current state of marker systems in healthcare, evidence supporting their use in making decisions or providing feedback, level of investment in human capital (ie, rater training) required for effective use and present applications of marker systems. Next, this article discusses the implications of these findings for future research, development and application of marker systems in healthcare. In doing so, we seek to provide both an understanding of the state of quality surrounding the use of behavioural markers systems in healthcare as well as a practical resource where marker systems can be selected based on the context of behavioural measurement and logistical demands associated with establishing psychometric quality and training requirements. As all measurement should be based in theory, we begin by reviewing the science of teams and performance measurement to provide structure for the review.

Background

The science of teams

A robust multidisciplinary science of teams has explicated a broad set of factors related to team effectiveness.14 ,15 This literature as well as the existing literature in healthcare has been plagued with inconsistent terminology.16 Consequently, this section briefly defines key terms used in this review. A team refers to two or more individuals with specific roles who work interdependently and adaptively toward a shared goal.17 Behaviour within teams can be classified in terms of taskwork (ie, behaviours related to how individual team members carry out their individual work) and teamwork (ie, behaviours related to team member interactions).18 Team performance is the culmination of taskwork and teamwork activities (ie, what the team actually does) and team performance effectiveness refers to whether team performance outcomes fulfil performance goals and expectations.19 In healthcare, the term non-technical skills is also used to describe individual-related and team-related behaviours that are not related to technical aspects of clinical practice.20

Team performance is generally characterised in terms of inputs, mediators and outcomes (IMO).14 This IMO framework has been adopted in healthcare as well.21 As illustrated in figure 1, the influence of team inputs (eg, composition characteristics) on team outputs (eg, quality/quantity of performance outcomes, safety outcomes) is mediated by team processes (eg, communication) and emergent states (eg, situational awareness (SA). Categories of team processes include action (ie, task execution), transition (eg, planning or preparing) and interpersonal (eg, conflict management).22 ,23

Figure 1

Generic inputs, mediators, and outcomes (IMO) model.

Behavioural marker systems

Behavioural markers are ‘a prescribed set of behaviours indicative of some aspect of performance’ (ref. 11, p.96). Ratings from marker systems are used to make inferences about latent team skills and cognitions. For example, SA is a cognitive construct that involves perception, comprehension and anticipation.24 ,25 Marker systems have evaluated SA by rating behaviours related to gathering information (eg, cross-checking), recognising and understanding (eg, articulation of cues and their importance), and anticipation (eg, actions taken to circumvent a problem).26

Behavioural marker systems rely on trained raters to assess overt behaviours, making them uniquely suited to capture teamwork skills with enhanced objectivity.11 Marker systems are also competency-driven and afford a standardised lexicon to structure assessments and feedback because of their specificity.6 ,26 ,27 For example, teams can debrief following a performance episode or an improvement initiative with a clear understanding of what ratings on certain team competencies actually signify.

Although behavioural marker systems have great potential for evaluating team performance, theoretical and logistical factors may limit widespread adoption and use. First, consistent with findings that suggest contextual and task-related factors will dictate which competencies are important,28 teamwork assessment strategies are not guaranteed to generalise across domains of work or research.29 Because behavioural markers are specific descriptions of performance, they may need to be adapted when applying them to new situations.30 For example, multiple situations might occur simultaneously (eg, admissions, arrest management) or transpire at different times. Adapting marker systems to account for unique situational attributes may entail significant time and resources. Second, staff time is required to make observations and raters must be trained to ensure data are being collected reliably. A variety of rater-training strategies exist, but all require an investment in staff time.31 Third, team performance can only be assessed during periods of observation; inferences drawn from marker systems are constrained by the task being observed or the period of observation. This means criterion variables must also be collected during the same process of data collection. To illustrate, it would be methodologically inappropriate to link data collected using a marker system to predict performance outcomes (eg, errors) relying on data from patient safety reporting systems; the researchers would have to identify errors as they occurred during observation.

Methods

Our approach to the literature search was intentionally broad. We did not restrict our focus to only include widely employed systems in order to sufficiently report on topics regarding the state of quality surrounding the use of these systems across the healthcare community (eg, reliability and validity reporting, training requirements, etc). That said, citations for each marker system provide an index of systems that are frequently used for a given purpose (see online supplementary table S1) in relation to evidence of reliability and validity and possible logistical considerations required for training raters (see online supplementary table S2). In both tables, systems are organised around specific purposes of measurement.

A Boolean search consisting of Medical Subject Headings (MeSH) terms and other key words was conducted using PubMed to identify articles related to: (1) health professionals/healthcare, (2) teamwork/non-technical skills and (3) behavioural assessment. Figure 2 summarises the screening process, while a more complete representation of the search strategy is provided in online supplementary file 1. A coding scheme was iteratively developed to systematically capture article content germane to the objectives of this review, including: attributes of marker systems (ie, behaviours, techniques, targets of measurement), psychometric properties and its application in healthcare research. A complete description of key variables is listed in online supplementary file 2. Articles were coded by one individual (AD), and 14% of articles (n=5) were reviewed by two coders (AD, KB) to establish inter-rater reliability (κ=0.743).

Figure 2

Methodological approach.

The state of quality

Thirty-eight articles describing 20 unique marker systems met the inclusion criteria (one article was added during the review process32). Findings from our review are organised around the four key research questions described earlier.

What are the attributes of the behavioural marker systems?

Validity cannot be established globally for a given measure, only for a given purpose under some set of conditions. Therefore, we address how teamwork behaviours are being conceptualised, and for what purpose, to understand how marker systems vary in their content and structure. This question also addresses the techniques used for assessment, which can have implications for ease of rater training and generalisability of study findings. Defining an appropriate range for a measurement scale, for example, is an important consideration for contrasting the relative magnitude of score differences that were observed.33 Online supplementary table S1 summarises the purpose of behavioural marker systems that were identified (eg, clinical context, type of personnel the system was developed for) and how teamwork behaviours are assessed (eg, scoring format).

Context of measurement

The majority of marker systems were developed for a specific clinical work area (n=15; 75%), with surgery (n=7; 35%) and resuscitation (n=6; 30%) being the most common.

Content of measurement

The systems reviewed used a variety of classification structures varying in their level of specificity or granularity. Six systems used a hierarchical structure to cluster behaviours. To illustrate, the Non-Technical Skills for Surgeons (NOTSS) system includes four behavioural categories each with three elements that constitute a taxonomy of non-technical skills.27 Each element is paired with positive and negative examples of behaviours to guide assessment. Other systems developed for the same task present different factorial structures34 or do not categorise behaviours with subdimensions at all.35 Similar differences in granularity occur at the construct level, where the Just-In-Time Pediatric Airway Provider Performance Scale (JIT-PAPPS)36 assesses decision making as a unidimensional construct, while the Anaesthetists’ Non-Technical Skills (ANTS) system assesses decision making as the product of (1) identifying options, (2) balancing and selecting options and (3) re-evaluating.26 ,37

To examine what behaviours were targeted for measurement, we amalgamated behaviours (both categories and elements) from each marker system. One hundred and four unique behaviours remained after exact duplicates were removed. Next, we removed duplicates with nominal relevance to account for redundancies in terminology ostensibly describing the same attribute (eg, coordination, coordinating with others). Seventy-nine unique constructs were retained following this qualitative data reduction. There were other instances where behaviours were paired with discrete constructs (eg, leadership and team coordination38; teamwork and cooperation39). The quantity of constructs precluded a meaningful comparison of behaviours assessed across marker systems.

Structure of measurement

Marker systems varied in their temporal structure and resolution, with the majority of marker systems using Likert scales (n=14; 70%) with behavioural anchors as an assessment aid (n=12; 60%). For example, observational teamwork assessment for surgery (OTAS) ratings cover five behaviours, three subteams (surgical, anaesthetic, nursing) and three operative phases of surgery (preoperative, intraoperative, postoperative).35 This results in 45 behavioural ratings for a single surgery. Raters assess performance using a 7-point Likert scale ranging from zero to six.35 By contrast, the Oxford Non-Technical Skills (NOTECHS) scale relies on a summative scoring of behaviours over the entire observation.34 Raters assess performance using a 4-point Likert scale ranging from one to four.

As an alternative to Likert scales, three marker systems relied on checklists and one marker system used a frequency count. Andersen et al40 developed a 22-item checklist to evaluate resuscitation teams, but there is no chronological sequence for when raters can expect behaviours to occur. Conversely, JIT-PAPPS uses a temporal structure to assess whether certain actions during airway management simulations were accomplished, partially accomplished or not done at all. Specific actions were linked to competencies such as SA, decision making and teamwork. Actions were also weighted to connote heightened importance of a particular skill.

What evidence of reliability and validity exists?

With an understanding of the purpose and method of measurement in mind, we turn to a synthesis of reliability and validity evidence because the inferences drawn from measurement must be considered in relation to established psychometric properties. The reliability of a measure concerns its consistency over repeated measurements and validity addresses its accuracy and the quality of inferences that can be made from a specific process of data collection.13 Establishing the reliability of a measure is necessary, but not sufficient for ensuring its validity.41 Online supplementary table S2 summarises the extent of evidence to illuminate strengths and limitations of existing marker systems while also providing a practical guide for future development and validation efforts. Online supplementary table S3 defines the type of reliability and validity evidence reported for reference.Reliability evidence was reported for 15 marker systems (75%) and evidence of validity was reported for 14 marker systems (70%). Multiple sources of validity evidence were reported for 12 marker systems (60%). Many studies involved a multipronged approach to establish evidence. The Crossley et al42 psychometric evaluation of NOTSS included the application of generalisability (G) theory to demonstrate reliability, exploratory factor analysis to verify NOTSS’ internal structure and intercorrelations with other measures to examine the relationship between scores obtained by NOTSS with external variables. The analysis revealed that one behaviour was loaded on two non-technical skill dimensions.

To validate the Scrub Practitioners’ List of Intraoperative Non-Technical Skills system, Mitchell et al43 ,44 first established the content of the measurement tool through focus groups, and followed this effort with a statistical assessment. Evaluation criteria focused on reliability (within-group agreement and internal consistency), validity (accuracy, completeness, observability) and usability (acceptability and usability). Within-group agreement was good for each skill category, but one-third of skill elements did not reach acceptable thresholds (rwg>0.7) and also varied by scenario. These examples highlight the importance of collecting multiple forms of reliability and validity evidence confirm the veracity of a marker system across multiple indices.

What skills or training are required?

Calibrating rater scores is necessary to ensure that the results are reliable, which is generally achieved through rater training. Information detailing the length of rater training was reported in 29% of articles (n=11). Rater training was specified in 61% of articles (n=23), and the time spent training raters ranged from just over 2 h45 to over 2 days.46 A juxtaposition of the length of rater training and reliability and validity evidence is reported in online supplementary table S2 to provide a brief overview of the possible resource requirements needed to achieve associated levels of reliability and validity.

Overall, the effectiveness of rater training on rater performance varied. Ratings made between novice raters and expert referents demonstrated good reliability in as little as 4–6 h of training,37 ,44 while other examples were much more time-intensive, lasting over 2 days.46 Russ et al46 reported how the reliability between expert and novice ratings using OTAS improved at each stage of rater training, with the learning curve being contingent upon the construct being measured. Rater training involved approximately 2 h of declarative information presentation followed by 1 h of video-based practice. Next, raters observed 10 surgical cases and received immediate feedback on their assessments during postobservation debriefings (approximately 18 h in total). High rater calibration for coordination was established immediately, so improvements were not significant due to a ceiling effect. Considerable improvements were demonstrated for communication, cooperation and leadership over the first seven observations, while steady improvements in rater calibration for monitoring/SA were demonstrated over the entire observation period. Further, there was no significant difference between novice raters with different professional backgrounds (ie, surgery and psychology).

The impact of rater training on rater performance was mixed, however. Following a two-and-a-half-hour NOTSS training course, the mode rating of non-technical behaviours between novices was the same as experts only half the time.45 Additionally, novices tended to under-rate non-technical performance compared with experts.45 Graham et al47 found considerable differences between expert and novice ratings following a 1-day ANTS training session, with a major source of disagreement being the misclassification on non-technical skills; raters were identifying behaviours, but scoring them as different elements of teamwork. Finally, Lamb et al48 reported a significant difference in ratings made between disciplines (ie, surgeon and psychologist), though there were significant improvements as more cases were observed.

How have behavioural measurement systems been applied in healthcare research?

Fifteen articles employed marker systems to test the relationship between constructs (n=4), study the effects of an intervention (n=5) or describe teamwork in relation to task events (n=7). Westli et al49 investigated team skills during trauma simulations. Positive relationships between performance and competencies, such as information exchange, coordination, communication and SA were reported. Surprisingly, higher-performing teams demonstrated less supporting behaviour. Other studies reported differences in teamwork scores based on professional background50 ,51 and years of experience.52

Behavioural marker systems have also been employed to establish the effectiveness of training interventions.36 ,53 Frengley et al38 evaluated the relative effectiveness of simulation-based training and case-based learning on the management of airway and cardiac crises with the Teamwork Behaviour Rater. The authors’ reported teamwork skills significantly improved for both intervention strategies. Feedback/debriefing on teamwork skills during training was described in four articles, but none described the process of how feedback was delivered or whether it was structured.

With respect to task events, Symons et al54 adapted OTAS to study teamwork skills in a handoff. Despite establishing adequate inter-rater reliability and concurrent validity with another teamwork scale, the authors did not observe significant correlations between teamwork skills and the completion of handoff content, handoff length, interruptions during handoffs or attendance at handoffs. Sevdalis et al55 observed that communication events were most likely initiated by surgeons (80%) and were received by either surgeons (46%–56%) or nurses (38%–40%). Additionally, laparoscopic surgeries tended to involve more communication events that were equipment related and that were directive compared with open surgeries. Another study found that surgeons’ SA was negatively correlated with technical errors.51

Forty per cent of applied research articles did not report evidence of rater reliability or training.

Implications for research and practice

This review answered four questions of significance surrounding the use of behavioural marker systems in healthcare. Implications for research and practice are summarised in table 1. First, this review identified attributes of behavioural marker systems. We found a surprisingly large number of unique skills being targeted for measurement. It is likely, that marker systems cover similar content, but inconsistent terminology and differing levels of granularity used to describe constructs complicates the comparison of behavioural marker content across systems. This finding is consistent with a previous review of medical teamwork56 as well as reviews of marker systems in other domains.11

Table 1

Research needs for behavioural marker systems in healthcare

The majority of marker systems were developed for a specific task, yet systems for the same task varied greatly in content and structure. Both NOTECHS34 and NOTSS27 use a hierarchical framework to describe teamwork behaviours for surgery, but the factorial structure for each framework differs (ie, number of dimensions and elements), while OTAS35 does not examine behaviours in relation to a hierarchical framework. Given this variability, critical research needs of great practical impact include investigating which attributes of marker systems produce the most reliable and valid ratings with the lowest level of ‘cost’ in terms of rater training, as well as any differences that may emerge in using data for different purposes (eg, feedback, assessment, detection of change over time). For instance, researchers and practitioners could select the marker system with the least amount of logistical costs associated with implementation.

The temporal structure or resolution of a measurement system is a key attribute with implications for ease of training and data use. Most marker systems used a low-resolution time scale where assessments of behaviours were made once over the entire rating period (eg, a team received one score for a dimension for the entire observation period). Low temporal resolution ratings may illuminate what teamwork deficiencies exist, but not necessarily why they occurred.57 Conversely, systems with higher levels of temporal resolution identify phases of performance or multiple time blocks within an observational period. For example, OTAS rates teamwork dimensions across three phases of surgery and JIT-PAPPS used an event-based approach to measurement (EBAT).36 EBAT tools rate teamwork competencies and skills relative to stimulus events.6 ,58–60 This approach is most useful for training, where scripted scenario events provide opportunities for trainees to exhibit teamwork skills.12 These systems are viable for providing explicit feedback on processes that explain why deficiencies in teamwork may exist. Further, this approach may reduce the cognitive load placed on raters by explicating what is supposed to be assessed and when; raters detect the presence or absence of events following an observation checklist that is temporally constructed which can enhance objectivity.11 A key shortcoming to EBAT is that generalisability is limited to the context and task being trained. For instance, stimulus events indicative of teamwork skills for a resuscitation task would be fundamentally different for a handoff.

Second, this review examined evidence of reliability and validity. Without this evidence, researchers cannot verify whether interventions are actually impacting team processes, safety outcomes or performance outcomes. The most widely cited index of reliability was the calibration of scores among raters, yet inter-rater reliability only estimates one source of measurement error: the rater. In reality, error variance and systematic bias in ratings can come from other sources, such as the time of observation, participants being observed and the context of observation. Unlike traditional approaches to reliability testing, G studies classify variance for multiple sources simultaneously61 ,62 to determine whether alternative designs would minimise measurement error in future applications.63 Although G theory is a paragon of reliability testing, it was only carried out in one study42 and provides a future opportunity for researchers to unequivocally define and account for sources of measurement error.

Evidence of validity should come from a variety of sources, such as the tools’ content, whether competencies can be observed, the tools’ internal structure and convergent and discriminate relationships with other constructs to name only a few.33 Clearly, extensive evidence is needed to establish the validity of a marker system, but multiple sources of evidence were only reported for 12 marker systems. While exceptions exist (see online supplementary table S2), the marker system literature requires further validation research. This finding is consistent with previous reviews on performance measurement in healthcare.64

Third, this review sought an understanding of the expertise required to sufficiently judge performance. Accurate judgments of behaviours directly influence the validity of inferences drawn from measurement and all raters are susceptible to biases, no matter their professional background. This makes rater training necessary to immerse raters in the content of the marker system, its appropriate use for observation and to curtail the possibility of rater biases manifesting during assessments.11 Best practices for behavioural assessment call for recurrent rater training and reliability testing to ensure rater scores are calibrated and accurate over time.12 The impact of rater training on rater proficiency was mixed,36 ,44 ,45 but evidence suggests that improvements can be made over time.46 Additionally, many techniques for calibrating raters have been developed, but not all have been applied in this context.31 Future research should examine which approaches to rater calibration are most cost effective.

The final aim of this review was to examine the application of behavioural marker systems in healthcare research. Focal shortcomings identified in our review were deficiencies in rater training and reliability reporting. Just because a measure has demonstrated evidence of reliability and validity in previous research does not mean it will inevitably be successful in a new context.29 Raters must still be trained in the use of the measurement system and reliability testing should be reported to ensure the veracity of conclusions.

Finally, this review has several limitations of its own. First, the quality of research design and reporting of evidence was not scrutinised. Second, we only examined characteristics of behavioural marker systems, not all approaches to team performance measurement. While marker systems are best suited for quantifying competencies and skills that underlie team processes, other techniques may be more appropriate for capturing implicit knowledge and attitudinal components of teamwork that are not readily observed (eg, collective efficacy, shared mental models).12 ,65 Finally, we only used one database (PubMED) to identify relevant articles. While further web queries would likely yield additional studies to consider for this review, it is expected that the reported sample sufficiently represents behavioural marker research within healthcare.

Conclusion

Accurate and meaningful data are a vital asset because they shape inferences and decisions.66 Findings from our review suggest several implications for healthcare, including the need to: (1) agree on concepts and terms to describe teamwork constructs, (2) generate a library of tools to measure team performance around the purpose of measurement (eg, targeted work domain, general and domain-specific behaviours) and (3) establish standards for the testing and reporting of psychometric evidence. Coordinated efforts through consensus conferences and through funding agency support for related research streams provide an opportunity to advance the field to this end.

References

Supplementary materials

Footnotes

  • Contributors All authors made unique contributions to the conception, drafting and revising of this manuscript. Each of them provided final approval for this work to be published and agreed to be accountable for the integrity information presented.

  • Funding This work was supported by funding from the Gordon and Betty Moore Foundation (Grant #3186.01). The views expressed in this paper are those of the authors and not necessarily reflective of Johns Hopkins University, Johns Hopkins Hospital, or the Gordon and Betty Moore Foundation.

  • Competing interests None.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Author note References 67–87 are cited in the online supplementary tables.