Article Text

Download PDFPDF

Original research
Questionnaire validation practice within a theoretical framework: a systematic descriptive literature review of health literacy assessments
  1. Melanie Hawkins1,2,
  2. Gerald R Elsworth1,2,
  3. Elizabeth Hoban1,
  4. Richard H Osborne2
  1. 1Faculty of Health, Deakin University School of Health and Social Development, Burwood, Victoria, Australia
  2. 2Centre for Global Health and Equity, Faculty of Health, Arts and Design, Swinburne University of Technology, Hawthorn, Victoria, Australia
  1. Correspondence to Dr Melanie Hawkins; melaniehawkins{at}


Objective Validity refers to the extent to which evidence and theory support the adequacy and appropriateness of inferences based on score interpretations. The health sector is lacking a theoretically-driven framework for the development, testing and use of health assessments. This study used the Standards for Educational and Psychological Testing framework of five sources of validity evidence to assess the types of evidence reported for health literacy assessments, and to identify studies that referred to a theoretical validity testing framework.

Methods A systematic descriptive literature review investigated methods and results in health literacy assessment development, application and validity testing studies. Electronic searches were conducted in EBSCOhost, Embase, Open Access Theses and Dissertations and ProQuest Dissertations. Data were coded to the Standards’ five sources of validity evidence, and for reference to a validity testing framework.

Results Coding on 46 studies resulted in 195 instances of validity evidence across the five sources. Only nine studies directly or indirectly referenced a validity testing framework. Evidence based on relations to other variables is most frequently reported.

Conclusions The health and health equity of individuals and populations are increasingly dependent on decisions based on data collected through health assessments. An evidence-based theoretical framework provides structure and coherence to existing evidence and stipulates where further evidence is required to evaluate the extent to which data are valid for an intended purpose. This review demonstrates the use of the Standards’ theoretical validity testing framework to evaluate sources of evidence reported for health literacy assessments. Findings indicate that theoretical validity testing frameworks are rarely used to collate and evaluate evidence in validation practice for health literacy assessments. Use of the Standards’ theoretical validity testing framework would improve evaluation of the evidence for inferences derived from health assessment data on which public health and health equity decisions are based.

  • public health
  • qualitative research
  • statistics & research methods

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • This is the first time a theoretical validity testing framework, the five sources of evidence from the Standards for Educational and Psychological Testing, has been applied to the examination of validity evidence for health literacy assessments.

  • A strength of this study is that validity is clearly defined, in accordance with the authoritative validity testing literature, as the extent to which theory and evidence (quantitative and qualitative) support score interpretation and use.

  • A limitation was the restriction of the search to studies and health literacy assessments published or administered in English, which may introduce an English language and culture bias to the sample.

  • A further limitation was the lack of clarity in some papers about the methods used and results obtained, leading to difficulties in coding validity evidence and may have led to some misclassification of reported evidence for some papers.


It has been argued that the health sector is lacking a theoretically-driven framework of validation practice for the development, testing and use of health assessments.1–6 Such a framework could guide and strengthen validation planning for the interpretation and use of health assessment data.2 3 7 Interpretations of scores from health literacy assessments are increasingly being used to make decisions about the design, selection and evaluation of interventions and policies to improve health equity for individuals, communities and populations.2–4 8 9 To ensure that decisions based on data from all health assessments are justified, and lead to equitable outcomes, validation practice must generate information about the degree to which the intended interpretations and use of data are supported by evidence and the theory of the construct being measured.10–19 Validation research is complex7 20 and a theoretical framework would facilitate an evaluation of a range of evidence to determine valid interpretation and use of health assessment data.2 4 18 20 21

Health literacy

Health literacy is a relatively new field of research with a range of definitions for different settings22–25 and advances in the approaches to its measurement.26–32 Some health literacy assessments measure an observer’s (eg, clinician’s or researcher’s) observations of a person’s health literacy, which often consists of testing a person’s health-related numeracy, reading and comprehension.33 34 Objective measurement can support a clinician to provide health information in formats and at reading levels that are suited to individual patients but usually these measures do not assess other important dimensions of the health literacy construct.35 Self-report measures of health literacy have become useful with the rise of the patient-centred healthcare movement, and these typically provide individuals’ perspectives of a range of aspects of their health and health contexts.23 36 This type of measurement can capture the multidimensional aspects of the health literacy construct to look at broader implications of treatment, care and intervention outcomes.37 Assessments could also combine both objective and self-report measurement of health literacy. Data from health literacy assessments have been used to inform health literacy interventions8 19 38–41 and, increasingly, health policies.42–46 However, despite the different definitions that health literacy assessments are based on (and thus, necessarily, the different score interpretations and uses), the data are often correlated and compared as if the interpretation of the scores have the same meaning, which is an incorrect assumption.27 A theoretical validity testing framework would help researchers, clinicians and policy-makers to differentiate between the meanings of data from different health literacy assessments, and evaluate existing evidence to support data interpretations, to enable them to choose the assessment that is most appropriate for their intended clinical or research purpose.

Contemporary validity testing theory

The validity testing framework of the 2014 Standards for Educational and Psychological Testing (the Standards) is the authoritative text for contemporary validity testing theory.5 It results from about 100 years of the evolution of validity theory.47 48 The Standards defines validity as ‘the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests’ (p.11) and validation as the process of ‘…accumulating relevant evidence to provide a sound scientific basis for the proposed score interpretations’ (p.11). The framework describes five types of validity evidence that can be evaluated to justify test score interpretation and use: (1) test content, (2) response processes of respondents and users, (3) internal structure of the assessment test, (4) relations to other variables and (5) consequences of testing, as related to validity (table 1).5 6 49 50 Evidence from each of these sources may be needed to verify data interpretation and use.

Table 1

The five sources of validity evidence5 49

The expectation of the Standards and leading validity theorists is that the validation process consists of an evaluative integration of different types of validity evidence (not types of validity) to support score meaning for a specific use.2 4 5 13–15 51–57 Integral to this framework are quantitative methods to evaluate an assessment’s statistical properties, but also important is validity evidence based on qualitative research methods.4 58–65 Qualitative methods are used to ensure technical evidence for test content and response processes, and to investigate validity-related consequences of testing.7 12 52 63–69 There are guides to assess quantitative measurement properties70–72 but still needed are reviews that include qualitative validity evidence, and that place validity evidence for health assessments within a validity testing framework such as the Standards.2 4 6 49


As a guide to inform and improve the processes used to develop and test health assessments, this review will examine validation practice for health literacy assessments. Health literacy is a relatively new area of research that appears to have proceeded with the ‘types of validity’ paradigm of early validation practice in education, and so it is ideally poised to embrace advancements in validity testing practices. Thus, an assumption underlying this review is that the field of health is not applying contemporary validity testing theory to guide validation practice, and that the focus of validation studies remains on the general psychometric properties of a health assessment rather than on the interpretation and use of scores. This study will provide an example of the application of the Standards’ theoretical validity testing framework through the review of sources of validity evidence (generated through quantitative and qualitative methods) reported for health literacy assessments.

The aim of this systematic descriptive literature review was to use the validity testing framework of the Standards to categorise and count the sources of validity evidence reported for health literacy assessments and to identify studies that used or made reference to a theoretical validity testing framework. Specifically, the review addressed the following questions:

  1. What is being reported as validity evidence for health literacy assessment data?

  2. Is the validity evidence currently provided for health literacy assessments placed within a validity testing framework, such as that offered by the Standards?


King and He situate systematic descriptive literature reviews toward the qualitative end of a continuum of review techniques.73 Nevertheless, this type of review employs a frequency analysis to categorise qualitative and quantitative research data to reveal interpretable patterns.32 73–78 This review will appraise validation practice for health literacy assessments using the Standards’ framework of five evidence sources. It will not critique nor assess the quality of individual health literacy assessments or studies.

Inclusion and exclusion criteria, information sources and search strategy

The method for this review was previously reported in a protocol paper.49 The eligibility and exclusion criteria, information sources and search terms are summarised in table 2. Peer reviewed full articles and examined theses were included in the search. Online supplementary file 1 shows the MEDLINE database search strategy, and this was modified for the other databases. The review was reported in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement.79 See online supplementary file 2 for the PRISMA checklist.

Table 2

Summary of inclusion and exclusion criteria, information sources and search terms

Article selection, and data extraction, analysis and synthesis

Duplicates were removed and a title and abstract screening of identified articles was performed in EndNote Reference Manager X9 by one author (MH). Identified full text articles (n=92) were screened for relevance by MH and corroborated with an independent screening of 10% (n=9) of the search results by a second author (GRE). Additionally, MH consulted with GRE when a query arose about inclusion of an article in the review.

Data extraction from articles for final inclusion was undertaken by one author (MH) with all data extraction comprehensively and independently checked by a second author (GRE). Both authors then corroborated to achieve categorisation consistency. General characteristics for each study were extracted but of primary interest were the sources of validity evidence reported, as were statements about or references to a theoretical validity testing framework. The validity evidence reported in each article was categorised according to the five sources of validity evidence in the Standards, whether or not the authors of the articles reported it that way. When the methods were unclear, the results were interpreted to determine the type of evidence generated by the study. A study was categorised as using or referencing a theoretical validity testing framework if the authors made a statement that referred to a framework and directly cited the framework document or if there was a clear citation path to the framework document.

Descriptive and frequency analyses of the extracted data were conducted to identify patterns in the sources of validity evidence being reported, and for the number of studies that made reference to a validity testing framework.

Patient and public involvement

Patients and the public were not involved in the development or design of this literature review.


Overall, 46 articles were identified for the review. The PRISMA flow diagram in figure 1 summarises the results of the search.79 There were 3379 records identified through database searches with four articles identified through other sources. There were 1922 records remaining after 1457 duplicates were removed. After applying the exclusion and inclusion criteria to all abstracts, with full text screening of 92 articles and theses, 40 articles and 6 theses were included in the review (n=46). Reasons for exclusion were that the health literacy assessment was developed in or administered in a language other than English (n=19); the assessment was specific to a disease or condition (n=8) or to a demographic group (n=2); the article was not a validity study (n=8); the study was not using a health literacy assessment (n=3) or used an adapted assessment (n=4); the assessment was based on an item-bank, which required a different approach to validity testing (n=1), or was a composite assessment where health literacy data were collected and analysed with another type of data (n=1).

Figure 1

Flow diagram for Preferred Reporting Items for Systematic Reviews and Meta-Analyses.

Four papers were identified from the broader literature. Two papers were identified from the references of previous literature reviews.80 81 The other two papers were known to the authors and were in their personal reference lists. These two papers were by Davis and colleagues and describe the development of the Rapid Estimate of Adult Literacy in Medicine (REALM)33 and the shortened version of the REALM.82 Neither of these papers were detected by the systematic review because Davis et al do not claim these to be measures of health literacy but of literacy in medicine. They state that both versions of the REALM are designed to be used by physicians in public health and primary care settings to identify patients with low reading levels.33 82–84 Nevertheless, we included these papers because the REALM and the shortened REALM have been used by clinicians and researchers as measures of health literacy, and are used either as the primary assessment or a comparator assessment in many studies.

Three papers identified in the database search were included in this review even though data were collected using translations of assessments originally developed in English. These studies were included because of the frequency of use of these assessments in the field of health literacy measurement, and because at least part of the data were based on English language research. The Test of Functional Health Literacy in Adults (TOFHLA)85 and the Newest Vital Sign (NVS)34 both collected data in English and Spanish. The analyses for the European Health Literacy Survey (HLS-EU) study23 used data from the English (Ireland), as well as Dutch and Greek versions of the HLS-EU.

Of the 46 studies, 34 were conducted in the USA, 8 in Australia, 2 in Singapore and 1 each in Canada and the Netherlands. There were 4 studies published in the decade between 1990 and 1999, 8 studies between 2000 and 2009 and 34 between 2010 and 2019.

Reports of reliability evidence were provided in 33 studies (72%). This resulted in 44 instances of reliability evidence, of which 29 (66% of all instances) were calculated using Cronbach’s alpha for internal consistency, 4 (9% of all instances) using test-retest, 4 (9%) using inter-rater reliability calculations and 7 (16%) using other methods. See table 3 for country and year of publication, and reliability evidence.

Table 3

Country and year of publication, and reliability evidence

Validity evidence for health literacy assessment data

The data extraction framework (online supplementary file 3) was adapted from Hawkins et al (p.1702)6 and Cox and Owen (p.254).58 More detailed sub-coding of the five Standards’ categories was done and will be drawn on selectively to describe aspects of the results (online supplementary file 4).

Data analysis consisted of coding instances of validity evidence into the five sources of validity evidence of the Standards. The results of the review are presented as: (1) the total number of instances of validity evidence for each evidence source reported across all studies, (2) the number of instances reported for objective, subjective and mixed methods health literacy assessments and (3) the number of instances of evidence within each of the Standards’ five sources, and a breakdown of the methods used to generate evidence.

Table 4 displays the overall results of the review. For the 46 studies that reported validity evidence for health literacy assessments, we identified 195 instances of validity evidence across the five sources: test content (n=52), response processes (n=7), internal structure (n=28), relations to other variables (n=107) and consequences of testing (n=1). Across types of health literacy assessments, there were 102 instances of validity evidence reported for health literacy assessments with an objective measurement approach (n=23 studies); 78 instances reported for assessments with a subjective measurement approach (n=20 studies) and 15 instances for assessments with a mixed methods approach or when multiple types of health literacy assessments were under investigation (n=3 studies).

Table 4

Sources of evidence for all studies, total instances of validity evidence and for objective, subjective and multiple/mixed methods health literacy assessments

Evidence based on test content

Nearly half of all studies (n=22) reported evidence based on test content, which resulted in 52 instances of validity evidence (table 4 and online supplementary table 1). Expert review was the most frequently reported method used to generate evidence (n=14 instances; 27% of all evidence based on test content),23 33 34 36 82 83 86–93 followed by the use of existing measures of the construct (n=8; 15%).34 36 83 90–92 94 95 Analysis of item difficulty was used five times (10%),36 86 89 92 96 with literature reviews,23 90 93 97 participant feedback processes about items23 34 83 89 and construct descriptions23 36 91 97 each used four times (8% each). Participant concept mapping23 36 88 and examination of administration methods36 98 99 were each used three times (6% each), and participant interviews88 100 were used twice (4%). Five other methods were each used once in five different studies: item intent descriptions,36 items tested against item intent descriptions,101 item-response theory (IRT) analysis for item selection within domains,90 item selection based on hospital medical texts85 and item selection based on a health literacy conceptual model.100

Evidence based on response processes

Only seven instances based on response processes were reported across 6 of the 46 studies (table 4 and online supplementary table 2). The methods used were cognitive interviews with respondents (n=3 instances; 43% of all evidence based on response processes)36 88 101 and with users (clinicians) (n=1; 14%),101 as well as recording and timing the response times of respondents (n=3; 43%).89 98 100

Evidence based on internal structure

There were 15 studies (33% of all studies) that reported evidence based on the internal structure of health literacy assessments resulting in 28 instances (table 4 and online supplementary table 3). The most frequently reported methods were exploratory factor analysis (including principal component analysis) (n=7 instances; 25% of all evidence based on response processes)88 93 100 102–105 and confirmatory factor analysis (also n=7; 25%).91 106 107 Differential item functioning was reported three times (11%),88 91 102 and item-remainder correlations twice (7%).36 92 There were nine other methods used to generate evidence for internal structure, including a variety of specific IRT analyses for fit, item selection and internal consistency. Each method was reported once, with some authors reporting more than one method.36 86 89 90 103 106

Evidence based on relations to other variables

This was the most commonly reported type of validity evidence across studies (n=42 studies; 91%) (table 4 and online supplementary table 4). There were 18 studies that only reported evidence based on relations to other variables.80 81 104 108–122 Evidence within this category was coded, as per the Standards, into convergent evidence (ie, relationships between items and scales of the same or similar structure), discriminant evidence (ie, assessments measuring different constructs determined to be sufficiently uncorrelated), criterion-referenced evidence (ie, how accurately scores predict criterion performance) and evidence for group differences (ie, relationships of scores with background characteristics such as demographic information). The Standards also includes evidence for generalisation but states that this relies primarily on studies that conduct research syntheses, and this review excluded studies that conducted meta-analyses. Across all studies, there were 107 instances of validity evidence reported for relations to other variables: 57 instances of convergent evidence (53% of all evidence in this category), 3 instances of discriminant evidence (3%), 17 instances of criterion-referenced evidence (16%) and 30 instances of evidence for group differences (28%).

The most frequently-used methods for convergent evidence were Spearman’s80 85 94 96 99 105 108 110 116 118 122 and Pearson’s33 34 82 83 90 93 104 112 113 120 123 correlation coefficients (11 instances and 19% each). These were closely followed by the receiver operating characteristic (ROC) curve and the area under the ROC curve (also n=11 instances; 19%).81 97 99 103 110 111 117 120 123 A further eight instances (14%) of correlation calculations with similar measures were reported but the types of calculation they performed were unclear.86 87 92 95 103 115 119 121

Harper, Elsworth et al and Osborne et al36 90 106 were the only three studies to generate discriminant evidence, as defined by the Standards. Harper90 used the Pearson correlation coefficient to assess the association of components of a new health literacy instrument with the shortened version of the Test of Functional Health Literacy in Adults (S-TOFHLA). Elsworth et al106 compared the average variance extracted and the variance shared between the nine scales of the Health Literacy Questionnaire (HLQ) (discriminant validity evidence between HLQ scales). Similarly, Osborne et al36 conducted a multiscale factor analysis to investigate if the nine HLQ scales were conceptually distinct.

Linear regression models were the most common method to generate criterion-referenced evidence (n=6 instances; 35% of all criterion-referenced evidence).86 90 107 114 115 121 The χ2 test of independence was used by three studies (18%),87 115 121 with Spearman’s correlation coefficient110 115 and logistic regression models86 115 each used by two studies (12% each).

There were 16 methods used to generate evidence for group differences and these were spread across 19 studies. The most frequently used methods were analysis of variance (n=5 instances; 17%)88 92 93 103 121 and linear regression models (n=4; 13%).80 83 91 123

Evidence based on validity and consequences of testing

One study did investigations that led to conclusions about validity and the consequences of testing (p.221).83 Elder et al found that the REALM under-represented the construct of health literacy when defined as the ability to obtain, interpret and understand basic health information.

Use of a validity testing framework when reporting validity evidence for health literacy assessments

Few studies referred to a validity testing framework or used a framework to structure or guide their work. Of the 46 studies, 9 directly or indirectly referenced a validity testing framework, and made a statement to support the citation (see online supplementary file 3). The frameworks directly cited by three studies87 101 106 were the 2014 Standards;5 Michael T Kane’s argument-based approach to validation;14 Samuel J Messick’s unified theory of validation;17 124 and Francis et al’s checklist operationalising measurement characteristics of patient-reported outcome measures.125 There were six studies36 83 93 96 102 107 that indirectly cited Messick, Kane and/or the 1985, 1999 or 2014 versions of the Standards5 126 127 through other citations. A 10th study88 referenced Buchbinder et al,128 which cites the Standards, but there was no clear statement about validity testing to support the citation.


This systematic descriptive literature review found that studies in health literacy measurement rarely use or reference a structured theoretical framework for validation planning or testing. Further, this review’s use of the Standards’ framework revealed that validity testing studies for health literacy assessments most frequently, and often only, report evidence based on relations to other variables. It is usual and reasonable for a single validity study to not provide comprehensive evidence about a patient-reported outcome measure, and this is why an organising framework for evaluating evidence from a range of studies is so important. The findings from this review show that validation practice for health literacy assessments does not use established validity testing criteria and is yet to embrace the structural framework of contemporary validity testing theory.5 6

In this review, evidence based on relations to other variables was the most frequent type of validity evidence reported across the 46 studies. It was reported more than twice as frequently as evidence based on test content, which was the second most commonly reported source of validity evidence. Evidence based on internal structure was reported in almost half the studies. This is not an unexpected result given the propensity for validity testing studies to almost routinely conduct correlation of an assessment with another variable (eg, a similar or different assessment).129 In the early 20th Century, the focus of test validation was primarily on predictive validity practices (eg, prediction of student academic achievement) and so correlation with known criteria was a common validation practice.48 130 131 Development of the theory and practice of validation, and the need to use tests in various contexts with different population groups, has required consideration of the meaning of test scores, and that score interpretations usually lead to decisions or actions that can affect people’s lives.2 3 52 66 As Kane explains, ‘ultimately, the need for validation derives from the scientific and social requirement that public claims and decisions be justified’ (p.17).13 A structured theoretical framework, such as the Standards, facilitates validation planning, testing and integration of evidence for decision-making. It can also support new users of a health assessment to judge existing evidence and previous rationales for data interpretation and use, and how these might justify the use of the assessment in a new context.

Reports of evidence based on response processes and on consequences of testing were negligible in this review. This is the first time this has been observed in the field of health literacy although it has been observed previously in other fields of research.50 68 132 Evidence based on the cognitive (response) processes of respondents (and of assessment users59 101) can be essential to understanding the meanings derived from assessment scores for each new testing purpose.69 Consequential evidence, although a controversial area of research,50 66 can reveal important outcomes for equitable decision-making, such as those discussed by Elder et al83 regarding the use of the REALM, a word recognition assessment, with non-native speakers of English in a world in which health literacy is understood to be about equitable access to, and understanding and use of health information and services.42 133–135 Potential risks for unintended consequences of testing can be lessened through the development of the content of health assessments using comprehensive grounded practices that ensure wide and deep coverage of the lived experiences of intended respondents.36 136–138

The findings of this review are important because institutions and governments around the world are increasingly implementing health literacy as a basis for health policy and practice development and evaluation.43–46 139 There needs to be certainty that inferences made from health literacy measurement data are leading to accurate and equitable decision-making about healthcare, interventions and policies, and that these decisions are as fair for the people with the lowest health literacy as for those with the highest.11 19 46 52 140–143 Some types of health interventions are known to widen health inequalities.143–147 Messick emphasises construct under-representation and construct-irrelevant variance as causes for negative testing consequences, as related to validity.124 148 For example, if a health assessment is biassed by a specific perspective about causes of health disparities then construct under-representation can be a threat to the validity of inferences and actions taken from the scores. Likewise, if an assessment reflects a particular social perspective (eg, middle class values and language embedded in the items) then there is the threat that the responses to the assessment are perfused with irrelevant variance derived from that perspective. Evidence from a range of sources is required to justify the use of measurement data in specific contexts (eg, socioeconomic, demographic, cultural, language), and to assure decision-makers of the absence of validity threats.4 51 54

This is the first time that a comprehensive review of sources of validity evidence for health literacy assessments has been undertaken within the theoretical validity testing framework of the Standards. For some methods, coding into the five sources of validity evidence was not straightforward and, in these cases, the Standards were consulted closely for guidance. Coding of studies by Elsworth et al and Osborne et al36 106 to relations to other variables (discriminant evidence) required some deliberation because the evidence in both studies was for discrimination analyses between independent scales within a multiscale health literacy assessment, rather than between different health literacy assessments. The developers of the HLQ view the nine scales as measuring distinct, although related, constructs.36 The Standards (p.16) explain that 'external variables may include measures of some criteria that the test is expected to predict, as well as relationships to other tests hypothesised to measure the same constructs, and tests measuring related or different constructs'.5 It was on the basis of the last part of this statement about tests measuring related or different constructs that these two studies were coded in relations to other variables as discriminant evidence.

In a few studies, some assessments seemed to be regarded as proxies for health literacy, which suggested that the researchers were thinking of them as measuring similar constructs to health literacy. In these cases, evidence was coded in relations to other variables as convergent evidence (ie, convergence between measures of the same or similar construct) rather than as criterion-referenced evidence (ie, prediction of other criteria). For example, Curtis et al86 explored correlations between the Comprehensive Health Activities Scale with the Mini Mental Status Exam as well as with the TOFHLA, the REALM and the NVS.86 Driessnack et al.108 looked at correlations between parents’ and children’s NVS scores with their self-reports of the number of children’s books in the home. Dykhuis et al87 correlated the Brief Medical Numbers Test with the Montreal Cognitive Assessment as well as with two versions of the REALM.

Further to coding for relations to other variables are the distinctions between convergent evidence, criterion-referenced evidence and evidence for group differences. Coding to convergent evidence was based on analyses of assessments of the same or similar construct (eg, typically, comparisons of one health literacy assessment with another health literacy assessment). Coding to criterion-referenced evidence was based on analyses of prediction (eg, a health literacy assessment with a disease knowledge survey). Coding for evidence of group differences was based on analyses of relationships with background characteristics such as demographic information.

Reliability was not coded within the five sources of evidence even though it does contribute to understanding the validity of score interpretations and use, especially for purposes of generalisation.5 The Standards (p.33) classifies reliability into reliability/precision (ie, consistency of scores across different instances of testing) and reliability/generalisability coefficients (ie, in the way that classical test theory refers to reliability as being correlation between scores on two equivalent forms of a test, with the assumption that there is no effect of the first test instance on the second test instance). The predominant focus in the reviewed papers was on the latter conception of reliability, most often calculated using Cronbach’s alpha.

Strengths and limitations

An element of bias is potentially present in this review because of the restriction of the search to studies published and health literacy assessments developed and administered in the English language. Future studies may be improved if other languages were included. The health literacy assessments reviewed are those that are predominant in the field and may well provide a foundation for validity studies of more specifically targeted assessments.

Just as there were two papers known to the authors of an instrument that is frequently used to measure health literacy, and two further papers were identified from published literature reviews, it may be that more papers that would be relevant to this review were not identified. However, since the 1991 publication of the REALM, which was not designed as a health literacy assessment but has since been used as such, we predict that most assessments for the measurement of health literacy will be identified for this purpose, and would thus have been captured by the present search strategy. Validation practice is complex and there are many groups publishing validity testing studies that may have limited training and experience in the area.1–4 There was a lack of clarity in some papers and theses about the methods used and results obtained, which caused difficulties with classifying the evidence within the Standards framework, so some misclassification is possible for some papers. Future work in this area would be improved if researchers used clearly defined and structured validity testing frameworks (ie, the five validity evidence sources of the Standards) in which to classify evidence.

The main strength of this study was that validity is clearly defined as the extent to which theory and evidence (quantitative and qualitative) support score interpretation and use. This definition is in accordance with leading authorities in the validity testing literature.2 5 13 51 A second strength of this study was the use of an established and well-researched theoretical validity testing framework, the Standards, to examine sources of evidence for health literacy assessments. Different health literacy assessments have different measurement purposes. Validation planning with a structured framework would help to determine the sources of evidence needed to justify the inferences from data, and to guide potential users. Application of theory to validation practice will provide a scientific basis for the development and testing of health assessments, enable systematic evaluations of validity evidence and help detect possible threats to the validity of the interpretation and use of data in different contexts.2 3 15


Arguments for the validity of decisions based on health assessment data must be based on evidence that the data are valid for the decision purpose to ensure the integrity of the consequences of the measurement, yet this is frequently overlooked. This literature review demonstrated the use of the Standards’ validity testing framework to collate and assess existing evidence and identify gaps in the evidence for health literacy assessments. Potentially, the framework could be used to assess the validity of data interpretation and use of other health assessments in different contexts. Developers of health assessments can use the Standards’ framework to clearly outline their measurement purpose, and to define the relevant and appropriate validity evidence needed to ensure evidence-based, valid and equitable decision-making for health. This view of validity being about score interpretation and use challenges the long-held view that validity is about the properties of the assessment instrument itself. It is also the basis for establishing a sound argument for the authority of decisions based on health assessment data, which is critical to health services research and to the health and health equity of the populations affected by those decisions.


The authors acknowledge and thank Rachel West, Deakin University Liaison Librarian, for her expertise in systematic literature reviews and her patient guidance through the detailed process of searching the literature.



  • Twitter @4MelanieHawkins, @richardosborne4

  • Contributors MH and RHO conceptualised the research question and analytical plan. MH led, with all authors contributing to, the development of the search strategy, selection criteria, data extraction criteria and analysis method. MH conducted the literature search with guidance from EH. MH screened the literature, and extracted and analysed the data with the continuous support of and comprehensive checking by GRE. MH drafted the initial manuscript and led subsequent drafts. GRE, RHO and EH read and provided feedback on manuscript iterations, and approved the final manuscript. RHO is the guarantor.

  • Funding MH was funded by a National Health and Medical Research Council (NHMRC) of Australia Postgraduate Scholarship (APP1150679). RHO was funded in part through a National Health and Medical Research Council (NHMRC) of Australia Principal Research Fellowship (APP1155125).

  • Competing interests None declared.

  • Patient consent for publication Not required.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data availability statement All data relevant to the study are included in the article or uploaded as supplementary information.