Article Text

Original research
Public’s understanding of swab test results for SARS-CoV-2: an online behavioural experiment during the April 2020 lockdown
  1. Stefania Pighin,
  2. Katya Tentori
  1. Center for Mind/Brain Sciences, University of Trento, Rovereto (TN), Italy
  1. Correspondence to Professor Katya Tentori; katya.tentori{at}unitn.it

Abstract

Objective Although widespread testing for SARS-CoV-2 is in place, little is known about how well the public understands these results. We aimed to provide a comprehensive overview of the general public’s grasp of the accuracy and significance of the results of the swab test.

Design Web-based behavioural experiment.

Setting Italy during the April 2020 lockdown.

Participants 566 Italian residents.

Main outcome measures Participants’ estimates of the SARS-CoV-2 prevalence; the predictive and diagnostic accuracy of the test; the behavioural impact of (positive vs negative) test results; the perceived usefulness of a short-term repetition of the test following positive or negative results; and rankings of causes for false positives and false negatives.

Results Most participants considered the swab test useful (89.6%) and provided predictive values consistent with their estimates of test diagnostic accuracy and infection prevalence (67.0%). Participants acknowledged the effects of symptomatic status and geographical location on prevalence (all p<0.001) but failed to take this information into account when estimating the positive or negative predictive value. Overall, test specificity was underestimated (91.5%, 95% CI 90.2% to 92.8%); test sensitivity was overestimated (89.7%, 95% CI 88.3% to 91.0%). Positive results were evaluated as more informative than negative ones (91.6, 95% CI 90.2 to 93.1 and 41.0, 95% CI 37.9 to 44.0, respectively, p<0.001); a short-term repetition of the test was considered more useful after a positive than a negative result (62.7, 95% CI 59.6 to 65.7 and 47.2, 95% CI 44.4 to 50.0, respectively, p=0.013). Human error and technical characteristics were assessed as more likely to be the causes of false positives (p<0.001); the level of the viral load as the cause of false negatives (p<0.001).

Conclusions While some aspects of the swab for SARS-CoV-2 are well grasped, others are not and may have a strong bearing on the general public’s health and well-being. The obtained findings provide policymakers with a detailed picture that can guide the design and implementation of interventions for improving efficient communication with the general public as well as adherence to precautionary behaviour.

  • COVID-19
  • primary care
  • statistics & research methods
http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

Statistics from Altmetric.com

Strengths and limitations of this study

  • This study provides a first comprehensive overview of the general public’s understanding of the most commonly used test for detecting SARS-CoV-2.

  • This study considers not only participants’ estimates of the positive predictive value but also of the negative predictive value and the diagnostic accuracy of tests, as well as their grasp of the test results’ behavioural consequences.

  • This study employs a rigorous experimental design that allowed for control of many potentially confounding variables, such as participants’ geographical location, worry and perceived individual risk.

  • The findings provide policymakers with a detailed picture that can guide the design of interventions for improving both efficient communication with the public and adherence to precautionary behaviour.

  • Further research is needed to extend this investigation to a wider population, including older adults, individuals with pre-existing conditions and those who have already been tested for SARS-CoV-2.

Introduction

The global outbreak of the COVID-19 has abruptly placed testing at the centre of everyone’s thoughts, actions and feelings. Widespread testing has been strongly recommended by WHO,1 as the rapid identification of possible new infectious cases is considered decisive in reducing clinical progression, in containing onward transmission2 and, in the long run, in saving lives and resuming normal life.3 Although employing different strategies and timelines,2 4 mass testing has now been implemented in many countries: thousands of individuals are tested every day, and even more are seeking to be tested. However, currently, there is not much knowledge on how this testing is perceived. To what extent does the general population consider it to be accurate and informative? What sense is there of the impact of test results on behaviour and of the usefulness of a short-term repetition of the test? Are possible test errors ascribed to the most probable causes? In this study, we address these questions by investigating, under various experimental conditions, the public’s grasp of the accuracy and significance of results of the reference standard for the detection of the novel coronavirus responsible for COVID-19 (SARS-CoV-2): the real-time reverse transcriptase polymerase chain reaction (rRT-PCR) performed on respiratory specimens.2 5

A PubMed searchi from inception to 24 November 2020 did not identify any research article that investigated how the results of molecular (nor serological) tests for SARS-CoV-2 are interpreted or understood. A similar absence is found in the systematic literature review on SARS-CoV-2 by Rajendran and colleagues.6 Studies on how laypeople and experts understand the accuracy of screening and diagnostic tests are available with reference to other conditions (hypothetical and real life), such as breast cancer and genetic disorders.7–12 The main result across these investigations, especially those that considered low-prevalence conditions, is that the great majority of individuals—including a wide range of healthcare providers—systematically overestimate the probability that individuals with a positive test result truly have the disease (namely the positive predictive value of the test, hereafter PPV).7 This robust finding is an expression of a more general tendency to discount or even ignore base rate information in favour of relevant evidence13 (a phenomenon known as the base rate fallacy or base rate neglect) and can be, at least partially, modulated by various factors, such as the format of the statistical information conveyed12 14 or the specific probability question posed.8 For instance, in Garcia-Retamero et al’s study,15 patients’ incorrect diagnostic inferences concerning various positive screening test results decreased from more than two-thirds to less than half when the numerical information concerning the prevalence rate, the sensitivity (SE) and the false positive rate was presented along with a visual display representing the overall number of individuals at risks, the number of individuals who obtained a positive result and the number of individuals who have the disease.

Though extremely interesting from a cognitive perspective and useful for regular medical practice, the results of earlier studies cannot be extrapolated to the public’s understanding of the extensive testing now underway. Indeed, the situation generated by the COVID-19 pandemic is new in various respects.

First, as the clinical validation of the newly developed rRT-PCR test for detecting the SARS-CoV-2 is still at an early stage, its accuracy is not yet fully known. Preliminary results suggest a high rate of false negatives and a limited rate of false positives. Reported SE varies widely (depending, for example, on the site, quality and timing of sampling), with most values converging on 70%–85%,16–19 while specificity (SP) has received less attention20 and is assumed to be greater than 98%.20–22 The lack of precise statistics describing test performance and, above all, on the true prevalence of this infection in specific populations23 makes it currently impossible to calculate the exact PPV. In the absence of this normative value, estimates of the accuracy of this test need to be assessed according to different criteria.

Second, given the considerable extent of asymptomatic carriage2 24 for the SARS-CoV-2 infection, test results are particularly crucial. Previous research focused almost exclusively on the understanding of PPV. Yet, when millions stand to be exposed to an infection, it is in fact the interpretation of negative results that becomes most challenging. This is because the more prevalence rises, the greater the proportion of false negatives among all negative results and, consequently, the lower the predictive value of a negative result (NPV).22 25 So we cannot assume that previous findings on the difficulty that is encountered in calculating the PPV generalise to the assessment of the NPV. Nor can they be applied to the predictive value of a double negative test result at a 24-hour interval (NNPV), which has been used as a discharge criterion26 for patients with COVID-19 from hospitals in various countries.

A third element of novelty concerns participants’ high personal involvement. The COVID-19 pandemic and associated containment measures have had a tremendous impact on behaviours and priorities,27 with the consequence that many individuals who are not ‘COVID-19 patients’, and who may never be, feel threatened and anxious.28 29 This, together with extraordinary exposure to health-related information,30 including data and arguments about testing, as well as the utility of implementation on a large scale, raises the question of how the population in the current emergency situation is comparable to participants of previous experiments, who typically have been presented with medical test scenarios quite remote from their direct experience.

Our study provides insight into the questions and novelties outlined above by offering new empirical data and a new methodological approach for evaluating the accuracy of the PPV and NPV estimates in the absence of definitive evidence on the diagnostic accuracy of a test (see the coherence criterion in box 1). It also complements our comprehension of the understanding of test results by exploring the public’s grasp of the various implications of these results. Gaining deeper insight into the general public’s understanding of the accuracy and significance of the most widespread test for SARS-CoV-2 offers a unique opportunity to improve scientific knowledge on reasoning about medical tests and could have tangible implications for pandemic health policies now by facilitating more efficient risk communication and by promoting adherence to precautionary behaviour.

Box 1

Test usefulness and qualitative coherence criteria

The restricted data set included all participants whose estimation of test characteristics met criterion 1 and whose judgements of PVs, prevalence and test characteristics met all of the conditions in criterion 2 (n=358, 63.3%).

Criterion 1: test usefulness (n=507, 89.6%)

To establish whether the test characteristics provided by participants were compatible with those of a ‘useful’ test, we applied the following rule37: SE+SP≥1.5, that is, estimated diagnostic accuracy was at least halfway between 1 (a completely useless test) and 2 (a perfect test).

Criterion 2: coherence between probability estimates (n=379, 67%)

To determine whether the PVs, prevalence and test characteristics provided by participants were coherent with each other, at least from a qualitative point of view, we set the following conditions:

Coherence between PPV, prevalence rate (PR) and test characteristics (SE and SP) (n=531, 93.8%)

(PPV>PR∧SE>(1-SP))∨(PPV<PR∧SE<(1-SP))∨(PPV=PR∧SE=(1-SP))∨(PPV=1∧PR=1∧SE>(1-SP))

Coherence between NPV, prevalence rate (PR) and test characteristics (SE and SP) (n=441, 77.9%)

(NPV>(1-PR)∧SP>(1-SE))∨(NPV<(1-PR)∧SP<(1-SE))∨(NPV=(1-PR)∧SP=(1-SE))∨(NPV=1∧(1-PR)=1∧ SP>(1-SE))

Coherence between NNPV, NPV and test characteristics (SE and SP) (n=486, 85.9%)

(NNPV>NPV∧SP>(1-SE))∨(NNPV<NPV∧SP<(1-SE))∨(NNPV=NPV∧SP=(1-SE))∨(NNPV=1∧NPV=1∧SP>(1-SE))

  • NNPV, double negative predictive value; NPV, negative predictive value; PPV, positive predictive value; PVs, predictive values; SE, sensitivity; SP, specificity.

Methods

Study design, stimuli and procedure

Our online behavioural experiment consisted of four parts (online supplemental material 1 fully reports the stimuli, translated from Italian) and was carried out through Prolific Academic (http://prolific.ac), one of the most popular and reliable crowdsourcing platforms for behavioural research.31 This gave us access to a general population, although with a lower representation of older adults (in Italy, the proportion of Prolific participants older than 65 years is lower than it is in other countries, such as the UK or the USA). The experiment was computer based and participants carried it out at home. There were no time limits, and the task was typically completed in less than 7 min. Participants received £0.60 compensation.

The first part of the experiment aimed to explore whether participants’ estimates of predictive and diagnostic accuracy depended on the prevalence rate. To this end, we employed a 4×3 between-subjects design, in which participants were asked to consider the hypothetical case of a person identified by a combination of two factors: her symptomatic status (SX: unspecified, absent, mild or severe) and her geographical location (Italy, Sassari or Bergamo). Italy represents a generic location (for an Italian participant) while Sassari and Bergamo are two well-known cities of comparable population that largely differ in reported infection and death rates (figure 1). Participants in all groups were asked to estimate the prior probability that the person had the SARS-CoV-2 infection (we consider this evaluation generalisable to the subpopulation to which the hypothetical person belongs and, therefore, will refer to it as judged prevalence) and three predictive values (PVs): the probability that a person from the same subpopulation had the SARS-CoV-2 infection, given a positive test result (judged PPV); the probability that a person from the same subpopulation did not have the SARS-CoV-2 infection, given a negative test result (judged NPV) or given two negative test results at a 24-hour interval (judged NNPV). The phrasing of the questions was adapted from previous studies on Bayesian inferences in the medical domain.8–10 Participants were then asked to provide their best estimates of the test SE and SP by judging the probability that a person from the same subpopulation who had (did not have) the SARS-CoV-2 infection would receive a positive (negative) test result. Although irrelevant for the last two questions, we kept the reference to the subpopulation in order to see whether participants’ estimates were affected by it. To reduce possible misunderstandings, we asked for complementary judgements in pairs and clarified that they had to sum up to 100%; participants’ compliance with this requirement also served as an attentional check.

Figure 1

Zones 1 and 2. Geographical distribution of COVID-19 cases in Italy on the first day of data collection (April 6) in the two prevalence areas: zone 1 (latitude ≥42°50), which included the hardest hit regions (more than 500 000 cases, test positivity rate ≥20%), and zone 2 (latitude <42°50), which comprises the least affected regions (fewer than 200 000 cases, test positivity rate ≤10%). Bergamo (in zone 1) and Sassari (in zone 2) are two cities of comparable populations but very different COVID-19 infection and death rates: while the former was the epicentre of the first COVID-19 outbreak in Italy, the latter has passed relatively unscathed, with about 1/20 of the cases of the former.

Since emotion-related variables such as worry and perceived risk are acknowledged to drive probability judgements and attitudes towards medical tests,32 participants were also asked to report their worry for the pandemic, to estimate their likelihood of contracting the infection and to evaluate its severity. The last two measures were then multiplied to obtain a perceived risk score.33

The remainder of the experiment was identical for all groups. The second part aimed to investigate whether participants were aware of the asymmetric implications of positive and negative results in terms of impact on behaviour and the usefulness of a short-term test/retest. In this regard, a positive result should be considered rather informative because it implies self-isolation, while a negative result is not expected to substantially affect behaviour, especially during the lockdown. On the other hand, repeating the test might be useful after a negative result (because of the high rate of false negatives) while it appears less justified after a positive result (due to the low rate of false positives, but also because positivity in itself does not impact treatment).

To investigate participants’ explanations of test errors, in the third part of the experiment, we asked them to rank, according to their probability, three possible causes of false positives and false negatives (human error, technical characteristics of the test and level of viral load).

Responses to all questions in the first three parts of the experiment were mandatory for completion of the experiment and to receive the Prolific payment.

Finally, in the fourth and final part of the experiment, participants were asked their personal experience with the swab test and COVID-19. Demographic information was obtained from Prolific.ac.

Evaluation criteria

Participants who evaluated the test as useful and who judged prevalence, test characteristics, PPV, NPV and NNPV as qualitatively coherent among each other were identified using the criteria reported in box 1.

Statistical analysis

Participants’ characteristics were analysed by means of χ2 tests for categorical variables and t-tests for continuous variables. For all variables, we calculated descriptive statistics such as means and 95% CIs. Statistical comparisons were evaluated by multivariate analysis of variance (MANOVA) followed by post hoc pairwise comparisons using Tukey’s honest significant difference, t-tests and repeated measures analysis of variance (ANOVA) when appropriate. In order to improve the accuracy of the models, the ANOVAs were performed including participants’ age, gender and education as covariates. Wilcoxon signed-rank tests were used to compare participants’ rankings of causes of false-positive and false-negative test results. Possible effects of participants’ characteristics (ie, age, gender and education) on rankings were preliminarily investigated by means of Mann-Whitney tests. All analyses concerning the prevalence, the predictive and the diagnostic accuracy were performed twice: once including all participants (full data set) and once including only participants whose judgements met the criteria reported in box 1 (restricted data set). Data analysis was performed with SPSS V.23. Only p values below 0.05 were considered significant and reported within the text.

Participants and data collection

A total of 591 native Italian speakers residing in Italy were recruited on 6–9 April 2020, during the total lockdown. We excluded from the analyses 22 participants who failed to pass the attention checks (ie, their complementary responses did not sum up to 100) and three participants who assigned extreme (0 and 100) values both to prevalence and test characteristics, making it impossible to compute a meaningful value for some of their expected PVs. The final sample thus included 566 participants (see table 1 for related statistics).

Table 1

Participants’ characteristics

Figure 1 reports the geographical distribution of COVID-19 cases in Italy on the first day of data collection. Since the disease mainly affected the northern regions, we classified participants’ locations into two different areas: zone 1 (latitude ≥42°50), which encompasses the hardest hit regions, and zone 2 (latitude <42°50), which includes the regions with the lowest incidence of the disease. All participants provided informed, written consent.

Results

Participants’ characteristics

The mean age of participants was 28 years (95% CI 27.5 to 29.0), ranging from 18 to 66 years. Age did not significantly differ for males and females, nor did educational level. Participants in zones 1 and 2 divided 58% to 42%, respectively, and this roughly parallels the percentage split of Italians living in the two zones (about 56% and 44%). Educational level did not differ significantly in the two zones. Only three participants declared they had undergone the swab test, all residing in zone 1. More participants in zone 1 than in zone 2 reported that a person in their circle (ie, relatives, friends, colleagues) had undergone the swab test or had been diagnosed with COVID-19 (p=0.001 and p=0.012, respectively, χ2 tests). These results support our partition by confirming a greater spread of the virus in zone 1 than in zone 2. The mean worry for the pandemic was 66.9 (95% CI 65.1 to 68.8). Somewhat surprisingly, participants in zone 1 reported lower worry than those in zone 2 (64.9, 95% CI 62.2 to 67.5 and 69.8, 95% CI 67.2 to 72.4, respectively, p=0.011, independent samples t-test). No matter the zone, younger (18–25 years) participants reported less worry than older (>26 years) ones (64.4, 95% CI 61.8 to 66.9 and 69.5, 95% CI 66.7 to 72.2, respectively, p=0.008, independent samples t-test).

A similar pattern was observed for the severity of contracting the virus, which was lower for participants in zone 1 than in zone 2 (60.5, 95% CI 57.6 to 63.3 and 65.3, 95% CI 62.4 to 68.2, respectively, p=0.022, independent samples t-test) and for younger participants than older ones (57.9, 95% CI 55.0 to 60.7 and 67.0, 95% CI 64.2 to 69.9, respectively, p<0.001, independent samples t-test). Participants in zone 1 estimated the probability of contracting the virus as higher than those in zone 2 (35.6%, 95% CI 33.0% to 38.3% and 29.2%, 95% CI 26.7% to 31.7%, respectively, p=0.001, independent samples t-test). The perceived risk differed between younger and older participants (19.1, 95% CI 17.0 to 21.1 and 24.2, 95% CI 22.1 to 26.4, respectively, p=0.001, independent samples t-test), but not between zones 1 and 2.

Predictive and diagnostic accuracy: qualitative coherence and test usefulness

The great majority of participants (89.6%) evaluated the test as useful (see criterion 1 in box 1) and provided estimates of predictive accuracy that were coherent with their evaluation of diagnostic accuracy and with their beliefs about prevalence (67.0%) (see box 1). Participants (63.3% of the total) whose judgements met both these criteria were included in the restricted data set. It is worth noting that this does not indicate that the remaining participants hold irrational beliefs; they may simply not have read all questions or response options carefully (in particular, some participants seem to have confused the order of two complementary responses in the NPV question, see the online supplemental material 1).

Predictive and diagnostic accuracy: effects of SX and location

The MANOVA used to investigate the effect of the SX and location on judged prevalence, PPV, NPV, NNPV, SE and SP (with age, gender, educational level, zone, worry and perceived risk as covariates) showed that, in both data sets, participants’ prevalence judgements were affected in the expected direction by the manipulation of SX and location. The effect was less systematic for judged PPV, which depended on SX in both data sets but only on location in the restricted data set, as well as for judged NPV, which depended on SX and location in the restricted data set alone (all p<0.05, table 2 for the outputs of the MANOVA and figure 2 for Tukey’s post hoc tests; see also online supplemental material 2 for mean judgements and 95% CI). These results indicate that participants—at least when they provided qualitatively coherent probability judgements—were sensitive to factors that can affect prevalence, PPV and NPV. Irrespective of the data set, judged SE and SP did not differ significantly across groups (table 2), indicating that participants correctly estimated the test’s diagnostic accuracy independently of prevalence. Among the covariates, the most robust effects were those exerted by worry and perceived risk, both on judged prevalence; irrespective of the experimental condition, participants who expressed a greater worry and/or greater perceived risk also provided higher estimates of prevalence.

Figure 2

Prevalence, PPV and NPV judgements in the 12 experimental groups, with corresponding results of the Tukey’s post hoc tests for the two independent variables (SX and location). Means in one subset significantly differ (at least p<0.05) from those in other subsets. NPV, negative predictive value; PPV, positive predictive value; SX, symptomatic status.

Table 2

Main results of the MANOVA conducted on participants’ judgements in full and restricted data sets

Predictive accuracy: consistency between judged PVs and expected PVs

Since the exact prevalence of the infection in the considered subpopulations is unknown, objective PVs cannot be computed. The PPV and NPV provided by each participant were therefore compared with the expected PPV and NPV that were obtained by inserting his/her judgements of prevalence and test characteristics into the Bayes theorem. The comparison between judged and expected PVs (table 3, figure 3) reveals that participants overestimated the PPV in the full data set (p<0.001, paired sample t-test) but underestimated the NPV in both data sets (all p<0.001, paired sample t-test). To control for the base rate fallacy, we performed two further analyses that focused on the judged PPV and NPV of participants who provided low values (≤20) for the prevalence and 1-prevalence, respectively. Regardless of the data set, judged PPV was greater than expected PPV (all p<0.01, paired sample t-test). This result supports those of previous research on the base rate fallacy and indicates that participants underweighted their own estimates of prevalence and/or diagnostic accuracy when updating their beliefs based on a positive test result. By contrast, no significant difference was observed between judged and expected NPV when the prevalence was assumed to be high (ie, 1-prevalence ≤20%). Such findings suggest that the base rate fallacy that has been repeatedly observed for the PPV does not extend to the NPV and, if replicated, would require modification of most theoretical models that have proposed to explain this phenomenon.

Figure 3

Judged and expected PPV (left panel) and NPV (right panel) as a function of judged prevalence, in full and restricted data sets. NPV, negative predictive value; PPV, positive predictive value.

Table 3

Means of judged and expected PPV and NPV (and 95% CI) in full and restricted data sets, with corresponding P values of paired sample t-tests

Predictive accuracy: NNPV

To check whether participants acknowledged that a double negative result supports the absence of the infection more than a single negative result does, judged NNPV and NPV were compared. Irrespective of the experimental condition, participants correctly indicated a higher value (all p<0.001, paired sample t-test) for judged NNPV (91.9%, 95% CI 90.2% to 93.5% and 97.6%, 95% CI 97.1% to 98.1% in full and restricted data sets, respectively) than judged NPV (81.1%, 95% CI 79.0% to 83.2% and 88.5%, 95% CI 87.0% to 90.1% in full and restricted data sets, respectively). Furthermore, in both data sets, the judged NNPV was lower than 100% (all p<0.001, one-sample t-tests), suggesting that participants had a correct grasp of the fact that a double negative test result does not rule out the possibility of an infection.

Diagnostic accuracy: consistency of judged SE and SP with experts’ estimates

In line with experts’ assessments, irrespective of the experimental condition and data set, participants judged the SP (91.5%, 95% CI 90.2% to 92.8% and 95.9%, 95% CI 95.2% to 96.5% for all and restricted data sets, respectively) as higher than the SE (89.7%, 95% CI 88.3% to 91.0% and 94.4%, 95% CI 93.6% to 95.2% for all and restricted data sets, respectively) according to paired sample t-tests (all p<0.05). In both data sets, judged SE was above the upper bound of the current reference range (85%) and judged SP was below the lower bound of the current reference range (98%) (all p<0.001, one-sample t-tests). These results complement those obtained for the PPV and NPV and confirm that participants, on one hand, overestimate the test’s ability to correctly detect infected individuals and, on the other hand, underestimate the test’s ability to correctly identify non-infected individuals.

Informativeness of positive and negative test results

To assess whether participants were aware of the differences between the informativeness of the positive and negative results, their perceived usefulness for changing behaviour was compared using a repeated measures ANOVA, and the same was done for the perceived usefulness of repeating the test after a negative or positive result. As expected, participants evaluated positive results as more useful than negative ones for changing current behaviour (91.6, 95% CI 90.2 to 93.1 and 41.0, 95% CI 37.9 to 44.0, respectively, p<0.001). The analysis revealed a significant interaction between participants’ judgements and age (p=0.038). More specifically, the difference between the perceived usefulness of positive and negative results was greater in younger (91.8, 95% CI 89.8 to 93.8 and 38.6, 95% CI 34.4 to 42.7, respectively) than in older participants (91.4, 95% CI 89.5 to 93.4 and 43.6, 95% CI 38.9 to 47.8, respectively). Yet, both evaluations (especially the one concerning a possible negative result) appear surprisingly high, given that participants were under lockdown and the tested person was assumed to be asymptomatic. Even less easily understandable is that participants found a short-term repetition of the test more useful after a positive rather than a negative result (62.7, 95% CI 59.6 to 65.7 and 47.2, 95% CI 44.4 to 50.0, respectively, p=0.013). These results are worth investigating in greater depth and, if confirmed in future studies, would indicate that participants do not consider some crucial behavioural implications of test results.

Causes of test errors

The Mann-Whitney tests did not reveal significant effects of age, gender and education on participants’ rankings of possible causes of test errors with one exception: the technical characteristics of the test were considered to be more probable as a cause of false positives by participants with a higher education level than those with a lower education level (2.12, 95% CI 2.03 to 2.21 and 1.95, 95% CI 1.87 to 2.03, respectively, p=0.007). Wilcoxon signed-rank tests showed that participants properly distinguish between the causes of false-positive and false-negative test results (figure 4): the level of viral load was considered more likely to be the cause of false negatives, while human error and technical characteristics of the test were assessed as more likely to generate false positives (all p<0.001). These evaluations appear in line with experts’ assessments.20 34

Figure 4

Ranking of causes for test errors. To display participants’ rankings on a scale between 0 and 1, we assigned each cause a score from 1 (least probable) to 3 (most probable), and then normalised total scores using the MinMax normalisation method.

Discussion

This study provides a first comprehensive overview of the general public’s understanding of the most commonly used test for detecting SARS-CoV-2. Overall, some aspects of the test appear to be fairly well grasped while others are not (for a detailed summary of the main results, see box 2). With regard to the latter, consistent with earlier research that considered different conditions and medical tests, our data show that, although laypeople are sensitive to several factors that can influence prevalence, they are not always able to integrate this information with evidence provided by test results. Our data also indicate that the estimate of the NPV can be flawed in a different way from that generally observed for the PPV. Moreover, the examination of participants’ beliefs about the diagnostic accuracy of the test allowed us to document an overestimation of the false-positive rate together with an underestimation of the false-negative rate. Finally, the high behavioural impact attributed to test results in the absence of symptoms appears to be unjustified, especially in the case of negative outcomes, as is the utility assigned to a short-term repetition of the test after a positive result. Among the aspects of the current SARS-CoV-2 testing that participants best understood are: the dependence of prevalence but not of SE and SP on SX and geographical location; the evaluation of the false-negative rate as higher than the false-positive rate; and proper probabilistic ordering of causes of false-positive and false-negative test results.

Box 2

Results in a nutshell

Because the (minimal) differences observed in the two data sets can be reasonably attributed to noise, we report here only the results of the restricted data set.

Participants’ most relevant attitudes towards SARS-CoV-2 and COVID-19

  • Older participants and those residing in the least affected area expressed greater worry about the COVID-19 pandemic.

  • Older participants and those residing in the least affected area assessed a possible SARS-CoV-2 infection as more severe.

  • Participants residing in the most affected area indicated a higher perceived probability of contracting the SARS-CoV-2 infection, while there was no difference between younger and older participants.

  • Participants’ prevalence judgements were predicted by their worry about the COVID-19 pandemic and their perceived risk of a possible SARS-CoV-2 infection.

Aspects of the test that are well understood by participants

  • Fairly good qualitative coherence between judgements of prevalence, test characteristics and predictive values (PVs).

  • Dependency of judged prevalence, positive predictive value (PPV) and negative predictive value (NPV) on geographic location and severity of symptoms.

  • No base rate fallacy for judged NPV (when prevalence >80).

  • Double negative results acknowledged both as supporting the absence of infection more than a single negative result and as not ruling out the possibility of infection.

  • Estimates of test characteristics (sensitivity (SE) and specificity (SP)) compatible with those of a useful test (ie, SE+SP≥1.5) and independent of symptomatic status and geographic location.

  • Higher estimates for false-negative than false-positive rates.

  • Positive results evaluated as more informative than negative ones with respect to an asymptomatic person’s current behaviour.

  • Human error and technical characteristics of the test judged more likely causes for false positives; level of viral load for false negatives.

Participants’ main errors

  • Base rate fallacy for judged PPV (when prevalence ≤20).

  • Judged NPV lower than expected based on judged prevalence and characteristics of the test.

  • General underestimation of false-negative rate.

  • General overestimation of false-positive rate.

  • General overestimation of the impact of both positive and negative results on an asymptomatic person’s behaviour.

  • Confusion about the utility of a short-term repetition of the test after positive or negative results.

As noted in the Introduction section, the findings of earlier studies cannot be extrapolated to the testing now underway. In particular, previous research has focused almost exclusively on whether participants properly calculate the PPV when explicitly provided with information about the prevalence of a condition and the diagnostic performance of a test used to detect it. Estimates of NPV or of the diagnostic accuracy of tests have not been studied, nor have the various behavioural consequences of the comprehension of test results. For the first time in literature, participants’ perception of the accuracy of a medical test was explored and combined with the behavioural impact of its positive and negative results, the perceived usefulness of a short-term repetition of the test following positive or negative results, and ranking of causes for false positives and false negatives. Thanks to the wider range of measures considered, this study extends scientific knowledge of how the general public interprets test results and challenges most theoretical models that have been proposed to explain the difficulties in computing the PPV and, more generally, base rate neglect. Furthermore, our study expanded existing methodology by introducing a qualitative coherence criterion that allows documentation of the base rate fallacy, even in the absence of normative values for test characteristics and prevalence.

The main limitations of this study are the narrow age range of participants (more than 95% younger than 50 years). Another limit is that it included mainly participants who had not undergone the swab test at the time of data collection. Finally, although there is no apparent reason to expect substantial cross-national differences in the accuracy of adults’ performance in these kinds of tasks,8 the generalisation to other countries cannot be taken for granted. Future research could extend our research questions to users of different healthcare systems and, especially, to specific subsets of the population, including older adults and patients with pre-existing conditions, and even to primary care physicians or specialists,ii who—together with public health agencies—have key roles in helping laypeople understand the reasons behind recommendations and obligations. Finally, it may also be of interest to consider other tests for SARS-CoV-2 infection (eg, rapid antigen tests or serological tests for the detection of antiviral antibodies).

Undoubtedly, mass testing plays a major role in the collection of epidemiological information and in the management of pandemics. However, it also has unavoidable effects at an individual level, as test results might well influence personal inferences and decisions. The aspects of the test that escape common understanding may indeed have a strong bearing on the public’s health and well-being. For example, the systematic underestimation of the false-negative rate could well lead to neglect of precautions and, in the event of subsequent development of symptoms, to mistrust of medical services and institutions. Similarly, the disproportionate behavioural impact attributed to test results in the absence of symptoms and the confusion about the utility of a short-term repetition of the test after a positive result could give rise to overtesting, with all the serious consequences this entails.

In conclusion, certainly the dissemination of correct medical information35 and the implementation of health literacy interventions36 are essential for dealing with this (or any) pandemic emergency. Yet, for these policies to be truly effective, they must be grounded in empirical evidence that indicates where exactly the difficulties lie, and hopefully provides precise guidance on how to overcome them. As more than 50 years of cognitive studies on human rationality have shown, the problem is more complicated than providing laypeople with accurate information but, beyond this, encompasses comprehending how they use this information in their reasoning.

References

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

Footnotes

  • Contributors Both authors contributed equally to the design of the study. SP collected data and performed the statistical analysis. Both authors drafted the manuscript for important intellectual content, approved the final version submitted and agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. KT acts as guarantor and corresponding author.

  • Funding The study was supported by the MIUR project “Dipartimenti di eccellenza".

  • Competing interests None declared.

  • Patient consent for publication Not required.

  • Ethics approval The study was approved by the Human Research Ethics Committee of the University of Trento (protocol number 2019/2020-026). All participants provided informed, written consent and data obtained were fully anonymous.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data availability statement Data are available upon reasonable request. Data sharing: the authors support data sharing and queries in this regard can be addressed to the corresponding author.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.

  • The literature search was conducted with no restrictions on language using the search string: (2019-nCoV [All fields] OR SARS-CoV-2 [All fields] OR novel coronavirus [All fields] OR Covid-19 [All fields]) AND (testing result* [All fields] OR test result* [All fields]) AND (interpretation [All fields] OR understanding [All fields] OR assessment [All fields] OR predictive value* [All fields]) AND (probability [All fields] OR reasoning [All fields]).

  • A preliminary analysis on our data indicated that, overall, the judgements of participants who have a medical/health-related university degree (eg, medicine, neuroscience, biology, neurobiology, psychology; n=82) did not differ from those of all other participants in any of our dependent measures (ie, prevalence rate, PVs, informativeness of a positive and negative test result and utility of short-term repetition after a positive or negative test result, all p>0.05).

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.