Article Text
Abstract
Background The reliability of binary exposure classification methods is routinely reported in occupational health literature because it is viewed as an important component of evaluating the trustworthiness of the exposure assessment by experts. The Kappa statistics (κ) are typically employed to assess how well raters or classification systems agree in a variety of contexts, such as identifying exposed participants in a populationbased epidemiological study of risks due to occupational exposures. However, the question we are really interested in is not so much the reliability of an exposure assessment method, although this holds value in itself, but the validity of the exposure estimates. The validity of binary classifiers can be expressed as a method's sensitivity (SN) and specificity (SP), estimated from its agreement with the errorfree classifier.
Methods and results We describe a simulationbased method for deriving information on SN and SP that can be derived from κ and the prevalence of exposure, since an analytic solution is not possible without restrictive assumptions. This work is illustrated in the context of comparison of jobexposure matrices assessing occupational exposures to polycyclic aromatic hydrocarbons.
Discussion Our approach allows the investigators to evaluate how good their exposureassessment methods truly are, not just how well they agree with each other, and should lead to incorporation of information of validity of expert assessment methods into formal uncertainty analyses in epidemiology.
Statistics from Altmetric.com
Strengths and limitations of this study

The main strength of our approach is that it is flexible and easy to implement.

Our methodology accounts for realistic uncertainties that an epidemiologist faces in evaluating the plausible extent of exposure misclassification.

The main limitation of our work is that it does not yet account for correlated errors in exposure estimates that are common in the field, and the importance of this limitation remains to be understood.
Introduction
The reliability of binary exposure classification methods is routinely reported in occupational health literature because it is viewed as an important component of evaluating the trustworthiness of the exposure assessment. The Kappa statistics (κ) are typically employed to assess how well the raters or classification systems agree in a variety of contexts, such as identifying exposed participants in a populationbased epidemiological study of risks due to occupational exposures. Most recently, Offermans et al1 estimated agreement among various methods of assessing exposures in a cohort using various expertbased methods (jobexposure matrices and casebycase evaluations). The authors reported κ coefficients for these methods that are not unlike those presented previously in a review by Teschke et al,2 and that seems to suggest that κ values of about 0.6 or worse are a fair summary of what these methods generally yield in terms of interrater agreement in a typical study of occupational exposures. However, the question we are really interested in is not so much the reliability of a method to assess exposure, although this holds value in itself, but the validity of the exposure estimates.
The validity of binary classifiers can be expressed as a method's sensitivity (SN) and specificity (SP), estimated from its agreement with the errorfree classifier (also known as ‘gold standard’).3 But how does one infer what κ tells us about the validity of exposure estimates (ie, SN and SP) when a true value (gold standard) is unavailable? Generally, reliability contains information on validity,3 but in the case of κ, its relationship with SN and SP is also affected by prevalence of exposure (Pr). An analytic solution in this case is not possible without restrictive assumptions about the actual prevalence and relationship between SN and SP.4 Therefore, we developed a simulationbased method for deriving information on SN and SP based on κ and the Pr. We illustrate this method in the context of a comparison of jobexposure matrices assessing occupational exposures to polycyclic aromatic hydrocarbons (PAHs).1
Method
We propose a simulationbased method to calculate the values of SN and SP that are consistent with the observed κ and Pr. The relationship among κ, SN, SP and Pr can be described mathematically, if we assume two conditionally independent raters with the same validity, by: 1
We assume that exposure classification by experts is better than chance, as expressed by: 2
First, we define the distributions of the lower (κ_{l}) and upper (κ_{h}) bounds of κ by using uniform distributions (U) as κ_{l}∼U(a_{1}, a_{2}) and κ_{h}∼U(b_{1}, b_{2}). We further define the distribution of Pr as a Beta distribution—Pr∼Beta(c, d). Information required to specify these distributions with reasonable credibility is available in reports evaluating interrater agreements, as in reference.1 We can then calculate (multiple) the lower bounds of SN and SP (SN_{l} and SP_{l}) that are consistent with these distributions, following: 3and 4
The upper theoretical bounds on SN and SP are known (ie, these are 1) and, even though no other information is available, this enables us to sample plausible SN and SP values from the uniform distribution constrained by the lower bounds (SN_{l} and SP_{l}, respectively) and the upper bounds of 1. Using Monte Carlo sampling, this procedure is repeated multiple times to generate sets of possible (SN and SP) combinations.
The proposed procedure is a hierarchical process that starts with (a) selecting a set of (κ_{l}, Pr) values from specified distributions to calculate (SN_{l}, SP_{l}; Eqs. (3) and (4)), and is followed by (b) selecting candidate set (SN and SP) from values uniformly distributed between the lower bounds (SN_{l} and SP_{l}) and the upper theoretical maximum of 1, and completed by (c) imposing constraints on the candidate set of (SN and SP) that are implied by Eqs. (1) and (2) (see next paragraph for details of the last step). The purpose of step (a) in the procedure is to calculate SN_{l} and SP_{l}. The purpose of step (b) is to sample candidate values of SN and SP that lie between their respective theoretical lower and upper boundaries. The purpose of step (c) is to limit the sets of values of SN and SP selected in step (b) to only those that, first, are congruent with the theoretical model that relates validity to reliability (Eq. 1), and, second, satisfy the assumption that classification of exposure is better than random (Eq. 2).
By chance, some values of Pr, SN and SP selected in this way will correspond to values of κ, implied by Eq. (1), that lie outside of bounds on κ that we have specified by choosing specific values of κ_{l} and κ_{h} from corresponding distributions. Furthermore, some combinations of SN and SP will not be consistent with Eq. (2) (ie, imply that exposure classification was worse than chance). Consequently, the candidate sets of values of SN and SP that are not in agreement with our starting assumptions are eliminated from the sample used to estimate the distributions of SN and SP. The resulting combinations are consistent with our knowledge of agreement between different exposure assessment methods and foretell how valid these exposure assessment methods can be expected to be in general.
Calculation can be implemented in R, and is available in Appendix 1 (available online) with input values specific to the illustrative example described below.
Results
We apply our method to information provided in table 2 in the article by Offermans et al1 for PAH exposure assessment. First, we define the distributions of the κ_{l} and κ_{h} for PAH by using U as κ_{l}∼U(0.29, 0.31) and κ_{h}∼U(0.59, 0.61). Some degree of judgements is involved in this but our formulation reflects the observation that in this case κ for PAHs lies between 0.3 and 0.6. We further define the distribution of Pr (mode of 5%, with 95% certainty that Pr does not exceed 10%) as Pr∼Beta(6.2, 99.7).5 The results of the rest of the calculations are summarised in figure 1, derived from 10 000 Monte Carlo samples for candidate values of SN and SP (step (b) above). They reveal that the mean SN for this example is about 0.78 (SD 0.15) and mean SP is about 0.96 (SD 0.03).
Discussion
Our approach allows the investigators to evaluate how good their exposureassessment methods truly are, not just how well they agree with each other, and should lead to incorporation of information of validity of expert assessment methods into formal uncertainty analyses in epidemiology.6 Specifically, once we can represent knowledge about SN and SP by a joint distribution, we can use a number of existing techniques to evaluate the impact of exposure misclassification on the epidemiological results and to correct such results for known imperfections in exposure classification. Till now, knowledge of κ and exposure prevalence did not enable such analyses. It is noteworthy that Bayesian analyses that appraised SN and SP of another jobexposure matrix produced a very similar appraisal for SP and lower value for average SN with a similarly wide distribution.7 ,8 This perhaps points to commonality of quality of expert assessment methods used in occupational epidemiology. It is important to note that simple comparison of measures of agreement across studies and instruments is not helpful because values of κ depend on the Pr, which may differ between applications even for the same SN and SP. Our method has a distinct advantage for such comparisons and assessment of validity. With knowledge about validity, even if it is uncertain, we can begin the work on incorporating this knowledge into routine epidemiological analyses.9
Supplementary Data
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Files in this Data Supplement:
 Data supplement 1  Online supplement
Footnotes

Contributors All authors equally contributed to the writing of the manuscript. IB and PG jointly developed the algorithm. Theoretical derivations were performed by PG. Simulations were conducted by IB and verified by PG and FdV.

Funding This research received no specific grant from any funding agency in the public, commercial or notforprofit sectors.

Competing interests None.

Provenance and peer review Not commissioned; externally peer reviewed.

Data sharing statement An appendix is available online.
Request permissions
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.