Article Text

Download PDFPDF

Method effects associated with negatively and positively worded items on the 12-item General Health Questionnaire (GHQ-12): results from a cross-sectional survey with a representative sample of Catalonian workers
  1. Maria F Rodrigo1,
  2. J Gabriel Molina1,
  3. Josep-Maria Losilla2,
  4. Jaume Vives2,
  5. José M Tomás1
  1. 1 Methodology of behavioural sciencies, University of Valencia, Valencia, Spain
  2. 2 Psychobiology and methodology of health sciences, Autonomous University of Barcelona, Barcelona, Spain
  1. Correspondence to Dr Jaume Vives; Jaume.Vives{at}


Objective Recent studies into the factorial structure of the 12-item version of the General Health Questionnaire (GHQ-12) have shown that it was best represented by a single substantive factor when method effects associated with negatively worded (NW) items are considered. The purpose of the present study was to examine the presence of method effects, and their relationships with demographic covariates, associated with positively worded (PW) and/or NW items.

Design A cross-sectional, observational study to compare a comprehensive set of confirmatory factor models, including method effects associated with PW and/or NW items with GHQ-12 responses.

Setting Representative sample of all employees living in Catalonia (Spain).

Participants 3050 participants (44.6% women) who responded the Second Catalonian Survey of Working Conditions.

Results A confirmatory factor analysis showed that the best fitting model was a unidimensional model with two additional uncorrelated method factors associated with PW and NW items. Furthermore, structural equation modelling (SEM) revealed that method effects were differentially related to both the sex and age of the respondents.

Conclusion Individual differences related to sex and age can help to identify respondents who are prone to answering PW and NW items differently. Consequently, it is desirable that both the constructs of interest as well as the effects of method factors are considered in SEM models as a means of avoiding the drawing of inaccurate conclusions about the relationships between the substantive factors.

  • psychological health
  • General Health Questionnaire (GHQ–12)
  • method effects
  • item wording effects
  • confirmatory factor analysis

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • Sampling quality: a random and large representative sample of workers and face-to-face administration by professional interviewers.

  • Comparison of confirmatory models for positively worded (PW) and/or negatively worded (NW) items and the use of two different parameterisations.

  • There are no previous studies regarding the demographic correlates of wording effects on the 12-item version of the General Health Questionnaire.

  • The different response scale used for the NW items and the PW items in the questionnaire could be a confounding variable.

  • The results might not be generalised to other specific populations, for example, adolescents and elderly retired people.


Originally developed by Goldberg,1 the General Health Questionnaire (GHQ) has been widely used as a screening instrument for measuring General Psychological Health (GPH) in both community and non-psychiatric clinical settings.2 The shortest 12-item version (GHQ-12) is the most popular and has been employed on different settings and in several countries, as well as part of multiple major national health, social well-being and occupational surveys, achieving results which underline the fact that it is highly reliable and valid.3–11

Despite its broad application, the factor structure underlying the responses to the GHQ-12 remains a controversial issue. In this sense, although the GHQ-12 was originally developed as a unidimensional scale, this one-factor latent structure has found little empirical support and some alternative multidimensional models have been proposed as more appropriate. Thus, the one with the most empirical support is the three-factor model proposed by Graetz.5 12–22 It is important to note that the six positively worded (PW) items make up the first factor, whereas the other two factors are made up of the six negatively worded (NW) items (see figure 1, model 8). On the other hand, the bidimensional model, where the 6 NW and the 6 PW items in the GHQ-12 are grouped into two factors, has also obtained wide support, especially in studies based on exploratory factor analysis.5 10 23–28 The arguments against these models and in favour of the unidimensional solution are the high correlations between the factors13 and the low discriminant validity of the factor scores derived from these models.16 29 30

Figure 1

Competing models tested for the 12-item General Health Questionnaire. Underlined numbers identify negatively worded items. GPH: General Psychological Health; NW: method factor associated with negatively worded items; PW: method factor associated with positively worded items.

As Hankins31 pointed out, multifactor models may just be the resulting artefact of the inclusion of PW and NW items in the questionnaire, and so the controversy about the factorial structure of the GHQ-12 might relate to the effect of item wording on subjects’ response patterns as part of a more general category called ‘method’.32 33 Hankins31 found that, after modelling the wording effects for the NW items, the unidimensional model fitted better than both the two-factor model (NW vs PW items) and Graetz’s three-factor model. Other studies have called into question the substantive meaning of the GHQ-12 multifactor solutions, suggesting that they might just be an artefact due to the wording effects associated with NW items.29 30 34–40 See Molina et al 36 for a deeper review about the dimensionality of GHQ-12.

Some studies about other instruments, however, suggested considering the wording effects not only for the NW items but also for the PW items.41 42 Regarding GHQ-12, only a recent meta-analysis modelled the presence of method effects for NW and PW items concluding that positively keyed items explained incremental variance beyond a general mental health factor.43

Therefore, another source of variability in the results about the factor structure of the GHQ-12 could come from the statistical control of method biases, which has been mainly achieved through the correlated traits–correlated methods (CTCM) and the correlated traits–correlated uniquenesses (CTCU) confirmatory factor analysis models. Both procedures have been used in GHQ-12, to deal with method effects applying the CTCM model,30 44 the CTCU model29 31 39 40 or both CTCM and CTCU models.34–37

To date, we have not found any study about GHQ-12 that analyse the wording effects associated with either PW items alone, or with NW and PW items simultaneously, comparing both CTCU and CTCM models. There are several multivariate statistical models for analysing method effect, and among them the CFA-based approaches are the most popular ones,45 in particular the CFA with CTCM (CFA-CTCM) and the CFA with CTCU (CFA-CTCU). On the one hand, the CTCM model specifies that indicators’ variance can be explained by a linear combination of trait, method and error effects,46 with trait and method effects specified as latent variables. The CTCM model, when methods are specified independent (uncorrelated), directly translates into the well-known bifactor model.47 48 On the other hand, the CTCU model specifies trait factors while method effects are modelled correlating the uniqueness of items (indicators) sharing a common method.49 Both CTCM and CTCU models have strengths and shortcomings and therefore are usually employed simultaneously.50 This work extends the previous work by Molina et al,36 which compares the fit of the unidimensional model, the multifactor models and the CTCM and CTCU unidimensional models with method effects for only the NW items.

To clarify, figure 1 (models 1 to 9) shows the nine CFA models estimated to test the potential method effects associated with either the PW or the NW or both. Model 1 is a one-factor model of general health. This model also works as a baseline model against which to compare other more complex models. Models 2 and 3 are the CTCU and CTCM models that include method effects for the NW items. These were the best fitting models in Molina et al 36 Models 4 and 5 are the CTCU and CTCM models including method effects for the PW items. Model 6 is the CTCM model including method factors for both the NW and PW items (a CTCU model with method effects for both PW and NW items was not estimated because it is not identified). Model 7 is a bifactor model with a general trait factor of general health and two method factors associated to NW and PW items. The three factors are independent (uncorrelated). Additionally, considering the best fitting multidimensional model in Tomás, Gutiérrez and Sancho51 based on the results by Graetz,12 models 8 and 9 were also tested. Model 8 posited three substantive dimensions: social dysfunction, anxiety and depression and loss of confidence. Model 9 included an additional method factor associated to NW items. Models considering a method factor associated to PW items made no sense as all PW items were indicators of social dysfunction.

As stressed by Marsh et al,52 it becomes necessary to consider this comprehensive set of competing models to determine the relative importance and substantive nature of the method effects.

Finally, there has been some research carried out on the demographic correlates of method effects, such as sex,53–57 age55 58 or educational level.41 59 With respect to the GHQ-12, to date, we have not found any studies that analyse demographic correlates of method effects.

Building on the previous studies, the first aim of this study was to overcome the limitation pointed out in Molina et al 36 and examine method effects associated with both PW and NW items. The second aim was to further understand the meaning of the method factors; therefore, we evaluated the relationships between the method factors and three covariates (ie, sex, age, and educational level) in the framework of a structural equation modelling (SEM).



The data used in this study came from the Second Catalonian Survey of Working Conditions60 and were based on a representative random sample of all employees living in Catalonia (Spain). Data were collected between September and November 2010 by professional interviewers in private households. The sample comprised a total of 3050 participants who responded to the GHQ-12 included in the survey. Main sociodemographic characteristics of the sample are shown in table 1.

Table 1

Main sociodemographic characteristics

Public involvement

Respondents were not involved in any stage of the design of the study and were only requested to respond the survey. In the selected households, interviewers identified themselves personally and informed that this was an official survey about the working conditions of employed Catalonian people commissioned by the Catalonian Government Work Department.

Results were published on the Catalonian Government Work Department website60 and are available at


The GHQ-12 is a self-report scale that contains 6 PW items (eg, ‘Have you been able to face up to problems?’) and 6 NW items (eg, ‘Have you been losing confidence in yourself?’). The GHQ-12 was validated in Spain by Lobo and Muñoz.61 Table 2 shows the statements of these items in the same order as they were presented in the survey. It must be noted that the GHQ-12 has a different response scale for the PW items (ie, more than usual; same as usual; less than usual and much less than usual) and the NW items (ie, not at all; no more than usual; rather more than usual and much more than usual). Accordingly, the four-point scoring scheme was applied in our study, and so the total scores in the GHQ-12 ranged from 0 to a maximum of 36, with higher scores indicating lower levels of GPH.

Table 2

Descriptive statistics, standardised factor loadings from model 7 and correlations between the model 7 factors and the covariates

For the purposes of exploring the correlates of method effects (ie, item wording effects), we used the following three covariates: (a) sex (0=men and 1=women); (b) age and (c) educational level, which was measured as a self-reported question with seven response graduated categories ranging from incomplete primary studies to postgraduate studies. The educational level was scored as the highest level of education reached.

Statistical analysis

A set of competing confirmatory factor models were estimated using MPlus V.8.3.62 Figure 1 shows the specification of all these CFA models. The goodness-of-fit indices computed were the χ2 statistic; the Comparative Fit Index (CFI); the Root Mean Square Error of Approximation (RMSEA) with its 90% CI and the Standardised Root Mean Square Residual (SRMR). Values greater than 0.95 for CFI, and lower than 0.06 and 0.08 for RMSEA and SRMR, respectively, are considered to indicate good model fit.

As concerns the estimation of CFA models, most studies into the GHQ-12 factor structure have used maximum likelihood.16 31 35 40 44 This estimation method relies on several assumptions which should be met to be confident about the results obtained. This is the case of the assumption of multivariate normality which implies, first, that the variables are continuous in nature and, second, that the joint distribution of the variables is normal. The first condition is unlikely to be met with the GHQ-12 Likert-type response data; nor is the second if the variables depart markedly from normality as is the case for the responses to the NW items which were heavily positively skewed (see figure 2). An alternative when these conditions are not met is to use the weighted least squares (WLS) estimator,63 which has already been used in some studies about the GHQ-12 factor structure13 18 20 29 and it will be the estimation method used here. Thus, the various CFA models were estimated using diagonally WLS.

Figure 2

Bar charts of the response distributions for the 12-item General Health Questionnaire. Responses were given on a different four-point response scale for the positively worded items (0=better than usual, 1=same as usual, 2=less than usual, 3=much less than usual) and for the negatively worded items (0=not at all, 1=no more than usual, 2=more than usual, 3=much more than usual).

Finally, correlates of the GHQ-12 factors were evaluated using SEM through the inclusion in the finally selected model of the three covariates considered in this study: sex was treated as categorical, whereas age and educational level were treated as continuous variables.


The goodness-of-fit statistics and indices obtained for the nine models compared here are shown in table 3.

Table 3

Fit indexes for the alternative models of the 12-item General Health Questionnaire

Model 1, with a single factor of general health, and model 8, with three substantive factors, had worse fit than the models that include wording effects. That is, a careful look at fit indexes makes clear that the inclusion of method effects always improves model fit. Indeed, both NW and PW method effects are needed to get the best fitting models. These best fitting models were models 6 and 7. Their fit was practically indistinguishable and, given that they only differ in that model 7 is more parsimonious because constrains method factors correlation to zero, it will be retained as the best representation of the observed data.

An in-depth inspection of the parameter estimates in model 7 (see table 2) showed that all factor loadings were statistically significant for the three factors, except for items 2 and 5 in the method factor comprising the NW items.

Finally, a statistical analysis of the relationships between the latent factors in model 7 and the three covariates considered in this study (ie, sex, age and educational level) was performed through a Multiple Indicator Multiple Causes (MIMIC) SEM model in which the effects between the three latent factors in model 7 and the three covariates were freely estimated, the focus being on the relationships between the method factors and the covariates. The model fit was excellent (RMSEA=0.040; RMSEA 90% Confidence interval (CI) = (0.037, 0.049); CFI=0.99; SRMR=0.029). As can be seen in table 2, the relations of age with the method factors were near to 0 and statistically non-significant for NW items, and positive and significant although small with PW items (0.08). Sex was significantly related to the method factor associated with PW items (–0.08), whereas the educational level was not significantly related to method factors. Thus, men and women differ in the way they answer PW items, meaning that men are slightly more likely than women to endorse PW items, and method effects associated with PW items also increased by age.


This study focused on the examination of the latent structure underlying the responses to the GHQ-12, considering the role of method effects associated with both, PW and NW items, and using two alternative parameterisations of the CFA measurement models. What should first be noted is that the studies that have included method effects in the measurement model of the GHQ-12 have been more the exception than the rule in previous research into the factor structure of this questionnaire.

According to the results of the present study, we conclude that the GHQ-12 factor structure is best characterised by introducing latent method factors that capture both the method effects associated with NW and PW items (model 7). These results support the conclusion from previous research that the good fit obtained by multidimensional models (mainly the two-factor model and the three-factor Graetz’s model) could simply be explained by the artificial grouping of PW and NW items. However, the interpretation of the latent (method) factors as purely integrating method bias due to wording is not straightforward. It is obvious that NW and PW items share the wording. It is also clear that this three bifactor model (one trait and two method factors) fitted the data best. And finally, there is a lot of empirical evidence on these wording effects. However, it is also relevant to discuss the large loadings of many items on the method factors, being these loadings sometimes larger than their loadings in the trait factor. The general factor explains a 52% of the shared variance, but there are some items that deserve careful attention. For example, items 3 (‘playing useful part in things’) and 4 (‘capable of making decisions’) had very low loadings on the trait factor. If we understand PW method factor as the only method bias, then it follows that these two items are purely method effects, but surely they must share some trait variance. In the same vein, items 10 (‘losing confidence in yourself’) and 11 (‘thinking of yourself as a worthless person’) load very high in the NW method factor and, as a reviewer pointed out, a likely (post-hoc) explanation is that wording bias are still confounded with a confidence/self-image factor. Therefore, the interpretation of these effects as purely method and, accordingly, the interpretation of an overall score for the scale difficult may be compromised.

The second aim of this study was to examine the relationship between the method factors associated with both NW and PW items and three demographic variables, namely sex, age and educational level of the respondents. Regarding the sex, we found a statistically significant, but weak, relationship between PW and sex, so that men were more likely than women to endorse PW items. These results are in line with previous works that, in the context of RSES, have found sex differences in wording effects.56 57 As for the explanatory role of age on method effects, we found that the relationship between age and the NW effect was not statistically significant, which supports previous research using other questionnaires (eg, self-esteem scales,50 Hospital Anxiety & Depression Scale64). Moreover, our results give support to previous studies which had stated that, in older adults, the strongest method effects would be associated with PW items, rather than NW items.55 58

As to the educational level, we found that there was not a significant correlation of this variable on the two method factors. This result supports and extends the evidence obtained in Tomás et al 50 who found that the educational level of the respondents had no effect on the negative method factor using self-esteem questionnaires. This results contradicts previous research on the relationship of the NW factor and the educational level/verbal ability with different questionnaires and samples.41 64–69

Overall, the significant effects of sex and age on trait and method factors point out that women have a worse well-being, but this effect is partly modified by a method effect on the PW items, whereas the results for age suggest that older respondents have worse well-being and this effect is magnified by a method effect on the PW factor. The results on the individual differences related to the demographic variables considered in this study cannot only help to understand the presence of wording method effects but also to identify respondents who are prone to answering PW and NW items differently. In this sense, the relationship that appears as more evident is for the age and sex variables.

Another practical consequence of our study concerns the relationship between the intended measure of the GHQ-12 (ie, the GPH factor) and other constructs of interest. Several studies have shown that method effects can inflate, deflate or have no effect at all on estimates of the relationship between two constructs (see Podsakoff et al 70 for a further review of the effects that method biases have on individual measures and on the covariation between different constructs). Thus, it is desirable that both the constructs of interest as well as the effects of method factors, like PW and NW, are considered in SEM models as a means of controlling these systematic sources of bias, and thus avoiding the drawing of inaccurate conclusions about the relationship between the substantive factors.

Previous research on the GHQ-1231 36 has outlined the asymmetry in the participants’ responses as a function of the wording of the items, as well as the different responses scales for the PW and NW items. This asymmetry in the participants’ responses as a function of the wording of the items is consistent with results from previous research into wording effects for contrastive survey questions.71 The extent to which the presence of method effects is linked to the asymmetric pattern of responses and/or to the different response scales for the PW and NW items in the GHQ-12 should be examined in future research.

Comparing the current work with previous studies into the factorial structure of the GHQ-12, to our knowledge, this is the first study that tests a comprehensive set of models including method effects associated with both PW and NW items and also explores some demographic correlates of these method effects. Another strength of this work was the fact that it used a large representative sample of workers, but the results might not be generalised to other specific populations, for example, adolescents and elderly retired people.



  • Twitter @jmlosilla, @VivesJ_Research

  • Contributors All authors meet the criteria recommended by the International Committee of Medical Journal Editors (ICMJE). All authors made substantial contributions to conception and design, acquisition of data or analysis and interpretation of data. MFR and JGM: drafted the article. JV and JML: critically revised the draft for important intellectual content. JMT: worked in the statistical analysis and interpretation of data. All authors agreed on the final version.

  • Funding This work was supported by the Grant PGC2018-100675-B-I00, Spanish Ministry of Science, Innovation and Universities (Spain). The funders had no role in the study design, data collection and analysis, decision to publish or preparation of the manuscript.

  • Disclaimer All authors have agreed to authorship in the indicated order. All authors declare that this paper is an original unpublished work and it is not being submitted elsewhere. All authors do not have any financial interests that might be interpreted as influencing the research, and APA ethical standard were followed in the conduct of the study.

  • Competing interests None declared.

  • Patient consent for publication Not required.

  • Ethics approval The research was not submitted to approval by an institutional review board since this is not a requirement at our universities for this type of study. Ethics approval was not sought for this study since this was a secondary analysis of anonymised data.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data availability statement Data are available upon reasonable request.