Development and validation of brief scales to measure emotional and behavioural problems among Chinese adolescents
  1. Minxue Shen1,2,
  2. Ming Hu2,
  3. Zhenqiu Sun2
  1. 1Xiangya Hospital, Central South University, Changsha, China
  2. 2Department of Epidemiology and Health Statistics, Xiangya School of Public Health, Central South University, Changsha, China
  1. Correspondence to Ming Hu; xysm2011{at}


Objectives To develop and validate brief scales to measure common emotional and behavioural problems among adolescents in the examination-oriented education system and collectivistic culture of China.

Setting Middle schools in Hunan province.

Participants 5442 middle school students aged 11–19 years were sampled. 4727 valid questionnaires were collected and used for validation of the scales. The final sample included 2408 boys and 2319 girls.

Primary and secondary outcome measures The tools were assessed by the item response theory, classical test theory (reliability and construct validity) and differential item functioning.

Results Four scales to measure anxiety, depression, study problem and sociality problem were established. Exploratory factor analysis showed that each scale had two solutions. Confirmatory factor analysis showed acceptable to good model fit for each scale. Internal consistency and test–retest reliability of all scales were above 0.7. Item response theory showed that all items had acceptable discrimination parameters and most items had appropriate difficulty parameters. 10 items demonstrated differential item functioning with respect to gender.

Conclusions Four brief scales were developed and validated among adolescents in middle schools of China. The scales have good psychometric properties with minor differential item functioning. They can be used in middle school settings, and will help school officials to assess the students’ emotional/behavioural problems.

Strengths and limitations of this study

  • First study to develop and validate brief tools for assessing common emotional and behavioural problems among Chinese adolescents.

  • Modern test theory and classical test theory are used to validate the tool.

  • The scales are specifically relevant to the examination-oriented education system and collectivistic culture of China.

  • A convenient sampling method was used.

  • Diagnoses were not determined by psychiatrists.


Adolescence is a time of profound biological and social transition during which new behaviours are developed that can either benefit the health and social adaptation of youth or, alternatively, undermine adjustment in adulthood.1 Adolescents have to cope with increasing independence and the growing importance of social relationships, while developing and exercising self-control. However, difficulties in the development of executive control in adolescence can lead to a lack of balance in the regulation of cognition, emotions and behaviours when dealing with negative thoughts and feelings, and then result in anxiety and depression.2 This is an important issue, especially in low-income and middle-income countries where enormous environmental threats are more common (eg, poverty, war, internal conflicts, sex trafficking, early pregnancy and marriage, absence of access to education).

Adolescents, defined by the WHO as those between 10 and 19 years of age, represent an estimated 1.2 billion of the world's population.3 A meta-analysis that synthesised data from over 60 000 adolescents aged 13–18 years estimated the prevalence of depression to be 6% in the community.4 Anxiety disorders have an estimated cumulative prevalence of 32% among adolescents aged 13–18 years according to numerous population-based studies.5 Mental disorders also contribute to heavy social burdens. Globally, neuropsychiatric disorders accounts for 45% of years lost due to disability for adolescents.6 Up to 20% of young adults have a disabling mental illness, and 50% of adult mental health disorders experience their onset in adolescence.7 Study and sociality problems are associated with anxiety and depression disorders in adolescents, while these disorders per se lead to negative outcomes including behavioural problems, poor school performance and impaired social and family functioning.8 ,9

Culture influences the sources of distress, the form of illness experience, symptomatology, the interpretation of symptoms, modes of coping with distress, help-seeking and the social response to distress and disability. In China, academic pressure under the examination-oriented education system, cultural differences between China and Western societies with respect to the view of social and sexual relationships, as well as limited mental health literacy and social inequity experienced by massive younger generations of internal migrants resulted in increased prevalence of anxiety, depression and other mental disorders.10–12 The examination-oriented education system of China results in a score-based friend-making criterion: adolescents prefer to make friends with those who perform better in examinations, and ignore those with poor performance. This further aggravates the association of anxiety and depression with study and sociality problems among them. However, since individual socioemotional well-being has traditionally been neglected in the collectivistic culture of China, adolescent mental and behavioural problems are not well identified, and have not received adequate attention from professionals and the public.13

Collectivism, as always being enhanced by Chinese parents and teachers, also results in differences in the expression of anxiety and depression. A collectivistic culture values harmony within the group, and the individual gain is considered to be less important than improvement of the social group.14 Embarrassment may be more common in collectivistic cultures because it is induced by external sanctions.15 ‘Taijin kyofusho’ (the fear of offending or embarrassing the other person) is an example of a culturally specific expression of anxiety in Asian countries.16 Biological evidence also showed that people who live in collectivist cultures are more likely than those in individualistic cultures to have a form of the serotonin transporter gene that correlates with higher rates of anxiety and depression.17

In addition, there are cultural differences with respect to treatment response. Stigmatisation of people with mental illness is especially pronounced in China.18 Family members try to conceal any history of mental illness within the family to avoid any negative impact on the family and potential of the young person to get married. A study examined culture-related influences on willingness to seek treatment for anxiety in first-generation and second-generation students of Chinese heritage and their European-heritage counterparts, and found that first-generation Chinese participants were significantly less willing to seek treatment.19 The reluctance was associated with greater Chinese-heritage acculturation rather than perceiving symptoms as less impairing.

A psychological scale seeks to identify and evaluate patients who may have current disorders but have not sought treatment.20 Currently, widely used mental/behaviour problem scales for children and adolescents include the Achenbach Child Behavior Checklist,21 Personality Diagnostic Questionnaire,22 Rutter's Behavior Scale,23 Spence Children's Anxiety Scale,24 Zung's Self-Rating Anxiety Rating Scale (SAS),25 Zung's Self-Rating Depression Scale (SDS),26 Children's Depression Inventory,27 Child and Adolescent Psychiatric Assessment,28 Hospital Depression and Anxiety Scale,29 etc. In China, however, most studies employed translated versions of foreign scales,30–34 and no dedicated scale has been developed, validated and used to screen the common emotional and behaviour problems among adolescents in middle school. Owing to the cross-cultural differences with respect to the formation and expression of mental problems as mentioned above, in the current study, we developed and validated four brief scales according to the characteristics of adolescents in the examination-oriented education system and collectivistic culture of China.


Study design

The study had a cross-sectional design. The participants were junior school (grade 7th–9th) and high school (grade 10th–12th) students aged 11–19. A stratified cluster sampling frame was used. Two cities (Changsha and Shaoyang) and two counties (Liuyang and Ningxiang) were selected from Hunan province through a convenience sampling method. In each city or county, two junior schools and two high schools were selected using a random number table. In each school, two classes were selected from each grade using a random number table. All students in selected classes were recruited through a questionnaire survey.

Scale development

Based on literature review and expert advice, common emotional and behavioural problems among Chinese middle school students were selected (emotional problems, learning problems and interpersonal problems), and four scales were formed (anxiety, depression, study problems and sociality problem). Scale development was performed in four phases:

Phase I: Programmed decision processing was used to develop the scales by a nominal group. A total of 90 items was drafted by interviewing the nominal group.

Phase II: Individual questions were edited and redundant questions were eliminated by a focus group consisting of 10 experts of child and adolescent psychology (n=3), epidemiology (n=2), biostatistics (n=3) and medical sociology (n=2). An initial pool of 59 items was derived. The responses to all items were graded on a four-point scale (1=never/occasionally; 2=a little time; 3=quite some time; 4=most of the time). Responses were transformed into raw scores, which signified the severity of a problem.

Phase III: A pilot study was conducted in 538 middle school students (one class per grade from a junior school and a high school in Changsha and Ningxiang, respectively). The items were selected by statistical methods as follows: (1) t-test. Participants were ranked by the score on the scale, and a high-score group and a low-score group were derived, respectively, according to percentiles (P75 and P25). The score of each item was compared using Student’s t-test. Any item with no statistical difference between the two groups was eliminated. (2) Correlation coefficient. Any item with a Pearson's correlation coefficient <0.40 with the scale score was eliminated. After the pilot test, four scales consisting of 46 items were derived. (3) Factor analysis. Any item with a factor loading<0.40 was eliminated. (4) Any item with three options that presented a selection rate<10% was eliminated. Details of the pilot study can be found in our published paper.35

Phase IV: The scales were tested among 5442 middle school students. A total of 220 students received the test again 1-week after the initial test. The construct validity, reliability, item performance and differential item functioning (DIF) were assessed.

Statistical analysis

The Kaiser-Meyer-Olkin (KMO) test was used to evaluate the adequacy of exploratory factor analysis (EFA), and a KMO value >0.8 indicates that factor analysis will be useful. EFA was used to examine the number of factors in each scale. Factors with an eigenvalue >1.0 were selected. The quartimax rotation was used to achieve rotated factor loadings.

Under the a priori hypothesis that each scale measures a distinct latent trait, construct validity was assessed by confirmatory factor analysis (CFA) for each scale respectively, using structured equation models. For each scale, the number of dimensions was determined by the EFA result. Goodness-of-fit of CFA was assessed by the root mean square error of approximation (RMSEA), comparative fit index (CFI) and Tucker-Lewis index (TLI).

Values of the CFI and TLI ≥0.90 represent an acceptable fit, and >0.95 represents a good fit. Values of the RMSEA≤0.08 represent an acceptable fit, and <0.05 represents a good fit.36 Factor loadings were reported.

Concurrent validity of scales was assessed by Zung's Self-Rating Anxiety Scale (SAS), Zung's Self-Rating Depression Scale (SDS), and the interpersonal sensitivity dimension of Chinese SCL-90-R, respectively, using Pearson's correlation coefficients. Concurrent validity of the study problem scale was not evaluated owing to the lack of relevant scales.

Reliability of scales was assessed by Cronbach's α coefficient, Spearman-Brown's split-half coefficient and test–retest reliability. Item performance was assessed by the two-parameter polytomous item response theory (IRT) models. IRT is a family of associated mathematical models that relate latent traits (ability) to the probability of responses to items in an assessment, and it has been widely used in health assessment.37 ,38 It describes a nonlinear relationship between binary, ordinal or categorical responses and the latent trait (mental/behavioural problems in this study). Equation (1) specifies a polytomous IRT model, which is used for items with multiple categories (eg, Likert-type).Embedded Image 1

In this model, the probability (P) of scoring in a specific category (c) is modelled by the probability of responding in this category minus the probability of responding in the next category. bk,c is the upper grade difficulty parameter (the point on the ability scale that corresponds to a probability of a certain response of 0.5) for category c of item k, and ak is the discrimination parameter (estimates how well an item can differentiate among respondents with different levels of ability) for item k. Acceptable ak should be above 0.5, and appropriate mean bk should be between –3.0 and 3.0.39

Test information function (TIF) describes the precision of the measure. A measure has most discriminative power among participants with ability that corresponds to the peak of the TIF curve.

Measurement invariance of the scales between genders was evaluated using the DIF test. Gender differences of discrimination (Δai) and difficulty (Δbi) parameters for all items were examined using the χ2 tests with a significant level of 0.001.

IRT parameters were estimated using the Bock-Aitkin procedure40 in IRTPRO 3 (Scientific Software International, Lincolnwood). Other analyses were performed using SAS 9.4 (SAS Institute, Cary, North Carolina, USA). The significance level for DIF was 0.001 owing to the large sample size and Bonferroni correction for multiple comparisons; for other statistical tests, the α was 0.01.

Ethics statement

The methods were carried out in accordance with the STROBE statement.41 The purpose and implication of the survey were explained by the investigators. Written informed consents were obtained from all parents or main caregivers of the students.


In all, 5442 middle school students were sampled, 5273 (97%) returned the questionnaires and 4727 (87%) completed the survey without apparent logical errors and missing values on items. The average age of the participants was 14.9±1.9 and ranged from 11 to 19 years. The demographic characteristics of the students and their parents are shown in table 1.

Table 1

Demographic characteristics of participants

The KMO test values for four scales were 0.893 (anxiety), 0.864 (depression), 0.870 (study problem) and 0.870 (sociality problem), respectively, all with Bartlett's sphericity test p<0.01. The results of EFA are shown in table 2. For each scale, two factors with an eigenvalue >1.0 were identified. For the anxiety scale, the two factors signified general symptoms and sleep-related symptoms, respectively. For the depression scale, the two factors loaded on different symptoms that were psychologically difficult to group. For the study problem, the two factors referred to interest of study and study/exams stress, respectively. For the sociality problem, the two factors signified sociality problems at home and at school, respectively.

Table 2

Exploratory factor analyses for the four scales

The results of CFA are shown in table 3. For each scale, two dimensions were identified by EFA, and were used for CFA. Mixed evidence was found for the dimensionality tested for each of the scales, that is, there was acceptable to good RMSEA and CFI for all scales but unacceptable TLI (<0.9) for the sociality scale. Despite the good model fit, local dependence was identified between the item A6 and A7, and between D9 and D11. CFA models were modified accordingly, and the goodness of fit was improved slightly. Pearson's correlation coefficient and standardised regression weight (factor loading) of each item are shown in table 4.

Table 3

Construct validity and reliability of the four scales

Table 4

Psychometric parameters of items in the original scale

With concurrent validity, the anxiety scale had a correlation coefficient of 0.78 (p<0.0001) with SAS, the depression scale had a coefficient of 0.79 with SDS (p<0.0001), and the sociality problem scale had a coefficient of 0.47 (p<0.0001) with an interpersonal sensitivity subscale of Chinese SCL-90-R.

Cronbach's α, Spearman-Brown's split-half coefficient and test–retest reliability of each scale are shown in table 3. Overall, all scales showed good internal consistency and test–retest reliability (above 0.7).

The discrimination (ai) and difficulty parameters (bi) of the polytomous IRT models are presented in table 4. All items exhibited an acceptable discrimination parameter (ai >0.5) and many had high discrimination power (ai>1.5). Most items had an appropriate difficulty parameter (–3.0<bi<3.0) except A9, ST11 and SO5. As shown in figure 1, for the anxiety, depression and study scales, TIF reached a peak where students’ ability was around 2.0; this indicates that the measurement was most discriminative among students with a high level of problems. The TIF of the sociality scale reached a peak among students with a moderate level of problem (ability around 0).

Figure 1

Test information curves. Test information curves are presented for each dimension of a scale. Ability signifies the severity of anxiety, depression, study problem and sociality problem, respectively, estimated from the item response model; it ranges from −3 to 3. The test information of anxiety, depression and study problem peaked among students with high level of the traits, while the test information of sociality problem reached a peak among students with moderate level of the trait.

The DIF test indicated that 2/11, 2/12/, 4/13 and 2/10 items in anxiety, depression, study and sociality scales, respectively, showed significant (p<0.001) DIF with respect to gender (table 4). A negative Δbi indicated that girls were more likely to endorse a higher score than boys; and vice versa, a positive Δbi indicated that boys were more likely to endorse a higher score. A mixed pattern of DIF was observed.


We developed and validated four short scales to measure emotional and behavioural problems among adolescents in middle schools of Hunan, China. In this study, we (1) developed the initial item pool based on a focus group, broad literature review and existing scales; (2) examined the psychometric properties of the scales in a pilot study among 538 middle school students; (3) examined the psychometric properties of the scales according to classical and modern test theories among a large sample of adolescents. This is the first systematic study to develop and validate short scales to measure emotional (anxiety and depression) and behavioural (study and sociality) problems among adolescents in China.

Cultural differences profoundly impact the sources of distress (study maladaptation in our case), formation and expression of anxiety/depression (characterised by amplified somatic symptoms among the Asian population) and sociality problems (collectivism in Chinese culture vs individualism in western cultures). As a result, the scales are specific in several aspects: (1) In the anxiety and depression scale, some somatic symptom items were not included in our scale to minimise the effect of amplification of somatic symptoms, because somatisation among Asians has been well documented in comparison studies.42–44 Sex-related items were not considered as well because of the inapplicability. Important and frequently reported items were retained, such as cardiopulmonary and vestibular symptoms, pain and fatigue. (2) The study scale measured the interest and stress related to the study. Since students are under substantial pressure in the examination-oriented education system, it is of great importance to measure study maladaptation, which is a significant and unique source of distress among Chinese adolescents. (3) The sociality scale was only moderately correlated to the SCL-90 interpersonal dimension according to the result of concurrent validity. Our scale measures the sociality problems from the perspective of collectivist culture of China. It emphasised the relationships with parents, teachers and classmates. It was different from SCL-90 in that the latter measured individual feelings and thoughts that were more relevant to individualistic cultures. (4) The brief scales measure the most common mental/behavioural problems among certain populations and are time-saving.

Under the a priori hypothesis, the four scales measured different latent traits respectively. According to the EFA results, each scale had two dimensions. Dimensions of the anxiety, study and sociality scales were well explained, while the division of the depression scale dimension was psychologically obscure. CFA showed acceptable to good RMSEA and CFI for all scales, although the TLI of the sociality scale was unacceptable and local dependency was detected between a few items. Further revisions will be needed to optimise the sociality problem scale.

The scales had good internal consistency and test–retest reliability. IRT parameters showed that all items had moderate to high discriminative power, and most items had appropriate difficulty. TIF showed that all scales had strong reliability, and they were most discriminative among adolescents with moderate to high levels of emotional/behavioural problems. The peaked information function of the scales reflected the quasi-traits of psychopathology constructs. A quasi-trait refers to a unipolar construct in which one end of the scale represents severity, while the other pole represents its absence (eg, depression vs not depressed).37 This is in contrast to a bipolar construct where both ends of the scale represent meaningful variation (eg, depression vs happiness, high-health literacy vs low-health literacy), which is commonly seen in public health settings.45 ,46

Good measurement requires that test scores have the same meaning across all relevant examinee groups. In the current study, significant gender DIF was detected at the item level, which might compromise the ability to scale boys and girls onto a common metric. In an observation on DIF research in clinical settings, McHorney and Fleishman47 suggest that women are more likely to report physical and emotional distress as well as pain, fatigue and other marks of negative effects. However, we observed a mixed pattern of DIF between boys and girls. For the anxiety and depression scale, girls were more likely to endorse higher scores on affective symptoms, while boys were more likely to endorse somatic symptoms (rapid heartbeat). Boys were also more likely to endorse study and sociality problems. Evidences of gender difference of somatisation still remain inclusive.48 Nevertheless, most items showed no significant DIF. Reise and Waller37 suggest that the presence of item-level DIF does not necessarily lead to bias at the level of scale scores. Although the DIF of our scale was significant, the overall location parameters were basically equal between boys and girls. We considered that the observed DIF in a few items was not psychologically important, and we concluded that no large bias was observed at the level of composite.

The study has some limitations. First, owing to the large sample size and feasibility of performing structured interviews, diagnoses were not determined by psychologists/psychiatrists; as a result, the cut-off point of the scales were not provided, and the scales could only be used to assess the level of a problem rather than to screen those with a problem. We will conduct a higher standard of validation in a further study. Second, a convenient sampling method was used to select cities and counties, considering the feasibility of field survey. Although schools and classes were randomly selected, the representativeness of the study population may be limited. Third, item-level DIF was detected, although this DIF does not necessarily lead to bias at the level of composite. To test and find DIF is better than to ignore a potential problem. In spite of the limitations, the study provides new short scales to measure common emotional and behavioural problems among adolescents in middle schools. The scales meet psychometric standards, and can serve as a reliable tool to measure the severity of common emotional, study and sociality problems among Chinese adolescents.


  • Contributors MH and ZS conceived and designed the study. MH collected the data. MS analysed the data and drafted the manuscript. All authors gave final approval to the version submitted for publication.

  • Funding This work was supported by the Natural Science Foundation of China (30400355) and China Scholarship Council (201406370034).

  • Competing interests None declared.

  • Patient consent Obtained.

  • Ethics approval The Ethics Committee of Central South University.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement No additional data are available.

