Objectives To examine whether respondents to a survey of health and physical activity and potential determinants could be grouped according to the questions they missed, known as ‘item missing’.
Design Observational study of longitudinal data.
Setting Residents of Brisbane, Australia.
Participants 6901 people aged 40–65 years in 2007.
Materials and methods We used a latent class model with a mixture of multinomial distributions and chose the number of classes using the Bayesian information criterion. We used logistic regression to examine if participants’ characteristics were associated with their modal latent class. We used logistic regression to examine whether the amount of item missing in a survey predicted wave missing in the following survey.
Results Four per cent of participants missed almost one-fifth of the questions, and this group missed more questions in the middle of the survey. Eighty-three per cent of participants completed almost every question, but had a relatively high missing probability for a question on sleep time, a question which had an inconsistent presentation compared with the rest of the survey. Participants who completed almost every question were generally younger and more educated. Participants who completed more questions were less likely to miss the next longitudinal wave.
Conclusions Examining patterns in item missing data has improved our understanding of how missing data were generated and has informed future survey design to help reduce missing data.
- public health
This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/
Statistics from Altmetric.com
Strengths and limitations of this study
A better understanding of item missing data could help improve survey design and help determine if the data are missing at random.
Identifying patterns in missing data is just part of the process and researchers need to use additional methods to impute or weight data to account for the potential biases of missing data.
We show results for two different groups sizes (3 and 14) and choosing the optimal group size may be based on a combination of goodness of fit statistics and the researchers’ preferences.
Missing data is an almost ubiquitous problem for surveys, and participants may fail to complete whole surveys or partially complete surveys. Ignoring missing data by using a complete case analysis can create potentially serious biases when the ratio of missing to complete data is large, or when participants with missing data are very different from participants with complete data.1 For example, participants with complete data may be healthier, which would be a particular problem if we were interested in measuring general health.
Longitudinal studies often have both wave missing data, where a participant failed to complete an entire wave, and item missing data, where a participant partially completed a wave.2 In this paper, our focus is on item missing data in returned mail surveys. Wave missing usually creates more serious biases than item missing as more information is lost, however, item missing data can compound the biases of wave missing data.
An understanding of the potential causes of item missing data may inform procedures to impute missing data, and may help to improve future data collections by changing the way questions are worded, displayed or ordered.3 4 A better understanding of the patterns of missing data may also help determine if the data are missing at random.5 To identify the missing data mechanism, we need to use as much information as possible to determine why the questions were not completed.6
In this analysis, we identify patterns of item missing data in survey responses using a case study of a survey of physical activity and potential predictors of activity such as neighbourhood perceptions, attitudes, health and sociodemographics. We used latent class analysis to identify groups of participants who tended to miss similar questions. For example, to find a group of participants who rarely completed potentially sensitive questions such as income.
There is a large literature on wave missing data and item missing data, but most papers have been concerned with imputing missing data in order to overcome the potential biases caused by missing data.7 Examining the pattern of missing data using tables and plots of both wave missing and item missing (yes/no) has been recommended by Rubin and Little8 as this can inform the method of imputation and highlight problems such as variable combinations that were never jointly observed. This can be achieved using the ‘mi’ package in R which plots the overall pattern of missing data with the option of ordering participants by their proportion missing.9 Pattern-mixture models have been used to model common patterns of missing as a between-subject factor for longitudinal analysis.10 In this paper, we model patterns of item missing data in order to identify problem questions or groups of questions, and to examine whether there are groups of participants who miss similar questions.
Materials and methods
We used data from the longitudinal HABITAT (How Areas in Brisbane Influence healTh And acTivity) study.11 This is a population-based study of people aged 40–65 years in 2007 living in Brisbane, Australia. The aims of the study are to understand the factors enabling and limiting physical activity in mid-age. Most questions in the self-completed mail surveys were either answered by ticking a box on a Likert scale or writing a number in a box with just a few free text (open-ended) responses. Some questions could be legitimately missing, for example, hours of work only needed to be completed if the respondent was currently working. We excluded these conditional questions from this analysis as they could be legitimately missing.
The study began in 2007 and had follow-up waves in 2009 and 2011. The response rates per wave for all participants sent a questionnaire were 68% in 2007 (n=11 035 returned questionnaires), 72% in 2009 (n=7867) and 67% in 2011 (n=6901).
Data were collected using detailed mail surveys of 250 or more questions. The survey questions were grouped in sections such as physical activity participation, neighbourhood characteristics (eg, traffic), general health and lifestyle, and demographics (eg, employment). The full surveys are available online at Institute for Health and Ageing’s website (https://iha.acu.edu.au/research/research-projects/habitat-project). A number of strategies were used to encourage a high response rate, including advanced notice, personalised cover letters, surveys labelled with suburb area, and reminder letters.11 Participants were encouraged to answer every question. The instructions at the front of the survey stated, “Some of the questions may sound the same. However, it will help us greatly if you answer all questions”.
The HABITAT study including the procedure for participants providing informed consent was approved by the human research ethics committee of the Queensland University of Technology.
Our overall aim was to find similar patterns of item missing data using latent groups. This could also be called unsupervised classification or clustering.12
We used a mixture of multivariate multinomial distributions to create latent groups with similar patterns of item missing using Rmixmod.13 The multinomial distributions modelled the average probability of missing across the 286 questions from the 2011 survey. Each participant was allocated the latent group that best matched their pattern of item missing. We allowed unequal group sizes, so, for example, two latent groups might have 90% of participants in one group and 10% in the other. The probability density function is a weighted sum of multinomial distributions:
where x is the observed binary data of missing (yes/no), n is the total number of observations, p are the latent group probabilities that sum to 1 and are the missing probabilities in each group. A key question is the optimal number of latent groups needed to capture the range of missing patterns in the data (K in the above formula). We chose the optimal number of latent groups using the Bayesian information criterion (BIC) which makes a trade-off between a good fit to the data and model complexity.14 The models were fitted using the Rmixmod package in the R software V.3.3.1 (code and data are available at GitHub (https://github.com/agbarnett/item.missing)).15 We tested groups sizes between 2 and 20 and chose the optimal number using the smallest BIC. We discarded models which included any group with fewer than 1% of participants, as such small groups were considered too small to be meaningful.
To examine the differences between latent groups, we plotted the mean estimated missing probability per question against question order as we believed that fewer questions would be missed at the start of the survey. Using the equation notation the plotted mean is for question j and group k. We used a kernel density smoother to illustrate the average probability of missing by question order.16
We calculated the overall mean probability of item missing in each latent group, using the below equation:
where Q is the total number of questions. We also give the range in each group. To ease comparisons in the tables and plots, we numbered the latent groups using this mean probability from lowest to highest. We also refer to the latent groups based on their mean overall probability, such as ‘low missing’ and ‘high missing’.
We expected the characteristics of the participants to vary by latent group, as a previous analysis of similar physical activity data found that individual characteristics, such as gender, predicted the number of missing items.17 We examined whether a participant’s age, gender and education predicted whether they would be in the latent group with the least amount of missing. We used predictors at the same wave (year 2011), but if these were missing then we used the value from the previous wave with an adjustment for age (eg, age from 2009 plus 2 years). We fitted this multiple logistic regression model using a Bayesian paradigm and used WinBUGS V.1.4.318 and presented the results as prevalence ratios.19 We used a burn-in of 5000 Monte Carlo Markov chain (MCMC) iterations followed by a sample of 15 000 thinned by 3. The Bayesian multiple logistic regression model is given below:
where L(i) is the binary dependent variable of being in latent group 1 and X is an n×11 matrix of an intercept and the predictors of gender, age and education which has nine categories with the lowest educational level as the reference category. To aid convergence of the MCMC samples, we standardised age by subtracting 56 years (the mean) and dividing by 10 years. A standard frequentist logistic regression model could equally have been used.
We used the highest probability to assign each participant to their latent class, L(i). However, there could be some uncertainty in this assignment as a participant may have a relatively high probability for multiple groups. For example, if there were five latent groups a participant with a probability of 0.2 of being in each group would reflect great uncertainty. To examine this issue, we used summary statistics for the highest probability for all participants.
We used logistic regression to examine whether the amount of item missing in a survey predicted wave missing for the next survey. Our hypothesis was that greater item missing data would increase the probability that a participant failed to complete the next wave. We used the proportion of item missing in the 2007 survey (wave 1) to predict the return of the 2009 survey (wave 2) and the proportion of item missing in the 2009 survey (wave 2) to predict the return of the 2011 survey (wave 3). We modelled these two waves using a generalised linear mixed model with a random intercept per participant to account for the non-independence of data from the same participant.20 Participants who did not complete the 2009 survey were only included once.
The association between the proportion of item missing data and missing the next wave could be non-linear, for example, with a stronger effect at higher proportions. To allow for a range of non-linear curves, we used the fractional polynomials approach to give interpretable curves using the equation:
where W(i, w) is the binary dependent variable of wave missing for participant i at wave w; mi,w−1 is the proportion of item missing from the previous survey; γ is a random intercept to control for repeated results from the same participant and the power P is one of: –2, –1, –0.5, 0, 0.5, 1, 2, 3. For P=0 we use log() in place of . We chose the best model (best P) as that with the smallest deviance.21 To evaluate the best model, we computed the overall goodness of fit statistic as compared with a model that did not include the proportion of item missing from the previous survey: .
The model was fitted using the glmer function in the lme4 library in R.20
Summary statistics on the amounts of missing data by participant and question are shown in table 1. The question with the least amount of missing data was “I live on or near a main road or busy throughway for motor vehicles” with just 0.24% missing, followed by gender with just 0.32% missing. The question with the most amount of missing data was “Do you plan to use the Bike Hire Scheme?”, with 13.97% missing. A small number of participants (0.1%) missed 82% or more of the questions.
The plot of the BIC for choosing the maximum number of latent groups is shown in figure 1. The best model (smallest BIC) is for a maximum latent group size of 14. However, 7 of the 14 groups contained less than 1% of the sample, with the smallest group having just 28 participants. We considered these groups to be too small to have a meaningful interpretation. We therefore selected a maximum group size of three as this was the smallest BIC where all group sizes were above 1%. We examine smaller latent groups in a sensitivity analysis below.
Missing pattern by question order
The estimated probabilities of missing for the three latent groups are shown in figure 2. Group 1 had a low average probability of missing of just 0.01 (range <0.0001 to 0.13) and this was relatively constant throughout the survey except for an unusually high probability of missing of over 0.1 for two questions on bicycle use and a question on sleeping hours on weekend days. Group 2 had an average missing probability of 0.05 (range 0.002 to 0.45), with a high missing probability of 0.30 for the set of 10 questions on days walking to places in the last month. Group 3 had the highest average missing probability of 0.19 (range 0.03 to 0.35). In this group the probability of missing increased from the start to around the middle of the survey, with a reduction towards the end of the survey. The numbers (percentages) in each group were 5726 (83%), 877 (13%) and 298 (4%).
Age, gender and education were all strongly associated with being in the ‘low missing’ latent group 1 (table 2). Every 10-year increase in age reduced the probability of being in the low missing group with a mean prevalence ratio of 0.945 (95% CI 0.921 to 0.965), so older people had generally more missing data. Women also had more missing data as their prevalence ratio of being in the low missing group was 0.940 (95% CI 0.910 to 0.969). Higher levels of education were associated with fewer missing items. Compared with the reference group with the least amount of education, every other education group had a mean prevalence ratio above 1, although for two groups the 95% CIs included 1.
Ninety-nine per cent of participants had a maximum probability of latent group membership above 0.99. Hence uncertainty in latent group membership is unlikely to be an issue.
Item missing predicting subsequent wave missing
There was a rising probability of wave missing with greater item missing (figure 3), and this association was strongly statistically significant (overall model fit χ2=66.2 (df=1), p<0.0001). The best non-linear curve using fractional polynomials was a square-root transformation (p=0.5). The non-linear curve sharpens as the proportion tends to 0, which likely reflects the increased diligence of those participants who miss only a few questions.
Smaller latent group sizes
We initially used a minimum group size of 1%, but this may hide small latent groups with interesting patterns of missingness. We therefore plot the mean probabilities of missing using the minimum BIC of 14 latent groups (figure 4). The largest group was 4745 participants who almost fully completed the questionnaire with a mean missing probability of 0.003 (range 0.0001 to 0.11). The next largest group had a similar pattern of missingness to group 1, but with a higher mean missing probability 0.006 (range 0.0008 to 0.14).
Some of the groups had high probabilities of missing to specific groups of questions, such as: group 3 for the questions on number of days in the last month walking to each of 10 specified places; group 4 for the questions on 12 long-term health conditions and group 11 for the 16 questions on attitudes about transport and 15 questions on attitudinal barriers and motivations to physical activity. Two other interesting patterns (both with less than 50 participants) were group 12 who generally completed fewer questions with increasing question order and group 13 who—after the first 50 questions—generally completed more questions with increasing question order.
Mixture modelling identified three latent groups that made logical sense as they represented excellent, good and poor item completers. The average probability of missing in the three groups was 0.01, 0.05 and 0.19. The good news for the HABITAT study was that 83% of participants were excellent completers, with just 4% as poor completers.
For the low missing group the question with the most missing was: “On a usual 24 hour day, how many hours do you spend sleeping on a weekend day?” (11.3% missing in low missing group), and this was a clear outlier compared with most of their other answers (figure 2). The question as it appeared in the survey is shown in figure 5, which shows how it was paired with a question on sleeping hours during the week. It is possible that people skipped the weekend question because their answer was the same as the weekday answer. But as this group had such low rates of item missing for the rest of the survey there may have been some misunderstanding. Starting the question with ‘On a usual 24 hour day’ may have primed people to think about all days. Alternatively it may be that the weekend numbers looked like they were the minutes of the weekday question, even with the large ‘AND’, and so were left blank because the participants were thinking of whole hours. Another explanation is that most other questions in the survey used one response per row. For example, just above this question was a question on weight that had the two options of kilograms or stone and pounds on the same row. Changing the layout of this question to have separate rows for weekend and weekday sleeping could reduce missingness.
The question on usual sleeping hours on the weekend was the most frequently missing across all questions (12.7% missing over all participants), so this question could have been identified as a problem using simple percentages. However, the fact that it was so often missing in the group that completed almost all the survey strongly suggests that the number of missing responses could be decreased.
For the middle missing group (group 2), there were three questions with relatively high missing probabilities (figure 2), these were: the questions on number of days in the last month walking to each of 10 specified places (30.4% missing), the questions on 12 long-term health conditions, for example, diabetes (19.3% missing), and the question on usual weekend sleeping hours (17.4% missing).
The walking questions may have been poorly completed because they asked about incidental behaviour such as walking to cafés and restaurants, which is an unstructured behaviour that may be hard to recall. Participants may have found the count of walking during the past month too difficult to attempt. Responses required a written answer (compared with a tick box), which may have been perceived as too burdensome. It is also possible that some people never walked to particular places (such as the library), but failed to write ‘zero’ as requested in the instructions. The lowest number of missing answers was for the supermarket, which arguably most people would need to visit, and the highest missing was for work, which may not be applicable for retired respondents (see online supplementary appendix). This suggests the need for clearer instructions about writing ‘zero’ rather than skipping the question, and possibly also including a ‘not applicable’ or ‘do not know’ response option.
Supplementary file 1
The 12 questions on long-term health conditions had a simple dichotomous response option (yes/no), but may have had more missing data because they asked about potentially sensitive issues such as depression. It is also possible that people found it hard to answer because conditions needed to have lasted for 6 months or more, and this may be difficult to recall especially for conditions that occurred some time ago or intermittently.
For the high missing group, the two least completed questions were the set of 15 questions on barriers to and motivations for physical activity (85.6% missing) and the set of six questions on confidence to do physical activity in the context of competing demands, for example, fatigue (75.5% missing). These were in the middle section of the survey that was most missed by this group (figure 2), and hence it may be the questions’ placement rather than content that was important. However, these questions also require more insight and personal reflection compared with other more factual questions (such as neighbourhood features), and so some respondents may have been reluctant to take the extra time needed.
The average pattern in the high missing group of a rise and fall in the probability of missing through the survey (figure 2) could be because many participants completed some questions at the start before losing interest and skipping to the end. For future survey waves, we could consider sending this group a shorter survey22 or reordering the survey to put the most important questions at the start or end. However the current survey starts with some general ‘warm-up’ questions (asking people about their neighbourhood area) and putting more challenging questions first may increase item and wave missing numbers.
The strong association between higher item missing predicting future wave missing (figure 3) is not surprising and could well be because people who partially completed the survey were not engaged with the study and are also likely to miss a wave. This association could be used when trying to impute wave missing data, as we could choose to give greater weight to the responses from partial completers when using techniques such as inverse probability weighting.23 However, the obvious catch is that by definition these participants may also be missing the variables of interest. This problem could be somewhat overcome by putting the most important questions early in the survey.
Related methods and extensions
Latent classes have previously been used to identify patterns of wave missing in longitudinal data24 and to identify participants who did not complete a question because it was not relevant.25 Classification trees have been used to identify patterns in participant characteristics using the count of item missing per participant.26
An initial dimension reduction step may have been useful, such as principal components analysis, to combine questions with a similar pattern of missing.27 However, this may have combined questions with an interesting difference in the pattern of missing between two latent groups.
We originally used a Bayesian multinomial mixture model that gave very similar results to those shown here as it also selected a maximum of three latent groups (using the deviance information criterion) with similar patterns of missing to those in figure 2. However, the Bayesian model did not perform well in simulation studies with poor convergence using multiple MCMC chains which too often found divergent modes in the likelihood. This poor convergence occurred because of the complexities of label switching and because the model has hundreds of parameters.28 One advantage of a Bayesian model was that a longitudinal structure could be added and we were able to track patterns of item missing over time in the same participants. The Rmixmod package used here can only model cross-sectional data.
We could have given more structure to the latent classes, for example, by making the participants’ latent class membership dependent on variables such as gender using multinomial regression. More structure would also be provided by making the probability of missing dependent on question order using a spline for each latent group. Such structure could be included in a Bayesian model that accounts for repeated data from the same participant if the convergence issues could be addressed.
The methods shown here do not help identify data that is missing not at random where the probability of missingness depends on the missing value. Our methods also do not help adjust for the potential bias caused by missing data and researchers need to use other methods for imputing missing data or weighting observed data.4 8
Examining item missing data in detail has improved our understanding of how missing data were generated in these surveys. The latent groups were distinctly different and had characteristics of non-responders that make sense to the study team. Thinking about the types of responders will help us improve the design or instructions of future surveys and the wording and layout of questions. Our results also highlighted a potentially poorly presented question that could be changed in future surveys to hopefully produce fewer missing responses.
Computational resources and services used in this work were provided by the High Performance Computer and Research Support Group, Queensland University of Technology, Brisbane, Australia. We thank Dr Xing Lee for useful advice on using multinomial mixture models.
Contributors AGB had the original idea, ran the statistical analysis, wrote the first draft of the paper and is the study guarantor. GT and NWB critiqued the methods used, interpreted the results and commented on drafts of the paper. PM and AN helped collect, manage and clean the data, commented on the results and read drafts of the paper.
Funding The HABITAT project was awarded funding by the Australian National Health and Medical Research Council (NHMRC) (ID290521; ID497236; ID1047453). At the time the manuscript was written, GT was supported by an NHMRC Senior Research Fellowship (ID1003710). AGB is supported by an NHMRC Senior Research Fellowship (ID1117784).
Competing interests None declared.
Ethics approval The Queensland University of Technology Human Research Ethics Committee.
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement All the data and R code are freely available on GitHub (https://github.com/agbarnett/item.missing).
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.