Article Text

Using decision trees to understand structure in missing data
  1. Nicholas J Tierney1,2,
  2. Fiona A Harden3,4,
  3. Maurice J Harden5,
  4. Kerrie L Mengersen1,2
  1. 1Department of Statistical Science, Mathematical Sciences, Science & Engineering Faculty, Queensland University of Technology, Brisbane, Queensland, Australia
  2. 2ARC Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS), Brisbane, Queensland, Australia
  3. 3Faculty of Health, Clinical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia
  4. 4Institute of Health and Biomedical Innovation, Brisbane, Queensland, Australia
  5. 5Hunter Industrial Medicine, Newcastle, New South Wales, Australia
  1. Correspondence to Nicholas J Tierney; nicholas.tierney{at}qut.edu.au

Abstract

Objectives Demonstrate the application of decision trees—classification and regression trees (CARTs), and their cousins, boosted regression trees (BRTs)—to understand structure in missing data.

Setting Data taken from employees at 3 different industrial sites in Australia.

Participants 7915 observations were included.

Materials and methods The approach was evaluated using an occupational health data set comprising results of questionnaires, medical tests and environmental monitoring. Statistical methods included standard statistical tests and the ‘rpart’ and ‘gbm’ packages for CART and BRT analyses, respectively, from the statistical software ‘R’. A simulation study was conducted to explore the capability of decision tree models in describing data with missingness artificially introduced.

Results CART and BRT models were effective in highlighting a missingness structure in the data, related to the type of data (medical or environmental), the site in which it was collected, the number of visits, and the presence of extreme values. The simulation study revealed that CART models were able to identify variables and values responsible for inducing missingness. There was greater variation in variable importance for unstructured as compared to structured missingness.

Discussion Both CART and BRT models were effective in describing structural missingness in data. CART models may be preferred over BRT models for exploratory analysis of missing data, and selecting variables important for predicting missingness. BRT models can show how values of other variables influence missingness, which may prove useful for researchers.

Conclusions Researchers are encouraged to use CART and BRT models to explore and understand missing data.

  • EPIDEMIOLOGY
  • OCCUPATIONAL & INDUSTRIAL MEDICINE
  • PUBLIC HEALTH
  • STATISTICS & RESEARCH METHODS

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.