Discretization of continuous features in clinical datasets

J Am Med Inform Assoc. 2013 May 1;20(3):544-53. doi: 10.1136/amiajnl-2012-000929. Epub 2012 Oct 11.

Abstract

Background: The increasing availability of clinical data from electronic medical records (EMRs) has created opportunities for secondary uses of health information. When used in machine learning classification, many data features must first be transformed by discretization.

Objective: To evaluate six discretization strategies, both supervised and unsupervised, using EMR data.

Materials and methods: We classified laboratory data (arterial blood gas (ABG) measurements) and physiologic data (cardiac output (CO) measurements) derived from adult patients in the intensive care unit using decision trees and naïve Bayes classifiers. Continuous features were partitioned using two supervised, and four unsupervised discretization strategies. The resulting classification accuracy was compared with that obtained with the original, continuous data.

Results: Supervised methods were more accurate and consistent than unsupervised, but tended to produce larger decision trees. Among the unsupervised methods, equal frequency and k-means performed well overall, while equal width was significantly less accurate.

Discussion: This is, we believe, the first dedicated evaluation of discretization strategies using EMR data. It is unlikely that any one discretization method applies universally to EMR data. Performance was influenced by the choice of class labels and, in the case of unsupervised methods, the number of intervals. In selecting the number of intervals there is generally a trade-off between greater accuracy and greater consistency.

Conclusions: In general, supervised methods yield higher accuracy, but are constrained to a single specific application. Unsupervised methods do not require class labels and can produce discretized data that can be used for multiple purposes.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Adult
  • Artificial Intelligence*
  • Bayes Theorem
  • Blood Gas Analysis
  • Cardiac Output
  • Data Mining / methods*
  • Decision Trees*
  • Electronic Health Records*
  • Humans