Elsevier

Pattern Recognition

Volume 44, Issue 8, August 2011, Pages 1761-1776
Pattern Recognition

An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes

https://doi.org/10.1016/j.patcog.2011.01.017Get rights and content

Abstract

Classification problems involving multiple classes can be addressed in different ways. One of the most popular techniques consists in dividing the original data set into two-class subsets, learning a different binary model for each new subset. These techniques are known as binarization strategies.

In this work, we are interested in ensemble methods by binarization techniques; in particular, we focus on the well-known one-vs-one and one-vs-all decomposition strategies, paying special attention to the final step of the ensembles, the combination of the outputs of the binary classifiers. Our aim is to develop an empirical analysis of different aggregations to combine these outputs. To do so, we develop a double study: first, we use different base classifiers in order to observe the suitability and potential of each combination within each classifier. Then, we compare the performance of these ensemble techniques with the classifiers' themselves. Hence, we also analyse the improvement with respect to the classifiers that handle multiple classes inherently.

We carry out the experimental study with several well-known algorithms of the literature such as Support Vector Machines, Decision Trees, Instance Based Learning or Rule Based Systems. We will show, supported by several statistical analyses, the goodness of the binarization techniques with respect to the base classifiers and finally we will point out the most robust techniques within this framework.

Research highlights

► One-vs-one and one-vs-all are ensembles for multi-class problems. ► The confidence estimates and their aggregation are key factors of these ensembles. ► Aggregations based on voting and estimation of probabilities are the most robust. ► One-vs-one is more robust, one-vs-all has received less attention. ► Binarization is beneficial even when it is not necessary.

Introduction

Supervized Machine Learning consists in extracting knowledge from a set of n input examples x1, …, xn characterized by i features a1,,aiA, including numerical or nominal values, where each instance has associated a desired output yj and the aim is to learn a system capable of predicting this output for a new unseen example in a reasonable way (with good generalization ability). This output can be a continuous value yjR or a class label yjC (considering an m class problem C={c1,,cm}). In the former case, it is a regression problem, while in the latter it is a classification problem [22]. In classification, the system generated by the learning algorithm is a mapping function defined over the patterns AiC and it is called a classifier.

Classification tasks are widely used in real-world applications, many of them are classification problems that involve more than two classes, the so-called multi-class problems. Their application domain is diverse, for instance, in the field of bioinformatics, classification of microarrays [51] and tissues [71], which operate with several class labels. Computer vision multi-classification techniques play a key role within objects [72], fingerprints [41] and sign language [8] recognition tasks, whereas in medicine, multiple categories are considered in problems such as cancer [6] or electroencephalogram signals [38] classification.

Usually, it is easier to build a classifier to distinguish only between two classes than to consider more than two classes in a problem, since the decision boundaries in the former case can be simpler. This is why binarization techniques have come up to deal with multi-class problems by dividing the original problem into easier to solve binary classification problems that are faced by binary classifiers. These classifiers are usually referred to as base learners or base classifiers of the system [30].

Different decomposition strategies can be found in the literature [52]. The most common strategies are called “one-vs-one” (OVO) [47] and “one-vs-all” (OVA) [17], [7].

  • OVO consists in dividing the problem into as many binary problems as all the possible combinations between pairs of classes, so one classifier is learned to discriminate between each pair, and then the outputs of these base classifiers are combined in order to predict the output class.

  • OVA approach learns a classifier for each class, where the class is distinguished from all other classes, so the base classifier giving a positive answer indicates the output class.

In the recent years, different methods to combine the outputs of the base classifiers from these strategies have been developed, for instance, new approaches in the framework of probability estimates [76], binary-tree based strategies [23], dynamic classification schemes [41] or methods using preference relations [44], [24], in addition to more classical well-known combinations such as Pairwise Coupling [39], Max-Wins rule [29] or Weighted Voting (whose robustness has been recently proved in [46]).

In the specialized literature, there exist few works comparing these techniques, neither between OVO and OVA, nor between different aggregation strategies. In [42] a study of OVO, OVA and Error Correcting Output Codes (ECOC) [21] is carried out, but only within multi-class Support Vector Machine (SVM) framework, whereas in [52] an enumeration of the different existing binarization methodologies is presented, but also without comparing them mutually. Fürnkranz [31] compared the suitability of OVO strategies for decision trees and decision lists with other ensemble methods such as boosting and bagging, showing also the improvement of using confidence estimates in the combination of the outputs. In [76], a comparison in the framework of probability estimates is developed, but no more possible aggregations for the outputs of the classifiers are considered.

Our aim is to carry out an exhaustive empirical study of OVO and OVA decompositions, paying special attention to the different ways in which the outputs of the base classifiers can be combined. The main novelties of this paper with respect to the referred previous studies [42], [31], [76], [52] consist in the following points:

  • We develop a study of the state-of-the-art on the aggregation strategies for OVO and OVA schemes. To do so, we will present an overview of the existing combination methods and we will compare their performances over a set of different real-world problems. Whereas a previous comparison exists between probability estimates by pairwise coupling [76], to the best of our knowledge, a comparison among the whole kind of aggregation methods is missing.

  • We analyse the behaviour of the OVO and OVA schemes with different base learners, studying the suitability of these techniques in each base classifier.

  • Since binarization techniques have been already proven as appropriate strategies to deal with multi-class problems [30], [31], [42], [63] where the original classifiers do not naturally handle multiple class labels, we analyse whether they also improve the behaviour of the classifiers that have a built-in multi-class support.

Thus, our intention is to make a thorough analysis of the framework of binarization, answering two main questions:

  • 1.

    Given that we want or have to use binarization, how should we do it? This is the main objective of this paper; to show the most robust aggregation techniques within the framework of binarization, which is still an unanswered question. Therefore, we analyse empirically which is the most appropriate binarization technique and which aggregation should be used in each case.

  • 2.

    But, should we do binarization? This is an essential question when we can overcome multi-class problems in different ways (the base classifier is able to manage multiple classes). Previous works have been done showing the goodness of binarization techniques [30], [31], [42], [63], although we develop a complementary study to stress their suitability with a complete statistical analysis among different learning paradigms that support multi-class data.

In order to achieve well-founded conclusions, we develop a complete empirical study. The experimental framework includes a set of nineteen real-world problems from the UCI repository [9]. The measures of performance are based on the accuracy rate and Cohen's kappa metric [18]. The significance of the results is supported by the proper statistical tests as suggested in the literature [20], [35], [34]. We chose several well-known classifiers from different Machine Learning paradigms as base learners, namely, SVMs [73], decision trees [62], instance-based learning [1], fuzzy rule based systems [16] and decision lists [19].

Finally, we included an indepth discussion on the results, that have been acquired empirically along the experimental study. This allowed us to answer the issues previously raised and summarize the lessons learned in this paper. Additionally, we showed some new challenges on the topic in correspondence with the obtained results.

The rest of this paper is organized as follows. Section 2 presents a thorough overview of the existing binarization techniques, with special attention to OVO and OVA strategies. Section 3 presents the state-of-the-art on the aggregation strategies for the outputs of those strategies that we use in this work. The experimental framework set-up is given in Section 4, that is, the algorithms used as base classifiers, the performance measures, the statistical tests, the data sets, the parameters for the algorithms and the description of a Web page associated to the paper (http://sci2s.ugr.es/ovo-ova), which contains complementary material to the experimental study. We develop the empirical analysis in Section 5. The discussion, including the lessons learned throughout this study and future works that remain to be addressed, is presented in Section 6. Finally, in Section 7 we make our concluding remarks.

Section snippets

Reducing multi-class problems by binarization techniques

In this section, we first describe the idea behind binarization techniques to deal with multi-class problems and review the existing decomposition strategies. Then, we explain with relative detail the most common strategies that we have used in the experimental study: OVO and OVA.

State-of-the-art on aggregation schemes for binarization techniques

In this section we describe the state-of-the-art on aggregation strategies for binarization techniques. We divide them into two subsections: the first one is oriented to the combinations for OVO decomposition where the aggregation is made from a score matrix; the second one reviews the combinations for OVA scheme, where the outputs of the classifiers are given by a score vector.

A more extensive and detailed description of these methods can be found in the web page http://sci2s.ugr.es/ovo-ova.

Experimental framework

In this section, we present the set-up of the experimental framework used to develop the experiments in Section 5. We first describe the algorithms that we have selected to use as base classifiers in the study in Section 4.1. Section 4.2 describes the measures employed to evaluate the performance of the algorithms analysed in this paper. Next, we present the statistical tests applied to compare the results obtained with the different aggregations and decomposition techniques in Section 4.3.

Experimental study

In this section, we present the results of the experimental study. We will answer to the following questions:

  • 1.

    Should we do binarization? How should we do it?

  • 2.

    Which is the most appropriate aggregation for each decomposition scheme?

Thereby the study is divided into two parts, each one dedicated to a question. Since the main objective of this paper is the analysis of the different combinations, we will try to answer these questions in an upside-down manner, starting from the second one. Hence we

Discussion: lessons learned and future work

This paper has provided an exhaustive empirical analysis of the main ensemble methods for binary classifiers in multi-class problems, specifically the methods based on OVO and OVA strategies. We structured the analysis in two sections, studying the different ways in which the outputs of the underlying binary classifiers can be combined and then, filling up the analysis investigating the use of binarization techniques when the multi-class problem can also be faced up by a unique classifier.

From

Concluding remarks

We made a thorough analysis of several ensemble methods applied on multi-classification problems in a general classification framework. All of them are based on two well-known strategies, OVO and OVA, whose suitability have been tested in several real-world problems data sets.

From this work we conclude that actually OVO methods and specifically the ensembles using WV, LVPC, PC and PE combinations are the ones with the best average behaviour, but the best aggregation within a problem depends on

Acknowledgements

This work has been supported by the Spanish Ministry of Education and Science under Projects TIN2010-15055 and TIN2008-06681-C06-01.

Mikel Galar received his M.Sc. degree in Computer Sciences from the Public University of Navarra, Pamplona, Spain, in 2009. Currently he holds a research position at the Department of Automatics and Computation. His research interests are data-mining, classification, multi-classification, ensemble learning, evolutionary algorithms and fuzzy systems.

References (77)

  • J. Luengo et al.

    Domains of competence of fuzzy rule based classification systems with data complexity measures: a case of study using a fuzzy hybrid genetic based machine learning method

    Fuzzy Sets and Systems

    (2010)
  • S.A. Orlovsky

    Decision-making with a fuzzy preference relation

    Fuzzy Sets and Systems

    (1978)
  • O. Pujol et al.

    An incremental node embedding technique for error correcting output codes

    Pattern Recognition

    (2008)
  • L. Rueda et al.

    Multi-class pairwise linear dimensionality reduction using heteroscedastic schemes

    Pattern Recognition

    (2010)
  • D.W. Aha et al.

    Instance-based learning algorithms

    Machine Learning

    (1991)
  • J. Alcalá-Fdez et al.

    KEEL Data-mining software tool: data set repository, integration of algorithms and experimental analysis framework

    Journal of Multiple-Valued Logic and Soft Computing

    (2011)
  • J. Alcalá-Fdez et al.

    KEEL: a software tool to assess evolutionary algorithms for data mining problems

    Soft Computing

    (2009)
  • E.L. Allwein et al.

    Reducing multiclass to binary: a unifying approach for margin classifiers

    Journal of Machine Learning Research

    (2000)
  • E. Alpaydin

    Introduction to Machine Learning

    (2004)
  • R. Anand et al.

    Efficient classification for multiclass problems using modular neural networks

    IEEE Transactions on Neural Networks

    (1995)
  • A. Asuncion, D.J. Newman, UCI machine learning repository (2007), URL:...
  • R.A. Baeza-Yates et al.

    Modern Information Retrieval

    (1999)
  • M. Basu et al.

    Data Complexity in Pattern Recognition

    (2006)
  • E. Bernado-Mansilla et al.

    Domain of competence of XCS classifier system in complexity measurement space

    IEEE Transactions on Evolutionary Computation

    (2005)
  • N.V. Chawla, N. Japkowicz, A. Kolcz (Eds.), Special Issue on Learning from Imbalanced Datasets, vol. 6, no. 11,...
  • Y. Chen et al.

    Support vector learning for fuzzy rule-based classification systems

    IEEE Transactions on Fuzzy Systems

    (2003)
  • P. Clark et al.

    Rule induction with CN2: some recent improvements

  • J. Cohen

    A coefficient of agreement for nominal scales

    Educational and Psychological Measurement

    (1960)
  • W.W. Cohen, Fast effective rule induction, in: ICML’95: Proceedings of the 12th International Conference on Machine...
  • J. Demar

    Statistical comparisons of classifiers over multiple data sets

    Journal of Machine Learning Research

    (2006)
  • T.G. Dietterich et al.

    Solving multiclass learning problems via error-correcting output codes

    Journal of Artificial Intelligence Research

    (1995)
  • R.O. Duda et al.

    Pattern Classification

    (2001)
  • B. Fei et al.

    Binary tree of SVM: a new fast multiclass training and classification algorithm

    IEEE Transactions on Neural Networks

    (2006)
  • A. Fernández, M. Calderón, E. Barrenechea, H. Bustince, F. Herrera, Enhancing fuzzy rule based systems in...
  • A. Fernández et al.

    Genetics-based machine learning for rule induction: state of the art, taxonomy and comparative study

    IEEE Transactions on Evolutionary Computation

    (2010)
  • E. Frank et al.

    Ensembles of nested dichotomies for multi-class problems

  • J.H. Friedman, Another approach to polychotomous classification, Technical Report, Department of Statistics, Stanford...
  • J. Fürnkranz

    Round robin classification

    Journal of Machine Learning Research

    (2002)
  • Cited by (652)

    View all citing articles on Scopus

    Mikel Galar received his M.Sc. degree in Computer Sciences from the Public University of Navarra, Pamplona, Spain, in 2009. Currently he holds a research position at the Department of Automatics and Computation. His research interests are data-mining, classification, multi-classification, ensemble learning, evolutionary algorithms and fuzzy systems.

    Alberto Fernández received his M.Sc. degree in Computer Sciences in 2005 and the Ph.D. degree in Computer Science in 2010, both from the University of Granada, Spain. He is currently a Supply Assistant Professor in the Department of Computer Science, University of Jaén, Jaén, Spain. His research interests include data mining, classification in imbalanced domains, fuzzy rule learning, evolutionary algorithms and multi-classification problems.

    Edurne Barrenechea is an Assistant Lecturer at the Department of Automatics and Computation, Public University of Navarra. She received an M.Sc. in Computer Science at the Pais Vasco University in 1990. She worked in a private company (Bombas Itur) as analyst programmer from 1990 to 2001, and then she joined the Public University of Navarra as Associate Lecturer. She obtained the Ph.D. in Computer Science in 2005 on the topic interval-valued fuzzy sets applied to image processing. Her research interests are fuzzy techniques for image processing, fuzzy sets theory, interval type-2 fuzzy sets theory and applications, decision making, and industrial applications of soft computing techniques. She is member of the board of the European Society for Fuzzy Logic and Technology (EUSFLAT).

    Humberto Bustince is a Full Professor at the Department of Automatics and Computation, Public University of Navarra, Spain. He holds a Ph.D. degree in Mathematics from Public University of Navarra from 1994. His research interests are fuzzy logic theory, extensions of Fuzzy sets (Type-2 fuzzy sets, Interval-valued fuzzy sets, Atanassov's intuitionistic fuzzy sets), Fuzzy measures, Aggregation functions and fuzzy techniques for Image processing. He is author of over 50 published original articles and involved in teaching Artificial Intelligence for students of Computer Sciences.

    Francisco Herrera received the M.Sc. degree in Mathematics in 1988 and the Ph.D. degree in Mathematics in 1991, both from the University of Granada, Spain. He is currently a Professor in the Department of Computer Science and Artificial Intelligence at the University of Granada. He has published more than 150 papers in international journals. He is coauthor of the book “Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Knowledge Bases” (World Scientific, 2001). As edited activities, he has co-edited five international books and co-edited 20 special issues in international journals on different Soft Computing topics. He acts as associated editor of the journals IEEE Transactions on Fuzzy Systems, Information Sciences, Mathware and Soft Computing, Advances in Fuzzy Systems, Advances in Computational Sciences and Technology, and International Journal of Applied Metaheuristics Computing. He currently serves as area editor of the Journal Soft Computing (area of genetic algorithms and genetic fuzzy systems), and he serves as member of several journal editorial boards, among others Fuzzy Sets and Systems, Applied Intelligence, Knowledge and Information Systems, Information Fusion, Evolutionary Intelligence, International Journal of Hybrid Intelligent Systems and Memetic Computation. His current research interests include computing with words and decision making, data mining, data preparation, instance selection, fuzzy rule based systems, genetic fuzzy systems, knowledge extraction based on evolutionary algorithms, memetic algorithms and genetic algorithms.

    View full text