An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes
Research highlights
► One-vs-one and one-vs-all are ensembles for multi-class problems. ► The confidence estimates and their aggregation are key factors of these ensembles. ► Aggregations based on voting and estimation of probabilities are the most robust. ► One-vs-one is more robust, one-vs-all has received less attention. ► Binarization is beneficial even when it is not necessary.
Introduction
Supervized Machine Learning consists in extracting knowledge from a set of n input examples x1, …, xn characterized by i features , including numerical or nominal values, where each instance has associated a desired output yj and the aim is to learn a system capable of predicting this output for a new unseen example in a reasonable way (with good generalization ability). This output can be a continuous value or a class label (considering an m class problem ). In the former case, it is a regression problem, while in the latter it is a classification problem [22]. In classification, the system generated by the learning algorithm is a mapping function defined over the patterns and it is called a classifier.
Classification tasks are widely used in real-world applications, many of them are classification problems that involve more than two classes, the so-called multi-class problems. Their application domain is diverse, for instance, in the field of bioinformatics, classification of microarrays [51] and tissues [71], which operate with several class labels. Computer vision multi-classification techniques play a key role within objects [72], fingerprints [41] and sign language [8] recognition tasks, whereas in medicine, multiple categories are considered in problems such as cancer [6] or electroencephalogram signals [38] classification.
Usually, it is easier to build a classifier to distinguish only between two classes than to consider more than two classes in a problem, since the decision boundaries in the former case can be simpler. This is why binarization techniques have come up to deal with multi-class problems by dividing the original problem into easier to solve binary classification problems that are faced by binary classifiers. These classifiers are usually referred to as base learners or base classifiers of the system [30].
Different decomposition strategies can be found in the literature [52]. The most common strategies are called “one-vs-one” (OVO) [47] and “one-vs-all” (OVA) [17], [7].
- •
OVO consists in dividing the problem into as many binary problems as all the possible combinations between pairs of classes, so one classifier is learned to discriminate between each pair, and then the outputs of these base classifiers are combined in order to predict the output class.
- •
OVA approach learns a classifier for each class, where the class is distinguished from all other classes, so the base classifier giving a positive answer indicates the output class.
In the recent years, different methods to combine the outputs of the base classifiers from these strategies have been developed, for instance, new approaches in the framework of probability estimates [76], binary-tree based strategies [23], dynamic classification schemes [41] or methods using preference relations [44], [24], in addition to more classical well-known combinations such as Pairwise Coupling [39], Max-Wins rule [29] or Weighted Voting (whose robustness has been recently proved in [46]).
In the specialized literature, there exist few works comparing these techniques, neither between OVO and OVA, nor between different aggregation strategies. In [42] a study of OVO, OVA and Error Correcting Output Codes (ECOC) [21] is carried out, but only within multi-class Support Vector Machine (SVM) framework, whereas in [52] an enumeration of the different existing binarization methodologies is presented, but also without comparing them mutually. Fürnkranz [31] compared the suitability of OVO strategies for decision trees and decision lists with other ensemble methods such as boosting and bagging, showing also the improvement of using confidence estimates in the combination of the outputs. In [76], a comparison in the framework of probability estimates is developed, but no more possible aggregations for the outputs of the classifiers are considered.
Our aim is to carry out an exhaustive empirical study of OVO and OVA decompositions, paying special attention to the different ways in which the outputs of the base classifiers can be combined. The main novelties of this paper with respect to the referred previous studies [42], [31], [76], [52] consist in the following points:
- •
We develop a study of the state-of-the-art on the aggregation strategies for OVO and OVA schemes. To do so, we will present an overview of the existing combination methods and we will compare their performances over a set of different real-world problems. Whereas a previous comparison exists between probability estimates by pairwise coupling [76], to the best of our knowledge, a comparison among the whole kind of aggregation methods is missing.
- •
We analyse the behaviour of the OVO and OVA schemes with different base learners, studying the suitability of these techniques in each base classifier.
- •
Since binarization techniques have been already proven as appropriate strategies to deal with multi-class problems [30], [31], [42], [63] where the original classifiers do not naturally handle multiple class labels, we analyse whether they also improve the behaviour of the classifiers that have a built-in multi-class support.
Thus, our intention is to make a thorough analysis of the framework of binarization, answering two main questions:
- 1.
Given that we want or have to use binarization, how should we do it? This is the main objective of this paper; to show the most robust aggregation techniques within the framework of binarization, which is still an unanswered question. Therefore, we analyse empirically which is the most appropriate binarization technique and which aggregation should be used in each case.
- 2.
But, should we do binarization? This is an essential question when we can overcome multi-class problems in different ways (the base classifier is able to manage multiple classes). Previous works have been done showing the goodness of binarization techniques [30], [31], [42], [63], although we develop a complementary study to stress their suitability with a complete statistical analysis among different learning paradigms that support multi-class data.
In order to achieve well-founded conclusions, we develop a complete empirical study. The experimental framework includes a set of nineteen real-world problems from the UCI repository [9]. The measures of performance are based on the accuracy rate and Cohen's kappa metric [18]. The significance of the results is supported by the proper statistical tests as suggested in the literature [20], [35], [34]. We chose several well-known classifiers from different Machine Learning paradigms as base learners, namely, SVMs [73], decision trees [62], instance-based learning [1], fuzzy rule based systems [16] and decision lists [19].
Finally, we included an indepth discussion on the results, that have been acquired empirically along the experimental study. This allowed us to answer the issues previously raised and summarize the lessons learned in this paper. Additionally, we showed some new challenges on the topic in correspondence with the obtained results.
The rest of this paper is organized as follows. Section 2 presents a thorough overview of the existing binarization techniques, with special attention to OVO and OVA strategies. Section 3 presents the state-of-the-art on the aggregation strategies for the outputs of those strategies that we use in this work. The experimental framework set-up is given in Section 4, that is, the algorithms used as base classifiers, the performance measures, the statistical tests, the data sets, the parameters for the algorithms and the description of a Web page associated to the paper (http://sci2s.ugr.es/ovo-ova), which contains complementary material to the experimental study. We develop the empirical analysis in Section 5. The discussion, including the lessons learned throughout this study and future works that remain to be addressed, is presented in Section 6. Finally, in Section 7 we make our concluding remarks.
Section snippets
Reducing multi-class problems by binarization techniques
In this section, we first describe the idea behind binarization techniques to deal with multi-class problems and review the existing decomposition strategies. Then, we explain with relative detail the most common strategies that we have used in the experimental study: OVO and OVA.
State-of-the-art on aggregation schemes for binarization techniques
In this section we describe the state-of-the-art on aggregation strategies for binarization techniques. We divide them into two subsections: the first one is oriented to the combinations for OVO decomposition where the aggregation is made from a score matrix; the second one reviews the combinations for OVA scheme, where the outputs of the classifiers are given by a score vector.
A more extensive and detailed description of these methods can be found in the web page http://sci2s.ugr.es/ovo-ova.
Experimental framework
In this section, we present the set-up of the experimental framework used to develop the experiments in Section 5. We first describe the algorithms that we have selected to use as base classifiers in the study in Section 4.1. Section 4.2 describes the measures employed to evaluate the performance of the algorithms analysed in this paper. Next, we present the statistical tests applied to compare the results obtained with the different aggregations and decomposition techniques in Section 4.3.
Experimental study
In this section, we present the results of the experimental study. We will answer to the following questions:
- 1.
Should we do binarization? How should we do it?
- 2.
Which is the most appropriate aggregation for each decomposition scheme?
Thereby the study is divided into two parts, each one dedicated to a question. Since the main objective of this paper is the analysis of the different combinations, we will try to answer these questions in an upside-down manner, starting from the second one. Hence we
Discussion: lessons learned and future work
This paper has provided an exhaustive empirical analysis of the main ensemble methods for binary classifiers in multi-class problems, specifically the methods based on OVO and OVA strategies. We structured the analysis in two sections, studying the different ways in which the outputs of the underlying binary classifiers can be combined and then, filling up the analysis investigating the use of binarization techniques when the multi-class problem can also be faced up by a unique classifier.
From
Concluding remarks
We made a thorough analysis of several ensemble methods applied on multi-classification problems in a general classification framework. All of them are based on two well-known strategies, OVO and OVA, whose suitability have been tested in several real-world problems data sets.
From this work we conclude that actually OVO methods and specifically the ensembles using WV, LVPC, PC and PE combinations are the ones with the best average behaviour, but the best aggregation within a problem depends on
Acknowledgements
This work has been supported by the Spanish Ministry of Education and Science under Projects TIN2010-15055 and TIN2008-06681-C06-01.
Mikel Galar received his M.Sc. degree in Computer Sciences from the Public University of Navarra, Pamplona, Spain, in 2009. Currently he holds a research position at the Department of Automatics and Computation. His research interests are data-mining, classification, multi-classification, ensemble learning, evolutionary algorithms and fuzzy systems.
References (77)
- et al.
Multiclass cancer classification by support vector machines with class-wise optimized genes and probability estimates
Journal of Theoretical Biology
(2009) - et al.
A multi-class classification strategy for fisher scores: application to signer independent sign language recognition
Pattern Recognition
(2010) - et al.
Strategies for learning in class imbalance problems
Pattern Recognition
(2003) A lot of randomness is hiding in accuracy
Engineering Applications of Artificial Intelligence
(2007)- et al.
Solving multi-class problems with linguistic fuzzy rule based classification systems based on pairwise learning and preference relations
Fuzzy Sets and Systems
(2010) - et al.
An experimental comparison of performance measures for classification
Pattern Recognition Letters
(2009) - et al.
Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power
Information Sciences
(2010) - et al.
Fingerprint classification using one-vs-all support vector machines dynamically ordered with Naïve Bayes classifiers
Pattern Recognition
(2008) - et al.
Learning valued preference structures for solving classification problems
Fuzzy Sets and Systems
(2008) - et al.
Combining predictions in pairwise classification: an optimal adaptive voting strategy and its relation to weighted voting
Pattern Recognition
(2010)
Domains of competence of fuzzy rule based classification systems with data complexity measures: a case of study using a fuzzy hybrid genetic based machine learning method
Fuzzy Sets and Systems
Decision-making with a fuzzy preference relation
Fuzzy Sets and Systems
An incremental node embedding technique for error correcting output codes
Pattern Recognition
Multi-class pairwise linear dimensionality reduction using heteroscedastic schemes
Pattern Recognition
Instance-based learning algorithms
Machine Learning
KEEL Data-mining software tool: data set repository, integration of algorithms and experimental analysis framework
Journal of Multiple-Valued Logic and Soft Computing
KEEL: a software tool to assess evolutionary algorithms for data mining problems
Soft Computing
Reducing multiclass to binary: a unifying approach for margin classifiers
Journal of Machine Learning Research
Introduction to Machine Learning
Efficient classification for multiclass problems using modular neural networks
IEEE Transactions on Neural Networks
Modern Information Retrieval
Data Complexity in Pattern Recognition
Domain of competence of XCS classifier system in complexity measurement space
IEEE Transactions on Evolutionary Computation
Support vector learning for fuzzy rule-based classification systems
IEEE Transactions on Fuzzy Systems
Rule induction with CN2: some recent improvements
A coefficient of agreement for nominal scales
Educational and Psychological Measurement
Statistical comparisons of classifiers over multiple data sets
Journal of Machine Learning Research
Solving multiclass learning problems via error-correcting output codes
Journal of Artificial Intelligence Research
Pattern Classification
Binary tree of SVM: a new fast multiclass training and classification algorithm
IEEE Transactions on Neural Networks
Genetics-based machine learning for rule induction: state of the art, taxonomy and comparative study
IEEE Transactions on Evolutionary Computation
Ensembles of nested dichotomies for multi-class problems
Round robin classification
Journal of Machine Learning Research
Cited by (652)
An auto-regulated universal domain adaptation network for uncertain diagnostic scenarios of rotating machinery
2024, Expert Systems with ApplicationsConstruction of fuzzy classification systems by fitness sharing based genetic search and boosting based ensemble
2024, Fuzzy Sets and SystemsA real-time SVM-based hardware accelerator for hyperspectral images classification in FPGA
2024, Microprocessors and MicrosystemsInformation transfer rate in BCIs: Towards tightly integrated symbiosis
2024, Biomedical Signal Processing and Control
Mikel Galar received his M.Sc. degree in Computer Sciences from the Public University of Navarra, Pamplona, Spain, in 2009. Currently he holds a research position at the Department of Automatics and Computation. His research interests are data-mining, classification, multi-classification, ensemble learning, evolutionary algorithms and fuzzy systems.
Alberto Fernández received his M.Sc. degree in Computer Sciences in 2005 and the Ph.D. degree in Computer Science in 2010, both from the University of Granada, Spain. He is currently a Supply Assistant Professor in the Department of Computer Science, University of Jaén, Jaén, Spain. His research interests include data mining, classification in imbalanced domains, fuzzy rule learning, evolutionary algorithms and multi-classification problems.
Edurne Barrenechea is an Assistant Lecturer at the Department of Automatics and Computation, Public University of Navarra. She received an M.Sc. in Computer Science at the Pais Vasco University in 1990. She worked in a private company (Bombas Itur) as analyst programmer from 1990 to 2001, and then she joined the Public University of Navarra as Associate Lecturer. She obtained the Ph.D. in Computer Science in 2005 on the topic interval-valued fuzzy sets applied to image processing. Her research interests are fuzzy techniques for image processing, fuzzy sets theory, interval type-2 fuzzy sets theory and applications, decision making, and industrial applications of soft computing techniques. She is member of the board of the European Society for Fuzzy Logic and Technology (EUSFLAT).
Humberto Bustince is a Full Professor at the Department of Automatics and Computation, Public University of Navarra, Spain. He holds a Ph.D. degree in Mathematics from Public University of Navarra from 1994. His research interests are fuzzy logic theory, extensions of Fuzzy sets (Type-2 fuzzy sets, Interval-valued fuzzy sets, Atanassov's intuitionistic fuzzy sets), Fuzzy measures, Aggregation functions and fuzzy techniques for Image processing. He is author of over 50 published original articles and involved in teaching Artificial Intelligence for students of Computer Sciences.
Francisco Herrera received the M.Sc. degree in Mathematics in 1988 and the Ph.D. degree in Mathematics in 1991, both from the University of Granada, Spain. He is currently a Professor in the Department of Computer Science and Artificial Intelligence at the University of Granada. He has published more than 150 papers in international journals. He is coauthor of the book “Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Knowledge Bases” (World Scientific, 2001). As edited activities, he has co-edited five international books and co-edited 20 special issues in international journals on different Soft Computing topics. He acts as associated editor of the journals IEEE Transactions on Fuzzy Systems, Information Sciences, Mathware and Soft Computing, Advances in Fuzzy Systems, Advances in Computational Sciences and Technology, and International Journal of Applied Metaheuristics Computing. He currently serves as area editor of the Journal Soft Computing (area of genetic algorithms and genetic fuzzy systems), and he serves as member of several journal editorial boards, among others Fuzzy Sets and Systems, Applied Intelligence, Knowledge and Information Systems, Information Fusion, Evolutionary Intelligence, International Journal of Hybrid Intelligent Systems and Memetic Computation. His current research interests include computing with words and decision making, data mining, data preparation, instance selection, fuzzy rule based systems, genetic fuzzy systems, knowledge extraction based on evolutionary algorithms, memetic algorithms and genetic algorithms.