UNIVERSITY OF IBADAN LIBRARY
DEVELOPMENT OF ADVANCED DATA SAMPLING SCHEMES TO ALLEVIATE CLASS IMBALANCE PROBLEM IN DATA MINING
CLASSIFICATION ALGORITHMS
BY
SAKINAT OLUWABUKONLA FOLORUNSO MATRICULATION NUMBER: 112644
B. Tech. Computer Science (Akure), M. Sc. Computer Science (Ibadan)
A Thesis in the Department of Computer Science, Submitted to the Faculty of Science,
In partial fulfilment of the requirements for the degree of
DOCTOR OF PHILOSOPHY
of the
UNIVERSITY OF IBADAN, IBADAN NIGERIA
SEPTEMBER, 2015
UNIVERSITY OF IBADAN LIBRARY
ii Certification
This is to certify that this research was carried out by Sakinat Oluwabukonla FOLORUNSO with matriculation number 112644 in the Department of Computer Science, University of Ibadan, Ibadan, Nigeria.
...
Supervisor Dr. A. B. Adeyemo
B. Sc. (Ife), PGD, M. Tech., Ph. D (Akure) Department of Computer Science University of Ibadan, Ibadan, Nigeria.
...
Head of Department Dr. A. B. Adeyemo
Department of Computer Science University of Ibadan, Ibadan, Nigeria.
UNIVERSITY OF IBADAN LIBRARY
iii Dedication
This work is dedicated to the glory of Almighty Allah.
Also to my Mother, Alhaja R. A. Tijani and to the loving memory of my late father, Alhaji (Chief) R. O. Tijani.
UNIVERSITY OF IBADAN LIBRARY
iv
Acknowledgements
My immense gratitude goes to Almighty Allah, the most Beneficent, the Merciful, the Firm. It is by His mercy that I have gone this far.
I wish to express my profound gratitude to my supervisor, Dr. A. B. Adeyemo for his guidance, supervisory act and being meticulous when going through this thesis. Your concern and tutelage is an addition to my life. I thank you for impacting knowledge in me.
May God bless you (Amen).
My appreciation also goes to Dr S. O. Akinola, for the words of encouragement. To my mentor, Prof. A. O. Osofisan and my Sisters/Mentors, Dr Y. O. Folajimi, Dr B. O. Oladejo and Dr I. T. Ayorinde, I am very grateful. Dr O. B. Akinkunmi, Dr O.W.F. Onifade, Dr A. B. C. Roberts and Dr Osunade and all the entire academic staff and non- academic staff of the department, thank you for your support.
I also wish to express my profound gratitude to Professor Gustavo E.A.P.A. Batista of Departamento de Ciências de Computação, Universidade de São Paulo, Campus de São Carlos, Brazil for his priceless and selfless contribution to the success and completion of this thesis. Also, Diego Furtado Silva is highly appreciated for his selfless and immense support.
The effort of all the academic and non-academic staff of the Department of Mathematical sciences, Olabisi Onabanjo University, Ago Iwoye, Ogun State cannot be measured with ordinary thanks. I appreciate the undying support, assistance and mentoring of Dr. D. A.
Agunbiade, Dr. T. O. Olatayo, Dr. O. S. Odetunde, Dr B. O. Oguntunde, Mrs K-K. A.
Abdullah and Mr Taiwo Abbas. I also wish to express my appreciation to Dr. O. M.
Ajibade and Mrs S. O. Ogunsanya of Geology Department, Olabisi Onabanjo University, Ago Iwoye, Ogun State for their endless support.
I also appreciate the support and contribution of Prof. O. Longe of Adeleke University, Ede and Dr Gabriel Iwasokun of Department of Computer Science, The Federal University of Akure, Ondo State (FUTA). My gratitude also goes to Mr Hameed Olaide of IT department of IITA, Nigeria.
The effort and assistance from my mother, Alhaja R. A. Tijani and My Uncle, Dr. M. A.
Tijani are priceless and cannot be measured in cash or kind. Special thanks and
UNIVERSITY OF IBADAN LIBRARY
v
unquantifiable indebtedness to all my siblings for their moral and financial support: Mr Oludare Samsudeen Tijani, Mr Ayodele Kabir Tijani, Alhaja Fatimah Oluwaseyi Salami and Mr Oyeniyi Quadri Tijani. I want to thank you all for being there for me at all times.
My appreciation also goes to you all my in-laws: Folorunsos, Nurenis, Aderintos especially Mrs Biodun Rafiat Olagunju and Mrs Lola Salewa Akindele, who always play the dual role of Sister and Mother-in-law.
I must not forget to thank the entire members of Badrul-din As-salat women circle of Nigeria especially Murshid Taofeek Salaudeen and Alhaja R. A. Adabanija for their support and all the members of Association for Computer Machinery (ACM), Ibadan- Nigeria Chapter. My undying gratitude goes to my close friends especially Dr. Oluwafemi Oriola, Alhaji Fagbenro, Alhaji Dr. and Alhaja Muyibi, Alhaji Sulaiman Salami, Mrs Aderemi, Khalifa Musa and to those who because of space cannot be mentioned here.
Thank you all.
My appreciation goes to all my students who have graduated and those still in school especially Adediji Michael (Angel) and Awoyemi Emmanuel of Olabisi Onabanjo University, Ago Iwoye, Ogun State.
My profound gratitude also goes to my children: Temitope Sariy Abeke, Mubarak Temidayo Adisa, Temilade Hamdalat Ajoke and Abdul Mateen Temidara Alao, who were always with me day and night praying persistently for the success of this work. Thank You for your understanding, perseverance and love. I am indeed grateful to you all.
Finally, many thanks to my husband, the man of my youth, Alhaji Dr. Rasheed Olayode Folorunso for being there for me always, for the sleepless night, for your prayers and for being my eyes to read this work. Almighty Allah will strengthen you and make all that you touch to prosper in your sight.
UNIVERSITY OF IBADAN LIBRARY
vi ABSTRACT
Classification is the process of finding a set of models that distinguish data classes to predict unknown class label in data mining. The class imbalance problem occurs when standard classifiers are majority-biased while the minority class is ignored. Existing classifiers tend to maximise overall prediction accuracy and minimise error at the expense of the minority class. However, research had shown that misclassification cost of the minority class is higher and should not be ignored since it is the class of interest. This work was therefore designed to develop advanced data sampling schemes that improve the classification performance of imbalance datasets with the view of increasing the recall of the minority class.
Synthetic Minority Oversampling Technique (SMOTE) was extended to SMOTE+300%
and combined with existing under-sampling schemes: Random Under-Sampling (RUS), Neighbourhood Cleaning Rule (NCL), Wilson’s Edited Nearest Neighbour (ENN) and Condense Nearest Neighbour (CNN). Five advanced data sampling scheme algorithms:
SMOTE300ENN, SMOTE300RUS, SMOTE300NCL, SMOTENCL and SMOTERUS were coded using JAVA and implemented in WEKA, a data mining tool as an Application Programming Interface. The existing and developed schemes were applied to 886 Diabetes Mellitus (DM), 1,163 Senior Secondary School Certificate Result (SSSCR) and 786 Contraceptive Methods (CM) datasets. The datasets were collected in Ilesha and Ibadan, Nigeria. Their performances were determined with different classification algorithms using Receiver Operating Characteristics (ROC), recall of the minority class and performance gain metrics. Friedman’s Test at p = 0.05 was used to analyse these schemes against the classification algorithms.
The ROC metric revealed that the mean rank values for DM, SSSCR and CM datasets treated with the advanced schemes ranged from 6.9-13.8, 3.8-12.8 and 6.6-13.5, respectively when compared with the existing schemes which ranged from 3.4-7.8, 2.6- 12.6 and 2.8-7.9, respectively. These results signifies improved classification performance. The Recall metric analysis for the DM, SSSCR and CM datasets in the advanced schemes ranged from 9.4-13.0, 6.3-14.0 and 7.3-13.6, respectively when compared with the existing schemes 2.0-7.5, 2.5-8.9 and 2.1-7.4, respectively. These results show increased detection of the minority class. Performance gains by the advanced
UNIVERSITY OF IBADAN LIBRARY
vii
schemes over the original dataset (DM, SSCE and CM) were: SMOTE300ENN (27.1%), SMOTE300RUS (11.6%), SMOTE300NCL (15.5%), SMOTENCL (8.3%) and SMOTERUS (7.3%). Significant difference was observed amongst all the schemes. The higher the mean rank value and performance gain, the better the scheme. The SMOTE300ENN scheme gave the highest ROC and recall values in the three datasets which were 13.8, 12.8, 12.3 and 13.0, 14.0, 13.6, respectively.
The developed Synthetic Minority Oversampling Technique 300 Wilson’s Edited Nearest Neighbour scheme significantly improved classification performance and increased the recall of the minority class over the existing schemes using the same dataset. It is therefore recommended for classification of imbalanced datasets.
Keywords: Imbalanced dataset, Receiver operating characteristics, Data reduction techniques, Cost sensitive learning
Word count: 445
UNIVERSITY OF IBADAN LIBRARY
viii
Table of contents
Contents Page
Title Page……….. i
Certification……….. ii
Dedication ……… iv
Acknowledgement……… v
Abstract………...………....……….. vi
Table of Contents………....……….. vii
List of Tables……… xiii
List of Figures………... xvi
Chapter One: Introduction 1.0 Background to the study……….………..… 1
1.1 Motivation of Study….………..………... 3
1.2 Justification of Study………..…………..……… 3
1.3 Research aim and objectives………. 3
1.4 Research Methodology………... 4
1.5 Scope and limitation of the study………..…….. 5
1.6 Organisation of thesis……..………..………….… 5
1.7 Glossary of terms…….……… 5
Chapter Two: Literature Review 2.0 Introduction ………..…………. 8
2.1 Class Imbalance Problem………...………..……… 9
2.2 Problems associated with class imbalance……….……….… 10
2.2.1 Difficulties encountered in imbalance classification……….. 10
2.2.2 Multiple class problems……….………... 14
2.3 Methods of multiple classes problem decomposition…….……….… 14
2.3.1 Direct multiclass classification………. 14
2.3.2 Multiclass Extension: Decomposition……… 15
2.3.3 Methods of decomposing multiple class problems………. 15
2.3.3.1 One- versus- one (OVO) method……… 15
2.3.3.2 One- versus- all (OVA) method………..… 16
2.3.3.3 P- against Q (PAQ) method………. 16
2.3.3.4 Error correcting code design method………..… 16
2.4 Evaluation metrics……….. 17
UNIVERSITY OF IBADAN LIBRARY
ix
2.4.1 Confusion matrix……… 17
2.4.2 F_ measure………... 18
2.4.3 Kappa Statistics or Cohen’s Kappa Coefficient…...………... 20
2.4.4 G- means criterion……… 20
2.4.5 Matthew’s correlation coefficient……… 21
2.4.6 ROC (Receiver Operating characteristic) and AUC (Area Under the Curve of ROC)………. 21
2.4.7 Precision- Recall curves………... 23
2.4.8 H- measure……… 24
2.4.9 Cross Validation………..… 24
2.4.10 Root Mean Squared Error (RMSE)………. 24
2.5 Challenges faced by class imbalance problem……… 24
2.6 Solutions to class imbalance problem………. 25
2.6.1 Sampling schemes ………..… 28
2.6.1.1 Under- sampling schemes ………..……….… 28
2.6.1.2 Over sampling schemes……… 35
2.6.1.3 Advanced sampling……….… 38
2.6.2 Solutions at the algorithm level……… 38
2.6.2.1 Adjusting algorithm itself……… 38
2.6.2.2 One class learning……… 38
2.6.2.3 Cost sensitive learning……… 39
2.6.2.4 Ensemble learning……… 40
2.7.1 Why ensemble is better than single classifier?………. 42
2.7.2 Ensemble methods……… 43
2.7.3 Methods of combination of ensembles……….……… 43
2.7.4 Diversity in ensembles.……….……… 44
2.7.4.1 Measure of diversity………..………… 44
2.8 Learning Algorithm……….……… 45
2.8.1 Random Forest………….……….…… 45
2.8.2 Random Subspace Method (Decision Forest)……….… 45
2.8.3 Random Committee……….……… 46
2.8.4 MultiClass Classifier ……….………. 46
2.8.5 Boosting……….………... 46
2.8.6 Stacking……… 47
UNIVERSITY OF IBADAN LIBRARY
x 2.8.7 Repeated Incremental Pruning to Produce
Error Reduction (RIPPER)…………..……… 47
2.8.8 Boostrap AGGregatING (BAGGING)….………... 47
2.8.9 Support Vector Machine (SVM)……….……… 47
2.8.10 Artificial Neural Network (ANN)……… 48
2.8.11 K- Nearest Neighbour………..……… 49
2.8.12 Reduced Error Pruning (REP tree)………..……… 49
2.9 Critical appraisal and comparison of the under-sampling schemes……….…… 49
2.9.1 Nearest Neigbhours (NN)……… 50
2.9.2 Properties of under-sampling schemes…….……….. 50
2.9.3 Comparison of under-sampling Technique……..……… 54
2.10 Review of related work………..………. 57
2.11 Remarks……….………. 63
Chapter Three: Research Methodology 3.1 Data mining process……….………… 65
3.2 Model Development……… 67
3.2.1 The Enhanced Data Sampling Schemes Algorithms………..… 67
3.2.2 Algorithm SMOTE (𝑇, 𝑁, 𝑘)………..……… 68
3.2.3 Algorithm ENN (𝑇, 𝜃, 𝑘) ………..……… 70
3.2.4 Algorithm NCL (𝑇, 𝐶, 𝑘)……… 70
3.2.5 Algorithm RUS (𝑇, 𝑁, 𝐶) ……… 70
3.3 Implementation of Models in WEKA………..……… 70
3.3.1 Basic Functionality………..……… 72
3.3.2 Graphical User Interfaces……… 74
3.3.3 Extending WEKA……… 75
3.3.3.1 Writing a new filter……….………… 75
3.4 Evaluation metrics and Statistical analysis tool………..…… 80
3.4.1 Hypothesis Testing……… 81
3.4.2 FriedMan Test………..……… 81
3.4.3 Analysis Of Variance (ANOVA)……….………. 82
3.4.4 Tukey–Kramer method……….……… 83
3.4.5 Box and Whisker Plots……….……… 83
UNIVERSITY OF IBADAN LIBRARY
xi
3.5 The Dataset……….……… 84
3.5.1 Diabetes Mellitus (DM) dataset ………..……… 84
3.5.2 Senior Secondary School Certificate Examination Result (SSS Result) dataset …………...……… 85
3.5.3 Tuberculosis (TB) Dataset……….……… 85
3.5.4 Contraceptive Method (CM) dataset……….……… 86
3.6 Experimental Design……… 87
3.6.1 Classification algorithms’ configuration………..…… 92
3.6.1.1 Random Tree……..………. 92
3.6.1.2 RIPPER………..………. 89
3.6.1.3 Decision Tree……… 89
3.6.1.4 K-Nearest Neighbours classifier (1B3)………. 89
3.6.1.5 REPTree……… 89
3.6.1.6 Support Vector Machine (SVM)………..……… 90
3.6.1.7 MultiLayerPerceptron (MLP)………..………… 90
3.6.1.8 Multiple Class Classifiers……… 90
3.6.1.9 RandomCommittee………..……… 90
3.6.1.10Random Forest……….…… 91
3.6.1.11Random Subspace (Decision Forest)……..……… 91
3.6.1.1 Stacking……….……… 91
3.6.1.13Bagging……… 91
3.6.1.14Boosting (AdaBoostM1)………. 91
3.6.2 Ten - fold Cross Validation……….…… 92
3.7 Percentage Reduction/ Increment in the dataset………..… 92
3.8 Percentage number of the minority class in the dataset……….. 92
3.9 Measuring the impact of class distribution on classifier performance……….… 92
3.10 Performance Loss/Gain on Classifiers……… 93
Chapter Four: Results and Discussion 4.1 Introduction……….……… 94
4.1.1 Analysis of error rates of the minority and majority class distributions……….… 94
4.1.1.1 Discussion on the error rates……… 98
UNIVERSITY OF IBADAN LIBRARY
xii
4.1.2 Steps involved in evaluation of result………..……… 98
4.1.3 Dataset Distribution……….……… 100
4.1.4 Analysis of classification results of performance metrics on all datasets……….……… 109
4.1.4.1 Analysis of ROC_AUC metrics……… 109
4.1.4.2 Analysis of Kappa statistics metrics……….……… 116
4.1.4.3 Analysis of RMSE metrics………..… 123
4.1.4.4 Analysis of RECALL of minority class metrics………..……… 130
4.1.4.5 Analysis of Performance Loss/gain metrics………..……….. 137
4.1.4.6 Analysis of all metrics on Tuberculosis (TB) dataset……..…………. 144
4.1.5 Statistical Analysis of classification results on performance metrics………..……. 147
4.1.5.1 Report of Friedman’s test on ROC_AUC metric for all dataset………... 147
4.1.5.2 Report of Friedman’s test on Kappa statistics metric for all datasets……….… 150
4.1.5.3 Report of Friedman’s test on RMSE metric for all datasets………..…… 153
4.1.5.4 Report of Friedman’s test on RECALL of the minority class metric for all datasets………..………… 156
4.1.5.5 Report of Friedman’s test on ROC_AUC metric for all classifiers ……… 159
4.1.5.6 Report of Friedman’s test on Performance Loss/gain metric for on all datasets……… 162
4.1.6 Result of analysis of Analysis of Variance (ANOVA)………. 165
4.1.6.1 ANOVA test on ROC_AUC metric on all datasets……… 165
4.1.6.2 ANOVA test on all datasets using Kappa statistics metric……….… 168
4.1.6.3 ANOVA test on RMSE metric all datasets………….……… 170
4.1.6.4 ANOVA test on RECALL of minority class metric on all datasets……… 172
4.1.6.5 ANOVA test on all classifiers……… 174 4.1.6.6ANOVA on all datasets with all data sampling
UNIVERSITY OF IBADAN LIBRARY
xiii
schemes using performance loss/gain metric………... 177
4.1.7 Box and whisker Plot ……….……… 181
4.2 Remarks……….. 200
4.2.1 Analysis of performance of datasets generated from existing data sampling schemes………. 200
4.2.2 Analysis of performance datasets generated from the enhanced data sampling schemes……….……… 202
4.2.3 Analysis of performance of the Tuberculosis dataset……….……… 204
4.2.4 Analysis of classifiers……….… 205
Chapter Five: Summary, Conclusion and Recommendations 5.1 Summary…………..……… 206
5.2 Contribution to the study……… 207
5.3 Conclusion……….. 207
5.4 Recommendations……….……….. 208
References………..……… 209
Appendix A The screenshots of the predictions of datasets on classification algorithm………….……… 222
Appendix B The class boundary diagram for DM dataset ..……… 229
Appendix C Java Codes for SMOTE Class ………. 236
Appendix D Java codes for Wilson’s Edited Nearest Neigbhour (ENN) Class ……….. 248
Appendix E Java codes for Neigbhourhood Cleaning Rule (NCL) Class ……… 264
Appendix F Java codes for Condense Nearest Neighbour (CNN) Class . 280
Appendix G Java codes for Random Under Sampling (RUS) Class ………… 294
UNIVERSITY OF IBADAN LIBRARY
xiv
LIST OF TABLES
Table 2.1:Confusion Matrix………... 19
Table 2.2:Cost Matrix………... 41
Table 2.3:Comparison of under-sampling technique……… 56
Table 3.1: Summary of datasets………..…… 88
Table 4.1: Confusion matrix of the bi-class from Decision Tree classifier on study dataset……….……… 95
Table 4.2: Error rates of the study datasets with Decision Tree Classifier… 96 Table 4.3: DM Dataset class distribution……… 101
Table 4.4: TB Dataset class distribution……… 103
Table 4.5: CM Dataset class distribution………. 105
Table 4.6: CM Database class distribution………. 107
Table 4.7: ROC_AUC metric values for DM dataset……… 110
Table 4.8: ROC_AUC metric values for SSS Result dataset……… 112
Table 4.9: ROC_AUC metric values for CM dataset……… 114
Table 4.10: Kappa Statistic metric values for DM dataset……….. 117
Table 4.11: Kappa Statistic metric for SSS Result dataset……… 119
Table 4.12: Kappa Statistics metric values for CM dataset……… 121
Table 4.13: RMSE metric values for DM dataset……… 124
Table 4.14: RMSE metric for SSS Result dataset……… 126
Table 4.15: RMSE metric values for CM dataset……… 128
Tables 4.16: RECALL of minority class (GDM) metric values for DM dataset……… 131
Table 4.17: RECALL of the minority class (PASSWAEC) metric values for SSS Result dataset……… 133
Table 4.18: RECALL of the minority class (NONE) metric values for CM dataset……….……… 135
Table 4.19: Performance Loss/gain values for on DM dataset against RAW DATA using ROC_AUC metric………. 138
Table 4.20: Performance Loss/gain values for SSS Result dataset against RAW DATA using ROC_AUC metric……… 140 Table 4.21: Performance Loss/gain values for CM dataset
UNIVERSITY OF IBADAN LIBRARY
xv
against RAW DATA using ROC_AUC metric……… 142 Table 4.22: All metrics values for Tuberculosis dataset………... 145 Table 4.23: Result of Friedman’s analysis on ROC_AUC
metric for all datasets………….……… 148
Table 4.24: Result of Friedman’s analysis on Kappa Statistics
metric for all datasets…….……… 151
Table 4.25: Result of Friedman’s analysis on RMSE
metric for all datasets……….……… 154
Table 4.26: Report of Friedman’s analysis on RECALL of
Minority class metric for all datasets……… 157 Table 4.27: Report of Friedman’s analysis of on ROC_AUC
metric for all classifiers……….……… 159 Table 4.28: Result of Friedman’s analysis on Performance Loss/gain
metric for all datasets……… 163
Table 4.29: ANOVA on all datasets with all data sampling
schemes using ROC_AUC metric……… 167 Table 4.30: ANOVA on all datasets with all data sampling schemes
using Kappa Statistics metric……… 169 Table 4.31: ANOVA on all datasets with all data sampling schemes
using RMSE metric……… 171 Table 4.32: ANOVA on all datasets with all data sampling schemes
using RECALL of minority class metric……… 173 Table 4.33: ANOVA on all classifiers with all data sampling schemes
using ROC_AUC metric………... 176
Table 4.34: ANOVA on all datasets with all data sampling schemes
using performance loss/gain metric……… 178 Table 4.35: Summary of performance gain/loss on performance of
scheme compared to the RAW DATA in percentage…….………… 179
UNIVERSITY OF IBADAN LIBRARY
xvi
LIST OF FIGURES
Figure 2.1: Class overlapping problem………..……….. 11
Figure 2.2: Small class disjuct ………..……… 13
Figure 2.3: ROC_AUC graph ……….. 22
Figure 2.4(a) An Imbalanced dataset……… 26
Figure 2.4(b) A balanced dataset……… 27
Figure 2.5(a) Imbalanced Dataset ……….……… 31
Figure 2.5(b) Balanced Dataset using TLink……….. 32
Figure 2.6(a) Imbalanced dataset……… 33
Figure 2.6(b) a balanced dataset after CNN……… 34
Figure 3.1: The research methodology ………..………. 66
Figure 3.2: The standard WEKA’s GUI with filters……….. 71
Figure 3.3: The enhanced data sampling implemented in WEKA…..…..… 74
Figure 4.1: Steps used for the pre-processing and analysis of result…….…. 99
Figure 4.2: Chart showing the DM Dataset class distribution……… 102
Figure 4.3: Chart showing the SSS Result Dataset class distribution……..… 104
Figure 4.4: Chart showing the TB Dataset class distribution……… 106
Figure 4.5: Chart showing the CM Dataset class distribution……… 108
Figure 4.6: Chart showing the ROC_AUC metric values for DM dataset…… 111
Figure 4.7: Chart showing the ROC_AUC metric values for SSS Result dataset…... 113
Figure 4.8: Chart showing the ROC_AUC metric values for CM dataset…… 115
Figure 4.9: Chart showing the Kappa Statistics metric values for DM dataset………..………. 118
Figure 4.10: Chart showing the Kappa Statistics metric values for SSS Result………..………. 120
Figure 4.11: Chart showing the Kappa Statistics metric values for CM dataset……… 122
Figure 4.12: Chart showing the RMSE metric values for DM dataset….…… 125
Figure 4.13: Chart showing the RMSE metric values for SSS Result dataset……… 127
Figure 4.14: Chart showing the RMSE metric values for CM dataset…...……. 129
UNIVERSITY OF IBADAN LIBRARY
xvii
Figure 4.15: Chart showing the RECALL of the minority class
(GDM) metric values for DM dataset……….……… 132 Figure 4.16: Chart showing the RECALL of the minority class
(PASSWAEC) metric values for SSS Result dataset……… 134 Figure 4.17: Chart showing the RECALL of the minority class
(NONE) metric values for CM dataset……….. 136 Figure 4.18: Chart showing the Performance Loss/gain values for on
DM dataset against RAW DATA using ROC_AUC metric……….. 139 Figure 4.19: Chart showing the Performance Loss/gain values for on SSS
Result dataset against RAW DATA using ROC_AUC metric……..…… 141 Figure 4.20: Chart showing the Performance Loss/gain values for on CM
dataset against RAW DATA using ROC_AUC metric……….……. 143 Figure 4.21: Chart showing all metrics values for Tuberculosis dataset……. 146 Figure 4.22: Chart showing the Friedman’s analysis on
ROC_AUC metric for all datasets……….………… 149 Figure 4.23: Chart showing the Friedman’s analysis on Kappa Statistics
metric for all datasets………..……… 152
Figure 4.24: Chart showing the Friedman’s analysis on
RMSE metric for all datasets……… 155 Figure 4.25: Chart showing the Friedman’s analysis on
RECALL of minority class metric for all datasets……… 158 Figure 4.26: Chart showing the Report of Friedman’s analysis of
on ROC_AUC metric for all classifiers………….……… 161 Figure 4.27: Chart showing the Result of Friedman’s analysis
on Performance Loss/gain metric for all datasets……….. 164 Figure 4.28: Chart showing the summary of on Performance
Loss/gain on performance of the scheme compared to
the RAWDATA in percentages……….……….. 180
Figure 4.29: Box and whisker plots for ROC_AUC metric
for DM dataset……….………. 182
Figure 4.30: Box and whisker plots for ROC_AUC metric
for SSS Result dataset……… 183
Figure 4.31: Box and whisker plots for ROC_AUC metric
for CM dataset……….……….. 184
UNIVERSITY OF IBADAN LIBRARY
xviii
Figure 4.32: Box and whisker plots for Kappa Statistics metric
for DM dataset………..……… 185
Figure 4.33: Box and whisker plots for Kappa Statistics metric
for SSS Result dataset……… 186
Figure 4.34: Box and whisker plots for Kappa statistics metric
for CM dataset……… 187
Figure 4.35: Box and whisker plots for RMSE metric for DM dataset……… 188 Figure 4.36: Box and whisker plots for RMSE metric for
SSS Result dataset……… 189
Figure 4.37: Box and whisker plots for RMSE metric for CM dataset……… 190 Figure 4.38: Box and whisker plots for RECALL metric for DM dataset…… 191 Figure 4.39: Box and whisker plots for RECALL metric
for SSS Result dataset……… 192
Figure 4.40: Box and whisker plots for RECALL metric for CM dataset…… 193 Figure 4.41: Box and whisker plots for classifiers on DM dataset………… 194 Figure 4.42: Box and whisker plots for classifiers on SSS Result dataset….. 195 Figure 4.43: Box and whisker plots for classifiers on CM dataset……..…… 196 Figure 4.44: Box and whisker plots on Performance Loss/Gain
on DM dataset……….…………..… 197
Figure 4.45: Box and whisker plots on Performance Loss/Gain
on SSS Result dataset……… 198
Figure 4.46: Box and whisker plots on Performance Loss/Gain
on CM dataset……… 199
UNIVERSITY OF IBADAN LIBRARY
1
CHAPTER ONE Introduction 1.1 Background to the study
Data mining is defined as the process of discovering patterns in data (Witten et al., 2011).
It can also be referred to as the extraction or “mining” of knowledge from large amounts of data (Han and Kamber, 2001).
Data mining has attracted lots of attention from the information industry due to availability of large amont of data and the burning need to transform these data into information and knowledge. Data mining techniques such as class description, association analysis, classification and prediction, cluster analysis and outlier analysis are used to specify the kind of patterns to be found in data mining tasks. Classification is the process of finding new set of models that describes and distinguish data classes and concepts to be able to predict unknown class of objects. These newly created models are based on the analysis of the training data whose class label is known. Since the class label of each training sample is provided, this step is known as supervised learning. The new model may be presented in various forms such as classification (IF-THEN) rules or mathematical formulae. These rules can be used to categorize future data samples, as well as provide a better understanding of the dataset contents.
UNIVERSITY OF IBADAN LIBRARY
2
Classification can be used to predict the class label of data objects. In many application domains, users may be interested in mining descriptions that distinguishes a target class from its contrasting classes in the same dataset. Classification models/algorithm/classifiers/learners in data mining include Decision Trees, Artificial Neural Networks, k-Nearest Neigbhour classifiers and Random Forest. Hence, classification is an important task but a general problem in data mining and machine learning.
Some of the issues regarding dataset for classification are data cleaning, relevance analysis, data transformation, class imbalance problem and comparison of classifiers.
These sub-problems impede learning. When constructing a classification model, the learning algorithm reveals the underlying relationship between the attribute set and class label and identifies a model that best fits the training data. This model should accurately predict the class label of previously unknown problem.
However, standard classifiers usually perform poorly on imbalanced data sets because they are designed to generalize from training data and output the simplest hypothesis that best fits the data. Therefore, this simplest hypothesis pays less attention to rare cases. However, in many cases, identification of these rare objects/minority class is of crucial importance;
classification performances on the small/rare/minority classes are the main concerns in determining the property of a classification model (Sun et al., 2006, Hoens, 2012).
Most traditional classifiers operate on data drawn from the same distribution as the training data and assume that maximizing accuracy is the principal goal. Also, in a problem with imbalance level of 99%, a learning algorithm that minimizes error rate could decide to classify all examples as the majority class, in order to achieve a low error rate of 1%. A practical example is a domain trying to predict terrorist and non-terrorist or a cancer patient to a non-cancer patient. The size of the samples representing non-terrorist and non-cancer patients are more than the terrorist and cancer patient. Most classifiers will assume that the cost of misclassification for these two classes (terrorist and non-terrorist or cancer and non-cancer patients) is the same. But the cost of predicting a non-terrorist is much lower than actual terrorist who carries a bomb at a cinema. Nevertheless, all minority examples will be wrongly classified in this case (Xu-Ying et. al, 2009). This class imbalance
UNIVERSITY OF IBADAN LIBRARY
3
problem had been observed to cause a significant deterioration in the performance of standard classifiers (Barandela et. al., 2003a; Johnson et. al., 2012).
1.2 Motivation of study
Traditional classifiers such as Decision Trees, Artificial Neural Networks (ANN), and Support Vector Machines (SVM) are ineffective at identifying samples from minority class which is the class of interest during classification (Garcia et al., 2012). New techniques are required to ensure that classifier can effectively identify these most important yet rarely occurring examples.
Secondly, there is also the advantage of low storage requirement (a reduced dataset) and a high computational advantage when dataset are reduced. Therefore, enhanced sampling schemes, which are external, independent of any classifier and also versatile, are desirable.
This study therefore is motivated by the need to identify specific domains for which an imbalance was shown to hurt the performance of standard classifiers. Also to show whether these class imbalances are always damaging to classification and to what extent do different types of imbalances affect classification performances.
1.3 Justification for the study
The justification for this research is that most standard classifiers are working towards achieving a generalised accuracy and low error rate which are biased towards the majority class while completely ignoring the minority class. But this minority class is the class of interest. Following closely, is class distribution of a domain where the classifiers assume that the classification algorithm will work on the dataset drawn from the same class distribution with training and testing dataset but this is not always true. Furthermore, the Error cost which is characterised by the situation whereby the classifier assumes that errors coming from the different classes in the dataset are the same but this is not correct as misclassification cost of the minority class is higher than the majority class.
1.4 Research aim and objectives
The aim of this study is to develop enhanced data sampling schemes for improving the performance of imbalance datasets trained on classification models that can increase the RECALL of the minority class which is the class of interest.
UNIVERSITY OF IBADAN LIBRARY
4 The specific objectives of the study are to:
i. develop enhanced data sampling schemes to alleviate the effects of class imbalanced problem;
ii. evaluate the performance of these new data sampling schemes on different classifiers as well as on homogeneous and heterogeneous ensembles and compare their performances
1.5 Research methodology
The following methodologies were used in this study:
i. Extensive review of literature in related work
ii. Development of a theoretical taxonomy of the relationship between under-sampling schemes in class imbalance learning, their underlying data reduction techniques and their time complexity.
iii. Development of the enhanced data sampling schemes; SMOTE300ENN, SMOTENCL, SMOTERUS, SMOTE300NCL and SMOTE300RUS using Java programming language.
iv. Extension of WEKA, a data mining tool to accommodate these enhanced data sampling schemes.
v. Testing of the new data sampling schemes on selected datasets obtained locally:
Contraceptive Methods (CM), Senior Secondary School Result (SSS Result), Diabetes Mellitus disease (DM) and Tuberculosis (TB) dataset in Nigeria.
vi. Testing of the enhanced and existing data sampling schemes (CNN, ENN and NCL) on various base classifiers (Decision Tree, RIPPER, Artificial Neural Network (ANN), Random Tree, Fast Decision Tree Learner (REPTree), Support Vector Machine (SVM) and k-Nearest Neighbours Classifier (1B3)), homogeneous ensembles (Boosting (ADABoostM1), BAGGING, Random Subspace (Decision Forest), Random Forest, Random Committee and MultiClass Classifier) and compared the result with heterogeneous ensemble (STACKING using Ripper, Decision Tree, 1B3, SVM and MLP as base classifiers in this order and Decision tree as the meta classifier).
vii.Evaluating the results obtained from the study of these datasets using the following metrics; Receiver Operating Characteristics Area Under Curve (ROC_AUC), Kappa
UNIVERSITY OF IBADAN LIBRARY
5
Statistics, Root Mean Square Error (RMSE), RECALL of the minority class and Performance Loss/gain.
viii. Analyses of the results obtained with performance metrics using both parametric and non-parametric statistical methods; ANOVA, Box and whisker plots and Friedman test at statistical significance level of 0.05%, confidence level of 95% in SPSS package.
1.6 Scope of the study
This study spans datasets with imbalance class distribution where the imbalance distribution among the classes in the dataset hindered the performance of classifiers. The study focuses on how to increase the RECALL of the minority class which is the class of interest. The study also identified specific domains for which an imbalanced dataset was shown to hinder the performance of standard classifiers, determine whether these imbalances were always damaging and to what extent different types of imbalances affect classification performances. CM, SSS Result and DM datasets were examples of such cases where the traditional classifier trained on them is overwhelmed by the number of the majority class thereby misclassifying the minority class, which is the class of interest.
1.7 Organisation of thesis
The rest of this thesis is organised as follows:
Chapter two gives an extensive review on class imbalance learning and several solutions reported in the literature. These include the sampling schemes, ensemble techniques, evaluation metrics and related work. Chapter three gives a comprehensive explanation on the methods used, the model development and the experimental setup. Chapter four presents the results obtained and the detailed discussion on the various results obtained.
Finally, Chapter five gives the summary of the study, conclusion drawn from the study and the recommendations for future work are presented.
1.8 Glossary of terms
Association analysis: This is the discovery of association rules showing attributes-value conditions that occur frequently together in a given dataset.
UNIVERSITY OF IBADAN LIBRARY
6
Bias: This is defined by Mitchell (1980) as a rule or method that causes an algorithm to choose one generalised output over another as explained by Wilson and Martinez, (1997b).
CBOS: Cluster-based Oversampling
Cluster analysis: This technique analyses data objects without consulting a known class label.
ENN: Wilson’s Edited Nearest Neighbour MLP: Multi Layer Perceptron
NCL: Neighbour Cleaning Rule
Noise: This is a random error or variance in a measured variable.
OSS: One Sided Selection
Outlier analysis: This is the study of data objects that do not comply with the general behaviour or models of the data.
Patterns: These are rules generated from mining a dataset. A pattern represents knowledge if is easily understood by humans, valid on test data with some degree of certainty and novel.
REPTree: Reduced Error Pruning Tree
RIPPER: Repeated Incremental Pruning to Produce Error Reduction
RNN: Reduced Nearest Neighbour
ROC: Receiver Operating Characteristics
ROS: Random Oversampling
RUS: Random Under Sampling
Samples: This could be used synonymously with examples, instances or objects. This is referred to as data tuples (rows or records) in a dataset.
SMO: Sequential Minimization Optimization SMOTE: Synthetic Minority Oversampling TEchnique SNN: Selective Nearest Neighbour
Supervised learning: This is a step in which the class label of each training samples is not known, and the number or set of classes to be learned may not be known in advance.
TLink: Tomek Link
UNIVERSITY OF IBADAN LIBRARY
7
Training dataset: These are data tuples analysed to build the model collectively from.
UNIVERSITY OF IBADAN LIBRARY
8
CHAPTER TWO Literature Review
This chapter presents the review of literature on class imbalance problem and related work.
2.1 Class Imbalance Problem
The class imbalance problem corresponds to the domain for which one class is represented by a large number of examples while the other is represented by few (Japkowicz, 2003).
Class imbalance learning is the learning problem in which instances in some classes heavily outnumber the instances in other classes. Imbalance Data Set (IDS) corresponds to domain that suffers this problem (Wang et al., 2009).
In such cases, standard classifiers tend to be overwhelmed by the large classes and ignore the small ones. This imbalance causes sub-optimal classification performance or even worse (Chawla et al., 2004, Fernanez et al., 2011). It is a fundamental problem of data mining research (Yang and Wu, 2005) and pattern recognition (Ghanem et al., 2010).
When the prediction model is trained on such an imbalance dataset, it tends to show a strong bias towards the majority class, since typical learning algorithms intend to maximize the overall prediction accuracy. In fact, if 95% of the entire dataset belongs to
UNIVERSITY OF IBADAN LIBRARY
9
the majority class, the model could ignore the remaining 5% of minority class and predict that all of the test data are in the majority class. Though the accuracy will be 95%, the instances belonging to the minority class will be absolutely mis-classified (Hido and Kashima, 2008). The mis-classification cost for the minority class, however is usually much higher than that of majority class and should not be ignored (Hido and Kashima, 2008, Thai-Nghe et al., 2009).
Domain suffering naturally from class imbalances include detection of oil spill in satellite radar images (Kubat and Matwin, 1997), diagnosis of diseases in medicine such as rare diseases (cancer) and rare gene mutation (albino), medical diagnosis (Yang and Ma, 2010), network monitoring, intrusion detection (Vegard, 2010), earth quakes and nuclear explosion and helicopter (Guo et al., 2008), risk management (Chawla et al., 2004), text classification (Estabrooks et al., 2004), education (the ratio of the number of “pass student” to “fail student”) and detection of fraudulent or default banking (Thai- Nghe et al., 2009), species distribution prediction in ecology and conservational biology (Johnson et al., 2012), information retrieval and filtering (Lewis and Gale, 1994), response optimization in Customer Relationship Management (CRM) (Lessmann, 2004), document classification (Manevitz and Yousef, 2001), image retrieval (Chen et al., 2001), Deoxyribo Nucleic Acid (DNA) Microarray time series (Pearson et al., 2003), spam-detection and filtering (Kolcz et al., 2003) and sentence boundary detection in speech (Liu et al., 2006).
In practical applications, the ratio of the small to large classes can be drastic such as 1:100, 1:1000, or 1 to 10,000 and sometimes even more (Chawla et al., 2004). In a classification problem, algorithm is used to construct a model by learning from training set which contains examples with class labels (Boontarika and Maythapolnum, 2011).
2.2 Problems associated with class imbalance
Class imbalance occurs when there are significantly fewer training instances of one class compared to other classes (Thai- Nghe et al., 2009, Chawla et al., 2004). In some applications, some data are naturally imbalanced. Examples are in credit card fraud and rare disease case (cancer). However, imbalance data set can also occur in areas where data are too expensive to be obtained for the minority class e.g. shuttle failure (Chawla et al., 2004, Guo et al., 2008) or limitation in collecting data such as cost, privacy, and the large effort required to obtain a representative data set (Thai-Nghe et al., 2009) thus, creating
UNIVERSITY OF IBADAN LIBRARY
10
‘artificial’ imbalances (Chawla et al., 2004). Class imbalance gives rise to various difficulties when learning.
2.2.1 Difficulties encountered in imbalanced classification
Some of the difficulties associated with imbalance dataset classification when allied according to Lopez et al., (2013), Batista et al., (2004), Nguyen, (2011), Johnson et al., (2012), Fernandez et al., (2011) includes:
a. Small Sample Size
This corresponds to the situation where the size of the minority class is extremely small due to the fact that there is either limitation in collecting data, data are too expensive or the datasets are naturally imbalanced. So, learning algorithm could not make generalisations about the class distribution because of lack of information or enough data. In this situation, the minority class becomes poorly represented. The combination of imbalanced data and the small sample size problem presents a new contest as the minority class can be poorly represented and the classifier to learn this data space become too specific, leading to over fitting.
b. Class Overlapping
The problem of overlapping between classes appears when a region of the data space contains a similar quantity of training data from each class as shown in Figure 2.1.
This problem may lead to developing an inference with almost the same apriori probabilities in this overlapping area, which makes it very hard or even impossible to distinguish between the two classes. Classification of imbalance dataset becomes sub- optimal when allied with class overlapping problem. However, any linearly separable problem can be solved by any base classifier irrespective of the class imbalance problem.
c. Small Disjuncts
The imbalance class problem is identical with small disjuncts problem. This small disjuncts problem is a condition that arises when sample from minority classes are represented within sub-clusters which happen as a direct result of underrepresented concept as established by Weiss and Provost, 2003; Galar et al., 2012 and Rahman
UNIVERSITY OF IBADAN LIBRARY
11
Figure 2.1: Class overlapping problem (Galar et al., 2012)
UNIVERSITY OF IBADAN LIBRARY
12
and Raju, 2014 and shown in Figure 2.2. Although these small disjuncts are hidden in most problems, their existence highly increases the complexity of the problem in the case of imbalance because it becomes hard to know whether these samples represent an actual sub-concept or are merely attributed to noise as concurred by Jo and Japkowicz (2004).
d. Dataset Shift
This phenomenon occurs when there is difference in the distribution of training and test samples of the same dataset as confirmed by Quinonero et al., (2009). This issue is significant in the presence of class imbalance dataset as a single mis-classification on the minority class can cause a sub-optimal performance in classifiers. This issue is especially relevant when dealing with imbalanced classification, because in highly imbalanced domains, the minority class is particularly sensitive to singular classification errors, due to the typically low number of examples it presents.
e. Concept Complexity
This is an important factor in a classifier ability to alleviate class imbalance problem.
Concept complexity in data corresponds to the level of separabilty of classes within the dataset (Japkowicz and Stephen, 2002). The class imbalance factor starts affecting the classifier generalisation ability as the degree of data complexity increases.
High complexity refers to inseparable datasets with highly overlapped classes, complex boundaries and high noise level. When samples of different classes overlap in the feature space, finding the ideal class boundary becomes tough (Nguyen et al., 2009).
f. Noise
The class imbalance problem is more significant when the data sets have a high level of noise. Noise in datasets can emerge from various sources like data samples are poorly acquired or incorrectly labelled, or extracted features are not sufficient for classification as explained by Nguyen et al., (2009).
UNIVERSITY OF IBADAN LIBRARY
13
Figure 2.2: Small class disjuct (Galar et al., 2012)
UNIVERSITY OF IBADAN LIBRARY
14 2.2.2 Multiple Class Problems
Typically, there are two types of classes for imbalance datasets namely: bi-class and multiple classes (more than two classes or multi-class). In a bi-class application, the imbalanced problem is observed as one class, represented by a large amount of samples while the other is represented only by a few. The class with the few training sample are usually associated with high identification importance, is referred to as the positive class;
the other one is the negative class (Thai-Nghe et al., 2010, Sun et al., 2006). In practice, most applications have more than two classes where unbalanced class distribution hinders the performance of the classifier. They suffer from more classification difficulties.
Most of the solutions reported to alleviate class imbalance problem so far are mainly two- class imbalance problems. Most real-world applications however have more than two classes with imbalanced distributions. They pose new challenges that are not observed in two-class problems. The multi-class classification problem is an extension of the traditional binary class problem where a dataset consists of 𝑘 different classes instead of two. Though class imbalance exists in binary class datasets where one class severely outnumbers the other class, it also extends to multiple classes where the effects of imbalance are even more problematic. That is, given 𝑘 different classes; there are multiple ways for class imbalance to manifest itself in the dataset. One typical way is that there is one “super majority” class which contains most of the instances in the dataset. Another typical example of class imbalance in multi-class datasets is the result of a single minority class. In such instances, each 𝑘 − 1 instances consists of roughly 1 (𝑘 − 1)⁄ of the dataset, and the “minority” class makes up the rest (Hoens et al., 2012). Multi-class imbalance problems suffer from more classification difficulties.
2.3 Methods of multiple classes’ problem decomposition
There are several methods by which multi-class classification can be resolved. These are discussed in sections 2.3.1 to 2.3.3:
2.3.1 Direct multiclass classification
This scheme works by performing classification on the learning algorithm directly. This involves using the learning algorithm directly without any changes in parameters to alleviate a multiple class problem. Examples of such algorithms are K-Nearest Neighbour, Decision Tree, Bayes Classifier (Nạve Bayes) and Support Vector Machine.
UNIVERSITY OF IBADAN LIBRARY
15 2.3.2 Multiclass Extension: Decomposition
This is a technique of processing multiple class by transforming the problem into multiple or several binary (two-class) classification sub-problems. Decomposing a big problem has some advantages which according to Wang (2011) includes:
a. Individual classifiers are likely to be simpler than a classifier learnt from the whole data set.
b. They can be trained simultaneously for less modelling time
c. They can be trained independently which allows different feature spaces, feature dimensions and architectures. The change in one classifier will not affect the others.
However, the potential drawbacks of decomposition method according to Wang (2011) are:
a. each individual classifier is trained without full data knowledge and
b. It can cause classification ambiguity or uncovered data regions with respect to each type of decomposition.
2.3.3 Methods of Decomposing Multiple Class Problems
There are several approaches/methods in decomposing a multi classification problem.
These methods as outlined by Boontarika and Maythapolnun, (2011) are discussed in 2.3.3.1 to 2.3.3.4:
2.3.3.1 One-versus-One (OVO) Method
This approach creates a classifier for each pair of classes. The training set for each pair classifier (𝑖, 𝑗) includes only those instances that belong to either class 𝑖 𝑜𝑟 𝑗. A new instance, 𝑥, belongs to the class upon which most pair classifiers agree. The prediction decision is quoted from the majority vote technique. There are 𝑛(𝑛−1)
2 classifiers to be computed, where 𝑛 is the number of classes in the dataset. It is evident, that the disadvantage of this scheme is that there is need to generate a large number of classifiers, especially if there are a large number of classes in the training set. For example, if there is a training set of 1,000 classes, then 499,500 classifiers are needed. On the other hand, the
UNIVERSITY OF IBADAN LIBRARY
16
size of the training set for each classifier is small because all instances that do not belong to that pair of classes are excluded as discussed by Awad et al. (2009).
2.3.3.2 One-versus-All (OVA) Method
It creates a classifier for each class in the dataset. The training set is pre-processed such that, for a classifier 𝑗, instances that belong to class j are marked as class (+1) and instances that do not belong to class j are marked as class (−1). In the OVA scheme, one computes n classifiers, where n is the number of pages that users have visited (at the end of each session). A new instance, x, is predicted by assigning it to the class that its classifier outputs has the largest positive value (that is maximal marginal). The main advantage of this method is that it introduces redundancy which creates generalisation in the classification but causes an over fitting problem when applied to a small sample size because each classifier uses data from two classes of interest and ignores the rest as conferred by Zhou and Tuck (2007).
The advantage of OVA scheme when compared to the OVO scheme is that it has fewer classifiers. On the other hand, the size of the training set is larger for OVA scheme than for an OVO scheme because the whole original training set was used to compute each classifier.
2.3.3.3 P-Against-Q (PAQ) Method
This is a generalised concept of a coding scheme. A code word length is equivalent to the sum of P and Q, where 𝑃 ≥ 1 and 𝑄 ≥ 1. P is the number of “on” (binary 1) bits, and Q is the number of “off” (binary 0) bits. OVA is a PAQ scheme with 𝑃 = 1 𝑎𝑛𝑑 𝑚 = 𝑘 as debated by Ou et al, (2004).
2.3.2.4 Error-Correcting Code Design Method
Error-correcting output code is defined to be a matrix of binary values. The length of a code is the number of columns in the code. The number of rows in the code is equal to the number of classes in the multiclass learning problem. A “code word" is a row in the code.
A good error-correcting output code for a k-class problem should satisfy two properties:
Row separation where each code word should be well-separated in Hamming distance from each of the other code words and column separation where each bit-position function 𝑓𝑖 should be uncorrelated with the functions to be learned for the other bit positions 𝑓𝑗, 𝑗 ≠
UNIVERSITY OF IBADAN LIBRARY
17
𝑖. The power of a code to correct errors is directly related to the row separation as concluded by Dietterich and Bakiri, (1995).
2.4 Evaluation metrics
Accuracy is the most common evaluation metrics used by most traditional application. But accuracy is not suitable to evaluate imbalance data sets as it places more weight on the majority class than the minority class as affirmed by (Weiss and Provost 2003; Guo et al., 2008). However, it has been observed that for extremely skewed class distribution, the RECALL/True Positive Rate (TPR) of the minority class is often 0, which means that there are no classification rules generated for the minority class as conferred by Guo et al.
(2008). The under listed metrics in sections 2.4.1 to 2.4.10 are the most frequently used.
2.4.1 Confusion Matrix
In a bi-class problem, the confusion matrix records the result of correctly and incorrectly recognised examples of each class (Galar et al., 2012; Thai-Nghe et al., 2009). Table 2.1 presents the confusion matrix of a bi–class problem. The positive class represents the minority class while the negative class represent the majority class. True Positive (TP) shows the number of positive class correctly classified as positive, while True Negative (TN) shows the number of negative class correctly classified as negative class. False Positive (FP) shows the number of negative classes that were incorrectly classified as the positive class while false negative (FN) shows the number of positive classes that were incorrectly classified as negative class. The Recall/Sensitivity/True Positive Rate (TPR) is the likelihood that a positive class is correctly classified as positive as depicted by (equation 2.3). Positive Predictive Value (PPV)/Precision is the likelihood that positive prediction is correct as depicted by (equation 2.2) while NPV is the likelihood that a negative predictions correct as depicted by (equation 2.7). False Negative Rate (FNR) is the likelihood that a positive example is classified as negative example as depicted by (equation 2.5). This confusion matrix could be extended and expanded to multiple class problems.
Accuracy =
TP
FN+FP+TN
(2.1)
UNIVERSITY OF IBADAN LIBRARY
18 Precision/PPV =
TP+FPTP
(2.2)Recall (TPR or Sensitivity) =
TP+FNTP
(2.3)FPR =
FP+TNFP
(2.4)FNR =
TP+FNFN
(2.5)Specificity/TNR =
TN+FPTN
(2.6)NPV =
TN+FNTN
(2.7)2.4.2 F_ measure
This metric harmonises the mean between Recall and Precision and it is depicted by (Equation 2.8). It can be calculated by picking its values from Table 2.1. It generally focus the learning accuracy on positive class from completeness and efficiency aspect respectively (Ding, 2011). It is high when both Recall and Precision are high and can be adjusted through changing the value of (Guo et al., 2008, Thai-Nghe. et al., 2009). The relative importance of precision versus recall is denoted by and it is usually set to 1 (Chawla, 2005).
F_ Measure =
2 2
1+β Recall×Precision
β Recall+Precision (2.8)
UNIVERSITY OF IBADAN LIBRARY
19 Table 2.1 Confusion Matrix
Positive Prediction Negative Prediction
Positive Class True Positive (TP) False Negative (FN) Negative Class False Positive (FP) True Negative (TN)
UNIVERSITY OF IBADAN LIBRARY
20 2.4.3 Kappa statistic or Cohen's kappa coefficient
It is used to measure the agreement between predicted and observed categorisation of a dataset, while correcting for an agreement that occurs by chance. Its maximum value is 100% (perfect agreement) and the expected value for random predictor with the column total is 0 (no agreement) (Witten et al., 2011). Cohen's kappa coefficient is a statistical measure of inter-rater agreement or inter-annotator agreement or qualitative (categorical) items (Carletta, 1996). It is generally thought to be a more robust measure than simple percent agreement calculation since κ takes into account the agreement occurring by chance.
The equation for κ is depicted in (equation 2.9):
Pr( ) Pr( ) 1 Pr( )
a e
k
e (2.9)
Where Pr (a) is the relative observed agreement among raters, and Pr (e) is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly in each category. If the raters are in complete agreement, then 𝑘 = 1. If there is no agreement among the raters other than what would be expected by chance (as defined by Pr (e)), 𝑘 = 0. Landis and Koch (1977) characterized values Kappa Statistic, 𝑘 < 0 as indicating no agreement and, 0 < 𝑘 ≤ 0.20 as slight, 0.21 < 𝑘 ≤ 0.40 as fair, 0.41 < 𝑘 ≤ 0.60 as moderate, 0.61 < 𝑘 ≤ 0.80 as substantial, and 0.81 < 𝑘 ≤ 1 as almost perfect agreement.
2.4.4 G- Means Criterion
Also known as geometric means and it combines the performance of both positive class and negative class i.e. geometric mean of the accuracies measured separately on each class (Positive and Negative) and depicted by (Equation 2.10) calculated from Table 2.1. It is calculated as the product of the prediction accuracies for both classes. High prediction accuracy on both positive and negative class will give rise to a high G-means value (Ding, 2011). Also, it measures the avoidance of the over fitting to the negative class and the degree to which the positive class is ignored.
UNIVERSITY OF IBADAN LIBRARY
21
G-Mean =
Specificity Sensitivity
(2.10)2.4.5 Matthews Correlation Coefficient (MCC)
This is a strong metric that considers both accuracies and error rates on both classes, since all the four values in the confusion matrix are involved in this formula. A high MCC value means the learner should have high accuracies on positive and negative classes, and also have less mis-classification on the two classes. Therefore, MCC can be considered as the best singular assessment metric so far (Ding, 2011). This is represented in (Equation 2.11).
MCC =
C C r r
TP×TN - FP×FN
P ×N ×P ×N (2.11)
Where Pc = TP+FN Nc = TN+FP Pr = TP+FP Nr = FN+TN
2.4.6 ROC (Receiver Operating characteristic) and AUC (Area Under the Curve of ROC) ROC graph is a technique for visualising, organising and selecting classifier based on their performances (Fawcett, 2003). It has properties that make them especially useful for domains with skewed class distribution and unequal classification error cost (Fawcett, 2006). It is a two-dimensional graph in which TP rate is plotted on the Y- axis and FP rate on the X- axis. ROC graph depicts relative trade-offs between benefits (TPR) and costs (FPR) as depicted in Figure 2.3. So far, all the metrics discussed are based on fixed values of TP, TN, FP, FN, where such values can be easily collected when the class labels and predicted values are both discrete. However, in some other cases, such as the Bayesian network, or some neural network, or some ensemble classifiers, the prediction on testing data are continuous values, and a threshold have to be chosen to discretize them. Shifting the threshold within certain range can produce different groups of TP, TN, FP, FN values.
By linking these TP and FP values jointly and plotting them on a 2-D axis, a Receiver Operating Characteristics (ROC) graph is constructed, as depicted in Figure 2.3. The ideal model should produce a point in Position A—the top left corner of Figure 2.3, where TPR
UNIVERSITY OF IBADAN LIBRARY
22
Figure 2.3: ROC_AUC graph (Ding, 2011)
UNIVERSITY OF IBADAN LIBRARY
23
is 1 and FPR is 0; and the worst model should be the point B at the bottom right corner.
Hence, a good classification model should be as close to the top left corner as possible.
Meanwhile, a model making random guess will be located on the diagonal, where the TPR and FPR are equal to each other. Note that the point D on the bottom left corner means the classifier predicts every examples as negative, and point C on the top right means all the predictions are positive. The ROC curve is created by connecting all groups of TPR and FPR values and point D and C together. The closer the ROC curve approaches to the top- left corner, the better the classification performance is (Weiss and Provost, 2003).
However, directly comparing two or more ROC curves are challenging and impractical, e.g., two curves may be interleaved together and it is hard to claim the better one. Thus, a single numerical value to represent the effectiveness of the ROC curve is necessary, which brings the Area under the ROC curves (ROC_AUC).
Clearly, the ROC_AUC value is ranged from 0 to 1, and the higher it is, the better the classifier. Although the ROC curve provides a straight visualization for performance evaluation, it also has a particular limitation when it is applied to the highly imbalanced data set (Davis and Goadrich, 2006).
ROC attractive property is that they are insensitive to changes in class distribution. If the proportion of positive to negative instances changes in a test set, the ROC curves will not change. For ROC, graphs are based upon TPR and FPR, in which dimension is a strict columnar ratio, so do not depend on class distribution (Fawcett, 2006). (Equation 2.12) represents the formula for calculation.
1+ TPR FPR ROC_AUC=
2
(2.12)
2.4.7 Precision-Recall Curves (PRC)
The PRC depicts the relationship between precision and recall as the classification threshold varies (Thai-Nghe et al., 2009; Davis and Goadrich, 2006). The recall (measures how often a positive class instance in the dataset is predicted as a positive class instance) is plotted on the X-axis and precision (which measures how often an instance which was predicted as positive is actually positive) is on the Y-axis.