UNIVERSITY OF IBADAN LIBRARY
83 𝐻0: All 𝛽𝑗 are equal.
𝐻1: At least one 𝛽𝑗 are different.
3.4.4 Tukey–Kramer method
This is a single-step multiple comparison procedure and statistical test. It can be used on raw data or in conjunction with an ANOVA (Post-hoc analysis) to find means that are significantly different from each other. It compares all possible pairs of means, and is based on a studentized range distribution (q).
3.4.5 Box and Whisker Plots
These plots offer a pictorial summary of important dataset characteristics including the central tendency, dispersion, asymmetry and extremes arrived at through percentile rank analysis and the plotting of maximum and minimum dataset values. Its graphically compact nature facilitates side by side comparison of multiple datasets, which can otherwise be difficult to interpret using more complete representations such as the histogram (Banaco, 2011). Ordered data are divided into lower and upper half by the median. The median of the lower half is the lower quartile. The median of the upper half is the upper quartile. The lower extreme is the least data value. The upper extreme is the greatest value. Important characteristics of each scheme: central tendency, skewness, dispersion and extremes are easy to interpret and visualise. Each box in the box plots represents a data sampling scheme. The whiskers at the end of the box plots show the minimum and maximum values, while the bar shows the median. If the median bar is above zero or higher, the data sampling scheme represented by the box plot is doing better on average than the data sampling scheme that is being compared with. And if the complete box, including the whiskers, is above zero, then that data sampling scheme is better than the other data sampling schemes.
UNIVERSITY OF IBADAN LIBRARY
84 3.5.1 Diabetes Mellitus (DM) dataset
Diabetes mellitus or simply diabetes is a group of metabolic diseases in which a person has high blood sugar content, either because the pancreas does not produce enough insulin, or because cells do not respond to the insulin that is produced. This high blood sugar content produces the classical symptoms of polyuria (frequent urination), polydipsia (increased thirst) and polyphagia (increased hunger).
Three main types DM considered were:
a. Type1 DM which is the outcome of the body's failure to produce insulin, and requires the person to inject insulin or wear an insulin pump. This form was previously referred to as "insulin-dependent diabetes mellitus" (IDDM) or
"juvenile diabetes".
b. Type2 DM which results from insulin resistance, a condition in which cells fail to use insulin properly, sometimes this is combined with an absolute insulin deficiency. This form was previously referred to as non-insulin-dependent diabetes mellitus (NIDDM) or "adult-onset diabetes".
c. The third main form, gestational diabetes (GDM) occurs when pregnant women without a previous diagnosis of diabetes develop a high blood glucose level. It may preceed development of type 2 DM and is the class of interest (minority class) in this study.
Other forms of DM include congenital diabetes, which is due to genetic defects of insulin secretion, cystic fibrosis-related diabetes, steroid diabetes induced by high doses of glucocorticoids, and several forms of monogenic diabetes (Sarwar et al., 2010).
The raw data for this disease condition used in this study was obtained from the records department of the Family Medicine Clinic of Wesley Guild Unit of Obafemi Awolowo University Teaching Hospital Complex, Ilesha, Osun State, Nigeria. The dataset of outgoing patients suffering from DM was extracted, reviewed and processed. The dataset contained 886 instances of complete record of DM patients from January 2009 to May 2010. This dataset was collected by Awokola (2010) for research purpose. It contained information about patients with three types of diabetes. The dataset contained 886 instances, had 18 attributes and three different classes namely: TYPE1, TYPE2 and Gestational Diabetes Mellitus (GDM). The dataset class distribution was 807:62:17 where
UNIVERSITY OF IBADAN LIBRARY
85
TYPE2 had 807 instances, TYPE1 had 62 instances and GDM had only 17 instances (the minority class and also the class of interest). The dataset is highly skewed.
3.5.2 Senior Secondary School Certificate Examination Result (SSS Result) dataset The data collected comprised of results of students in five secondary schools in Ibadan, Nigeria. The records of students who sat for SSS Result consisting of both West Africa Examination Council (WAEC) and Nigeria Examination Council (NECO) examinations results in the schools were used for analysis. These results were for both public and private secondary schools within Ibadan metropolis, Oyo state, Nigeria for a period of five years (2005-2009). This dataset was collected by Agboola (2010) for research purpose. For the purpose of this study, only data on English Language and Mathematics were used for the analysis because they were compulsory for all students. Any student that passes both English Language and Mathematics in both WAEC and NECO was regarded as PASSBOTH, students that failed English Language and Mathematics both in WAEC and NECO examination was regarded as FAILBOTH. Students that passed English Language and Mathematics in WAEC alone was regarded as PASSWAEC while students that passed both English Language in NECO alone was regarded as PASSNECO. The dataset contains 1163 instances consisting of 8 different attributes with four different classes namely:
FAILBOTH with 775 instances, PASSNECO with 248 instances, PASSWAEC with 45 instances and PASSBOTH with 95 instances. PASSWAEC is the class of interest and also the minority class in this study. The dataset class distribution was 775:248:45:95.
3.5.3 Tuberculosis (TB) dataset
Tuberculosis (TB) is a disease caused by bacteria called Mycobacterium tuberculosis. It is usually spread through the air and attacks low immune bodies such as patients with Human Immuno-deficiency Virus (HIV) (Asha et al,. 2011). It is a disease which can affect virtually all organs, not sparing even the relatively inaccessible sites. The microorganisms usually enter the body by inhalation through the lungs. They spread from the initial location in the lungs to other parts of the body via the blood stream. It presents a diagnostic dilemma even for physicians with a great deal of experience with this disease.
Hence Tuberculosis (TB) is a contagious bacterial disease caused by mycobacterium which affects usually lungs and is often co-infected with HIV/AIDS (Asha et al., 2012).
UNIVERSITY OF IBADAN LIBRARY
86
Nigeria has the tenth highest burden of TB among the 22 TB high –burden countries in the world (Lawson et al., 2012)
The medical dataset that were classified included 768 real records of patients suffering from tuberculosis (TB) obtained during the cause of this study from Ijaye State Hospital, Ogun State. The entire dataset was put in one file having many records. Each record corresponds to most relevant information of one patient. Initial queries by Doctors for symptoms and required test result details of patients were considered as main attributes.
On the aggregate, there were 12 attributes (symptoms) and four classes namely: Pulmonary TB (PTB), Extra PTB (EPTB), Retroviral PTB (RPTB) and Retroviral EPTB (REPTB) which is the minority class and also the class of interest. The dataset class distribution was 589:124:37:6 where PTB had 589 instances, RPTB had 124 instances, EPTB had 37 instances and REPTB had only 6 instances (the minority class and also the class of interest).
3.5.4 Contraceptive Method (CM) dataset
This dataset was collected during this study from the Government Health Centre clinic at Ibadan North East Local Government, Ibadan, Oyo state. The dataset was collected for a period of seven (7) years (2008–2014) for this research purpose. The dataset contained 775 instances, 20 attributes with 5 different classes namely: NONE, SECONDARY+, SECONDARY, PRIMARY and PRIMARY+. NONE represented patients without any education that is illiterate, PRIMARY represented patients that attended primary school but did not complete their education, PRIMARY+ represented patients that had elementary primary education with certificate, SECONDARY represented patients that went to secondary school but did not complete the senior secondary school but could have or have not completed the junior secondary school while SECONDARY+ stood for patients that had a complete secondary education and/or with either College of Education, Polytechnic or University education. The dataset class distribution was 414:247:45:53:16 where SECONDARY+ had 414 instances, SECONDARY had 247 instances, PRIMARY had 45 instances, PRIMARY+ had 53 instances and NONE had only 16 instances (the minority class and also the class of interest).
UNIVERSITY OF IBADAN LIBRARY
87
The summary of datasets used in this study is presented in Table 3.1. It showed the datasets with the number of attributes, the number of classes they contain and the percentage of minority class.