controlled diabetes in pregnancy increases the danger of fetal death and other complications Dowling and Yap (2014).

Early-stage diagnosis Diabetes mellitus may present with characteristic symptoms such as thirst, polyuria, blurring of vision, and weight loss in the absence of effective treatment; in the most severe forms, ketoacidosis or a non-kenotic hyperosmolar state may develop and cause stupor, and, in the absence of effective treatment, death Hauner and Scherbaum (2002).

2. STATEMENT OF THE PROBLEM

Diabetes mellitus is a global public health issue that contributes significantly to heart disease, stroke, chronic kidney failure, leg amputation, foot ulcer, nerve damage, and eye damage. Ethiopia ranks first among the top four African countries, with 2.6 million people. This is due to number of factors, such as lack of awareness, limitation in screening protocols, less propaganda for intervention programs, globalization, rapid adaptation to western lifestyle, unhealthy eating habits like skipping the breakfast (or) eating junk foods because of financial hardships, and poor accessibility to health care services (scarcity of specialists, practitioners, health facilities, less expenditure) on diabetes or free diabetes in Ethiopia.

Diabetes prevalence is rising at an alarming rate. According to WHO fact shit, 77% of individuals with diabetes sleep in low- and middle-income countries, making socially disadvantaged countries the most vulnerable to diabetes disease-related complications. Developing countries, the human and financial costs of diabetes management are high and escalating from time to time.

According to the literature, detecting diabetes's 80% disease progression is based on clinical suspicion and is confirmed by performing a laboratory assessment of the patient's blood sample's oral glucose or sugar level. These methods aren't feasible for screening, because they require skilled manpower and are time consuming, making them not accessible to all segments of the population. According to CSA data from 1997, 84% of people in the country live in the country, and health institutions are heavily concentrated in the city's core. On the other hand, the current health care setup is a busy outpatient setting. The shortage of highly trained health care providers is an acute problem in Ethiopia. Today, this ultimately raises the cost of patient health care service for treating non-communicable diseases like diabetic patients and improves the standard of care Dagnew (2021).

3. RELATED WORK AND LITERATE REVIEW

In an investigation directed by Y, Hongmei, three strategies for data mining were the examination portrayed the advancement of a clinical choice organization to anticipate the presence of myocardial infraction during an associate of 4,770 patients giving intense agony at two college emergency clinics and four local area hospitals. The clinical choice organization had comparative affectability (88.0% versus 87.8%) yet a significantly higher specificity (74% versus 71%) in foreseeing the shortfall of myocardial infarction in contrast with physicians' choices if the patients needed to be conceded to the coronary consideration unit. In the event that the choice to concede depended entirely on the choice organization, the affirmation of patients without dead tissue to the coronary consideration unit would be diminished by 11.5% without antagonistically influencing patient results or the nature of care Yan et al. (2006) .

Beck, Huain, and Y. huajo concentrated on diabetes-related difficulty avoidance. Around 30 to 80 % of type 2 diabetic cases stay undiscovered. It is proposed that information be prepared, utilizing decision trees, type 2, with various levels of pervasiveness within the limelight. It has been perceived by an asymptomatic stage between the beginning of diabetic hyperglycemia and clinical conclusion within 4–7 years. During the information collection period from 2009 to 2011, techniques for gathering information from guests were regarded as a risk factor. The attribute selected individual and epidemiological linkage when patient status in light of the hour of visit facility. of choice quality are heftiness or overweight, history of diabetes in first-degree relationships, hypertension in pregnancy, privies history of gestational diabetes, history of early termination, stillbirth, and birth of a baby under 4 kg, and foundation of patient and epidemiological information. Features include age, sex, history of diabetes, and weight loss plan (BMI). The examiner utilized the procedure of J48 Algorithm to build up the decision tree in WEKA (3.9.5 version). The degree of model checked accuracy and precision of the model was 71.7 and 97.6 %, separately. The specialist reasoned that the created model utilizing the decision tree for the screening of T2DM does not need lab tests for analysis Habibi et al. (2015).

Razak and Bakar have led investigations that have some expertise in mining affiliation rules from asthma patients' profile datasets. The purpose of the examination is to identify ascribed factors that influence asthma patients. The asthma patient profile dataset during this investigation comprises of 16,384 records and 118 factors in several organizations. These attributions are assembled into segment attribute and asthma-related attribute. The mining strategy utilized includes an information readiness stage and an affiliation rules mining stage. Understanding the personality of the dataset, distinguishing information types and configurations, recognizing deficient information, breaking down information conveyance, and discretizing information are the many stages required to efficiently preprocess the information. Due to information preprocessing and purging, just 31 attributes are left to urge affiliation rules. The affiliation rules mining stage utilizes deduced algorithms. Deciding, preparing, and testing datasets, deciding limit esteems, mining affiliation rules, and affiliation rules examination are included during the execution of quality mining Region (2017).

Zerihun (2017) conducted a research study to develop a predictive model for pre-diabetes screening by using processing technology from Adare General Hospital 4529 diabetic instances with sixteen attributes at Hawassa City in Ethiopia for the diagnosis of prediabetes yes or no. He focused on the implementation of the J48 decision tree and PART to affect the problem. The experiment results show that PART rules outperformed decision tree classifiers with 96.9% accuracy.

Selam, A led a task force on the measles outbreak in Ethiopia's various districts. The philosophy for building a prescient model utilizing information handling methods for this exploration was a cross-bred six-venture Cios KDP. It had six fundamental advances. Model form by 13 selected attributes for creating a foresight model. Examiner tests are directed by two arrangement algorithms, the decision tree and the naive Bayes Models, which differ in the order of a few flare-ups. The classifier has an affectability of 86.8%, indicating that the model is capable of perceiving truth, and an explicitness of 99.7%. The next analysis utilized 9 attribute and scored the simplest exactness of 93.31% with a 70% split test alternative from the contrary trials. Examination number three scored the principal precision with both test alternatives. The Chosen algorithm suggests a district-based measles episode forecast SELAM (2012) .

The Shegaw-led study Anagaw (2002) has some expertise within the research on the expected appropriateness of data mining innovation to predict child mortality based on side-by side comparison of local area-based epidemiological datasets. The analyst utilized neural organization and selection tree strategies. Assembling and testing the models utilizing the neural organization approach, the least difficult model was distinguished for the preparation it made by utilizing the default boundaries of 9 attributes. This model had a precision pace of 93%. This classifier happened with an exactness of 95% in preparing cases, and it accomplished 95% exactness in experiments.

Iyer and Sumbaly (2015) have utilized two methods, to be specific, the Decision Tree and Naive Bayes algorithms for the conclusion of diabetes utilizing arrangement mining strategies from the University of California, Irvine (UCI) Pima Indians diabetes data set of public establishments of diabetes and stomach-related and kidney sicknesses, with 768 examples and eight traits with class mark tried positive and tried negative. The trial results show that the Naive Bayes calculation with 79.5652% precision outperformed the Decision Tree (J48) exactness of 76.9565% by a rate split of 70:30.

4. METHODOLOGY

1) Research Design

This study follows an experimental research approach. This is because experiments that will occur to extract results from real-world implementations will be, and it is important to reiterate that all the experiments and results should be reproducible. The CRISP-DM technique is followed to explore the utility of knowledge mining in diabetes screening across all eligible groups. This model was chosen since it exhibits all the benefits of the well-known and widely used methodology called CRISP-DM and provides a more general, research-oriented description James and Sarvanakumar (2017).

2) Data Collection Methods

The first data was gathered by using interviews with domain experts, and therefore the second data was gathered from different written documents, conference articles, and journal publications. A dataset was collected from baseline diabetic patients’ medical history using a secondary data collection method, also referred to as the retrospective method Tella (2015) .

3) Evaluation

Evaluate the performance and accuracy of the model created by the J48 decision tree, Naive Bayes, JRIP, and PART rule induction. The methods' relevance was checked using a confusion matrix, ROC curve, 10 folds cross validation, and a ready dataset spited with 70% split for training and 30% for testing.

5. DATA UNDERSTANDING AND PREPARATION

5.1. Handling Missing Value

There were some missing values in the data collected for this research project, such as the type of food typically consumed and the age of the patient. This is often corrected by the time of the next visit, and a few of them, with the assistance of the domain expert’s special sorts of diabetes support characterization risk factor of the patient, in order that all the missing values are crammed with the acceptable value.

Table 1 Attributes with missing values
No	Attribute	Missed Values
1	Pregnancies	108
2	Glucose	5
3	Blood Pressure	35
4	Skin Thickness	216
5	Insulin	354
6	BMI	11
7	Cholesterol	1
8	DBP	18

5.2. Data Discretization

Interval labels are often used to replace actual data values. For instance, smoothing techniques, including binning and dividing value by hierarchal derived new attribute construction, are the most commonly used ones. From the dataset, the "AGE" and "BMI" attributes are continuous value changes to discrete value thoughts in a discretized (binned) process Marzuki and Ahmad (2007). After completing the discretization process, the distinct values of the age attribute were reduced to 6 from 46 distinct values.

Table 2 Summary of Derived Attributed with Their Values
No	Original Attributes	New value
1	Age of participant	0-34,35-43, 44-52,53-61,62-70, >71
2	Body mass index (BMI)	BMI <=11.8 underweight, BMI =11.8-22.36kg/m2 = Normal, BMI =23-33.6 g/m2= overweight, BMI =34-44.7 kg/m2 = obese Class1, BMI =45-55.9 kg/m2= obese Class2, BMI >= 56 kg/m2 = very obese.

6. EXPERIMENTATION AND RESULTS ANALYSIS

6.1. J48 Algorithm

Experiment I

This experiment was conducted under the 10-fold cross-validation test option with default parameters of Weka and the algorithm generates a model as a decision tree with 91 leaves and a size of 176. The correctly classified instances were 467, which means 63.88%, and the incorrectly classified instances were 264, which means 36.11% of the total number of instances of 731, taking 0.01 seconds to build the model.

Table 3 10-fold test for J48 algorithm
Algorithm	Test Option	Precision	Recall	ROC Area	Class
J48	10-fold	52.20%	51.40%	59.40%	Diabetes
		70.08%	71.40%	59.40%	Free Diabetes

Experiment II

This experiment was conducted using the percentage split test option to train and test the classification model Eyasu et al. (2020). Out of the 731 total records, 219 (70%) of the instances were used as a training dataset and the remaining 512 (30%) of the instances were used as a testing dataset. The J48 learning algorithm scored an accuracy of 138 out of 219 total testing instances. 138 (63.01%) of them were classified correctly, and the remaining 81 (36.98%) testing instances were incorrectly classified. The algorithm generates a model as a discussion tree with 91 types of leaves and 176 sizes of the tree and takes 0.06 seconds to build the model.

Table 4 70% split test for J48 Classification algorithm
Algorithm	Test Option	Precision	Recall	ROC Area	Class
Naïve Bayes	70 % split	51.20%	50.60%	59%	Diabetes
		70.10%	70.60%	59%	Free Diabetes

To conclude, the above two experiments, namely experiments I and II, were performed in order to build the classification model using the J48 classification algorithm by applying k-fold cross validation and percentage split methods, respectively, to the experiments Yu et al. (2004).

Table 5 Detailed Accuracy by Class for J48 classification algorithm
Detailed Accuracy by Class
J48	Precision	Recall	ROC Area	Class
	92.40%	88%	96.80%	Diabetes
	92.90%	95.60%	96.80%	Free diabetes

Confusion matrix for J48 Algorithm

The confusion matrix may be useful for analyzing how well the classifier can recognize tuples of various classes. The two-way table's sensitivity (true positive rate) is (243/ (243+33)) *100 = 88.04%, and the specificity (true negative rate) of support vector machine experiments is (435/ (435+20)) *100 = 95.60%. The overall accuracy of this training algorithm was 91.82%, which is significantly lower than the other two algorithms used in this study.

Table 6 Confusion Matrix for J48 Decision Tree algorithm
Confusion Matrix
Diabetes	Free diabetes	Class
243	33	Diabetes
20	435	Free diabetes

ROC Analysis for J48 Algorithm

ROC analysis provides tools to pick the simplest models and discard suboptimal ones. Because of the cost-benefit analysis of diagnostic decision, ROC analysis is said during a street. Figure 1 depicts the world under ROC for diabetes screening cases. Out came that yes, gives the ROC accuracy of 98.45% of algorithms selected from all 18 attributes

Figure 1 ROC curve of the J48 classification algorithm

6.2. PART Algorithm

Experiment I

This experiment was conducted under the 10-fold cross-validation test option with default parameters of WEKA and the algorithm generates a model as PART and correctly classified instances are 458, which means 62.65 % and incorrectly classified instances are 273, which means 37.34% of the total number of 731 instances and it takes 0.03 seconds to build the model.

Table 7 10-fold test for PART Classification algorithm
Algorithm	Test Option	Precision	Recall	ROC Area	Class
PART	10-fold	50.60%	49.30%	63.30%	Diabetes
		69.70%	70.80%	63.30%	Free Diabetes

Experiment II

To train and test the classification model, use the percentage split test option. Out of the 731 total records, 219 (70%) of the instances were used as a training dataset and the remaining 512 instances (30%) were used as a testing dataset. The PART algorithm scored an accuracy of 133 out of a total of 219 testing instances. 133 (60.73%) of them were classified correctly, and the remaining 86 (39.26%) testing instances were misclassified or incorrectly classified.

Table 8 70% split test for PART Classification algorithm
Algorithm	Test Option	Precision	Recall	ROC Area	Class
Part	70% split	48.10%	47.00%	55.90%	Diabetes
		70.04%	69.10%	55.90%	Free Diabetes

Experiment I and Experiment II show the classification accuracy of the models based on the above two methods, respectively. The first experiment was performed based on the 10-fold cross validation method and classified with a 62.65% accuracy rate, and the second experiment, performed based on a 70%:30% percentage split, classified with a 60.83% accuracy rate.

Table 9 Detailed Accuracy by Class for PART algorithm
Detailed Accuracy by Class
Part	Precision	Recall	ROC Area	Class
	97.00%	81.50%	97.20%	Diabetes
	89.80%	98.50%	97.20%	Free diabetes

Confusion matrix for PART algorithm

The confusion matrix may be useful for analysing how well the classifier can recognize tuples of various classes Kabakchieva (2016). The two-way table's sensitivity (true positive rate) is (255/ (255+51)) *100 = 83.3%, and its specificity (true negative rate) is (448/ (448+7)) *100 = 98.46%. The overall accuracy of this training algorithm was 92.06%, which is significantly lower than the other two algorithms used in this study.

Table 10 Confusion Matrix for PART algorithm
Confusion Matrix
Diabetes	Free diabetes	Class
255	51	Diabetes
7	448	Free diabetes

Figure 2 specifying the number of people suffering by diabetes

ROC Analysis for PART Algorithm

ROC analysis is directly related to measuring the cost-benefit analysis of diagnostic PART Rule induction. Figure 3 shows the area under ROC for the prediabetes screening instances. The ROC accuracy of algorithms selected from all attributes is 99.22% when class value is yes.

Figure 3 ROC curve of the PART algorithm

6.3. Naive Bayes Algorithm

Experiment I

This experiment was conducted under the 10-fold cross-validation test option with default parameters of WEKA and the algorithm generates a model as Naive Bayes and Correctly Classified Instances are 487, which means 66.21 % and Incorrectly Classified Instances are 247, which means 37.78% of the total number of 731 instances.

Table 11 10-fold test for Naive Bayes classification algorithm
Algorithm	Test Option	Precision	Recall	ROC Area	Class
Naive Bayes	10-fold	56.60%	45.30%	66.20%	Diabetes
		70.40%	78.90%	66.20%	Free Diabetes

Experiment II

To train and test the classification model, use the percentage split test option. Out of the 731 total records, 219 (70%) of the instances were used as a training dataset and the remaining 512 instances (30%) were used as a testing dataset. The Naive Bayes learning algorithm scored an accuracy of out of a total of 512 testing instances, 291 (56.83%) of them were classified correctly and the remaining 221 (43.16%) testing instances were misclassified or incorrectly classified.

Table 12 70% split for Naïve Bayes classification algorithm
Algorithm	Test Option	Precision	Recall	ROC Area	Class
Naïve Bayes	70% split	49.30%	43.40%	64.80%	Diabetes
		67.80%	72.80%	64.80%	Free Diabetes

Experiment I and Experiment II show the classification accuracy of the models based on the above two methods, respectively. The first experiment was performed based on the 10-fold cross validation method and classified with a 62.21% accuracy rate, and the second experiment, performed based on a 70%:30% percentage split, classified with a 61.64% accuracy rate.

Table 13 Detailed Accuracy by Class for Naïve Bayes algorithm
Detailed Accuracy by Class
Naïve Bayes	Precision	Recall	ROC Area	Class
	59.70%	47.80%	70.30%	Diabetes
	71.80%	80.40%	70.30%	Free diabetes

Confusion matrix Naive Bayes Algorithm

The two-way table's sensitivity (true positive rate) is (132/ (132+144)) *100 = 47.8%, and the specificity (true negative rate) of support vector machine experiments is (336/ (336+89)) *100 = 79.05%. The overall accuracy of this training algorithm was 68.12%.

Table 14 Confusion Matrix for Naïve Bayes Algorithm
Confusion Matrix
Diabetes	Free diabetes	Class
132	144	Diabetes
89	336	Free diabetes

ROC Analysis for Navies Bayes Algorithm

ROC analysis is performed during a cost-benefit analysis of diagnostic decisions. Figure 4 shows the world under ROC for diabetes screening instances. Class value of yes, gives the ROC accuracy of 70.31% of algorithms selected attributes.

Figure 4 ROC curve of the Navies Bayer’s Algorithm

6.4. JRIP Algorithm

Experiment I

This experiment was performed using the JRIP Rule induction algorithm with 10-fold cross validation, and the outcome of this experiment is presented in Table 15 below.

Table 15 10-fold Cross Validation for JRIP algorithm
Algorithm	Test Option	Precision	Recall	ROC Area	Class
JRIP	10-fold	56.30%	52.20%	64.60%	Diabetes
		72.40%	75.40%	64.60%	Free Diabetes

Experiment II

The JRIP algorithm scored an accuracy of out of a total of 219 testing instances, 147 (67.12%) of them were classified correctly and the remaining 72 (32.87%) were incorrectly classified.

Table 16 70% split for JRIP classification algorithm
Algorithm	Test Option	Precision	Recall	ROC Area	Class
JRIP	70% split	56.30%	59.30%	66.90%	Diabetes
		74.20%	72.10%	66.90%	Free Diabetes

To conclude, the above two experiments, namely experiments I and II, were performed so as to build the classification model using the JRIP classification algorithm by applying k-fold cross validation and percentage split methods, respectively, to the experiments.

Table 17 Detailed Accuracy by Class for JRIP algorithm
Detailed Accuracy by Class
JRIP	Precision	Recall	ROC Area	Class
	64.50%	49.30%	66.40%	Diabetes
	73.10%	83.50%	66.40%	Free diabetes

Confusion Matrix for JRIP Algorithm

The two-way table's sensitivity (true positive rate) is (136/ (136+140)) *100 = 49.27%, and the specificity (true negative rate) of support vector machine experiments is (380/ (380+75)) *100 = 83.51%. The overall accuracy of this training algorithm was 70.58%, which is significantly lower than the other two algorithms used in this study.

Table 18 Confusion Matrix for JRIP Algorithm
Confusion Matrix
Diabetes	Free diabetes	Class
136	140	Diabetes
75	380	Free diabetes

ROC Analysis for JRIP Algorithm

ROC analysis is performed during a cost-benefit analysis of diagnostic decisions. Figure 5 depicts the world under ROC for diabetes screening cases. Class value: yes, gives the ROC accuracy of 66.87% of the selected attribute.

Figure 5 ROC curve of the JRIP Algorithm

Comparison among Classification Algorithms

One of the aims of this research is to select a better classification Algorithm for building a model that performs best in classification. Therefore, the below table compares the output of all the four models supported by the accuracy of the model, the time it took to build the model, the sensitivity classified instances (Yes), and the insensitivity classified instances (No), supported by the 10-fold cross-validation and 70% split test option.

Table 19 Comparison of 10-fold test and 70% split test option
	10-fold test option		70% split test option
Algorithm	Correctly classified	Incorrectly Classified	Correctly classified	Incorrectly Classified
J48	62.51%	37.48%	61.64%	38.35%
Navies Bayer’s	66.21%	33.79%	61.64%	38.35%
PART	62.38%	37.61%	59.36%	40.63%
JRIP	66.62%	33.37%	67.12%	32.87%

Among the tested classification algorithms, the JRIP algorithm had the highest accuracy of 67.12%. Accordingly, this algorithm was chosen for classifications of diabetes risk.

Figure 6 Predicted Accuracy of each 10-fold test and 70% split Algorithm

7. DISCUSSION RESULT ON THE MAJOR FINDINGS

For this study, the algorithms were selected to test on the diabetic datasets in order to generate rules, i.e., J48, PART, Navies Bayer’s and JRIP algorithms. Therefore, analysing one by one and seeing the result that they performed during the previous experiment has been tabularized accordingly.

The J48 algorithm is the most accurate model among the others due to the results that this algorithm demonstrated in terms of performance, time, labelling, specificity, and confusion matrix. From the previous situation, the J48 algorithm had scored a time of 0.02 seconds to classify the 678 records according to the class they belong to. Besides this, the model also showed good performance more often than others. The ROC that this model displays is almost identical to one that is 96.8 and the results of precision and recall (92.9% and 95.6%) are also pretty much the same as the left model.

The second most performing model is the PART Classier, or model which is the second one according to the above criteria for performance. This model scored the highest accuracy (92.06%) on the general data to classify the status of diabetic patient datasets. The time taken to perform the general data by this algorithm is 0 seconds, as is the time taken to classify the 673 instances of the records. The precision was 89.8%. and recall (98.5%). This result is the most promising result next to the J48 algorithm by understanding the experiment result of the model.

The third most performing model is the JRIP Algorithm model, which is the third one according to the above criteria performance, which is almost very close to the JRIP classifier. This model scored the highest accuracy (70.58%) on the general data to classify the status of diabetic patient datasets. The time taken to perform the general data by this algorithm is 0 seconds, as is the time taken to classify the 516 instances of the records. The precision was 73.1%, and the recall was 83.5%. This result is the most promising result next to the JRIP algorithm by understanding the experiment result of the model.

The fourth most performing model is the Naive Bayes Algorithm model, which is the third one according to the above criteria performance, which is almost very close to the Naive Bayes classifier. This model scored 68.12% accuracy on the general data to classify the status of diabetic patient datasets. The time taken to perform the general data by this algorithm is 0.1 seconds, as is the time taken to classify the 498 instances of the records. The precision (71.8%) and recall (80.4%). This result is the most promising result next to Naïve Bayes algorithm by understanding the experiment result of the model.

Generally, the J48 model is the most performing model with good accuracy of results. The PART rule induction is the second most performing model next to the J48 model, whereas the JRIP and Naive Bayes algorithms are the last performing classifiers. Among these algorithms, the J48 algorithm is the best performing model by classifying diabetic patient datasets and generating rules.

8. CONCLUSION AND RECOMMENDATIONS

Conclusion

This experimental research, which engaged a CRISP methodological approach, made use of predictive modeling techniques to address the problem. The experiment result shows the selected algorithms tested, the decision tree classifier (J48) algorithm scored the highest accuracy and best predictor with (92.74%), followed by PART (92.06%), JRIP (70.58%), and Naive Bayes algorithms (68.12%).

Recommendation and Future Work

This study showed the potential applicability of data mining algorithms to diabetes screening datasets in Classification algorithm. Based on the findings of the study, we recommend the following as future research directions:

· We used the J48 decision tree, the PART, the JRIP, and the Naive Bayes classifier. Further research using ANN, KNN, SVM, and others

· It is difficult to get well-organized, correct, and quality data for the mining algorithms. We suggest health centres analyse their data symmetrically for data analyses.

· More research and development efforts need to be conducted to enable and explore the variety of data mining techniques that can be applied to diabetes and free diabetic datasets.

· Integration of data mining techniques into existing systems and computerizing manual recording systems in databases is a priority issue.

· To develop web-based software for performance evaluation of various classifiers where the users can just submit their data set and evaluate the results on the patient.

REFERENCES

A. G. Eapen, (2004) "Application of Data mining in Medical Applications by," Univ. Waterloo, Retrieved from https://uwspace.uwaterloo.ca/handle/10012/772

A. Iyer, J. S, and R. Sumbaly, (2015) "Diagnosis of Diabetes Using Classification Mining Techniques," Int. J. Data Min. Knowl. Manag. Process, vol. 5, no. 1, pp. 01-14, Retrieved from https://doi.org/10.5121/ijdkp.2015.5101

A. SELAM, (2012) "PREDICTING THE OCCURRENCE OF MEASLES OUTBREAK IN ETHIOPIA USING DATA MINING TECHNOLOGY." Addis Ababa University,

A. Tella, (2015) "Electronic and paper based data collection methods in library and information science research: A comparative analyses," New Libr. World, vol. 116, no. 9-10, pp. 588-609, Retrieved from https://doi.org/10.1108/NLW-12-2014-0138

B. Dagnew et al., (2021) "Hypertriglyceridemia and Other Plasma Lipid Profile Abnormalities among People Living with Diabetes Mellitus in Ethiopia: A Systematic Review and Meta-Analysis," Biomed Res. Int., vol. 2021, Retrieved from https://doi.org/10.1155/2021/7389076

B. S. Kumar and D. G. R., (2016) "A Survey on Data Mining Approaches to Diabetes Disease Diagnosis and Prognosis," Ijarcce, vol. 5, no. 12, pp. 463-467, Retrieved from https://doi.org/10.17148/IJARCCE.2016.512105

B. Zerihun, (2017) "Developing a Predictive Model for Pre-Diabetes Screening by Using Data Mining Technology." Addis Ababa University,

D. Kabakchieva, (2016) "Predicting Student Performance by Using Data Mining Methods for Classification Predicting Student Performance by Using Data Mining Methods for Classification Dorina Kabakchieva," no. March 2013, Retrieved from https://doi.org/10.2478/cait-2013-0006

H. Hauner and W. A. Scherbaum, (2002) "Type 2 diabetes," DMW - Dtsch. Medizinische Wochenschrift, vol. 127, no. 19, pp. 1003-1005, Retrieved from https://doi.org/10.1055/s-2002-28326

H. Yan, Y. Jiang, J. Zheng, C. Peng, and Q. Li, (2006) "A multilayer perceptron-based medical decision support system for heart disease diagnosis," Expert Syst. Appl., vol. 30, no. 2, pp. 272-281, Retrieved from https://doi.org/10.1016/j.eswa.2005.07.022

I. M. Ahmed, A. M. Mahmoud, M. Aref, and A.-B. M. Salem, (2012) "A study on expert systems for diabetic diagnosis and treatment," Recent Adv. Inf. Sci., pp. 363-367,

J. James and K. Sarvanakumar, (2017) "Empirical Study on Data Mining Algorithms related to Breast Cancer," Indusedu.Org, vol. 07, no. 03, pp. 14-18,, [Online]. Available Retrieved from : http://www.indusedu.org/pdfs/IJRIME/IJRIME_1088_90543.pdf

J. M. Dowling and C.-F. Yap, (2014) "Communicable Diseases in Developing Countries," Commun. Dis. Dev. Ctries., 2014. Retrieved from https://doi.org/10.1057/9781137354785

J. Yu, H. Huang, and S. Tian, (2004) "Cluster validity and stability of clustering algorithms," Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 3138, no. 3, pp. 957-965, Retrieved from https://doi.org/10.1007/978-3-540-27868-9_105

K. Eyasu, W. Jimma, and T. Tadesse, (2020) "Developing a Prototype Knowledge-Based System for Diagnosis and Treatment of Diabetes Using Data Mining Techniques," Ethiop. J. Health Sci., vol. 30, no. 1, pp. 115-124, Retrieved from https://doi.org/10.4314/ejhs.v30i1.15

O. Region, (2017) "Research in Molecular Medicine Prevalence of Prediabetes and its Risk Factors among the Employees of Ambo," vol. 5, no. 3, pp. 11-20, Retrieved from https://doi.org/10.29252/rmm.5.3.11

R. Williams et al., (2020) "Global and regional estimates and projections of diabetes-related health expenditure: Results from the International Diabetes Federation Diabetes Atlas, 9th edition," Diabetes Res. Clin. Pract., vol. 162, Retrieved from https://doi.org/10.1016/j.diabres.2020.108072

S. Anagaw, (2002) "Application of data mining technology to predict child mortality patterns : the case of butajira rural health project (brhp)," Unpubl. Masters thesis Addiss Ababa Univ.,.

S. Habibi, M. Ahmadi, and S. Alizadeh, (2015) "Type 2 Diabetes Mellitus Screening and Risk Factors Using Decision Tree: Results of Data Mining," Glob. J. Health Sci., vol. 7, no. 5, pp. 304-310, Retrieved from https://doi.org/10.5539/gjhs.v7n5p304

W. Gao and Q. Qiao, (2012) "Screening for type 2 diabetes," Epidemiol. Type 2 Diabetes, pp. 29-38, Retrieved from https://doi.org/10.2174/978160805361211201010029

Z. Marzuki and F. Ahmad, (2007) "Data Mining Discretization Methods and Performances," Mach. Learn., no. 1, pp. 978-980, Retrieved from https://d1wqtxts1xzle7.cloudfront.net/50217711/Data_Mining_Discretization_Methods_and_P20161109-21049-ukdace-with-cover-page-v2.pdf?Expires=1640247769&Signature=aBcWHXg6eVqFLq6aaQIxKpqA4KuDOdOhq7Nifd2cwm9wtkdzUHvlfkD6eiW4pllyKw0cPci26sAMcHgSU57tGBn9HeS4nqR6WsQCKUN-8w4OoreY-1Pjq1ecaCSZrh-1HLt0V0lapzSmtmWGZzP9gYJqfejBAvchirFY-3FH1F4TPbbgT7xyCA5HNSbUJFiOyAtUvjV-fzf~VhFAK3yREd9nwbhqc0-tHLL9aPQ2MIV-btIn6jYi0BIOlgGLT~b7XWM0NlotydSBaDP~l7CfKGJFl3UWZhUCp96wFIS5gla~kudQL12Rz0n2poR0XuaeLFVZ-hS4kQz5dwr1ODffOw__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA

This work is licensed under a: Creative Commons Attribution 4.0 International License

PREDICTION OF DIABETES SCREENING BY USING DATA MINING ALGORITHMS Aberham Tadesse Zemedkun ¹ ¹Faculty of Engineering & Technology,Rift valley university, Department of Computer science, P.O.Box 80734, Addis Ababa, Ethiopia.

Received 10 November 2021 Accepted 05 December 2021 Published 24 December 2021 Corresponding Author Aberham Tadesse Zemedkun, abrehamt373@gmail.com DOI 10.29121/IJOEST.v5. i6.2021.253 Funding: This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. Copyright: © 2021 The Author(s). This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.	ABSTRACT
	Diabetes is one of the most common non-communicable diseases in the world. Diabetes affects the ability to produce the hormone insulin. Thus, complications may occur if diabetes remains untreated and unidentified. That features a significant contribution to increased morbidity, mortality, and admission rates of patients in both developed and developing countries. When disease is not detected early, it leads to complications. Medical records of the cases were retrospective. Anthropometric and biochemical information was collected. From this data, four ML classification algorithms, including Decision Tree (J48), Naive-Bayes, PART rule induction, and JRIP, were used to prognosticate diabetes. Precision, recall, F-Measure, Receiver Operating Characteristics (ROC) scores, and the confusion matrix were calculated to determine the performance of the various algorithms. The performance was also measured by sensitivity and specificity. They have high classification accuracy and are generally comparable in predicting diabetes and free diabetes patients. Among the selected algorithms tested, the Decision Tree Classifier (J48) algorithm scored the highest accuracy and was the best predictor, with a classification accuracy of 92.74%.
	Keywords: Diabetes, Data Mining, ML, J48, PART, JRIP, Naïve Bayes 1. INTRODUCTION Diabetes mellitus (DM) is a syndrome characterized by chronic hyperglycaemia, due to an absolute or relative deficiency of circulating insulin Ahmed et al. (2012). There are three main types of diabetes: Type 1, Type 2 & Gestational diabetes. People with type 1 diabetes produce very little or no insulin at all and it is called insulin- dependent. Type 2 diabetes used to be called non-insulin-dependent diabetes or adult-onset diabetes, and accounts for at least 90% of all cases of diabetes. Gestational diabetes mellitus (GDM) is a type of diabetes characterized by high blood glucose levels during pregnancy Williams (2020) . As a result, the figure is expected to rise to 366 million by 2030.DM is the commonest of all metabolic diseases everywhere on the planet Gao and Qiao (2012). The burden of diabetes is increasing worldwide, including in developing countries like Ethiopia. The International Diabetes Federation Association reported Ethiopia to be ranked 3rd in Africa with 1.4 million DM and a prevalence of 3.32 by the year 2012 Kumar and D. G. R (2016). Diabetes affects all segments of the population, regardless of age and sex Eapen (2004) Diabetes of all kinds can cause complications that will increase the general risk of dying prematurely. Possible complications include attacks, strokes, renal failure, leg amputations, vision loss, and nerve damage. Poorly

PREDICTION OF DIABETES SCREENING BY USING DATA MINING ALGORITHMS

1 Faculty of Engineering & Technology,Rift valley university, Department of Computer science, P.O.Box 80734, Addis Ababa, Ethiopia.

¹Faculty of Engineering & Technology,Rift valley university, Department of Computer science, P.O.Box 80734, Addis Ababa, Ethiopia.