PRE-DETECTION OF DISEASES WITH DEEP LEARNING METHOD AND ANAPPLICATION ON DIABETES

: Diabetes is a chronic disorder caused by the inability of the pancreas to provide adequate insulin or to insure the body is not consumable. The height of the heading expenses of diabetes, which is caused by serious complications. After a certain time, there are serious complications such as eye diseases, cardiovascular diseases, kidney diseases, diabetes, the height of remediation expenses and the person who is uncomfortable with the cause of loss of labour, socioeconomic is an important health problem for the creation of boredom. Diabetes is often manifested by the average age in our ages and in adults. And because of these conditions, early diagnosis is important in diabetes as well as in many diseases. Different chemical tests are performed in blood and urine for diagnosis. In this study, the clinical data of individuals were examined and the data mining techniques were determined to determine whether individuals were diabetic .


Introduction
Diabetes; Those who have side effects, non-vital, chronic and acute, who need medical assistance and care to prevent or relieve the economic and social aspects of the right or indirect side effects. Inactivity, urbanization, and changes in food consumption increase the risk of diabetes (Gedik, 2016). According to the World Health Organization, he thinks that by 2025, more than 300 million diabetes will be caught in the world.
When we look at the types of diabetes, there are two subtypes, Type 1 and Type 2. Tip 2 diabetes Tip 1 occurs before age 30, more common Tip 2 diabetes occurs at a late age. It is Type 2 diabetes since 80% of cases in diabetes. The part that is disturbed by tip 2 is the majority of 90% obese patients and adults. Tip 1 diabetes includes 5-10% of diabetes cases. Although very rare, other causes are caused by diabetes (Oksay Şahin, 2015).
One of the causes of diabetes mellitus gestational diabetes mellitus is the most common complications of metabolic pregnancy, up to 14% of pregnant women. There is a greater risk for women with GDM in the future development of diabetes and perinatal morbidity and mortality as a result of pregnancy. For this reason, early detection of women at risk of GDM or in fact developing is strongly guaranteed (Graziano Dıjanıand Gıuseppe Seghıerı, 2007). The types of diabetes that develop after GDM are generally not studied. However, the causes and contributions of insulin resistance and poor insulin secretion in GDM are likely to be involved in diabetes, which also occurs after GDM (Thomas A. Buchanan, MD1, Anny Xiang, PHD2, Siri L. Kjos, MD3 and Richard Watanabe, PHD2 2007).
Today, computer aided methods are widely used in medical fields. The efficiency gains and achievements gained in computer-aided medical systems also paved the way for the diversification and spread of computer-aided methods. Artificial Intelligence studies, methods, techniques and approaches that become widespread in this field are the most remarkable. When the applications of our age are investigated, it is known that artificial intelligence methods in these computerassisted medicine methods are carefully dealt with in many different disease diagnoses (Köse, Güraksin and Deperlioğlu, 2015).
The basis of the techniques that can be used in health and the policies that can be developed is information provided from the data. Health data of clinics, other clinical institutions, insurance companies and related public institutions are gathered through many institutions. The first noticeable concept of data to a large extent is "data mining". For this reason, data mining is a process that is applied to find patterns and links within the scope of data with many analysis tool applications and to provide valid estimates (Koyuncugil and Özgülbaş, 2009).

Goal
In this study, the subject of predicting diseases with deep learning method is discussed. In order to make the subject more concrete, data was compiled by considering diabetes and applied in the research. The data obtained in the study were processed and tested with two separate software, which are widely used in the field of deep learning. The software used in this study is WEKA and Python programming language. Based on the WEKA program with higher and more reliable results. Thus, it is aimed to train the machine with Linear Regression and Logistic Regression, which are data mining algorithms. The result of this training was to check whether the desired algorithm can be estimated correctly in the test sets prediction. For the data set used, the machine was trained with an algorithm. Thus, the probability of whether the future test datasets are sick or not has been investigated. Using two algorithms, it was researched which of the algorithms worked better and this algorithms are compared.

What is Diabetes (Diabetes)?
It is a metabolism disease that develops as a result of the inability of the human body to produce enough insulin to meet its own needs or not to consume the available insulin as required. In diabetes, blood sugar is more than it should be. Insulin is the secretion hormone produced by the pancreas. Insulin provides the conversion of sugar from blood to cells by converting it into energy. After consuming food, the amount of sugar in the blood increases. Increased sugar in non-sick people quickly returns to its favorable state. In people with glucose disorders, this system does not work due to insufficiency or lack of insulin. Sugar cannot penetrate into the internal structure of the cell and glucose is collected in the blood (Güçlüvediet al., 2008). In the case of hypoglycemia, even the blood vessels nervous system is severely damaged in the first order (Erdost and Çetinkale, 2008). Individuals with diabetes try to consume the calories of the foods they consume in a balanced way and take medication in order to maintain the blood glucose stability. In some cases, when the calories of the foods consumed are high, in cases where the drug is not taken adequately and the food is not consumed during the day, situations such as increase in blood sugar or decrease in blood sugar occur.

Diabetes Types and Diagnostic Criteria
Diabetes diagnosis criteria were defined in 2010. For the first time, HbA1c was added to the diabetes diagnosis criteria determined in 2010 (Yanık, 2011). Its varieties are divided into four clusters: Type 1, Type 2, special types and gestational sugar. Type 1 sugar pancreatic B cell distribution is manifested as an acute disease in adults under 18 years of age with insulin deficiency. In type 2 diabetes, insulin resistance and insulin secretion disorder are the most common. Types of diabetes in insulin function or beta cell genetic distributions, pancreatic disorders, infections, endocrine diseases, drugs and chemical molecules are the causes (Samancıoğlu, 2013).

Symptoms and Findings of Diabetes
Main metabolic differences that progress somehow through blood sugar; As it is known in type 1 diabetes, insufficient insulin formation / peripheral insulin resistance, as manifested in type 2 diabetes. In these 2 events, insulin cannot function effectively, glucose cannot be taken into the cell and it increases the darkness in the blood. When glucose is not consumed in the body, the glycogen storage collapses into glucose in a way that meets the energy need. The collapse of the fat warehouse causes hyperlipidemia by keeping the blood lipid attention assembly high.
High blood sugar levels increase plasma osmolarity. When the glucose concentration rises above 180 mg / dl, kidney glucose initiation is overcome and glucose is excreted in the urine. In the case of protein warehouses, it causes the formation of polyphagia. Despite feeling very hungry and polyphagia; attenuation through protein, water consumption and collapse of oil holds; Fracture, fatigue and malaise occur due to the low level of glucose accepted by the cells, decreased plasma density, and decreased muscle proteins to meet the vigorous state of the body (Talaz, 2007).

Deep Learning and Artificial Intelligence
The deep learning system uses representative learning methods to extract meaningful patterns (Ting, Yim, Cheungand Lim, 2017). Deep learning is to apply computational models to learn data representations with multiple levels of abstraction. Despite the success of deep learning, learning about models' internal operations and behavior has become an interesting topic in deep learning (Fang, 2017). Machine learning is a general-purpose method of artificial intelligence that can learn relationships from data without having to pre-define them. The main objection is the ability to obtain predictive models without the need for strong assumptions about the underlying mechanisms that are often unknown or poorly defined. benefit of deep learning is the analysis and learning of uncontrolled data in large quantities. This is a valuable tool for big data analytics, where raw data is largely untagged and categorized (Nacafabadi, Villanustre, Khoshgoftaar, Seliya, Wald and Muhare magic, 2015). Deep learning is a branch of artificial intelligence (Kamble, 2016).
Intelligence is the whole skill of human reasoning, thinking, comprehending objective facts, perception, giving results, judgment, abstraction, and determination. Other skills such as learning, abstraction and adapting to new situations are also included in the scope of intelligence. However, artificial intelligence is intelligence in non-organic methods with these characteristics. It is described as the ability of a computer or a computer-controlled machine to perform its tasks connected to strong mental processes such as reasoning, generalization and making sense and evident from past experiences, which are assumed to have mostly human characteristics (Kalaycı, 2006).

Data Mining
Data mining; is the result of the emergence of very large datastores. Data in the 1960s, accumulation of data in electronic spaces and analysis of historical data with computers began. In the 1980s, mobile and easy analysis of data was provided with associated data bases and SQL. With these opportunities, when the 1990s were reached, the volume of the data accumulated reached very large sizes and data warehouses were used to keep the data in the warehouses. Data mining has come about as a result of handling statistics and artificial intelligence methods to handle the extensive data stacks.
Advancing technology has simplifiedtheabilitytoconvertunexamineddatatomeetadministrativeand market requirements to create new opportunities, and in a sense, is committed to addressing data mining of organizations. Data mining, despite the differences and conflicts in the approach, the health sector needs data more (Ruben D. Canlas Jr, MSIT, MBA, 2009). Especially in the last two decades, outage can be defined as the big data period in which digital data has become increasingly important in many fields such as health, science, technology and society. Numerous data were captured and produced from multiple sources such as flow machines from multiple areas, highefficiency tools, sensor networks, mobile applications and especially from healthcare services. This high data volume represents big data (Daoudy and Maalmi, 2019). Therefore, data mining is generally associated with a wider knowledge discovery process (Losiewicz, Oard and Kostoff, 2000). The steps to go through in the process of data mining: Identification of the Problem; the first impression of progress is to specify for which target data mining application will be kept. This step defines the requirements and the target with which the data provided will ultimately be applied. Data Identification and Collection; In data definition and collection, it is defined what kind of references the data and data will be used. Data preparation; It is the step of bringing the data accumulated in a useful way to the target proportional to the data mining design. The steps to get the data ready are as follows: Data Cleaning, Data Consolidation, Data Reduction and Data Conversion. (Aydemir, 2017). Models and techniques in data mining are as follows: predictive models; Classification and Regression analysis. Descriptive Models and Techniques are Cluster Analysis and Clustering Methods (Sivri, 2015).

K-Times Cross Validation
Cross-validation is a technique applied to compare and calculate identification algorithms by splitting existing data into 2 parts. Some of the data that is divided into two addresses the design to educate, and the other handles to check the accuracy of the design (Çataloluk, 2012).

Logistic Regression Analysis
In cases where a connection between two parameters is assumed, the spoken connection can be specified with a line that exceeds the points in the scatter plot. The line specified here is called the "Regression Line". "Regression Equation" is called the mathematical expression of the line, equality (Şata, 2015).

Linear Regression
Linear regression analysis is investigated under two titles as multiple linear regression and simple regression. Simple regression analysis, one comment with answer parameter. If a curvilinear and linear relationship between a single response parameter and multiple interpretive parameters is described, the connection is examined by multiple linear regression analysis (Arı and Önder, 2013).

Weka
WEKA is the name of the program that was advanced through "University of Waikato" in 1993 with the aim of Machine learning and used the first letters of the words "Waikato Environment for Knowledge Analysis". It covers many preferred machine learning algorithms in our age. The fact that the developed language is java and its sources come out in the form of "jar" files, the simple integration of the designs dealt with in the JAVA language has spread even more. WEKA's methods constitute a design of parts entirely, with the qualities it covers, such as data analysis, visualization, business intelligence programs and data mining on data sets. Regression, data preprocessing, classification, clustering, feature selection or feature extraction are some of the methods that this WEKA has created. As a result of these methods, there are imaging instruments that make up the results in visual format (Pehlivan, 2014). All algorithms included in WEKA can accept the ARFF file format, which is in an easy relational table format, as an input (Aydın, 2011).
As shown in Figure 1 below, the application of the data on WEKA is as follows:

• Data Partition and Data Validation Processes
During the analysis, test and training sets were used. It was not kept as two files separately, some of them were taken as test and some of them as education. At this stage, the k-fold cross-validation method has been established. Since k = 10 is generally preferred in the literature, k = 10 was preferred in the project. Thus, the data was divided into 10 equal parts, used as a 10% test and 90% educational set. • Data Feature Scaling There are both numerical and nominal data in the data set. The filter 'Nominal to Binary' filter, which enables this translation in WEKA, has been applied to the data set. When applying the filter, only the filter was applied to the 9th column. This is because the class in the 9th column takes nominal values.

Linear Regression Results
Results were obtained by using Linear Regression and k = 10-fold cross-validation method in WEKA. The applied algorithm has processed 6 of the 8 features in total. This is because the 2 features that do not participate in the process are too small to affect the result or have no effect on the result. There are two unused features: 'skin' and 'insu'.
The formulas formed as a result of the algorithm used in Figure 2 below are given. Using these formulas, calculations are made on data sets.  The figure below gives the error value of Linear Regression. The algorithm calculated as 74.01119% according to the relative absolute error (Relative absolute error) and 84.6013% according to the Root relative squared error.

Logistic Regression Results
When using the Logistic Regression Algorithm, data set without data pre-processing was used. The success rate of the algorithm is 77.2135%.

Figure 3: Confusion Matrix
Confusion matrix is given in the figure above. According to Confusion matrix, 440 + 153 = 593 data are classified as correct, 115 + 60 = 175 data are classified as incorrect.

Application of Data on Python
The code was written using the Python programming language. Python codes are written in Syder environment in Anaconda Navigator. Tests were carried out under the same conditions as in WEKA. In this way, the accuracy of the programs were compared (Şanlı, 2018 The loading stages of the data in figure 4 we see above: ➢ Data Pre-Processing Stage The data has been pre-processed. The pretreatment step is crucial for creating healthier results.

➢ Separation of Data
The data are separated as columns and related places are used by combining them according to the situations to be used, which facilitates the code. As seen in Figure 5:

• Separation of Data as a Training and Test Set
The data is divided into 3 with the Percentage split method. One of the data divided into 3 was used as a test and 2 as an education. In other words, 30% of the data was used as a test and 70% as a training set.  • Attribute Scaling The effects of the data from the same measure must be measured. For these reasons, all data should be evaluated with in the same range. In this way, healthier results appear. For example; when comparing a person's insulin value with their age, the insulin value can exceed 100, and the age value is below 100. This does not indicate that the insulin value is greater than the age value. This needs to be adjusted by scaling. Here the data is reduced from -1 to 1. • Creating Probability Values and Reporting First, a '768x1' column was created and a value of 1 was given as an element. The purpose of this is to use it in the zeroth index data since its index starts from zero since it is stored as an array. Since the result will not change when multiplied by 1, the data is given as 1.

Result
Diabetes mellitus is a chronic disease characterized by hyperglycemia. It can cause many complications. According to the increasing morbidity in recent years, the world's diabetic patients will reach 642 million by 2040, which means that one in ten adults in the future will have diabetes. There is no doubt that this worrying figure needs to be of great interest. With the rapid development of machine learning, machine learning has been applied to many aspects of medical health. In this study, the Pima Indians data set was used to predict Diabetes Mellitus. In this study, a systematiceffortwasmadetoidentifyandreviewthemachinelearninganddataminingapproachesapplie d in the DM research. While the reason for the small differences in success rates is taken into consideration, it is due to the small change of the criteria in the programs. WEKA and Piton are completely separate from each other. WEKA is more preferred today. There a son for this is the other classification of the error rate. To reduce the error rate in Python, the P probability value must be reduced to zero. The aim is to compare the change in accuracy rates between platforms when the values are changed. WEKA-Logistics Regression: Regression with 77% error. Python-Logistic Regression is 75%. In this algorithm, the criteria are kept closer together and, as seen, close results are obtained. It works with the appropriate error rate in two algorithms for the data set. For this reason, operations can be made by selecting the desired algorithm. As a result of this study, good results have been obtained. The medium used in datamining does not matter. The most important thing is the most efficient pretreatment and the best result.