Granthaalayah
HEALTH INSURANCE PREMIUM PREDICTION USING BLOCKCHAIN TECHNOLOGY AND RANDOM FOREST REGRESSION ALGORITHM

HEALTH INSURANCE PREMIUM PREDICTION USING BLOCKCHAIN TECHNOLOGY AND RANDOM FOREST REGRESSION ALGORITHMs

 

Ghosh Madhumita 1, Ravi Gor 2

 

1 Research Scholar, Department of Mathematics, Gujarat University, Ahmedabad-380009, India

2 Department of Mathematics, Gujarat University, Ahmedabad-380009, India

 

A picture containing logo

Description automatically generated

ABSTRACT

Blockchain technology is based on a sequence of blocks, where each block carries a certain amount of information. Medical records can be cryptographically secured in the health insurance ecosystem with blockchain technology. Here, blockchain technology model is used to create a user interface for storing data block wise. Also, Insurance premium is predicted using Support Vector Regression, Lasso Regression, Ridge Regression, Multiple Linear Regression and Random Forest Regression algorithms. Out of all these algorithms, Random Forest Regression algorithm gives the better result.

 

Received 23 April 2022

Accepted 28 May 2022

Published 24 June 2022

Corresponding Author

Ghosh Madhumita,

madhumitaghoshfmg@gujaratuniversity.ac.in

DOI 10.29121/IJOEST.v6.i3.2022.346   

Funding: This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Copyright: © 2022 The Author(s). This work is licensed under a Creative Commons Attribution 4.0 International License.

With the license CC-BY, authors retain the copyright, allowing anyone to download, reuse, re-print, modify, distribute, and/or copy their contribution. The work must be properly attributed to its author.

 

Keywords: Blockchain Technology, Supervised Learning, Uniform Resource Locator, Support Vector Regression, Lasso Regression, Ridge Regression, Multiple Linear Regression, Random Forest Regression, Health Insurance Data

 

 

 


1. INTRODUCTION

Due to the competitive environment, people are constantly under stress and suffer from physical and mental health problems. Therefore, it is important to purchase adequate health insurance plans for the treatment of physical and mental illness. Financial challenges can be avoided if a person has a health insurance policy at the time of medical treatment. So, today more and more people are realizing the importance of health insurance and opt to buy it.

Now a day, machine learning and blockchain technology are used to run automated applications in every sector of the insurance industry. Machine learning and blockchain technologies can effectively store the medical history of patient, increase the access of medical records, increase the delivery of services related to patient care by efficiently designing effective algorithms.

Most important benefit of blockchain technology in the insurance industry is that, because of the immutable nature of the contracts stored on the blockchain, there is no need for any third party to act as an intermediary. Insurance companies can store data such as transactions and claims in a secure manner using blockchain technology that is virtually invulnerable. With the help of blockchain technology the insurance company can reduce the human error that occurs when updating records. excellarate (2021)

Machine learning is widely used across the insurance industry. It helps insurance companies in fraud detection, claim management, billing, and customer service (azure.microsoft.com). Many models are created using machine learning techniques to enhance customer service. These models are used for everything from claim registration to claim settlement. In addition, predictive model can be used to predict future claims and costs associated with claims accenture (n.d.)

 

2. LITERATURE REVIEW

Shreyas et al. (2016) implemented thirteen regression algorithms to predict the popularity of online articles. They compared the results obtained by all algorithms and then identified the top five algorithms such as Random Forest, Linear regression, Lasso, Ridge, and Nearest Neighbor giving the best results. These five models were identified based on the R2 score. Out of these five algorithms Random Forest Regression predicts better result with an accuracy of 88.8%. Shreyas et al. (2016)

Gururaj et al. (2019) used Linear Regression (LR) and Support Vector Machine (SVM) algorithms to predict the stock price. They compared both the algorithms and concluded that SVM performs better than LR. Both algorithms are compared by calculating MSE, MAE, RMSE and R-Squared. They also explained advantages and disadvantages of SVM and LR. Gururaj et al. (2019)

D'Costa et al. (2020) predicted the true value of cars by using machine learning algorithms. They divided the data into two parts training and testing. They applied Multiple Linear Regression algorithm to train and test the data. In this mode car model, fuel type, emission, mileage, and year of registration are taken as independent variables and car price taken as dependent variable. D'Costa et al. (2020)

Bajaj et al. (2020) used Machine Learning models such as Linear Regression, K-Neighbours Regressor, XGBoost Regressor, and Random Forest Regressor to forecast future sales of Big Mart Companies based on previous year's sales. The input criteria for the prediction are item weight, fat content, visibility, item kind, MRP, outlet establishment year, outlet size, and outlet location type. They also determine the precision of outcomes by calculating Root Mean Squared Error (RMSE), Variance Score, Training and Testing Accuracy and concluded that the Random Forest Algorithm is the best of all, with a precision of 93.53%. Bajaj et al. (2020)

Venkatesan et al. (2020) applied linear regression to discover the best strawberry growth production with the optimum water. The data was collected from September 2015 to May 2016. The 233 valid samples were separated into two groups: training and testing. 186 training sets are utilised in modelling, whereas 46 test sets are used to evaluate the model's prediction performance. Nutrition water, average temperature, humidity, and CO2 are used as independent variable. They also calculate R-squared, Root Mean Squared Error (RMSE) and P-value. Venkatesan et al. (2020)

Vali et al. (2020) studied the previous year’s sales of a supermarket to predict their future sales. The Linear Regression Algorithm is applied for the prediction. Sales data from 2017 to 2019 are taken for this model. Data from 2017 and 2018 is used for training purposes, and data for 2019 is predicted. To calculate the accuracy of prediction, the actual data for the year 2019 was compared with the predicted data. Vali et al. (2020)

Rohith et al. (2020) used Decision Tree regression algorithm to predict crop price. Rainfall, Minimum Support Price, and Wholesale Price Index are taken as independent variables. The max depth parameter is used to reduce the complexity of model and size of tree. Also, Flask module is used to forecast the crop price through web application.  Mean Square Error, Mean Absolute Error, and R-score are calculated to measure the performance of decision tree regression. Rohith et al. (2020)

Kausthub (2021) applied Multiple Linear Regression algorithm to predict sales related to commercials which were displayed in mainly three forms of media TV, Radio & Newspaper. He also noted the error prediction value by using yellow-brick library. He also calculated RMSE value to check the accuracy of model. Kausthub (2021)

Zhang (2021) used two methods to predict housing price. He analysed significant factors for input variable affecting on house prices. These factors were selected by using Spearman correlation coefficient method. Then Multiple Linear Regression model for housing price prediction was established. Also, conclude that Multiple Linear Regression model effectively predicts and analyse the housing price to some extent. Zhang (2021)

Dabreo et al. (2021) predicted Real estate prices using XGBoost, Random Forest, Decision Tree, and Linear Regression algorithms. Thirteen independent variables are used, such as the fraction of residential land zoned for plots larger than 25,000 square feet, the proportion of non-retail commercial acres per town, the full-value property-tax rate per $10,000, and so on. They also calculated the Root Mean Squared Error mean, Root Mean Squared Error standard deviation, and Mean Cross Validation Score to evaluate the model's performance. The XGBoost Regression machine learning algorithm came in first, followed by the Random Forest regression algorithm, and the Decision Tree came in third with a significant difference. Dabreo et al. (2021)

Bhavsar and Gor (2022) predicted restaurant ratings with the help of Machine Learning Model. Information such as Restaurant id, Country, categories for dining, cost, currency, online delivery option, aggregate rating, rating, votes were provided to the Artificial Neural Network model. The ratings were classified in 5 different categories form poor to Excellent. Results of three different optimizers Adam, Adamax and Nadam were compared, where Nadam shows best accuracy. Bhavsar and Gor (2022), Ghosh and Gor (2022)

 

3. REPRESENTATION OF BLOCKCHAIN TECHNOLOGY AND RANDOM FOREST REGRESSION MODEL         

3.1. BLOCKCHAIN

Blockchain technology is defined as a chain of blocks that contains information It is difficult to change, hack or cheat information which is stored in a blockchain system excellarate (2021) Blockchain collects sets of information in a group, known as blocks (geeksforgeeks.org). Blocks have a specific storage capacity, when they are filled; they are closed and attached to a previously filled block, which forms a chain of data known as a blockchain.

Figure 1

                                                                     Text

Description automatically generated

Figure 1 Structure of Blockchain

 

In a blockchain the first block is called the Genesis block. Each new block is linked to the previous block excellarate (2021)

 

3.2. REGRESSION

Regression is a supervised learning technique used when the output variable has a real or constant value, such as salary, weight etc Seely (2018) There are many algorithms are used to solve the classification and regression problem both Seely (2018) Here, Random Forest algorithm is used to solve regression problem.

 

3.3. RANDOM FOREST

Random Forest (RF) algorithm can be used for classification and regression both. This algorithm is an ensemble of decision tree. The prediction of the random forest is based on the predictions of each individual tree Vali et al. (2020) In this paper Random Forest regression is applied to predict the future sales.

Steps involved in Random Forest Regression: Biau (2012)

Step-1 Select the sample randomly from the training data set.

Step-2 Apply the decision tree algorithm individually on the collected sample.

Step-3 Calculation of decision tree. Coursera (n.d.)

1)     Start with the root node, which contains the complete data set.

2)     Find the best attribute using Attribute Selection Measure (ASM).

       Two popular techniques for ASM

Information Gain:

                                                

Where, s = total number of samples

 

                                                          

Gini Index:

                                                        

3)     Divide the root node into subsets that contain possible values for the best attributes.

4)     Generate the decision tree node, which contains the best attribute.

5)     Recursively make new decision trees using the subsets of the dataset created in step-iii. Continue this process until a stage is reached where you cannot further classify the nodes.

Step-4 Calculate the average of the predictions made by output of the individual decision tress.

 

                                                                                                

 

where, T = decision trees in the Random Forest,

y ̂_t = predictions made by each decision tree.

 

4. METHODOLOGY USED IN THE PAPER

The data of Health insurance has been collected from Kaggle is shown in Table 1 In this model age, sex, BMI, and children are taken as an independent variable to predict an insurance premium.

 Table 1

Table 1 Health Insurance dataset (kaggle.com)

Age

Sex

Bmi

Children

Smoker

Region

Expenses

19

female

27.9

0

yes

southwest

16884.92

18

male

33.8

1

no

southwest

1725.55

28

male

33

3

no

southwest

4449.46

33

male

22.7

0

no

northwest

21984.46

32

male

28.9

0

no

northwest

3866.86

31

female

25.7

0

no

southwest

3756.62

46

female

33.4

1

no

southwest

8240.59

37

female

27.7

3

no

northwest

7281.51

37

male

29.8

2

no

northwest

6406.41

60

female

25.8

0

no

northwest

28923.14

25

male

26.2

0

no

northwest

2721.32

62

female

26.3

0

yes

southwest

27808.73

23

male

34.4

0

no

southwest

1826.84

56

female

39.8

0

no

southwest

11090.72

27

male

42.1

0

yes

southwest

39611.76

19

male

24.6

1

no

southwest

1837.24

52

female

30.8

1

no

northwest

10797.34

23

male

23.8

0

no

northwest

2395.17

56

male

40.3

0

no

southwest

10602.39

30

male

35.3

0

yes

southwest

36837.47

 

 

Age: age of person having policy

Sex: gender of person having policy (female=0, male=1) (kaggle.com)

BMI: index of body weight (kg/m^2) using the ratio of height to weight, ideally 18.5 to 25 (kaggle.com)

Children: Total number of children (kaggle.com)

Charges: medical costs billed by health insurance for particular individual

Here, two different techniques Blockchain and Regression are used to store and predict health insurance premium.

First user interface (web page) is created by using blockchain technology. With the help of webpage, data can be stored block wise, and this data cannot be hacked by anyone. This data is secured in blockchain. Then different machine learning algorithms are applied on this data to predict the premium of health insurance and all the results obtained by algorithms are compared. In these models age, sex, BMI, and children taken as an independent variable to predict charges of health insurance. 70% of data are used for training and 30% of data are used for testing purpose Bhavsar and Gor (2022)

 

5. RESULT AND DISCUSSION

Information is filled into the user interface and submitted. After submission the block is created with the hash function.  In such a way blocks are connected one by one and create a blockchain.

Figure 2

                                                                         Graphical user interface, text, application, email

Description automatically generated

Figure 2 User Interface of Insurance Premium

 

Data set cannot be taken directly for regression because of null values and some attributes which have unnecessary information, or which decreases the accuracy of model Bhavsar and Gor (2022) To improve the accuracy and speed of model we have to remove the null values and attributes like region and smoker parameters from dataset.

After cleaning the data, correlation is checked between the variables. The RBF kernel is employed among all kernels in the Support Vector Regression algorithm. The maximum depth 5 and random state 13 are used in the Random Forest Regression. Then, the premium is predicted by using Support Vector Regression, Lasso Regression, Ridge Regression, Multiple Linear Regression and Random Forest Regression. Among these algorithms Random Forest Repression gives better result.

Table 2

Table 2 Mean Absolute Error and R-Squared of regression algorithms

Algorithms

Mean Absolute Error

R-squared

Support Vector Regression

8817.232

0.1146

Lasso Regression

8858.1903

0.1146

Ridge Regression

8858.8802

-0.0668

Multiple Linear Regression

4307.8631

0.7991

Random Forest Regression

2444.9258

0.8954

 

Consequently, the error between actual charges and predicted charges are calculated by Mean Absolute Error (MAE) method. To check the accuracy of model R-squared value is also calculated. The performance of various methods based on Mean Absolute Error and R-squared value is depicted in Figure 3 and Figure 4 The R-squared value obtained by Random Forest Regression.

 Figure 3

                                                                         Chart, bar chart, waterfall chart

Description automatically generated

Figure 3 Mean squared error graph

 

Figure 4

                                                                       Chart, waterfall chart

Description automatically generated

Figure 4 R-Squared error graph

 

6. CONCLUSION

The health insurance premium data is stored and predicted by using Blockchain Technology and Random Forest Regression algorithm. Here, five supervised learning regression-based algorithms were used to predict the premium, namely Support Vector Regression, Lasso Regression, Ridge Regression, Multiple Linear Regression and Random Forest Regression. For building the model four features have been taken which affect the premium price. The performance of five algorithms has been calculated in terms of MAE and R-Squared. Hence, the compared result concludes that Random Forest Regressor gives better result.

In future, this type of problems can be solved with other supervised learning techniques.

 

CONFLICT OF INTERESTS

None. 

 

ACKNOWLEDGMENTS

None.

 

REFERENCES

accenture. (n.d.). Machine Learning in Insurance  

Bajaj, P. Ray, R. Shedge, S. Vidhate, S. & Nilkumar. (2020). Sales Prediction using Machine Learning Algorithms. International Research Journal of Engineering and Technology, 7(6).

Bhavsar, S. & Gor, R. (2022). Predicting Restaurant Ratings using Back Propagation Algorithm. International Organization of Scientific Research Journal of Applied Mathematics (IOSR-JM), 18(2), 5-9.

Biau, G. (2012). Analysis of a Random Forests Model. Journal of Machine Learning Research, 13, 1063-1095.

Coursera. (n.d.). Medical Insurance Premium Prediction with Machine Learning.     

D'Costa, L. D'Souza, A. Abhijith, k. & Varghese, D. (2020). Predicting True Value of Used Car using Multiple Linear Regression Model. International Journal of Recent Technology and Engineering, 8(5). https://doi.org/10.35940/ijrte.E1010.0285S20

Dabreo, S. Rodrigues, S. Rodrigues, V. & Shah, P. (2021). Real Estate Price Prediction. International Journal of Engineering Research & Technology, 10(4). 

David, D. (2020). Random Forest Classifier Tutorial : How to Use Tree-Based Algorithms for Machine Learning.

excellarate. (2021). Blockchain in Insurance : Use Cases and the Way Forward.

Ghosh, M. & Gor, R. (2022). Short Message Service Classifier Application using Naïve Bayes algorithm (In Press). International Organization of Scientic Research Journal of Computer Engineering (IOSR-JCE).

Gururaj, V. Shriyaand, V. & Ashwini, K. (2019). Stock Market Prediction using Linear Regression and Support Vector Machines. International Journal of Applied Engineering Research, 14, 1931-1934.

Hanafy, M. (2021). Predict Health Insurance Cost by using Machine Learning and DNN Regression Models.

javatpoint. (2022). Decision Tree Classification Algorithm.    

kaggle. (n.d.). Medical Cost Personal Datasets.  

Kharwal, A. (2021). Health Insurance Premium Prediction with Machine Learning.

Kausthub, K. (2021). Commercials Sales Prediction Using Multiple Linear Regression. International Research Journal of Engineering and Technology, 8(3).

Rohith, R. Vishnu, R. Kishore, A. & Chakkarawarthi, D. (2020). Crop Price Prediction and Forecasting system using Supervised Machine Learning Algorithms. International Journal of Advanced Research in Computer and Communication Engineering, 9(3).

Ronaghan, S. (2018). The Mathematics of Decision Trees, Random Forest and Feature Importance in Scikit-learn and Spark.

Seely, S. (2018). Eight use cases for machine learning in insurance.

Shreyas, R. Akshata, D. Mahanand, B. Shagun, B. & Abhishek, C. (2016). Predicting Popularity of Online Articles using Random Forest Regression. Institute of Electrical and Electronics Engineers, 1-5. https://doi.org/10.1109/CCIP.2016.7802890

Vali, M. Sankeerthana, K. Naveen, B. & Vishal, N. (2020). Prediction of Online Sales using Linear Regression. International Journal of Creative Research Thoughts, 8(2).

Venkatesan, S. Sathishkumar, V. Park, J. Shin, C. & Cho, Y. (2020).  A Prediction of Nutrition Water for Strawberry Production using Linear Regression, International Journal of Advanced Smart Convergence. 132-140.

Zhang, Q. (2021). Housing Price Prediction Based on Multiple Linear Regression. Hindawi Scientific Programming, (3), 1-9.

 

 

 

 

 

     

 

 

 

Creative Commons Licence This work is licensed under a: Creative Commons Attribution 4.0 International License

© Granthaalayah 2014-2022. All Rights Reserved.