GEOGRAPHICALLY WEIGHTED REGRESSION AND MULTIPLE LINEAR REGRESSION FOR TOPSOIL TEXTURE PREDICTION

Henny Pramoedyo ^*1, Sativandi Riza ², Afiati Oktaviarina ^{1, 4}, Deby Ardianti ³

^{1, 2, 3}Department of Statistics, Brawijaya University, Malang, Indonesia

⁴ Department of Mathematics, Surabaya State University, Surabaya, Indonesia

DOI: https://doi.org/10.29121/granthaalayah.v9.i2.2021.3112

Article Type: Research Article

Article Citation: Henny Pramoedyo, Sativandi Riza, Afiati Oktaviarina, and Deby Ardianti. (2021). GEOGRAPHICALLY WEIGHTED REGRESSION AND MULTIPLE LINEAR REGRESSION FOR TOPSOIL TEXTURE PREDICTION. International Journal of Research -GRANTHAALAYAH, 9(2), 64-71. https://doi.org/10.29121/granthaalayah.v9.i2.2021.3112

Received Date: 15 January 2021

Accepted Date: 23 February 2021

Keywords:

Regression

Geographically Weighted Regression

Soil Texture Modelling

Terrain Analysis

Digital Elevation Model
ABSTRACT

Land resource management requires extensive land mapping. Conventional soil mapping takes a long time and is expensive; therefore, geographic information system data as a predictor in soil texture modeling can be used as an alternative solution to shorten time and reduce costs. Through digital elevation model data, topographic variability can be obtained as an independent variable in predicting soil texture. Geographically weighted regression is used to observe the effects of spatial heterogeneity. This study uses a data set of 50 observation points, each of which had soil particle-size fraction attributes and eight local morphological variables. The covariates used in this study are eastness aspects, northness aspects, slope, unsphericity curvature, vertical curvature, horizontal curvature, accumulation curvature, and elevation. Prediction using geographically weighted regression shows more results compared to multiple linear regression models. The spatial location can affect product Y, with the R2 value of 0.81 in the sand fraction, 0.57 in the silt fraction, and 0.33 in the clay fraction.

1. INTRODUCTION

Soil texture is influenced by topographic variability, which modifies water flow and material distribution to produce a soil pattern in a landscape [1]. Mapping of soil texture is needed as the main source of information in land resource management [2]. Soil mapping is conducted using conventional methods, which require large amounts of time and high costs. This results in minimal information regarding the broad spatial distribution of soil textures. In studies on soil texture mapping, many methods are utilized, including modeling [2], [3], [4], which produces soil texture mapping efficiently and accurately.

The combination of statistical modeling and GIS is an alternative solution to shorten the time and reduce costs. Hence, GIS data can be used as predictor variables in modeling [5], including GIS data for topographic variability to predict soil texture, which is the digital elevation model (DEM) [6]. Through DEM data, topographic variability can be obtained as an independent predictor of soil texture.

The simplest modeling, when there are two or more predictor variables, is multiple regression analysis. Multiple linear regression can model or predict an object by looking at the relationship between the dependent variable and a group of independent variables [7]. However, in regression analysis, several assumptions must be met. This regression is applied to modeling data that are influenced by spatial aspects or geographic conditions, and there will be assumptions that are difficult to fulfill that lead to spatial heterogeneity [8]. Spatial heterogeneity is a condition defined by different conditions from one location to another [9]. Additionally, this study uses geographically weighted regression (GWR) to observe the effects of spatial heterogeneity. GWR is based on a non-parametric technique of a locally weighted regression developed in statistics for curve fitting and smoothing [10]. Then, we compare the results of simple multi-linear regression with modeling using GWR. This study expects to produce a soil texture prediction model with high accuracy.

2. MATERIALS AND METHOD

The topsoil at a depth of 0-10 cm based on 50 randomly selected samples was taken from the Kalikonto watershed, in Malang, during June-July 2020. Soil texture content was then derived from the laboratory analysis and used as the primary data in this study. This was because soil texture is a combination of three particle-size fractions (PSFs): sand, silt, and clay. Modeling is conducted on the three PSFs, which are the Y variables. The X variables used in this study are eastness aspects (Ae) as X1, northness aspects (An) as X2, slope (S) as X3, unsphericity curvature (M) as X4, vertical curvature (Kv) as X5, horizontal curvature (Kh) as X6, accumulation curvature (Ka) as X7, and elevation (Elv) as X8.

2.1. DATA SETS

This study's data sets consisted of 50 observation points, each of which had soil PSF attributes, and eight local morphological variables (LMV), which showed curvature diversity of a topography [11]. The LMV was obtained from the formula shown in Table 1. However, to obtain this variable, an analysis of the DEM data was performed to obtain the value derived from the elevation, which is the DEM digital number value. To obtain the derived value of the elevation, the following formula is used [12]:

Where z is the elevation, and w is the cell size in pixels. We apply a 3x3 window calculation to perform this analysis.

Table 1: Formula to obtain the LMV [11]..

Covariates	Formula
eastness aspects (Ae)
northness aspects (An)
slope (S)
unsphericity curvature (M)
vertical curvature (Kv)
horizontal curvature (Kh)
accumulation curvature (Ka)
elevation (Elv)	Direct DEM’s pixel value

2.2. MULTILINEAR REGRESSION ANALYSIS

Multilinear regression analysis is the development of a simple regression analysis that explains and describes the relationship between the response variable and more than one predictor variable [13]. The regression equation model that can be formed with n observations and p predictor variables can be written as follows [7]:

Where:

i	:	Observation – ith with i = 1, 2, …, n
	:	Observation – ith in – kth predictor with k =1, 2, …, p
	:	The intercept value for all observations
	:	kth predictor value
	:	Observation – i^th error

Before starting the analysis, we performed several assumption tests as a standard procedure in regression analysis. We conducted the normality test, heterogeneity test, and non-multicollinearity test.

2.3. GEOGRAPHICALLY WEIGHTED REGRESSION

In the spatial aspect, we tested the spatial autocorrelation by using the test statistic Moran’s I, based on the following hypotheses [14]:

Hypotheses:

(no spatial correlation).

(there is a spatial correlation),

if true test statistic,

and

Where is the mean of , is the element of weighted matrix, is Moran’s index, is the expected value of Moran’s index, and is the number of samples.

The Breusch–Pagan test was used to test the spatial heterogeneity, based on the following hypotheses [15]:

Hypothesis:

H₁ : there are at least one j where

If true test statistic,

Where is is , is the galat vector, is the weighting matrix, is the matrix containing the standard predictor variable, and T is

The GWR model considers geographic factors and produces local estimators of the parameter model for each point or location [16]. The GWR model is as follows:

Where y_i is the observed value of the i^th predictor variable, x_ik is the k^th predictor variable's observed value, is the regression model intercept value, is the kth predictor variable regression coefficient, and is the i-error.

The weighted least square method is used to estimate the parameter of the GWR model that produces different weighting in each location. The following is the parameter estimation for the GWR model [16]:

From equation (5), the parameter coefficient of the GWR model for each location has different values.

The weighting forms by kernel function are divided into fixed kernel and adaptive kernel (Fotheringham). The fixed kernel function has the same bandwidth in all locations.[17]

Where is the bandwidth, is the adaptive bandwidth, and is the Euclidean distance

with,

Where is the coordinate point in location, and is the coordinate point in location.

Additionally, is optimum bandwidth with the cross validation (CV) method

Where n is the number of samples, and is the estimated value of

Partial testing in the GWR parameter model is used to determine which predictor variable influences the response variable for each location. Based on the following hypotheses:

Hypotheses:

The statistics test can be written as [16]:

Where,

and is a diagonal matrix element

Reject if the test statistic

3. RESULTS AND DISCUSSION

3.1. MULTILINEAR REGRESSION RESULT

For the sand model, the equation for the multiple linear regression model obtained is as follows:

Based on the model obtained, An, M, Kv, and Kh have a positive relationship to the sand soil fraction. Meanwhile, Ae, S, Ka, and Elv have a negative relationship with the sand soil fraction. For example, the lower the Ka value, the lower the sand soil fraction. This multiple linear regression model produces an R2 value of 0.6285, which means that the study's independent variables simultaneously affect the sand soil fraction of 62.85%, and other variables outside the research variables influence the remaining 37.15%.

The equation for the silt model obtained is as follows:

Based on this model, An, M, Kv, and Ka have a negative relationship with the silt soil fraction. Meanwhile, Ae, S, Kh, and Elv have a positive relationship with the sand soil fraction. For example, the lower the Ka value, the silt soil fraction will increase. This multiple linear regression model produces an R2 value of 0.5503, which means that the study's independent variables simultaneously affect the sand soil fraction by 55.03%, and the remaining 44.97% is influenced by other variables outside the research variables.

For the clay model, the equation for the multiple linear regression model obtained is as follows:

Based on this model, An, M, and Kh have a negative relationship with the clay fraction. Meanwhile, Ae, S, Kv, Ka, and Elv positively correlate with the clay soil fraction. For example, the lower the Ka value, the higher the clay soil fraction. The multiple linear regression model produces an R2 value of 0.3034, which means that the independent variables simultaneously affect the sand soil fraction of 30.34% and the remaining 69.66% for other variables outside the research variables. The above models met the standard test for multiple regression analysis.

3.2. GWR ANALYSIS RESULT

Based on the results of the spatial dependence test in this study, the p-value of the three types of soil is smaller than α = 0.05; therefore, a spatial dependence on observations exists. Likewise, with the results of the heterogeneity test in the three PSFs, spatial heterogeneity exists. Therefore, based on testing the spatial aspect, spatial dependence on observations and spatial heterogeneity exist, so the multiple linear regression method is not appropriate for describing the phenomenon of soil types. Therefore, it is better to use a model that accommodates the location factor of the observation.

The first step in GWR modeling is to determine the optimal bandwidth and minimum CV by using fixed Gaussian spatial weighting. The minimum CV and bandwidth results are shown in Table 2.

Table 2: Minimum CV and bandwidth

	p-value
	Sand	Silt	Clay
CV minimum	3459.44	3767.20	3594.77
Bandwidth	4645.38	22043.22	22043.22

Then the GWR result is shown in Table 3.

Table 3: GWR Model

PSF	Variable	Coefficient
PSF	Variable	Min	Max	Global
Sand	X intercept	41.5669	48.4123	46.1294
	X1	-2.9916	-1.5447	-2.4215
	X2	-0.5120	1.6084	1.0453
	X3	-4.3435	-1.2016	-3.1946
	X4	-0.6242	5.5791	6.1098
	X5	-2.6798	3.4149	2.3281
	X6	-1.6738	3.4592	0.5892
	X7	-9.0681	1.5708	-3.5741
	X8	-12.2402	-4.1612	-9.4836

Silt	X intercept	27.5163	27.7998	27.500
	X1	1.14747	1.3448	1.1832
	X2	-0.4801	-0.2678	-0.3682
	X3	2.9984	3.1486	3.0438
	X4	-4.4882	-3.9209	-4.2890
	X5	-5.5011	-4.9580	-5.2969
	X6	1.5558	2.1242	1.8954
	X7	-4.1397	-4.0602	-4.1780
	X8	7.0855	7.5849	7.2919

Clay	X intercept	26.1564	26.5512	26.3665
	X1	1.1501	1.2355	1.2383
	X2	-0.7888	-0.5473	-0.6771
	X3	-0.0714	0.2663	0.1508
	X4	-2.038	-1.5131	-1.18208
	X5	2.8305	2.9345	2.9688
	X6	-2.8077	-2.1079	-2.4846
	X7	7.4888	8.1884	7.7521
	X8	2.0445	2.2463	2.1917

Table 4: MLR and GWR models comparison

Model	Determination Coefficient
Model	Sand	Silt	Clay
MLR	0.63	0.55	0.30
GWR	0.81	0.57	0.33

Based on Table 4, the value of the R² GWR model for the three types of soil is greater than the value of the multiple regression R², meaning that the GWR model is better for modeling the existing data.

4. CONCLUSION

Prediction using GWR shows more results compared to multiple linear regression models. The spatial location can affect product Y, with the R2 value of 0.81 in the sand fraction, 0.57 in the silt fraction, and 0.33 in the clay fraction.

SOURCES OF FUNDING

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

CONFLICT OF INTEREST

The author have declared that no competing interests exist.

ACKNOWLEDGMENT

This research is sponsored by grants professors and doctors, faculty of mathematics and natural sciences, University of Brawijaya. The authors are grateful to the anonymous referees for a careful checking of the details and for helpful comments that improved the overall presentation of this paper.