OPTIMIZATION OF THE HIGUCHI METHOD

Background and Objective: Higuchi’s method of determining fractal dimension (HFD) occupies a valuable place in the study of a wide variety of physical signals. In comparison to other methods, it provides more rapid, accurate estimations for the entire range of possible fractal dimensions. However, a major difficulty in using the method is the correct choice of tuning parameter (k max ) to compute the most accurate results. In the past researchers have used various ad hoc methods to determine the appropriate k max choice for their data. We provide a more objective method of determining, a priori, the best value for the tuning parameter, given a particular length data set. Methods: We create numerous simulations of fractional Brownian motion to perform Monte Carlo simulations of the distribution of the calculated HFD. Results: Experimental results show that HFD depends not only on k max but also on the length of the time series, which enable derivation of an expression to find the appropriate k max for an input time series of unknown fractal dimension. Conclusion: The Higuchi method should not be used indiscriminately without reference to the type of data whose fractal dimension is examined. Monte Carlo simulations with different fractional Brownian motions increases the confidence of evaluation results.


INTRODUCTION
The Higuchi algorithm Higuchi (1988) is one of many widely used methods to compute fractal properties of complex nonlinear physical signals Esteller et al. (2001). It is often preferred when big data are analyzed because it is stable, rapid, accurate, relatively low-cost, and excels better known linear methods. We call the fractal dimension calculated via the Higuchi algorithm the Higuchi fractal dimension (HFD).
The Higuchi algorithm was first applied to turbulence in space plasmas Higuchi (1988) but is applicable to any data generated by complex systems since these tend to exhibit multidimensional fluctuations over many orders of magnitude. The method thus leverages the multiple scaling domains generated by complex systems and derives their characteristic fractal (or multifractal) scaling exponents.
Over the past several decades the Higuchi algorithm has been extensively used to study pathologies in biological systems. Its utility increases in an era of big data where real-time computation continues to grow in importance Gomolka et al. (2018), Klonowski et al. (2004). The Higuchi algorithm has been successfully employed in numerous medical studies including human gait Doyle et al. (2004), epilepsy Khoa and Toi (2012), disorders Wajnsztejn et al. (2016), , magnetoencephalograms Gómez et al. (2009), Khoa and Toi (2012), neuro-physiology Kesić and Spasić (2016), electroencephalograms Paramanathan and Uthayakumar (2008), , Zappasodi et al. (2014), and brain entrainment Shamsi et al. (2021). The Higuchi method generally exceeds the efficiency of well-known linear methods such as the fast Fourier and wavelet transforms. Those methods work well when signals are stationary, but the increasing richness of data that are not only nonstationary but generated by nonequilibrium and noisy processes eliminates their advantages.
The Higuchi algorithm exhibits sensitive dependence on a tuning parameter k max , defined in the following section. In the original paper Higuchi did not elaborate on the selection of the tuning parameter and for illustration used k max =2 11 for time series of length N=2 17 . Subsequent authors used similar values for the tuning parameter but it has been discovered that the tuning parameter plays a crucial role in estimation the HFD. Several studies have addressed the issue of proper selection of tuning parameter k max . An early paper Affinito et al. (1997) calculated HFD values for k max =3-10 for time series ranging in length from N=50-1000, settling on k max =6 as the optimum. The goal was to determine, in their study of electroencephalograms, the most suitable pair of (k max , N). Decades later the literature is unclear on the method of determining the appropriate value of the tuning parameter, usually suggesting that k max must be deter-mined on a case-by-case basis. Several papers recommend plotting the calculated HFD against a range of k max , and selecting the appropriate k max at the location where the calculated HFD approaches a local maximum or asymptote, considered to be a saturation point Doyle et al. (2004), Wajnsztejn et al. (2016). Gomolka et al. (2018) select k max on the basis of statistical tests that allow discrimination between known healthy and diabetic subjects. Paramanathan and Uthayakumar (2008) proposed to determine k max based on a size-measure relationship, that employs a recursive length of the signal from different measuring scales. In most cases researchers found that the calculated HFD is not much affected by length of the time-series but depends more strongly on k max . Therefore, a poorly chosen tuning parameter can severely prejudice results. Table 1 shows a sample of pairs of (k max , N) from the literature. There is clearly no established procedure widely accepted for determining the tuning parameter. The plethora of values selected for the tuning parameter makes it clear that the community would benefit from a careful consideration of selection criteria for the tuning parameter, especially with long time-series and big data where it is impractical to examine an almost infinite number of curves of HFD against k max .
In summary, one of main the difficulties in performing the Higuchi algorithm are that it relies on a tuning parameter, k max , that in most cases must be selected before the fractal dimension is computed. Our goal in this paper is to explore the optimum sample pairs between the tuning parameter and length of the time series for the most general types of data.

MATERIALS AND METHODS
The Higuchi algorithm takes a signal and discretizes it into the form of a time series, (1), (2), … , ( ). From this series we derive a new time series, , defined as follows: where [] represents the integer part of the enclosed value. The integer = 1,2, … , is the start time; is the time interval, with = 1, … , ; is a free tuning parameter. This means that a given time interval equal to , spawns -sets of new time series. For instance, if = 3 and the time series has length = 40, the following three time series are derived from the original data: The length of any one of these curves is given by: The length of the curve for the time interval is then defined as the average over the sets of ( ): If this equation scales according to the rule ( ) ∝ − , then the time series behaves as a fractal with dimension D. Thus, the HFD is defined as the slope of the straight line that fits the curve of ln(L(k)) versus ln(1/k). The data we experiment on are theoretically derived fractional Brownian motions (fBm). fBm is a continuous-time random process proposed by Mandelbrot and John (1968). A signal that displays fBm is expressed in terms of stochastic integrals of time integrations of fractional Gaussian noise: Here W is a white noise process defined on (-∞, ∞), and ∈ (0, 1) is known as the Hurst parameter. The Hurst exponent for the signal is its roughness averaged over many length scales. The covariance function is given by This means that for the special case H=1/2, fBm reduces to the well-known random walk. The relationship between H and HFD is HFD = 2 − , with values between 1 and 2. Thus, we are able to use the fBm process as our data source with a well-defined HFD in order to determine how well the Higuchi algorithm is able to accurately recover the theoretical value. Figure 1 shows sample data of fBm signals for different values of the HFD. We create these data using a wavelet-based synthesis of fBm generation based on a biorthogonal wavelet method proposed by Meyer and Sellan Abry and Sellan (1996), Bardet et al. (2003) implemented in Matlab software. Figure 2 shows examples of how, in the Higuchi algorithm, the slope is calculated from the slope of linear fits for the curve of ln(L(k)) versus ln(1/k) for theoretical values of HFD of 1.3 and 1.7.

RESULTS AND DISCUSSIONS
One way that is suggested to find the optimal k max is by plotting the calculated HFD against the k max , and by selecting the value in the range for which HFD(k) achieves a plateau. This assumes that a plateau is achieved in every case for different classes of HFD. It is not a priori evident that this will be the case since, as is demonstrated in Figure 1, different values for the fractal dimension create time series with different roughness, which will influence the accuracy of the Higuchi algorithm. Those with HFD below 1.5 are smoother curves having clear long-range dependence while curves with higher values are rougher. Figure 3 shows the error between the theoretical HFD and the value calculated via the Higuchi algorithm, for time series of length N=1,000, and varying values of k max . In general, for the smallest k max =2 the persistent time series, with HFD<1.5 produces an underestimate the theoretical fractal dimension, and antipersistent time series an overestimate. What we note here is that not every case shows the HFD estimate reaches a plateau, thus demonstrating that seeking to find the optimum tuning parameter by a plateau is not in general a valid procedure in the Higuchi algorithm. Indeed, for the persistent time series there is no plateau that is achieved. The antipersistent curves do show a clear plateau yet, contrary to general recommendations, selecting k max near the plateau does not yield the lowest error, which is rather achieved for k max values far beyond the plateau (i.e., for HFD=1.9, 1.7) In order to explore how the HFD depends on the tuning parameter we created 100 independent time series with lengths, N, varying between 1,000 and 200,000 data points. Next, for each of these series, we compute the HFD for different values of k max , allowing k max to vary between k max =2 and k max =N/2. The idea is to produce Monte Carlo simulations from which to derive a distribution of outcomes that can be analyzed. This results in a set of HFD values as a function of the time series length, and the tuning parameter, k max . Next, we create surface plots comparing the HFD versus the tuning parameter k max and time series length, N. To best compare the results for various theoretical HFD from the wavelet method, we calculate an error metric, defined as, the percentage error:  Figure 8 shows the surface plots for E(k max ,N) for values of HFD=1.9, 1.7, 1.5, 1.3, 1.1. The grey line overlaid in each figure shows the curves of minimum error, corresponding to the k max tuning parameter that yields the best correspondence between the theoretical and HFD values for a given N, averaged over 100 unique time series.  By taking a simple average of these minimum error curves of least error we are able to derive a best-fit curve using a cubic function, shown in Figure 9 as the dashed curve, for the average relationship between the time series length and the tuning parameter, for different HFD values: This function can be used to generate a tuning parameter estimate that can be expected to yield accurate results for the HFD, for cases where N<200,000.

Figure 9
Comparison of the average minimum error curve (solid) and the best fit cubic function (dashed).

APPLICATION TO GENETICS
We next demonstrate analysis on physical data, that of a complete genome of a bacterial organism, viz. the proteobacteria Acidovorax avenae which attacks the Cucurbitaceae family. These data are freely available from the National Center for Biotechnology Information (NCBI), which is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH).
In order to analyze hidden patterns in the genome, the nucleotide sequence must be converted from a symbolic sequence, meaning A, G, C, T; to a numeric representation. The Peng method Peng et al. (1992) was followed wherein DNA is represented as a "random walk" with two parameters ruling the direction of the "walk" and the resulting displacement. Starting from the first nucleotide, every time we encounter a pyrimidine base, we move up one position. On the other hand, when a purine base is encountered in the series, we move down one position. The nucleotide distance from the first nucleotide is then plotted versus the displacement, as in Figure 10, Figure 11 shows the computed HFD against tuning parameter k max from time series of length N=200,000 that are subsets from the original series shown in Figure 10. We took 39 non-overlapping series and computed curves for each of them. Figure 11 shows that the curves (black lines) computed from different subsets of the genome are very similar, which indicates that the fractal dimension does not vary significantly along the genome.
In each case the results show a clear plateau and an HFD<1.5, indicating a persistent time series, though with widely divergent values due to the different tuning parameter.

Figure 11
HFD calculated for varying length kmax between 2 and 100,000 for nonoverlapping time series of length N=200,000 from Acidovorax Avenae. The grey triangle shows where our research indicates the appropriate kmax should be situated.
The plateau region is shown with the arrows and, using a plateau as the measure of where to select k max , suggests the best tuning parameter should be in the range k max =10-50, thus yielding HFD~1.49. However, our analysis, given time series length N=200,000, recommends a value around k max =1,000 (grey triangle) which yields HFD~1.42. This difference allows one to easily see where the Higuchi algorithm best approximates the correct HFD, and where it produces over-or underestimates.

CONCLUSIONS AND RECOMMENDATIONS
The Higuchi algorithm Higuchi (1988) is one of many widely used methods to compute fractal properties of complex nonlinear physical signals from a wide variety of research areas Esteller et al. (2001), Wanliss and Reynolds (2003), Mitsutake et al. (2004), Cersosimo and Wanliss (2007). Over the past decades the Higuchi algorithm has been extensively used to study pathologies in biological systems and to discriminate between healthy signals and those displaying pathologies. It is often preferred when large amounts of data are analyzed because it is stable, rapid, accurate, relatively low-cost, and excels better known linear methods for study fractal properties of a time series. However, the method requires the user to input a free tuning parameter, k max ., the selection of which influences computational efficiency of the algorithm and, more importantly, the value of the computed HFD. Thus, all the benefits of the algorithm can be negated by poor selection of the tuning parameter. Different values of k max can produce widely divergent estimates for the HFD from the same time series, thus it is imperative to have a method for appropriately determining the best tuning.
It has been suggested (e.g., by Doyle et al. (2004) and Wajnsztejn et al. (2016)) Doyle et al. (2004), Wajnsztejn et al. (2016) that the best k max to derive the most precise HFD should be where the calculated HFD (k max ) approaches a local maximum or asymptote, considered to be a saturation point. However, as we have demonstrated in Figure 3, it is not a given that any physical data will result in such a plateau or saturation point. This methodology can produce spurious results and can be a time consuming, iterative process.
In this paper we have explored Monte Carlo computer realisations of wavelet derived fBm time series, with known HFD. We have demonstrated that calculation of an accurate HFD depends not only on the appropriate k max but is also dependent on time series length, N. We have calculated a relationship that determines the appropriate k max to obtain the best HFD, given the time series length. We find that a third order polynomial will yield an appropriate k max given the particular time series length, N. In general, persistent time series, with HFD<1.5, tended to need smaller k max than antipersistent series.
In a future work we will extend these results to larger data sets and explore the effects from simulations of synthetic synthetic time series with known fractal properties, to see how well the method performs, and how it is influenced by the generation algorithms.