In Search of Determinism-Sensitive Region to Avoid Artefacts in Recurrence Plots

,


Introduction
Recurrence is a fundamental property of many dynamical systems, which can be exploited to characterize the systems behavior in phase space, while a recurrence plot (RP) is the visualization tool for the analysis of this property. In this study, the phase space reconstruction method of time delay embedding [Packard et al., 1980;Takens, 1981] is used [Eq. (1)]. Such a reconstruction is particularly useful when not all variables required to describe the system are available (i.e. data scarcity or limited set of observation variables), and where the topology of the system dynamicsx i can still be created using only a single variable or observation u i .
where m is the embedding dimension and τ is the time delay. The vectors (e j ) are unit vectors and span an orthogonal coordinate system (e i ·e j ) = δ i,j . The calculation of recurrence as elements of the RP is based on Eq. (2): . . . , N, (2) where N is the number of measured points x i , ε is a threshold distance, · is a norm and Θ(·) the Heaviside function. The RP is basically the visual representation of the square matrix, in which the matrix elements correspond to those times at which a state of a dynamical system recurs (columns and rows correspond then to a certain pair of times). RPs are especially useful for nonstationary pattern in time series [Eckmann et al., 1987;Marwan et al., 2007]. Besides using RPs for the visual analysis of time series, RPs can also quantify structures hidden within the series through recurrence quantification analysis (RQA) [Zbilut & Webber, 1992;Marwan et al., 2007]. In RQA, important elements are the diagonal and vertical/horizontal straight lines because they reveal typical dynamical features of the investigated system, such as range of predictability, chaos-order, and chaos-chaos transitions [Trulla et al., 1996]. One of the prominent diagonal line measures is called determinism [DET,Eq. (3)], from which the system predictability can be inferred.
where P (l) = {l i ; i = 1, . . . , N l } is the histogram of the lengths l of diagonal structures, and N l is the absolute number of those diagonal lines.
For a deterministic signal (including chaos), many diagonal lines in the RP are typical, leading to high value of DET [Marwan, 2010]. However, single, isolated recurrence points can occur if states are rare, if they do not persist, or if they fluctuate heavily. For instance, stochastic or random signals would comprise such single points and result in a very low DET.
Since the use of RPs relies on the reconstructed phase space, its parameters' uncertainty includes those of the phase space reconstruction method, such as embedding dimension (m) and time delay (τ ), in addition to the recurrence threshold (ε). Standard approaches for finding optimal embedding parameters are false nearest neighbors (FNN) for m, and auto-correlation or mutual information (MI) for τ [Kantz & Schreiber, 2005;Kennel et al., 1992;Fraser & Swinney, 1986]. Other methods include wavering-products, fill-factor or integral local deformation [Buzug & Pfister, 1992]. Moreover, Marwan [2010] concludes that τ is sometimes overestimated by auto-correlation and mutual information, and that the choice of the embedding dimension has to be considered with care, as a wrong choice artificially increases diagonal lines, and hence DET, and leads to artefacts. For instance, a RP resulting from a random series should exhibit scattered or nondeterministic patterns (i.e. single points). However, when m increases to 2 and beyond with τ = 1, the number and the length of diagonal lines start to increase and dominate the plot as artefacts. This may be misinterpreted as if the series was highly deterministic (Fig. 1).
In this study, we focus on the artefacts related to these embedding parameters. The impact of the recurrence threshold ( ) is not elaborated, since the selection of the optimal values of the recurrence threshold has been discussed earlier [Gao & Jin, 2009;Koebbe et al., 1994;Zbilut & Webber, 1992;Zbilut et al., 2002;Mindlin & Gilmore, 1992;Schinkel et al., 2008;Thiel et al., 2002]. Hence, the recurrence threshold is fixed to a 10% recurrence rate (recurrence points density). Supplementary information on the impact of changing this threshold is enclosed in the Appendix. The Appendix also includes the evaluation of DET values of a correlated random series, using an AR1 series as example, to showcase that high DET values are indeed associated with deterministic systems instead of its auto-correlation structures, although there are also cases at certain parameter values where the number/length of diagonal lines artificially increase. It is important to note that the proposed technique is not intended to be used as a new, independent method, but rather as an additional consideration during parameterization, when the dynamical system is known to be deterministic.

Methodology
Artificially biased line length distributions due to the embedding can overlay the true line length distributions and lead to wrong conclusions. Hence, it would be desirable to separate the contribution of the embedding induced line length distributions from the real underlying dynamics. However, separating both contributions is not possible without additional knowledge about the system (such as precise model or amount of observational noise). Therefore, we propose an approach that minimizes the contribution of the embedding. This approach is based on comparing the fraction of recurrence points that form diagonal lines in the RPs of the original time series (which includes both the real underlying dynamics as well as the embedding effect) with that of a random time series (which consists of the embedding effect only). As random time series we use simply shuffled versions of the original time series, because this preserves its value distribution and, thus, allows to use the same recurrence threshold and allows to compare the resulting RPs. As mentioned above, RPs of random time series should consist mainly of single points, but embedding artifacts would increase the fraction of recurrence points that form diagonal lines in the RP. Thus, this fraction measure is well suited for our purpose. Moreover, this measure is equal to the DET measure. Other measures that use the line length distribution (e.g. average and longest line length, entropy of the length distribution) would be possible but are less intuitive and interpretable. The advantage of the DET measure is that it considers the influence of scattered points that appear within the RP as well as in addition to just the diagonal lines, while the index of average and longest line length could easily suffer from large statistical uncertainty and can easily be influenced by a few extreme values.  τ fixed at 3 following the first minimum of its automutual information, with m varying from 1 to 10. In contrast, for the Gaussian random series, the DET values are shown to be between 0 to 0.2 (Fig. 3).
The resulting difference (determinism distance) between DET i and DET o would therefore be high. The undesired effect by the embedding should be minimal for the difference between DET i and DET o . Both median (M d ) and standard deviation (S d ) of these distances are used for identifying this determinism-sensitive region [Eqs. (4) and (5)]. The further (larger) the M d of each parameter combination, the safer it is in terms of avoiding the mentioned artefacts, under the condition that S d should be reasonably small (e.g. within 0.1) where DET o and DET i are the recurrence determinism values of the original series and each shuffled iteration (i), and n is the total number of shuffling iterations.

Case Study Applications
This paper presents two application examples using Lorenz series derived from a mathematical model, and daily runoff observations from the station Burghausen at the Salzach River in south Germany. These signals are chosen for their nonlinear characteristics with known presence of determinism [Sivakumar, 2000;Martins et al., 2011]. The resulting region of artefact-safe parameter set will be presented and discussed in Sec. 4. Caution should be taken when τ = 1 because artificially high DET values can lead to misinterpretations [ Figs. 1(b), 1(d) and 1(e)], and hence should be excluded. In addition to the resulting artefact-safe region as the boundary of the parameter sets, the final choice of the parameter set is still necessary to be optimal, i.e. being able to reconstruct the topology of system dynamics and minimal in the sense not to over-reduce data points in the signal. There are many approaches to find optimal embedding parameters, such as the standard approaches mentioned in Sec. 1.
This Lorenz system is described by three variables and integrated using the Euler scheme, and hence, we know the three-dimensional phase space that describes the topology of the system dynamics. In this study, the x variable is used as our Lorenz series test set [ Fig. 4(a)] with its phase space reconstructed using the time delay embedding method. Thereafter, its DET is calculated. The reliability of these DET values is checked by using median and standard deviation of their determinism distance values (M d and S d ) to qualitatively evaluate how much the constructed RP of a certain parameter set is influenced by artefacts. This Lorenz series is derived from a mathematical model with well-known phase space topology and recurrence characteristics, whereas real world observations are most likely contaminated by noise. Therefore, we also investigate the impact of noise on the method, i.e. in respect to the values of determinism and determinism distance. Gaussian white noise with a magnitude range corresponding to the standard deviation of the Lorenz signal is applied, i.e. added to the signal [Eq. (7)] where,x(t) is the resulting new series with the addition of noise and x(t) is the original series (Lorenz); k is the noise level, while β(t) is the Gaussian white noise with magnitude range corresponding to the standard deviation of x(t). The noise levels used are 5%, 10%, 30% and 50%. For each of the noise-added signal, its determinism and determinism distance are calculated.

River runoff series
The second test application uses daily river runoff observations extracted from station Burghausen in south Germany for the year 1961. This station measures the streamflow of the Salzach River with a catchment area of 6600 km 2 . The time series [ Fig. 4(b)] is used as a test set representing real world data, i.e. it is potentially nonstationary and contaminated by noise and observation error.

Results and Discussion
This section presents the results of our proposed method for selecting an artefact-safe parameter region with the assumption of recurrence rate fixed at 10%. The range of embedding dimension (m) bounds from 1 to 10 and time delay (τ ) from 1 to 20.

Lorenz series
The Lorenz series is known for its deterministic feature, i.e. high determinism value, yet certain parameter combinations can give incorrect, low determinism values, e.g. when m = 1 or m = 10, τ = 6 [Figs. 5(a)-5(c)]. Increasing the time delay at high embedding dimension is also seen to thicken the line structures of the RP [ Fig. 5(i)]. Low determinism values reflect nonoptimal parameterization, and hence, misleading RP structures [Figs. 5(d) and 5(g)] with diagonal line structures as wobbly and perpendicular to the main diagonal [Marwan et al., 2007].  In order to assess the reliability of the resulting RP corresponding to the m and τ parameter combinations, the proposed shuffling techniques is applied to find the determinism-sensitive region. Using the proposed technique (n = 100 shuffles), M d is low for the case without embedding (m = 1) as well as for τ = 1, when m > 1 (Fig. 6). The latter suggests artefacts due to embedding. Those parameter values where M d is high, e.g. for τ ≥ 2, when m > 1, can be considered to be less influenced by embedding artefacts. It can be noticed that when τ and m are higher, M d starts to decrease and to fluctuate, as indicated by S d . In this case, the use of the median is quite reliable due to the low S d value (i.e. below 6%). The identified determinism-sensitive region can be referenced with the standard approaches, such as FNN and MI, to find the optimal parameter set. This also serves to prevent the use of unnecessarily high parameter values that result in the reduction of data points (i.e. by (m − 1)τ ). For instance, in the case of the Lorenz series, the optimal parameter set found by the standard approach is m = 3 and τ = 3 (Fig. 7) which coincides well with the domain of high M d values.
To investigate the impact of noise as in a real world scenario, Gaussian white noise with different noise levels is added to the signal as described in show both false nearest neighbor and mutual information characteristics for the added-noise signal. The false nearest neighbor approach slightly increases at the optimal dimension of 3 causing a shift to the next dimension value, i.e. m = 4. When the noise level reaches 30% and 50%, the mutual information characteristics start to differ from the original, whereas the noise levels of 5% and 10% still preserve the original signal characteristics. Noise needs to be handled with care, as high level noise contamination potentially alters the determinism of the signal. It decreases in this case when Gaussian white noise is added, hence the determinism distance between the original and the shuffled series gets smaller.

River runoff series
A river runoff series is used to represent an example for field observations which are usually contaminated with noise. River runoff is typically a nonlinear deterministic series and exhibits chaos properties [Porporato & Ridolfi, 1997;Sivakumar, 2000;Martins et al., 2011], hence, its DET is expected to be high. However, its recurrence determinism is low when parameter m = 1 and when both m and τ reach high values, e.g. m > 8 and τ > 9 [ Fig. 9(a)]. For instance, for τ = 10 the DET value starts to decrease when m > 7 [ Fig. 9(b)], while for m = 10 the increase of τ (i.e. above 4) also starts to reduce DET values [ Fig. 9(c)]. When evaluated through 100 shuffles, the parameter set of τ = 1, m > 1 should not be used due to the clear artefact potential suggested by its low determinism distance [see Fig. 10 As cross-check with the standard approach of parameter identification [ Fig. 11(a)] suggested optimal embedding parameters in this case would be τ = 10 days and m = 5.

Summary
We propose a method to identify a determinismsensitive parameter region with minimal impact of artefacts due to embedding when constructing a Recurrence Plot (RP). The method utilizes both deterministic (incl. chaos) and stochastic characteristics of recurrence quantification, i.e. diagonal structures, as indicated by their determinism values. It is useful when the evaluated signal is known to be deterministic. The method involves randomly shuffling the time series for an abundant number of times in order to destroy its original characteristics and its determinism. Thereafter, determinism values are calculated for each shuffle iteration and compared with the determinism of the original signal for a range of parameters, resulting in a measure called determinism distance.
The matrix of the median values of this measure is plotted to depict the determinism-sensitive parameter region. The larger the determinism τ = 1, m > 1, the increase of recurrence rate further decreases the determinism distance [ Fig. 12(d)]. This confirms that the use of such parameter value should be ignored regardless of the choice of recurrence threshold. However when m = 1, there is an increase of determinism distance i.e. peaking at 20% recurrence rate and decreases thereafter [ Fig. 12(d)]. Despite the increase, the determinism distance is still regarded as low (i.e. below 0.5). Both original and shuffled time series experience an increase of their DET values when recurrence rate is increased [Figs. 13(a)-13(c)]. However, when m > 1 the increase of DET values is rather sharp, changing significantly from low to high. The shuffled series recurrence plots with m = 1, and recurrence rate of 20% (i.e. at the peak of DET distance) still do not present any noteworthy deterministic features [Figs. 13(d)-13(f)]. In this case, users should avoid using large threshold values and special attention should be made on using such shuffling technique (i.e. when choosing m = 1).

Appendix B
In this Appendix, we include the evaluation of DET values of a correlated random series (exemplified using AR1 series) in contrast with uncorrelated random series and Lorenz to showcase that the high DET values is indeed associated with deterministic system instead of its auto-correlation structures (see Figs. 14 and 15). Although there are cases at certain embedding parameter set where diagonal lines of RP artificially increase, similar to random uncorrelated series, when τ equals 1, AR1 appears to artificially induce long diagonal lines and hence high DET when embedding dimension gets higher. Meanwhile, in addition to what was mentioned, there appear some (but rare) relatively high DET at certain parameter sets, e.g. when 6 ≤ m ≤ 10 and τ = 7 (see Figs. 13 and 14). However, there is a possibility that we have excluded some rare structures of a nondeterministic signal that also showcase high DET values. Therefore, it is important to note that this proposed technique is intended to be used when the user knows that the dynamical system is deterministic.