Statistics

Confidence bands

indsl.statistics.confidence.bands(data: Series, period: str = '1h', K: float = 2.0, as_json: bool = True)

Confidence bands

Confidence bands, also known as Bollinger Bands, are a statistical characterization of a time series fluctuations. The confidence bands display a graphical envelope (upper and lower bands) given by the deviation (expressed by the width of the envelope). The width of the envelope is estimated as a factor of the standard deviation for a given time period.

Two input parameters are required to describe the historical behavior of the data, a time window, N, and a multiplication factor, K. The window influences the “responsiveness” of the bands to magnitude and frequency of data variations. The multiplication factor influences the width of the envelope.

The Bollinger Bands consist of an N-period moving average (MA) and upper and lower bands at K times an N-period standard deviation above and below the moving average (MA +/- K*stdev).

Parameters
  • data – Time series.

  • period – Window. Window length in seconds. Used to estimate the moving average and standard deviation. Defaults to 3600.

  • K – Factor. Factor used to estimate the width of the envelope K*stdev. Defaults to 2.

  • as_json – JSON? Return a json dictionary (True) or a pandas DataFrame (False). Defaults to True.

Returns

Time index, moving average, and upper and lower rolling confidence bands.

Return type

JSON or pandas.DataFrame

Outlier removal

indsl.statistics.remove_outliers(data: Series, reg_smooth: float = 0.9, min_samples: int = 4, eps: float = None, time_window: str = '60min', del_zero_val: bool = False) Series

Outlier removal

Identifies outliers combining two methods, dbscan and csap.

  • dbscan: Density-based clustering algorithm used to identify clusters of varying shape and size within a data set. Does not require a pre-determined set number of clusters. Able to identify outliers as noise, instead of classifying them into a cluster. Flexible when it comes to the size and shape of clusters, which makes it more useful for noise, real life data.

  • csaps regression: Cubic smoothing spline algorithm. Residuals from the regression are computed. Data points with high residuals (3 Standard Deviations from the Mean) are considered as outliers.

Parameters
  • data – Time series.

  • reg_smooth – Smoothing factor. The smoothing parameter that determines the weighted sum of terms in the regression and it is limited by the range [0,1]. Defaults to 0.9. Ref: https://csaps.readthedocs.io/en/latest/formulation.html#definition

  • min_samples

    Minimum samples. Minimum number of data points required to form a distinct cluster. Defaults to 4. Defines the minimum number of data points required to form a distinct cluster. Rules of thumb for selecting the minimum samples value:

    • The larger the data set, the larger the value of MinPts should be.

    • If the data set is noisier, choose a larger value of MinPts Generally, MinPts should be greater than or equal to the dimensionality of the data set. For 2-dimensional data, use DBSCAN’s default value of 4 (Ester et al., 1996).

    • If your data has more than 2 dimensions, choose MinPts = 2*dim, where dim= the dimensions of your data set (Sander et al., 1998).

  • eps – Distance threshold. Defaults to None. Defines the maximum distance between two samples for one to be considered as in the neighborhood of the other (i.e. belonging to the same cluster). The value of this parameter is automatically set after using a Nearest Neighbors algorithm to calculate the average distance between each point and its k nearest neighbors, where k = min_samples (minimum samples). In ascending order on a k-distance graph, the optimal value for the threshold is at the point of maximum curvature (i.e. after plotting the average k-distances in where the graph has the greatest slope). This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function. If no value is given, it is set automatically using nearest neighbors algorithm. Defaults to None.

  • time_window – Window. Length of the time period to compute the rolling mean. The rolling mean and the data point value are the two features considered when calculating the distance to the furthest neighbour. This distance allows us to find the right epsilon when training dbscan. Defaults to ‘60min’. Accepted string format: ‘3w’, ‘10d’, ‘5h’, ‘30min’, ’10s’. If a number without unit (such as ‘60’)is given, it will be considered as the number of minutes.

  • del_zero_val – Remove zeros. Removes data points containing a value of 0. Defaults to False.

Returns

Time series without outliers.

Return type

pandas.Series