Statistics

Confidence bands

indsl.statistics.confidence.bands(data: Series, period: str = '1h', K: float = 2.0, as_json: bool = True) str | DataFrame

Confidence bands.

Confidence bands, also known as Bollinger Bands, are a statistical characterization of a time series fluctuations. The confidence bands display a graphical envelope (upper and lower bands) given by the deviation (expressed by the envelope’s width). The envelope width is estimated as a factor of the standard deviation for a given time period.

Two input parameters are required to describe the historical behavior of the data, a time window, N, and a multiplication factor, K. The window influences the “responsiveness” of the bands to magnitude and frequency of data variations. The multiplication factor influences the width of the envelope.

The Bollinger Bands consist of an N-period moving average (MA) and upper and lower bands at K times an N-period standard deviation above and below the moving average (MA +/- K*stdev).

Parameters:
  • data – Time series.

  • period – Window. Window length in seconds. Used to estimate the moving average and standard deviation. Defaults to 3600.

  • K – Factor. Factor used to estimate the width of the envelope K*stdev. Defaults to 2.

  • as_json – JSON? Return a json dictionary (True) or a pandas DataFrame (False). Defaults to True.

Returns:

Time index, moving average, and upper and lower rolling confidence bands.

Return type:

JSON or pandas.DataFrame

Outlier detection

indsl.statistics.outliers.detect_outliers(data: Series, reg_smooth: float = 0.9, min_samples: int = 4, eps: float | None = None, time_window: Timedelta = Timedelta('0 days 01:00:00'), del_zero_val: bool = False) Series

Outlier detection.

Identifies outliers combining two methods, dbscan and csap.

  • dbscan: Density-based clustering algorithm used to identify clusters of varying shape and size within a data set. Does not require a pre-determined set number of clusters. Able to identify outliers as noise, instead of classifying them into a cluster. Flexible when it comes to the size and shape of clusters, which makes it more useful for noise, real life data.

  • csaps regression: Cubic smoothing spline algorithm. Residuals from the regression are computed. Data points with high residuals (3 Standard Deviations from the Mean) are considered as outliers.

Parameters:
  • data – Time series. The data has to be non-uniform.

  • reg_smooth – Smoothing factor. The smoothing parameter that determines the weighted sum of terms in the regression, and it is limited by the range [0,1]. Defaults to 0.9. Ref: https://csaps.readthedocs.io/en/latest/formulation.html#definition

  • min_samples

    Minimum samples. Minimum number of data points required to form a distinct cluster. Defaults to 4. Defines the minimum number of data points required to form a distinct cluster. Rules of thumb for selecting the minimum samples value:

    • The larger the data set, the larger the value of MinPts should be.

    • If the data set is noisier, choose a larger value of MinPts Generally, MinPts should be greater than or equal to the dimensionality of the data set. For 2-dimensional data, use DBSCAN’s default value of 4 (Ester et al., 1996).

    • If your data has more than 2 dimensions, choose MinPts = 2*dim, where dim= the dimensions of your data set (Sander et al., 1998).

  • eps – Distance threshold. Defines the maximum distance between two samples for one to be considered as in the neighborhood of the other (i.e. belonging to the same cluster). This is the most important DBSCAN parameter to choose appropriately for your dataset and distance function. If no value is given, it is set automatically using Nearest Neighbors algorithm to calculate the average distance between each point and its k nearest neighbors, where k = min_samples (minimum samples). In ascending order on a k-distance graph, the optimal value for the threshold is at the point of maximum curvature (i.e. after plotting the average k-distances in where the graph has the greatest slope). This is not a maximum bound on the distances of points within a cluster. Defaults to None, eps value has to be > 0.0.

  • time_window – Time window. Length of the time period to compute the rolling mean. The rolling mean and the data point value are the two features considered when calculating the distance to the furthest neighbour. This distance allows us to find the right epsilon when training dbscan. Defaults to ‘60min’. Accepted format: ‘3w’, ‘10d’, ‘5h’, ‘30min’, ’10s’. If a number without unit (such as ‘60’)is given, it will be considered as the number of minutes.

  • del_zero_val – Remove zeros. Removes data points containing a value of 0. Defaults to False.

Returns:

Time series.

Binary time series indicating outliers: Outlier= 1, Not an outlier = 0

Return type:

pandas.Series

Outlier removal

indsl.statistics.outliers.remove_outliers(data: Series, reg_smooth: float = 0.9, min_samples: int = 4, eps: float | None = None, time_window: Timedelta = Timedelta('0 days 01:00:00'), del_zero_val: bool = False) Series

Outlier removal.

Identifies and removes outliers combining two methods, dbscan and csap.

  • dbscan: Density-based clustering algorithm used to identify clusters of varying shape and size within a data set. Does not require a pre-determined set number of clusters. Able to identify outliers as noise, instead of classifying them into a cluster. Flexible when it comes to the size and shape of clusters, which makes it more useful for noise, real life data.

  • csaps regression: Cubic smoothing spline algorithm. Residuals from the regression are computed. Data points with high residuals (3 Standard Deviations from the Mean) are considered as outliers.

Parameters:
  • data – Time series.

  • reg_smooth – Smoothing factor. The smoothing parameter that determines the weighted sum of terms in the regression, and it is limited by the range [0,1]. Defaults to 0.9. Ref: https://csaps.readthedocs.io/en/latest/formulation.html#definition

  • min_samples

    Minimum samples. Minimum number of data points required to form a distinct cluster. Defaults to 4. Defines the minimum number of data points required to form a distinct cluster. Rules of thumb for selecting the minimum samples value:

    • The larger the data set, the larger the value of MinPts should be.

    • If the data set is noisier, choose a larger value of MinPts Generally, MinPts should be greater than or equal to the dimensionality of the data set. For 2-dimensional data, use DBSCAN’s default value of 4 (Ester et al., 1996).

    • If your data has more than 2 dimensions, choose MinPts = 2*dim, where dim= the dimensions of your data set (Sander et al., 1998).

  • eps – Distance threshold. Defines the maximum distance between two samples for one to be considered as in the neighborhood of the other (i.e. belonging to the same cluster). This is the most important DBSCAN parameter to choose appropriately for your dataset and distance function. If no value is given, it is set automatically using Nearest Neighbors algorithm to calculate the average distance between each point and its k nearest neighbors, where k = min_samples (minimum samples). In ascending order on a k-distance graph, the optimal value for the threshold is at the point of maximum curvature (i.e. after plotting the average k-distances in where the graph has the greatest slope). This is not a maximum bound on the distances of points within a cluster. Defaults to None, eps value has to be > 0.0.

  • time_window – Time window. Length of the time period to compute the rolling mean. The rolling mean and the data point value are the two features considered when calculating the distance to the furthest neighbour. This distance allows us to find the right epsilon when training dbscan. Defaults to ‘60min’. Accepted format: ‘3w’, ‘10d’, ‘5h’, ‘30min’, ’10s’. If a number without unit (such as ‘60’)is given, it will be considered as the number of minutes.

  • del_zero_val – Remove zeros. Removes data points containing a value of 0. Defaults to False.

Returns:

Time series without outliers.

Return type:

pandas.Series

Pearson correlation

indsl.statistics.pearson_correlation(data1: Series, data2: Series, time_window: Timedelta = Timedelta('0 days 00:15:00'), min_periods: int = 1, align_timesteps: bool = False) Series

Pearson correlation.

This function measures the linear correlation between two time series along a rolling window. Pearson’s definition of correlation: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

Parameters:
  • data1 – Time series.

  • data2 – Time series.

  • time_window – Time window. Length of the time period to compute the Pearson correlation. Defaults to ‘minutes=15’. Time unit should be in days, hours, minutes or seconds. Accepted formats can be found here: https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.html.

  • min_periods – Minimum samples. Minimum number of observations required in the given time window (otherwise, the result is set to 0). Defaults to 1.

  • align_timesteps (bool) – Auto-align. Automatically align time stamp of input time series. Default is False.

Returns:

Time series

Return type:

pandas.Series

Raises: