Data Quality

Completeness

Completeness Score

indsl.data_quality.completeness.completeness_score(x: Series, cutoff_good: float = 0.8, cutoff_med: float = 0.6, method_period: str = 'median') str

Completeness score.

This function determines the completeness of a time series from a completeness score. The score is a function of the inferred data sampling period (median or minimum of timestamp differences) and the expected total number of data points for the period from the sampling frequency. The completeness score is defined as the ratio between the actual number of data points to the expected number of data points based on the sampling frequency. The completeness is categorized as good if the score is above the specified cutoff ratio for good completeness. It is medium if the score falls between the cutoff ratio for good and medium completeness. It is characterized as poor if the score is below the medium completeness ratio.

Parameters:
  • x – Time series

  • cutoff_good – Good cutoff Value between 0 and 1. A completeness score above this cutoff value indicates good data completeness. Defaults to 0.80.

  • cutoff_med – Medium cutoff Value between 0 and 1 and lower than the good data completeness cutoff. A completeness score above this cutoff and below the good completeness cutoff indicates medium data completeness. Data with a completeness score below it are categorised as poor data completeness. Defaults to 0.60.

  • method_period – Method Name of the method used to estimate the period of the time series, can be ‘median’ or ‘min’. Default is ‘median’

Returns:

Data quality

The data quality is defined as Good, when completeness score >= cutoff_good, Medium, when cutoff_med <= completeness score < cutoff_good, Poor, when completeness score < cutoff_med

Return type:

string

Raises:
  • TypeError – cutoff_good and cutoff_med are not a number

  • ValueError – x has less than ten data points

  • TypeError – x is not a time series

  • TypeError – index of x is not datetime

  • ValueError – method_period is not median or min

  • ValueError – completeness score more than 1

Data Gaps Detection

Using Z scores

indsl.data_quality.gaps_identification_z_scores(data: Series, cutoff: float = 3.0, test_normality_assumption: bool = False) Series

Gaps detection, Z-scores.

This function detects gaps in the time stamps using Z-scores. Z-score stands for the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured. This method assumes that the time step sizes are normally distributed. Gaps are defined as time periods where the Z-score is larger than cutoff.

Parameters:
  • data – Time series

  • cutoff – Cut-off Time periods are considered gaps if the Z-score is over this cut-off value. Default 3.0.

  • test_normality_assumption – Test for normality. Raise a warning if the data is not normally distributed. The Shapiro-Wilk test is used. The test is only performed if the time series contains less than 5000 data points. Default to False.

Returns:

Time series

The returned time series is an indicator function that is 1 where there is a gap, and 0 otherwise.

Return type:

pd.Series

Raises:

Using modified Z scores

indsl.data_quality.gaps_identification_modified_z_scores(data: Series, cutoff: float = 3.5) Series

Gaps detection, mod. Z-scores.

Detect gaps in the time stamps using modified Z-scores. Gaps are defined as time periods where the Z-score is larger than cutoff.

Parameters:
  • data – Time series

  • cutoff – Cut-off Time-periods are considered gaps if the modified Z-score is over this cut-off value. Default 3.5.

Returns:

Time series

The returned time series is an indicator function that is 1 where there is a gap, and 0 otherwise.

Return type:

pd.Series

Raises:

Using the interquartile range method

indsl.data_quality.gaps_identification_iqr(data: Series) Series

Gaps detection, IQR.

Detect gaps in the time stamps using the interquartile range (IQR) method. The IQR is a measure of statistical dispersion, which is the spread of the data. Any time steps more than 1.5 IQR above Q3 are considered gaps in the data.

Parameters:

data – time series

Returns:

time series

The returned time series is an indicator function that is 1 where there is a gap, and 0 otherwise.

Return type:

pd.Series

Raises:

Using a time delta threshold

indsl.data_quality.gaps_identification_threshold(data: Series, time_delta: Timedelta = Timedelta('0 days 00:05:00')) Series

Gaps detection, threshold.

Detect gaps in the time stamps using a timedelta threshold.

Parameters:
  • data – time series

  • time_delta – Time threshold Maximum time delta between points. Defaults to 5min.

Returns:

time series

The returned time series is an indicator function that is 1 where there is a gap, and 0 otherwise.

Return type:

pd.Series

Raises:

Low data density

Using Z scores

indsl.data_quality.low_density_identification_z_scores(data: Series, time_window: Timedelta = Timedelta('0 days 00:05:00'), cutoff: float = -3.0, test_normality_assumption: bool = False) Series

Low density, Z-scores.

Detect periods with low density of data points using Z-scores. Z-score stands for the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured. This method assumes that the densities over a rolling window are normally distributed. Low density periods are defined as time periods where the Z-score is lower than cutoff.

Parameters:
  • data – Time series

  • time_window – Rolling window. Length of the time period to compute the density of points. Defaults to 5 min.

  • cutoff – Cut-off. Number of standard deviations from the mean. Low density periods are detected if the Z-score is below this cut-off value. Default -3.0.

  • test_normality_assumption – Test for normality. Raises an error if the data densities over the rolling windows are not normally distributed. The Shapiro-Wilk test is used. The test is only performed if the time series contains less than 5000 data points. Default to False.

Returns:

Time series

The returned time series is an indicator function that is 1 where there is a low density period, and 0 otherwise.

Return type:

pd.Series

Raises:

Using modified Z scores

indsl.data_quality.low_density_identification_modified_z_scores(data: Series, time_window: Timedelta = Timedelta('0 days 00:05:00'), cutoff: float = -3.5) Series

Low density, mod.Z-scores.

Detect periods with a low density of data points using modified Z-scores. Low density periods are defined as time periods where the Z-score is lower than the cutoff.

Parameters:
  • data – Time series

  • time_window – Rolling window. Length of the time period to compute the density of points. Defaults to 5 min.

  • cutoff – Cut-off. Low density periods are detected if the modified Z-score is below this cut-off value. Default -3.5.

Returns:

Time series

The returned time series is an indicator function that is 1 where there is a low density period, and 0 otherwise.

Return type:

pd.Series

Raises:

Using the interquartile range method

indsl.data_quality.low_density_identification_iqr(data: Series, time_window: Timedelta = Timedelta('0 days 00:05:00')) Series

Low density, IQR.

Detect periods with a low density of data points using the interquartile range (IQR) method. The IQR is a measure of statistical dispersion, which is the spread of the data. Densities that are more than 1.5 IQR below Q1 are considered as low density periods in the data.

Parameters:
  • data – time series

  • time_window – Rolling window. Length of the time period to compute the density of points. Defaults to 5 min.

Returns:

time series

The returned time series is an indicator function that is 1 where there is a low density period, and 0 otherwise.

Return type:

pd.Series

Raises:

Using a density threshold

indsl.data_quality.low_density_identification_threshold(data: Series, time_window: Timedelta = Timedelta('0 days 00:05:00'), cutoff: int = 10) Series

Low density, threshold.

Detect periods with a low density of points using a time delta threshold as a cut-off value.

Parameters:
  • data – time series

  • time_window – Rolling window. Length of the time period to compute the density of points. Defaults to 5 min.

  • cutoff – Density cut-off. Low density periods are detected if the number of points is less than this cut-off value. Default is 10.

Returns:

time series

The returned time series is an indicator function that is 1 where there is a low density period, and 0 otherwise.

Return type:

pd.Series

Raises:

Rolling standard deviation of time delta

indsl.data_quality.rolling_stddev_timedelta(data: Series, time_window: Timedelta = Timedelta('0 days 00:15:00'), min_periods: int = 1) Series

Rolling stdev of time delta.

Rolling standard deviation computed for the time deltas of the observations. This metric aims to measure the amount of variation or dispersion in the frequency of time series data points.

Parameters:
  • data – Time series.

  • time_window – Time window. Length of the time period to compute the standard deviation for. Defaults to ‘minutes=15’. Time unit should be in days, hours, minutes, or seconds. Accepted formats can be found here https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.html.

  • min_periods – Minimum samples. Minimum number of observations required in the given time window (otherwise, the result is set to 0). Defaults to 1.

Returns:

Time series

Return type:

pandas.Series

Raises:

Validity

Extreme Outliers Removal

indsl.data_quality.extreme(data: Series, alpha: float = 0.05, bc_relaxation: float = 0.167, poly_order: int = 3) Series

Extreme outliers removal.

Outlier detector and removal based on the paper from Gustavo A. Zarruk. The procedure is as follows:

  • Fit a polynomial curve to the model using all the data

  • Calculate the studentized deleted (or externally studentized) residuals

  • These residuals follow a t distribution with degrees of freedom n - p - 1

  • Bonferroni critical value can be computed using the significance level (alpha) and t distribution

  • Any values that fall outside of the critical value are treated as anomalies

Use of the hat matrix diagonal allows for the rapid calculation of deleted residuals without having to refit the predictor function each time.

Parameters:
  • data – Time series.

  • alpha – Significance level. This is a number higher than or equal to 0 and lower than 1. In statistics, the significance level is the probability of rejecting the null hypothesis when true. For example, a significance level of 0.05 means that there is a 5% risk detecting an outlier that is not a true outlier.

  • bc_relaxation – Relaxation factor for the Bonferroni critical value. Smaller values will make anomaly detection more conservative. Defaults to 1/6.

  • poly_order – Polynomial order. It represents the order of the polynomial function fitted to the original time series. Defaults to 3.

Returns:

Time series.

Return type:

pandas.Series

Raises:

UserValueError – Alpha must be a number between 0 and 1

Out of Range Values

indsl.data_quality.outliers.out_of_range(data: Series, window_length: List[int] = [20, 20], polyorder: List[int] = [3, 3], alpha: List[float] = [0.05, 0.05], bc_relaxation: List[float] = [0.25, 0.5], return_outliers: bool = True) Series

Out of range.

The main objective of this function is to detect data points outside the typical range or unusually far from the main trend. The method is data adaptive; i.e. it should work independent of the data characteristics. The only exception, the method is designed for non-linear, non-stationary sensor data, one of the most common type of time series in physical processes.

Outliers are detected using an iterative and data-adaptive method. Additional details on the analytical methods are provided below. But it is basically a three step process, carried out in two iterations. The three steps are:

  1. Estimate the main trend of the time series using the method:

    • Savitzky-Golay smoothing (SG).

  2. Estimate the studentized residuals.

  3. Identify outliers using the bonferroni correction or bonferroni outlier test.

The results from the first iteration (new time series without the detected outliers) are used to carry out the second iteration. The Student’s t-distribution is used as it useful when estimating the mean of a normally distributed population in situations where the sample size is small and the population’s standard deviation is unknown. Finally, the bonferroni correction or Boneferroni test is a simple and efficient method to tests for extreme outliers. Additional details on each of these methods is provided below.

Savitzky-Golay Smoother:

The SG smoother is a digital filter ideal for smoothing data without distorting the data tendency. But its true value comes from being independent of the sampling frequency (unlike most digital filters). Hence, simple and robust to apply on data with non-uniform sampling (i.e. industrial sensor data). Two parameters are required to apply the SG smoother, a point-wise window length and a polynomial order. The window length is the number of data points used to estimate the local trend and the polynomial order is used to fit those data points with a linear (order 1) or non-linear (order higher than 1) fit.

Studentized residuals and the Bonferroni Correction:

The Student’s t-distribution is typically used when the sample size is small and the standard deviation is unknown. In this case we assume that there will be a small amount of out of range values and the statistical properties of the data are not known. By studentizing the residuals is analogous to normalizing the data and it a useful technique for detecting outliers. Furthermore, we apply the bonferroni correction to test if a residual is an outlier or not. To studentize the residual an significance lebel must be defined. Furthermore, the Bonferroni Correction is very conservative before it classifies a data point as an outlier. Consequently, a sensitivity factor is used to relax the test and identifying points that are located close to the main trend and that can be removed in the second iteration.

Parameters:
  • data – Time series.

  • window_length – Window length. Point-wise window length used to estimate the local trend using the SG smoother. Two integer values are required, one for each iteration. If the SG smoother is not used, these values are ignored. Default value is 20 data points for both iterations.

  • polyorder – Polynomial order. It represents the order of the polynomial function fitted to the data when using the SG smoother. Default value is 3 data points for both iterations.

  • alpha – Significance level. Number higher than or equal to 0 and lower than 1. Statistically speaking, the significance level is the probability of detecting an outlier that is not a true outlier. A value of 0.05 means that there is a 5% risk of detecting an outlier that is not a true outlier. Defaults to 0.05 for both iterations.

  • bc_relaxation – Sensitivity. Number higher than 0 used to make outlier detection more or less sensitive. Smaller values will make the detection more conservative. Defaults 0.25 and 0.5 for the first and second iterations, respectively.

  • return_outliers – Output outliers. If selected (True) the method outputs the detected outliers, otherwise the filtered (no-outliers) time series is returned. Defaults to True.

Returns:

Time series.

Return type:

pandas.Series

Value Decrease Indication

indsl.data_quality.value_decrease_check(x: Series, threshold: float = 0.0) Series

Decrease in time series values.

Identify decrease in values of a time series where the values shouldn’t be decreasing over time. One example would be Running Hours (or Hour Count) time series - a specific type of time series that is counting the number of running hours in a pump. Given that we expect the number of running hours to either stay the same (if the pump is not running) or increase with time (if the pump is running), the decrease in running hours value indicates bad data quality. Although the algorithm is originally meant to be used for Running Hours time series, it can be applied to all time series where the decrease in value is a sign of bad data quality.

Parameters:
  • x – Time series

  • threshold – Threshold for value drop. This threshold indicates for how many hours the time series value needs to drop (in hours) before we consider it bad data quality. The threshold must be a non-negative float. By default, the threshold is set to 0.

Returns:

Time series

The returned time series is an indicator function that is 1 where there is a decrease in time series value, and 0 otherwise. The indicator will be set to 1 until the data gets “back to normal” (that is, until time series reaches the value it had before the value drop).

Return type:

pandas.Series

Raises:

Datapoint difference over a period of time

indsl.data_quality.datapoint_diff_over_time_period(data: Series, time_period: Timedelta = Timedelta('1 days 00:00:00'), difference_threshold: int = 24, tolerance: Timedelta = Timedelta('0 days 01:00:00')) Series

Diff. between two datapoints.

The function is created in order to automate data quality check for time series with values that shouldn’t be increasing more than a certain threshold over a certain amount of hours. For each data point in a given time series, the function will calculate the difference in value between that data point and the data point at the defined length of period ago (i.e. it calculates the value change over a period).

An example is Running Hours (or Hour Count) time series - a specific type of time series that is counting the number of running hours in a pump. Given that we expect the number of running hours to stay within 24 over a period of 24 hours, a breach in the number of running hours over the last 24 hours would indicate poor data quality. In short, the value difference over 24 hours should not be higher than 24 for an Hour Count time series.

Although the algorithm is originally created for Running Hours time series, it can be applied to all time series where the breach in the threshold defined for difference between two datapoints at a certain length of period apart is a sign of bad data quality.

Parameters:
  • data – Time series.

  • time_period – Time period. The amount of period over which to calculate the difference between two data points. The value must be a non-negative float. Defaults to 1 day.

  • difference_threshold – Threshold for data point diff. The threshold for difference calculation between two data points. Defaults to 24.

  • tolerance – Tolerance range The tolerance period to allow between timestamps while looking for closest timestamp.

Returns:

Time series

The returned time series is an indicator function that is 1 where the difference between two datapoints over given number of hours exceeds the defined threshold, and 0 otherwise.

Return type:

pandas.Series

Raises: