Data Quality

Completeness

Completeness Score

indsl.data_quality.completeness.completeness_score(x: Series, cutoff_good: float = 0.8, cutoff_med: float = 0.6, method_period: str = 'median') str

Completeness score

Function to determine the completeness of a time series from a completeness score. The score is a function of the inferred data sampling period (median or minimum of timestamp differences) and the expected total number of data points for the period from the sampling frequency. The completeness score is defined as the ratio between the actual number of data points to the expected number of data points based on the sampling frequency. The completeness is categorized as good if the score is above the specified cutoff ratio for good completeness. It is medium if the score falls between the cutoff ratio for good and medium completeness. It is characterized as poor if the score is below the medium completeness ratio.

Parameters
  • x – Time series

  • cutoff_good – Good cutoff Value between 0 and 1. A completeness score above this cutoff value indicates good data completeness. Defaults to 0.80.

  • cutoff_med – Medium cutoff Value between 0 and 1 and lower than the good data completeness cutoff. A completeness score above this cutoff and below the good completeness cutoff indicates medium data completeness. Data with a completeness score below it are categorised as poor data completeness. Defaults to 0.60.

  • method_period – Method Name of the method used to estimate the period of the time series, can be ‘median’ or ‘min’. Default is ‘median’

Returns

Data quality

The data quality is defined as Good, when completeness score >= cutoff_good, Medium, when cutoff_med <= completeness score < cutoff_good, Poor, when completeness score < cutoff_med

Return type

string

Raises
  • TypeError – cutoff_good and cutoff_med are not a number

  • ValueError – x has less than ten data points

  • TypeError – x is not a time series

  • TypeError – index of x is not datetime

  • ValueError – method_period is not median or min

  • ValueError – completeness score more than 1

Data Gaps Detection

Using Z scores

indsl.data_quality.gaps_identification_z_scores(data: Series, cutoff: float = 3.0, test_normality_assumption: bool = False) Series

Gaps detection, Z-scores

Detect gaps in the time stamps using Z-scores. Z-score stands for the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured. This method assumes that the time step sizes are normally distributed. Gaps are defined as time periods where the Z-score is larger than cutoff.

Parameters
  • data – Time series

  • cutoff – Cut-off Time periods are considered gaps if the Z-score is over this cut-off value. Default 3.0.

  • test_normality_assumption – Test for normality Raise a warning if the data is not normally distributed. The Shapiro-Wilk test is used. The test is only performed if the time series contains less than 5000 data points. Default to False.

Returns

Time series

The returned time series is an indicator function that is 1 where there is a gap, and 0 otherwise.

Return type

pd.Series

Raises
  • UserTypeError – data is not a time series

  • UserTypeError – cutoff is not a number

  • UserValueError – data is empty

  • UserValueError – time series is not normally distributed

Using modified Z scores

indsl.data_quality.gaps_identification_modified_z_scores(data: Series, cutoff: float = 3.5) Series

Gaps detection, mod. Z-scores

Detect gaps in the time stamps using modified Z-scores. Gaps are defined as time periods where the Z-score is larger than cutoff.

Parameters
  • data – Time series

  • cutoff – Cut-off Time-periods are considered gaps if the modified Z-score is over this cut-off value. Default 3.5.

Returns

Time series

The returned time series is an indicator function that is 1 where there is a gap, and 0 otherwise.

Return type

pd.Series

Raises
  • UserTypeError – data is not a time series

  • UserTypeError – cutoff has to be of type float

  • UserValueError – data is empty

Using the interquartile range method

indsl.data_quality.gaps_identification_iqr(data: Series) Series

Gaps detection, IQR

Detect gaps in the time stamps using the interquartile range (IQR) method. The IQR is a measure of statistical dispersion, which is the spread of the data. Any time steps that are more than 1.5 IQR above Q3 are considered gaps in the data.

Parameters

data – time series

Returns

time series

The returned time series is an indicator function that is 1 where there is a gap, and 0 otherwise.

Return type

pd.Series

Raises
  • UserTypeError – data is not a time series

  • UserValueError – data is empty

Using a time delta threshold

indsl.data_quality.gaps_identification_threshold(data: Series, time_delta: Timedelta = Timedelta('0 days 00:05:00')) Series

Gaps detection, threshold

Detect gaps in the time stamps using a timedelta threshold.

Parameters
  • data – time series

  • time_delta – Time threshold Maximum time delta between points. Defaults to 5min.

Returns

time series

The returned time series is an indicator function that is 1 where there is a gap, and 0 otherwise.

Return type

pd.Series

Raises
  • UserTypeError – data is not a time series

  • UserValueError – data is empty

  • UserTypeError – time_delta is not a pd.Timedelta

Low data density

Using Z scores

indsl.data_quality.low_density_identification_z_scores(data: Series, time_window: Timedelta = Timedelta('0 days 00:05:00'), cutoff: float = - 3.0, test_normality_assumption: bool = False) Series

Low density, Z-scores

Detect periods with low density of data points using Z-scores. Z-score stands for the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured. This method assumes that the densities over a rolling window are normally distributed. Low density periods are defined as time periods where the Z-score is lower than cutoff.

Parameters
  • data – Time series

  • time_window – Rolling window Length of the time period to compute the density of points. Defaults to 5min.

  • cutoff – Cut-off Number of standard deviations from the mean. Low density periods are detected if the Z-score is below this cut-off value. Default -3.0.

  • test_normality_assumption – Test for normality Raise a warning if the data is not normally distributed. The Shapiro-Wilk test is used. The test is only performed if the time series contains less than 5000 data points. Default to False.

Returns

Time series

The returned time series is an indicator function that is 1 where there is a low density period, and 0 otherwise.

Return type

pd.Series

Raises
  • UserTypeError – data is not a time series

  • UserTypeError – cutoff is not a number

  • UserValueError – data is empty

  • UserValueError – time series is not normally distributed

Using modified Z scores

indsl.data_quality.low_density_identification_modified_z_scores(data: Series, time_window: Timedelta = Timedelta('0 days 00:05:00'), cutoff: float = - 3.5) Series

Low density, mod.Z-scores

Detect periods with low density of data points using modified Z-scores. Low density periods are defined as time periods where the Z-score is lower than the cutoff.

Parameters
  • data – Time series

  • time_window – Rolling window Length of the time period to compute the density of points. Defaults to 5min.

  • cutoff – Cut-off Low density periods are detected if the modified Z-score is below this cut-off value. Default -3.5.

Returns

Time series

The returned time series is an indicator function that is 1 where there is a low density period, and 0 otherwise.

Return type

pd.Series

Raises
  • UserTypeError – data is not a time series

  • UserTypeError – cutoff has to be of type float

  • UserValueError – data is empty

Using the interquartile range method

indsl.data_quality.low_density_identification_iqr(data: Series, time_window: Timedelta = Timedelta('0 days 00:05:00')) Series

Low density, IQR

Detect periods with low density of data points using the interquartile range (IQR) method. The IQR is a measure of statistical dispersion, which is the spread of the data. Densities that are more than 1.5 IQR below Q1 are considered as low density periods in the data.

Parameters
  • data – time series

  • time_window – Rolling window Length of the time period to compute the density of points. Defaults to 5min.

Returns

time series

The returned time series is an indicator function that is 1 where there is a low density period, and 0 otherwise.

Return type

pd.Series

Raises
  • UserTypeError – data is not a time series

  • UserValueError – data is empty

Using a density threshold

indsl.data_quality.low_density_identification_threshold(data: Series, time_window: Timedelta = Timedelta('0 days 00:05:00'), cutoff: int = 10) Series

Low density, threshold

Detect periods with low density of points using a time delta threshold as a cut-off value.

Parameters
  • data – time series

  • time_window – Rolling window Length of the time period to compute the density of points. Defaults to 5min.

  • cutoff – Density cut-off Low density periods are detected if the number of points is less than this cut-off value. Default 10.

Returns

time series

The returned time series is an indicator function that is 1 where there is a low density period, and 0 otherwise.

Return type

pd.Series

Raises
  • UserTypeError – data is not a time series

  • UserValueError – data is empty

Rolling standard deviation of time delta

indsl.data_quality.rolling_stddev_timedelta(data: Series, time_window: Timedelta = Timedelta('0 days 00:15:00'), min_periods: int = 1) Series

Rolling stdev of time delta

Rolling standard deviation computed for the time deltas of the observations. The purpose of this metric is to measure the amount of variation or dispersion in the frequency of time series data points.

Parameters
  • data – Time series.

  • time_window – Time window. Length of the time period to compute the standard deviation for. Defaults to ‘minutes=15’. Time unit should be in days, hours, minutes or seconds. Accepted formats can be found `here (https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.html)`_.

  • min_periods – Minimum samples. Minimum number of observations required in the given time window (otherwise, the result is set to 0). Defaults to 1.

Returns

Time series

Return type

pandas.Series

Raises
  • UserTypeError – data is not a time series

  • UserValueError – data is empty

  • UserTypeError – time_window is not of type pandas.Timedelta

Accuracy

Uncertainty estimation

indsl.data_quality.uncertainty_rstd(data: Series, resample_rate: Timedelta = Timedelta('0 days 00:30:00'), emd_sift_thresh: float = 1e-08, emd_max_num_imfs: Optional[int] = None, emd_error_tolerance: float = 0.05) Series

Relative uncertainty

The relative uncertainty is computed as the ratio between the standard deviation of the signal noise and the mean of the true signal. The noise and true signals are estimated using the empirical model decomposition method. The relative uncertainty is computed on segments of the input data of size rst_resample_rate. In mathematical notation, this means:

\[rstd = \sigma(F_t - A_t)/|\mu(A_t)|\]

where \(F_t\) is the resampled input time series, and \(A_t\) is the resampled and detrended time series obtained using the empirical model decomposition method (EMD).

Parameters
  • data – Time series Input time series

  • resample_rate – Resample rate Resample rate used when estimating the relative standard deviation

  • emd_sift_thresh – Sifting threshold Threshold to stop EMD sifting process. This threshold is based on the Cauchy convergence test and represents the residue between two consecutive oscillatory components (IMFs). A small threshold (close to zero) will result in more components extracted. Typically, a few IMFs are enough to build the main trend. Choosing a high threshold might not affect the outcome. Defaults to 1e-8.

  • emd_max_num_imfs – Maximum number of components Maximum number of EMD oscillatory components (or IMFs) to estimate the main trend. If no value (None) is defined the process continues until sifting threshold is reached. Defaults to None.

  • emd_error_tolerance – Energy tolerance Threshold for the EMD cross energy ratio validation used for choosing oscillatory components or IMFs. Defaults to 0.05.

Returns

Time series

Return type

pd.Series

Validity

Extreme Outliers Removal

indsl.data_quality.extreme(data: Series, alpha: float = 0.05, bc_relaxation: float = 0.167, poly_order: int = 3)

Extreme outliers removal

Outlier detector and removal based on the paper from Gustavo A. Zarruk. The procedure is as follows:

  • Fit a polynomial curve to the model using all of the data

  • Calculate the studentized deleted (or externally studentized) residuals

  • These residuals follow a t distribution with degrees of freedom n - p - 1

  • Bonferroni critical value can be computed using the significance level (alpha) and t distribution

  • Any values that fall outside of the critical value are treated as anomalies

Use of the hat matrix diagonal allows for the rapid calculation of deleted residuals without having to refit the predictor function each time.

Parameters
  • data – Time Series.

  • alpha – Significance level. is a number higher than or equal to 0 and lower than 1. In statistics, the significance level is the probability of rejecting the null hypothesis when true. For example, a significance level of 0.05 means that there is a 5% risk detecting an outlier that is not a true outlier.

  • bc_relaxation – Relaxation factor. for the Bonferroni critical value. Smaller values will make anomaly detection more conservative. Defaults to 1/6.

  • poly_order – Polynomial order. It represents the order of the polynomial function fitted to the original time series. Defaults to 3.

Returns

Time series without outliers.

Return type

pandas.Series

Raises

UserValueError – Alpha must be a number between 0 and 1

Negative Running Hours

indsl.data_quality.negative_running_hours_check(x: Series, threshold: float = 0.0) Series

Negative running hours

Negative running hours model is created in order to automate data quality check for time series with values that shouldn’t be decreasing over time. One example would be Running Hours (or Hour Count) time series - a specific type of time series that is counting the number of running hours in a pump. Given that we expect the number of running hours to either stay the same (if the pump is not running) or increase with time (if the pump is running), the decrease in running hours value indicates bad data quality. Although the algorithm is originally created for Running Hours time series, it can be applied to all time series where the decrease in value is a sign of bad data quality.

Parameters
  • x – Time series

  • threshold – Threshold for value drop. This threshold indicates for how many hours the time series value needs to drop (in hours) before we consider it bad data quality. Threshold must be a non-negative float. By default, the threshold is set to 0.

Returns

Time series

The returned time series is an indicator function that is 1 where there is a decrease in time series value, and 0 otherwise. The indicator will be set to 1 until the data gets “back to normal” (that is, until time series reaches the value it had before the value drop).

Return type

pandas.Series

Raises
  • UserTypeError – x is not a time series

  • UserValueError – x is empty

  • UserTypeError – index of x is not a datetime

  • UserValueError – index of x is not increasing

  • UserTypeError – threshold is not a number

  • UserValueError – threshold is a negative number