Data Quality
Completeness
Completeness Score
- indsl.data_quality.completeness.completeness_score(x: Series, cutoff_good: float = 0.8, cutoff_med: float = 0.6, method_period: str = 'median') str
Completeness score
Function to determine the completeness of a time series from a completeness score. The score is a function of the inferred data sampling period (median or minimum of timestamp differences) and the expected total number of data points for the period from the sampling frequency. The completeness score is defined as the ratio between the actual number of data points to the expected number of data points based on the sampling frequency. The completeness is categorized as good if the score is above the specified cutoff ratio for good completeness. It is medium if the score falls between the cutoff ratio for good and medium completeness. It is characterized as poor if the score is below the medium completeness ratio.
- Parameters
x – Time series
cutoff_good – Good cutoff Value between 0 and 1. A completeness score above this cutoff value indicates good data completeness. Defaults to 0.80.
cutoff_med – Medium cutoff Value between 0 and 1 and lower than the good data completeness cutoff. A completeness score above this cutoff and below the good completeness cutoff indicates medium data completeness. Data with a completeness score below it are categorised as poor data completeness. Defaults to 0.60.
method_period – Method Name of the method used to estimate the period of the time series, can be ‘median’ or ‘min’. Default is ‘median’
- Returns
- Data quality
The data quality is defined as Good, when completeness score >= cutoff_good, Medium, when cutoff_med <= completeness score < cutoff_good, Poor, when completeness score < cutoff_med
- Return type
string
- Raises
TypeError – cutoff_good and cutoff_med are not a number
ValueError – x has less than ten data points
TypeError – x is not a time series
TypeError – index of x is not datetime
ValueError – method_period is not median or min
ValueError – completeness score more than 1
Examples:
Data Gaps Detection
Using Z scores
- indsl.data_quality.gaps_identification_z_scores(data: Series, cutoff: float = 3.0, test_normality_assumption: bool = False) Series
Gaps detection, Z-scores
Detect gaps in the time stamps using Z-scores. Z-score stands for the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured. This method assumes that the time step sizes are normally distributed. Gaps are defined as time periods where the Z-score is larger than cutoff.
- Parameters
data – Time series
cutoff – Cut-off Time periods are considered gaps if the Z-score is over this cut-off value. Default 3.0.
test_normality_assumption – Test for normality Raise a warning if the data is not normally distributed. The Shapiro-Wilk test is used. The test is only performed if the time series contains less than 5000 data points. Default to False.
- Returns
- Time series
The returned time series is an indicator function that is 1 where there is a gap, and 0 otherwise.
- Return type
pd.Series
- Raises
UserTypeError – data is not a time series
UserTypeError – cutoff is not a number
UserValueError – data is empty
UserValueError – time series is not normally distributed
Examples:
Using modified Z scores
- indsl.data_quality.gaps_identification_modified_z_scores(data: Series, cutoff: float = 3.5) Series
Gaps detection, mod. Z-scores
Detect gaps in the time stamps using modified Z-scores. Gaps are defined as time periods where the Z-score is larger than cutoff.
- Parameters
data – Time series
cutoff – Cut-off Time-periods are considered gaps if the modified Z-score is over this cut-off value. Default 3.5.
- Returns
- Time series
The returned time series is an indicator function that is 1 where there is a gap, and 0 otherwise.
- Return type
pd.Series
- Raises
UserTypeError – data is not a time series
UserTypeError – cutoff has to be of type float
UserValueError – data is empty
Examples:
Using the interquartile range method
- indsl.data_quality.gaps_identification_iqr(data: Series) Series
Gaps detection, IQR
Detect gaps in the time stamps using the interquartile range (IQR) method. The IQR is a measure of statistical dispersion, which is the spread of the data. Any time steps that are more than 1.5 IQR above Q3 are considered gaps in the data.
- Parameters
data – time series
- Returns
- time series
The returned time series is an indicator function that is 1 where there is a gap, and 0 otherwise.
- Return type
pd.Series
- Raises
UserTypeError – data is not a time series
UserValueError – data is empty
Examples:
Using a time delta threshold
- indsl.data_quality.gaps_identification_threshold(data: Series, time_delta: Timedelta = Timedelta('0 days 00:05:00')) Series
Gaps detection, threshold
Detect gaps in the time stamps using a timedelta threshold.
- Parameters
data – time series
time_delta – Time threshold Maximum time delta between points. Defaults to 5min.
- Returns
- time series
The returned time series is an indicator function that is 1 where there is a gap, and 0 otherwise.
- Return type
pd.Series
- Raises
UserTypeError – data is not a time series
UserValueError – data is empty
UserTypeError – time_delta is not a pd.Timedelta
Examples:
Low data density
Using Z scores
- indsl.data_quality.low_density_identification_z_scores(data: Series, time_window: Timedelta = Timedelta('0 days 00:05:00'), cutoff: float = - 3.0, test_normality_assumption: bool = False) Series
Low density, Z-scores
Detect periods with low density of data points using Z-scores. Z-score stands for the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured. This method assumes that the densities over a rolling window are normally distributed. Low density periods are defined as time periods where the Z-score is lower than cutoff.
- Parameters
data – Time series
time_window – Rolling window Length of the time period to compute the density of points. Defaults to 5min.
cutoff – Cut-off Number of standard deviations from the mean. Low density periods are detected if the Z-score is below this cut-off value. Default -3.0.
test_normality_assumption – Test for normality Raise a warning if the data is not normally distributed. The Shapiro-Wilk test is used. The test is only performed if the time series contains less than 5000 data points. Default to False.
- Returns
- Time series
The returned time series is an indicator function that is 1 where there is a low density period, and 0 otherwise.
- Return type
pd.Series
- Raises
UserTypeError – data is not a time series
UserTypeError – cutoff is not a number
UserValueError – data is empty
UserValueError – time series is not normally distributed
Examples:
Using modified Z scores
- indsl.data_quality.low_density_identification_modified_z_scores(data: Series, time_window: Timedelta = Timedelta('0 days 00:05:00'), cutoff: float = - 3.5) Series
Low density, mod.Z-scores
Detect periods with low density of data points using modified Z-scores. Low density periods are defined as time periods where the Z-score is lower than the cutoff.
- Parameters
data – Time series
time_window – Rolling window Length of the time period to compute the density of points. Defaults to 5min.
cutoff – Cut-off Low density periods are detected if the modified Z-score is below this cut-off value. Default -3.5.
- Returns
- Time series
The returned time series is an indicator function that is 1 where there is a low density period, and 0 otherwise.
- Return type
pd.Series
- Raises
UserTypeError – data is not a time series
UserTypeError – cutoff has to be of type float
UserValueError – data is empty
Examples:
Using the interquartile range method
- indsl.data_quality.low_density_identification_iqr(data: Series, time_window: Timedelta = Timedelta('0 days 00:05:00')) Series
Low density, IQR
Detect periods with low density of data points using the interquartile range (IQR) method. The IQR is a measure of statistical dispersion, which is the spread of the data. Densities that are more than 1.5 IQR below Q1 are considered as low density periods in the data.
- Parameters
data – time series
time_window – Rolling window Length of the time period to compute the density of points. Defaults to 5min.
- Returns
- time series
The returned time series is an indicator function that is 1 where there is a low density period, and 0 otherwise.
- Return type
pd.Series
- Raises
UserTypeError – data is not a time series
UserValueError – data is empty
Examples:
Using a density threshold
- indsl.data_quality.low_density_identification_threshold(data: Series, time_window: Timedelta = Timedelta('0 days 00:05:00'), cutoff: int = 10) Series
Low density, threshold
Detect periods with low density of points using a time delta threshold as a cut-off value.
- Parameters
data – time series
time_window – Rolling window Length of the time period to compute the density of points. Defaults to 5min.
cutoff – Density cut-off Low density periods are detected if the number of points is less than this cut-off value. Default 10.
- Returns
- time series
The returned time series is an indicator function that is 1 where there is a low density period, and 0 otherwise.
- Return type
pd.Series
- Raises
UserTypeError – data is not a time series
UserValueError – data is empty
Examples:
Rolling standard deviation of time delta
- indsl.data_quality.rolling_stddev_timedelta(data: Series, time_window: Timedelta = Timedelta('0 days 00:15:00'), min_periods: int = 1) Series
Rolling stdev of time delta
Rolling standard deviation computed for the time deltas of the observations. The purpose of this metric is to measure the amount of variation or dispersion in the frequency of time series data points.
- Parameters
data – Time series.
time_window – Time window. Length of the time period to compute the standard deviation for. Defaults to ‘minutes=15’. Time unit should be in days, hours, minutes or seconds. Accepted formats can be found `here (https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.html)`_.
min_periods – Minimum samples. Minimum number of observations required in the given time window (otherwise, the result is set to 0). Defaults to 1.
- Returns
Time series
- Return type
pandas.Series
- Raises
UserTypeError – data is not a time series
UserValueError – data is empty
UserTypeError – time_window is not of type pandas.Timedelta
Accuracy
Uncertainty estimation
- indsl.data_quality.uncertainty_rstd(data: Series, resample_rate: Timedelta = Timedelta('0 days 00:30:00'), emd_sift_thresh: float = 1e-08, emd_max_num_imfs: Optional[int] = None, emd_error_tolerance: float = 0.05) Series
Relative uncertainty
The relative uncertainty is computed as the ratio between the standard deviation of the signal noise and the mean of the true signal. The noise and true signals are estimated using the empirical model decomposition method. The relative uncertainty is computed on segments of the input data of size rst_resample_rate. In mathematical notation, this means:
\[rstd = \sigma(F_t - A_t)/|\mu(A_t)|\]where \(F_t\) is the resampled input time series, and \(A_t\) is the resampled and detrended time series obtained using the empirical model decomposition method (EMD).
- Parameters
data – Time series Input time series
resample_rate – Resample rate Resample rate used when estimating the relative standard deviation
emd_sift_thresh – Sifting threshold Threshold to stop EMD sifting process. This threshold is based on the Cauchy convergence test and represents the residue between two consecutive oscillatory components (IMFs). A small threshold (close to zero) will result in more components extracted. Typically, a few IMFs are enough to build the main trend. Choosing a high threshold might not affect the outcome. Defaults to 1e-8.
emd_max_num_imfs – Maximum number of components Maximum number of EMD oscillatory components (or IMFs) to estimate the main trend. If no value (None) is defined the process continues until sifting threshold is reached. Defaults to None.
emd_error_tolerance – Energy tolerance Threshold for the EMD cross energy ratio validation used for choosing oscillatory components or IMFs. Defaults to 0.05.
- Returns
Time series
- Return type
pd.Series
Examples:
Validity
Extreme Outliers Removal
- indsl.data_quality.extreme(data: Series, alpha: float = 0.05, bc_relaxation: float = 0.167, poly_order: int = 3)
Extreme outliers removal
Outlier detector and removal based on the paper from Gustavo A. Zarruk. The procedure is as follows:
Fit a polynomial curve to the model using all of the data
Calculate the studentized deleted (or externally studentized) residuals
These residuals follow a t distribution with degrees of freedom n - p - 1
Bonferroni critical value can be computed using the significance level (alpha) and t distribution
Any values that fall outside of the critical value are treated as anomalies
Use of the hat matrix diagonal allows for the rapid calculation of deleted residuals without having to refit the predictor function each time.
- Parameters
data – Time Series.
alpha – Significance level. is a number higher than or equal to 0 and lower than 1. In statistics, the significance level is the probability of rejecting the null hypothesis when true. For example, a significance level of 0.05 means that there is a 5% risk detecting an outlier that is not a true outlier.
bc_relaxation – Relaxation factor. for the Bonferroni critical value. Smaller values will make anomaly detection more conservative. Defaults to 1/6.
poly_order – Polynomial order. It represents the order of the polynomial function fitted to the original time series. Defaults to 3.
- Returns
Time series without outliers.
- Return type
pandas.Series
- Raises
UserValueError – Alpha must be a number between 0 and 1
Negative Running Hours
- indsl.data_quality.negative_running_hours_check(x: Series, threshold: float = 0.0) Series
Negative running hours
Negative running hours model is created in order to automate data quality check for time series with values that shouldn’t be decreasing over time. One example would be Running Hours (or Hour Count) time series - a specific type of time series that is counting the number of running hours in a pump. Given that we expect the number of running hours to either stay the same (if the pump is not running) or increase with time (if the pump is running), the decrease in running hours value indicates bad data quality. Although the algorithm is originally created for Running Hours time series, it can be applied to all time series where the decrease in value is a sign of bad data quality.
- Parameters
x – Time series
threshold – Threshold for value drop. This threshold indicates for how many hours the time series value needs to drop (in hours) before we consider it bad data quality. Threshold must be a non-negative float. By default, the threshold is set to 0.
- Returns
- Time series
The returned time series is an indicator function that is 1 where there is a decrease in time series value, and 0 otherwise. The indicator will be set to 1 until the data gets “back to normal” (that is, until time series reaches the value it had before the value drop).
- Return type
pandas.Series
- Raises
UserTypeError – x is not a time series
UserValueError – x is empty
UserTypeError – index of x is not a datetime
UserValueError – index of x is not increasing
UserTypeError – threshold is not a number
UserValueError – threshold is a negative number
Examples: