Identifying low density periods

Detecting density of data points in a time series is important for finding out if the expected number of data points during a certain time window such as per hour or per day have been received.

In this example, we apply four low-density identification methods to a time series. methods are:

Z-scores: Marks a period with low density if the number of data points is 3 standard deviations below the mean.

Modified Z-scores: A modified version of the Z-score method, which uses the median absolute deviation instead of the standard deviation.

Interquartile range (IQR): Uses IQR, a measure for the spread of the data, to identify low density periods.

Density threshold: Marks a period with low density if the number of data points are lower than the provided threshold.

In the plots below, we apply the four methods listed above to a time series ranging from 2022/01/01 to 2022/01/02 with sampling frequency of 5 minutes. In this time series, 35% of the data is removed by introducing five gaps at random locations. The plots show the different characteristics of the low density identification methods.

Low density identification using the modified Z-Score method has been plotted at two different time windows, one for 30 minutes and the other for 6 hours. The plot for 30-minute time window is a straight line because modified z-score method measures how much an outlier differs from a typical score based on the median.

Low density identification applied to a time series with five gaps and 35% removal, Z-scores - time window of 30 minutes, Modified Z-Scores - time window of 30 minutes, Modified Z-Scores - time window of 6 hours, IQR - time window of 30min, Density threshold - 30 minutes

import matplotlib.pyplot as plt
import pandas as pd

from indsl.data_quality.low_density_identification import (
    low_density_identification_iqr,
    low_density_identification_modified_z_scores,
    low_density_identification_threshold,
    low_density_identification_z_scores,
)
from indsl.signals.generator import insert_data_gaps, line


start = pd.Timestamp("2022/01/01")
end = pd.Timestamp("2022/01/02")

# Create a time series with four gaps of random location and size
remove = 0.35
data = line(start_date=start, end_date=end, slope=0, intercept=0, sample_freq=pd.Timedelta("5m"))
ts_mult_gaps = insert_data_gaps(data=data, fraction=remove, method="Multiple", num_gaps=5)

# Apply low density identification methods to time series
ts_low_density_z_scores = low_density_identification_z_scores(ts_mult_gaps, time_window=pd.Timedelta("30m"))
ts_low_density_modified_z_scores_time_window_30m = low_density_identification_modified_z_scores(
    ts_mult_gaps, time_window=pd.Timedelta("30m")
)
ts_low_density_modified_z_scores_time_window_6h = low_density_identification_modified_z_scores(
    ts_mult_gaps, time_window=pd.Timedelta("6h"), cutoff=1
)
ts_low_density_iqr = low_density_identification_iqr(ts_mult_gaps, time_window=pd.Timedelta("30m"))
ts_low_density_w_threshold = low_density_identification_threshold(ts_mult_gaps, time_window=pd.Timedelta("60m"))

fig, (ax1, ax2, ax3, ax4, ax5) = plt.subplots(5, 1, figsize=(17, 10))
ax1.plot(ts_mult_gaps, "bo", mec="b", markerfacecolor="None", markersize=4)
ax1.plot(ts_low_density_z_scores, "r-")

ax2.plot(ts_mult_gaps, "bo", mec="b", markerfacecolor="None", markersize=4)
ax2.plot(ts_low_density_modified_z_scores_time_window_30m, "r-")

ax3.plot(ts_mult_gaps, "bo", mec="b", markerfacecolor="None", markersize=4)
ax3.plot(ts_low_density_modified_z_scores_time_window_6h, "r-")


ax4.plot(ts_mult_gaps, "bo", mec="b", markerfacecolor="None", markersize=4)
ax4.plot(ts_low_density_iqr, "r-")

ax5.plot(ts_mult_gaps, "bo", mec="b", markerfacecolor="None", markersize=4)
ax5.plot(ts_low_density_w_threshold, "r-")

ax1.set_title("Z-scores - time window of 30 minutes")
ax2.set_title("Modified Z-Scores - time window of 30 minutes")
ax3.set_title("Modified Z-Scores - time window of 6 hours")
ax4.set_title("IQR - time window of 30min")
ax5.set_title("Density threshold - 30 minutes")

fig.suptitle("Low density identification applied to a time series with five gaps and 35% removal", fontsize=16)
fig.tight_layout()

plt.show()

Total running time of the script: ( 0 minutes 3.119 seconds)

Gallery generated by Sphinx-Gallery