{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Detect out of range outliers in sensor data\n\n## Introduction\nThe :func:`out_of_range` function uses Savitzky-Golay smoothing (SG) - :func:`sg` to estimate the main trend of the data:\n\nThe algorithm carries out two iterations to determine the trend and detect extreme outliers.\n\nThe outlier detection is based on the |studentized residuals| and the |bonferroni correction|. The residuals between\nthe original data and the estimated trend are studentized and the Bonferroni Correction is used to identify potential\noutliers.\n\nNote that the `Out of Range` algorithm is designed to be used with *non-linear, non-stationary* sensor data. Therefore,\nlets start by generating a synthetic signal with those characteristics. We will use some of the signal generator\nfunctions in InDSL. We also need a few additional InDSL helper functions and algorithms to demonstrate how the outlier\ndetection works step by step.\n\n.. |studentized residuals| raw:: html\n\n studentized residuals\n\n.. |bonferroni correction| raw:: html\n\n Bonferroni Correction\n\n.. |Student's-t distribution| raw:: html\n\n Student's-t distribution\n\n.. |Numpy's| raw :: html\n\n Numpy's\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\nfrom scipy.stats import t as student_dist\n\nfrom datasets.data.synthetic_industrial_data import non_linear_non_stationary_signal\nfrom indsl.data_quality.outliers import (\n _calculate_hat_diagonal,\n _calculate_residuals_and_normalize_them,\n _split_timeseries_into_time_and_value_arrays,\n out_of_range,\n)\nfrom indsl.resample import reindex\nfrom indsl.smooth import sg" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Non-linear, non-stationary synthetic signal\nWe'll use a pre-defined test data set with a non-linear, non-stationary time series with \"`industrial`\"\ncharacteristics. This data set is a time series composed of 3 oscillatory components, 2 nonlinear trends, sensor\nlinear drift (small decrease over time) and white noise. The signal has non-uniform\ntime stamps, a fraction (35%) of the data is randomly removed to generate data gaps. The data gaps and white\nnoise are inserted with a constant seed to have a reproducible behavior of the algorithm. The functions used\nto generate this signal are also part of InDSL: :func:`insert_data_gaps`, :func:`line`, :func:`perturb_timestamp`,\n:func:`sine_wave`, and :func:`white_noise`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "seed_array = [10, 1975, 2000, 6000, 1, 89756]\nseed = seed_array[4]\ndata = non_linear_non_stationary_signal(seed=seed)\n\n\n# A simple function to style and annotate the figures.\ndef style_and_annotate_figure(\n ax,\n text,\n x_pos=0.50,\n y_pos=0.9,\n fsize=12,\n fsize_annotation=18,\n title_fsize=14,\n ylimits=[-3005, 8000],\n title_txt=None,\n):\n ax.text(\n x_pos, y_pos, text, transform=ax.transAxes, ha=\"center\", va=\"center\", fontsize=fsize_annotation, color=\"dimgrey\"\n )\n ax.legend(fontsize=fsize, loc=4)\n ax.tick_params(axis=\"both\", which=\"major\", labelsize=fsize)\n ax.tick_params(axis=\"both\", which=\"minor\", labelsize=fsize)\n ax.tick_params(axis=\"x\", rotation=45)\n ax.set_title(title_txt, fontsize=title_fsize)\n ax.grid()\n ax.set_ylim(ylimits)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Insert extreme outliers\nThis algorithm was tested with outliers generated at different locations. Five percent (5%) of the data points at\nrandom locations were replaced by outliers. To do so we used |Numpy's| random generator with different seeds.\nFeel free to use one of the 5 seeds below. The algorithm will work with 100% precision for these conditions and\nparameters used in the example. Or use another seed to generate a different signal and further test the\nlimits of the algorithm.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data = data.dropna()\n\nrng = np.random.default_rng(seed)\noutlier_fraction = 0.05 # Fraction of the signal that will be replaced by outliers\nnum_outliers = round(len(data) * outlier_fraction)\nlocations = np.unique(rng.integers(low=0, high=len(data), size=num_outliers))\ndirection = rng.choice([1, -1], size=len(locations))\noutliers = data.iloc[locations] + data.mean() * rng.uniform(0.5, 1, len(locations)) * direction\n\ndata_w_outliers = data.copy()\ndata_w_outliers[locations] = outliers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Initial conditions: test data set\nThe figure below shows the original data set and the outliers inserted. We took care to give the outliers random\nvalues, both far and close to the main trend. But far enough for these to be categorized as an extreme deviation from\nthe expected behavior of the data.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fig_size = (9, 7)\nfig, ax = plt.subplots(figsize=fig_size)\nax.plot(data, linewidth=2, label=\"Original data set\")\nax.plot(outliers, \"ro\", markersize=3, label=\"Outliers inserted\")\noutliers_inserted = (\n f\"{len(outliers)} outliers inserted ({round(100 * len(outliers) / len(data), 1)}% of the data points)\"\n)\nstyle_and_annotate_figure(ax, text=outliers_inserted, title_txt=\"Test Data Set\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Initial iteration\n\n### 1. Trend estimate\nAs mentioned before, we will use a SG smoother to estimate the trend. To demonstrate how the initial pass works, we'll run the SG independently. The SG smoother\nrequires a point-wise window length and a polynomial order. The bigger the window, more data\nused to estimate the local trend. With the polynomial order we influence how much we want the fit to follow the\nnon-linear characteristics of the data (1, linear, >1 increasingly non-linear fit). In this case we will use a window\nof 20 data points and 3rd order polynomial fit.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Estimate the trend using Savitzky-Golay smoother\ntolerance = 0.08\ntrend_pass01 = sg(data_w_outliers, window_length=20, polyorder=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Studentized residuals and Bonferroni correction\nIdentifying potential outliers is done by comparing how much each data point deviates from the estimated\nmain trend (i.e. the residuals). However, since in most cases little information about the data is readily\navailable and extreme outliers are expected to be sparse and uncommon, the |Student's-t distribution| is well\nsuited for the current task, where the sample size is small and the standard deviation is\nunknown. To demonstrate how the residuals are studentized, we use a helper function from InDSL.\nBut these steps are integrated into the :func:`out_of_range` function.\nFinally, since we aim to identify extreme outliers, a simple t-test does not suffice. Hence the Bonferroni Correction.\nFurthermore, we use a relaxation factor for the Bonferroni factor estimated from the data to adjust the sensitivity\nof the correction. Again, the Bonferroni Correction is explicitly calculated here but it is integrated into the\n:func:`out_of_range` function.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Statistical parameters\nalpha = 0.05 # Significance level or probability of rejecting the null hypothesis when true.\nbc_relaxation = 1 / 4 # Bonferroni relaxation coefficient.\n\nx, y = _split_timeseries_into_time_and_value_arrays(data_w_outliers)\ny_pred_pass01 = trend_pass01.to_numpy()\nhat_diagonal = _calculate_hat_diagonal(x)\n\n# Calculate degrees of freedom (n-p-1)\nn = len(y)\ndof = n - 3 # Using p = 2 for a model based on a single time series\n\n# Calculate Bonferroni critical value and studentized residuals\nbc = student_dist.ppf(1 - alpha / (2 * n), df=dof) * bc_relaxation\nt_res = _calculate_residuals_and_normalize_them(dof, hat_diagonal, y, y_pred_pass01)\n\n# Boolean mask where outliers are detected\nmask = np.logical_and(t_res < bc, t_res > -bc)\nfiltered_ts_pass01 = pd.Series(y[mask], index=data.index[mask]) # Remove detected outliers from time series\n\noutliers_pass01 = pd.Series(y[~mask], index=data.index[~mask])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Outliers detected with the initial pass\nThe figure below shows the results of the initial pass. The SG method does a good job at estimating the trend, except\nfor a few periods in the data where a larger number of outliers are grouped together. This causes strong nonlinear\nbehavior in the estimated trend, and as a consequence some data points are miss-identified as outliers. But overall,\na good enough baseline.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fig2, ax2 = plt.subplots(figsize=fig_size)\nax2.plot(data_w_outliers, \"--\", color=\"orange\", label=\"Data with outliers\")\nax2.plot(trend_pass01, \"k\", linewidth=2, label=\"Savitzky-Golay trend\")\nax2.plot(outliers_pass01, \"wo\", markersize=7, alpha=0.85, mew=2, markeredgecolor=\"green\", label=\"Outliers detected\")\nax2.plot(outliers, \"ro\", markersize=3, label=\"Outliers inserted\")\n\ntext_outlier_res = (\n f\"{len(outliers_pass01)} out of {len(outliers)} outliers detected \"\n f\"({round(100 * len(outliers_pass01) / len(outliers), 1)}%)\"\n)\nstyle_and_annotate_figure(ax2, text=text_outlier_res, title_txt=\"First Iteration: Savitzky-Golay trend\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Last iteration\nFor the last iteration the outliers previously detected are removed and then use the SG method to estimate the main trend.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tolerance_pass02 = 0.01\ntrend_pass02 = sg(filtered_ts_pass01, window_length=20, polyorder=3)\n\n# Filtering parameters\nalpha_pass02 = 0.05\nbc_relaxation_pass02 = 1 / 2\nbc_pass02 = student_dist.ppf(1 - alpha_pass02 / (2 * n), df=dof) * bc_relaxation_pass02\n\ny_pred_pass02 = reindex(trend_pass02, data_w_outliers)\ny_pred_pass02 = y_pred_pass02.to_numpy()\nt_res_pass02 = _calculate_residuals_and_normalize_them(dof, hat_diagonal, y, y_pred_pass02)\n\n# Boolean mask where outliers are detected\nmask_pass02 = np.logical_and(t_res_pass02 < bc_pass02, t_res_pass02 > -bc_pass02)\nfiltered_ts_pass02 = pd.Series(y[mask_pass02], index=data.index[mask_pass02])\n\n# Remove detected outliers from time series\noutliers_pass02 = pd.Series(y[~mask_pass02], index=data.index[~mask_pass02])\n# Run the InDSL function that carries out the entire analysis with the same parameters\nindsl_outliers = out_of_range(data_w_outliers)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Results\nThe figure below shows the original data, the trend estimated using the SG method, the outliers artificially\ninserted, and the outliers detected by the full method (:func:`out_of_range`). A perfect performance is observed, all\noutliers are detected. This \"perfect\" performance will not always be the case but this function provides a very\nrobust option for detecting and removing out of range or extreme outliers in *non-linear,\nnon-stationary sensor data*.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# sphinx_gallery_thumbnail_number = 3\nfig3, ax3 = plt.subplots(figsize=fig_size)\n\nax3.plot(data_w_outliers, \"--\", color=\"orange\", label=\"Data with outliers\")\nax3.plot(trend_pass02, \"k\", linewidth=2, label=\"Savitzky-Golay trend\")\nax3.plot(\n indsl_outliers,\n \"wo\",\n markersize=7,\n alpha=0.85,\n mew=2,\n markeredgecolor=\"green\",\n label=\"Outliers detected\",\n)\nax3.plot(outliers, \"ro\", markersize=3, label=\"Outliers inserted\")\n\ntext_outlier_res = (\n f\"{len(indsl_outliers)} out of {len(outliers)} outliers detected \"\n f\"({round(100 * len(indsl_outliers) / len(outliers), 1)}%)\"\n)\n\nstyle_and_annotate_figure(ax3, text=text_outlier_res, title_txt=\"Final Iteration: EMD-HHT trend\")\nplt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.16" } }, "nbformat": 4, "nbformat_minor": 0 }