Skip to content

Commit

Permalink
Merge pull request #82 from zillow/feature/window_density_model_impro…
Browse files Browse the repository at this point in the history
…vements

Feature/window density model improvements
  • Loading branch information
sayanchk authored Feb 23, 2021
2 parents 3bcc374 + ee24284 commit f59f4b3
Show file tree
Hide file tree
Showing 5 changed files with 216 additions and 125 deletions.
296 changes: 191 additions & 105 deletions docs/tutorial/streaming.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,124 +6,210 @@ Luminaire *WindowDensityModel* implements the idea of monitoring data over compa
.. image:: windows.png
:scale: 40%

Although *WindowDensityModel* is designed to track anomalies over streaming data, it can be used to track any sustained fluctuations over a window for any frequency. This detection type is suggested for up to hourly data frequency.
Although *WindowDensityModel* is designed to track anomalies over streaming data, it can be used to track anomalies even for low frequency time series. This detection type is suggested for up to hourly data frequency.

Anomaly Detection: Pre-Configured Settings
------------------------------------------
This window based anomaly detection feature in Luminaire operates fully automatically where the underlying model detects the frequency that the data has been observed, the optimal size of the window (using the periodic signals in the data) and the optimal detection method given some identified characteristics from the input time series. Moreover, user also has the ability to overwright the configuration for custom use cases.

Luminaire provides the capability to configure model parameters based on the frequency that the data has been observed and the methods that can be applied (please refer to the Window density Model user guide for detailed configuration options). Luminaire settings for the window density model are already pre-configured for some typical pandas frequency types and settings for any other frequency types should be configured manually (see the API reference for `Streaming Anomaly Detection Models <https://zillow.github.io/luminaire/api_reference/streaming.html>`_).
Fully Automated Anomaly Detection using Time-windows
----------------------------------------------------

Luminaire provides a fully automated anomaly detection method that tracks time series abnormalities over time-windows. Luminaire is capable of selecting the best possible setting by studying different characteristics of the input time series. Although, compared to the Luminaire outlier detection module, the window based anomaly detection does not require running any separate configuration optimization to obtain the best hyperparameters. Rather, the automation process is embedded withing the data exploration and the training process.

Similar to the outlier detection module, Luminaire Window Density Model comes with a streaming data profiling module to extract different characteristics about the high-frequency time series.

>>> from luminaire.model.window_density import WindowDensityHyperParams, WindowDensityModel
>>> from luminaire.exploration.data_exploration import DataExploration
>>> print(data)
raw interpolated
index
2020-05-25 00:00:00 10585.0 10585.0
2020-05-25 00:01:00 10996.0 10996.0
2020-05-25 00:02:00 10466.0 10466.0
2020-05-25 00:03:00 10064.0 10064.0
2020-05-25 00:04:00 10221.0 10221.0
... ... ...
2020-06-16 23:55:00 11356.0 11356.0
2020-06-16 23:56:00 10852.0 10852.0
2020-06-16 23:57:00 11114.0 11114.0
2020-06-16 23:58:00 10663.0 10663.0
2020-06-16 23:59:00 11034.0 11034.0

>>> hyper_params = WindowDensityHyperParams(freq='T').params
>>> wdm_obj = WindowDensityModel(hyper_params=hyper_params)
>>> success, model = wdm_obj.train(data=data)
>>> print(success, model)
(True, <luminaire_models.model.window_density.WindowDensityModel object at 0x7f8cda42dcc0>)

The model object contains the data density structure over a pre-specified window, given the frequency. Luminaire sets the following defaults for some typical pandas frequencies (any custom requirements can be updated in the hyperparameter object instance):

- 'S': Hourly windows
- 'T': 24 hours windows
- '15T': 24 hours windows
- 'H': 24 hours windows
- 'D': 4 weeks windows
- 'custom': User specified windows

In order to score a new window innovation given the trained model object, we have to provide a equal sized window that represents a similar time interval. For example, if each of the windows in the training data represents a 24 hour window between 9 AM to 8:59:59 AM (next day) for last few days, the scoring data should represent the same interval of a different day and should have the same window size.
raw
index
2020-06-04 00:00:00 227798
2020-06-04 00:10:00 224593
2020-06-04 00:20:00 229400
2020-06-04 00:30:00 217813
2020-06-04 00:40:00 217862
... ...
2020-07-02 23:20:00 221226
2020-07-02 23:30:00 218762
2020-07-02 23:40:00 225726
2020-07-02 23:50:00 220783
2020-07-03 00:00:00 260981

>>> config = WindowDensityHyperParams().params
>>> de_obj = DataExploration(**config)
>>> data, pre_prc = de_obj.stream_profile(df=data)
print(data, pre_prc)
raw interpolated
2020-06-04 00:10:00 224593 224593.0
2020-06-04 00:20:00 229400 229400.0
2020-06-04 00:30:00 217813 217813.0
2020-06-04 00:40:00 217862 217862.0
2020-06-04 00:50:00 226861 226861.0
... ... ...
2020-07-02 23:20:00 221226 221226.0
2020-07-02 23:30:00 218762 218762.0
2020-07-02 23:40:00 225726 225726.0
2020-07-02 23:50:00 220783 220783.0
2020-07-03 00:00:00 260981 260981.0
[4176 rows x 2 columns]
{'success': True, 'freq': '0 days 00:10:00', 'window_length': 144, 'min_window_length': 10, 'max_window_length': 100000}

Luminaire *stream_profile* performs missing data imputation if necessary, extracts the frequency information and obtains the optimal size of the window to be monitored (if not specified by the user). All the information obtained by the profiler can be used to update the configuration for the actual training process.

>>> config.update(pre_prc)
>>> wdm_obj = WindowDensityModel(hyper_params=config)
>>> success, training_end, model = wdm_obj.train(data=data)
>>> print(success, training_end, model)
True 2020-07-03 00:00:00 <luminaire.model.window_density.WindowDensityModel object at 0x7fb6fab80b00>

The training process generates the success flag, the model timestamp and the actual trained model. The trained model here is a collection of several sub-models that can be used to score any equal length time segment of the day and does not depend on the specific patterns based on the selected time window.
In order to score a new window innovation given the trained model object, we have to provide a equal sized time window. Moreover, Luminaire allows the user to perform basic processing (imputing missing index etc.) of the scoring window in order to get the data ready for scoring.

.. image:: window_train_score_auto.png
:scale: 45%

>>> scoring_data
raw interpolated
index
2020-06-17 00:00:00 11021.0 11021.0
2020-06-17 00:01:00 10931.0 10931.0
2020-06-17 00:02:00 10637.0 10637.0
2020-06-17 00:03:00 10845.0 10845.0
2020-06-17 00:04:00 10163.0 10163.0
... ... ...
2020-06-17 23:55:00 9680.0 9680.0
2020-06-17 23:56:00 9985.0 9985.0
2020-06-17 23:57:00 9363.0 9363.0
2020-06-17 23:58:00 9686.0 9686.0
2020-06-17 23:59:00 9220.0 9220.0

>>> scores = model.score(scoring_data)
>>> print(scores)
{'Success': True, 'ConfLevel': 99.9, 'IsAnomaly': False, 'AnomalyProbability': 0.6956745734841678}

Anomaly Detection: Manual Configuration
---------------------------------------

There are several options in the *WindowDensityHyperParams* class that can be manually configured. The configuration should be selected mostly based on the frequency that the data has been observed.
:scale: 100%

>>> print(scoring_data)
raw
index
2020-07-03 00:00:00 260981
2020-07-03 00:10:00 274249
2020-07-03 00:20:00 293194
2020-07-03 00:30:00 272722
2020-07-03 00:40:00 276930
... ...
2020-07-03 23:10:00 287773
2020-07-03 23:20:00 255438
2020-07-03 23:30:00 277127
2020-07-03 23:40:00 266263
2020-07-03 23:50:00 275432
>>> freq = model._params['freq']
>>> de_obj = DataExploration(freq=freq)
>>> processed_data, pre_prc = de_obj.stream_profile(df=scoring_data, impute_only=True, impute_zero=True)

The processed data can be used to score as:

>>> score, scored_window = model.score(processed_data)
>>> print(score)
{'Success': True, 'ConfLevel': 99.9, 'IsAnomaly': True, 'AnomalyProbability': 1.0}

User can also score rolling (or overlapping windows) windows instead of sequential windows for more frequent anomaly detection use cases.

>>> print(scoring_data)
raw
index
2020-07-02 12:10:00 203836
2020-07-02 12:20:00 209813
2020-07-02 12:30:00 206271
2020-07-02 12:40:00 209135
2020-07-02 12:50:00 207085
... ...
2020-07-03 11:20:00 255009
2020-07-03 11:30:00 260246
2020-07-03 11:40:00 248541
2020-07-03 11:50:00 246094
2020-07-03 12:00:00 252223
>>> freq = model._params['freq']
>>> de_obj = DataExploration(freq=freq)
>>> processed_data, pre_prc = de_obj.stream_profile(df=scoring_data, impute_only=True, impute_zero=True)
>>> score, scored_window = model.score(processed_data)
>>> print(score)
'Success': True, 'ConfLevel': 99.9, 'IsAnomaly': True, 'AnomalyProbability': 0.9999867236}

Reusing Past Trained Model
^^^^^^^^^^^^^^^^^^^^^^^^^^

Luminaire Window Density model also comes with the capability of ingesting previously trained model in the future model trainings. This can be part of a sequential process that always passes the last trained model in the next training. This ensures richer data accumulation to have more reliable scores, specially when the training history is limited to a fixed length rolling window. This way, the model is able to keep larger history as a metadata even though the actual training history is limited.

>>> past_model = <luminaire.model.window_density.WindowDensityModel object at 0x7fb6fab80b00>
>>> print(new_training_data)
raw
index
2020-06-04 00:00:00 227798
2020-06-04 00:10:00 224593
2020-06-04 00:20:00 229400
2020-06-04 00:30:00 217813
2020-06-04 00:40:00 217862
... ...
2020-07-03 23:10:00 287773
2020-07-03 23:20:00 255438
2020-07-03 23:30:00 277127
2020-07-03 23:40:00 266263
2020-07-03 23:50:00 275432
>>> success, training_end, model = wdm_obj.train(data=new_training_data, past_model=past_model)

Anomaly Detection using Time-windows: Manual Configuration
----------------------------------------------------------

There are several options in the *WindowDensityHyperParams* class that can be manually configured. User can select different option starting from the desired window size, whether all previous windows should be used to identify anomalies or the last window only, the detection method and how to manage nonstationarity and periodicity present in the data and so on. Please refer to the API reference for `Streaming Anomaly Detection Models <https://zillow.github.io/luminaire/api_reference/streaming.html>`_.

>>> from luminaire.model.window_density import WindowDensityHyperParams, WindowDensityModel
>>> print(data)
raw interpolated
index
2020-05-20 00:03:00 6393.451190 6393.451190
2020-05-20 00:13:00 6491.426190 6491.426190
2020-05-20 00:23:00 6770.469444 6770.469444
2020-05-20 00:33:00 6490.798810 6490.798810
2020-05-20 00:43:00 6273.786508 6273.786508
... ... ...
2020-06-09 23:13:00 5619.341270 5619.341270
2020-06-09 23:23:00 5573.001190 5573.001190
2020-06-09 23:33:00 5745.400000 5745.400000
2020-06-09 23:43:00 5761.355556 5761.355556
2020-06-09 23:53:00 5558.577778 5558.577778
>>>hyper_params = WindowDensityHyperParams(freq='custom',
detection_method='kldiv',
baseline_type="last_window",
min_window_length=6*12,
max_window_length=6*24*84,
window_length=6*24,
ma_window_length=24,
).params
>>> wdm_obj = WindowDensityModel(hyper_params=hyper_params)
>>> success, model = wdm_obj.train(data=data)
>>> print(success, model)
(True, <luminaire_models.model.window_density.WindowDensityModel object at 0x7f8d5f1a6940>)

The trained model object can be used to score data representing the same interval from a different day and having the same window size.
raw
index
2020-06-04 00:00:00 227798
2020-06-04 00:10:00 224593
2020-06-04 00:20:00 229400
2020-06-04 00:30:00 217813
2020-06-04 00:40:00 217862
... ...
2020-07-02 23:20:00 221226
2020-07-02 23:30:00 218762
2020-07-02 23:40:00 225726
2020-07-02 23:50:00 220783
2020-07-03 00:00:00 218315
>>>config = WindowDensityHyperParams(freq='10T',
detection_method='kldiv',
baseline_type="last_window",
window_length=6*6,
detrend_method='modeling'
).params
>>> de_obj = DataExploration(**config)
>>> data, pre_prc = de_obj.stream_profile(df=data)
>>> print(data, pre_prc)
raw interpolated
2020-06-05 00:10:00 227504 227504.0
2020-06-05 00:20:00 225664 225664.0
2020-06-05 00:30:00 227586 227586.0
2020-06-05 00:40:00 223805 223805.0
2020-06-05 00:50:00 222679 222679.0
... ... ...
2020-07-02 23:20:00 221226 221226.0
2020-07-02 23:30:00 218762 218762.0
2020-07-02 23:40:00 225726 225726.0
2020-07-02 23:50:00 220783 220783.0
2020-07-03 00:00:00 218315 218315.0
[4032 rows x 2 columns]
{'success': True, 'freq': '10T', 'window_length': 36, 'min_window_length': 10, 'max_window_length': 100000}
>>> config.update(pre_prc)
>>> wdm_obj = WindowDensityModel(hyper_params=config)
>>> success, training_end, model = wdm_obj.train(data=data)
>>> print(success, training_end, model)
True 2020-07-03 00:00:00 <luminaire.model.window_density.WindowDensityModel object at 0x7ff33ef74550>

The trained model object can be used to score the data of a similar window size.

.. image:: window_train_score_manual.png
:scale: 45%

>>> scoring_data
raw interpolated
index
2020-06-10 00:00:00 5532.556746 5532.556746
2020-06-10 00:10:00 5640.711905 5640.711905
2020-06-10 00:20:00 5880.368254 5880.368254
2020-06-10 00:30:00 5842.397222 5842.397222
2020-06-10 00:40:00 5827.231746 5827.231746
... ... ...
2020-06-10 23:10:00 7210.905952 7210.905952
2020-06-10 23:20:00 5739.459524 5739.459524
2020-06-10 23:30:00 5590.413889 5590.413889
2020-06-10 23:40:00 5608.291270 5608.291270
2020-06-10 23:50:00 5753.794444 5753.794444
>>> scores = model.score(scoring_data)
>>> print(scores)
{'Success': True, 'ConfLevel': 99.9, 'IsAnomaly': True, 'AnomalyProbability': 0.9999999851834622}
:scale: 100%

>>> print(data)
raw
index
2020-07-03 06:10:00 222985
2020-07-03 06:20:00 210951
2020-07-03 06:30:00 210094
2020-07-03 06:40:00 215166
2020-07-03 06:50:00 212968
... ...
2020-07-03 11:20:00 209008
2020-07-03 11:30:00 211170
2020-07-03 11:40:00 203302
2020-07-03 11:50:00 204498
2020-07-03 12:00:00 203234
>>> freq = model._params['freq']
>>> de_obj = DataExploration(freq=freq)
>>> processed_data, pre_prc = de_obj.stream_profile(df=data, impute_only=True, impute_zero=True)
>>> score, scored_window = model.score(processed_data)
>>> print(score)
{'Success': True, 'ConfLevel': 99.9, 'IsAnomaly': False, 'AnomalyProbability': 0.330817121756509}



Binary file modified docs/tutorial/window_train_score_auto.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/tutorial/window_train_score_manual.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit f59f4b3

Please sign in to comment.