AnoScout

A System for Visual Recommendations of Anomalies in Time-Oriented Data

Master Thesis by Julian Rakuschek
Supervised by Prof. Dr. Tobias Schreck

Three Easy Pieces

Introduction
AnoScout
Behind the scenes

Introduction

This is a Time Series

This is an Anomaly (Outlier)

This is an Anomaly (Shapelet)

Some anomalies mess with the seasonality ...

... while others cause a drift

Easy to see visually, but how does a computer see it?

We use anomaly detection algorithms!

Desired Output: Scoring

Distance-Based Methods

Forecasting-Based Methods

Anomaly Score is the deviation from the forecast.

AnoScout offers seven algorithms

AutoEncoder
Discrete Wavelet Transform
Subsequence Isolation Forest
Subsequence Local Outlier Factor
LSTM Forecasting
Random Black Forest
ARIMA
AutoEncoder

Src: https://towardsdatascience.com/autoencoders-and-the-denoising-feature-from-theory-to-practice-db7f7ad8fc78

Discrete Wavelet Transform I
Discrete Wavelet Transform II
  1. Store sliding windows for each layer $l$ in matrices $\mathbf{D}^{(l)}$ and $\mathbf{C}^{(l)}$.
  2. Estimate gaussian distribution on matrices through MLE.
  3. Create a binary vector a that labels each row as either anomalous (1) or normal (0).
  4. On each level, the coefficients are halved, thus creating multiple binary trees across the time series.
  5. If an anomaly is detected in one of the nodes, it is marked as an event $e$, which gets pushed down to the leafs of the tree.
  6. Event count in the leafs is the final anomaly score.
Discrete Wavelet Transform III
Subsequence Isolation Forest I

Partition dataset using tree: Anomalies easy to isolate, therefore closer to root node.

Isolation tree construction:

  1. Randomly select an attribute $q$ in the dataset
  2. Randomly select a value $p$ from the selected attribute column in the dataset and proceed to split the dataset into two partitions based on the value.
  3. Two new nodes are attached to the current node; the left node contains all values smaller than $p$ , while the right node contains all nodes greater or equal than $p$ .
  4. These steps are recursively repeated until either the dataset is depleted or the maximum tree height specified by the user has been reached.
Subsequence Isolation Forest II
Subsequence Isolation Forest III

Assign anomaly scores to each node $x$

  1. Compute average path length $h(x)$ for node $x$ from multiple isolation trees.
  2. Estimate total average path length: \[c(n) = 2H (n − 1) − (2(n − 1)/n)\]
  3. Anomaly score $s$ for a point $x$: \[ s(x, n) = 2^{-\frac{E(h(x))}{c(n)}} \]
Subsequence Local Outlier Factor

Compute a Local Outlier Factor (LOF) for each point to measure the degree of deviation from normal data in the local neighborhood.

  1. Reachability distance (RD) of $p$ w.r.t. $q$: \[RD_k(p, q) = \max \left\{ \text{k-distance}(q), d(p, q) \right\}\]
  2. Local reachability density (LRD) for a point $p$: \[ LRD_k(p) = \frac{1}{\sum_{q \in N_k(p)} \frac{RD(p, q)}{\left\| N_k(p) \right\|} } \]
  3. Local Outlier Factor (LOF): \[ LOF_k(p) = \frac{\sum_{q \in N_k(p)} \frac{LRD_k(q)}{LRD_k(p)} }{\left\| N_k(p) \right\|} \]
LSTM Forecasting I

Idea of an RNN:

Neural networks for time series!

LSTM Forecasting II

RNN is not able to consider information far in the past, therefore LSTM is introduced:

Random Black Forest

RBF is a nested regression ensemble method:

Bagging Regressor

Random Forest Regressor

Decision Tree Regressor
Decision Tree Regressor
Decision Tree Regressor

Random Forest Regressor

Decision Tree Regressor
Decision Tree Regressor
Decision Tree Regressor

Random Forest Regressor

Decision Tree Regressor
Decision Tree Regressor
Decision Tree Regressor
ARIMA I

Forecasting model with three components:

  1. Autoregressive model: \[ y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \ldots + \phi_p y_{t-p} + \epsilon_t \]
  2. Moving average: \[ y_t = c + \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \ldots + \theta_q \epsilon_{t-q} \]
  3. Differencing: \[ y_t^{\prime} = y_t - y_{t-1} \]
ARIMA II

All combined:

\[ y_t^{\prime} = c + \phi_1 y^{\prime}_{t-1} + \phi_2 y^{\prime}_{t-2} + \ldots + \phi_p y^{\prime}_{t-p} + \\ \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \ldots + \theta_q \epsilon_{t-q} + \epsilon_t \]

Ensemble Method

Compute weighted average of anomaly scores

What if there are many time series?

A Data Scientist wants to know:

Which anomaly patterns arise?

Which algorithms work well?

Which anomalies are the most severe?

This is a Visual Analytics Process!

Image Source: J. Bernard, "Exploratory search in time-oriented primary data"

Inspiration: MTV

Image Source: Liu et al., "MTV: Visual Analytics for Detecting, Investigating, and Annotating Anomalies in Multivariate Time Series"

Inspiration: Cyclic Time Series Data with Matrix and Glyph Representations

Image Source: Suschnigg et al., "Visual Exploration of Anomalies in Cyclic Time Series Data with Matrix and Glyph Representations"

AnoScout

A System for Anomaly Exploration

GutenTAG

Good Time Series Anomaly Generator

Anomaly Types in the Dataset

Upload and Computation

Do we have anomalies now?

Not yet!

Anomaly Extraction

Algorithm Settings

Anomaly Representation

Manual Inspection

Exploring Anomalies

Heatmap

Scatterplot

Cluster Overview

Dissimilarities

Heatmap

Bird's Eye Perspective

The Severity of an Anomaly

\[\text{severity} = \sqrt{\text{score}^2 + \text{length}^2}\]

The Scatterplot is Interactive!

Ranking Anomalies

Two methods implemented:

  1. Sort by severity and rating
  2. Estimate rating based on similar item's ratings:
    \[ \hat{r}_i = \frac{\sum_{j \in I^k} sim(i, j)r_j}{\sum_{j \in I^k} sim(i,j)} \] where $I^k$ = k-Nearest Neighbors of Anomaly $i$.

Measuring similarity

AnoScout uses Dynamic Time Warping

The Recommender in action

How do we gain an overview of anomaly patterns?

Clustering!

Clustering of Anomalies

Finding The Most Dissimilar Anomalies

Optimizing the Ensemble Hyperparameters

Hypothesis: User labels $y_i$ approximated by anomaly scores $\hat{y}_i$

\[ \frac{\mathbf{a}^{T}_i \mathbf{w}}{\left\| \mathbf{w} \right\|} = \hat{y}_i \approx y_i \]

The Optimization Problem

\[ \begin{gather*} \min \frac{1}{n} \sum_{i=1}^{n} \left( y_i - \sum_{j=1}^{k} w_j \hat{y}_{ij} \right)^2 \text{ s.t. } \\ \sum_{j=1}^{k} w_j = 1, \\ w_j \geq 0, \forall j = 1 \ldots k \end{gather*} \]

Gridsearch for the other parameters:

  1. Differencing: To apply or not to apply, that is the question.
  2. Threshold: Search space in the interval $[0, 1]$ with step size $0.001$.
  3. Savitzky-Golay filter: Search space in the interval $[10, 200]$ with step size $1$.

Result

Caution: Trash in, trash out!

WIP: Shape Matching

Boost shapelets that are similar to highly rated anomalies.

Mute shapelets rated as "Nominals".

Summary

Architecture

Architecture

Contributions

  1. Exploration pipeline for anomalies in time-oriented data.
  2. "Playground" for testing various algorithms.
  3. Using user labels to fine-tune the system.

Future Work

  1. Advanced Recommenders, e.g. RecWalk or BPR [Prof. Lex]
  2. Explainable AI, e.g. a 2D Projection for the Sub-LOF method [Prof. Holzinger]
  3. Impact of user labels [Prof. Schreck]
  4. Evaluation: Domain experts and proper use case

Please tell me your ideas! 🙏

Conclusion

AnoScout combines a bunch of different algorithms and presents them to the user in a fancy way.

QR

Thank you!