Anomaly Detection Using Robust Random Cut Forest (RRCF)
Summer 2021 Remote Software Engineering Internship at Google — Google Cloud Platform
Internship Introduction
For this summer, I was on the SmartOps team under Google Cloud Platform (GCP), and it went from 5/24/2021 to 8/13/2021.
Project Background
Before going into the main content. I want to first start with a focus on these generic comparisons regarding practical Machine Learning applications: multivariate vs univariate and streaming vs batch operations because these summarize well the core values in this project.
Streaming vs Batch Operations
Streaming operation refers to running a process live as data stream in while maintaining an internal state that incrementally updates itself based on new data. Whereas batch operation refers to processing a batch of data all at once to produce some results. Later if queried again, the entire operation reruns from end to end. For these reasons, streaming operations allow more real-time results generation and improved computational efficiency.
Multivariate vs Univariate
A multivariate scheme is when we combine different metrics into one datapoint that has multiple dimensions. A univariate scheme is when we treat different metrics separately as their own 1-dimensional data points. A multivariate scheme is able to capture more information correlating different metrics, allowing possible detections that would be impossible under a univariate scheme.
Next, I want to preface the main content with some introductions of these terms and technologies.
Anomaly Detection in Cloud Operations
In cloud operations, alerting is conventionally based on a predefined threshold on some metrics. A more advanced alerting approach is based on the history of a metric in order to classify each new datapoint as either “normal” or “anomalous”. In this setting, each point is generally assigned an “anomaly score” to represent how anomalous it is based on the history. Then, a threshold is set on the anomaly scores to perform anomaly labeling. An anomaly alert would then be triggered when the timeseries produces a series of anomaly scores exceeding this threshold. This approach would free the cloud operator from having to choose a somewhat arbitrary threshold on the raw timeseries.
Robust Random Cut Forest
RRCF is a tree-based anomaly detection model where, given a timeseries, it produces anomaly scores for each data point. On a high level of how it works, given a multivariate timeseries dataset, it first picks a dimension of the multivariate dataset and splits the dataset on this dimension using a random value. Then the same operation is performed on both of the sub-datasets. This procedure runs recursively until all points are isolated. Conceptually, points isolated earlier are more likely to be anomalous because it means that they don’t merge well or fit in with all other points, thus higher anomaly scores. We chose RRCF in this case because 1) RRCF is a state-of-the-art model that supports both multivariate and streaming data due to its efficient model updating mechanism. 2) RRCF has demonstrated better performance than the current model in production for anomaly detection in other problems.
For more details on how RRCF works, read the Amazon SageMaker Developer Guide on RRCF
For the purpose of this project, it aims to exploit the theoretical advantages of multivariate and streaming anomaly detection using Robust Random Cut Forest (RRCF) as opposed to the current univariate and batch-based anomaly detection using Isolation Forest (IF).
Approach
Here is the detailed design and approach for this project. Starting from the top of the pipeline, the univariate data sources from different metrics stream from Pub/Sub topics. So the first step here is to retrieve these univariate data streams, and then the data construction module processes this data and constructs a persistent multivariate dataset in BigQuery. Next, the multivariate data is fed into the RRCF anomaly detection module in a streaming manner to compute anomaly scores. Finally, the anomaly scores produced are used for further analysis.
Preliminary Results
In this section, I aim to go over some meaningful experiments and the qualitative/quantitative evaluations performed. I also aim to state the preliminary conclusions and takeaways based on each analysis
Qualitative Analysis
Hyperparameter — Sample Size (SS): When SS is small, less points are taken into consideration while determining whether the next point is anomalous, so anomaly scores for just natural noise can be high. On the other hand, when SS is large, more points are taken into consideration. This means that if there have been anomalous points in the past, large SS values are more likely to also take in consideration of those past anomalies while evaluating the next data point, which would result in lower anomaly scores. SS should be set to take in consideration an amount of data that captures the natural data variance well while not incorporating too many anomalies from the past. Therefore, SS should be carefully chosen given different datasets and problem space.
Hyperparameter —Number of Trees (NoT): From the experiments we ran to observe the effect of NoT, we discovered that NoT just needs to be set large enough to produce a consistent and stable behavior.
Normalization: One key assumption of RRCF is that all dimensions should have a similar range of values at normal settings. If not satisfied, it tends to be more sensitive to dimensions with a wider range of values, which is not good. Therefore, to bring all dimensions into the same range, we have experimented with the min-max normalization technique.
After applying min-max normalization, we have noticed a reduced bias towards the metrics with a wider range of values. However, it then introduces a new bias towards metrics that oscillate more. Therefore, more data normalization techniques need to be explored and experimented to achieve the optimal performance.
Shingles: Shingles are sliding windows on timeseries that construct new data points. This is a widely used technique in anomaly detection. If a shingle of size 4 is set over a univariate stream, the first 4 values of the stream received at time t1, t2, t3, t4 are treated as a 4-dimensional point. Then, at time t5, the values at time t2, t3, t4, t5 are treated as the next four-dimensional point. The window slides over one unit at each time step. A shingle encapsulates a typical shape of a curve — a departure from a typical shape could be an anomaly. In our case with multivariate data streams, instead of encapsulating a typical shape of a curve, a shingle actually encapsulates typical shapes of multiple curves, one for each metric. Furthermore, it also encapsulates the correlations between these curve segments. However, it remains a question if such complex information can be properly learned by a tree-based model.
The expected effects here are reducing noise and removing false alarms. We chose shingle sizes of 4 and 10 because the RRCF paper experimented with shingle sizes of 4 and 48 on univariate data. Since we, at the maximum, have 4 different metrics / variables, having a shingle size of 10 would equate to a length 40 datapoint, which is close to 48.
Conclusions from the experiments:
- When shingle size is large enough, some false alarms can be removed.
- When there is a number of significant anomalies present, noise is reduced when shingle size increases while the scores remain high for anomalies.
- When there are no significant anomalies present, shingles make the predictions more noisy and assign lower scores to some anomalies.
To actually find the best shingle size, more experiments are needed on larger shingle sizes until we see some missed anomalies or significantly delayed detection as large shingles theoretically delay detection.
Quantitative Analysis
The goal here is to quantify the model performance in each experiment using an evaluation metric. In the study of anomaly detection, one of the most popular evaluation metrics is the F1-score on the positive label. For reference, the common evaluation metrics like precision, recall, accuracy, and AUC scores are also computed. For the evaluation pipeline framework, given the anomaly scores from experiments, we first assign labels to them to optimize the F1-score. Then we feed them, along with the ground truth labels into the computation process. Finally, we obtain the quantitative evaluation metrics.
A quick note here is that it is generally difficult to have ground truth labels for anomaly detection, so we are using the current production results as the ground truth, which is sufficient at the current preliminary phase.
The following conclusions are produced from deeper analysis of the quantitative results:
- Multivariate anomaly detection definitely demonstrated performance improvement in RRCF. We observed that the F1 score increases as we increase the number of metrics, which is a quantitative proof for the effectiveness of multivariate scheme.
- We also observed some occurrences when the RRCF is able to catch the anomaly perfectly. This shows that RRCF is already producing great results even without much fine-tuning. Most importantly, these results are from real-time detection on streaming data. This means that the model is already able to produce real-time and efficient detection without so much loss of effectiveness. This is a great signal for a better streaming anomaly detection method compared to the current batch-based one.
- RRCF demonstrates that it can catch anomalies quicker than the current method. This is actually a known trait of RRCF. The data actually shows that RRCF is able to detect the anomaly 30 seconds to 1 minute before the current method does. This is actually a great behavior that can potentially benefit customers a lot. In some system failures, if the anomaly can be caught a few minutes or even seconds earlier, it can potentially save the customers a lot of cost.
Future Work
Better Ground Truth Data
First one is obtaining more and better ground truth data. The nature of anomaly detection problems is having extremely unbalanced data, which is unavoidable, but we can always obtain more data. This would definitely increase the amount of positive labels. In fact, the data construction process can be left running live until stopped to accumulate as much multivariate data as needed.
Extensive Search on Normalization Techniques
Next is more experiments on normalization techniques because most of the noises or false alarms are due to oversensitivity for one or two metrics. Moreover, the unnormalized data actually breaks the assumption of RRCF as discussed earlier, so an optimal normalization technique is necessary to achieve the optimal RRCF performance. A good place to continue would be statistical standardization.
Acknowledgment
In the end, I want to especially thank my hosts Mark and Reza for all the help and guidance, and director Dave for the support and leadership.