Senior Manager, Analytics and Algorithms, at American Express Global Business Travel
Bradford Tuckfield is a Senior Manager, Analytics and Algorithms, at American Express Global Business Travel. He studied behavioral economics for his PhD, and has worked in data analysis and data science across several industries.
Track: Data Mining
Tuesday, April 16, 1:50–2:40pm
Anomaly Detection in the Wild
Anomaly detection remains crucial for organizations in a variety of industries. It can be used to detect fraud, find data entry errors, optimize revenue streams and cut costs, among other things. This presentation will analyze and explore several machine learning methods that can improve the speed and accuracy of anomaly detection. This presentation will also include practical suggestions for implementing effective anomaly detection programs in real-world organizations.
The presentation will begin with the simplest type of anomaly detection: statistical outlier detection. We will quickly cover the best way to make a Gaussian model of a variable, calculate z-scores, and find outliers. We will also cover nonparametric methods that depend on quantiles and interquartile ranges.
We will move on to data transformations. Raw outlier detection is not always a workable solution, since some data naturally follows a fat-tailed distribution. In order to successfully perform outlier detection on fat-tailed data, it is necessary to first apply suitable transformations. We explore log-normal distributions, which are heavy-tailed and which commonly occur in nature, and we show that an easy logarithm transformation enables simple Gaussian outlier detection to successfully detect anomalies.
We will move on to seasonality decomposition, which enables the analysis of temporal trends in data – both long-term, directed trends, and also seasonal, cyclic trends. Seasonality decomposition “breaks down” data into component parts, consisting of a trend component, a cyclic component, and a noise component. Statistical outliers in the noise component of decomposed data provide strong evidence for anomalies in the underlying data, but they are not at all obvious before the decomposition is performed. We show how to perform this type of decomposition and how it enables anomaly detection using retail sales data of automobiles.
Finally, we will cover two advanced types of anomalies: contextual anomalies and collective anomalies. Contextual anomalies are not anomalies except when considered in the context of their immediate neighbors (temporal or otherwise). Collective anomalies are situations in which one individual data point is not an anomaly, but the occurrence of many such data points together constitutes an anomaly. Both types of anomaly require specific data transformation techniques in order to reliably detect them.
We will not only present these anomaly detection methods, but also show ways that we have optimized and improved their performance in innovative ways.
The presentation will include extensive examples of real, working code that enables the implementation of these anomaly detection techniques. We will show code written in R and Python, including a discussion of the differences between them. We will also show examples of some implementation methods that can be written strictly with SQL. We will conclude the presentation with practical considerations drawn from our experience implementing and optimizing these anomaly detection methods.