The perfect choice of one-stop service for diversification of architecture.
Anomaly detection can be used in a variety of applications, such as:
â Fraud identification
â¡ Detect defective products in manufacturing
⢠Data cleansing -- removing outliers from a data set before training another model.
You may have noticed that some unbalanced classification problems are often solved by anomaly detection algorithms. For example, spam detection task can be considered as a classification task (spam is much less than ordinary e-mail), but we can use exception detection to achieve this task.
A related task is singular value detection. It differs from anomaly detection in that it is assumed that the algorithm is trained on a clean data set (no outliers). It is widely used in online learning when it is necessary to identify whether a new instance is an outlier.
Another related task is density estimation. It is the task of estimating the probability density function of the random process generated by the data set. Density estimation is usually used for anomaly detection (instances located in low-density areas are likely to be anomalies) and data analysis. It is usually solved by clustering algorithm based on density (Gaussian mixture model or DBSCAN).
statistical method
The easiest way to detect outliers is to try statistical methods, which were developed a long time ago. One of the most popular methods is called outlier detection Tukey method (or quartile distance IQR).
Its essence is to calculate the range between percentile and quartile. Data points before q1-1.5 * IQR and after Q3 1.5 * IQR are considered outliers. Below you can see an example of using a person's height data set. Heights below 54.95 inches (139 cm) and above 77.75 inches (197 cm) are considered outliers.
This and other statistical methods (Z-score method for detecting outliers, etc.) are usually used for data cleaning.
Clustering and dimensionality reduction algorithm
Another simple, intuitive and usually effective anomaly detection method is to use some clustering algorithms (such as Gaussian mixture model and DBSCAN) to solve the task of density estimation. Then, any instance located in the low-density area can be considered as an exception. We only need to set some density thresholds.
In addition, any with inverse can be used_ The dimension reduction algorithm of transform() method. This is because the abnormal reconstruction error is always much larger than that of the normal example.
Isolated forest and SVM
Some supervised learning algorithms can also be used for anomaly detection, of which the two most popular are isolated forest and SVM. These algorithms are more suitable for singular value detection, but they are usually also suitable for anomaly detection.
The isolated forest algorithm constructs a random forest, in which each decision tree grows randomly. With each step, the forest isolates more and more points until all points become isolated. Because exceptions are located far from the usual data points, they are usually isolated in fewer steps than normal instances. The algorithm performs well for high-dimensional data, but needs a larger data set than SVM.
SVM (a kind of SVM in our example) is also widely used in anomaly detection. Kernel SVM can construct an effective "constraint hyperplane", which separates normal points from abnormal points. Like any SVM modification, it can handle high-dimensional or sparse data well, but it is only suitable for small and medium-sized data sets.
Local anomaly factor
The local outlier factor (LOF) algorithm is based on the assumption that the anomaly is located in a low-density region. It not only sets the density threshold (as we can do with DBSCAN), but compares the density of a point with the density of K of its nearest neighbor. If the density of this particular point is much lower than that of its neighbors (which means it is far from them), it is considered an anomaly.
The algorithm can be used for both anomaly detection and singular value detection. Because of its simple calculation and good quality, it will be often used.
Minimum covariance determinant
The minimum covariance determinant (MCD or its modified fast MCD) can be used for outlier detection, especially in data cleaning. It assumes that interior points are generated from a single Gaussian distribution, while outliers are not generated from this distribution. Because many data have normal distribution (or can be simplified to normal distribution), the algorithm usually performs well. In sklearn, the ellipticenvelope class is its implementation.
How to select anomaly detection algorithm?
If you need to clean up the dataset, you should first try classical statistical methods, such as Tukey method for outlier detection. If you know that the data distribution is Gaussian, you can use fast MCD,.
If you don't do exception detection for data cleaning, first try a simple and fast lof. If it doesn't work well (or if you need to separate hyperplanes for some reason) - try other algorithms based on your task and dataset:
Single class SVM for sparse high-dimensional data or isolated forest for continuous high-dimensional data
If you can assume that the data is generated by the mixture of multiple Gaussian distributions, you can try the Gaussian mixture model