## Deep Anomaly Detection for large scale enterprise data

In generic terms, anomaly detection intends to help distinguish events that are pretty rare and/or are deviating from the norm. This is of high importance to the finance industry like in consumer banking, anomalies might be critical things — like credit card fraud. In other cases, an anomaly might be something that companies look for to leverage from it. Some of the other applications include Intrusions in communication networks, Fake news, and misinformation, Healthcare analysis, Industry damage detection, Manufacturing, Security and surveillance, etc.

The use-case shown in this article is from the SAP domain particularly, Finance. The business goal is to find anomalous behavior in financial transactions.

A typical financial transaction in an Accounting Information System would look like this.

Most such entries fall into being regular transactions, but quite a few show malicious behavior which turns out to be anomalies. The most widely used use-case in every financial domain is detecting fraud and anomaly detection methods can aid substantially in detecting fraud in cases where it takes so much manual effort to do so.

In this article, I will talk about a cutting-edge anomaly detection method using Autoencoder Neural Network (AENN). This is a deep learning-based anomaly detection method.

##### Well, about the dataset

The dataset used for this use case can be found in the GitHub link provided. This is a synthetic dataset of financial data modified to appear more similar to a real-world dataset that one usually observes in SAP-ERP systems especially the Finance and Cost controlling module.

The dataset contains 7 categorical and 2 numerical attributes available in the FICO BKPF table (containing the posted journal entry headers) and BSEG table (containing the posted journal entry segments) tables.

Another attribute “label” can also be found in the data that explains the true nature of the transaction is a regular or an anomaly (local or global). This is provided to validate the model and won’t be used in the training part.

##### Classification of anomalies:

Usually, in the industry anomalies are classified in many ways depending on the use-case. When conducting a detailed examination of real-world journal entries, usually recorded in large-scaled AIS or ERP systems, two prevalent characteristics can be observed:

Derived from this observation, two classes of anomalous journal entries can be distinguished, namely “global” and “local” anomalies.

Global accounting anomalies are journaled entries that exhibit unusual or rare individual attribute values. Such anomalies usually relate to skewed attributes e.g. rarely used ledgers, or unusual posting times. Traditionally, “red flag” tests performed by auditors during an annual audit, are designed to capture this type of anomaly. However, such tests often result in a high volume of false-positive alerts due to events such as reverse postings, provisions and year-end adjustments usually associated with a low fraud risk. Furthermore, when consulting with auditors and forensic accountants, “global” anomalies often refer to “error” rather than “fraud”.

Local accounting anomalies are journaled entries that exhibit an unusual or rare combination of attribute values while their attribute values occur quite frequently e.g. unusual accounting records, irregular combinations of general ledger accounts, user accounts used by several accounting departments. This type of anomaly is significantly more difficult to detect since perpetrators intend to disguise their activities by imitating a regular activity pattern. As a result, such anomalies usually pose a high fraud risk since they correspond to processes and activities that might not be conducted in compliance with organizational standards.

*Prerequisites: Audiences are expected to be familiar with the basics of how neurons and neural networks work in Deep learning. Here is an excellent tutorial to give you a precise understanding of Neural networks.*

##### Anomaly Detection using Autoencoder Neural Networks — Theory

Autoencoders have been widely used in computer vision and speech processing. But it is a little known fact that they can also be used for anomaly detection. In this section, we introduce the main elements of autoencoder neural networks.

A typical autoencoder consists of two non-linear mapping functions called as Encoder-f(x) and Decoder-g(x) neural networks. Encoder usually follows a funnel-like paradigm with a decreasing set of neurons and a decoder typically is the symmetric mirror of the encoder. There exists a hidden central layer referred to as a latent layer of lower dimensions which will be a compressed rich representation of the input data enough to reconstruct it will minimal reconstruction error.

The idea behind using this algorithmic paradigm for anomaly detection consists of two main steps: learning the normal behavior of the system (based on past data) and detecting anomalous behavior in real-time (by processing real-time data).

Because of the nature of the anomaly dataset which is highly biased towards being regular, the network learns how to reconstruct a regular transaction and fails to do so for an anomaly. Based on such high reconstruction errors we can identify whether a transaction is a regular one or an anomaly. Here out loss function is the reconstruction error itself.

```
Loss function(reconstruction error) = arg min || x — g(f(x)) ||
```

In this use case, we used the binary cross-entropy loss given by.

```
−(xlog(x’)+(1−x)log(1−x’))
```

x being the input data, x’ being g(f(x)). This is measuring how similar the given two distributions are. The lower the loss, the similar, input and its reconstruction are.

##### Implementation

*Note: Here is where it gets a bit technical so i advice all the non-tech folks to skip this section. You can go through it but don’t get intimidated by it 🙂*

Import the necessary libraries and set some parameters.

```
# importing utilities
import os
import sys
from datetime import datetime
# importing data science libraries
import pandas as pd
import random as rd
import numpy as np
# importing pytorch libraries
import torch
from torch import nn
from torch import autograd
from torch.utils.data import DataLoader
# import visualization libraries
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
from IPython.display import Image, display
sns.set_style('darkgrid')
# ignore potential warnings
import warnings
warnings.filterwarnings("ignore")
```

Set random seed and use GPU if available.

```
rseed = 1234
rd.seed(rseed)
np.random.seed(rseed)
torch.manual_seed(rseed)
if (torch.backends.cudnn.version() != None and USE_CUDA == True):
torch.cuda.manual_seed(rseed)
USE_CUDA = True
```

Import the data into a pandas data frame.

```
ad_dataset = pd.read_csv('./data/fraud_dataset_v2.csv')
ad_dataset.head()
```

Look at shape and label value_counts.

```
ad_dataset.shape
Out[#]: (533009, 10)
ad_dataset.label.value_counts()
Out[#]: regular 532909
global 70
local 30
Name: label, dtype: int64
```

As you see, its a highly biased dataset which is true for most real-world data. Anomalies are 0.018% of the total data. Any typical machine learning algorithm would not perform well in such cases. But the approach shown in the article is a clever trick to leverage autoencoders to find anomalies.

Let’s remove the label for further processing as autoencoder is an unsupervised technique.

```
label = ad_dataset.pop('label')
```

Now let’s split categorical and numerical attributes. Add one-hot encodings to the categorical attributes to vectorize them. Apply log scaling and min-max scaling to the numerical variables.

```
categorical_attr = ['KTOSL', 'PRCTR', 'BSCHL', 'HKONT', 'WAERS', 'BUKRS']
ad_dataset_categ_transformed = pd.get_dummies(ad_dataset[categorical_attr])
numeric_attr_names = ['DMBTR', 'WRBTR']
# add a small epsilon to eliminate zero values from data for log scaling
numeric_attr = ad_dataset[numeric_attr] + 1e-7
numeric_attr = numeric_attr.apply(np.log)
ad_dataset_numeric_attr = (numeric_attr - numeric_attr.min()) / (numeric_attr.max() - numeric_attr.min())
```

Concatenate both numerical and catogorical attributes.

```
ad_subset_transformed = pd.concat([ad_dataset_categ_transformed, ad_dataset_numeric_attr], axis = 1)
ad_subset_transformed.shape
Out[#]: (533009, 618)
```

Now let’s implement the encoder network(618–512–256–128–64–32–16–8–4–3).

```
# implementation of the encoder network
class encoder(nn.Module):
def __init__(self):
super(encoder, self).__init__()
# specify layer 1 - in 618, out 512
self.encoder_L1 = nn.Linear(in_features=ori_subset_transformed.shape[1], out_features=512, bias=True) # add linearity
nn.init.xavier_uniform_(self.encoder_L1.weight) # init weights according to [9]
self.encoder_R1 = nn.LeakyReLU(negative_slope=0.4, inplace=True) # add non-linearity according to [10]
# specify layer 2 - in 512, out 256
self.encoder_L2 = nn.Linear(512, 256, bias=True)
nn.init.xavier_uniform_(self.encoder_L2.weight)
self.encoder_R2 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 3 - in 256, out 128
self.encoder_L3 = nn.Linear(256, 128, bias=True)
nn.init.xavier_uniform_(self.encoder_L3.weight)
self.encoder_R3 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 4 - in 128, out 64
self.encoder_L4 = nn.Linear(128, 64, bias=True)
nn.init.xavier_uniform_(self.encoder_L4.weight)
self.encoder_R4 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 5 - in 64, out 32
self.encoder_L5 = nn.Linear(64, 32, bias=True)
nn.init.xavier_uniform_(self.encoder_L5.weight)
self.encoder_R5 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 6 - in 32, out 16
self.encoder_L6 = nn.Linear(32, 16, bias=True)
nn.init.xavier_uniform_(self.encoder_L6.weight)
self.encoder_R6 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 7 - in 16, out 8
self.encoder_L7 = nn.Linear(16, 8, bias=True)
nn.init.xavier_uniform_(self.encoder_L7.weight)
self.encoder_R7 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 8 - in 8, out 4
self.encoder_L8 = nn.Linear(8, 4, bias=True)
nn.init.xavier_uniform_(self.encoder_L8.weight)
self.encoder_R8 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 9 - in 4, out 3
self.encoder_L9 = nn.Linear(4, 3, bias=True)
nn.init.xavier_uniform_(self.encoder_L9.weight)
self.encoder_R9 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# init dropout layer with probability p
self.dropout = nn.Dropout(p=0.0, inplace=True)
def forward(self, x):
# define forward pass through the network
x = self.encoder_R1(self.dropout(self.encoder_L1(x)))
x = self.encoder_R2(self.dropout(self.encoder_L2(x)))
x = self.encoder_R3(self.dropout(self.encoder_L3(x)))
x = self.encoder_R4(self.dropout(self.encoder_L4(x)))
x = self.encoder_R5(self.dropout(self.encoder_L5(x)))
x = self.encoder_R6(self.dropout(self.encoder_L6(x)))
x = self.encoder_R7(self.dropout(self.encoder_L7(x)))
x = self.encoder_R8(self.dropout(self.encoder_L8(x)))
x = self.encoder_R9(self.encoder_L9(x))
return x
```

Instantiate the encoder and put in on

```
# init training network classes / architectures
encoder_train = encoder()
# push to cuda if cudnn is available
if (torch.backends.cudnn.version() != None and USE_CUDA == True):
encoder_train = encoder().cuda()
```

Now, the decoder network implementation which is the symmetric mirror of the encoder. (3–4–8–16–32–64–128–256–512–618)

```
# implementation of the decoder network
class decoder(nn.Module):
def __init__(self):
super(decoder, self).__init__()
# specify layer 1 - in 3, out 4
self.decoder_L1 = nn.Linear(in_features=3, out_features=4, bias=True) # add linearity
nn.init.xavier_uniform_(self.decoder_L1.weight) # init weights according to [9]
self.decoder_R1 = nn.LeakyReLU(negative_slope=0.4, inplace=True) # add non-linearity according to [10]
# specify layer 2 - in 4, out 8
self.decoder_L2 = nn.Linear(4, 8, bias=True)
nn.init.xavier_uniform_(self.decoder_L2.weight)
self.decoder_R2 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 3 - in 8, out 16
self.decoder_L3 = nn.Linear(8, 16, bias=True)
nn.init.xavier_uniform_(self.decoder_L3.weight)
self.decoder_R3 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 4 - in 16, out 32
self.decoder_L4 = nn.Linear(16, 32, bias=True)
nn.init.xavier_uniform_(self.decoder_L4.weight)
self.decoder_R4 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 5 - in 32, out 64
self.decoder_L5 = nn.Linear(32, 64, bias=True)
nn.init.xavier_uniform_(self.decoder_L5.weight)
self.decoder_R5 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 6 - in 64, out 128
self.decoder_L6 = nn.Linear(64, 128, bias=True)
nn.init.xavier_uniform_(self.decoder_L6.weight)
self.decoder_R6 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 7 - in 128, out 256
self.decoder_L7 = nn.Linear(128, 256, bias=True)
nn.init.xavier_uniform_(self.decoder_L7.weight)
self.decoder_R7 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 8 - in 256, out 512
self.decoder_L8 = nn.Linear(256, 512, bias=True)
nn.init.xavier_uniform_(self.decoder_L8.weight)
self.decoder_R8 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 9 - in 512, out 618
self.decoder_L9 = nn.Linear(in_features=512, out_features=ori_subset_transformed.shape[1], bias=True)
nn.init.xavier_uniform_(self.decoder_L9.weight)
self.decoder_R9 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# init dropout layer with probability p
self.dropout = nn.Dropout(p=0.0, inplace=True)
def forward(self, x):
# define forward pass through the network
x = self.decoder_R1(self.dropout(self.decoder_L1(x)))
x = self.decoder_R2(self.dropout(self.decoder_L2(x)))
x = self.decoder_R3(self.dropout(self.decoder_L3(x)))
x = self.decoder_R4(self.dropout(self.decoder_L4(x)))
x = self.decoder_R5(self.dropout(self.decoder_L5(x)))
x = self.decoder_R6(self.dropout(self.decoder_L6(x)))
x = self.decoder_R7(self.dropout(self.decoder_L7(x)))
x = self.decoder_R8(self.dropout(self.decoder_L8(x)))
x = self.decoder_R9(self.decoder_L9(x))
return x
```

Instantiate the decoder and put it on GPU.

```
# init training network classes / architectures
decoder_train = decoder()
# push to cuda if cudnn is available
if (torch.backends.cudnn.version() != None) and (USE_CUDA == True):
decoder_train = decoder().cuda()
```

Now setting the loss function and some hyperparameters.

```
# define the optimization criterion / loss function
loss_function = nn.BCEWithLogitsLoss(reduction='mean')
# define learning rate and optimization strategy
learning_rate = 1e-3
encoder_optimizer = torch.optim.Adam(encoder_train.parameters(), lr=learning_rate)
decoder_optimizer = torch.optim.Adam(decoder_train.parameters(), lr=learning_rate)
# specify training parameters
num_epochs = 8
mini_batch_size = 128
```

Load the data into a tensor and onto GPU.

```
# convert pre-processed data to pytorch tensor
torch_dataset = torch.from_numpy(ad_subset_transformed.values).float()
# convert to pytorch tensor - none cuda enabled
dataloader = DataLoader(torch_dataset, batch_size=mini_batch_size, shuffle=True, num_workers=0)
# note: we set num_workers to zero to retrieve deterministic results
# determine if CUDA is available at compute node
if (torch.backends.cudnn.version() != None) and (USE_CUDA == True):
dataloader = DataLoader(torch_dataset.cuda(), batch_size=mini_batch_size, shuffle=True)
```

Now to our training. (Note: I advise not to copy-paste the below code as the formatting may get wrong. Please get the code from the GitHub link mentioned below.)

```
# init collection of mini-batch losses
losses = []
# convert encoded transactional data to torch Variable
data = autograd.Variable(torch_dataset)
# train autoencoder model
for epoch in range(num_epochs):
# init mini batch counter
mini_batch_count = 0
# determine if CUDA is available at compute node
if(torch.backends.cudnn.version() != None) and (USE_CUDA == True):
# set networks / models in GPU mode
encoder_train.cuda()
decoder_train.cuda()
# set networks in training mode (apply dropout when needed)
encoder_train.train()
decoder_train.train()
# start timer
start_time = datetime.now()
# iterate over all mini-batches
for mini_batch_data in dataloader:
# increase mini batch counter
mini_batch_count += 1
# convert mini batch to torch variable
mini_batch_torch = autograd.Variable(mini_batch_data)
# =================== (1) forward pass ============================
# run forward pass
z_representation = encoder_train(mini_batch_torch) # encode mini-batch data
mini_batch_reconstruction = decoder_train(z_representation) # decode mini-batch data
# =================== (2) compute reconstruction loss ======
# determine reconstruction loss
reconstruction_loss = loss_function(mini_batch_reconstruction, mini_batch_torch)
# =================== (3) backward pass ====================
# reset graph gradients
decoder_optimizer.zero_grad()
encoder_optimizer.zero_grad()
# run backward pass
reconstruction_loss.backward()
# =================== (4) update model parameters =========
# update network parameters
decoder_optimizer.step()
encoder_optimizer.step()
# =================== monitor training progress ===================
# print training progress each 1'000 mini-batches
if mini_batch_count % 1000 == 0:
# print the training mode: either on GPU or CPU
mode = 'GPU' if (torch.backends.cudnn.version() != None) and (USE_CUDA == True) else 'CPU'
# print mini batch reconstuction results
now = datetime.utcnow().strftime("%Y%m%d-%H:%M:%S")
end_time = datetime.now() - start_time
print('[LOG {}] training status, epoch: [{:04}/{:04}], batch: {:04}, loss: {}, mode: {}, time required: {}'.format(now, (epoch+1), num_epochs, mini_batch_count, np.round(reconstruction_loss.item(), 4), mode, end_time))
# reset timer
start_time = datetime.now()
# =================== evaluate model performance ================
# set networks in evaluation mode (don't apply dropout)
encoder_train.cpu().eval()
decoder_train.cpu().eval()
# reconstruct encoded transactional data
reconstruction = decoder_train(encoder_train(data))
# determine reconstruction loss - all transactions
reconstruction_loss_all = loss_function(reconstruction, data)
# collect reconstruction loss
losses.extend([reconstruction_loss_all.item()])
# print reconstuction loss results
now = datetime.utcnow().strftime("%Y%m%d-%H:%M:%S")
print('[LOG {}] training status, epoch: [{:04}/{:04}], loss: {:.10f}'.format(now, (epoch+1), num_epochs, reconstruction_loss_all.item()))
# =================== save model snapshot to disk ================
# save trained encoder model file to disk
encoder_model_name = "ep_{}_encoder_model.pth".format((epoch+1))
torch.save(encoder_train.state_dict(), os.path.join("./models", encoder_model_name))
# save trained decoder model file to disk
decoder_model_name = "ep_{}_decoder_model.pth".format((epoch+1))
torch.save(decoder_train.state_dict(), os.path.join("./models", decoder_model_name))
```

Plotting the losses.

```
# plot the training progress
plt.plot(range(0, len(losses)), losses)
plt.xlabel('[training epoch]')
plt.xlim([0, len(losses)])
plt.ylabel('[reconstruction-error]')
#plt.ylim([0.0, 1.0])
plt.title('AENN training performance')
```

This completes our training. Now let’s look at how to leverage our models to get predictions.

Load the pre-trained models.

```
# restore pretrained model checkpoint
encoder_model_name = "ep_8_encoder_model.pth"
decoder_model_name = "ep_8_decoder_model.pth"
# init training network classes / architectures
encoder_eval = encoder()
decoder_eval = decoder()
# load trained models
encoder_eval.load_state_dict(torch.load(os.path.join("models", encoder_model_name)))
decoder_eval.load_state_dict(torch.load(os.path.join("models", decoder_model_name)))
```

Perform the reconstruction for whole data.

```
# convert encoded transactional data to torch Variable
data = autograd.Variable(torch_dataset)
# set networks in evaluation mode (don't apply dropout)
encoder_eval.eval()
decoder_eval.eval()
# reconstruct encoded transactional data
reconstruction = decoder_eval(encoder_eval(data))
```

Get the reconstruction losses for whole data.

```
# determine reconstruction loss - all transactions
reconstruction_loss_all = loss_function(reconstruction, data)
print(reconstruction_loss_all)
reconstruction loss: 0.0034663924
```

Determine reconstruction loss for individual transactions.

```
# init binary cross entropy errors
reconstruction_loss_transaction = np.zeros(reconstruction.size()[0])
# iterate over all detailed reconstructions
for i in range(0, reconstruction.size()[0]):
# determine reconstruction loss - individual transactions
reconstruction_loss_transaction[i] = loss_function(reconstruction[i], data[i]).item()
```

Plot the data points in accordance with there reconstruction losses attached with there labels.

```
# prepare plot
fig = plt.figure()
ax = fig.add_subplot(111)
# assign unique id to transactions
plot_data = np.column_stack((np.arange(len(reconstruction_loss_transaction)), reconstruction_loss_transaction))
# obtain regular transactions as well as global and local anomalies
regular_data = plot_data[label == 'regular']
global_outliers = plot_data[label == 'global']
local_outliers = plot_data[label == 'local']
# plot reconstruction error scatter plot
ax.scatter(regular_data[:, 0], regular_data[:, 1], c='C0', alpha=0.4, marker="o", label='regular') # plot regular transactions
ax.scatter(global_outliers[:, 0], global_outliers[:, 1], c='C1', marker="^", label='global') # plot global outliers
ax.scatter(local_outliers[:, 0], local_outliers[:, 1], c='C2', marker="^", label='local') # plot local outliers
# add plot legend of transaction classes
ax.legend(loc='best')
```

The plot shows how the chosen approach elegantly found the anomalies from a highly biased dataset. Let’s look at how many anomalies were identified.

```
ad_dataset['label'] = label
ad_dataset[reconstruction_loss_transaction >= 0.1].label.value_counts()
Out[#]: global 59
local 2
Name: label, dtype: int64
ad_dataset[(reconstruction_loss_transaction >= 0.018) & (reconstruction_loss_transaction < 0.05)].label.value_counts()
Out[#]: local 23
Name: label, dtype: int64
```

As you see, out of 70 global, 59 were detected which is 84% and out of 30 local, 23 have been detected which is 76.6%. That’s far better performance than any other older techniques considering outliers were only 0.018% of the whole data.

Here is the Github link for code implementation along with the dataset.

I hope this gives a clear understanding of the approach and how to implement it.

###### Conclusion

This concludes that applying deep learning algorithms on classical structured data machine learning problems will give promising results if designed well. Identifying the right algorithm, appropriate loss function and ideal dataset can help data scientists tap into deep learning and leverage its capabilities to boost performances on age-old approaches. The use case mentioned in this article is on financial transactions but the very idea of deep anomaly detection can be extended to other domains like manufacturing and marketing.

### Want to explore all the ways you can start, run & grow your business?

Fill out the information below and we will get in touch with you shortly.