Deep Anomaly Detection for large scale enterprise data

ARTIFICIAL INTELLIGENCE

FEBRUARY 4, 2020

Deep Anomaly Detection for large scale enterprise data

 

In generic terms, anomaly detection intends to help distinguish events that are pretty rare and/or are deviating from the norm. This is of high importance to the finance industry like in consumer banking, anomalies might be critical things — like credit card fraud. In other cases, an anomaly might be something that companies look for to leverage from it. Some of the other applications include Intrusions in communication networks, Fake news, and misinformation, Healthcare analysis, Industry damage detection, Manufacturing, Security and surveillance, etc.

The use-case shown in this article is from the SAP domain particularly, Finance. The business goal is to find anomalous behavior in financial transactions.

A typical financial transaction in an Accounting Information System would look like this.

Most such entries fall into being regular transactions, but quite a few show malicious behavior which turns out to be anomalies. The most widely used use-case in every financial domain is detecting fraud and anomaly detection methods can aid substantially in detecting fraud in cases where it takes so much manual effort to do so.

In this article, I will talk about a cutting-edge anomaly detection method using Autoencoder Neural Network (AENN). This is a deep learning-based anomaly detection method.

Well, about the dataset

The dataset used for this use case can be found in the GitHub link provided. This is a synthetic dataset of financial data modified to appear more similar to a real-world dataset that one usually observes in SAP-ERP systems especially the Finance and Cost controlling module.

The dataset contains 7 categorical and 2 numerical attributes available in the FICO BKPF table (containing the posted journal entry headers) and BSEG table (containing the posted journal entry segments) tables.

Another attribute “label” can also be found in the data that explains the true nature of the transaction is a regular or an anomaly (local or global). This is provided to validate the model and won’t be used in the training part.

Classification of anomalies:

Usually, in the industry anomalies are classified in many ways depending on the use-case. When conducting a detailed examination of real-world journal entries, usually recorded in large-scaled AIS or ERP systems, two prevalent characteristics can be observed:

  • specific transactions attribute exhibit a high variety of distinct attribute values e.g. customer information, posted sub-ledgers, amount information, and
  • the transactions exhibit strong dependencies between specific attribute values e.g. between customer information and type of payment, posting type and general ledgers.
  • Derived from this observation, two classes of anomalous journal entries can be distinguished, namely “global” and “local” anomalies.

    Global accounting anomalies are journaled entries that exhibit unusual or rare individual attribute values. Such anomalies usually relate to skewed attributes e.g. rarely used ledgers, or unusual posting times. Traditionally, “red flag” tests performed by auditors during an annual audit, are designed to capture this type of anomaly. However, such tests often result in a high volume of false-positive alerts due to events such as reverse postings, provisions and year-end adjustments usually associated with a low fraud risk. Furthermore, when consulting with auditors and forensic accountants, “global” anomalies often refer to “error” rather than “fraud”.

    Local accounting anomalies are journaled entries that exhibit an unusual or rare combination of attribute values while their attribute values occur quite frequently e.g. unusual accounting records, irregular combinations of general ledger accounts, user accounts used by several accounting departments. This type of anomaly is significantly more difficult to detect since perpetrators intend to disguise their activities by imitating a regular activity pattern. As a result, such anomalies usually pose a high fraud risk since they correspond to processes and activities that might not be conducted in compliance with organizational standards.

    Prerequisites: Audiences are expected to be familiar with the basics of how neurons and neural networks work in Deep learning. Here is an excellent tutorial to give you a precise understanding of Neural networks.
    Anomaly Detection using Autoencoder Neural Networks — Theory

    Autoencoders have been widely used in computer vision and speech processing. But it is a little known fact that they can also be used for anomaly detection. In this section, we introduce the main elements of autoencoder neural networks.

    A typical autoencoder consists of two non-linear mapping functions called as Encoder-f(x) and Decoder-g(x) neural networks. Encoder usually follows a funnel-like paradigm with a decreasing set of neurons and a decoder typically is the symmetric mirror of the encoder. There exists a hidden central layer referred to as a latent layer of lower dimensions which will be a compressed rich representation of the input data enough to reconstruct it will minimal reconstruction error.

    The idea behind using this algorithmic paradigm for anomaly detection consists of two main steps: learning the normal behavior of the system (based on past data) and detecting anomalous behavior in real-time (by processing real-time data).

    Because of the nature of the anomaly dataset which is highly biased towards being regular, the network learns how to reconstruct a regular transaction and fails to do so for an anomaly. Based on such high reconstruction errors we can identify whether a transaction is a regular one or an anomaly. Here out loss function is the reconstruction error itself.

    
    Loss function(reconstruction error) = arg min || x — g(f(x)) ||
    

    In this use case, we used the binary cross-entropy loss given by.

    
    −(xlog(x’)+(1−x)log(1−x’))
    

    x being the input data, x’ being g(f(x)). This is measuring how similar the given two distributions are. The lower the loss, the similar, input and its reconstruction are.

    Implementation
    Note: Here is where it gets a bit technical so i advice all the non-tech folks to skip this section. You can go through it but don’t get intimidated by it 🙂

    Import the necessary libraries and set some parameters.

    
    # importing utilities
    import os
    import sys
    from datetime import datetime
    # importing data science libraries
    import pandas as pd
    import random as rd
    import numpy as np
    # importing pytorch libraries
    import torch
    from torch import nn
    from torch import autograd
    from torch.utils.data import DataLoader
    # import visualization libraries
    import matplotlib.pyplot as plt
    from mpl_toolkits.mplot3d import Axes3D
    import seaborn as sns
    from IPython.display import Image, display
    sns.set_style('darkgrid')
    # ignore potential warnings
    import warnings
    warnings.filterwarnings("ignore")
    

    Set random seed and use GPU if available.

    
    rseed = 1234 
    rd.seed(rseed)
    np.random.seed(rseed)
    torch.manual_seed(rseed) 
    if (torch.backends.cudnn.version() != None and USE_CUDA == True):
        torch.cuda.manual_seed(rseed)
    USE_CUDA = True
    

    Import the data into a pandas data frame.

    
    ad_dataset = pd.read_csv('./data/fraud_dataset_v2.csv')
    ad_dataset.head()
    

    Look at shape and label value_counts.

    
    ad_dataset.shape
    Out[#]: (533009, 10)
    ad_dataset.label.value_counts()
    Out[#]: regular    532909
             global         70
             local          30
             Name: label, dtype: int64
    

    As you see, its a highly biased dataset which is true for most real-world data. Anomalies are 0.018% of the total data. Any typical machine learning algorithm would not perform well in such cases. But the approach shown in the article is a clever trick to leverage autoencoders to find anomalies.

    Let’s remove the label for further processing as autoencoder is an unsupervised technique.

    
    label = ad_dataset.pop('label')
    

    Now let’s split categorical and numerical attributes. Add one-hot encodings to the categorical attributes to vectorize them. Apply log scaling and min-max scaling to the numerical variables.

    
    categorical_attr = ['KTOSL', 'PRCTR', 'BSCHL', 'HKONT', 'WAERS', 'BUKRS']
    ad_dataset_categ_transformed = pd.get_dummies(ad_dataset[categorical_attr])
    numeric_attr_names = ['DMBTR', 'WRBTR']
    # add a small epsilon to eliminate zero values from data for log scaling
    numeric_attr = ad_dataset[numeric_attr] + 1e-7
    numeric_attr = numeric_attr.apply(np.log)
    ad_dataset_numeric_attr = (numeric_attr - numeric_attr.min()) / (numeric_attr.max() - numeric_attr.min())
    

    Concatenate both numerical and catogorical attributes.

    
    ad_subset_transformed = pd.concat([ad_dataset_categ_transformed, ad_dataset_numeric_attr], axis = 1)
    ad_subset_transformed.shape
    Out[#]: (533009, 618)
    

    Now let’s implement the encoder network(618–512–256–128–64–32–16–8–4–3).

    
    # implementation of the encoder network
    class encoder(nn.Module):
    def __init__(self):
    super(encoder, self).__init__()
    # specify layer 1 - in 618, out 512
            self.encoder_L1 = nn.Linear(in_features=ori_subset_transformed.shape[1], out_features=512, bias=True) # add linearity 
            nn.init.xavier_uniform_(self.encoder_L1.weight) # init weights according to [9]
            self.encoder_R1 = nn.LeakyReLU(negative_slope=0.4, inplace=True) # add non-linearity according to [10]
    # specify layer 2 - in 512, out 256
            self.encoder_L2 = nn.Linear(512, 256, bias=True)
            nn.init.xavier_uniform_(self.encoder_L2.weight)
            self.encoder_R2 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
    # specify layer 3 - in 256, out 128
            self.encoder_L3 = nn.Linear(256, 128, bias=True)
            nn.init.xavier_uniform_(self.encoder_L3.weight)
            self.encoder_R3 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
    # specify layer 4 - in 128, out 64
            self.encoder_L4 = nn.Linear(128, 64, bias=True)
            nn.init.xavier_uniform_(self.encoder_L4.weight)
            self.encoder_R4 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
    # specify layer 5 - in 64, out 32
            self.encoder_L5 = nn.Linear(64, 32, bias=True)
            nn.init.xavier_uniform_(self.encoder_L5.weight)
            self.encoder_R5 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
    # specify layer 6 - in 32, out 16
            self.encoder_L6 = nn.Linear(32, 16, bias=True)
            nn.init.xavier_uniform_(self.encoder_L6.weight)
            self.encoder_R6 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
    # specify layer 7 - in 16, out 8
            self.encoder_L7 = nn.Linear(16, 8, bias=True)
            nn.init.xavier_uniform_(self.encoder_L7.weight)
            self.encoder_R7 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
    # specify layer 8 - in 8, out 4
            self.encoder_L8 = nn.Linear(8, 4, bias=True)
            nn.init.xavier_uniform_(self.encoder_L8.weight)
            self.encoder_R8 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
    # specify layer 9 - in 4, out 3
            self.encoder_L9 = nn.Linear(4, 3, bias=True)
            nn.init.xavier_uniform_(self.encoder_L9.weight)
            self.encoder_R9 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
    # init dropout layer with probability p
            self.dropout = nn.Dropout(p=0.0, inplace=True)
            
        def forward(self, x):
    # define forward pass through the network
            x = self.encoder_R1(self.dropout(self.encoder_L1(x)))
            x = self.encoder_R2(self.dropout(self.encoder_L2(x)))
            x = self.encoder_R3(self.dropout(self.encoder_L3(x)))
            x = self.encoder_R4(self.dropout(self.encoder_L4(x)))
            x = self.encoder_R5(self.dropout(self.encoder_L5(x)))
            x = self.encoder_R6(self.dropout(self.encoder_L6(x)))
            x = self.encoder_R7(self.dropout(self.encoder_L7(x)))
            x = self.encoder_R8(self.dropout(self.encoder_L8(x)))
            x = self.encoder_R9(self.encoder_L9(x))
    return x
    
    

    Instantiate the encoder and put in on

    
    # init training network classes / architectures
    encoder_train = encoder()
    # push to cuda if cudnn is available
    if (torch.backends.cudnn.version() != None and USE_CUDA == True):
        encoder_train = encoder().cuda()
    

    Now, the decoder network implementation which is the symmetric mirror of the encoder. (3–4–8–16–32–64–128–256–512–618)

    
    # implementation of the decoder network
    class decoder(nn.Module):
    def __init__(self):
    super(decoder, self).__init__()
    # specify layer 1 - in 3, out 4
            self.decoder_L1 = nn.Linear(in_features=3, out_features=4, bias=True) # add linearity 
            nn.init.xavier_uniform_(self.decoder_L1.weight)  # init weights according to [9]
            self.decoder_R1 = nn.LeakyReLU(negative_slope=0.4, inplace=True) # add non-linearity according to [10]
    # specify layer 2 - in 4, out 8
            self.decoder_L2 = nn.Linear(4, 8, bias=True)
            nn.init.xavier_uniform_(self.decoder_L2.weight)
            self.decoder_R2 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
    # specify layer 3 - in 8, out 16
            self.decoder_L3 = nn.Linear(8, 16, bias=True)
            nn.init.xavier_uniform_(self.decoder_L3.weight)
            self.decoder_R3 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
    # specify layer 4 - in 16, out 32
            self.decoder_L4 = nn.Linear(16, 32, bias=True)
            nn.init.xavier_uniform_(self.decoder_L4.weight)
            self.decoder_R4 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
    # specify layer 5 - in 32, out 64
            self.decoder_L5 = nn.Linear(32, 64, bias=True)
            nn.init.xavier_uniform_(self.decoder_L5.weight)
            self.decoder_R5 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
    # specify layer 6 - in 64, out 128
            self.decoder_L6 = nn.Linear(64, 128, bias=True)
            nn.init.xavier_uniform_(self.decoder_L6.weight)
            self.decoder_R6 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
            
            # specify layer 7 - in 128, out 256
            self.decoder_L7 = nn.Linear(128, 256, bias=True)
            nn.init.xavier_uniform_(self.decoder_L7.weight)
            self.decoder_R7 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
    # specify layer 8 - in 256, out 512
            self.decoder_L8 = nn.Linear(256, 512, bias=True)
            nn.init.xavier_uniform_(self.decoder_L8.weight)
            self.decoder_R8 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
    # specify layer 9 - in 512, out 618
            self.decoder_L9 = nn.Linear(in_features=512, out_features=ori_subset_transformed.shape[1], bias=True)
            nn.init.xavier_uniform_(self.decoder_L9.weight)
            self.decoder_R9 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
    # init dropout layer with probability p
            self.dropout = nn.Dropout(p=0.0, inplace=True)
    def forward(self, x):
    # define forward pass through the network
            x = self.decoder_R1(self.dropout(self.decoder_L1(x)))
            x = self.decoder_R2(self.dropout(self.decoder_L2(x)))
            x = self.decoder_R3(self.dropout(self.decoder_L3(x)))
            x = self.decoder_R4(self.dropout(self.decoder_L4(x)))
            x = self.decoder_R5(self.dropout(self.decoder_L5(x)))
            x = self.decoder_R6(self.dropout(self.decoder_L6(x)))
            x = self.decoder_R7(self.dropout(self.decoder_L7(x)))
            x = self.decoder_R8(self.dropout(self.decoder_L8(x)))
            x = self.decoder_R9(self.decoder_L9(x))
            
            return x
    

    Instantiate the decoder and put it on GPU.

    
    # init training network classes / architectures
    decoder_train = decoder()
    # push to cuda if cudnn is available
    if (torch.backends.cudnn.version() != None) and (USE_CUDA == True):
        decoder_train = decoder().cuda()
    

    Now setting the loss function and some hyperparameters.

    
    # define the optimization criterion / loss function
    loss_function = nn.BCEWithLogitsLoss(reduction='mean')
    # define learning rate and optimization strategy
    learning_rate = 1e-3
    encoder_optimizer = torch.optim.Adam(encoder_train.parameters(), lr=learning_rate)
    decoder_optimizer = torch.optim.Adam(decoder_train.parameters(), lr=learning_rate)
    # specify training parameters
    num_epochs = 8
    mini_batch_size = 128
    

    Load the data into a tensor and onto GPU.

    
    # convert pre-processed data to pytorch tensor
    torch_dataset = torch.from_numpy(ad_subset_transformed.values).float()
    # convert to pytorch tensor - none cuda enabled
    dataloader = DataLoader(torch_dataset, batch_size=mini_batch_size, shuffle=True, num_workers=0)
    # note: we set num_workers to zero to retrieve deterministic results
    # determine if CUDA is available at compute node
    if (torch.backends.cudnn.version() != None) and (USE_CUDA == True):
        dataloader = DataLoader(torch_dataset.cuda(), batch_size=mini_batch_size, shuffle=True)
    

    Now to our training. (Note: I advise not to copy-paste the below code as the formatting may get wrong. Please get the code from the GitHub link mentioned below.)

    
    # init collection of mini-batch losses
    losses = []
    # convert encoded transactional data to torch Variable
    data = autograd.Variable(torch_dataset)
    # train autoencoder model
    for epoch in range(num_epochs):
    # init mini batch counter
        mini_batch_count = 0
        
        # determine if CUDA is available at compute node
        if(torch.backends.cudnn.version() != None) and (USE_CUDA == True):
            
            # set networks / models in GPU mode
            encoder_train.cuda()
            decoder_train.cuda()
    # set networks in training mode (apply dropout when needed)
        encoder_train.train()
        decoder_train.train()
    # start timer
        start_time = datetime.now()
            
        # iterate over all mini-batches
        for mini_batch_data in dataloader:
    # increase mini batch counter
            mini_batch_count += 1
    # convert mini batch to torch variable
            mini_batch_torch = autograd.Variable(mini_batch_data)
    # =================== (1) forward pass ============================
    # run forward pass
            z_representation = encoder_train(mini_batch_torch) # encode mini-batch data
            mini_batch_reconstruction = decoder_train(z_representation) # decode mini-batch data
            
            # =================== (2) compute reconstruction loss ======
    # determine reconstruction loss
            reconstruction_loss = loss_function(mini_batch_reconstruction, mini_batch_torch)
            
            # =================== (3) backward pass ====================
    # reset graph gradients
            decoder_optimizer.zero_grad()
            encoder_optimizer.zero_grad()
    # run backward pass
            reconstruction_loss.backward()
            
            # =================== (4) update model parameters =========
    # update network parameters
            decoder_optimizer.step()
            encoder_optimizer.step()
    # =================== monitor training progress ===================
    # print training progress each 1'000 mini-batches
            if mini_batch_count % 1000 == 0:
                
                # print the training mode: either on GPU or CPU
                mode = 'GPU' if (torch.backends.cudnn.version() != None) and (USE_CUDA == True) else 'CPU'
                
                # print mini batch reconstuction results
                now = datetime.utcnow().strftime("%Y%m%d-%H:%M:%S")
                end_time = datetime.now() - start_time
                print('[LOG {}] training status, epoch: [{:04}/{:04}], batch: {:04}, loss: {}, mode: {}, time required: {}'.format(now, (epoch+1), num_epochs, mini_batch_count, np.round(reconstruction_loss.item(), 4), mode, end_time))
    # reset timer
                start_time = datetime.now()
    # =================== evaluate model performance ================
        
        # set networks in evaluation mode (don't apply dropout)
        encoder_train.cpu().eval()
        decoder_train.cpu().eval()
    # reconstruct encoded transactional data
        reconstruction = decoder_train(encoder_train(data))
        
        # determine reconstruction loss - all transactions
        reconstruction_loss_all = loss_function(reconstruction, data)
                
        # collect reconstruction loss
        losses.extend([reconstruction_loss_all.item()])
        
        # print reconstuction loss results
        now = datetime.utcnow().strftime("%Y%m%d-%H:%M:%S")
        print('[LOG {}] training status, epoch: [{:04}/{:04}], loss: {:.10f}'.format(now, (epoch+1), num_epochs, reconstruction_loss_all.item()))
    # =================== save model snapshot to disk ================
        
        # save trained encoder model file to disk
        encoder_model_name = "ep_{}_encoder_model.pth".format((epoch+1))
        torch.save(encoder_train.state_dict(), os.path.join("./models", encoder_model_name))
    # save trained decoder model file to disk
        decoder_model_name = "ep_{}_decoder_model.pth".format((epoch+1))
        torch.save(decoder_train.state_dict(), os.path.join("./models", decoder_model_name))
    

    Plotting the losses.

    
    # plot the training progress
    plt.plot(range(0, len(losses)), losses)
    plt.xlabel('[training epoch]')
    plt.xlim([0, len(losses)])
    plt.ylabel('[reconstruction-error]')
    #plt.ylim([0.0, 1.0])
    plt.title('AENN training performance')
    

    This completes our training. Now let’s look at how to leverage our models to get predictions.

    Load the pre-trained models.

    
    # restore pretrained model checkpoint
    encoder_model_name = "ep_8_encoder_model.pth"
    decoder_model_name = "ep_8_decoder_model.pth"
    # init training network classes / architectures
    encoder_eval = encoder()
    decoder_eval = decoder()
    # load trained models
    encoder_eval.load_state_dict(torch.load(os.path.join("models", encoder_model_name)))
    decoder_eval.load_state_dict(torch.load(os.path.join("models", decoder_model_name)))
    

    Perform the reconstruction for whole data.

    
    # convert encoded transactional data to torch Variable
    data = autograd.Variable(torch_dataset)
    # set networks in evaluation mode (don't apply dropout)
    encoder_eval.eval()
    decoder_eval.eval()
    # reconstruct encoded transactional data
    reconstruction = decoder_eval(encoder_eval(data))
    

    Get the reconstruction losses for whole data.

    
    # determine reconstruction loss - all transactions
    reconstruction_loss_all = loss_function(reconstruction, data)
    print(reconstruction_loss_all)
    reconstruction loss: 0.0034663924
    

    Determine reconstruction loss for individual transactions.

    
    # init binary cross entropy errors
    reconstruction_loss_transaction = np.zeros(reconstruction.size()[0])
    # iterate over all detailed reconstructions
    for i in range(0, reconstruction.size()[0]):
    # determine reconstruction loss - individual transactions
        reconstruction_loss_transaction[i] = loss_function(reconstruction[i], data[i]).item()
    

    Plot the data points in accordance with there reconstruction losses attached with there labels.

    
    # prepare plot
    fig = plt.figure()
    ax = fig.add_subplot(111)
    # assign unique id to transactions
    plot_data = np.column_stack((np.arange(len(reconstruction_loss_transaction)), reconstruction_loss_transaction))
    # obtain regular transactions as well as global and local anomalies
    regular_data = plot_data[label == 'regular']
    global_outliers = plot_data[label == 'global']
    local_outliers = plot_data[label == 'local']
    # plot reconstruction error scatter plot
    ax.scatter(regular_data[:, 0], regular_data[:, 1], c='C0', alpha=0.4, marker="o", label='regular') # plot regular transactions
    ax.scatter(global_outliers[:, 0], global_outliers[:, 1], c='C1', marker="^", label='global') # plot global outliers
    ax.scatter(local_outliers[:, 0], local_outliers[:, 1], c='C2', marker="^", label='local') # plot local outliers
    # add plot legend of transaction classes
    ax.legend(loc='best')
    

    The plot shows how the chosen approach elegantly found the anomalies from a highly biased dataset. Let’s look at how many anomalies were identified.

    
    ad_dataset['label'] = label
    ad_dataset[reconstruction_loss_transaction >= 0.1].label.value_counts()
    Out[#]: global    59
            local      2
            Name: label, dtype: int64
    ad_dataset[(reconstruction_loss_transaction >= 0.018) & (reconstruction_loss_transaction < 0.05)].label.value_counts()
    Out[#]: local   23
            Name: label, dtype: int64
    

    As you see, out of 70 global, 59 were detected which is 84% and out of 30 local, 23 have been detected which is 76.6%. That’s far better performance than any other older techniques considering outliers were only 0.018% of the whole data.

    Here is the Github link for code implementation along with the dataset.

    I hope this gives a clear understanding of the approach and how to implement it.

    Conclusion

    This concludes that applying deep learning algorithms on classical structured data machine learning problems will give promising results if designed well. Identifying the right algorithm, appropriate loss function and ideal dataset can help data scientists tap into deep learning and leverage its capabilities to boost performances on age-old approaches. The use case mentioned in this article is on financial transactions but the very idea of deep anomaly detection can be extended to other domains like manufacturing and marketing.

    WRITTEN BY

    Rehan Ahmad

    AI Expert at wavelabs.ai

    Want to explore all the ways you can start, run & grow your business?

    Fill out the information below and we will get in touch with you shortly.