Úterý, 8 dubna, 2025

Fighting Digital Fraud with Artificial Intelligence and Machine Learning

Sdílet

The goal of this paper is to give the reader a complex overview on how Artificial Intelligence and Machine Learning are transforming the fight against digital fraud. The work should acquaint the reader with common digital fraud types and the options of detecting them. Real-world examples from two companies should then give an insight into the incorporation of these solutions.

In the beginning of this work, the most common types of digital fraud are presented, highlighting the differences between serial and opportunistic fraud, and the exploitation of newly emerged AI tools. Furthermore, key AI&ML algorithms for digital fraud detection are discussed, involving both unsupervised and supervised methods. The workflow of model development is also explained, proving the importance of appropriate model selection. Next, an industry example of successful fraud detection is presented, along with a discussion of its consequences. Finally, the last section is devoted to Resistant AI, and especially to their intriguing story and remarkable expertise. The cherry on top being an insight from Resistant AI’s Data Science Team Lead on the current challenges and future trends of AI&ML fraud detection.

1 Introduction to digital fraud

1.1 The growing threat of digital fraud

With an ongoing digitalisation, increasing number of institutions is shifting their services to the on-line world, moving closer to an idea of automated administration. ID verification, document submission or even contract signing no longer require in-person interactions. These digitalized processes offer lower costs, faster execution times and more convenience for customers. On the other hand, such technological advancements open new possibilities for exploitation by digital fraudsters. Apart from that, the cybercriminal ecosystem is becoming increasingly industrialized, enabling even non-technical fraudsters to access digital exploitation tools without the need of technical knowledge (Bank for International Settlements, 2024).

To give an example, AI photo editors have emerged as powerful tools, but their advanced capabilities present a double-edged sword. Fraudsters can now generate multiple identities with just couple of clicks. Adding a beard, changing eye colour or virtually aging a person? In just a few steps a number of fraud-ready synthetic identities can be created. Combination of these factors could result in only one way — the digital fraud is on its all-time high. According to Veriff’s 2024 Fraud Report there has been a 20% YoY increase in overall digital fraud in 2024. The strongest surge could be seen especially in the E-Commerce sector, with an alarming 40% increase in net fraud when compared to 2022 (Veriff, 2024).

Picture 1: Annual mean fraud rate comparison for different sectors in 2022 vs 2023

(Source: veriff.com/ebooks/veriff-fraud-report-2024 )

With the statistics mentioned above, there is no doubt that digitalisation is like a honey pot for digital criminals. Globally, companies are annually targeted by not thousands but millions of fraudulent activities, resulting in significant financial losses, reputational damage and increased operational expenses. But what techniques in particular do fraudsters employ to deliver such damage and achieve their malicious objectives?

1.2 Types of digital fraud

Digital fraud is a broad term, containing various forms of deceptive practices, but in this paper, we will be primarily focusing on those types, that can be effectively detected by Machine Learning techniques.

Additionally, we should distinguish between serial and opportunistic fraud, as these categories significantly differ in their characteristics — primarily regarding the scale and used methods. Starting with serial fraud, this type is described as repetitive and organized fraudulent action, typically characterized by a high level of sophistication and persistence, often involving automated processes. In practice, this could be represented by reproduction and distribution of fake documents, chargeback frauds or subscription abuse. Here we can further categorize serial fraud into decentralized and concentrated. Decentralized serial fraud could be described like fraud-as-a-service model, where criminals distribute editable templates via internet marketplaces and social media to large segment of end users. These individuals then commit crimes on their own using the acquired documents. On the other hand, concentrated serial fraud is run by highly organized crime groups, who leverage technical tools and iterative experimentation to test and bypass automated controls and create a large number of accounts, which can be then used for various financial crimes. (Resistant AI, 2023)

Opportunistic fraud is known as a situation when individuals exaggerate or manipulate otherwise legitimate claims to gain an unfair advantage or financial benefit. This type of fraud is commonly observed in the insurance industry, when policyholders inflate the value of their claims, falsify details about an incident or misinterpret circumstances. An example could be a policyholder “photoshopping” details about an injury to receive a higher payout from the insurance company.

1.2.1 Document forgery

Moving to exact fraud techniques used by criminals, document forgery is an act of creating a falsified document from scratch, imitating a genuine one. This type of fraud is as old as documents themselves, but with the increasing sophistication of technology, it has taken on a new and more complex forms. Once an underground industry, requiring skilled artistic sense, nowadays all the resources criminals need are just a computer and an image editing software. This means that document forgery is becoming easier and vastly more common. Still, just because making fraudulent documents is easier than ever, it doesn’t necessarily mean that creating a good fake document is easy — fraudsters often give themselves away with typos, unprofessional formatting or non-matching fonts.

1.2.2 Synthetic identity fraud

Next on the list, a newer form of identity fraud, involves combining real or both real and fake information to create a new fictional identity. This combination makes synthetic fraud particularly challenging to detect. Criminals may, for instance, begin by stealing photos of personal identification or credit card information. On their own, these pieces of information may not be sufficient to open a bank account, but document forgery can easily fill in the gaps to make unauthorized registration successful.

Picture 2: Illustration of how a synthetic identity could be created

(Source: omnisecure.berlin/wp-content/uploads/os23_Muerl_Carsten.pdf )

According to MasterCard’s Fraud Prevention e-book, synthetic identity fraud has recently surpassed credit card fraud and identity theft and is now the fastest-growing crime in the world (MasterCard, 2024). As stated by TransUnion, synthetic identity fraud was up 132% in 2022, with 46% of global companies having experienced such crime that year (TransUnion, 2023).

1.2.3 Template fraud

A website selling pre-designed layouts, used to create documents. Formally a legal business, right? Yes, but the offered templates are almost specifically used to commit digital crimes. Surprisingly, these marketplaces, often referred to as „template farms,“ operate with a level of organization that goes far beyond what meets the eye. While they resemble ordinary template websites like Canva or Freepik on the surface, their true nature becomes apparent in their offerings. These platforms provide thousands of files designed to mimic official documents such as utility bills, bank statements, and passports. Moreover, criminal organizations operate strategically, spreading tens of thousands of links to their template farms across the internet, primarily to increase the reach. Many of the websites share a similar structure, differing only in logo and name, hinting they may be operated by the same organization. Customers visiting these sites are frequently redirected to Telegram, where they are offered „24/7 support“ to facilitate their fraudulent activities and ensure seamless transactions (Resistant AI, 2024).

Picture 3: An example of a template farm website

(Source: resistant.ai/blog/types-of-document-fraud#heading-0)

1.2.4 Authorized Push Payments Fraud

Authorized Push Payments Fraud happens, when a fraudster convinces a person to authorize a payment under false pretences. This type of fraud could be sometimes categorized as pre-digital, since the process starts by deceiving a person and then continues further digitally. APP fraud take many forms such as purchase, investment and romance scams. A purchase fraud occurs when a customer believes that they are making a verified payment for goods or services, when in reality, the product does not exist. This scam typically takes place on-line or through social media, with scammers offering deals that seem unrealistically favourable. After making a payment, the consumer never receives the product nor sees the money again. Recently popular, romance scams, happen typically on dating apps or websites, where a fraudster creates a fake identity and pretends to build a romantic connection with the victim. The scammer builds trust and emotional commitment over several months, eventually leading to a request for money, often pretending to be in an emergency. These types of fraud are particularly challenging to detect but for some cases even ML algorithms are useful (ACI Worldwide, 2024).

1.2.5 Money laundering

Money laundering is called what it is because it precisely describes what takes place — illegally obtained money is put through a cycle of transactions so it appears as gained legally. Money laundering traces its origins to mafia undergrounds in the 1920s. While it has evolved over time, the arrival of digital banking and on-line transactions has made the process more efficient and harder to detect, contributing to its continued presence.

The legitimization of funds can be divided into 3 stages: placement, layering and integration. With placement being the first one, launderer attempts to put the “dirty money” into the financial system unnoticed. This is commonly done by asking a group of people to make small deposits to their accounts, making the funds seem legitimate. This first stage is where detection is most probable. Second stage is done by conducting a series of transactions that, by the reason of their frequency, volume or complexity appear as legitimate transactions. The aim of this process is to make the funds untraceable back to the crime origin. Last stage — integration — is when the launderer tries to integrate the illicit money back into economy, making the funds seem as earned by a legitimate business (for example, business earnings or property) (Organization of American States, 2013).

1.3 Why is AI&ML the key to fight digital fraud?

As previously mentioned, more and more companies are shifting their services to the online world, leading to an unintended consequence of a booming digital fraud industry. With an increasing scale of these criminal activities, manual review process is no longer an effective technique. When processing, for instance, 10,000 documents a day, one can imagine how time-expensive and money-draining a manual control would be. Next, a bit more sophisticated way, a rule-based system is now becoming obsolete as well. While useful for detecting known patterns of fraudulent activity, static rules struggle to adapt to new and emerging threats. For example, rule-based systems are often ineffective for synthetic identity fraud and account tampering (Whitrow, Hand, & Juszczak, 2009). As criminals continuously invent new ways how to commit digital fraud, rule-based system leads to higher false negative rates and missed detections.

One solution for combating endlessly evolving criminal processes could be utilizing machine learning techniques. Machine learning offers a proactive and efficient approach, perfectly suiting the scale of thousands of documents being processed daily. ML algorithms can analyse vast datasets to identify irregular patterns or behaviours, that may not be apparent through traditional methods. Additionally, by learning from historical data, these algorithms have the ability to predict fraud with high accuracy rate. The agility of ML systems also enables companies to detect fraud in real-time, significant amounts of both time and money. (Babatope, 2024)

Companies which already adopted machine learning methods to detect suspicious activity show an impressive success in financial loss reduction. According to Visa’s 2019 press article, Visa Advanced Authorization (VAA) using artificial intelligence helped financial institutions prevent an estimated $25 billion in annual fraud (Visa Inc., 2019). As another example, JPMorgan Chase employs ML to continuously oversee on-line transaction processes, with the result of achieving a 50 percent reduction in credit card losses over the past five years leading to 2019 (JPMorgan Chase & Co., 2019). Moreover, a renowned payment system company PayPal spends around $300 million on anti-fraud measures, with machine learning approaches being a solid pillar of their fraud detection systems.

2 Key AI&ML techniques used for digital fraud detection

As many types of digital fraud exist, so do techniques to detect them — with Artificial Intelligence being one of the most popular ones. Particularly, a branch of AI known as Machine Learning has brought the most success. Machine learning was defined in 1950’s as “a field of study that gives computers the ability to learn without explicitly being programmed” and 70 years later it seems to still hold true.

Machine learning starts with data — tables, photos or text — all this is gathered and processed to be used as training data. Next step, a similarly important, is choosing an appropriate model, which will then be trained on the available data. After training, the model is evaluated, providing developers with performance metrics such as precision, recall and F-1 score — insights on overall accuracy. If needed, an updated set of parameters is chosen, and the model is retrained to give better results.

Functions of machine learning systems are diverse, with descriptive ones using data to explain what happened, predictive ones used to predict what will happen and prescriptive utilizing data to make suggestions about what actions to make. Additionally, machine learning can be categorized into two primary approaches: Supervised and Unsupervised learning. In the following sections, we will explore these categories in detail, highlighting the models most commonly used for digital fraud detection and explaining how the algorithms work.

2.1 Supervised learning

Supervised machine learning models are trained on labelled datasets, which means that the input data is paired with the desired output. Training datasets are manually labelled by a human, who decides whether a document or transaction is fraudulent or not.  Once trained, the machine is provided with a new set of unlabelled data and uses its prior training to predict values or classify the data into categories. Supervised learning is particularly effective in identifying known types of fraud, but very limited in recognizing new or evolving fraud techniques.

Picture 4: Supervised learning process

(Source: geeksforgeeks.org/supervised-unsupervised-learning )

2.1.1 Logistic regression

The first algorithm we will discuss, Logistic regression, estimates the probability of an event occurring, such as if transaction is fraudulent or not. This model, also known as logit model, is often used for classification or predictive analysis. Since the expected outcome is probability, dependent variable lies between 0 (not-fraud) and 1 (fraud) included. Unlike generative algorithms, logistic regression does not create or generate new information. Instead, it assigns items to a class by estimating the probability of belonging to that class. Most commonly used approach, Binomial logistic regression, can divide variables into just two classes – for example, fraud/no fraud — but more complex types like Multinomial logistic regression are able to predict three or more outcomes. Logistic regression can be interpreted by a Sigmoid function, a curve used to map the predicted values to probabilities.

Picture 5: Comparison of Linear and Logistic regression

(Source: saedsayad.com/logistic_regression.htm )

As seen on the plot, logistic regression is described by the logistic function, which is mapped by the following equation:

The formula calculates a probability that given input X belongs to the positive class (Y = 1), while β0 represents the baseline probability of the positive class when all predictors (X1, X2 Xn) are zero. β1, β2βn being coefficients or weights assigned to the input features. Similarly important is also a concept of log odds. In logistic regression, a logit transformation is applied on the odds, that being a probability of success, divided by the probability of failure. This is commonly known as log odds and can be represented by the following formula:

Beta parameter is often estimated via maximum likelihood estimation (MLE). This method tests different values of beta through multiple iterations to optimize for the best fit of log odds. All of these iterations produce the log likelihood function, and logistic regression seeks to maximize this function to find the best parameter estimate. Once the optimal coefficient is found, the conditional probabilities for each observation can be calculated and summed together to yield a predicted probability. This changes slightly, when is logistic regression used specifically for machine learning cases. Within machine learning, the negative log-likelihood is commonly used as the loss function, with gradient descent being applied to optimize the parameters and find the global maximum (IBM, 2024).  

As discussed in a paper about credit card fraud detection (Alenzi & Aljehane, 2020), logistic regression can be a powerful tool for detecting credit card fraud through transactions. A database of credit card transactions was split into training and testing sets to build and evaluate predictive algorithms. Multiple classifiers such as k-Nearest neighbours or Voting classifier were used, but logistic regression presented the best performance. The accuracy was determined by a Confusion matrix — a table containing counts of true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN), aiming to have the best proportion of correct predictions.

Picture 6: Detailed plot of Logistic regression, showing values for fraud (1) and non-fraud (0)

(Source: thesai.org/Downloads/Volume11No12/Paper_65-Fraud_Detection_in_Credit_Cards.pdf )

2.1.2 Random forests

Random forest is a supervised machine learning algorithm consisting of multiple decision trees during the training phase. First, it would be best to introduce how decision trees work.

A decision tree is a structure consisting of nodes (representing decisions or tests on attributes), branches (representing the outcome of these decisions) and leaf nodes (representing final outcomes). For example, root node corresponds to the entire dataset and the initial decision to be made, and leaf nodes correspond to final decisions with no further splits (GeeksForGeeks, 2024).

Picture 7: A simple visual example of decision tree  

(Source: botpenguin.com/glossary/decision-trees )

While decision trees are commonly used, they can be prone to problems, such as bias and overfitting. However, when multiple decision trees form an ensemble in the random forest algorithm, they predict more accurate results, particularly when the individual trees are uncorrelated with each other.

Ensemble methods combine predictions from multiple models to determine the most popular result, with bagging being one of the most well-known approaches. In this method a random sample of data in a training set is selected with replacement. This means, that the individual data points can be chosen multiple times. After several data samples are generated, these models are then trained independently, which helps with reducing the variance within a noisy dataset.

Finally, we get back to the random forest algorithm, which is an extension of the bagging method as it uses both bagging and feature randomness to an uncorrelated forest of decision trees. Feature randomness generates a random subset of features, which ensures low correlation among decision trees. This is the key difference between decision trees and random forests. Furthermore, random forests have three main hyperparameters, which need to be set before training is started. These include node size, the number of trees and the number of features sampled. From there, the random forest classifier can be utilized to solve both regression and classification problems (IBM, 2024).

Picture 8: Visual representation of random forest algorithm

(Source: geeksforgeeks.org/random-forest-algorithm-in-machine-learning )

2.1.3 Support Vector Machine (SVM)

Next of the supervised machine learning models, support vector machine, is an approach that classifies data by finding an optimal line that maximizes the distance between each class in an n-dimensional space. As commonly used algorithm for classification problems, it can distinguish between two classes by finding the optimal hyperplane that maximizes margin between the closest data points of opposite class. The plane dimension is specified by the number of features. Even multiple hyperplanes can be created to differentiate classes, enabling the algorithm to find the best decision boundary between classes. The lines adjacent to the optimal hyperplane are known as support vectors as these vectors run through the data points that determine maximal margin.

The benefit of SVM is that it can handle both linear and non-linear classification tasks, making it a versatile method. For not linearly separable data, kernel functions are used to transform the original feature into a higher-dimensional space, where the data becomes linearly separable.

When SVM is compared for example to logistic regression, it typically performs better with high-dimensional and unstructured data such as text, images and especially those images that have been tampered with. SVMs are also less vulnerable to overfitting with the benefit of also being a bit easier to interpret. But on the other hand, they are usually much more computationally expensive (IBM, 2024) (SciKit-Learn, 2024).

Picture 9: A visual representation of the concept of SVM

(Source: images.javatpoint.com/tutorial/machine-learning/images/support-vector-machine-algorithm5.png )

As discussed in a journal paper (Kumar, Gunjan, Ansari, & Pathak, 2022), SVM was compared to other ML algorithms (such as previously mentioned random forest and linear regression) in a credit card fraud detection problem. Customer data was pre-processed, categorical values were converted into numerical form, and the dataset was split into 70% training and 30% testing data. In this case, the performance of SVMs was exceptional, likely because they are particularly effective when dealing with imbalanced and skewed data. Overall, the algorithm achieved a 96% accuracy rate in detecting credit card fraud, with precision, recall, and the F1-score also demonstrating high performance levels.

2.2 Unsupervised learning

The second category of machine learning algorithms works with unlabelled datasets to analyse and cluster data. In other words, these algorithms are allowed to discover patterns and insights without any explicit guidance or instructions. Unsupervised algorithms are particularly effective for more complex processing tasks, such as organizing large datasets into clusters. They are also significantly better at identifying previously undetected patterns and can help identify features useful for categorizing.

The algorithm groups the data by similar patterns and while the machine itself does not understand these patterns, a human then can create classes based on their understanding. For instance, our algorithm might group weather data by temperature or similar weather patterns. Our task then would be to determine whether these groups correspond to specific seasons or distinct weather types, such as rain or snow (Google, 2024).

Picture 10: Unsupervised learning process

(Source: geeksforgeeks.org/supervised-unsupervised-learning )

2.2.1 K-means clustering

As we discussed earlier, many forms of clustering tools exist, including exclusive, overlapping, hierarchical and probabilistic, the K-means algorithm represents the exclusive (“hard”) method. This type of grouping specifies that a data point can be assigned to just one cluster. In fraud detection, this type can be especially useful in document clustering or image segmentation. It is also widely used in other problems related to cluster analysis because the algorithm is efficient, effective and relatively simple.

K-means algorithm uses an iterative process to minimize the sum of distances between the data points and their cluster centroids. It operates by classifying data points into clusters by using a mathematical distance measure, typically Euclidean, from the cluster centre. The goal is to minimize the sum distances between data points and their assigned clusters. Data points are assigned to clusters based on their proximity to a centroid.

The initial step of this algorithm is to assign a value to k, choosing how many clusters we want to create. A higher k value signifies smaller clusters with greater detail, while a lower k value results in larger clusters with less detail. Next is a two-step iterative process which includes Expectation-Maximization machine learning algorithm. The expectation step assigns each data point to its nearest centroid based on their distance. Then the maximization step calculates the mean distance of all the points (to their original centroid) in a cluster and reassigns a new centroid. This process is repeated until the centroid positions are stable (IBM, 2024).

Also, a similar algorithm to k-means clustering called DBSCAN or Density-based Spatial Clustering of Applications with Noise in its full title exist. But in this case, we are categorizing the data based on their density, meaning proximity to each other. In context of DBSCAN, clusters are dense regions in the data space, separated by regions of the lower density of points. This algorithm works with “clusters” and “noise”. The main difference is that with DBSCAN, instead the number of clusters, we are choosing the minimum amount of points a cluster can contain. Additionally, we choose other a parameter called eps, which defines the neighbourhood around a data point i.e. the maximum distance between two points to be considered neighbours (GeeksForGeeks, 2023).

Picture 11: Comparison of clustering between DBSCAN and k-means

(Source: github.com/NSHipster/DBSCAN )

The k-means clustering method can be applied in almost every domain and industry and it is usually utilized for data which has few dimensions, is numeric and can be easily portioned. In the context of fraud detection k-means algorithm can be used to identify fraudulent transactions based on the attributes like geolocation and device information. It can detect irregular behaviour patterns, which helps identifying when a fraudster steals credentials and tries to make payments. Moreover, anomaly detection is an area where k-means algorithm proves effective, it identifies the data points that are most far away from the centroids, in other words — outliers (Signicat, 2024) .

2.2.2 Isolation forest

This algorithm could remind us of a supervised one discussed earlier — Random forest. Contradictory, Isolation forest belongs to unsupervised methods and we will soon find out why.

Isolation forest, best known for its efficiency and simplicity, is an algorithm used primary for anomaly detection. By removing anomalies from an unlabelled dataset using binary partitioning, it quickly identifies outliers with minimal computation overhead. Anomaly detection is a technique of identifying rare observations which can raise suspicions by being statistically different from the rest. In our context, these anomalies could resemble fraudulent behaviour such as falsely authorized transactions or tampered documents.

Isolation forest is once again using the concept of trees, but this approach randomly selects features and splits them along random values until individual data points are isolated. This isolating process creates trees or in other words partitions, that aim to separate anomalies from ordinary observations. Each data point is then assigned the “anomaly score”, which is based on how many splits are needed to isolate it. Anomalies requiring fewer splits to isolate are typically assigned higher anomaly scores.

Picture 12: An example of Isolation forest

(Source: geeksforgeeks.org/what-is-isolation-forest/ )

Let’s now delve deeper into the steps which this unsupervised algorithm takes. It begins by Random partitioning. Random feature from the dataset is selected and once a feature is chosen, a random value within the range of that feature’s values is selected as the splitting threshold. The process splits the data into two parts. This random selection and splitting are repeated recursively, until all the data points are either isolated into individual partitions or maximum depth is reached. Next step, Isolation path, calculates how isolated a data point within the tree is. As already stated, the isolation of a data point is determined by the number of splits needed to isolate that point within a tree. Finally, a third step is implemented, when an ensemble of isolation trees is built. The algorithm constructs a specified number of isolation trees independently and evaluates the isolation paths to find the anomalies (GeeksForGeeks, 2023).

3 AI&ML fraud detection in the insurance industry

As apparent from the fraud introduction section, scammers operate in many industries. With ongoing digitalization, nearly every sector is increasingly vulnerable to the threats of digital fraudsters. To combat these threats, companies are implementing a range of countermeasures, with machine learning algorithms being among the most prominent. However, it remains a constant battle similar to a cat-and-mouse game, with fraudsters often staying one step ahead. In this section, we will explore one particular industry, which is being frequently targeted, the methods fraudsters use to exploit it, and the strategies insurance industry employs to defend itself against scammers’ threats.

3.1 Insurance fraud

An industry, where fraud might be as old as the industry itself, insurance. Insurance fraud in the US alone is estimated to cost a total of $40 billion annually. This of course has an effect; an insured US family pays between $400 and $700 per year in the form of increased premiums (FBI, 2010).

Insurance companies encounter several types of insurance fraud. While serial fraud exists in almost every industry, in the insurance sector, the opportunistic fraud tends to be much more common. When an insurance claim is filed, some customers have the tendency to exploit the situation to maximize their financial gains. A common practice among fraudsters is to inflate their claims, exaggerating the value of the loss or including items that were not actually damaged. They can do so by editing digital documents and invoices in a photo editor app, even with minimal experience. Moreover, fraudsters now can take advantage of recently emerged AI tools, forging documents with even greater ease. Similarly common is also the application fraud, which happens when a customer provides falsified documents at the very start of the insurance process to obtain coverage or benefits, they might not otherwise qualify for. If advanced machine learning techniques are applied, the insurance company can often detect this fraud during the application phase. However, it may occasionally go unnoticed, only coming to light when an insurance claim is filed, resulting either in its eventual discovery or in an unjustified payout that could have been avoided.

3.2 Real example of insurance fraud

We will now present a real-world example of insurance fraud, obtained from Ondřej Poul – Claim Division Director in Kooperativa (no. 2 insurance company in Czechia). The customer applied for health insurance, likely omitting any disclosure health complications. Sometime after obtaining the policy, they filed a claim related to health complications, supported by a medical document. However, an applied machine learning algorithm identified evidence of document tampering. Upon contacting the document’s issuer, probably a general practitioner, it was confirmed that health complications actually started in 2011 — way before the insurance policy was arranged. The algorithm not only classified the document as fraudulent but also pinpointed the exact numbers, that have been edited. As Ondřej Poul stated, the company was able to save 95 000 CZK by detecting this fraud (Poul, 2024).

Picture 13: An example of ML algorithm detecting document tampering on a medical report (Source: linkedin.com/in/poul )

To detect such fraud cases, insurance companies can either develop in-house machine learning solutions or leverage the expertise of external providers. Both approaches will be discussed in the following paragraphs, even with a real example, again from Kooperativa.

3.3 Kooperativa x Resistant AI: Implementation of machine learning solution for insurance fraud

For this part, I contacted Lucie Paulusová, a Business Analyst at Kooperativa, who has been actively involved in the implementation of Resistant AI’s fraud detection system. While the expertise of Resistant AI will be discussed later, we will now focus on what such an implementation involves.

The process begins with initial discussions to identify and align on business needs. Once these foundational talks are complete, teams from both companies start their collaboration. Furthermore, Resistant AI typically provides a proof of concept (POC) to demonstrate the applicability of their solution on insurance fraud cases. Then, an important phase of the solution begins — determining the formats and types of documents that will be analysed. For instance, some documents are rarely, if ever, falsified, and therefore can be excluded from the verification process. On the other hand, documents like medical reports or car accident claims are to be thoroughly reviewed. Equally important is also the selection of key indicators for the fraud analysis, such as logos, font types, and other details. Since Resistant AI’s solutions also focus on the document metadata, many fraudsters can be detected just by analysing whether a document has been edited and when. For example, if a certain company’s accounting records are typically generated using the Czech accounting software Pohoda, the presence of a different accounting tool or editing software in the metadata could raise suspicion.

Even after the system is fully set, the two teams continue their collaboration, through weekly sprints, mitigating issues and enhancing the effectiveness of the solution. Furthermore, individual cases requiring clarification are also often discussed on these meetings. The ultimate goal is to catch the fraudsters before insurance payouts are made and minimize false positives. Usually, fraudulent cases are settled directly with the policyholder but if the fraudulent sum is excessively high or the fraud shows signs of organized crime, the cases can be escalated to the police for further investigation and legal action.

3.4 Development of in-house machine learning solution

The entire workflow requires collaboration across multiple departments including business, data governance or legal teams. However, in this paragraph we will concentrate exclusively on the technical development.

When starting from scratch, the first step is always to perform the Exploratory Data Analysis (EDA). This part helps in understanding the data and may reveal hidden issues such as data inconsistency, duplicated rows or missing values. Successful machine learning algorithms are dependent on accurate data representation. To achieve this, feature engineering or in other words selection, transformation and creation of relevant variables, is essential. Finally, the model can be built; however, a common approach is to first develop multiple models of different types, evaluate their performance, and then select the best one for fine-tuning. In this particular case (Wipro, 2024) these models were selected for testing: Logistic regression, Modified Multi-variate Gaussian, Boosting, Bagging with Adjusted Random Forest.

Once again, the accuracy was evaluated by the Confusion matrix also showing recall (fraction of positive instances that are retrieved) and precision (fraction of retrieved instances that are positives). An additional way of model performance evaluation was used — ROC Curves. The evaluation has then revealed the best performing model for this particular case, with Bagging with Adjusted Random Forest obtaining the highest scores. However, the analysis suggests that the final outcome is influenced more significantly by the quality and quantity of the available data than by the choice of the model itself. Some might think this marks the end of the process, but until the chosen model is deployed into production, there is still significant work to be done. This includes the fine-tuning process to prevent overfitting, ensuring the model generalizes well to new, unseen data, and rigorous validation to confirm its reliability in real-world scenarios. (Wipro, 2024).

Picture 14: The process of building an ML solution for fraud detection

(Source: wipro.com/analytics/comparative-analysis-of-machine-learning-techniques-for-detectin )

4 Insight into practice: Resistant AI

Now let us shift our focus the other side — to a company that develops advanced machine learning solutions for digital fraud detection and prevention. These solutions not only help companies reduce financial losses, but also contribute to the uncovering of organized crime activities, serving a higher purpose. In the following section we will be focusing on how such company works and how they manage to detect millions of digital fraud cases. Additionally, I had the opportunity to connect with data science team lead at Resistant AI, who will provide valuable insights into the current challenges and future trends in AI&ML fraud detection.

Picture 15: Resistant AI’s logo and slogan

(Source: resistant.ai )

4.1 The story of Resistant AI

The story begins on the academic grounds. Most of Resistant AI’s founders have completed PhD’s in artificial intelligence, computer engineering or related field at prestigious European universities. By 2006 their similar interest brought them together as a team of researchers at the Czech Technical University in Prague. Recognizing the unique ability of machine learning for securing private and cloud networks against real-time malware threats, in 2009 they spun off their first company — Cognitive Security. Later, major international players began to take notice, with the result of Cisco acquiring the company in 2013. This might not seem too important, but in fact this expertise is what gives Resistant AI an extreme advantage from the current competitors. Having a backbone made out of cybersecurity experts with already one company successfully built is what now makes Resistant AI a company you can trust.

With nearly 15 years of top-tier experience in network security and machine learning, the expert team reunited in 2019 to establish Resistant AI. Backed by the vision of making today’s financial systems more resilient to digital fraud, they began with the development of their solutions. Having seed funding secured in 2020 and Series A funding completed in 2021, Resistant AI has grown significantly. By 2023, the company had built a customer portfolio exceeding the hundred-client mark and expanding abroad to form both international team and clientele. Currently, Resistant AI is a recognized leader in digital fraud detection, with dedicated sales teams operating in New York and London to support their global expansion (Resistant AI, 2024).

4.2 Resistant AI’s expertise

Resistant AI offers two main division of solutions — transactions and documents.

Starting with documents, tailored solutions are able to verify thousands of documents daily, providing companies with an automated system of fraud protection and detection. Resistant AI’s algorithms accept both PDF and image format documents, checking metadata, internal structures or used fonts. Overall, Document Forensics check over 500 parameters to find signs of fraudulent behaviour. And they can do so even without explicitly reading document contents, ensuring a top-tier privacy. Additionally, actionable verdicts are provided, marking the documents as Trusted, Warning or High Risk. While every 1 in 5 onboarding documents are tampered with, and up to 2% are based on reused or generated documents, mentioned techniques can have an immense impact on whether the fraudster is successful or not. The solutions are especially effective for onboarding, KYC processes or underwriting (Resistant AI, 2024).

The second division of solutions, being especially useful for banking industry, prevents real-time financial threats. The software recognizes irregular behaviour patterns, actively finding both threats already present in the system and potential fraudsters. It can also aid already implemented risk monitoring services, creating hyper-granular risk profiles for each customer by segmenting transactional data based on behaviour. These solutions are best suited for detecting money laundering schemes, authorized push payment scams or fraudulent techniques exploiting Buy Now Pay Later (Resistant AI, 2024).

For this section I teamed up with Anežka Lhotáková, Adaptive Decision & Image operations Team Lead in Resistant AI. We led a discussion about AI&ML fraud detection topics, resulting in many inspirational ideas being incorporated into this work. Apart from that she was more than willing to answer the following questions:

What are the current challenges in AI&ML fraud detection?

“In digital document fraud, a major topic is the current technological availability of powerful machine learning models (or, if you prefer, AI models). The days when criminals were sophisticated individuals like the meticulous Frank Abagnale are long gone. What distinguishes the current era is accessibility—today’s fraudsters need no more than a photoshop tool with AI features that smooth out all imperfections that might alert the human eye about a forgery.

At present, we are at a phase where existing documents or templates are being modified (very convincingly) locally. In the future, however, we must prepare for a level of forgery where documents will be generated from scratch, without any original template.”

What are the future trends of AI&ML fraud detection?

“We must acknowledge that in the more distant future, when it becomes impossible to distinguish an original document from a forgery, the question of alternative methods for identity verification will arise. Some efforts and directions are already visible today—be it methods like face recognition, fingerprinting, cognitive security, or, specifically in the Czech Republic, the relatively „simple“ system of „datová schránka“. However, such methods, like those we use today, must demonstrate a certain reliability while also protecting individuals‘ data and privacy.

Would you trust a system that verifies your identity based on your „computer“ behaviour—typing speed, keystroke pressure, mouse movements, time spent on a page, voice analysis…? Would you entrust such a system with your confidence, a piece of your privacy, and highly personal data in the interest of verifying your banking identity?”

Conclusion

The semestral work provided a complex overview on how machine learning algorithms are leveraged to fight cunning digital fraudsters.

At the beginning it introduced the most popular types of fraud, highlighting an important difference between serial and opportunistic fraud. While both being similarly dangerous, the level of sophistication and premeditation is the main distinctive factor. Furthermore, with the emergence of AI tools, creating a fraudulent document is becoming increasingly easier. This poses a real threat for the future, as even non-technically skilled scammers are now able to utilize the software to create fraud materials in minutes, if not seconds.

To counter these criminal activities, machine learning solutions can be developed and employed. Both supervised and unsupervised techniques have proven effective, with a diverse portfolio of algorithms to be chosen from. However, there is no one-size-fits-all solution, as each model comes with its own strengths and weaknesses, making it suitable for specific types of fraud. In practice, multiple models are often trained and evaluated to determine which performs best on the given task. Once identified, the best-performing model undergoes fine-tuning to optimize its performance before being prepared for real-world deployment.

The final two sections of this work were granted to showcase of real applications of AI&ML fraud detection. The first example explored a case from the insurance industry, showing an actual attempt to commit fraud by tampering with a medical document. In the second section, an example of a world leader in AI&ML fraud detection was provided, discussing their success story. Finally, short insights from a field expert were shared, also highlighting the threats posed by newly emerging AI tools in the context of digital fraud. Given the complexity of this topic, future research could dive into neural networks, which now play a significant role in fraud detection and could complement the machine learning algorithms discussed here. While at the end of this work we primarily focused on the insurance industry, future studies could explore other sectors where fraud is similarly frequent, such as banking. Possibly also the gambling industry, where features like friend-invitation systems are now often exploited.

References

ACI Worldwide. (2024). APP fraud explained. ACI Worldwide. https://www.aciworldwide.com/app-fraud

Alenzi, H. Z. (2024). Machine learning for advanced fraud detection and content moderation. Preprints. https://doi.org/10.20944/preprints202411.0352.v1

Babatope, A. (2024, November). Machine learning for advanced fraud detection and content moderation. Preprints. https://doi.org/10.20944/preprints202411.0352.v1

Bank for International Settlements. (2024). Digital fraud and banking: Retrieved from https://www.bis.org

GeeksForGeeks. (2023a). DBSCAN clustering. Retrieved from https://www.geeksforgeeks.org/dbscan-clustering-in-ml-density-based-clustering/

GeeksForGeeks. (2023). What is isolation forest? Retrieved from https://www.geeksforgeeks.org/what-is-isolation-forest/

IBM. (2024a). What is k-means clustering? Retrieved from https://www.ibm.com/topics/k-means-clustering

IBM. (2024b). What is logistic regression? Retrieved from https://www.ibm.com/topics/logistic-regression

JPMorgan Chase & Co. (2019). JPMorgan Chase & Co. annual report 2019. Retrieved from https://www.jpmorganchase.com/content/dam/jpmc/jpmorgan-chase-and-co/investor-relations/documents/annualreport-2019.pdf

Kumar, S., Gunjan, V. K., & Babatope, A. (2024). Machine learning for advanced fraud detection and content moderation. Retrieved from https://doi.org/10.20944/preprints202411.0352.v1

Mastercard. (2024). Synthetic identity theft prevention [E-book]. Retrieved from https://ekata.com/wp-content/uploads/2024/04/Synthetic_identity_theft_prevention_ebook.pdf

Organization of American States. (2013). Money laundering. Retrieved from https://www.oas.org/cicaddocs/Document.aspx?Id=3095

Poul, O. (2024). Machine learning applications in insurance fraud detection. Retrieved from https://resistant.ai/blog/threat-intel-doc-juicer#heading-10

Resistant AI. (2023). The threat of serial fraud. Retrieved from https://info.resistant.ai/serialfraud-wp

Resistant AI. (2024a). Document solutions. Retrieved from https://resistant.ai/products/documents

Resistant AI. (2024b). About Resistant AI. Retrieved from https://resistant.ai/about

Scikit-Learn. (2024). DBSCAN clustering. Retrieved from https://www.geeksforgeeks.org/dbscan-clustering-in-ml-density-based-clustering/

Signifyd. (2023). 2023 State of omnichannel fraud report. Retrieved from https://www.transunion.ca/fraud-trends/reports/2023-state-of-omnichannel-fraud-report

TransUnion. (2023). 2023 state of omnichannel fraud report. Retrieved from https://www.transunion.ca/fraud-trends/reports/2023-state-of-omnichannel-fraud-report

Visa. (2019, June). Visa prevents approximately $25 billion in fraud using artificial intelligence. Retrieved from https://www.businesswire.com/news/home/20190617005366/en/

Whitrow, C., Hand, D., & Juszczak, P. (2009). Transaction aggregation as a strategy for fraud detection. Data Mining and Knowledge Discovery, 18(1), 30–45. https://doi.org/10.1007/s10618-008-0116-1

Author: Jakub Kabele, Business Analyst @ Kooperativa, Student Data Analytics @ VŠE (2024)

+ posts

Číst více

Další články