www.statice.ai Open in urlscan Pro
2606:4700:20::681a:f88  Public Scan

URL: https://www.statice.ai/post/boosting-fraud-detection-with-synthetic-data
Submission: On October 09 via manual from FR — Scanned from FR

Form analysis 1 forms found in the DOM

Name: wf-form-FORM02POST

<form id="FORM02" name="wf-form-FORM02" data-name="FORM02" method="post" class="form02_form" data-wf-page-id="62cc51bb01a3d9a06e323be3" data-wf-element-id="cf913639-2085-91dd-295c-94bac7384c52" aria-label="FORM02">
  <div class="form02_field_wrap"><input type="text" class="form02_input w-input" maxlength="256" name="NAME-2" data-name="NAME 2" placeholder="Your Name" id="NAME-2" required=""><input type="email" class="form02_input w-input" maxlength="256"
      name="EMAIL-2" data-name="EMAIL 2" placeholder="Email Address" id="EMAIL-2" required=""></div><textarea placeholder="Your message..." maxlength="5000" id="MESSAGE-2" name="MESSAGE-2" data-name="MESSAGE 2"
    class="form02_input_full w-input"></textarea><input type="submit" value="Submit" data-wait="Please wait..." class="form02_button w-button">
</form>

Text Content

Home
Product
PRODUCT OVERVIEW
Synthetic data software
SDK & APIs
De-identification software ↗
Services
LEARN MORE
Synthetic data
Pseudonymization
Solutions
Industries
Finance
Insurance
Healthcare
Resources
FAQ
Blog
Newsletter
Webinars and videos
Guides & whitepapers
Company
Jobs ↗
About
Anonos ↗
News and press
Company
About
Press and news
Job
Contact

Contact
EN
DE
FR

Statice is Now part of Anonos Data Embassy Platform

Statice's synthetic data technology is now part of Anonos Data Embassy, the
award-winning data security and privacy solution.

LEARN MORE


BOOSTING FRAUD DETECTION WITH SYNTHETIC DATA

By
Richard Ball
February 16, 2023
-
10
minutes read

Fraud detection is an increasingly difficult and important problem that most
financial institutions and insurance companies face. One of the key challenges
of detecting fraud through the use of machine learning techniques is the lack of
available training data. 

This is especially true for confirmed fraudulent records, as they are rare in
comparison to non-fraudulent records. Large difference between the number of
non-fraudulent and fraudulent records in the data set leads toa severe class
imbalance. This negatively impacts the ability to train an effective machine
learning model, as the training process is significantly biased towards the
majority, non-fraudulent records.

This blog post will cover the following topics:

 * Comparison of approaches to dealing with class imbalances
 * Augmenting datasets with synthetic fraud transactions
 * Training and evaluating the performance of the fraud detection model with
   SMOTE and synthetic data
 * Results and insights on hot to improve the performance


APPROACHES TO DEALING WITH A CLASS IMBALANCE

Several well-known approaches to dealing with a class imbalance have been
proposed. One of those approaches is to upsample the number of records in the
fraudulent minority class using a technique known as SMOTE (Synthetic Minority
Oversampling Technique). 

SMOTE generates sample copies of the minority class by interpolating between the
nearest neighbors of random samples of fraudulent records. A randomly selected
nearest neighbor is selected, where a copy data point is created by forming a
line segment in the feature space between the original fraudulent sample and the
randomly selected nearest neighbor. While being extremely popular because of its
simplicity, the SMOTE approach doesn't represent the original data distribution
or relationships, which may result in poor model generalization. 

Alternatively, synthetic data generation is an approach for upsampling the
minority class that maintains both the underlying distribution and relationships
to ultimately improve upon the model generalization capabilities derived from
the SMOTE approach.

In this article, we intend to train an initial machine learning model on an
open-source credit card fraud detection data set. A second model will then be
trained by augmenting the original training set by means of oversampled sample
data generated by SMOTE. Finally, a third model will be trained using synthetic
copies of the original training set generated by the Data Embassy SDK. The
performance of the three models will then be evaluated to determine the most
effective approach for detecting fraud.

The data set that we used in this exercise is a credit card fraud detection data
set taken from Kaggle. This data set contains over 284k records, of which 0.172%
are fraudulent. This severe class imbalance between the non-fraudulent and
fraudulent classes makes this data set a perfect candidate to test our
hypothesis on using synthetic data to augment the training of machine learning
models for fraud detection.


‍AUGMENTING DATASETS WITH SYNTHETIC FRAUD TRANSACTIONS

Generating synthetic data using our Data Embassy SDK is straightforward, and the
entire process can be conducted within a Jupyter notebook or your preferred IDE.
The credit card fraud data was first loaded into a Pandas DataFrame in order to
inspect and analyze the data. The fraudulent class was then explicitly split
from the non-fraudulent class, as the intention was to over-sample the
fraudulent data only. There was, therefore, no need to synthetically generate
additional samples of the majority, non-fraudulent class.

Following this, the synthesization process was conducted. We generated 45k
samples with the Data Embassy SDK, which aligns with the number of additional
records generated by the SMOTE approach. The number of samples can be changed
depending on the use case. No additional feature engineering was required, as
the data was already pre-processed and scaled accordingly.

Once the synthesization process was completed, we saved the resulting data into
its own Pandas DataFrame, where a direct comparison between the original and
synthetic data can be made. 

Figure 1 below shows the distribution of the original data as well as synthetic
data generated for one of the most influential features in detecting fraud, V4.
Feature importance was determined by using the feature importances property of
the trained XGBoost model, where importance was measured by the average gain
across all splits where each feature was used. The illustration shows that the
synthetic data distribution of the minority class is very representative of the
original data, although the count of the synthetic data samples (y-axis) is much
larger.


Figure 1: Original vs. synthetic data for feature V4

The distribution of the original and synthetic data for feature V14 is also
included in Figure 2. V14 is another significant contributor when inspecting the
feature contribution using XGBoost. The same shape is exhibited in the two
distributions, where the count of samples is again much larger for the synthetic
distribution, as expected.


Figure 2: Original vs. synthetic data for feature V14

The SDK also offers reports to assess the privacy and utility of synthetically
generated data. You can read more about evaluating the utility of synthetic data
and the performance of ML models in this blog post. Additionally, you can also
learn more about our privacy evaluations here. 


‍FRAUD DETECTION MODEL TRAINING AND PERFORMANCE WITH SMOTE AND SYNTHETIC DATA

The first model trained was a baseline XGBoost implementation using the features
derived from the original data set only. No specific hyperparameters were
configured for this model or for any of the subsequent models in order to make
fair comparisons between the approaches. 

The training and test data sets were created by making a 90%/10% train test
split on the original features. Of the 255k records in the training set, only
443 were fraudulent. The test set contained 28k records. We applied stratified
sampling to the train test split to ensure that the class imbalance was
maintained in both subsets, with 49 belonging to the minority fraudulent class. 

The baseline solution performs well out of the box, where an AUC (Area Under the
ROC Curve) score of 96.4% is achieved (Figure 3). A ROC (receiver operating
characteristic) curve is a graph showing the performance of a classification
model at all classification thresholds. AUC measures the entire two-dimensional
area beneath the ROC curve and is a popular classification performance metric
that measures the trade-off between the false positive rate and the true
positive rate at different classification thresholds. It gives a good sense of
the overall performance of the model where, in practice, the threshold is likely
to be changed from 50% towards another level that more effectively discriminates
between the classes.


Figure 3: AUC of the baseline model

The second model was trained on an augmented training set that included a number
of up-sampled fraudulent records from the training set only. These records were
generated via SMOTE. 

We were able to increase the number of fraudulent records in the minority class
to over 48k after applying SMOTE, therefore reducing the class imbalance.
Conversely, the majority class was reduced from 255k down to 100k via random
downsampling. 

Including SMOTE in the modeling process improved the AUC to 97.5% (Figure 4),
where the model was able to detect additional fraudulent transactions not
detected by the baseline model. 

The number of false positives also increased slightly, which is expected when
applying SMOTE or other over-sampling techniques. According to the authors of
the original SMOTE paper (Chawla et al., 2002), optimal results were achieved
when combining the use of both SMOTE to up-sample the minority class, as well as
the use of random sampling to down-sample the majority class. Therefore, in
order to extract the best possible performance from the SMOTE approach, random
down-sampling of the majority class was also conducted.


Figure 4: AUC for SMOTE model

The distribution of the minority class data following the upsampling via SMOTE
resulted in the graph illustrated in Figure 5. The purple distribution on the
left contains data for the minority class from feature V4, while the gold
distribution to the right represents the upsampled minority class generated
through the implementation of SMOTE. 

By making a direct comparison with the synthetic data generated in Figure 1, we
can see that the SMOTE approach does not approximate the original distribution
as accurately as the SDK does.


Figure 5: Original vs. SMOTE data for feature V4

Figure 6 provides a comparison of the original and SMOTE distributions for
feature V14 generated for the fraudulent class only. Again, by making a direct
comparison between Figure 2 and Figure 6, we see that the SMOTE approach yields
a much wider distribution that is not fully representative of the original data
from which it was trained.


Figure 6: Original vs SMOTE data for feature V14

The final model in the experiment was the same XGBoost implementation but
included the use of the SDK for synthetic data generation. The SDK was used to
up-sample the fraudulent minority class only, by increasing the number of
fraudulent records in the training set by 45k. Model 3 was able to correctly
detect even more fraudulent transactions compared with the SMOTE model, which
provided an increase in AUC to 98.1%. This comes at the expense of an increase
in false positives, which like SMOTE, is expected behavior. 

In practice, customers are primarily interested in increasing recall while
keeping precision low. An additional 10% in recall over the baseline model is
worth the drop in precision, as false negatives are significantly more expensive
from a cost perspective (cost of fraud, loss of goodwill with payment providers)
when compared with false positives (operational overhead, poor customer
experience).


RESULTS AND HOW TO IMPROVE THE PERFORMANCE

The complete results from the experiment are summarized below:

 * Baseline AUC 96.41%
 * SMOTE with down-sampling AUC 97.51%
 * Synthetic data with down-sampling AUC 98.17%

An illustration of the model performances at different classification thresholds
is presented in Figure 7. The model trained using the synthetic data generated
by the Data Embassy SDK (gold line) achieves a higher AUC score when compared to
the baseline model (purple line), and the model trained using SMOTE (red line).


Figure 7: AUC scores of the three approaches

These results are conclusive, and a 2% increase in AUC can have a considerable
impact on operations within an organizational context. To ensure the accuracy of
the follow-up experiments, it's important to minimize the potential rise in
false positives. A cost-based approach should be followed, where the increase in
true positives should be compared to the increase in false positives from a
perspective of the total cost of fraud. This can be achieved by attaching a
monetary cost to each classification based on empirical evidence gained from
historical instances of different model classifications.


CONCLUSION

Imbalanced data is challenging, particularly in fraud detection contexts, where
access to positive labeled data is lacking. This experiment shows that synthetic
data can be used to augment model training in highly imbalanced problems, where
the results exceed those of traditional methods such as SMOTE. In addition to
providing high-quality synthetic data, our SDK also provides features to enhance
the privacy and utility of your data use cases.


REFERENCES

Chawla N.V., Bowyer K.W., Hall L.O., and Kegelmeyer W.P., 2002, SMOTE: Synthetic
Minority Over-sampling Technique, Journal of Artificial Intelligence Research,
16, pg. 321–357

TABLE OF CONTENT
Approaches to dealing with a class imbalance‍Augmenting datasets with synthetic
fraud transactions‍Fraud detection model training and performance with SMOTE and
synthetic dataResults and how to improve the performanceConclusionReferences

GET THE LATEST CONTENT STRAIGHT IN YOUR INBOX!


RECENT POSTS
Enhancing machine learning model monitoring with synthetic data
How to evaluate synthetic data compliance?
What is Differential Privacy: definition, mechanisms, and examples
Synthetic data tools: Open source or commercial? A guide to building vs. buying

GET THE LATEST CONTENT STRAIGHT IN YOUR INBOX!


RECENT POSTS
Enhancing machine learning model monitoring with synthetic data
How to evaluate synthetic data compliance?
What is Differential Privacy: definition, mechanisms, and examples
Synthetic data tools: Open source or commercial? A guide to building vs. buying


ARTICLES YOU MIGHT LIKE

Enhancing machine learning model monitoring with synthetic data

Read more

How to evaluate synthetic data compliance?

Read more

Synthetic data tools: Open source or commercial? A guide to building vs. buying

Read more
Company
About Anonos ↗Press and brandJobs ↗
Industry
InsuranceFinanceHealthcare
Resources
BlogFAQGuideNewsletterWebinars
Product
Synthetic data softwareSDK & APIDe-identification software ↗ServicesSynthetic
dataPseudonymization
Legal
ImprintTerms of usePrivacy policyCookie policy
Connect with us

Stay in touch
Get our newsletter
Contact

Contact us

Contact Anonos

© 2023 Statice. All rights reserved.

GET IN TOUCH.

Contact us and get feedback instantly.


Your information has been saved.
Looks like we're having trouble
Cookie Control



COOKIES AND OTHER TECHNOLOGIES ON THIS WEBSITE

We store cookies and other information on your devices and access them within
your usage of our website. We and our partners use this information to analyze
the usage of our website with the aim of improving the user experience. We ask
for your consent for this. You can change or revoke this at any time in the
cookie settings.

Accept allI Do Not AcceptSettings