Multi Model Identifies Fraud at Scale - ArangoDB White Paper

White Paper:
Multi-Model Identiﬁes Fraud

At Scale
By Arthur Keen (Senior Solution Architect, ArangoDB)

May 2020
Table of Contents

The Significance of Fraud and Graphs 2
Why Multi-Model for Fraud Detection? 3
Converting from Relational Source to Multi-model Graph 4
Fraud Questions 5
Detect Fraud Rings From a Suspicious Account 5
Detect All Fraud Rings 6
Find Orphan Accounts 7
Find Most Influential Customers and Accounts 8
What are the top 3 most influential accounts? 9
Finding Money Laundering Patterns 10
Detecting Fraud At Scale 11
Conclusion 12
Hands-on with Fraud Detection & Anti Money Laundering 13
Appendix A: Queries 15

1
The Significance of Fraud and Graphs
Fraud is an enormous and ever growing problem impacting all industries and
government services. Global fraud results in over $3.7 trillion losses annually.
Businesses lose on average 5% of their income to fraud every year. In 2018
businesses incurred $3.13 remediation costs for each dollar of fraud [1], dealing
with chargebacks, fees, interest and labor.

Traditional fraud detection views data through a straw, focusing on discrete data
points including specific accounts, individuals, devices or IP addresses. However,
today’s sophisticated fraudsters escape detection by forming fraud rings
composed of stolen and synthetic identities and circuitous back channels.

To uncover fraud rings, it is essential to look beyond individual data points in
individual data sources to a broader view of the connection patterns that exist
across multiple data modalities1. Multiple disparate data sources storing
individual activities and relationships that need to be analysed in concert to
detect complex fraudulent behavior.

ArangoDB’s native multi-model is ideal for tackling this challenge, because it
supports graphs, documents, key-stores, and relational models. This provides
streamlined, flexible, and agile harmonization of the relevant multi-modal2 user
activity data, provides the performance and scale to detect the complex fraud
patterns, and serves results in the different data models needed by
stakeholders.

1
Multimodal data. Our experience of the world is multimodal — we see objects, hear sounds, feel
the texture, smell odors, and taste flavors. Modality refers to the way in which something
happens or is experienced and a research problem is characterized as multimodal when it
includes multiple such modalities.
2
This juxtaposition of multi-model and multi-modal is deliberate, they are orthogonal terms.
2

Figure 1: Identify fraud patterns in the network of transactions and relationships.

Why Multi-Model for Fraud Detection?

ArangoDB’s multi-model graph allows you to easily fuse together disparate data
and identify complex fraudulent patterns of connections, such as fraud rings,
using the ArangoDB Query Language (AQL).

The identification of fraud ring patterns requires very deep (multi-hop) traversals
across the graph. The query for detecting a fraud ring can be accomplished in
six lines of (easy to write and maintain) AQL code and ArangoDB can execute
these queries with sub-second response times

Multi-model do not have to convert the entire dataset to graph to do this.
Use graphs where needed for analytics. Multi-model graphs allow you to
combine documents, joins, and graphs to solve this problem.

3
Converting from Relational Source to Multi-model Graph
The source of data for fraud detection would likely be a relational database,
for example, the schema depicted in Figure 3, which describes the foreign
key relationships among the Bank, Branch, Customer, Account, and
Transaction Tables.

Figure 3: Relational Source Schema

How do we convert this to a graph in ArangoDB? Because ArangoDB is a
multi-model database, the tables can be ingested as-is, directly into
ArangoDB as collections, so the Bank table becomes the Bank collection and
so on. Then you can choose whether to convert all or part of it to a graph
model based on the requirements for fraud analytics.

Since we need to do deep link/traversal analytics on the account transactions
and the customers, it makes sense to add graph edges in this area of the
graph. In this transformation it makes sense to use the Transaction
collection as an edge and to materialize the CustomerAccount foreign key as
the AccountHolder edge.

Figure 4: Multi-Model Schema: Documents, Joins, Graph

We use the convention of converting foreign key relationships to edges that
are directed from the dependent to the independent entity. Resolution
4
entities AKA join tables in the relational model can be used as edges in a
graph model as we have done with Transaction.

Fraud Questions
We will describe how to use ArangoDB to answer various questions:

● Are there potential fraud rings connected to a suspicious account?
● Are there any potential fraud rings in my data?
● Are there orphan accounts (those not transacting)?
● Who are the most influential customers/accounts in transactions?
● Are there any money laundering patterns?

The following section shows how these questions can be answered in
ArangoDB on synthetically generated transaction data. The queries are
examples for detecting the patterns on this synthetic data set, meant to
inspire practitioners to develop real-world fraud detection capabilities on
ArangoDB with real data.
Detect Fraud Rings From a Suspicious Account

Fraud rings consist of very long loops of transactions and relationships
among individuals that are used by fraudsters to evade detection. These
long loops are also used in sophisticated cyber crime, where the perpetrators
create long paths of logins across multiple systems to avoid detection. The
reason these long paths are difficult to detect and understand is that they
require deep multi-hop traversals into the graph of transactions and
relationships among the individuals collaborating in the fraud.

In conventional systems, these multi-hop queries require a high number of
joins, which can take a substantial amount of time and consume a large
amount of computing resources. ArangoDB’s graph model supports high
performance multi-hop queries, where for example 10-hop queries on large
datasets can take less than 10 milliseconds depending on the topology of the
5
graph. For this example, the query finds long loops of transactions starting
from a suspicious account and looping back to the suspicious account over 5
to 10 transaction hops.

Figure 5 depicts the fraud ring detection query written in the ArangoDB
Query Language (AQL) being developed and executed in the ArangoDB
administrative panel. Note that this sophisticated query is expressed in 6
lines of AQL code and that the compact representation is easily
understandable and maintainable. The query results are displayed as a
circuit in the graph visualization and are also available in json, so they can be
processed by applications calling this query. Note also that the query is
parameterized by ‘suspicious account’ and number of loops to detect.

Figure 5: Finding fraud Ring(s) from a suspicious account
Detect All Fraud Rings

In the previous example, we detected fraud rings connected to a suspicious
account. What if we did not have a list of suspicious accounts to analyze yet
and wanted to analyze our graph to detect all of the fraud ring patterns in it?

6
This is easily accomplished in AQL by adding an outer loop to the fraud ring
detector for suspicious accounts. This sophisticated query is written in only 6
lines of AQL!
The query for finding all fraud loops is depicted in Figure 6.

Figure 6: Find all fraud rings
Find Orphan Accounts

There are many patterns for finding suspicious accounts that may require
further investigation. Most of these patterns are essentially finding
anomalous behavior to flag accounts.

One pattern is the orphan account, where an account is set up to participate
in very specific fraud transaction patterns, but otherwise does not interact in
a ‘normal’ way with other accounts and may be used very infrequently.

Figure 7 depicts a query for finding orphan accounts and reports on the
accounts and account owner.

7

Figure 7: Find Suspicious “Orphan” Accounts

Find Most Influential Customers and Accounts

We can also use standard graph algorithms like pagerank to find deeply
coordinated activity, by looking for the most influential customers and
accounts.

The pagerank algorithm scores how important or influential a vertex is
relative to the rest of the network. This is accomplished in ArangoDB by
executing ArangoDB’s pagerank algorithm on the graph via the Pregel
interface and then visualizing the results.

Figure 8 depicts a visualization of several clusters of customer/
account/transaction activity, where the size of the vertices is scaled
proportional to the pagerank computed for that vertex. This visualization
provides visual cues to the relative dominance of customers and accounts in
the network.

8

Figure 8: Find most influential accounts and customers

What are the top 3 most influential accounts?

Top 3 or top 10 queries are often used to focus attention. In this example,
we use an AQL query to find the top 3 most influential customers. This query
is essentially reading the pagerank value inserted by ArangoDB’s pagerank
algorithm and ordering the results in descending order and returning a limit
of three. The query and the results of execution are depicted in Figure 9.

Figure 9: Query for listing top 3 most influential accounts
9

Finding Money Laundering Patterns

ArangoDB can also be used to find more specific patterns, for example, in
money laundering. In money laundering there is a funds
disaggregation/aggregation pattern, where many small transactions (below
some known triggering threshold) are used to split up a large sum of money,
followed by multiple transaction hops across accounts to further avoid
detection, ultimately followed by a number of transactions that aggregate
the funds back to an account.

This fan-out/fan-in pattern can easily be detected using AQL. The query and
results are depicted in Figure 10.

Figure 10: Finding Money Laundering Patterns

10
Detecting Fraud At Scale
Real-world financial transactions generate billions of data points and
relationships, which will rapidly overrun the capabilities of a single server.
Providing fraud-detection performance at scale requires the underlying data
systems to be able to scale out data across multiple nodes in a distributed
cluster and to be able to efficiently distribute computation in parallel across
the cluster.

On a distributed database cluster, the limiting factor is network performance,
because network performance is two orders of magnitude slower than
memory and in a distributed cluster there will be data and communication
traffic between nodes in the cluster. For example, the performance on
detecting a fraud ring would be negatively impacted if many of the edges
being traversed caused computation to hop back and forth between servers.
Obviously better network performance improves overall performance,
however there are also data distribution and query optimizations that can
greatly reduce the amount of inter-node communication needed to execute
queries, and therefore improve distributed performance.

Optimizing the layout of data on the cluster can reduce the inter-node
communication needed to perform queries. ArangoDB uses Smartgraph
algorithm to optimize graph distribution across a cluster, SmartJoins to
ensure that joins do not cross servers, and satellite collections to replicate
metadata across servers so that lookups occur local to servers.

Figure 11: Bad distribution of graph data causes network hops during query execution
11

The Smartgraph feature of ArangoDB allows us to handle this problem in a
smarter way. In Fraud Detection we might know from the past that
fraudsters use banks in certain countries or regions to launder their money.
We can use this domain knowledge as a sharing key for our graph data and
allocate all financial transactions performed in this region on DB server 1,
and distribute other transactions on other DB servers. By using this
approach we can allocate all data needed to be grouped together on each
machine, and use the query engines on each DB Server to execute our
queries in parallel.

Figure 12: Optimized data distribution with ArangoDB SmartGraphs

Conclusion
This paper points the way to using ArangoDB as part of a fraud detection
solution. We encourage users to experiment with our sample data and
sample queries, learn how to apply ArangoDB to fraud visa experimentation
by adding/modifying the data and queries, and be inspired and empowered
to apply your knowledge of fraud to use ArangoDB on your own data to
12
detect fraudulent activity. To get started easily, you can follow the interactive
demo provider on our cloud service ArangoDb Oasis and described below.

Hands-on with Fraud Detection & Anti Money Laundering
Testing ArangoDB and its capabilities for detecting fraud and money
laundering is very simple. Many of the use cases shown in this white paper
are part of an interactive demo available for free on ArangoDB’s cloud
service Oasis. No credit card is needed for a 14 day free trial deployment and
the examples can be installed with just one click. A detailed guide is provided
so really everyone can follow along easily.

Just s ign-up for ArangoDB Oasis and follow the few steps below

1. Create a Deployment (Here is a 2min video Tutorial)
2. Install the Fraud Detection Example in Oasis (Project -> Deployment
Tab -> View your deployment -> Examples Tab or just click “view
Deployment” directly after initiating the deployment creation
3. After the example is ready (~1minute) follow the Fraud Detection
guide provided to run real queries against the demo data you just
installed

13

This White Paper was written by Arthur Keen. For any questions about solving
Fraud Detection cases with ArangoDB, feel free to reach out to
arthur@arangodb.com
14
Appendix A: Queries
/*

Find all suspicious long loops of transactions
Show the graph and json results
Scroll to bottom of graph results and click "GraphViewer" to see results in Graph Viewer

*/

WITH transaction, account
FOR suspicous_account IN account
FOR acct, tx, path IN 5..10 OUTBOUND suspicous_account._id GRAPH 'fraud-detection'
PRUNE tx._to == suspicous_account._id
FILTER tx._to == suspicous_account._id
RETURN path

/*
Find number of Curious loops from a suspicious Account
Hints:
Try suspiciousAccountID = account/10000032
Rerun the query for different number of loops detected
*/

WITH account, transaction
LET suspicious_account = DOCUMENT(@suspiciousAccountID)
FOR acct, tx, path IN 5..10 OUTBOUND suspicious_account._id GRAPH 'fraud-detection'
PRUNE tx._to == suspicious_account._id
FILTER tx._to == suspicious_account._id
LIMIT @numberOfLoopsReturned
RETURN path

/*
Find Orphan Account
An orphan account is an account with little or no transactions.
These may be set up in advance of money laundering operations.
This query finds accounts with no transactions

*/

LET usedResources = UNION_DISTINCT(
FOR relationship IN transaction RETURN relationship._from,
FOR relationship IN transaction RETURN relationship._to)
FOR resource IN account
FILTER resource._id NOT IN usedResources
SORT resource.account_type, resource.customer_id
RETURN {"customerName" : DOCUMENT(CONCAT("customer/",
resource.customer_id)).Name, "customerID": resource.customer_id, "accountID":
resource._id, "type": resource.account_type }
15

/*
Anti Money Laundering Pattern Detection
Find transaction patterns that contain a disaggregation and re-aggregation of funds
pattern
This pattern is characterized by transactions that dis-aggregate funds from a source
account to
multiple accounts in amounts that are below a reporting threshold, i.e., below $10,000
followed by a series of small transactions into 1 or more accounts, followed by
re-aggregation
of the small transactions into a destination account.
*/

WITH account, transaction
LET accountOutDegree = (FOR transaction IN transaction
COLLECT accountOut = transaction._from WITH COUNT INTO outDegree
RETURN {account : accountOut, outDegree : outDegree})
LET accountInDegree = (FOR transaction IN transaction
COLLECT accountIn = transaction._to WITH COUNT INTO inDegree
RETURN {account : accountIn, inDegree : inDegree} )
LET accountDegree = (FOR inRecord IN accountInDegree
FOR outRecord IN accountOutDegree
FILTER inRecord.account == outRecord.account
RETURN MERGE(inRecord, outRecord))
LET maxAccount = (FOR maxDegree IN accountOutDegree
FILTER maxDegree.outDegree == MAX(accountOutDegree[*].outDegree)
RETURN maxDegree)[0]
FOR account, transaction IN 1..4 OUTBOUND maxAccount.account transaction
RETURN transaction

16

Multi Model Identifies Fraud at Scale - ArangoDB White Paper

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multi Model Identifies Fraud at Scale - ArangoDB White Paper

Uploaded by

Copyright:

Available Formats

White Paper:

Multi-Model Identiﬁes Fraud

By Arthur Keen (Senior Solution Architect, ArangoDB)

Why Multi-Model for Fraud Detection? 3

Converting from Relational Source to Multi-model Graph 4

Detect Fraud Rings From a Suspicious Account 5

Detect All Fraud Rings 6

Find Orphan Accounts 7

Find Most Influential Customers and Accounts 8

What are the top 3 most influential accounts? 9

Finding Money Laundering Patterns 10

Detecting Fraud At Scale 11

Hands-on with Fraud Detection & Anti Money Laundering 13

Appendix A: Queries 15

Why Multi-Model for Fraud Detection?

Detect Fraud Rings From a Suspicious Account

Detect All Fraud Rings

Find Orphan Accounts

Find Most Influential Customers and Accounts

What are the top 3 most influential accounts?

Finding Money Laundering Patterns

Figure 12: Optimized data distribution with ArangoDB SmartGraphs

You might also like