Professional Documents
Culture Documents
Multi Model Identifies Fraud at Scale - ArangoDB White Paper
Multi Model Identifies Fraud at Scale - ArangoDB White Paper
Fraud Questions 5
Conclusion 12
1
The Significance of Fraud and Graphs
Fraud is an enormous and ever growing problem impacting all industries and
government services. Global fraud results in over $3.7 trillion losses annually.
Businesses lose on average 5% of their income to fraud every year. In 2018
businesses incurred $3.13 remediation costs for each dollar of fraud [1], dealing
with chargebacks, fees, interest and labor.
Traditional fraud detection views data through a straw, focusing on discrete data
points including specific accounts, individuals, devices or IP addresses. However,
today’s sophisticated fraudsters escape detection by forming fraud rings
composed of stolen and synthetic identities and circuitous back channels.
To uncover fraud rings, it is essential to look beyond individual data points in
individual data sources to a broader view of the connection patterns that exist
across multiple data modalities1. Multiple disparate data sources storing
individual activities and relationships that need to be analysed in concert to
detect complex fraudulent behavior.
ArangoDB’s native multi-model is ideal for tackling this challenge, because it
supports graphs, documents, key-stores, and relational models. This provides
streamlined, flexible, and agile harmonization of the relevant multi-modal2 user
activity data, provides the performance and scale to detect the complex fraud
patterns, and serves results in the different data models needed by
stakeholders.
1
Multimodal data. Our experience of the world is multimodal — we see objects, hear sounds, feel
the texture, smell odors, and taste flavors. Modality refers to the way in which something
happens or is experienced and a research problem is characterized as multimodal when it
includes multiple such modalities.
2
This juxtaposition of multi-model and multi-modal is deliberate, they are orthogonal terms.
2
Figure 1: Identify fraud patterns in the network of transactions and relationships.
3
Converting from Relational Source to Multi-model Graph
The source of data for fraud detection would likely be a relational database,
for example, the schema depicted in Figure 3, which describes the foreign
key relationships among the Bank, Branch, Customer, Account, and
Transaction Tables.
Figure 3: Relational Source Schema
How do we convert this to a graph in ArangoDB? Because ArangoDB is a
multi-model database, the tables can be ingested as-is, directly into
ArangoDB as collections, so the Bank table becomes the Bank collection and
so on. Then you can choose whether to convert all or part of it to a graph
model based on the requirements for fraud analytics.
Since we need to do deep link/traversal analytics on the account transactions
and the customers, it makes sense to add graph edges in this area of the
graph. In this transformation it makes sense to use the Transaction
collection as an edge and to materialize the CustomerAccount foreign key as
the AccountHolder edge.
Figure 4: Multi-Model Schema: Documents, Joins, Graph
We use the convention of converting foreign key relationships to edges that
are directed from the dependent to the independent entity. Resolution
4
entities AKA join tables in the relational model can be used as edges in a
graph model as we have done with Transaction.
Fraud Questions
We will describe how to use ArangoDB to answer various questions:
● Are there potential fraud rings connected to a suspicious account?
● Are there any potential fraud rings in my data?
● Are there orphan accounts (those not transacting)?
● Who are the most influential customers/accounts in transactions?
● Are there any money laundering patterns?
The following section shows how these questions can be answered in
ArangoDB on synthetically generated transaction data. The queries are
examples for detecting the patterns on this synthetic data set, meant to
inspire practitioners to develop real-world fraud detection capabilities on
ArangoDB with real data.
5
graph. For this example, the query finds long loops of transactions starting
from a suspicious account and looping back to the suspicious account over 5
to 10 transaction hops.
Figure 5 depicts the fraud ring detection query written in the ArangoDB
Query Language (AQL) being developed and executed in the ArangoDB
administrative panel. Note that this sophisticated query is expressed in 6
lines of AQL code and that the compact representation is easily
understandable and maintainable. The query results are displayed as a
circuit in the graph visualization and are also available in json, so they can be
processed by applications calling this query. Note also that the query is
parameterized by ‘suspicious account’ and number of loops to detect.
Figure 5: Finding fraud Ring(s) from a suspicious account
6
This is easily accomplished in AQL by adding an outer loop to the fraud ring
detector for suspicious accounts. This sophisticated query is written in only 6
lines of AQL!
The query for finding all fraud loops is depicted in Figure 6.
Figure 6: Find all fraud rings
7
Figure 7: Find Suspicious “Orphan” Accounts
8
Figure 8: Find most influential accounts and customers
Figure 9: Query for listing top 3 most influential accounts
9
Figure 10: Finding Money Laundering Patterns
10
Detecting Fraud At Scale
Real-world financial transactions generate billions of data points and
relationships, which will rapidly overrun the capabilities of a single server.
Providing fraud-detection performance at scale requires the underlying data
systems to be able to scale out data across multiple nodes in a distributed
cluster and to be able to efficiently distribute computation in parallel across
the cluster.
On a distributed database cluster, the limiting factor is network performance,
because network performance is two orders of magnitude slower than
memory and in a distributed cluster there will be data and communication
traffic between nodes in the cluster. For example, the performance on
detecting a fraud ring would be negatively impacted if many of the edges
being traversed caused computation to hop back and forth between servers.
Obviously better network performance improves overall performance,
however there are also data distribution and query optimizations that can
greatly reduce the amount of inter-node communication needed to execute
queries, and therefore improve distributed performance.
Optimizing the layout of data on the cluster can reduce the inter-node
communication needed to perform queries. ArangoDB uses Smartgraph
algorithm to optimize graph distribution across a cluster, SmartJoins to
ensure that joins do not cross servers, and satellite collections to replicate
metadata across servers so that lookups occur local to servers.
Figure 11: Bad distribution of graph data causes network hops during query execution
11
The Smartgraph feature of ArangoDB allows us to handle this problem in a
smarter way. In Fraud Detection we might know from the past that
fraudsters use banks in certain countries or regions to launder their money.
We can use this domain knowledge as a sharing key for our graph data and
allocate all financial transactions performed in this region on DB server 1,
and distribute other transactions on other DB servers. By using this
approach we can allocate all data needed to be grouped together on each
machine, and use the query engines on each DB Server to execute our
queries in parallel.
Conclusion
This paper points the way to using ArangoDB as part of a fraud detection
solution. We encourage users to experiment with our sample data and
sample queries, learn how to apply ArangoDB to fraud visa experimentation
by adding/modifying the data and queries, and be inspired and empowered
to apply your knowledge of fraud to use ArangoDB on your own data to
12
detect fraudulent activity. To get started easily, you can follow the interactive
demo provider on our cloud service ArangoDb Oasis and described below.
Hands-on with Fraud Detection & Anti Money Laundering
Testing ArangoDB and its capabilities for detecting fraud and money
laundering is very simple. Many of the use cases shown in this white paper
are part of an interactive demo available for free on ArangoDB’s cloud
service Oasis. No credit card is needed for a 14 day free trial deployment and
the examples can be installed with just one click. A detailed guide is provided
so really everyone can follow along easily.
Just s ign-up for ArangoDB Oasis and follow the few steps below
1. Create a Deployment (Here is a 2min video Tutorial)
2. Install the Fraud Detection Example in Oasis (Project -> Deployment
Tab -> View your deployment -> Examples Tab or just click “view
Deployment” directly after initiating the deployment creation
3. After the example is ready (~1minute) follow the Fraud Detection
guide provided to run real queries against the demo data you just
installed
13
This White Paper was written by Arthur Keen. For any questions about solving
Fraud Detection cases with ArangoDB, feel free to reach out to
arthur@arangodb.com
14
Appendix A: Queries
/*
Find all suspicious long loops of transactions
Show the graph and json results
Scroll to bottom of graph results and click "GraphViewer" to see results in Graph Viewer
*/
WITH transaction, account
FOR suspicous_account IN account
FOR acct, tx, path IN 5..10 OUTBOUND suspicous_account._id GRAPH 'fraud-detection'
PRUNE tx._to == suspicous_account._id
FILTER tx._to == suspicous_account._id
RETURN path
/*
Find number of Curious loops from a suspicious Account
Hints:
Try suspiciousAccountID = account/10000032
Rerun the query for different number of loops detected
Show the graph and json results
Scroll to bottom of graph results and click "GraphViewer" to see results in Graph Viewer
*/
WITH account, transaction
LET suspicious_account = DOCUMENT(@suspiciousAccountID)
FOR acct, tx, path IN 5..10 OUTBOUND suspicious_account._id GRAPH 'fraud-detection'
PRUNE tx._to == suspicious_account._id
FILTER tx._to == suspicious_account._id
LIMIT @numberOfLoopsReturned
RETURN path
/*
Find Orphan Account
An orphan account is an account with little or no transactions.
These may be set up in advance of money laundering operations.
This query finds accounts with no transactions
*/
LET usedResources = UNION_DISTINCT(
FOR relationship IN transaction RETURN relationship._from,
FOR relationship IN transaction RETURN relationship._to)
FOR resource IN account
FILTER resource._id NOT IN usedResources
SORT resource.account_type, resource.customer_id
RETURN {"customerName" : DOCUMENT(CONCAT("customer/",
resource.customer_id)).Name, "customerID": resource.customer_id, "accountID":
resource._id, "type": resource.account_type }
15
/*
Anti Money Laundering Pattern Detection
Find transaction patterns that contain a disaggregation and re-aggregation of funds
pattern
This pattern is characterized by transactions that dis-aggregate funds from a source
account to
multiple accounts in amounts that are below a reporting threshold, i.e., below $10,000
followed by a series of small transactions into 1 or more accounts, followed by
re-aggregation
of the small transactions into a destination account.
Show the graph and json results
Scroll to bottom of graph results and click "GraphViewer" to see results in Graph Viewer
*/
WITH account, transaction
LET accountOutDegree = (FOR transaction IN transaction
COLLECT accountOut = transaction._from WITH COUNT INTO outDegree
RETURN {account : accountOut, outDegree : outDegree})
LET accountInDegree = (FOR transaction IN transaction
COLLECT accountIn = transaction._to WITH COUNT INTO inDegree
RETURN {account : accountIn, inDegree : inDegree} )
LET accountDegree = (FOR inRecord IN accountInDegree
FOR outRecord IN accountOutDegree
FILTER inRecord.account == outRecord.account
RETURN MERGE(inRecord, outRecord))
LET maxAccount = (FOR maxDegree IN accountOutDegree
FILTER maxDegree.outDegree == MAX(accountOutDegree[*].outDegree)
RETURN maxDegree)[0]
FOR account, transaction IN 1..4 OUTBOUND maxAccount.account transaction
RETURN transaction
16