Correlation Sets 1

Associate Operator
Module:
BigID Correlation Sets 1
Learn how to add a correlation set and find or select correlation
attributes.
*Mandatory prerequisite modules: Operator Essentials 1, Data Sources 1
Correlation Learning Sets
▪ To associate discovered personal information with specific entities, it is
necessary to add and configure at least one Correlation Learning Set
What is an Entity?
An individual or a “thing” for which sensitive information has
been collected during a business process.
▪ A correlation set is based on a supported, existing data source that holds

identifying entity data
What is a Correlation Set?

An exemplary data source table or view containing data which
can identify an entity uniquely across all data sources.
© 2022 BigID. All rights reserved. – 2 –

Correlation Sets Are Important…
Without a correlation set configured:

○ There is no correlation to entities
○ You cannot connect an entity to its data
○ Access requests cannot be fulfilled in full
○ Other personal data finding logic will be limited
We can still discover sensitive ‘non-personal’ data using the various classification-based
discovery forms.

Correlation Learning Sets (Entity Sources)
A correlation set defines the learning set for the correlation – the initial attributes that
are later used to discover linked personal data in the data sources themselves.
A correlation set assigns relevant correlation attributes to special roles:
Unique ID (unique, mandatory)
● An identifier key, such as user ID, customer ID, email, passport number, or any field that is unique for
each individual entity. Used to correlate data from different sources to a single individual entity. Can
be a composite value created from joining multiple fields and a separator.
Display Name (not necessarily unique, optional)
● A field that provides an friendly human-readable label for this entity, such as full name or email
address.
● If not provided, the Unique ID will be used as the Display Name.
Residency (not necessarily unique, optional but highly recommended)
● The data subject’s country/state of residence, defined as a string, ISO 2-letter, or 3-letter country
code.
● If not provided, the platform will lack the geographic context.

Adding a Correlation Set DEMO
1. Open the correlation sets view.
2. Review existing correlation sets.
3. Add a new correlation set connection based on a data source.
○ Review the attributes that are returned after testing the connection – which are good entity
attributes? Which are not?
○ Which of the attributes should be overridden for sure match?
○ Show building a composite Unique ID and setting the display name and residency

Add a Correlation Set
Select the Administration tab on the sidebar, then Correlation Sets, then New
Correlation Set:

Add a Correlation Set | Fill Initial Parameters
Populate respective fields, then select Test Connection:
Assign a meaningful
name for the correlation
set. DB Table vs Custom Query String
Select an existing data With DB Table, you enter the name

source connection. of the database table where the
entity information is stored. The
corresponding select query is built
Reserved for for you automatically, and shown
future use. in the Query String box below. If all
Enter the name of the table in fields required for an entity are in a
the data source that holds the single table, it is easiest to use this
relevant entity attributes (the option.
table must exist), or use the
query string. With Custom Query String, you
write your own SQL query. This
option can be used to create a set
Control sampling options. Select to use simple unique of entity fields from an arbitrary
ID or a composite. SQL query and joining from
multiple tables. All tables in the
query must be in this data source.
When finishing populating all
the parameters, select Test
Connection.

Add a Correlation Set | Select Correlation Attributes
Run Test Connection to return a list of all columns in the specified table, some sample
values, and an identifiability (a measure of uniqueness): Override Sure-Match removes the sure match
designation, i.e., if the identifiability is over
Identifiability is a key output in this screen 50%, this attribute will NOT be used on its own
and critical in deciding if an attribute can be but only in combination with other attributes.
used as an identifier. It is calculated based on This is based on the identifiability not during
the uniqueness of the values, using the ratio test (which is calculated from the first 1000
(unique records)/(total records). records only), but during actual scan (which
could be different).
Any selected attributes with identifiability
over 50% will be automatically marked as A sure match should be overridden if using this
sure match (which can be overridden). field as an entity attribute leads to
false-positive matches, e.g., when the field is
Remember that this identifiability shown here an integer identifier which has different
is calculated for just the first 1000 records of meaning in different systems. Typically, you will
this correlation set. The actual identifiability override sure matches not on initial correlation
of the attributes for the full sample size isn't set configuration but later when refining the
calculated until a scan (and then displayed on scan results and noticing a large number of
the entity correlation page). false positives on a particular attribute.
Sample values retrieved

from respective columns Select what attributes to use
or fields . for the entity (to include in
the reference set). Your
selection here will set the
Create a composite of any attributes. actual query into the
correlation set.

Correlation Attributes
■ Selected attributes from correlation sets
■ Define a baseline learning set
■ Used in a data scan to index matching values in a data source
Correlation attributes should be highly identifiable, meaning that they

are unique (or nearly so) for the entity.
Other factors to consider when selecting correlation attributes are
data coverage and data quality.
Good correlation attributes: Poor correlation attributes:

username, driver’s license number, date of birth, first/last name, gender, …
passport number, GUID, email, SSN, or other attributes that are common or
policy number, account number. not found across the data.

Choosing Correlation Attributes
The correlation set fields we choose as correlation attributes should have:
✔ High identifiability
✔ Good data coverage
✔ Semantic sense
Poor choices:
○ First name (not highly-identifiable)
○ High-res timestamp (highly-unique but not found elsewhere)
○ Password (not found elsewhere)

Finding More Correlation Attribute Candidates
▪ Using Classification:
○ Using RegExs to search for entity-specific identifiers (like an email or a credit
card) is a helpful way to find attributes that are common and descriptive of
entities
○ Utilizing what is found through classification, these attributes can be added as
correlation attributes to bolster correlation results
▪ Using Cluster Analysis
○ Cluster analysis is a machine learning technique that imposes structure on
unstructured data at scale. Clustering gives qualitative and quantitative
visibility to previously amorphous and unmeasured data.

Residencies, Countries, States
■ The residency data comes from the correlation set
attribute you associate with the Residency dropdown.
■ To show the residency on a map, the system needs
the residency value to follow a predefined list of
countries or states:
○ States are only relevant for the US since it is the only country where
regulations are per state (for now)
○ Some countries and states may conflict (e.g., DE matches both

Germany and Delaware). For such conflicts, use the
RESIDENCY_STATES_FIRST environment variable to
prioritize US states over countries.

Composite Unique IDs and Composite Identifiers
Composite Unique ID - use a composite of columns as the unique ID
■ A composite key is the primary key we use for a table by combining multiple columns, rather than
using only one column
■ Use in access requests as id=key1+delimiter+key2
■ Scan results details will still show each attribute
separately, by design
Composite Identifier – define a foreign key as a unique identifier for other tables
■ Defines a composite key used as the foreign key between one table and another table, in the case
where no direct unique ID correlation is available
■ Will be used when running a full scan
■ Currently not supported in access requests

Add a Correlation Set | Finalize
Select values for the Unique ID, Display Name and Residency choices and Save Changes
Select values for all

of these selections.

Sure Match, and Overriding It
▪ When scanning the correlation set, an attribute will be a ‘sure match’ if it has high
identifiability (over 50%)
▪ A ‘sure match’ attribute will be used on its own to declare a match to a record
▪ An attribute that is not a ‘sure match’ will not be used on its own – in order to match,
the record must have at least one other highly-identifiable attribute matching
▪ You can override a sure match if leaving this attribute as a ‘sure match’ may lead to
false-positive matches
▪ Override Sure Match can be used later in the configuration process as we refine scans

Adding a Correlation Set from Correlation Screen
■ Quickly add attributes as entity attributes
■ Useful when you want to upgrade an enrichment attribute to a full correlation attribute
Select a table, and then select

Define as Correlation Set in the
Action Menu to automatically
set up the selected table as a
new correlation set.
You will see a success/fail

message and be taken to the
Correlation Sets page, where
the new correlation set appears
at the bottom of the list with
(Entity Correlation) appended
to its name.

Value Matching Logic ENRICHMENT
Logical steps: Correlation
Data Field Outcome Notes
Attribute
Break the data field into segments (a segment
starts at string start or a delimiter, and ends at France France Match Straightforward exact match.
string end or a delimiter). France france.wilkins@mail.com Match

Case-insensitive match from string start until the
‘.’ delimiter.
■ A delimiter is a whitespace or a Jane.Smith design.jane.smith@mac.com Match

Case-insensitive match from the ‘.’ delimiter to
the ‘@’ delimiter.
punctuation character.
Case-insensitive match from string start to the
jane smith Jane Smith Studio Match
■ Segments of 4 characters or less are second space (first space is within the match).
ignored. 4873-3911 4873-3911 Match

Straightforward exact match (the dash is within
the match).
Perform case-insensitive match of the (as-is) 12345 00000 12345 Match

Exact match from the space delimiter to string
end.
correlation attribute to each segment. Exact match from string start to the space
12345 12345 00000 Match delimiter.
Notes: Exact match from the first space delimiter to the
12345 00000 12345 67890 Match second space delimiter.
If you run your own tests, remember to turn off
sampling to not miss any matches! France francewilkins No Match No character-for-character match till string end.
12345 12345678 No Match No character-for-character match till string end.

This logic is only relevant at the Scanner level No character-for-character match from string
where we look for straight matches. These 12345 012345 No Match start.
matches may be later dropped by the Correlator 0123-8945 880123-8945 No Match

No character-for-character match from string
start (the dash is part of the string).
due to low confidence. No character-for-character match from string
90210 5409021076141985 No Match start.

Enrichment ENRICHMENT
Proximity Analysis
Data Source
User ID Residency Display Name Address reference set enrichment

sure match attribute
ABC123 Minnesota John Smith 123 Apple St

User ID Gateway Tax ID Gender
DEF456 DEU Marcus Miller 456 Pear Cir ABC123 10.2.0.11 111-22-3333 M
sample
Sarah 789 Plum XYZ888 10.2.0.11 789-13-8989 F
GHI789 USA
Johnson Blvd
GHI789 10.3.3.135 987-65-1234 F
… … … …
identifiability 100% 15% 100% 0%
sure match Yes

Correlator ENRICHMENT
▪ A Correlator lives within a Docker container and is typically deployed on the
BigID App Server
▪ In larger deployments, additional remote correlators could be deployed to
improve correlation throughput
● Additional correlators introduce overhead, so linear performance gain cannot be expected
▪ The correlation process periodically checks for uncorrelated personal data

findings in MongoDB
▪ If uncorrelated personal data findings are present, the correlator attempts to
connect these findings to entities across BigID’s entity inventory
▪ Once the finding is correlated to at least one entity, it will become part of the
personal data inventory on the UI

Correlator | Confidence Levels ENRICHMENT
▪ Quantify the quality of the correlation between entity attributes and data fields
▪ Based on a ML algorithm that accounts for uniqueness, proximity, frequency, etc
▪ Calculated using only metadata and not the personal data itself
▪ Emphasizes high recall over precision

Recall
○ Customers want to see more matches even if they may be false The percentage of true
cases that are reported
▪ Can be adjusted using the confidence level thresholds A high recall means less
misses of true records
○ Decreasing the low-to-medium threshold will increase recall but decrease precision
trade-off
○ If 96% of the findings were correctly labeled by BigID as low (recall), this leaves only
4% to manually find and correct by overriding their confidence level to medium/high Precision
The percentage of reported
○ If 90% of the findings were correctly labeled by BigID as medium/high (precision), records that are indeed true
this leaves only 10% to manually find and correct by overriding their confidence level A high precision means less
to low false positives (less noise)

Let’s Recap!
■ A correlation set is a supported data source that contains entity data.
■ Scans can run even if no correlation sets are defined, but will not report on
correlation results.
■ Scan results details can be used to identify false positives.
■ Scan results details can also be used to reduce confidence levels of false
positives so they are disregarded.
■ Enrichment and classification are other mechanisms that could be leveraged
to enhance the data.

Learner Resources
✔ Training team at training@bigid.com
✔ BigID University
✔ BigID Documentation
✔ BigID Support
✔ BigExchange Community Continue the Operator track
with the next module:
✔ Developer Portal Data Scanning 1

Correlation Sets 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Correlation Sets 1

Uploaded by

Copyright:

Available Formats

Associate Operator

▪ A correlation set is based on a supported, existing data source that holds

What is a Correlation Set?

© 2022 BigID. All rights reserved. – 2 –

Without a correlation set conﬁgured:

© 2022 BigID. All rights reserved. – 3 –

© 2022 BigID. All rights reserved. – 4 –

© 2022 BigID. All rights reserved. – 5 –

© 2022 BigID. All rights reserved. – 6 –

Select an existing data With DB Table, you enter the name

© 2022 BigID. All rights reserved. – 7 –

Sample values retrieved

© 2022 BigID. All rights reserved. – 8 –

Correlation attributes should be highly identiﬁable, meaning that they

Good correlation attributes: Poor correlation attributes:

© 2022 BigID. All rights reserved. – 9 –

© 2022 BigID. All rights reserved. – 10 –

© 2022 BigID. All rights reserved. – 11 –

○ Some countries and states may conﬂict (e.g., DE matches both

© 2022 BigID. All rights reserved. – 12 –

© 2022 BigID. All rights reserved. – 13 –

Select values for all

© 2022 BigID. All rights reserved. – 14 –

© 2022 BigID. All rights reserved. – 15 –

Select a table, and then select

You will see a success/fail

© 2022 BigID. All rights reserved. – 16 –

string end or a delimiter). France france.wilkins@mail.com Match

■ A delimiter is a whitespace or a Jane.Smith design.jane.smith@mac.com Match

ignored. 4873-3911 4873-3911 Match

Perform case-insensitive match of the (as-is) 12345 00000 12345 Match

12345 12345678 No Match No character-for-character match till string end.

matches may be later dropped by the Correlator 0123-8945 880123-8945 No Match

© 2022 BigID. All rights reserved. – 17 –

User ID Residency Display Name Address reference set enrichment

ABC123 Minnesota John Smith 123 Apple St

identifiability 100% 15% 100% 0%

sure match Yes

© 2022 BigID. All rights reserved. – 18 –

▪ The correlation process periodically checks for uncorrelated personal data

© 2022 BigID. All rights reserved. – 19 –

▪ Based on a ML algorithm that accounts for uniqueness, proximity, frequency, etc

▪ Emphasizes high recall over precision

© 2022 BigID. All rights reserved. – 20 –

© 2022 BigID. All rights reserved. – 21 –

You might also like