You are on page 1of 22

Associate Operator

Module:
BigID Correlation Sets 1
Learn how to add a correlation set and find or select correlation
attributes.
*Mandatory prerequisite modules: Operator Essentials 1, Data Sources 1
Correlation Learning Sets
▪ To associate discovered personal information with specific entities, it is
necessary to add and configure at least one Correlation Learning Set

What is an Entity?
An individual or a “thing” for which sensitive information has
been collected during a business process.

▪ A correlation set is based on a supported, existing data source that holds


identifying entity data

What is a Correlation Set?


An exemplary data source table or view containing data which
can identify an entity uniquely across all data sources.

© 2022 BigID. All rights reserved. – 2 –


Correlation Sets Are Important…

Without a correlation set configured:


○ There is no correlation to entities
○ You cannot connect an entity to its data
○ Access requests cannot be fulfilled in full
○ Other personal data finding logic will be limited

We can still discover sensitive ‘non-personal’ data using the various classification-based
discovery forms.

© 2022 BigID. All rights reserved. – 3 –


Correlation Learning Sets (Entity Sources)
A correlation set defines the learning set for the correlation – the initial attributes that
are later used to discover linked personal data in the data sources themselves.
A correlation set assigns relevant correlation attributes to special roles:
Unique ID (unique, mandatory)
● An identifier key, such as user ID, customer ID, email, passport number, or any field that is unique for
each individual entity. Used to correlate data from different sources to a single individual entity. Can
be a composite value created from joining multiple fields and a separator.
Display Name (not necessarily unique, optional)
● A field that provides an friendly human-readable label for this entity, such as full name or email
address.
● If not provided, the Unique ID will be used as the Display Name.
Residency (not necessarily unique, optional but highly recommended)
● The data subject’s country/state of residence, defined as a string, ISO 2-letter, or 3-letter country
code.
● If not provided, the platform will lack the geographic context.

© 2022 BigID. All rights reserved. – 4 –


Adding a Correlation Set DEMO
1. Open the correlation sets view.
2. Review existing correlation sets.
3. Add a new correlation set connection based on a data source.
○ Review the attributes that are returned after testing the connection – which are good entity
attributes? Which are not?
○ Which of the attributes should be overridden for sure match?
○ Show building a composite Unique ID and setting the display name and residency

© 2022 BigID. All rights reserved. – 5 –


Add a Correlation Set
Select the Administration tab on the sidebar, then Correlation Sets, then New
Correlation Set:

© 2022 BigID. All rights reserved. – 6 –


Add a Correlation Set | Fill Initial Parameters
Populate respective fields, then select Test Connection:
Assign a meaningful
name for the correlation
set. DB Table vs Custom Query String

Select an existing data With DB Table, you enter the name


source connection. of the database table where the
entity information is stored. The
corresponding select query is built
Reserved for for you automatically, and shown
future use. in the Query String box below. If all
Enter the name of the table in fields required for an entity are in a
the data source that holds the single table, it is easiest to use this
relevant entity attributes (the option.
table must exist), or use the
query string. With Custom Query String, you
write your own SQL query. This
option can be used to create a set
Control sampling options. Select to use simple unique of entity fields from an arbitrary
ID or a composite. SQL query and joining from
multiple tables. All tables in the
query must be in this data source.
When finishing populating all
the parameters, select Test
Connection.

© 2022 BigID. All rights reserved. – 7 –


Add a Correlation Set | Select Correlation Attributes
Run Test Connection to return a list of all columns in the specified table, some sample
values, and an identifiability (a measure of uniqueness): Override Sure-Match removes the sure match
designation, i.e., if the identifiability is over
Identifiability is a key output in this screen 50%, this attribute will NOT be used on its own
and critical in deciding if an attribute can be but only in combination with other attributes.
used as an identifier. It is calculated based on This is based on the identifiability not during
the uniqueness of the values, using the ratio test (which is calculated from the first 1000
(unique records)/(total records). records only), but during actual scan (which
could be different).
Any selected attributes with identifiability
over 50% will be automatically marked as A sure match should be overridden if using this
sure match (which can be overridden). field as an entity attribute leads to
false-positive matches, e.g., when the field is
Remember that this identifiability shown here an integer identifier which has different
is calculated for just the first 1000 records of meaning in different systems. Typically, you will
this correlation set. The actual identifiability override sure matches not on initial correlation
of the attributes for the full sample size isn't set configuration but later when refining the
calculated until a scan (and then displayed on scan results and noticing a large number of
the entity correlation page). false positives on a particular attribute.

Sample values retrieved


from respective columns Select what attributes to use
or fields . for the entity (to include in
the reference set). Your
selection here will set the
Create a composite of any attributes. actual query into the
correlation set.

© 2022 BigID. All rights reserved. – 8 –


Correlation Attributes
■ Selected attributes from correlation sets
■ Define a baseline learning set
■ Used in a data scan to index matching values in a data source

Correlation attributes should be highly identifiable, meaning that they


are unique (or nearly so) for the entity.
Other factors to consider when selecting correlation attributes are
data coverage and data quality.

Good correlation attributes: Poor correlation attributes:


username, driver’s license number, date of birth, first/last name, gender, …
passport number, GUID, email, SSN, or other attributes that are common or
policy number, account number. not found across the data.

© 2022 BigID. All rights reserved. – 9 –


Choosing Correlation Attributes
The correlation set fields we choose as correlation attributes should have:

✔ High identifiability
✔ Good data coverage
✔ Semantic sense

Poor choices:
○ First name (not highly-identifiable)
○ High-res timestamp (highly-unique but not found elsewhere)
○ Password (not found elsewhere)

© 2022 BigID. All rights reserved. – 10 –


Finding More Correlation Attribute Candidates
▪ Using Classification:
○ Using RegExs to search for entity-specific identifiers (like an email or a credit
card) is a helpful way to find attributes that are common and descriptive of
entities
○ Utilizing what is found through classification, these attributes can be added as
correlation attributes to bolster correlation results
▪ Using Cluster Analysis
○ Cluster analysis is a machine learning technique that imposes structure on
unstructured data at scale. Clustering gives qualitative and quantitative
visibility to previously amorphous and unmeasured data.

© 2022 BigID. All rights reserved. – 11 –


Residencies, Countries, States
■ The residency data comes from the correlation set
attribute you associate with the Residency dropdown.
■ To show the residency on a map, the system needs
the residency value to follow a predefined list of
countries or states:
○ States are only relevant for the US since it is the only country where
regulations are per state (for now)

○ Some countries and states may conflict (e.g., DE matches both


Germany and Delaware). For such conflicts, use the
RESIDENCY_STATES_FIRST environment variable to
prioritize US states over countries.

© 2022 BigID. All rights reserved. – 12 –


Composite Unique IDs and Composite Identifiers
Composite Unique ID - use a composite of columns as the unique ID
■ A composite key is the primary key we use for a table by combining multiple columns, rather than
using only one column
■ Use in access requests as id=key1+delimiter+key2
■ Scan results details will still show each attribute
separately, by design

Composite Identifier – define a foreign key as a unique identifier for other tables
■ Defines a composite key used as the foreign key between one table and another table, in the case
where no direct unique ID correlation is available
■ Will be used when running a full scan
■ Currently not supported in access requests

© 2022 BigID. All rights reserved. – 13 –


Add a Correlation Set | Finalize
Select values for the Unique ID, Display Name and Residency choices and Save Changes

Select values for all


of these selections.

© 2022 BigID. All rights reserved. – 14 –


Sure Match, and Overriding It
▪ When scanning the correlation set, an attribute will be a ‘sure match’ if it has high
identifiability (over 50%)
▪ A ‘sure match’ attribute will be used on its own to declare a match to a record
▪ An attribute that is not a ‘sure match’ will not be used on its own – in order to match,
the record must have at least one other highly-identifiable attribute matching
▪ You can override a sure match if leaving this attribute as a ‘sure match’ may lead to
false-positive matches
▪ Override Sure Match can be used later in the configuration process as we refine scans

© 2022 BigID. All rights reserved. – 15 –


Adding a Correlation Set from Correlation Screen
■ Quickly add attributes as entity attributes
■ Useful when you want to upgrade an enrichment attribute to a full correlation attribute

Select a table, and then select


Define as Correlation Set in the
Action Menu to automatically
set up the selected table as a
new correlation set.

You will see a success/fail


message and be taken to the
Correlation Sets page, where
the new correlation set appears
at the bottom of the list with
(Entity Correlation) appended
to its name.

© 2022 BigID. All rights reserved. – 16 –


Value Matching Logic ENRICHMENT
Logical steps: Correlation
Data Field Outcome Notes
Attribute
Break the data field into segments (a segment
starts at string start or a delimiter, and ends at France France Match Straightforward exact match.

string end or a delimiter). France france.wilkins@mail.com Match


Case-insensitive match from string start until the
‘.’ delimiter.

■ A delimiter is a whitespace or a Jane.Smith design.jane.smith@mac.com Match


Case-insensitive match from the ‘.’ delimiter to
the ‘@’ delimiter.
punctuation character.
Case-insensitive match from string start to the
jane smith Jane Smith Studio Match
■ Segments of 4 characters or less are second space (first space is within the match).

ignored. 4873-3911 4873-3911 Match


Straightforward exact match (the dash is within
the match).

Perform case-insensitive match of the (as-is) 12345 00000 12345 Match


Exact match from the space delimiter to string
end.
correlation attribute to each segment. Exact match from string start to the space
12345 12345 00000 Match delimiter.
Notes: Exact match from the first space delimiter to the
12345 00000 12345 67890 Match second space delimiter.
If you run your own tests, remember to turn off
sampling to not miss any matches! France francewilkins No Match No character-for-character match till string end.

12345 12345678 No Match No character-for-character match till string end.


This logic is only relevant at the Scanner level No character-for-character match from string
where we look for straight matches. These 12345 012345 No Match start.

matches may be later dropped by the Correlator 0123-8945 880123-8945 No Match


No character-for-character match from string
start (the dash is part of the string).
due to low confidence. No character-for-character match from string
90210 5409021076141985 No Match start.

© 2022 BigID. All rights reserved. – 17 –


Enrichment ENRICHMENT
Proximity Analysis
Data Source

User ID Residency Display Name Address reference set enrichment


sure match attribute

ABC123 Minnesota John Smith 123 Apple St


User ID Gateway Tax ID Gender

DEF456 DEU Marcus Miller 456 Pear Cir ABC123 10.2.0.11 111-22-3333 M

sample
Sarah 789 Plum XYZ888 10.2.0.11 789-13-8989 F
GHI789 USA
Johnson Blvd
GHI789 10.3.3.135 987-65-1234 F

… … … …

identifiability 100% 15% 100% 0%

sure match Yes

© 2022 BigID. All rights reserved. – 18 –


Correlator ENRICHMENT
▪ A Correlator lives within a Docker container and is typically deployed on the
BigID App Server
▪ In larger deployments, additional remote correlators could be deployed to
improve correlation throughput
● Additional correlators introduce overhead, so linear performance gain cannot be expected

▪ The correlation process periodically checks for uncorrelated personal data


findings in MongoDB
▪ If uncorrelated personal data findings are present, the correlator attempts to
connect these findings to entities across BigID’s entity inventory
▪ Once the finding is correlated to at least one entity, it will become part of the
personal data inventory on the UI

© 2022 BigID. All rights reserved. – 19 –


Correlator | Confidence Levels ENRICHMENT
▪ Quantify the quality of the correlation between entity attributes and data fields

▪ Based on a ML algorithm that accounts for uniqueness, proximity, frequency, etc

▪ Calculated using only metadata and not the personal data itself

▪ Emphasizes high recall over precision


Recall
○ Customers want to see more matches even if they may be false The percentage of true
cases that are reported
▪ Can be adjusted using the confidence level thresholds A high recall means less
misses of true records
○ Decreasing the low-to-medium threshold will increase recall but decrease precision
trade-off
○ If 96% of the findings were correctly labeled by BigID as low (recall), this leaves only
4% to manually find and correct by overriding their confidence level to medium/high Precision
The percentage of reported
○ If 90% of the findings were correctly labeled by BigID as medium/high (precision), records that are indeed true
this leaves only 10% to manually find and correct by overriding their confidence level A high precision means less
to low false positives (less noise)

© 2022 BigID. All rights reserved. – 20 –


Let’s Recap!
■ A correlation set is a supported data source that contains entity data.
■ Scans can run even if no correlation sets are defined, but will not report on
correlation results.
■ Scan results details can be used to identify false positives.
■ Scan results details can also be used to reduce confidence levels of false
positives so they are disregarded.
■ Enrichment and classification are other mechanisms that could be leveraged
to enhance the data.

© 2022 BigID. All rights reserved. – 21 –


Learner Resources
✔ Training team at training@bigid.com
✔ BigID University
✔ BigID Documentation
✔ BigID Support
✔ BigExchange Community Continue the Operator track
with the next module:
✔ Developer Portal Data Scanning 1
© 2022 BigID. All rights reserved. – 22 –

You might also like