You are on page 1of 1

11/27/2019 Data profiling - Wikipedia

Data profiling
Data profiling is the process of examining the data available from an existing information source (e.g. a database or a
file) and collecting statistics or informative summaries about that data.[1] The purpose of these statistics may be to:

1. Find out whether existing data can be easily used for other purposes
2. Improve the ability to search data by tagging it with keywords, descriptions, or assigning it to a category
3. Assess data quality, including whether the data conforms to particular standards or patterns[2]
4. Assess the risk involved in integrating data in new applications, including the challenges of joins
5. Discover metadata of the source database, including value patterns and distributions, key candidates, foreign-key
candidates, and functional dependencies
6. Assess whether known metadata accurately describes the actual values in the source database
7. Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding
data problems late in the project can lead to delays and cost overruns.
8. Have an enterprise view of all data, for uses such as master data management, where key data is needed, or data
governance for improving data quality.

Contents
Introduction
How data profiling is conducted
When data profiling is conducted
Benefits and examples
See also
References

Introduction
Data profiling refers to the analysis of information for use in a data warehouse in order to clarify the structure, content,
relationships, and derivation rules of the data.[3] Profiling helps to not only understand anomalies and assess data quality,
but also to discover, register, and assess enterprise metadata.[4][5] The result of the analysis is used to determine the
suitability of the candidate source systems, usually giving the basis for an early go/no-go decision, and also to identify
problems for later solution design.[3]

How data profiling is conducted


Data profiling utilizes methods of descriptive statistics such as minimum, maximum, mean, mode, percentile, standard
deviation, frequency, variation, aggregates such as count and sum, and additional metadata information obtained during
data profiling such as data type, length, discrete values, uniqueness, occurrence of null values, typical string patterns, and
abstract type recognition.[4][6][7] The metadata can then be used to discover problems such as illegal values, misspellings,
missing values, varying value representation, and duplicates.

Different analyses are performed for different structural levels. E.g. single columns could be profiled individually to get an
understanding of frequency distribution of different values, type, and use of each column. Embedded value dependencies
can be exposed in a cross-columns analysis. Finally, overlapping value sets possibly representing foreign key relationships
https://en.wikipedia.org/wiki/Data_profiling 1/3

You might also like