Who Should Be Profiling the Data?
Data profiling is primarily considered part ofIT projects, but the most successful effortsinvolve a blend of IT resources and businessusers of the data. IT, business users, and datastewards each contribute valuable insightscritical to the process:
IT system owners, developers, and projectmanagers
analyze and understand issuesof data structure: how complete is thedata, how consistent are the formats, arekey fields unique, is referential integrityenforced?
Business users and subject matter experts
understand the data content: what the datameans, how it is applied in existing businessprocesses, what data is required for newprocesses, what data is inaccurate or outof context?
Data stewards
understand corporate stan-dards and enterprise data requirementsas a whole. They can contribute to boththe requirements for specific projects andthe corporation.
How Do People Profile Data?
The techniques for profiling are either manualor automated via a profiling tool:
Manual techniques
involve people siftingthrough the data to assess its condition,query by query. Manual profiling is ap-propriate for small data sets from a singlesource, with fewer than 50 fields, where thedata is relatively simple.
Automated techniques
use software toolsto collect summary statistics and analy-ses. These tools are the most appropriatefor projects with hundreds of thousandsof records, many fields, multiple sources, and questionabledocumentation and metadata. Sophisticated data profilingtechnology was built to handle complex problems, especiallyfor high-profile and mission-critical projects.
How Do Data Profiling Tools Differ?
Data profiling tools vary both in the architecture they use to ana-lyze data and in the working environment they provide for thedata profiling team.
Architecture option: Query-based profiling
Some profilingtechnologies involve crafting SQL queries that are run againstsource systems or against a snapshot copy of the source data.While this generates some good information about the data, ithas several limitations:
Performance risks:Queries strain live systems, slowing downoperations, sometimes significantly. When additional informa-tion is required, or if users want to see the actual data, a sec-ond query executes, creating even more strain on the system.Organizations reduce this risk by making a copy of the data,but this requires replicating the entire environment—bothhardware and software systems—which can be costly andtime-consuming.Traceability risks:Data in production systems changes con-stantly. The statistics and metadata captured from query-based profiling risk being out of date immediately. Completeness risks:It is difficult to gain comprehensiveinsights using query-based analysis. Queries are based onassumptions, and the purpose is to confirm and quantifyexpectations about what is wrong and right in the data.Given this, it is easy to overlook problems that you are notalready aware of. Profiling by query is valuable when you want to moni-tor production data for certain conditions. But it is notthe best way to analyze large volumes of data in prepa-ration for large-scale data integrations and migrations.
Leave a Comment