• Embed Doc
  • Readcast
  • Collections
  • CommentGo Back
 
Trillium Software
 ® 
Solution Guide 
Key Considerationsfor Selecting aData Pro
ling Tool 
1. Who is profiling:business users, IT,or both2. Common enviromentto communicate,review, and interpretresults3. Complexity of analysis,number of sources4. Security of data5. Ongoing support andmonitoring
What is Data Profiling?
Data profiling is a process for analyzing large data sets. Standarddata profiling automatically compiles statistics and other summaryinformation about the data records. It includes analysis by fieldfor minimum and maximum values and other basic statistics, fre-quency counts for fields, data type and patterns/formats, and con-formity to expected values. Other advanced profiling techniquesalso perform analysis about the relationships between fields, suchas dependencies between fields in a single set and between fieldsin separate data sets.
Why Do People Profile?
People may want to profile for several reasons, including: 
Assessing risks
 —Can data support the new initiative? 
Planning projects
 —What are realistic time lines and whatdata, systems, and resources will the project involve? 
Scoping projects
 —Which data and systems will be includedbased on priority, quality, and level of effort required? 
Assessing data quality
 —How accurate, consistent,and complete is the data within a single system? 
Designing new systems
 —What should the target structureslook like? What mappings or transformations need to occur? 
Checking/monitoring data
 —Does the data continue to meetbusiness requirements after systems have gone live and changesand additions occur?
Data Profiling Basics
 
Who Should Be Profiling the Data?
Data profiling is primarily considered part ofIT projects, but the most successful effortsinvolve a blend of IT resources and businessusers of the data. IT, business users, and datastewards each contribute valuable insightscritical to the process:
 
IT system owners, developers, and projectmanagers
analyze and understand issuesof data structure: how complete is thedata, how consistent are the formats, arekey fields unique, is referential integrityenforced? 
Business users and subject matter experts
understand the data content: what the datameans, how it is applied in existing businessprocesses, what data is required for newprocesses, what data is inaccurate or outof context?
Data stewards
understand corporate stan-dards and enterprise data requirementsas a whole. They can contribute to boththe requirements for specific projects andthe corporation.
How Do People Profile Data?
The techniques for profiling are either manualor automated via a profiling tool:
Manual techniques
involve people siftingthrough the data to assess its condition,query by query. Manual profiling is ap-propriate for small data sets from a singlesource, with fewer than 50 fields, where thedata is relatively simple.
Automated techniques
use software toolsto collect summary statistics and analy-ses. These tools are the most appropriatefor projects with hundreds of thousandsof records, many fields, multiple sources, and questionabledocumentation and metadata. Sophisticated data profilingtechnology was built to handle complex problems, especiallyfor high-profile and mission-critical projects.
How Do Data Profiling Tools Differ?
Data profiling tools vary both in the architecture they use to ana-lyze data and in the working environment they provide for thedata profiling team.
Architecture option: Query-based profiling
Some profilingtechnologies involve crafting SQL queries that are run againstsource systems or against a snapshot copy of the source data.While this generates some good information about the data, ithas several limitations:
 
Performance risks:Queries strain live systems, slowing downoperations, sometimes significantly. When additional informa-tion is required, or if users want to see the actual data, a sec-ond query executes, creating even more strain on the system.Organizations reduce this risk by making a copy of the data,but this requires replicating the entire environment—bothhardware and software systems—which can be costly andtime-consuming.Traceability risks:Data in production systems changes con-stantly. The statistics and metadata captured from query-based profiling risk being out of date immediately. Completeness risks:It is difficult to gain comprehensiveinsights using query-based analysis. Queries are based onassumptions, and the purpose is to confirm and quantifyexpectations about what is wrong and right in the data.Given this, it is easy to overlook problems that you are notalready aware of. Profiling by query is valuable when you want to moni-tor production data for certain conditions. But it is notthe best way to analyze large volumes of data in prepa-ration for large-scale data integrations and migrations.
 
Trillium Software Solution Guide: Data Profiling Basic
Architecture option: Data profiling repository
Other profiling technologies profile data as part of a scheduledprocess and store results in a profiling repository. Stored resultscan include content such as summary statistics, metadata, pat-terns, keys, relationships, and data values. Results can then befurther analyzed by users or stored for later trending analysis.Profiling repositories that allow users to drill down on informa-tion and see original data values in the context of source recordsprovide the most versatility and stability for non-technical audi-ences. Independence from operational source systems coupledwith the vast amount of metadata and information derived froma point in time profile provide a cross-functional team of busi-ness and IT resources a common, comprehensive view of sourcesystem data from which traceable decisions can be based.Volume considerations:Should tables or files enter intothe range of hundreds of millions of records, a profilingrepository strategy should be considered. With volumesthis large, the best strategy may be a blend between week-end-scheduled profiling processes and focused, non-con-tentious query-based profiling, closely monitored by IT.
Work environment: Multi-user workspace
Some profiling tools are designed as desktop solutions for re-sources to use as a team of one. How many resources will beinvolved in your data profiling efforts? For large projects, thereis generally a cross-functional team involved.Consider the environment a profiling tool provides since mul-tiple users with different skills, different expertise, and varyinglevel of technical skills all need to be able to access and clearlysee the condition of the data. Even if some prospective dataprofilers are skilled in SQL and database technologies, profil-ing tools that foster collaboration between business users andIT offer greater value overall. With a common window on thedata sets, people with diverse backgrounds can concretely andproductively discuss the data, its current state, and what is re-quired to move forward.
Work environment: Graphical Interface
Because users may not be familiar with data-base structures and technologies, it is impor-tant to find a tool that provides an intuitive,easy-to-learn graphical user interface (GUI).Appropriate security features should also bea part of the work environment, to ensure thataccess to restricted fields or records can be al-lowed or denied, for sensitive information.
What Follows Data Profiling?
Once the task of data profiling is complete,there is more to do. Keep in mind both theshort- and long-term goals driving the need toprofile your data. Leverage your investmentsby understanding what follows and see if thereare logical extensions to your profiling effortsthat can be executed within the same tool.ETL projectsfor data integration or migra-tion use profiling results to design targetsystems, define how to accurately integratemultiple data sets, and efficiently move datato a new system, taking all data conditionsinto consideration.Data quality processesthat improve theaccuracy, consistency, and completenessof data use results to identify problems oranomalies and then develop rules for auto-mated cleansing and standardization.Data monitoring initiativesuse profiling re-sults to establish automated processes forongoing assessment of key data elementsand acute data conditions in productionsystems. The profiling repository capturesresults, sends alerts, and centrally managesdata standards.
of 00

Leave a Comment

You must be to leave a comment.
Submit
Characters: ...
You must be to leave a comment.
Submit
Characters: ...