AnArchitecture AnArchitecture AnArchitecture AnArchitecture

forDataQuality forDataQuality forDataQuality forDataQuality

AKimballGroupWhitePaper
ByRalphKimball

TableofContents TableofContents TableofContents TableofContents
ExecutiveSummary......................................................................3
AbouttheAuthor...........................................................................3
AGrowingAwarenessforDataQuality.........................................4
WhereDoDataQualityProblemsOriginate? ...............................4
EstablishaQualityCultureandReengineertheProcesses .........5
TheRoleoftheDataSteward ......................................................5
HowCanTechnologyAddressDataQuality? ..............................6
lmportanceofDataProfiling.........................................................6
NegotiateDataQualitywiththeSourceSystems .........................7
EstablishaFeedbackPathtotheSource.....................................7
QualityScreens:TheHeartoftheArchitecture ............................8
ErrorEventSchema.....................................................................9
RespondingtoQualityEvents.................................................... 12
TheAuditDimension ................................................................. 13
TheAuditDimensionPayoff....................................................... 15
AdvancedDesignlssues ........................................................... 15
SixSigmaDataQuality .............................................................. 19
BuildingaDataQualityArchitecture........................................... 19
Conclusion.................................................................................. 20

AnArchitectureforDataQuality÷3
ExecutiveSummary ExecutiveSummary ExecutiveSummary ExecutiveSummary
lnthiswhitepaper,weproposeacomprehensivearchitectureforcapturingdata
qualityevents,aswellasmeasuringandultimatelycontrollingdataqualityinthe
datawarehouse.Thisscalablearchitecturecanbeaddedtoexistingdata
warehouseanddataintegrationenvironmentswithminimalimpactandrelatively
littleupfrontinvestment.Usingthisarchitecture,itisevenpossibletoprogress
systematicallytowardaSixSigmalevelofqualitymanagement.Thisdesignisin
responsetothecurrentlackofapublished,coherentarchitectureforaddressing
dataqualityissues.

AbouttheAuthor AbouttheAuthor AbouttheAuthor AbouttheAuthor
RalphKimballfoundedtheKimballGroup.Sincethemid1980s,hehasbeenthe
datawarehouse/businessintelligence(DW/Bl)industry'sthoughtleaderonthe
dimensionalapproachandtrainedmorethan10,000lTprofessionals.Priorto
workingatMetaphorandfoundingRedBrickSystems,Ralphco-inventedtheStar
workstationatXerox'sPaloAltoResearchCenter(PARC).RalphhashisPh.D.in
ElectricalEngineeringfromStanfordUniversity.

TheKimballGroupisthesourcefordimensionalDW/Blconsultingandeducation,
consistentwithourbest-sellingToolkitbookseries,DesignTips,andaward-
winningarticles.Visitwww.kimballgroup.comformoreinformation.
Copyright©2007byKimballGroup.Allrightsreserved.

AnArchitectureforDataQuality÷4
AGrowingAwarenessforDataQuality AGrowingAwarenessforDataQuality AGrowingAwarenessforDataQuality AGrowingAwarenessforDataQuality
Threepowerfulforceshaveconvergedtoputdataqualityconcernsnearthetopof
thelistfororganizationexecutives.First,thelongtermculturaltrendthatsays"if
onlylcouldseethedatathenlcouldmanagemybusinessbetter¨continuesto
grow,untiltodayin2007,mostknowledgeworkersbelieveinstinctivelythatdatais
acrucialrequirementtoperformtheirjobs.Second,mostorganizationsunderstand
thattheyareprofoundlydistributed,oftenaroundtheworld,andthateffectively
integratingmyriaddisparatedatasourcesisrequired.Andthird,thesharply
increaseddemandsforcompliancemeanthatcarelesshandlingofdataisnotgoing
tobeoverlookedorexcused.
Thesepowerfulconvergingforcesilluminatedataqualityproblemsinaharshlight.
Fortunately,thebigpressuresarecomingfromthebusinessusers,notjustfromlT.
Thebusinessusershavebecomeawarethatdataqualityisaseriousand
expensiveproblem.Thustheorganizationismorelikelytosupportinitiativesto
improvedataquality.Butmostbusinessusersprobablyhavenoideawheredata
qualityproblemsoriginateorwhatanorganizationcandotoimprovedataquality.
TheymaythinkthatdataqualityisasimpleexecutionprobleminlT.lnthis
environmentlTneedstobeagileandproactive:dataqualitycannotbeimprovedby
lTalone.Anevenmoreextremeviewsaysthatdataqualityhasalmostnothingto
dowithlT.
WhereDoDataQualityProblemsOriginate? WhereDoDataQualityProblemsOriginate? WhereDoDataQualityProblemsOriginate? WhereDoDataQualityProblemsOriginate?
ltistemptingtoblametheoriginalsourceofdataforanyandallerrorsthatappear
downstream.lfonlythedataentryclerkweremorecareful,andREALLYcared!We
areonlyslightlymoreforgivingoftyping-challengedsalespeoplewhoenter
customerandproductinformationintotheirorderforms.Perhapswecanfixdata
qualityproblemsbyimposingbetterconstraintsonthedataentryuserinterfaces.
Thisapproachprovidesahintofhowtothinkaboutfixingdataquality,butwemust
takeamuchlargerviewbeforepouncingontechnicalsolutions.Atalargeretail
bankweworkedwith,thesocialsecuritynumberfieldsforcustomerswereoften
blankorfilledwithgarbage.Someonecameupwiththebrilliantideatorequireinput
inthe999-99-9999format,andtocleverlydisallownonsensicalentriessuchasall
9's.Whathappened?Thedataentryclerkswereforcedtosupplyvalidsocial
securitynumbersinordertoprogresstothenextscreen,sowhentheydidn'thave
thecustomer'snumber,theytypedintheirown!
MichaelHammer,inhisrevolutionarybook,ReengineeringtheCorporation
(HarperBusiness1994),strucktotheheartofthedataqualityproblemwitha
brilliantinsightthatlhavecarriedwithmethroughoutmycareer.Paraphrasing
Hammer:"Seeminglysmalldataqualityissuesare,inreality,importantindications
ofbrokenbusinessprocesses.¨Notonlydoesthisinsightcorrectlyfocusour
attentiononthesourceofdataqualityproblems,butitshowsusthewaytothe
solution.

AnArchitectureforDataQuality÷5
EstablishaQualityCultureandReengineertheProcesses EstablishaQualityCultureandReengineertheProcesses EstablishaQualityCultureandReengineertheProcesses EstablishaQualityCultureandReengineertheProcesses
Technicalattemptstoaddressdataqualitywillnotfunctionunlesstheyarepartof
anoverallqualityculturethatmustcomefromtheverytopofanorganization.The
famousJapanesecarmanufacturingqualityattitudepermeateseverylevelofthose
organizations;qualityisembracedenthusiasticallybyalllevels,fromtheCEOdown
totheassemblylineworker.Tocastthisinadatacontext,imagineacompanylikea
largedrugstorechain,whereateamofbuyerscontractswiththousandsof
supplierstoprovidethedrugstoreinventory.Thebuyershaveassistants,whose
jobitistoenterthedetaileddescriptionsofeverythingpurchasedbythebuyers.
Thesedescriptionscontaindozensofattributes.Buttheproblemisthatthe
assistantshaveadeadlyjob.Theyarejudgedonhowmanyitemstheyenterper
hour.Theassistantshavealmostnovisibilityofwhousestheirdata.Occasionally
theassistantsarescoldedforobviouserrors.Butmoreinsidiously,thedatagivento
theassistantsisitselfincompleteandunreliable.Forexample,therearenoformal
standardsfortoxicityratings,sothereissignificantvariationovertimeandacross
productcategoriesforthisattribute.Howdoesthedrugstoreimprovedataquality?
Hereisaninesteptemplate,notonlyforthedrugstore,butforanyorganization
addressingdataquality:
1. Declareahighlevelcommitmenttoadataqualityculture.
2. Driveprocessreengineeringatexecutivelevels.
3. Spendmoneytoimprovethedataentryenvironment.
4. Spendmoneytoimproveapplicationintegration.
5. Spendmoneytochangehowprocesseswork.
6. Promoteend-to-endteamawareness.
7. Promoteinterdepartmentalcooperation.
8. Publiclycelebratedataqualityexcellence.
9. Continuouslymeasureandimprovedataquality.

Atthedrugstore,moneyneedstobespenttoimprovethedataentrysystemsothat
itprovidesthecontentandchoicesneededbythebuyers'assistants.The
company'sexecutivesneedtoassuretheassistantsthattheirworkisvery
importantandtheireffortsaffectmanydecisionmakersinapositiveway.Diligent
effortsbytheassistantsshouldbepubliclypraisedandrewarded.Andend-to-end
teamawarenessandappreciationofthevalueofdataqualityisthefinalgoal.
TheRoleoftheDataSteward TheRoleoftheDataSteward TheRoleoftheDataSteward TheRoleoftheDataSteward
Thedatastewardisresponsiblefordrivingorganizationalagreementondefinitions,
businessrules,andpermissibledomainvaluesforthedatawarehousedata,and
thenpublishingandreinforcingthesedefinitionsandrules.Historically,thisrolewas
referredtoasdataadministration,afunctionwithinthelTorganization.However,
it'smuchbetterifthedatastewardroleisstaffedbythesubjectmatterexpertsfrom
thebusinesscommunity.
Clearly,thisisapoliticallychallengingrole.Stewardsmustbewellrespected
leaders,committedtoworkingthroughtheinevitablecross-functionalissues,and

AnArchitectureforDataQuality÷6
supportedbyseniormanagement,especiallywhenorganizationalcompromiseis
required.
Sometimesthedatastewardsaresupportedbyqualityassurance(QA)analysts
whoensurethatthedataloadedintothewarehouseisaccurateandcomplete.They
identifypotentialdataerrorsanddrivethemtoresolution.TheQAanalystis
sometimesalsoresponsibleforverifyingthebusinessintegrityofthebusiness
intelligence(Bl)applications.Thisroleistypicallystaffedfromwithinthebusiness
community,oftenwithresourceswhostraddlethebusinessandlTorganizations.
OnceadataerrorhasbeenidentifiedbytheQAanalyst,thentheerrorsmustbe
correctedatthesource,fixedintheextract,transform,andload(ETL)steps,or
taggedandpassedthroughtheETLsteps.Remember,dataqualityerrorsare
indicatorsofbrokenbusinessprocessesandoftenrequireexecutivesupportto
correct.Relativelyfewdataerrorscanbefixedinthedatawarehouse.
TheQAanalysthasasignificantworkloadduringtheinitialdataloadtoensurethat
theETLsystemisworkingproperly.Andgiventheneedforongoingdata
verification,theQAanalystroledoesnotendoncethewarehouseisputinto
production.
HowCanTechnologyAddressDataQuality? HowCanTechnologyAddressDataQuality? HowCanTechnologyAddressDataQuality? HowCanTechnologyAddressDataQuality?
Oncetheexecutivesupportandtheorganizationalframeworkareready,then
specifictechnicalsolutionsareappropriate.Therestofthisarticledescribeshowto
marshaltechnologytosupportdataquality.Ourgoalsforthetechnologyinclude:
• Earlydiagnosisandtriageofdataqualityissues.
• Specificdemandsonsourcesystemsandintegrationeffortsto
supplybetterdata.
• Specificdescriptionsofdataerrorsexpectedtobeencounteredin
ETL.
• Frameworkforcapturingalldataqualityerrors.
• Frameworkforpreciselymeasuringdataqualitymetricsovertime.
• Qualityconfidencemetricsattachedtofinaldata.

lmportanceofDataProfiling lmportanceofDataProfiling lmportanceofDataProfiling lmportanceofDataProfiling
Dataprofilingisthesystematicanalysisofdatatodescribeitscontent,consistency,
andstructure.lnsomesense,anytimeyouperformaSELECTDlSTlNCT
investigativequeryonadatabasefieldyouaredoingdataprofiling.Todaythereare
avarietyoftoolspurpose-builttodopowerfuldataprofiling.ltprobablypaysto
investinatoolratherthanbuildyourown,becausethetoolsallowmanydata
relationshipstobeexploredeasilywithsimpleuserinterfacegestures.Youwillbe
muchmoreproductiveinthedataprofilingstagesofyourprojectusingatoolrather
thanhandcodingallyourdatacontentquestions.
Dataprofilingplaysdistinctstrategicandtacticalroles.Atthebeginningofadata
warehouseproject,assoonasacandidatedatasourceisidentified,aquickdata
profilingassessmentshouldbemadetoprovidea¨go/nogo¨decisionabout

AnArchitectureforDataQuality÷7
proceedingwiththeproject.ldeallythisstrategicassessmentshouldoccurwitha
dayortwoofidentifyingacandidatedatasource!Earlydisqualificationofadata
sourceisaresponsiblestepthatwillearnyourespectfromtherestoftheteam,
evenifitisbadnews.Averylaterevelationthatthedatasourcecannotsupportthe
missioncanbeafatalcareeroutcomeforyouifthisrealizationoccursmanymonths
intoaproject.
NegotiateDataQualitywiththeSourceSystems NegotiateDataQualitywiththeSourceSystems NegotiateDataQualitywiththeSourceSystems NegotiateDataQualitywiththeSourceSystems
Dataprofilingisaprocessofon-goingdiscovery.Asfarasthedatawarehouseis
concerned,theidealplacetoaddressdataqualityissuesisatthesource.However,
thedatawarehouseteamneedstoproceedcautiouslyinordertogetthebest
possibleresponsefromtheownersofthesourcesystems.Beforeevenstarting,a
cultureneedstobeestablishedbyseniormanagementthatencouragesfindingdata
qualityissuesandfixingthemallaspartofanenterprise-wideteameffort.Wecan
takeapagefromtheexperienceofcarmanufacturerswhohavefounditvery
successfultogiverewardstoworkerswhofindqualityproblemsandpropose
solutions.
Thedatawarehouseteamneedstobesensitivetotheimpactthatprocessre-
engineeringmayhaveonthesourcesystems,bothatthetechnicalimplementation
levelandtheoperationallevel.ldeally,asinglelistofshowstopperqualityproblems
canbedevelopedintheinitialdataprofilingeffortanddealtwithbyajointteam
consistingofDW/Blandsourcesystempersonnelandchairedbyaseniorexecutive.
Again,ideallytheDW/Blteamdoesnotreturnrepeatedlywithmoredemandsfor
significantchangesinthesourcesystems.Butwerarelyliveinanidealworld.
EstablishaFeedbackPathtotheSource EstablishaFeedbackPathtotheSource EstablishaFeedbackPathtotheSource EstablishaFeedbackPathtotheSource
lftheenterprisehasamasterdatamanagement(MDM)systemthatassembles
mastercopiesofimportantentitiessuchascustomer,product,andlocation,then
presumablysourcesystemsaroundtheenterprisearecommittedtousing"gold¨
copiesoftheseentities.Perhapsupdatedcustomerlistsaredownloadedasbatch
filesbyasourcesystemfromtheMDMsystem,orindividualcustomerprofilesare
fetchedonrequestfromtheMDMsystemthroughaserviceorientedarchitecture
(SOA)interface.Eitherofthesearchitecturesisagreatoutcomeforthedata
warehousebecauseitmeansnotonlythatdataqualityisrecognizedandvalued,but
allthosetasksarebeingalreadyhandledbysomeoneelse!
Butitisreasonablylikelythatthedatawarehouseitselfwillgraduallymorphintoan
MDMsystem,perhapswithouttheenterprisebeingawareoftheimplications.lnThe
DataWarehouseETLToolkit(Wiley2004),wedescribethestepsofcleaning,de-
duplicating,survivingandconformingmultipledatafeedsfromdifferentsource
systemstodevelopthesamekindofgoldmastersofentitiesthatMDMsystems
produce.lfthathashappened,thedatawarehouseteamshouldgetsenior
management'scommitmenttotakethefinalstepoffeedingthecleaneddatabackto
thesourcesystem,toavoidtheSisypheanfateofcleaningthesamedirtydataover
andover.

AnArchitectureforDataQuality÷8
QualityScreens:TheHeartoftheArchitecture QualityScreens:TheHeartoftheArchitecture QualityScreens:TheHeartoftheArchitecture QualityScreens:TheHeartoftheArchitecture
Theheartofthedatawarehousequalityarchitectureisasetofqualityscreensthat
actasdiagnosticfiltersinthebackroomETLdataflowpipelines.Aqualityscreenis
simplyatest,implementedatanypointintheETLordatamigrationprocesses.lf
thetestissuccessful,nothinghappensandthescreenhasnosideeffects.Butifthe
testfails,theneveryscreenhastwosimpleresponsibilities:
• Dropanerroreventrecordintotheerroreventschema,and
• Choosetohalttheprocess,sendtheoffendingdataintosuspension,
ormerelytagthedata.

Althoughallqualityscreensarearchitecturallysimilar,itisconvenienttodivide
themintothreetypes,inascendingorderofscope.Herewefollowthe
categorizationsofdataqualityasdefinedbyJackOlson,inhisseminalbook,Data
Quality:TheAccuracyDimension(MorganKaufman2002):
• Columnscreens
• Structurescreens
• Businessrulescreens

Columnscreenstestthedatawithinasinglecolumn.Theseareusuallysimple,
somewhatobvioustests.JackOlsengivesanumberofusefulexamplesofeach
kindofscreen,whichhecallsrules,andwhichwerepeatherewithkindpermission.
Hiscolumnscreenexamplesinclude:
• Valuemustbenotnull.
• Valuemustbeonecharacterandfromafinitefixedlist.
• Valuemustbewithinarange.
• ValuemustfitaspecificfieldpatternsuchasZ9999.
• Valuemusteitherbenullorgreaterthan5charactersoffreetext.
• Valuemustnotbeinaspecificexclusionlist.
• Valuemustnotfailspellchecker.

Structurescreenstesttherelationshipofdataacrosscolumns.Twoormorefields
maybetestedtoverifythattheyimplementahierarchy(e.g.,aseriesofmany-to-
onerelationships).Structurescreensincludetestingforeignkey/primarykey
relationshipsbetweenfieldsintwotables,andalsoincludetestingwholeblocksof
fieldstoverifythattheyimplementvalidpostaladdresses.
JackOlsen'sexamplesinclude:
• Acombinationoffieldsmustimplementaprimarykeyforthe
surroundingtable.
• Aninventoryhistorypartnumbermustappearintheinventory
master.
• Allinventorypartsmusthaveatleastonesource.
• Allsuppliersmustsupplyatleastonepart.
• Asuppliermayhavenoorders.

AnArchitectureforDataQuality÷9
Businessrulescreensimplementmorecomplexteststhatdonotfitthesimpler
columnorstructurescreencategories.Forexample,acustomerprofilemaybe
testedforacomplextimedependentbusinessrule,suchasrequiringthataLifetime
Platinumfrequentflyerhasbeenamemberforatleast5yearsandhasmorethan
twomillionfrequentflyermiles.Businessrulescreensalsoincludeaggregate
thresholddataqualitychecks,suchascheckingtoseeifastatisticallyimprobable
numberofMRlexaminationshavebeenorderedforminordiagnoseslikesprained
elbow.lnthiscase,thescreenonlythrowsanerrorafterathresholdofsuchMRl
examsisreached.
JackOlsenfurthersubdividesbusinessrulescreensintothreesubcategories,
includingsimpledatarules,complexdatarules,andvaluerules.
Simpledatarules:
• Thequantityonordercannotbelessthantheminimumorder
quantity.
• lfthenumberofordersplacediszero,thequantityorderedmustalso
bezero.
• Thedateasupplierisestablishedmustnotbelaterthanthelast
orderplacedwiththissupplier.Bothdatesmustnotbelaterthanthe
currentdate.
Complexdatarules:
• Twosourcesforthesamepartnumbercannothavethesame
priority.
• Thereshouldbenooutstandingordersforpartsthataremarkedas
Donotorder.
Valuerules:
• Computethenumberofordersforthesamepartineachmonth
periodandverifythatnoparthasmorethantwoordersinasingle
month,becausetheinventoryre-orderingalgorithmissupposedto
prohibitthisfromhappening.
• Verifythatthetotalnumberofordersforeachmajorcategoryof
inventorydoesnotvarymonth-to-monthbymorethan10%.

ErrorEventSchema ErrorEventSchema ErrorEventSchema ErrorEventSchema
Theerroreventschemaisacentralizeddimensionalschemawhosepurposeisto
recordeveryerroreventthrownbyaqualityscreenanywhereinthedata
warehouseETLpipeline.Althoughwearefocusingondatawarehouseprocessing,
thisapproachobviouslycanbeusedingeneraldataintegration(Dl)applications,
wheredataisbeingtransferredbetweenlegacyapplications.Theerrorevent
schemaisshowninFigure1.

AnArchitectureforDataQuality÷10

Figure1.TheErrorEventSchema. Figure1.TheErrorEventSchema. Figure1.TheErrorEventSchema. Figure1.TheErrorEventSchema.
Themaintableistheerroreventfacttable.ltsgrainiseveryscreenevent:anerror
thrown(produced)byaqualityscreenanywhereintheETLordatamigration
system.Rememberthatthegrainofafacttableisthephysicaldescriptionofwhya
facttablerecordexists.Thuseveryqualityscreenerrorproducesexactlyonerecord
inthistable,andeveryrecordinthetablecorrespondstoanobservederror.
Thedimensionsoftheerroreventfacttableincludethecalendardateoftheerror,
thebatchjobinwhichtheerroroccurred,andthescreenwhichproducedtheerror.
Thecalendardateisnotaminuteandsecondtimestampoftheerror,butrather
providesawaytoconstrainandsummarizeerroreventsbytheusualattributesof
thecalendar,suchasweekdayorlastdayofafiscalperiod.Thetime-of-dayfactis
afullrelationaldate-timestampthatspecifiespreciselywhentheerroroccurred.
Thisformatisusefulforcalculatingthetimeintervalbetweenerroreventsbecause
youcantakethedifferencebetweentwodate-timestampstogetthenumberof
secondsseparatingevents.
Thebatchdimensionisaninterestingdimensionthatcontainsasmuchdataas
possibledescribingthespecificbatchjobinwhichtheerroroccurred.Thebatch
dimensioncanbegeneralizedtobeanindividualprocessingstepincaseswhere
dataisstreamed,ratherthanbatched.Theinformationinthebatchdimension
comesfromtheETLsystemworkflowmonitor,andcouldinclude:
• Batchscheduledate-timestamp
• Actualbatchstartingandendingdate-timestamps
• Totalnumberofrecordsprocessedinbatchrun
• Totalnumberofscreentestsperformedinbatchrun
• Totalnumberoferrorsencounteredinbatchrun
• Database,processor,memory,anddiskcontention
• Maximumerrorseverityscoreinbatchrun
Theattributesinthebatchdimensionbehavealternativelylikedimensional
attributessubjecttoconstrainingandgrouping,orlikenumericfactssubjectto
summingandcalculating.lfthisisthecase,aseparatebatchmeasurementfact
tablecanalsobebuilt.
Erroreventfact Erroreventfact Erroreventfact Erroreventfact
Datedimension Datedimension Datedimension Datedimension Erroreventkey(PK) Screendimension Screendimension Screendimension Screendimension
Erroreventdate(PK) Erroreventdate(FK) Screenkey(PK)
Dateattributes Screenkey(FK) Screentype
Batchkey(FK) ETLmodule
Batchdimension Batchdimension Batchdimension Batchdimension Timeofday Screenprocessingdef
Batchkey(PK) Severityscore Exceptionaction
Batchattributes grain=screenevent
Erroreventdetail Erroreventdetail Erroreventdetail Erroreventdetail
Erroreventkey(FK)
Table(FK)
Recordidentifier(FK)
Fieldidentifier(FK)
Errorcondition
grain=errorfieldineacherrorevent

AnArchitectureforDataQuality÷11
Thescreendimensionidentifiespreciselywhatthescreencriterionisandwherethe
codeforthescreenresides.ltalsodefineswhattodowhenthescreenthrowsan
error.Screendimensionattributesinclude:
• Screentype:column,structure,orbusinessrule.Thiscouldbe
augmentedwithasimplediagnosticassessmentoftheerror,suchas
lncorrect,Ambiguous,lnconsistent,orlncomplete.Thiswouldallow
theanalysttoaggregateerroreventsintointerestingclassifications.
• ETLmodule:oneormorefields,probablydescribingahierarchical
locationintheoverallETLsystem,indicatingwherethequality
screenisembedded.Theexactnatureofthesefieldswilldependon
thearchitectureofyourETLsystemandifyouareusingavendor
suppliedETLtool.
• Screenprocessingdefinition:Asimpleandstraightforwarddesign
wouldincludeaterse,freetextdescriptionofthescreentestaswell
asafieldcontainingtheexactfilelocationofthescreencode.ln
somesystems,thescreencodecouldactuallybefetchedatruntime
fromthislocation.
• Exceptionaction:halttheprocess,sendtherecordtoasuspense
file,ormerelytagthedata.Wediscusstheimpactoftheseactionsin
thenextsection.Theexceptionactioncanbegeneralizedto
describewhatnotificationsaresentwhenaprocessishalted;where
thesuspensefileislocated;andhowthedataistaggedifitispassed
throughthescreen.Wedescribeexactlyhowtotagdatainthe
sectionontheauditdimension.
• Optionaldefaultseverityscorenotshowninfigure:thevalueofthe
severityscore(assignedwithinarangefrom0.0to1.0)normally
usedinthefacttablewhenthiserroreventoccurs.Specialbusiness
rulescouldchangetheseverityscoreinthefacttableatcertain
pointsintime,orifmanysucherrorshavebeenreported.
Themainerroreventfacttablecontainsasinglepartprimarykeyshownasthe
erroreventkey,atimeofdayfieldandaseverityscore,besidestheforeignkeysto
thedimensions.Theerroreventkey,liketheprimarykeyofthedimensiontables,is
asurrogatekeyconsistingofasimpleintegerassignedsequentiallyasrecordsare
addedtothefacttable.Thiskeyfieldisnecessaryinthosesituationswherean
enormousburstoferrorrecordsisaddedtotheerroreventfacttableallatonce.
Hopefullythiswon'thappentoyou.Thissinglefieldprimarykeyisalsohelpfulfor
theDBAwhomayneedtocallattentiontoasinglerecordinthislargefacttable.
NeartheendofthispaperwediscussadaptingtheSixSigmaqualitymeasuring
methodologytodatawarehousedata.ThefundamentalgoalofreachingtheSix
Sigmalevelistoachievelessthan3.4errorspermillionopportunities.Thebatch
dimensiondescribedabovecontainsenoughinformationtotestforSixSigma.ln
addition,theseverityscoreintheerroreventfacttableallowsamorethoughtful
weightingoftheseverityoftheerrorsiftheteamisuncomfortable"beingpenalized¨

AnArchitectureforDataQuality÷12
foralldataqualityerrorsinthesameway.Sincetheseverityscorerangesfrom0.0
to1.0,thesenumberscansimplybeaddedtogettheweighted,asopposedto
absolute,errortotalinagivenbatchrun.YoucanconsultwithlocalSixSigma
methodologiststoseeifthisweightederrorreportingiscontroversial!
Theerroreventschemaincludesaseconderroreventdetailfacttableatalower
grain.Eachrecordinthistableidentifiesanindividualfieldinaspecificdatarecord
thatparticipatedinanerror.Thusacomplexstructureorbusinessruleerrorthat
triggersasingleerroreventrecordinthehigherlevelerroreventfacttablemay
generatemanyrecordsinthiserroreventdetailfacttable.Thetwotablesaretied
togetherbytheerroreventkey,whichisaforeignkeyinthislowergraintable.ln
otherwords,thereisastrict1-to-manyrelationshipbetweenrecordsintheparent
erroreventfacttableandthechilderroreventdetailfacttable.Theerrorevent
detailtableidentifiesthetable,record,field,andpreciseerrorcondition,and
likewisecouldoptionallyinheritthedate,screen,andbatchdimensionsfromthe
highergrainerroreventfacttable.Thusacompletedescriptionofcomplexmulti-
field,multi-recorderrorsispreservedbythesetablesTheerroreventdetailtable
couldalsocontainaprecisedate-timestamp,toprovideafulldescriptionof
aggregatethresholderroreventswheremanyrecordsgenerateanerrorcondition
overaperiodoftime.
Wenowappreciatethateachqualityscreenhastheresponsibilityforpopulating
thesetablesatthetimeofanerror.
TheerroreventschemadescribedinthissectionisattheheartoftheETLpipeline,
andmaybesubjecttointenseburstsofrecordinsertionswhenthepipelineis
processingdata.lnmanyways,thesetablesarelikeproductiontablesina
transactionprocessingsystem.Complexqueriesonthesetablesshouldbe
prohibitedduringtimeswhenthatactivitywouldslowtheETLpipeline.Orahybrid
architecturemightbeusedwhereindividualscreenssimplywritetologfiles,andthe
erroreventschemaupdatingistakenoffthecriticalpathofETLprocessing.Of
course,thatarchitecturemightconflictwithETLmanagers'desiresforrealtime
reportingoferrors!
lnalargeenterprisespanningmanymachines,manydatabaseinstances,and
multipleindependentETLpipelines,youshouldnottrytobuildasinglecentralized
erroreventsetoftables.Rather,eachETLenvironmentshouldhaveitsownerror
eventschema.Thiscertainlysimplifiestheimmediatecollectionandmanagement
oferrorsatalocallevel,butitaddsanacceptableincreasedcomplexityofdrilling
acrossalltheerroreventschemainstancestoseetheenterpriseviewofdata
quality.
RespondingtoQualityEvents RespondingtoQualityEvents RespondingtoQualityEvents RespondingtoQualityEvents
Wehavealreadyremarkedthateachqualityscreenhastodecidewhathappens
whenanerroristhrown.Thechoicesare:1)haltingtheprocess;2)sendingthe
offendingrecord(s)toasuspensefileforlaterprocessing;and3)merelytaggingthe
dataandpassingitthroughtothenextstepinthepipeline.Thethirdchoiceisbyfar
thebestchoice,wheneverpossible.Haltingtheprocessisobviouslyapainbecause

AnArchitectureforDataQuality÷13
itrequiresmanualinterventiontodiagnosetheproblem,restartorresumethejob,
orabortcompletely.Sendingrecordstoasuspensefileisoftenapoorsolution
becauseitisnotclearwhenoriftheserecordswillbefixedandreintroducedtothe
pipeline.Untiltherecordsarerestoredtothedataflow,theoverallintegrityofthe
databaseisquestionablebecauserecordsaremissing.Werecommendnotusing
thesuspensefileforminordatatransgressions.Thethirdoptionoftaggingthedata
withtheerrorconditionoftenworkswell.Badfacttabledatacanbetaggedwiththe
auditdimensiondescribedinthenextsection.Baddimensiondatacanalsobe
taggedusingtheauditdimensionorinthecaseofmissingorgarbagedatacanbe
taggedwithuniqueerrorvaluesinthefielditself.
Missingorobviouslycorruptfactdatacanbehandledwithaleastbiasedestimator:
anartificiallygeneratedvaluethatisagoodestimateofwhatthemissingdata
shouldhavebeen.Theauditdimensioncanadequatelydescribethisconditionso
thatbusinessusersarenotmisled.Mostdatawarehousepeopleareuncomfortable
withthisapproachbecauseunderstandablytheydon'tlikemakingupthedata.But
considerthealternativesofdroppingtheactivityrecordinquestionorreplacingthe
corruptvaluewithNULL,orevenworse,zero.lnanyofthesecases,thetotal
activitymeasurewillbemisleadingandwrong.Usingaleastbiasedestimator
bringsthetotalactivitycountsandmeasuresmuchclosertoreality.
TheAuditDimension TheAuditDimension TheAuditDimension TheAuditDimension
Theauditdimensionisanormaldimensionthatisassembledinthebackroomby
theETLprocessforeachfacttable.Asampleauditdimensionattachedtoa
shipmentsfacttableisshowninFigure2.

Figure2.ASampleAuditDimensionAttachedtoaShipmentsFactTable. Figure2.ASampleAuditDimensionAttachedtoaShipmentsFactTable. Figure2.ASampleAuditDimensionAttachedtoaShipmentsFactTable. Figure2.ASampleAuditDimensionAttachedtoaShipmentsFactTable.

ShipmentsFact ShipmentsFact ShipmentsFact ShipmentsFact AuditDimension AuditDimension AuditDimension AuditDimension
Order_date(FK) Audit_key(PK)
Ship_date(FK) overallquality
Delivery_date(FK) completeness
Ship_from(FK) validation
Ship_to(FK) out-of-bounds
Product(FK) screensfailed
Promotion(FK) recordmodified
Terms(FK) extracttimestamp
Status(FK) cleantimestamp
Audit_key(FK) conformtimestamp
Order_num(DD) ETLmasterversion
Ship_num(DD) allocationversion
Line_num(DD) currencyversion
number_units erroreventgroup(FK)
gross_dollars
discount_dollars
terms_dollars
revenue_dollars
return_dollars
1recordforeach 1recordforeach
shipmentlineitem distinctauditcondition

AnArchitectureforDataQuality÷14
lnFigure2,theshipmentsfacttablecontainsatypicallylonglistofdimensional
foreignkeyseachlabeledwithFK,threedegeneratedimensionslabeledwithDD,
andsixadditivenumericfacts.Thisstyleoffacttablewithitskeysandotherfields
hasbeendescribedmanytimesinKimballGrouparticlesandbooks.
TheauditdimensioninFigure2containstypicalmetadatacontextrecordedatthe
momentwhenaspecificfacttablerecordiscreated.Onemightsaythatwehave
elevatedmetadatatorealdata!Thedesignerofthedataqualitysystemcaninclude
asmuchoraslittlemetadataasisconvenienttorecordatthetimeofanerror.To
visualizehowauditdimensionrecordsarecreated,imaginethatthisshipmentsfact
tableisupdatedonceperdayfromabatchfile.Supposethattodaywehavea
perfectrunwithnoerrorsflagged.lnthiscasewewouldgenerateonlyoneaudit
dimensionrecordanditwouldbeattachedtoeveryfactrecordloadedtoday.Allof
theerrorconditionsandversionnumberswouldbethesameforeveryrecordinthis
morning'sload.Henceonlyonedimensionrecordwouldbegenerated.Thisaudit
dimensionrecordwouldbecreatedduringthefinaldeliverystepoftheshipment
lineitemfacttablebylookinguptherequiredinformationintheerroreventschema.
Nowletusrelaxthestrongassumptionofaperfectrun.lfwehadsomefactrecords
whosediscountdollarstriggeredanout-of-boundserror,thenonemoreaudit
dimensionrecordwouldbeneededtohandlethiscondition.lnthenextfew
paragraphswediscusseachofthefieldsintheauditdimension.
Theoverallqualityattributecouldbeatextfieldwithasmallnumberofdiscrete
values,suchasZeroDefects,Minor,orMajor.Additionally,thequalitymetriccould
includeacomputednumericfieldequalto1minusthefractionofthenumberof
defectsfoundinthisrecordcomparedtothenumberofscreentestsperformed.
Rememberthatallofthisinformationisavailablefromthetablesintheerrorevent
schema.
Thecompleteness,validation,andoutofboundsattributesintheauditdimension
allowmoreprecisedescriptionsoftheconfidenceoftheparticularshipmentline
itemfacttablerecord.Thesekindsofmeasuresareparticularlyappealingto
businessusersbecausetheycandevelopconfidenceinthevalidityofthedata.
Completenessreferstowhetherallofthefactsintherecordhavebeendelivered.
Validationreferstowhetheranyfactsintherecordhavetriggeredbusinessrule
violations(asopposedtocolumnorstructureruleviolations).Theoutofbounds
attributedescribeswhetheranyofthevaluesintherecordexceedtheoutofbounds
thresholddefinedbythedataqualityscreen.Amoresophisticatedversionofthis
couldincludethemaximumvariancefromtheexpectedmeanofanyofthefact
valuesintherecord.
Thescreensfailedattributeisasimplecountofallscreensthatindicatedanerror
forthisfactrecord.Therecordmodifiedattributeshowswhetherthisrecordhas
everbeenupdatedsinceitsfirstcreation.Thefacttableusedinthisexampleisan
accumulatingsnapshotgraintable,whichisexpectedtoundergorevisionstoeach
record.lfthefacttablewereofthetransactiongrainorperiodicsnapshotgrain,then
therecordmodifiedattributewouldpotentiallyindicateamuchmoreseriousType1

AnArchitectureforDataQuality÷15
overwrite.lnanyofthesecases,therecordmodifiedattributecouldbeaugmented
witharecordupdatedtimestampthatwouldneedtohaveadefaultvalueforthe
casewherenoupdatehadoccurred.Thethreetypesoffacttablesandthevarious
slowlychangingdimension(SCD)typeshavebeendescribedmanytimesinthe
ToolkitseriesofbooksandKimballGrouparticles.
Theextract,cleanandconformtimestampsrefertowhentheoverallbatch
processingfinishedthatcontainedthefactrecordinquestion.
Finally,thethreeversionnumbersareexamplesofmastermetadataversion
numbersthathopefullyyoumaintaintodescribeexactlywhichversionofthe
processinglogicyouusedwhenthisfactrecordwascreated.Theseversion
numbersareenormouslyusefulinfinancialauditingorcompliancetracking
situations.Supposethattheallocationlogicthatassignedcostdatatovarious
productcategorieswaschangedatsomepointduringthefiscalyear.Theallocation
versionnumbercouldbeusedasarowheaderinaprofitabilityreporttoshowhow
muchprofitwasreportedundertheoldandnewallocationschemes.Seethenext
sectionwhereweshowhowtousetheauditdimension.
TheAuditDimensionPayoff TheAuditDimensionPayoff TheAuditDimensionPayoff TheAuditDimensionPayoff
Thepoweroftheauditdimensionbecomesmostapparentwhenthedimension
attributesareusedinabusinessuserreportasshowninFigure3.
Figure3.NormalandlnstrumentedReportsUsingtheAuditDimension. Figure3.NormalandlnstrumentedReportsUsingtheAuditDimension. Figure3.NormalandlnstrumentedReportsUsingtheAuditDimension. Figure3.NormalandlnstrumentedReportsUsingtheAuditDimension.
ThetopreportisanormalreportshowingsalesoftheAxonproductintwo
geographicregions.Thelowerreportisthesamereportwiththeout-of-bounds
indicatoraddedtothesetofrowheaderswithasimpleuserinterfacecommand.
Thisproducesaninstantdataqualityassessmentoftheoriginalreport,andshows
thattheAxonWestsalesaresuspect.
AdvancedDesignlssues AdvancedDesignlssues AdvancedDesignlssues AdvancedDesignlssues
lnthissection,wedescribefurthermoretechnicalextensionsthatleveragetheaudit
dimensionanderroreventdata.
Normal Report
Product Ship From Qty Shipped Revenue
Axon East 1438 $235,000
Axon West 2249 $480,000
Instrumented Report (add Out of Bounds Indicator to SELECT)
Out of Bounds
Product Ship From Indicator Qty Shipped Revenue
Axon East Abnormal 14 2,350 $
Axon East OK 1424 232,650 $
Axon West Abnormal 675 144,000 $
Axon West OK 1574 336,000 $

AnArchitectureforDataQuality÷16
UsingtheErrorEventGroup
Figure2showsaforeignkeyintheauditdimensionreferringtoanerrorevent
group.Figure4illustratestheerroreventgrouptableforthefacttableinour
example.

Figure4.TheErrorEventGroupTableCorrespondingtoFigure2. Figure4.TheErrorEventGroupTableCorrespondingtoFigure2. Figure4.TheErrorEventGroupTableCorrespondingtoFigure2. Figure4.TheErrorEventGroupTableCorrespondingtoFigure2.
Thefacttablewehavebeenusingasanexampleinthispaperhassixfactfields
shownatthebottomofthetable.lnordertoseparatelydescribethedataquality
statusofeachfield,weneedarecordintheerroreventgrouptableforeachfactin
eachdistinctgroupofqualityratings.Thusinourexamplewherewehadaperfect
run,wewouldneedexactlyoneerroreventgroupthatwouldcontainsixrecords.
ThesixrecordswouldcorrespondtoeachfieldandaNormalrating.Seethetop
tableinFigure5wherethesurrogateerroreventgroupkeyequals2973.
Figure5.ErrorEventGroupTableRecordsforTwoErrorConditions. Figure5.ErrorEventGroupTableRecordsforTwoErrorConditions. Figure5.ErrorEventGroupTableRecordsforTwoErrorConditions. Figure5.ErrorEventGroupTableRecordsforTwoErrorConditions.
lfsomeothershipmentfactrecordscontainedvalueswherethediscount_dollars
triggeredanoutofboundsdataqualitywarning,thentheauditdimensionforthese
factrecordswouldcontaintheforeignkey3575pointingtotherecordsshowninthe
lowertableinFigure5.
Thepowerofthiseventgrouptablebecomesapparentifthebusinessuserdrags
theFactNameandtheRatingintoareportasrowheaders,muchlikeourexample
inFigure3.Thiswouldbreakoutseparaterowsforeachqualityratingforthatfact.
ErrorEventGroup(FK)
FactName
QualityRating
Key Fact Name Rating
2973 number_units Normal
2973 gross_dollars Normal
2973 discount_dollars Normal
2973 terms_dollars Normal
2973 revenue_dollars Normal
2973 return_dollars Normal
Key Fact Name Rating
3575 number_units Normal
3575 gross_dollars Normal
3575 discount_dollars Out of Bounds
3575 terms_dollars Normal
3575 revenue_dollars Normal
3575 return_dollars Normal

AnArchitectureforDataQuality÷17
PropagatingErrorTrackingBackThroughaDe-DuplicatingStep
AgoodETLsystemwilldetectduplicateentriesinmajordimension,suchas
customer,andwillcombinetheactivityrecordsforthoseduplicatedentriesundera
singlekey,suchasthecustomerkey.Thismaybesupportedbythemasterdata
management(MDM)system.Thisde-duplicatingstepraisesaproblembecause
informationispotentiallylostwhenseparateoriginalnaturalkeysarecombinedinto
one"super-natural¨key.Thesolution,inthiscase,istomaintainbackpointersto
theoriginalnaturalkeysfromthefinalcustomerdimension,forinstance,asshown
inFigure6.

Figure6.ExampleCustomerDimensionShowingBackPointerstoOriginalNatural Figure6.ExampleCustomerDimensionShowingBackPointerstoOriginalNatural Figure6.ExampleCustomerDimensionShowingBackPointerstoOriginalNatural Figure6.ExampleCustomerDimensionShowingBackPointerstoOriginalNatural
Keys. Keys. Keys. Keys.
lnFigure6,theCustomerprimarykeyshownasthefirstfieldisthecleande-
duplicatedsurrogatekeythatistheendresultofthecleaningprocess.Thelastfour
fieldsinthetableareliteralnaturalkeyscomingfromtheoriginalsourcesystems
thatinteractwiththecustomer.lncaseswhereanerroristhrownbyaquality
screentestingafinalfacttablevalue,theanalysthastheoptionoftracingbackto
theoriginalrawinputdatathroughoneofthebackpointers,orconverselyifthe
errorreferstotheoriginalsourcedata,theanalystcantracetheimpactofthiserror
downstreampastthede-duplicationstep.
EstimatingCorrectMeasuredValuesfromPriorHistory
Supposewehaveafacttablethattracksdailysalesin600stores,eachofwhich
has30departments.Wethereforereceive18,000salesnumberseachday.This
sectiondescribesaquickstatisticalcheck,basedoncalculatingstandard
deviations,thatallowsustojudgeeachofthe18,000incomingnumbersfor
reasonablenessandassigndataconfidencemetricstoalloftheminasinglepass.
Weassumethatthistestisconductedbyadataqualityscreen.

Customer
dimension
Customer(PK)
Customer(NK)
Type
Name
Street
City
State
Country
PostalCode
AssembledAddress
DateFirstContact(FK)
ProspectlD(FK) iforiginalsourceshaveduplicatereferences
CallCenterContactlD(FK) thencanimplementmulti-valueddimension
MassMailingContactlD(FK) withbridgetable
CreditPartnerContactlD(FK)

AnArchitectureforDataQuality÷18
Thetechniquedescribedhereletsusquicklyupdatethestatisticalbaseofnumbers
togetreadyfortomorrow'sdataload.Goingbacktothestatisticscourseyoutookin
college,rememberthatthestandarddeviationisthesquarerootofthevariance.
Thevarianceisthesumofthesquaresofthedifferencesbetweeneachofthe
historicaldatapointsandthemeanofthedatapoints,dividedbyN-1,whereNis
thenumberofdaysofdata.Unfortunately,thisformulationcouldrequireustolook
attheentiretimehistoryofsales,which,althoughpossible,makesthecomputation
unattractiveinafast-movingETLenvironment.Butifwehavebeenkeepingtrackof
SUMSALESandSUMSQUARESALES,wecanwritethevarianceas(1/(N-1))*
(SUMSQUARESALES-(1/N)*SUMSALES*SUMSALES).Checkthealgebra!
SoifweabbreviatetheabovevarianceformulawithVAR,ourdatavaliditycheckfor
allanomalousvaluesmorethan3standarddeviationsfromtheexpectedmean
lookslike:
SELECT s.storename, p.departmentname, sum(f.sales)
FROM fact f, store s, product p, time t, accumulatingdept a
WHERE
(first, joins between tables... )
f.storekey = s.storekey and f.productkey = p.productkey and
f.timekey = t.timekey and s.storename = a.storename and
p.departmentname = a.departmentname and

(then,constrainthetimetotodaytogetthenewlyloadeddata...)
t.full_date = #October 13, 2007# and

(finally,invokethestandarddeviationconstraint...)
HAVING ABS(sum(f.sales) - (1/a.N)*a.SUM_SALES) > 3*SQRT(a.VAR)
WeexpandVARasinthepreviousexplanationandusethea.prefixonN,SUM
SALESandSUMSQUARESALES.Wehaveassumedthatdepartmentsare
groupingsofproductsandhenceareavailableasarollupintheproductdimension.
Everyrowreturnedfromthisquerywouldprovokeanentryintheerroreventfact
table,wherethediagnosiswas"AbnormalValue.¨
Embellishmentsonthisschemecouldincluderunningtwoqueries:oneforthesales
MOREthanthreestandarddeviationsabovethemeanandanotherforsalesLESS
thanthreestandarddeviationsbelowthemean.Maybethereisadifferent
explanationforthesetwosituations.ThiswouldalsogetridoftheABSfunctionif
yourSQLdoesn'tlikethisintheHAVlNGclause.lfyounormallyhavesignificant
dailyfluctuationsinsales(forexample,MondayandTuesdayareveryslow
comparedtoSaturday),youcouldaddaDAYOFWEEKtotheaccumulating
departmenttableandconstraintotheappropriateday.lnthisway,youdon'tmixthe
normaldailyfluctuationsintoourstandarddeviationtest.
WhenyouaredonecheckingtheinputdatawiththeprecedingSELECTstatement,
youcanupdatetheexistingSUMSALESandSUMSQUARESALESjustbyadding
today'ssalesandtoday'ssquareofthesales,respectively,tothesenumbersinthe
accumulatingdepartmenttable.Thenyouarereadyfortomorrow'sdata.

AnArchitectureforDataQuality÷19
ConformingAuditDimensionsacrossMultipleETLEnvironments
lnadistributedenterprisedatawarehouseenvironment,withmanyseparateETL
environments,therewillbemanyversionsofauditdimensions,withpotentially
differentqualitymetrics.Thisisquiteunderstandablesincethefacttablesacross
theenterprisewillbeverydifferentandhavelegitimatelydifferentcriteriafordata
quality.However,likeanydimensionappearingacrosstheenterprise,asignificant
effortshouldbemadebythejointETLteamstodefineacoresetofqualitymetrics
thathavethesamemeaningforeveryinstanceoftheauditdimension.Thiscoreset
ofmetricsthenallowsdrillingacrossfacttablestobuildenterprisewidereports
labeledbykeyqualitymetrics.FromourexampleauditdimensionshowninFigure
2,wewouldproposetheoverallquality,completeness,validity,andoutofbounds
attributesassubjecttotheconformedstandards.Thismeansthateveryaudit
dimensionassignvaluestotheseattributesfromthesamestandardsetandwith
thesamebusinessrulemeanings.
SixSigmaDataQuality SixSigmaDataQuality SixSigmaDataQuality SixSigmaDataQuality
Thedatawarehousecommunitycanborrowsomeusefulexperiencefromthe
manufacturingcommunitybyadoptingpartsoftheirqualityculture.lnthe
manufacturingworld,theSixSigmalevelofqualityisachievedwhenthenumberof
defectsfallsbelow3.4defectspermillionopportunities.Theerroreventfacttableis
theperfectfoundationformakingthesameSixSigmameasurementfordata
quality.Thedefectsarerecordedintheerroreventschemaandtheopportunities
arerecordedintheorganization'sworkflowmonitortoolthatrecordstotalnumberof
recordsprocessedineachjobstream.
BuildingaDataQualityArchitecture BuildingaDataQualityArchitecture BuildingaDataQualityArchitecture BuildingaDataQualityArchitecture
Thedataqualityarchitecturedescribedinthisarticlecanbeincrementallyaddedto
anexistingdatawarehouseordataintegrationenvironmentwithverylittle
disruption.Oncetheerroreventschemaisestablished,thelibraryofqualityscreens
cangrowindefinitelyfromamodeststart.Thescreensneedonlyobeythetwo
simplerequirementsoutlinedearlier:loggingeacherrorintotheerroreventschema,
anddeterminingthesystemresponsetotheerrorcondition.Errorscreenscanbe
implementedinmultipletechnologiesthroughouttheETLpipeline,including
standalonebatchjobsaswellasdataflowmodulesembeddedinprofessionalETL
tools.
Eventhoughthispaperdescribesacomprehensiveandcomplexarchitecturefor
dataquality,itispossibletostartsimplyandgrowthecapabilitiesofthesystemas
theETLteamgainsconfidence.Thetypesoferrorsthrownbythescreenscanbe
verysimple,perhapsusingonlytwoorthreelevelsofseverity.Screenscanbe
addedincrementally,whenevertheETLteamdecidestoexpandthelibraryof
screens.
Ofcourse,theerroreventschemaprovidesaquantitativebasisformanagingdata
qualityinitiativesovertime,sinceitisatimeseriesbydefinition.Thedimensionality

AnArchitectureforDataQuality÷20
oftheerroreventdataallowsstudyingtheevolutionofdataqualitybysource,
softwaremodule,keyperformanceindicator(KPl),andtypeoferror.
Conclusion Conclusion Conclusion Conclusion
Theindustryhastalkedaboutdataqualityendlessly,buttherehavebeenfew
unifyingarchitecturalprinciples.Thispaperdescribesaneasilyimplemented,non-
disruptive,scalable,andcomprehensivefoundationforcapturingdataquality
events,aswellasmeasuringandultimatelycontrollingdataqualityinthedata
warehouse.

Sign up to vote on this title
UsefulNot useful