You are on page 1of 10

What is DataFlux?

So what is DataFlux? Yes, a leader in data quality, it’s both a company and a product; better stated, DataFlux (the company) provides a suite of tools (often simply called DataFlux) that provide data management capabilities, with a focus in data quality.DataFlux’s tools can do a lot of really neat things; I’d say it’s a must-have for Sales & Marketing, and it’d benefit most enterprises out there in other ways. To see what all of this pomp is about, let’s use an example. Think of these entries in your company’s xyz system: Name Mr. Victor Johnson Victor Jonson, JD Address 1600 Pennsylvania Avenue NW 1600 Pennsylvania Avenue City,State,Zip Washington, DC 20500 Washington, DC SAN LUIS OBISPO, CA 93405 SLO, CA 93408 San Luis Obispo, California omaha, nebraska, 68102 Phone 202-4561414 456-1414 (805) 5551212 8055444800 n/a

VICTOR JOHNSON 255 DONNA WAY Bill Shares Doctor William Shares 1050 Monterey St 1052 Monterrey Ave

william shares, sr 1001 cass street

In this example, a human could probably pretty easily figure out that the first two Victors are probably one and the same and that Bill in SLO and William in San Luis Obispo are also the same person. The other records might be a match, but most of us would agree that we can’t be sure based on name alone. Furthermore, it is obvious that some data inconsistencies exist such as name prefixes and suffixes, inconsistent casing, incomplete address data, etc.; DataFlux can’t (and shouldn’t try) to fix all of these quirks, but it should at least be able to reconcile the differences, and, if we choose, we should be able to do some data cleanup automatically. So let’s get started. I’ll open up dfPower Studio.

In my case. . where most design takes place. On this note I guess I should say that Architect is the single most useful product in the suite(in my opinion anyway). This change is actually helpful (as opposed to some GUI changes made by companies) by combining a lot of the settings into a central place. and it’s where I’ll spend most of my time in this posting.This interface is new in version 8 and helps provide quick access to the functions one would use most often. I’ll start Architect by clicking on the icon in the top left.

Enrichment (Distributed) – Provides the same functionality as I just described. Transfer. which might be helpful to know if you’ve worked with Informatica or another ETL (Extract. but distributed across servers for performance/reliability gains. .g.g. Enrichment – As the name suggests. text files. Quality – Here’s where some of DataFlux’s real magic takes place. Utilities – Utilities contain what many would refer to as “transformations”. Right Fielding (move data from the “wrong” field to the “right”). parsing (we’ll see this). so I’ll go through the task of describing each node briefly: Gender Analysis (determine gender based on a name field). e. this gets tricky with certain alphabets). i. and more. identification analysis (e. is this a person’s name or an organization name?). standardization (we’ll see one application of this). I’ll cover one other data input later… Data Outputs – Similar to inputs. you’ll find various ways of storing the output of the job. Another DataFlux tool is dedicated to profiling – in some ways these nodes are a subset of the other’s functionality. and Dynamic Scheme Application (new in Version 8 – another advanced topic) Integration – Another area where magic takes place. We’ll see this in this post. geocoding (obtaining demographic and other information based on an address) and some phone number functions (we’ll see one example). Create Scheme (new in Version 8 – more of an advanced topic). email John if sales fall under $10K. Change Case (although generally not too complicated. these nodes help enrich data. Here the output of profiling can be linked to other actions. SAS data sets (DataFlux is a SAS company). but there’s one primary difference. Profiling – Most nodes here help provide a synopsis of the data being processed. Monitoring – Allows for action to take place on a data trigger. This section includes: address verification (we’ll see this). they provide data that’s missing in the input.On the left panel you’ll see a few categories. Load) tool.e. Let me explain what you’ll find each one (skip over this next section if you want): Data Inputs – Here you’ll find nodes allowing you to read from ODBC sources.

Now that we’ve gone through a quick overview of Architect’s features. I can use the “Suggest” button to populate the field names based on the header of the text file. I’ll first drag my data source on to the page and double click on it to configure its properties. or I can turn it off and press F5 for a refresh. let’s use them. For my purposes today I’ll read from a delimited text file I created with the data I described at the beginning of the article. Either way. which shows the data only when asked. . What’s nice here is I can have auto-preview on (which by the way drives me crazy). the data will appear in my preview window (instant gratification is one of the great things about Architect).

After attaching the node to Text File Input 1 and doubleclicking on the node. I’ll start out my data quality today by verifying these addresses. I do this by dragging on the Address Verification (US/Canada) node. in the input section I map my fields to the ones expected by DataFlux and in another window I specify what outputs I’m interested in.Next. I’ve selected a few fields here but there are many other options available. .

. let’s clean up the names. Already you can see what a difference we’ve made. the correct Zip-4 would have been calculated. If I would have used that. Nonetheless. Next. I could have also kept the originals side by side. but these will suffice for now (It’d be tough to fit on the screen here). I want to point out just two things here: 1. pretty neat. This is likely to have happened because too many fields are wrong and the USPS data verification system is designed not to guess too much… 2. Having said this. which among other things. indicates that addresses should always be uppercased. middle and last names. The real address for the courthouse in San Luis Obispo is 1050 Monterey St. Note how well DataFlux picked out the first. There is one “NOMATCH”. I reconfigured the USPS properties to allow additional outputs (the original name & phone number).You’ll notice here I’ve passed through only the enriched address fields in the output. not to mention the prefixes and suffixes. eh? I’d also like to point out that the county name was determined because I added this output when I configured the properties. It’d be nice if we could split the names into a first & last name. For this reason you see this here. So why did we get a US_Result_Code of “OK”? This is because the USPS system recognizes 1052 as an address within a correct range. I dragged the Parsing node onto the screen and configured its properties to identify what language & country the text was based on (DataFlux supports several locales and in version 8 supports Unicode). At our company we’ve configured DataFlux to comply with USPS Publication 28. I can preview as before. you have the option to propercase the result set if you’d like. After that. Moving on. 1052 Monterey St is an address I made up and consequently the Zip-4 could not be determined. First. plus I could have added many more fields to the output.

Here in the properties I’ll select a “Definition” for the name and phone inputs. I’ll remove the Parse step I just added and use a Standardize node instead. Business Title. and several others. City. Phone. Zip. There are many options to choose from including things like: Address.For simplicity. Organization. Country. Let’s see what this does… . Name. Date. Postal Code.

first names. Victor” would it have correctly standardized the name to “Victor Johnson”? The answer here is yes. Let’s move on. nonetheless I think that most users would be surprised how good this “guessing” can be. After all.” Of course this means that with very unusual names the parsing algorithm could make a mistake. Let’s preview a match code to see what this does. DataFlux utilizes several algorithms and known last names. For example if a name is Rob. occur based on the data type. we can’t assume the real name is Robert yet we may have a burning desire to do something like that to figure out that 1 record is a potential duplicate of another… this is where match codes come in. I’m next going to make “Match Codes. For example. If you’re interested in learning more about this let me know and perhaps I’ll write another blog to go into the details. name lookups. This step is important because intelligent parsing. . All in all. etc. Here’s the section of the Match Codes Properties window where we assign the incoming fields to the Definition. to analyze the structure and provide a best “guess. By that I mean that the placement of a comma in a name greatly enhances the parser ability to determine the location of the last name. often times (perhaps most of the time). nothing can be done about data in a system once it is entered. especially with the help of a comma. if the input name were “Johnson.You might be wondering how DataFlux does this.” Match codes allow duplicate identification (and resolution). it’s pretty neat stuff and of course the good part is that it is customizable. etc. This helps if someday you want to write a standardization rule for your company’s specific purpose.

I couldn’t get the whole output to fit on the screen here. but I think the match codes seen in the name and the address will get my point across. Why is this so significant? We now have an easy way to find duplicates! Match codes could be stored in a database and allow quick checks for duplicates! Let’s move on to see more… I’m now going to use Clustering next to see how duplication identification can be done. nicknames. take into account abbreviations. etc. First. Here you can see that match codes ignore minor spelling differences. I’ll set the clustering rules in the Properties window (note that I use the match code instead of the actual value for the rule): And let’s preview… .

Note that the cluster numbers are the same for records that match. Why? Well because of the clustering conditions I set. together with lowering the sensitivity on the address match code (sensitivities range from 50-95) and the two would match. Pay special attention to the fact that our Bill & William Shares didn’t match. I’ll try to post a thing or two out here now and again if I see anyone interested… . We could modify our Quality Knowledge Base (QKB) to indicate that SLO = San Luis Obispo or I could remove the City as a clustering condition. based on the clustering conditions I set a moment ago. Let’s do this to be sure: There are a lot of really neat things that DataFlux can do.