Professional Documents
Culture Documents
DataFlux Tips and Tricks
Chris Martin, Client Services Manager
Gary Townsend, Solutions Consultant
Consuming a directory of input files with a
single text file input node
How can I consume 500 text files sitting in a single folder, all having the same field
structure? Using 500 different input nodes would be tedious and time consuming.
Creating one job with a macro variable would be more efficient, but would still require
that job be run 500 times.
Using the DataFlux delimited and fixed width input nodes we can point a single input
node at an entire directory of files instead of a single file itself. This will cause DataFlux
to append the contents of each file together (data union), leaving the end user with a
usable data set consisting of all records across all input files. See the example below for
a description of how to use this feature.
Consuming a directory of input files with a
single text file input node
Output of directory read:
Using macro variables to create dynamic file names
Always wanted to dynamically create files names or pass table values between
pages in an Architect job? The example below shows how to create and make use
of macro variables and their values between pages of an Architect job.
Defining Data: PAGE 1 of JOB
EXPRESSION NODE:
Execution_Number = 2239 setvar("TodaysDT",DateToday)
String DateToday
DateToday = today()
DateToday = FormatDate(DateToday,"MMDDYYYY")
Pushrow()
Using macro variables to create dynamic file names
Retrieving Data: PAGE 2 of JOB
EXPRESSION NODE:
Pre‐Expression
string Execution_Number
String TodayDate
Execution_Number = getvar("EX_NUM")
TodayDate = getvar("TodaysDT")
pushrow()
Expression
Seteof()
Text File Output:
FileName Property Value: C:\%%Execution_Number%%_%%TodayDate%%.txt
Expected Results:
1 row of data will be written to a file called 2239_TODAYS_DATE.txt
Inside of it you will see the value of Execution_Number as well as the value stored in the TodayDate macro
variable. The possibilities of using macros in this manner are endless. Experience with other uses.
Passing Macro Variable Values as a Command Line Argument
We all realize DataFlux’s ability to utilize macro variables within Architect & Profile jobs, but
what if you do not want to declare them as static values within the architect.cfg file?
As part of a command line argument when invoking DataFlux from the command line, we are
able to dynamically declare macro variable values at the time of execution. Below are some
examples of the syntax required on both Unix & Windows Platforms:
UNIX/LINUX DIS PLATFORMS:
INPUT_FILE=/dataflux/input/audit1.txt OUTPUT_FILE=/dataflux/output/audit1_out.txt
/dfpower/bin/dfexec –log /dataflux/DISjoblob/joblogname.log
../var/dis_arch_job/jobname.dmc
WINDOWS DIS PLATFORM:
Set INPUT_FILE=C:\dataflux\input\audit1.txt &
Set OUTPUT_FILE=C:\dataflux\output\audit1_out.txt & “C:\Program
Files\DataFlux\DIS\8.2\bin\dfexec.bat” –log c:\dataflux\DISjoblob\joblogname.log
c:\dataflux\jobs\jobname.dmc
Architect Node ‐ Advanced Properties
Using the advanced properties of nodes within dfPower Architect can drastically save
you time and effort associated with managing a large number of fields.
The following examples contain two very practical uses of the advanced properties.
Copy & paste fields from external data provider into job specific data node
Architect Node ‐ Advanced Properties
Architect Node ‐ Advanced Properties
Standardizing all fields into same field name
=
Alternate Date/Time Extraction
Methods
Counting Records in Text File
We all know how simple it is to extract a count of records from a database table, but
what if you want to determine how many records exist in a text file so you can
increment a counter accordingly.
•Option 1: Open File in Notepad, count records one by one
•Option 2: Just take a wild guess and hope you are close
•Option 3: Let DataFlux do the counting!
Counting Records in Text File
Let DataFlux count
We will stick to…
your records….
Remove control characters within your data:
Before: After:
ASCII Control Characters
Char Oct Dec Hex Control-Key Control Action
Why use Edit_Distance()??
•Introduce additional layer of “fuzzy” matching
•Determine difference between two strings / words
•Set “likeness” thresholds for matching
Edit Distance for matching
• How to use Edit_Distance()
Need to match on a portions
of a field?
Ever need to match on subcomponents of a field? You want to find all
customers at a specific location code but positions 8 and 13 of the
code mean nothing? But the rest of the value needs to be exact.
MatchCodes won’t help. Edit Distance may give you the proper
results, IN SOME CASES……. Let’s use the expression node to create a
new field built using the left/right/mid functions
Expression node
Expression node
Clustering node
Without Location_Substring With Location_Substring
Data ‐ Examples
Original Data
Cluster Results (Location Code or Address/City/State/Postal)
Cluster Results (Location Code or Address/City/State/Postal or Location_Substring)
Sort_Words for Matching
• sort_words Function:
‐ DataFlux through its EEL exposes a function called
sort_words that performs a sort (ascending or
descending) of data within a field.
‐ The function can also eliminate a word if it is
duplicated in the field.
‐ This function becomes valuable if/when a business
requirement requires matching on a free form field
(for example material or parts descriptions).
Sort_Words (Expression Node)
Move_File after job completion
• move_file Function:
‐ DataFlux through its EEL exposes a function called
move_file. The function performs a move of a file from
one directory to another.
‐ This functionality is important when input files are
processed and should be moved to a secondary location
so that they are not processed again.
‐ This could be viable as a last page in a job if it
consistently runs ‘listening’ for a file to arrive at an input
location.
move_file function
Before After
Cluster Flagging
At times it may be necessary to identify if a record was part of a multi‐row
cluster or if it was just a single, non‐matched record.
Cluster Flagging
Data after this
node
Cluster Flagging
We sort on the cluster id and sequence field (I). The sequence field is in
descending order as it will be imperative to understand if a cluster has a
sequence higher than 1 when identifying it as a multi‐ or single row.
Cluster Flagging
Cluster Flagging
Questions
Any of the presented topics and/or workflows can be provided.
Please see the instructors following the session to obtain this information.