3** année Cycle ingéiew
Business Intelligence / Dat
Warehouse
TP4: PDI
Reading and Writing Files
1. Reading data from files
Data is everywhere; in particular, you will find data in files. Despite being the most
primitive format used to store data, files are still broadly used and they exist in several
formats, such as comma-separated values and spreadsheets. Pentaho Data Integration
(PDI) has the ability to read data from all kinds of files.
Before reading a file, it's important that you observe its format and content: Does the file
have a header? Does it have g footer? Which are the data types ofthe fields? Knowing
these properties about your file is mandatory for reading it properly
ACTION: .
. Start Spoon and create a new Transformation.
2. Expand the Input branch of the Steps tree, and drag and drop to the canvas a Text
file input step.
3. Double-click the Text file input icon and give the step a name.
Click on the Browse... button and search for the sales_data.csv file. Select the file
and click on the Add button,
5, Click on the Content tab end replace the semicolon (;) by a comma (,) in the
separator character option. If your file has a Unix format, you have to change the
format in the Format option.
Click on the Fields tab. Then click on the Get Fields button. The Get Fields
functionality ties to guess the metadata but might not always getitright, In which
case you can manually overwrite it.
Click on the Preview rows button and then click on the Ok button. The previewed
data should look like the figure 1
Y. ELYOUNOUSS! 1/6 2021-20223° année Cycle éingéieurs— Gt
fer Ah pp i bl arpa
acon cost feet Business intelligence / Data
| concasonaectrsencs soos ‘Warehouse
Hers z =
Figure 2: Previewing the sales data input fle
There is still one more thing that you should do; provide the proper metadata for the
fields.
8. Change the Fields grid as shown in the figure 2:
_ an es
‘igure 2: Configuing the fields metadata
9. Run anew preview.
You should see the same data, but with the proper format.
10. Close the window.
This is all you have to do for reading a simple text file, Once you read it, the wata is ready
for further processing.
Y. ELYOUNOUSS! 2/6 2021-2022falcon tet tfeas oe 3 annee Cycle ¢ingticurs ~ Gt
ice tanto teenth vette Business intelligence / Data
Warehouse
2. Reading several files at the same time
Suppose you have several files, all with the same structure, Reading those files at the
same time is not much different from what you did.
We will take as a source several files with the same structure as sales_data.csy, but
separated by region: sales_data_APAC.csv, sales_data_EMEA.csv and
sales_data_Japan.csv.
Actior
You have two options here. The first way for reading several files at the same time would
be to provide the detailed list of filenames:
1. Open the Transformation that reads the sales_data.csv file and save it under
another name.
2. Fill the grid with every filename, as shown next (figure 3)
taser ———— aes
ae = *
\ eat strc nemcten a
eect Sagara i
‘igure 3: Reading several ies at the same time
3. Select the Additional output fields tab. In the Short filename field textbox, type
file_name
4. Run a preview.
The second way is using Regular Expressions. This option is useful when the names of the
files in your list follow a pattern, or when you don’t know the exact names of the files
beforehand:
1. Double-click on the Input step.
2. Delete the lines with the names of the files,
Y. ELYOUNOUSS! 3/6 2021-2022ie St ope ag ual
CN fobOF 1 sell taco
125 son Warehouse
3. Under the File of Director
folder. Then under the R
filename(s)... button, Yo
Click on Preview rows,
Y Option, type or browse for the full path of the input
‘eBExp option, type sales_data_.+\.esv. Click on the Show
'u will see the list of files that match the expression,
He to readis specified in an XML file. Let's learn how to handle
this situation with Poi,
ACTION:
First, you have to create a fle named configuration,
xml and inside it, type the following:
sales data Japan
The idea is to dynamically build
2 string with the full filename and then pass this
information to the Text file input ste
ep, as follows:
Create a new Transformation,
From the Input category of steps, drag to the canv:
Double-click the step for editing it
configur:
‘as the Get data from XML step.
In the File tab, browse for the
n.xm file and add it to the grid
Select the Content tab. in the Loop XPath textbox, type /settings/my file
Finally, select the Fields tab, Fill the first row ofthe grid as follows: under Name,
‘ype filename; under XPath, type a dot (,; and as Type, selector type String.
6. Click on Ok and preview the data
7. After this step, add @ UDJE step. Double-
full path for the file, as in the figure 4:
it and configure it for creating the
2021-2022
5 4/6
Y, ELYOUNoUss!
3" année Cycle dingéieurs.
Business Intelligence / DataI eanory 3.
‘année Cycle d'ingéieur:
Business inteligence / Data
Warehouse
eT ei
Faure 4: Configuting the UDIE step
8. Close the UDJE window.
8. Add a Text file inj
10. Double-click on
following figure:
| Acceptlenames trom previous steps
Accept filenames fom previous step
Pass through fils fom previous step E]
| Step toread filenames fom User dei
|
a —
ava preston
Fein he ips to seas Hera [fl flenume
igure 5: Accepting aflename for incoming steps
11, Fil In the Content and Fields tabs just lke you did before. it's worth saying that
the Get Fields button will not populate the grid as expected because the filename
's not explicit in the configuration window.
12. Save and Run.
Y. ELYOUNoUSs! 5/6 2021-20223h année Cycle c'ingéieur
rac cosa saat Business Intelligence / Datz
ice casi ses Busines nt
4, Creating a simple file
‘As well as extracting data from several types of files, PDI is capable of sending data to
ifferent types of output files. All ou have to dois redirect the flow of data towards the
proper output step.
In this section, we will earn how to generate a plain text fle,
ACTION:
1. Open the last Transformation of TP2 tutorials and save it under a different name.
2. Expand the Output branch of the Steps tree, look for the Text file output step and.
drag it to the work area,
Create a hop from the UDJE step to this new step.
Double-click on the Text file output step icon and give it a name. As Filename, type
the full name for the file to generate.
In the Content tab, leave the default values.
Select the Fields tab and click on the Get Fields button.
Click on Ok, save the transformation and run it.
Browse for the new file and open it. It will ook as follows:
1 pertormances iat Bloe-nates
Fichier Estion Format aifichage Aide
lproject_nane;start_datesend_date;diff_dates;
Project A;2016-01-10;2016-01-25;15;Excellent
Project 8;2016-04-03;2016-07-21;103;Good
Project 0;2015-03-03;2015-12-20,108;Good
Project €;2016-05-11;2016-05-31;20; Excellent
Project F;2011-12-61;2013-11-30;730;Poor
performance;duration;message
Lnt.cott 100% _Windows (CRLF) —_UE-B
igre 6: Pain text ile generated by Text fle output step
Now we want to send the Write to log step fields toa Microsoft Excel file, So,
out the appropriate step to do this work. Then,
‘configure it.
try to find
add it to the canvas area, link it and
Y. EL YOUNOUSs! 6/6 2021-2022,