You are on page 1of 6
3** année Cycle ingéiew Business Intelligence / Dat Warehouse TP4: PDI Reading and Writing Files 1. Reading data from files Data is everywhere; in particular, you will find data in files. Despite being the most primitive format used to store data, files are still broadly used and they exist in several formats, such as comma-separated values and spreadsheets. Pentaho Data Integration (PDI) has the ability to read data from all kinds of files. Before reading a file, it's important that you observe its format and content: Does the file have a header? Does it have g footer? Which are the data types ofthe fields? Knowing these properties about your file is mandatory for reading it properly ACTION: . . Start Spoon and create a new Transformation. 2. Expand the Input branch of the Steps tree, and drag and drop to the canvas a Text file input step. 3. Double-click the Text file input icon and give the step a name. Click on the Browse... button and search for the sales_data.csv file. Select the file and click on the Add button, 5, Click on the Content tab end replace the semicolon (;) by a comma (,) in the separator character option. If your file has a Unix format, you have to change the format in the Format option. Click on the Fields tab. Then click on the Get Fields button. The Get Fields functionality ties to guess the metadata but might not always getitright, In which case you can manually overwrite it. Click on the Preview rows button and then click on the Ok button. The previewed data should look like the figure 1 Y. ELYOUNOUSS! 1/6 2021-2022 3° année Cycle éingéieurs— Gt fer Ah pp i bl arpa acon cost feet Business intelligence / Data | concasonaectrsencs soos ‘Warehouse Hers z = Figure 2: Previewing the sales data input fle There is still one more thing that you should do; provide the proper metadata for the fields. 8. Change the Fields grid as shown in the figure 2: _ an es ‘igure 2: Configuing the fields metadata 9. Run anew preview. You should see the same data, but with the proper format. 10. Close the window. This is all you have to do for reading a simple text file, Once you read it, the wata is ready for further processing. Y. ELYOUNOUSS! 2/6 2021-2022 falcon tet tfeas oe 3 annee Cycle ¢ingticurs ~ Gt ice tanto teenth vette Business intelligence / Data Warehouse 2. Reading several files at the same time Suppose you have several files, all with the same structure, Reading those files at the same time is not much different from what you did. We will take as a source several files with the same structure as sales_data.csy, but separated by region: sales_data_APAC.csv, sales_data_EMEA.csv and sales_data_Japan.csv. Actior You have two options here. The first way for reading several files at the same time would be to provide the detailed list of filenames: 1. Open the Transformation that reads the sales_data.csv file and save it under another name. 2. Fill the grid with every filename, as shown next (figure 3) taser ———— aes ae = * \ eat strc nemcten a eect Sagara i ‘igure 3: Reading several ies at the same time 3. Select the Additional output fields tab. In the Short filename field textbox, type file_name 4. Run a preview. The second way is using Regular Expressions. This option is useful when the names of the files in your list follow a pattern, or when you don’t know the exact names of the files beforehand: 1. Double-click on the Input step. 2. Delete the lines with the names of the files, Y. ELYOUNOUSS! 3/6 2021-2022 ie St ope ag ual CN fobOF 1 sell taco 125 son Warehouse 3. Under the File of Director folder. Then under the R filename(s)... button, Yo Click on Preview rows, Y Option, type or browse for the full path of the input ‘eBExp option, type sales_data_.+\.esv. Click on the Show 'u will see the list of files that match the expression, He to readis specified in an XML file. Let's learn how to handle this situation with Poi, ACTION: First, you have to create a fle named configuration, xml and inside it, type the following: sales data Japan The idea is to dynamically build 2 string with the full filename and then pass this information to the Text file input ste ep, as follows: Create a new Transformation, From the Input category of steps, drag to the canv: Double-click the step for editing it configur: ‘as the Get data from XML step. In the File tab, browse for the n.xm file and add it to the grid Select the Content tab. in the Loop XPath textbox, type /settings/my file Finally, select the Fields tab, Fill the first row ofthe grid as follows: under Name, ‘ype filename; under XPath, type a dot (,; and as Type, selector type String. 6. Click on Ok and preview the data 7. After this step, add @ UDJE step. Double- full path for the file, as in the figure 4: it and configure it for creating the 2021-2022 5 4/6 Y, ELYOUNoUss! 3" année Cycle dingéieurs. Business Intelligence / Data I eanory 3. ‘année Cycle d'ingéieur: Business inteligence / Data Warehouse eT ei Faure 4: Configuting the UDIE step 8. Close the UDJE window. 8. Add a Text file inj 10. Double-click on following figure: | Acceptlenames trom previous steps Accept filenames fom previous step Pass through fils fom previous step E] | Step toread filenames fom User dei | a — ava preston Fein he ips to seas Hera [fl flenume igure 5: Accepting aflename for incoming steps 11, Fil In the Content and Fields tabs just lke you did before. it's worth saying that the Get Fields button will not populate the grid as expected because the filename 's not explicit in the configuration window. 12. Save and Run. Y. ELYOUNoUSs! 5/6 2021-2022 3h année Cycle c'ingéieur rac cosa saat Business Intelligence / Datz ice casi ses Busines nt 4, Creating a simple file ‘As well as extracting data from several types of files, PDI is capable of sending data to ifferent types of output files. All ou have to dois redirect the flow of data towards the proper output step. In this section, we will earn how to generate a plain text fle, ACTION: 1. Open the last Transformation of TP2 tutorials and save it under a different name. 2. Expand the Output branch of the Steps tree, look for the Text file output step and. drag it to the work area, Create a hop from the UDJE step to this new step. Double-click on the Text file output step icon and give it a name. As Filename, type the full name for the file to generate. In the Content tab, leave the default values. Select the Fields tab and click on the Get Fields button. Click on Ok, save the transformation and run it. Browse for the new file and open it. It will ook as follows: 1 pertormances iat Bloe-nates Fichier Estion Format aifichage Aide lproject_nane;start_datesend_date;diff_dates; Project A;2016-01-10;2016-01-25;15;Excellent Project 8;2016-04-03;2016-07-21;103;Good Project 0;2015-03-03;2015-12-20,108;Good Project €;2016-05-11;2016-05-31;20; Excellent Project F;2011-12-61;2013-11-30;730;Poor performance;duration;message Lnt.cott 100% _Windows (CRLF) —_UE-B igre 6: Pain text ile generated by Text fle output step Now we want to send the Write to log step fields toa Microsoft Excel file, So, out the appropriate step to do this work. Then, ‘configure it. try to find add it to the canvas area, link it and Y. EL YOUNOUSs! 6/6 2021-2022,

You might also like