You are on page 1of 54
Studénte Notebook | BiidnfospkereridyanserbRatad | 98.5.5 - Il | ‘Course code WB815 / 7B815 ERC 1.0 | Course coda Kens ERC ta! IBM Training Agel, 2816 nonce ‘hin efarratiy wm dana som Fox rca ewan a th BA. (S00 ay na rt ria. mar, Fr aca a Saran ek, yer rr te ‘foci rth presi we wave Scrip wrt yor we Ay rw sw 12 a, a vw Pl i ate 2 ray a oy nN pa gg meee May oman! Bey Frey ava ka papa eo a ee gay ae eos gonna oon Lain a ray gwen) ta ary Tatewired So Goretcaa pat any lrwstae are oonaed onnrcoe nee BU Dito Lemna ‘Bit Capen Tenses sen mone on pty sw mer wt eh a or yr nt oanaes jncrsres cuemced Pr panos Toss PamveMT oe aa Repeat arene SF ar ONO. STReWamatadin nmues tisund shuet ineteore toeneues hasunhes armen aamaoned ‘ibecnea(PagiL Tv SRPIEES FORA PARTICULAR PORPCEE Rave sate a wie Sucre Sees roid ae Sewn Gowen Ga kaeras ray any ye ‘a ferrin att nett acral cman s pegugtcal ware Care a cece ates om forrater fmctme (S crsaeaiinrwr sito d twnincest (hii tay tau nasa a ae Sateen ow pa a Jey recension ta EM at roid er eerie ay and da ed ary er anc o Sevatan Pra rawade tou cence ewe aa tw raven oid Bil rose we oo cee front a aya atte et tn ps ny my En rr i gy pin yo ton enOM procs? wat ciftcnd tha perso cme pratn tng pits vnc tsp LAG ova SS an DB iid io as wre ew Sao the coms dae cor ly ey Sw Se wed Fas Comore crim casio d tori paisa wold emesis iw naire Pua os “hia ferration conta mai of te nd arta ein ly i rar To trata on cocci sie, erg eas na rwrang reine Corgan Swen we rasa Ae Mane are wv fe we any Ty a ‘Soran aadly a) ea aia pW nly rel ‘ec re (Rt 2h BM ge ec ir carm, Bata efor oars or ghar err of Ita Buin Michie Ear ng tira! ran) racers wecue Gow Daas wnt mevicaraan Tg anaes Hao aia eoTorTan A ETS i BMA Searara wala ow eae Sage ne cae ara we me ST So ge SR ET Sooo roe anon ewe neice era Adon Sera ieee a nt ee eae ome Vise, atthe Wireman ogee nner Wicrctt Comer otha Uta Satan tar eur, oth (POX argue rar of Tan rc rt at are heart, |Eeapya gee inane Same Masia Gergana ‘hin gear rapt au pendent o woita ar pat wieraut agree ton param (acer Patt pn ett oar rip DE ADP Sera Cn th BN irs, Bcopyrgne iam corp. 2008 2018 ‘Coumsematerais may robe repeccueas @ uncle cre parwimauima price wean parmiaticn of Preface ‘Contents, ‘Course avarview. Additonal tating resources, IBM product help... ee Unit 1 Introduction to the parallel f framework architecture... Unit abjectives. ‘Why study the parallel architecture?........ ‘What wa need to master DataStage paralial jab dacumantation Key parallel concepts ‘Scalable hardware environments Drawbacks of traditional batch procassing.......... Pipaline paraltalism Partition paraliaiism Partitioning illustration DataStaga combines partioning a and i Pipelining Job design versus execution . Defining paralieiism Configuration fil sesso ‘Example configuration file ... Genarating mack data lob design far generating mock data..... ‘Specifying the generating aigarithm insida the Lookup stage ....... Configuration file displayed in job log ‘Chacknaint ‘Chackpoint solutions ......... esses Demanstation 1: Intreduetion to tha baie framework archtectue 1-25 Unit SUMMALY ese esses Unit 2. Compiling and exacuting jobs . ‘Unit abjectives. Paralial job compilation. More abaut compiling a Tran Generatad OSH. Staga to OSH oparatar mappings... Genarated OSH primar jer job. sossyrarvien cee. ase ae a Virals b say rete npsceuas vce cre par witeutins por wrtar pamanen 220 Virtual data sets in tha OSH DataStage GUI varsus OSH terminology. Configuration fil@ sss Prooassing nades (partitions) Configuration file format ... Primary alaments of a configuration Sampia configuration file Anathar canfiguration fle example Canstraining operators to specific nad: Configuration editor Parallal jab startup. Parallal jab run time. Viewing the Job Scor@ ....cscusesussaneeeainintsiseinsnenatnnsnatnatinane Example job Scare . Job execution: The orchestra metaphor - Runtime contral and data networks Parallel data flow Monitoring job startup and execution in the lag. Counting the total number of processes . Peeking at the data stream. Peeking at the data stream dasign Using Transformer stage variables. Chackpaint Chackpoint salutians Interpreting the Scare partitioning ‘Scare partitioning example Partition numbers: Partitioning mathods........... Selecting @ partitioning method Scopyegre lancer. 2005 2076 ‘Chumemaiardlamay reba mpeedvcad # whela or partltheuttha evar wean parmiaian GfiM Same partitioning algorithm Caution ragarding Same partitioning Round Robin and Random Paralial runtime axampia Entira partitioning Hash partitioning... Unequal distribution example Modulus partitioning Ranga partitioning .... Using Ranga partitioning Example partitioning icans. Auto partitioning ......0.. Prasarve Partitioning flag Partitioning stratagy. Callacting data... Callactors Specifying tha collactar mathad Callactormathads ..... Sort Merge example. Nan-detamninistic execution ‘Choosing a callactor mathad .. Collector mathad versus Funnel saga . Parallal number saquencas Row Ganerator sequences of numbers Generated numbers............. Transformer example using @INROWNUM Transfarmer example using parallel variables Header and detail processing Job dasign Inside the Transformer Examining the Score Difficulties with the design - Examining tha Scare Generating a header detail data file. Insida the Column Export stage. Inside the Funnel stage Checkpoint... Checkpoint solutions . Demonstration 1: Read data with multipla record farmat ‘Unit summary scenyrene en cep 208 2016 ‘Seursematerdis may rote rapraduoad & whole or # par witheutiha peer wrtan parmid.on ef M Traditional (sequentas ‘sort Parallel sort Stages that require sorted data ... Parallal sorting mathads In-staga sorting Sort stage ..... Stable sorts... Rasarting on sub-groups Dan't sart (previously grouped). Partitioning and sort ardar .......... Global sorting mathods.. Insarted tsorts. Changing insariad tsort behavior if the stage requirements ‘Sart kay dogs not match stage. requirements DataStage lagacy bahavior Muitigia input links example. Fork-pin job example ..... Fork-pin job dasign....... Examining the Scora Difficulties with tha design . Score of optimized eb. Checkpoint. ‘Chackpoint Demonstration 1: Optimize 4 forkeoin jab. Unit summary Siceeyegne core aoe ‘TE Saematara aay nota maaguensw vrais ore sarwimauane ener wan gamma erent Unit Buffering in parallel jobs .. ‘Unit objectives... ceseeeesneee intreaducing the buffer oparator . identifying buffer operators in the Scor@ o.oo cscs siemens Haw buffer operators work Buffer flaw control... Buffer tuning - buffar palicy.... Buffer tuning - additional buffar settings - ‘Cautions........... . Changing buffer sattings in a ob sage. Buffer resource usage . a Buffering far group stages - Join stage intemal buffering .. ‘Avoiding buffar cantantion in fark-join jabs. Revisiting the header detail job dasign Buffering solution.......... Redeeigned haadar detail processing aot Checkpoint ccc. Chaekpaint saiutans Damanstration 1: Sptmae a friejon ab. ‘Unit summary... Unit6 Parallel framework data types. ‘Unit objectives... Data formats. Data s4ts........... Example schemas. ‘Tyoe conversions......... Using Mody sige fr type conversions Processing extemal data... ‘Sequential file import conversion .... ‘COBOL file impart canvarsion ‘Oracia automatic conversion Standard framework data typas.............. ‘Complex data tyoes. Schema with complex types. Compiex types column dafinitions ‘Camplex Fiat File (CFF) stage. Sample COBOL capybdok .. imaarting a COBOL file dafinitian (CFD). COBOL table definitions ... we Doeeewit a cere 2028208 ar ‘CoUrSamatErAlemay rete rprecusod & wnole ore por MAMaUNnG prer ween parmadon etlaM COBOL file layout. Spacifying a date m: Example data file with mutiaie formats Sampia job with CFF Stage. egg ensrnnnnnanrne Transformer constraints... ‘Nullabie data ‘Null transfer rules Mulls and S8qQu@ntial M8. ceeeneennemeesesnsneneneeereeseneeeee Null fad value examples Viewing data with Null ValU@S ......ecsscesetesetessenestmnestsatenttnnsenartevatents Loakup stage and nullable columns Nuliability in lookups. Cuter joins and nullable COMMS. eee eeeneenieeeeieeeensniee Chackpoint Chackpaint solutions . Demonstration 1: Tast nuliabilit Unit summary Unit 7 Reusable components... Unit abjectives. Schama file ... Crating a schema fie. imparting a schama ‘Creating a schema from a table de finitOn oe eee nceneeneeneeneeneeieeiteaneatee Reading a saquantial fila using a schama Runtime Column Propagation (RCP)... Enabling Runtime Column Propagation (ROP) Enabling RCP at the projact level Enabling ROP at the Job MaV@) ee esnesnmesetsansaneaeaeaeaneeeenea Enabling RCP at stage laval Whan RCP is enabled Where do RCP columns come from? ‘Shared containers... ete grH aI cone 30E8 EE ‘Coursamaiardla may nee reprocuend @ vhcla or gar withoutthe prier writen parmiasion oti Creating a shared container Inside the shared container Inside the shared container Transformer ‘Using a shared container in a job....... ‘Mapping input / output links to the cantainar. 724 Interfacing with the shared container. Checkpoint... veseiee Chackpoint solutions | Damanstration 1: Reusable components... ‘Unit summary... Unit 8 Balanced Optimization « Unit objectives... Pushing processing to the data source. Sort example ‘Optimized jab Saurca Cannactor staga changes Logs tab in Optimizer window....... Advanced Options tab ‘Optimizing Transformer processing Logs tab ... ‘Optimizing slaga arocessing to a database target . Source stage SQL Target stage SOL. Full push down... Inside the targat Connector stage. ‘Optimizing a job with Big Data Filg stag) ‘Chackpain Che ‘Unit summary... Sceeyeghe a cere 2082048 aS ‘Coumematerals may robe reproduoad & whale oré par tnthouthe prior weiter parmiason OMB. Course overview Preface overview ‘This course is designed to introduce advanced parallal job development techniques in CataStage v115. In this course you will develop a deeper understanding of the: CataStage architactura, including a deeper understanding af tha DataStage development and runtime environments. This will anable you to dasign parallal jobs that are robust, less subject i arrars, reusable, and optimized for better perfarmance. Intended audience Experienced DataStage davalopers saaking training in more advanosd DataStaga job techniques and who seek an understanding of the parallal framawork architecture, Topics covered Topics covered in this course include: Introduction fo the parallel framework architecture Compiling and executing jobs Partitioning and collecting data Sorting data Buffering in parallel jabs Parallal framawork data types Rausable components Balanced Optimzatan Course prerequisites Participants should have: + BM infoSphere DataStaga Essentials course or aquivaientand af least one year df experiance developing parallel jabs using DatStage. 2Cepyryht BM Grp. 2008 2018 ‘Ceurabmaiara gay note raprecueas # wieig ere par witeutins par arta pammakion nent Pretae, Document conventions. Conventions used in this guide follow Merasoft Windows application standards, where applicabia. As wall, tha fallowing conventions ara observed: « Bold: Bold styia is used in damanstration and axarcisa step-by-step salutions ta indicate 4 user interface siament that is actively selected or text that must be ‘yoed by the participant. * alic: Used to reference book titles. «» CAPITALIZATION: Al fila names, table names, column names, and fodder names: appear in this quide exactly as they appear in tha apalication To keep capitalizaton consistant with ths gurda, type text exactly as shown. Scopyrght BM Com. 2008 2018 PUG Doraraha dla may roa raprecusas & hele or ® parthitneunma ener wrtan pamasen eran Additional training resources © Visit IBM Analytics Product Training and Certification on the IBM website for cblails on: Instructared training in a classroom er online Self-gaced training that fits yourneeds and schedule Comprehensive curricula and training paths that help you identify the courses that ara right fer you IBM Analytics Cartfieaton pragram COthar resources that will enhance your success with IBM Analytics Software © For the URL relevant to your training raquiramants outlined above, bookmark: » Information Management portfolio: hito:/Wwww-O_ibm com/software/data/education/ Sccey phil ce ec8.2008 ‘Guorgamalarais may rote recrecusse # vie er partimeutns price wean parmagion eF1EN IBM product help Halp type When to use Task- ‘You are working in tha praduet and oriented you need Specific task-oriented help. Books for You want to use saarch angines to Printing find information. You can then print (ndf) cut salaciad pags, a section, or the whale book. Use Step-by-Step online books (pdf) ff you want to know how to compiate a task but prefer to read about it in a book ‘Tha Stap-by-Siap onlina books contain the sama information as tha onling help, but tha methad of presentation is different ‘You want to acoass any of the bliowing: © BM - Training and Cartfieation + http://www01 ibm.cam/ software/analytics/training- and-ertfeation! » Online support, © httpv/www-047.iom.com/ ‘support/entry/portal’ Overview!Softwara » IBM Web site « httpwww.ibm.cam: 2CepyrghelGM corp. 2008 2078 ‘CEUMEmataraiEmay rate reprecused @ uncle ere par Wimcuna pcr wman parmusion GfIEM. Unit 1 Introduction to the parallel framework architecture Introduction to the parallel framework architecture IBM InfoSphera OstaStage v11.5 Unit 1 traduction te the parailal tramawerk architecture fen Unit objectives. * Describe the parallel processing architecture * [Describe pipeline and partition paralleliam * Describe the role of the configuration file * Design jobthat crestes robusttest dats eran tin wr onan see orapmcsuas ee Unit objectives S Copyeght IM Com. 2008 2078 ‘GEUEEMatarAIE may robe reproduces & wnela ert par wimeute price wrRar pause of EM Unit 4 htroducticnte the parallal tramawork architecture fen Why study the parallel architecture? * DataStage Cent is 8 productivity tool « Guldesgn Uunetonality leis you develop jabs quickly « Natintendad to erimarundedying architecture * GUI depicts standard ETL process: + Faraialiam is implemented wider te covers « GUlhides end in some cases distaris ings = For example, sors, butters, parttioring gparators * Sound, scaisble designs require an understanding of underlying architecture Why study the peralle! erohltecture? Laaming DataStage at tha GUI jab dasign laval is nat anough. In order to devalap the ability to design saund, scalable jobs, it is necessary to understand the underlying architecture, This is because the DataStage cient is pimanlya productivity tool. It is not intended ta mirror underlying architecture. ‘The DataStaga Dasigner GUI dapicts the standard ETL process. Sauree stagas raad the data, processing stages preform varus sorts of transformations, and target stages oad tha data. Much is hidden, however, The GUI hides the fact that the data is being partitioned and processed in multipia, parallel straams. And the GUI hides the insertions df operatnrs for sorting, buffering, partiitoning and collecting. © Copyegne ia corp. 2008 2018 ‘Coumematiraia may neta reprocuena # uncle er & par Withoune pror wman parminden ef aM. Welt 4 mtroguction to the paralial framework arenitacture lela What we need to master + How the GUI job design gets executed + Whats genaaiee turn me GUI (OSH) + Hin is i executed in Soe pare) raewark, = How parallelism is implemented + patina paralilism + Parton paratlige = Role ol he coniguration tle -txre * Development environment + ow ib Gaveap aticient, wellgariorning GUIjod designs + How io dabug and change the GUl joo dasign based an te generated OSH td Sco anid rressages in he jou log Wher we need m master To be abla to design robust parallel jobs, wa need to get behind and beyand tha GUL We nead to understand what gets generated from the GUI design and howthis gets executed by the parallel framawark. We also nead to be able ta dabug and modify our Job designs based an whatwa sae happening at runtime. ‘The OSH is generated from the GUI design at compile tima, The OSH raveais which ‘operators get generated from which GUI stages. At execution tima, itis the operators that gat executed, nat tha GUI stagas. And thare is nat a one-to-one corespondanca banvean stages and operators. The Scare raveals which operators get executed an which nodes. It also reveals which operators get insartad at runtime for buffaring and to satsfy tha requirements of othar qperators. Copy phil Corp 2008 2078 Chumdmataraia may rota rapresuses # rele ore par wimauima pier arman pamiauer af EM DataStage parallel job documentation * Information Server vit_5 documentation i provided in the onlingIBM® Knowledge Genter: «Mapua iter comisuppartioronledgeceneriSSZIPZ 11.5.0 * The Knowledge Center has 8 short URL 3s wall: + Al dacurrentason: neg iom Bizknawer «Version 115: hig: om bigknawertSSZP2_11 5.0 DstaStegs parsilel job docur 2 copy ght cere. 2008 2018 ‘Seurmematardla may ne‘be raproduasd & whole er & gar withthe prior witen panini n ot IBM. Unit + mtrequetiente the parallel tramawerk srchitectura Training Kay parallel concepts: * Parallel processing: + Extouing the job on mulsgie CPUS * Scalable proceasing: « Add mane rasouwoes (CPUs and disks) to ncease sysien partaenance Gama ayaa 6 Ona (peeing Teees] ara aims Ha Key perelle! concepts Parallal processing is the kay to building jobs that ara highly scalable. "Scalable," hare, rafars to the ability to increase the parformans of a job by adding addtional resources, (CPU, mamory, disk). As the amount of data that requires processing increases, the reed for greats: performance increases, For greater scalability and portability, the parallel angina amploys the processing nada concept. “Standalona processes" rathar than “thread iachnalogy” is usad, Processed- tasad architecture is platform-ndapandent, and allows greater scalability across sesouroas within the processing pool Scepyrgneiaucer 205 2078 ‘Soumamatar ile may ne te rapreduend f whole ert par uithauthe par weiter parm ot IM Unit + mtrecuction to the parallel tramawerk arenitecture Tenn Scalable hardware environments eters at at » + mulcru (ae) + Samad mamery dist ‘Saslsble hardvere environments DataStage parallel jobs ara dasigned to ba platform-ndepandent. Asingla job can run across resouroas within a singe machine (SMP) or multisie machings (clustar, GRID, or MPP architectures) DataStage parallel jobs ara also indapendant of the tyna af apaating system they run on, assuming itis supported. The same GUI design can be compiled and run an a Windaws systamas wall as many flavars of UNIX systems. While DataStage can run ana singla-CPU environment, it is dasignad ta take advantage of parallal piatfarms. Oder computers (Windows, Unit) oftan have a singe CPU. Athough DataStaga parallel jabs can runan a singia CPU system, tha resaureas limit the parfarmance. Newer computers often have several CPUs. Tua physical parallelism can ba achiaved. Each CPU accesses tha sama memory and disk resourcas. A.Grid ar Ciustar environment offars the greatest amount of potential resourcas tha jabs can use. Multiple computers, each potentially with muitinla CPUs, ara connected together in a high-snead network. The arocassing power of DataStage parallel jabs can ba distributed over multiple CPUs on multiple computers. Inthe graphic shown, nina computers, aach with four CPUs, share tha same mamary on their computer sysiams, but nat the memory of othar computers n the Grd. Potentially, a processing af job running on tha Grid can be distributed aver 36 CPUs. And thase job processes can taka advantage of the memory an all of the systems. Scepyegreiewcom 00s 2018 ‘Coumematerdla may rote repreducic & vhole ort par titheutine prior witan parmiasian of IBM. Unit 4 trocuetlen te the parailal tramawork architecture: iene Drawbacks of traditional batch procassing £ yo a aes £! a we * Poor utilization of resources «Lotsal dle pracessing Sime. = Lots of disk and LO for staging + Complex tomanage + Lotsofsmalljebs * impractical with large dats volumes Brewbacks of treditions! bstah prosessing ‘Traditional batch processing consists of a distinct sat of steps, defined by businass ‘equirements. Batween each step, intermediate rasuits are written to disk ‘This processing may exist outside of a database (using flat files for intermediate results) or within a databasa (using SQL, stored procedures, and temporary tables). ‘There are several problems with this approach: First, aach step must complete and wig its entire result set before the next step can begin, Secondly, landing intermediate sesulls incurs. a large perfomance penalty thraugh incraasad I/O. In thisaxampl, a ‘Sngie source incurs seven times (epreseniad by the seven araws going to ar from disk) the /O to pracass. Thirdly, with incraasad |/O raquiremants coma incraasad siorage costs. © cepyegre ia cem. ze05 2078 1 ‘Ceursamateralsmay rete mapresunes & vhele er par wltheutina peer ween parmiadon oF EM Unit 1 traduction te the parallel framawerk architecture Trainir Pipeline parallelism * Transform, envich, load processes execute simultanzously * Like 8 conveyor belt moving rows rom process to process + Siri downavean pracess while wsvean pracessig anning Advantages: « Raguots disk usage for staging seas + Meeps prmcessars busy + Stilhas limits on scalability DataStaga takes advantage of two typas af parallelism pipeiina parallelism and partition parallelism. This diagram illustrates pipaline parallelism. Inthis diagram, the arows represent rows of data flowing through the job. While earlier sows are undergaing the Loading process, later rows ara in parallel undargaing the ‘Transform and Enrich prooasses. In this way a number of rows (saven in this picture) arg baing processed in parallal ‘You can visualize this as like a conveyor bait upon which the saven rows are resting ‘The source adds new rws to tha conveyor bait The target removes rows when thay arrive at the and Booey gree com. 200s 2015 Chumamataraia may note represusse # viele ar® par witaums ener war parminken etn Unit 1 intreduetion te tha perallal framework architecture: eal Partition parallalism * Divide the incoming stream of data into subsets 1 be separately processed by ancperstion = Subeete are Caled parstons + Each partition of data is processed by the same operation + For axampli, ie aparaser is Filigree pues ileum fre Fike ‘peraion * Facilitates nesr-ingsr acslabllty +B Sena taster on 8 processors «© 24 Sera lasier an 28 arocessars + This assumes the date iseveny distinuted Portition parallelism Parttioning breaks a data set into smaller sets, Each set of data is processed separately in parallel. Suppose the total data sat consists of ang milion rows and it takes X amount of tma to process that data in ona single stream. If the data is partitioned into four subsets, each with roughly equal amounts of data, then the data can potentially be processed four times fasiar, minus the avarhaad fram dividing the ta up into tha fours subsets and initiating processing on each of tha faur partitions (podas) Parttion parallelism is a key to scalability. However, the data needs b be evenly distrituted across the partitions; otharwisa, tha benefits of partitioning are reduced. this important to note that what is done to each of the four partitions of data is the same. ‘Suppose, for exampla, that the job contains Transformer stage. Then four copies of tha same transfarm operatar, which is generated in tha OSH during campila tine, will be running in parallel, ona on @ach of the four partitions. Each transform operator will perform exacty the same processing operations on each set of data. e ay Reba reprecuess Rural cre par twimnouitna pace wean paemiaien of IM. Welt + mtrequetionte tne paraiial tramawark architecture, Partitioning illustration me eS + The deta is partitioned into three subsets + The same operation is cerformed on each nentition of cate senerstely and in paralial + ifthedata ia evenly distributed, the data will be processed oughly three fmes faster Partivaning i This diagram depicts how partition parallelism is implemented in DataStaga. The data is spit into muitiple data streams which are each procassad separalaly by the sama staga eperations. In this example, tha data which is initially all collectad in a single location is writen out ‘b three partitions. This data bacomes the input to three copies of the same operator transformoperator, sort operator, fter aparator, or whatever it happens to be), How the data is distributed to each of these partitions depends on the partitioning aigorithm used. Later in this coursa, wa wil discuss the different typas of partitioning aigorithms that can ba used, Soma distribute tha data evenly; othars may nat. Ifthe chia is disinbuted eventy, then this particular operation can yield roughly three times the performanca of a singia partitian justratiar 2 copyrghi ie Corp. 2008 2018 +2 Celmgmaiarais may nots napeauess wreis er par witmautins prer wmtan parmiaken et EM Welt 4 ntreduetlon te the parallel framawork architecture DataStage combines partitioning and pipelining Platine * Within DataStage, pipelining, partitioning, and reperttioning are sutomatic + Job developer only identifies: + Shquesal ve. paraiel operations (by stage} + Memad of date arssoring + Canfguratan fe (whieh idenstes msaurees) + Advanced siege options (buller uring, operatar eambiring, ete ) DeteStge ing DataStage jobs employ bath pipaiina and partition parallelism to achieve high performance, Neither is expiicity designed into the DataStage job. The DataStage developer species the GU! ETL design. The parallalism happans automatically when tha job is compiled and run. ‘This ig not to say that nothing in the jab dasign affacts parallalism. Thare are ways in which the DataStage developer can affect the parallelism, Developars can specify partitioning and callacting algorithms for stages in the GUI design. Davelopers can also specify tha mada in which stagas run: parallel ar sequential made. One of tha pumosas d this course is to teach developers haw to creata jab designs that enhance the paraliglism, and hance enhance the performance of their jobs. ‘The configuration fie determines tha dagrea of parallalism by specifying tha numbar af partitions. Tha configuration file used by a particular job isnat known until runtime, And tha sama job can be run using different configuration files on diffarent occasions. Scepyrgn za com. ze0e 2016 +e ‘Ceurmemaizrala may rote raproducsa & vhole oré par tltheutthe pei wean permiaien of IBM. Writ + mtrecuetien te the pargilel tramawerk architecture Job design versus execution * Useressembies the flow using DateStage Designer a ras ‘Acrurtione, this job rune in parallel for any coniguration (Tnace, 4 neces, Nnoces) (Na need to modify ar recampule tna jon datign! oo secign versus execution Much of the paratial procassing paradigm is hiddan from the designer. The dasigner smply designates the prooass flow, as shawn in the upper portion of this diagram, The paralal angina, using definitans in a configuration file, will actually execute procassas that are partitioned and paralized, as ilustrated in the botom portion. Amisigading feature of the lower diagram is that it makes it appear as if the data semains in the same partitions through the duration of the job. In fact, partitianing and ‘epartitioning occurs on a stage-by-sage bass. There will be times when the data moves fom one partition to anathar. fF coey phen ce 2008 1 “oS bnass bs sy rows aoreauenc# la re gar twas aver wnan germaain etait Unit + mtreduetien te the paral Training Defining parallelism * Execution mode (sequential / parallel) ls controled by stage definition anc propemes + Dataultis paraliaitor mostatages Can svertide datauitin erost cases (Advanced Proaeries tab) + By default, Sequental File stage nisin sequental made = Garunin paralil mesa when using muspie rescers + By default, Sartstage (ard most otter stages) run in parle! made * Degree of peralieism ia determined by configuration file the job ia running with, + Total number aflogical races in tie contguaation sie = Aasumirg these nodes est in avaliable node pools ‘Stages run in two possible execution mades: sequential, parallel, The default is paralial ‘or most stages, but there aré axcaptans. Far examnla, the Sequential File stage runs insaquential moda by default. The Sart stage, and most othar stages, run in parallal mode by default, The DataStage developer can override the default in many cases. For ‘wampie, although tha Aggregator and So stages run in paralial mada by default, a dbveloper might change the mode fo sequential far some purpasas. Running a Sort stage in sequential mode produeas a "global" sart of all the data. Running a Sort stage ‘n parallel made produces sorted data within ach partion, but not acres partitions. lfa stage runs in saquental node it will run on only one of tha avaiable nodes specified inthe configuration file. Ifa stage runs in paralal moda, it can use allthe available nodes specified inthe configuration fila, ‘Ona thing you can laam from reading tha Scare, which is discussed in detail later in this ‘course, § whether an operator is running sequentially ar in parallal and, if in parallel, the number of nadas it is running on. ecopyegrtiem cen. ze08 2078 ‘Ceureamataraia may cote repecust @ uncle or 8 por wineutna prorwman painter af, Unit + mtroauction te the parallel tramawork architecture, cnn Configuration file * Configuration file sepsrates configuration (hardware! software) from job design « Seected pet job at runtime by $ABT_CONPIG_FLE envirancrant verigble + Qptrriaes overall Proughout and matches pb characteris t overall hardware regources + Late you change haircare and radourcad witnaut changers jb design * Defings number of nodes (logic! srocessing units) wit thelrresources Need natmaieh tre number af piysical CUS + Resuroms include dal set, seratn, uM disk (Sie sysiaens} » Qhtand)! ragowces incuce database, SAS + Resuree usage can beaptmized using "pools" (named subsseis of neds) + Alaws runtime constainis on rasauroe usage on a par jab basis * Different configurafon filea can be used on different job runa * Add SAPT_CONFIG_FLE a8 ob parameter Canfl Tha configuration fie detarminas tha degree of paralialism (number of partitions) of jabs ‘hat use it. Each job runs under a configure fie. The configuration fie is specified at ‘untima by he $APT_CONFIG_FILE environment varabie. DataStage job runs can use diferent configuration fles on differant job runs by making the $APT_CONFIG_FILE environment varabie a jab parameter. Thus, a job can utilize differant hardware architectures without baing recompiled. It might, far axampla, sometimes ba advantageous ta run the jab with a 4-noda configuration file running on a 2procassor computer, far example the job is “resource bound,” In that case the disk VO can be spread among a greater numberof controllers. Scepyrgnr iene com. z00s 2016 ‘Ceumamaiarala may note ropreduend # whele orb par withauthe pear whan garmin ef EM. Unit 1 introduction te the parallel framawerk arahitactura Tenn Example configuration file ‘App pein Numbsr of nedes defines | aesswces sesigradtonsch Soosee sercemat “/eany | ada, That order is sigrfcare. haa Nameless nade pool"). Nodes ee initareavailabie ts sage Tne Sa Seat a Example oo This exampia shows a tynical configuration fila. Paais can be applied ta nadesor athar resources. The cury braces following same dk resourcas specify the resoures pools associated with that resource. A node paal is simply a callaction af nadas. Tha paais a given noda belongs to ara listed aftar the kay word ‘poof for tha givan nada. A stage that is constrained to usea particular named pool will run only on the nodes that are in that poo). By default all stagas run on the nodes that are in tha nameless pool (™). Following the kayword “nade” is the name of the nada (logical processing unit). ‘The order of resoureas is significant The first disk is used bafore the sacand, and so on. Keywords, such as “sor” and “bigdata’, when used, restrict the signifed processes to the use of the resourcas that ara identifad. Far exampia, “sort” restricts sarting to nade pools and scratch disk resources labeled “sort’, Database resources (not shown hera) can also be created that restrict datahasa access tocertain nodes. Acomman quastion is whether job operators can be constrained to use specific CPUs? Nb, 2 requast is mada ta the oparating system and the aparating sysiam chooses the CPU. Acomputar can be designated, but nota CPL within a computer. copy gremecem. z005 2076 ‘Cuursamaterala may nee repreeusee @ uncle er par witmeute peer wman peemiasen oF EM Unit 4 mtreduetlon te the parallel framawork architecture Ganerating mock data * Row Generator stage + Deine eulucnns in which to ganarate me cate On te Emended Sroperses nage, selecisigeritir te generathg values = OMerert ypas rave steven algerie evalasle * Lookups can be used to generate large smounts of robust mock data Lookup Mabie naa integers to values + Colina Generator colunns generate integers ts leit us = Oyeling fimugh integer sets can generate all passible conbinasons, Generating mock date ‘Ona of the purposes of this course is to introduce you new types of stages and jab signs. Even if you never build a similar job, exoloring these designs willshow you diffarent possibiities. In this unitwe explore the use of the Raw Generator stage. Among its many uses, the Row Ganeratorstage can ba usad toganerata mock artast dita. When used with Lookup stages in a job, large amounts of rabust mock data can be genarated Scepyrgreieucer 205 2076 we ‘Seumamatar ile may notbe reproduced # whale ort par itheuttha par wrtan garmin ot IM Unit + Mtreductian to the parallel tramewark architecture Ree ob dasign for generating mock dete In this job dasign, the Raw Generator stage generate intagers to look up. For differant ‘columns, it cycles through intager sats, generating all possibla combinations. The. jookup referanca filas man thasa intagers to specific values. For axamola, FName raps different intagar values to firet names. LName maps diferant imager valuas to fastnames, And soon. Tha resuttisa robust setofvanad data scapyegrecn sate Sr acinar ala may no be rerccLaes& whle er par wltheushs paras seman at BNL os Unit 1 troaueticn te the parailal tremawork arenitactura Specifying the generathg sigorithm ‘The number af values to cycia through should be different far each sat of integers, £0 thatall possible combinatans will ba ganeratad For example: ooo 144 220 301 a0 124 200 an Hare the first column cycias through 0-8, the second 0-2, and the third Q-1. In this way, each row shown here haga unique set of inagars. Atsome point, of course, tha rows ‘willbagin repaating. Scepyrgne za cer. 2005 2048 ‘Courmématardia may nc ka mpredvoad & whele er & par wihutha pric writer parmiadan ot IBM. Unit 4 Mtreduetiente the parallel tramawerk architecture Inside tha Lookup stage Inside the Lookup stage This shows tha inside of the Lookup stage. Notioa how integar columns (intt, int2, ...) are spacifad as keys into the lookup files. The values the keys ara mapped to are setumad in the output. Scepyrgneiaucer 205 2078 ‘Soumamatar ile may ne te rapreduend f whole ert par uithauthe par weiter parm ot IM Unit 4 mtrequetien te tha peralial framework arenitecture, iin Configuration file displayed in job log Configuration fle displayed in job log ‘The job jog contains a jot of valuable information. In this course, we will explore this information and how you can use it. One message you will find in the log displays the configuration fila the jab is running under. The graphic shown hare displays that message, in ths Sxarnipia, we can see that there are at least two nodes in the canfiguraten file. ‘Therefore, operators in this job that are running in parallal made will (unless thay are constrained) run in two partitions. cep prt core. 2008 2008 ‘Cuutsamataralsmay nots repeecucas 8 yale ert par witroutms pier wean pamiaton oF 2M Unit + mtrequetion to the paraiial framework architecture Checkpoint 41. What two main factors determine the number of nodes a stage in job wil runan? 2. What two types of paralieliam are implemented in parallel jobs? 3. What stage is often used to generate mock date? Checkpoint Scepyrgneiemicer 205 2076 ‘Coumamatar ile may noika ropreduend & whole er é par tithautihe peer wetan parma ar af IBM Unit 4 mtreguction te the paral framawork architecture Training Checkpoint solutions ‘The configuration file determines the total available nodes to the stages inthe job. The stage execution moce cetermings whether the stage can use al the nodes (pereliel mods) or just one (sequentis! mode}. Pertition parallelism and pipeline perelieiam Row Ganarstor stape. Beep phil corp 2008 2018 ‘Cuotamatarais may nee repmecucas # yrele er par timeutna pas wean pamiaker ef ent nit 4 nireduetlon te tha parallel tramawork arehitectura tn) Demonstration 1 Introduction to the parallel fremework architecture: * inthis demonstration, you wil: + Gmermte mock doe + Examine trejst og cep phil Cop. 20082018 (Cocmsamaterala may rote naprecuces # uncle ere par wineutina price wrmsn paminsen oF EM, Wit + mtrecuetion to the parailal tremawork arenitactura tc acer eee ser ee ou ed Purpose: In this demonstration you will dasign a job that generates mock data. You will also examine tha information in tha job log. Windows Usar/Passward: student/student DataStaga Clent: Dasignar Dasigner Client Use/Password: — studentistudent Project: EDSERVERIOSProject Task 1. Log into DataStage Designer. 1. Fingoassary, log inte the Windows client system as student / student, 2. Login to Designar Client Baus: + Host name of the sarvices tierand pat number: edservar:8443 © Username: student * Password: student + Projact: EDSERVERIDSProjact NOTE: Ifyou cannatiagin, this may ba bacausa Information Server DataStaga) has not started up. Itean take aver 5 minutes i start up. if thas not stated up, examine Windows services. Thare is a sharteut an the desktop. Varfy that DB2 - DB2Copy has started. Ifnot, salect itand then click Start. Then select IBM Websphere Application Sarver and than click Restart Typically, DB2 typically stants up autsmatically, butif itdoas not, Information Sarvar (DataStage) will nat start. 3. Click Cancel, » cioss the New window. 4. hi the Rapositary, right-click O&Project,and than create a naw falder named Training. 2 Copyrphi le Corp. 2008 2008 = Ceurabmaiarals may nets precuses # uncle er # partsimeutms par ertan pammaner oF ant Unit + mtroqueticn to tha parallel tramawork arenitactura ‘5. Within that folder create subfolders named Jobs and Metadata. 6 Crea anawparallal job named archGanData. Store it, and all jabs you create ‘n this caursa, in tha _Training>Jobs foder. Task 2. Design a job that generates a mock data file. 1, Add the stages shawn. Add a Row Generator saga (fram tha Deveioamant/Datug folder), followed by Lookup sage (from the Processing folder), and a Transformer saga (from tha Processing fader). Add 6 Sequential stages, and thanadd the links as shown 2. Name your stages and links as shown. a t C L 3. Opsnup the Row Generator stags. On tha Properties tab spacify that 1000 vows are to be ganaratad (Number of Records 1000) Bcapyrgne ence 2005 2018 oe ‘oumemaiardla may nets reqmaducsd # yhola ert par withoutine par writan panmiason of BM Unit 4 miredction te the parallel framework architecture 4. Onthe Output>Columns tab, specify the column definitions as shawn: ‘You can either specify these manually or load them from the archGenDataSource dex fia in your C:\CourseDatalDSAdv_Filesidsx directary, Next you willse@ haw to mpor the table defintion from the dex file 5. Close tha Row Genaratar Stage. 6. From the import manu, cick DataStaga Components, and than browse ta C:\CoursaData\DSAdv_FilesidsxiarchGenDataSource.dex. 7. Click Import selected, and than click OK. ‘Tha result appears as shown: 8. Click Tabla Definitions, and than click OK. Now, in the Rapasitary f you expand tha _Training\Matadata foldar, the arehGenDataSource table defintion is available. Now you can go back and load the table dafiniton in the Row Generator saga. Soepyrgre iepucemp. 2005 201e = ‘Ceuraematarkia may neberapraduoed & whole ert par wiheuthe per wrben panier of IBM Unit + miroauetion to the parallel tremawerk arenitactura 9. Open the Row Generator stage, and on tha Columns tab, click Load. ‘uname = ——] com | Poems Ce [Ate if you imponad tha table definiton as dascrbed above, the remaindaraf this task lets you confirmthat the import was succassful. Ifyou entered tha column names manually i step 4, you will need to specify the extended proparties as ‘shown next ‘Open up the extanded properties windew far the CustID column. (Double-click on tha numberto the laft of tha column.) Spacify that tha type of aigarithm is ‘¢ycle wih an initial value of 10000. eS copyegre iene cem. z00s 2076 ‘Seurémaizralamay note ropreduead b uhele ert par wihauthe pr wren permission ot EM. Unit 4 ireduetion te the parallel tramawork architecture Forthe Int2 esiumn cycie from 1 t 28 itis important that this not start atd, so thatthese cycles will not repeat. Forthe Int column cycle from 2 to 29 Farthe Int column cycie {rom 3 to 29. Forthe Middlainit column, use the alphabet agorithm aver a string of characters that mght ba middia name initials. . Forthe CustDate column, setthe Epoch to2010-01-15. This is the earliest cnarated date. Set the Type to cycle, neramenting by 15 days, and setthe imitte 20000. For insertUpdataFlagint, saiact randam intagars with a limit of 2. ‘This will ensure that values are either 0 (meaning update) ar 1 (meaning inert, Chayghe econ eos 2008 Ceurmemaiara say nots napeeaucgs # whela art par wineutna per weer parmiawen at att Unit 1 troauetion to tha pargiial tramawerk erchitacture, 20. Close, and then click the View Data tution to examina a sampling of tha data fat will be ganerated Next you want to edit tha Sequential File stages on tha refaranca links of the Lookup sage. Tha Sequential fies are in your DSAdv_Files directory named FNama. txt, LName.tet, Street! tet, and Street2. txt. Examine these filas to get an idea of the data they contain, and than import the metadata for these fies and load it into tha stagas. 4. Chan the EName sequential fie sage and sat the Fla propany to C:\CourseDat Onthe Colunans fb, dick Load ‘The tabla definition for this sequantial file has not baen added yet, but you can add it here. Right-click Training, and then click Import Table Definition, and then click ‘Sequential File Definition hh the Directory box browea to DSAdv_Files,and then click OK. Unit + merequetionte the parallel tramawork arenitacture 5. Click FNama.bet, and then in the To Folder bax, browse to: _Training>Metadata, to stare your tabia definitions Tha rasutt appears as shown: ‘Thase flies are UNIX fies and use non-defautt column delimiters. The first line tha txt files a column nama, and tha pipe (|) is the delimiter. Scepyrgreieucer 205 2076 ‘Seumamatar ile may notbe reproduced # whale ort par itheuttha par wrtan garmin ot IM Unit + ntroauetion to tha paraiisl tramawork srenitectura 7. Salsct First line is column names and Other Delimiter ts "/, and than cick Praview. 9. Similarly, peat stags 5 to 8, to import the table definitions for LName.txt, ‘Street! txt, and Street2.txt. Remember to save your tabla definitions to the _Training\Metadata folder. 10. Gnthe FName - Sequential File stage, load the table dafiniton on the Format tab as well as the Columnstab. Varfy an the Format ‘@b that the recard limiter is UNIX newline, and that on the Propartias ‘8b, that you have set First Line is Column Names t True. 11. Load the table definitions for the ather Sequential File Stage: LName - Sequential File, Streett - Sequential File, Street2 - Sequential File,and verify that you can view tha data fromaach of thase fils. SCopy phil com. 2008 2018 on ‘Churmematardia may rote raprecucas # ynala ere par wimeuins ener area pammaki ef kt ‘Task 4. Edit the Lookup stage and the Transformer stage. 1. Edt he Lookup sage. Map the nt Intcoums, respectively, tthe Num (lune ofeath of fe ference inks, 2. Dalingthaauut colsmre in thaardar shown atthe far st 4. Ginme Tansiormer gage. Define ne target columns as sown Theadded-olumn DateEnteredsteud getthe dat of hejob run (Furction>Date&me> CurentDate) InsertUpdeteFiagints replaced by insertUpdateFlag. The IneertUpaateFiag olin, nen sow Char) oud rapace Oby"U" ond toy Forte CustDate column, rave wth eusterar dae ar hanthe curert date should get the curert date. Others, they réainthe dae inthe 8. Oban Job Prepares [i] ax cave a nayonserarotr need TarotPath Siovaa twa dlat pales aa nad GostmarsOutbtn your GowstDaalGBAdy Flesoutput gc a (een eel atlenl t tt [Ei fe angel Customers Sequential Fe sage. Inset your pb cavamesr ‘TargetPath nthe Fila oper © cats a nev comma chrnted fie named Customers Out tt — == == Batre = = Edt the reacts Sequential Fie siage. Write rejects to a file named Sagi aane aon ‘Conpie and tun your pb Esamne tie ob og Faxany erars ‘Task 5. Examine the job log. 1 hn Dasigne, cick View>Job Logs open up te job 1g rena, tis rotaready open, Lacala and doubie-ies the fou Starting Jeb arehGerDale, beaming the val ofthe ob paameters. Beale ad doube-dek tha tow Environment variable settings to examine fhe valuas ofthe anwonmmant varaties iatara naffact athe tive the eb ¢ ocala and examine he message tht clspays the coniguraton fe used when ‘ho ab war un. Kowmany rede 2 defined the fis? oeate and examine he message at ets he[ob's datasets operators. and ‘umbe’ of processes knownasine Score, ‘Nata that you wilnotse@the word "Sco. The fist ine show yau can ently Ae é ae cebomamtsee mma (oat ricccte false ct oo tater szectly wan to be target Sequential Fie stage and te numberinat were ected Results: [examined the inormaton in the [ob log. Unit summary Descrbe the raalel soceasing echt Descroe speineandaantionsealeiem Deserts hewectine covigraten fe Design a potiat cosa mbustent ta Seer paras ary ee acca # ela presume te wean sama Compiling end exccuting jobe Compiling and executing jobs IM infosphere Dawsiage v11.5 Unitobjectives + Descree the msinpars ofthe couiostion fis + Descroe the comple process an the OSH (Ochestte Shel Sot ‘ha the compan process goerates + Deserss te ola ena ine man pata ne Seare + Descroe me po execution process Paralleljob compilation + DeaSngegernte tent OH, Co) + Octigaercing enna mater sures isco eccaaboty be DasSape pote ose ~Demgresonptepcces Osuna [ous “nsnetinmernennmearmpeage Dedtene yee at tees -Srenesosvepenmiananatons = _Ases| Tbe ceiisers ae compiled nm schemas = | \detenwremesoteees ay sescaamaoseosesnoer inane fal) Feral pb eamlaion Durng the compile process, DataStage generates althe code fr te job. The ‘ppiaton process generat OSH (a scping anguage) tom thejob design and aso (escalator any Tarslome’ sages tara used ina cb Tha e+ sovres cole ‘hen compied ntocusom transfomeperators. Forsacn Tanslomer, DaaSage buds a C++ eperato. This explains why jods wth Tarsformers chan take longer cori (but nat to run). Ths aise exians why (r= compiar singaded on fie DstaSiage Server system. OataSiage cnly uses this ‘pier at campia tina At runtime, ha Cos compiarenatused ‘Sages have equiamante Fer xaroia, Sot stage i equredta have an input Ink and an quut ink At compile time, DataStage checks thatal the sage requrements Fave been metbelore tgenezatasine OSH.

You might also like