2011

DATA STAGE

2/12/2011

DATA STAGE DATA STAGE

Page 2

DATA STAGE DATA STAGE
1. Explain about Data stage Architecture? Data stage is an ETL Tool and it is client-server technology and integrated toolset used for designing, running, monitoring and administrating the “data acquisition” application is known as “job”. A job is graphical representation of dataflow from source to target and it is designed with source definitions and target definition and transformation Rules. The data stage software consists of client and server components

Data stage Designer Data stage Server Data stage Director TCP/IP

Data stage Manager

Data stage Repository

Data stage Administrator

When I was installed Data stage software in our personal PC it’s automatically comes in our PC is having 4 components in blue color like DATASTAGE ADMINISTRATOR, DATASTAGE DESIGNER, DATASTAGE DIRECTOR, DATASTAGE MANAGER. These are the client components. DS Client components:1) Data Stage Administrator:This components will be used for to perform create or delete the projects. , cleaning metadata stored in repository and install NLS.

Page 3

DATA STAGE DATA STAGE
2) Data stage Manager:it will be used for to perform the following task like.. a) Create the table definitions. b) Metadata back-up and recovery can be performed. c) Create the customized components. 3) Data stage Director:It is used to validate, schedule, run and monitor the Data stage jobs. 4) Data stage Designer:It is used to create the Data stage application known as job. The following activities can be performed with designer window. a) Create the source definition. b) Create the target definition. c) Develop Transformation Rules d) Design Jobs. Data Stage Repository:It is one of the server side components which is defined to store the information about to build out Data Ware House. Data Stage Server:This is defined to execute the job while we are creating Data stage jobs. 2. What is a job? And Types of the Job? Ans:- Job is nothing but it is ordered series of individual stages which are linked together to describe the flow of data from source and target. There are three types of jobs can be designed. a) Server jobs b) Parallel Jobs c) Mainframe Jobs 3. Have you work either parallel jobs or server jobs? Ans:- I had been working parallel job since 3+ years onwards… 4. What is difference between server jobs and parallel jobs? Ans:Server jobs:a) In server jobs it handles less volume of data with more performance. b) It is having less number of components. c) Data processing will be slow.

Page 4

Partition parallism:. e) It is having more number of components compared to server jobs. then all remaining records processing simultaneously. c) It applies parallism techniques.These stages defines the extraction. e) It is highly impact usage of transformer stage. it is having the information about the processing and storage resources that are available for usage during parallel job execution. file and processing There are two types of stages.it is logical processing unit which performs all ETL operations. Page 5 . For example. 7. All stages in the job are Operating simultaneously. Passive stages:.the data flow continuously throughout it pipeline . by using this name it was executed our ETL jobs. f) It‟s work on orchestrate framework 5.A stage defines a database. Built-in stages:.in this parallism. For example. c) Fast Name: it is server name. What is stage? Explain different various stages in data stage? Ans:. MPP to achieve the parallism.:. and loading These are also divided into two types I. d) It follows MPP (Massively parallel Processing).All Database stages in palette window by designer. Parallel jobs:a) It handles high volume of data. Ex:. EX:.which defines the data transformation and filtering the data known as active stage. b) Pools:.All Processing Stages. The data will be equally partition across 4 partitions that mean the partitions will get 25 records. a) Built-in stages b) Plug-in stages. my source is having 100 records and 4 partitions.which defines read and write access are known as passive stages. Transformation. B.DATA STAGE DATA STAGE d) It‟s purely work on SMP (Symmetric Multi Processing). 6. There are two types of parallel parallism techniques. b) It‟s work on parallel processing concepts. Explain parallism techniques? Ans:. Pipeline parallism.it is a collections of nodes. What is configuration file? What is the use of this in data stage? It is normal text file. the same job would effectively be run simultaneously by several processors. Pipeline parallism. the remaining three partitions start processing simultaneously and parallel. II.it is a process to perform ETL task in parallel approach need to build the data warehouse. Active stages:. A. The default configuration file is having like a) Node:. my source is having 4 records as soon as first record starts processing. The parallel jobs support the following hardware system like SMP. Partition parallism. Each processors handles separate subset of total records. Whenever the first partition starts.

This is most common method. Sequential File stage: it is one of the file stages which it can be used to reading the data from file or writing the data to file. second record goes to the second processing node and so on…. e) Resource Scratch disk:-it is temporary memory area where the staging operation will be performed. The range is specified based on key column.All file stage are passive stages means which defines just to read or write access only. The data stage determines the best partition method to use depending upon the type of stage.DATA STAGE DATA STAGE d) Resource disk:. Random:. It can support single input link or single output link and as well as reject link.The records are randomly distributed across all processing nodes..The related records are distributed across the one node . Round Robin:. Range:. What are the data partitioning techniques in data stage? In this data partitioning method the data splits into various partitions distribute across the processors.The records with the same values for the hash-key field given to the same processing node.This partition is based on key column module. Auto:. Modulus:. 9) Explain each and every File stages? File stages Note:. The data partitioning techniques are a) Auto b) Hash c) Modulus d) Random e) Range f) Round Robin g) Same The default partition technique is Auto. 8. Hash:. This method is useful for creating equal size of partition. This partition is similar to hash partition. When I was go for properties of sequential file stage…. Page 6 .it is permanent memory area which stores all Repository components.the first record goes to first processing node.

Page 7 .DATA STAGE DATA STAGE Dataset:It is also one of the file stages which it can be used to store the data on internal format. When I was go for properties of dataset stage…. it is related operating system. it will take less time to read or write the data. So.

Page 8 . The file it can be saved with the extension of “.DATA STAGE DATA STAGE File Set:It is also one of the file stage which it can be used to read or write the data on file set. it operating parallel When I was go for properties of file set….fs”..

2) Only advantage of using file set over a sequential file is "it preserves partioning scheme". 3) You can view the data but in the order defined in partitioning scheme.DATA STAGE DATA STAGE 10) What is exact difference between Dataset and File set? Dataset is an internal format of Data Stage the main points to be considered about dataset before using are: 1) It stores data in binary in the internal format of Data Stage so. 2) It preserves the partioning schemes so that you don't have to partition it again. About File set 1)It stores data in the format similar to a sequential file. 3) You cannot view data without data stage Now. Page 9 . it takes less time to read/write from dataset than any other source/target.

Complex Flat File:This file is used to read the data form Mainframe file.where as.. We can select the required columns and can omit the remaining.DATA STAGE DATA STAGE Q) Why we need datasets rather than sequential files? When you use sequential file as Source.sequentilal files does not support NULL values.. by default sequential files we be Processed in sequence only. We can flatten the arrays (COBOL files).suppose if you want to capture rejected data... Sequential files can accommodate up to 2GB only. We can collect the rejects (bad formatted records) by setting the property reject to save (other options: continue fail)... at the time of Compilation it will convert to native format from ASCII. Also. By using CFF we can read ASCII or EBCDIC (Extended Binary coded Decimal Interchange Code) data... Seq file is used Extract the from flat files and load into flat files and limit is 2GB Dataset is a intermediate stage and it has parallism when load data into dataset and it improve the performance.all the above can me overcome using dataset Stage.. When I was go for properties of CFF stage… Page 10 ..in the case you need to use sequential file or file set stage.. conversion is not required. when you go for using datasets..but selection is depends on the Requirement......

DATA STAGE DATA STAGE 11) Explain about various types of processing stages? Processing Stages Aggregator stage:It is one the processing stage which it can be used to perform the summaries for the group of input data.Double click on aggregate stage then it will show… Page 11 .. When I was go for properties for aggregator stage. It can support single input link which carries the input data and it can support single out put link which carries aggregated data to output link.

DATA STAGE DATA STAGE Copy stage:It is also one of the processing stages which it can be used just to copy the input data to number of output links. When I was go for the properties of the copy stage. It can support single input link and number of output links. Double click on copy stage It will show like… Page 12 .

DATA STAGE DATA STAGE Filter Stgae:it is also one of the processing stage which it can be used tooo perform the filter the data based on given condition. It can support single input link and „n‟ no of output links and optinally it support one reject link. Double click on filter stage it will show… Page 13 . When I was go for propertief of filter stage.

. It can support single input link and 128 output links . Page 14 . double click on switch stage..DATA STAGE DATA STAGE Switch stage:.it is also one of the processing stage which it can be used to filter the input data based on given conditions. When I was go for properties of switch stage.

right outer. But every time data come form source system and filter the data and loads into target. It can support two or more input datasets and one output dataset. full-outer joins.. In filter stage. and doesn‟t support reject link. Left-outer join means to show the matched records from both sides as well as unmatched records from left side table. Right-outer join means to show the matched records from both sides as well as unmatched records from Right side table. we have to give the multiple conditions. Join stage:It is also one of the processing stages which can be used to combine two or more input datasets based on key field. In switch stage. Inner join means to display the matched records from both the side tables. Join can be performing inner join. left outer.DATA STAGE DATA STAGE Q) What exact difference filter stage and switch stage? Both stages functionality and responsibilities is same. but all data come form source only once and check all the condition in the switch stage and loads into target. we have to give the multiple conditions on single column. on multiple columns. When I was go to the properties of join stage. But the difference way of execution like. Full-outer join means to show the matched as well as unmatched records from both sides. Double click on join stage… Page 15 .

Page 16 . It can support multiple input links. the first input link is called “master input link” and remaining links are called “Updated links”.DATA STAGE DATA STAGE Merge stage:It is also one of the processing stages which it can be used to merge the multiple input data. It can be perform inner join and left-outer join only.

.DATA STAGE DATA STAGE When I was go to the properties of the merge stage.Double click on merge stage… Page 17 .

. If Unmatched Master Mode= keep then it will be perform left-outer join. it will show like… Please look into below diagram Page 18 . When I was go for properties of look-up stage. It can support multiple input links and single output link and support single reject link. Double click on look-up stage. This is simple job for regarding on explanation of look-up stage… This stage will be performing inner join and left-outer join. Look-up stage:This is also one of the processing stages which can be used to look-up on relational tables. If Unmatched Master Mode= Drop then it will be perform inner join.DATA STAGE DATA STAGE Q) On which case inner join perform and on which case left-outer join perform? In merge stage it is having one property is here. To see the above picture.

Double click on that icon it will show one window is look like… Page 19 .DATA STAGE DATA STAGE Q) On which case it will be perform either inner or left-outer join? To see observe picture it is having one icon is – constraints.

Q) What is main difference between join. and full-outer. Memory usage 1. In case of lookup. In case of merge. Page 20 . it catches the unmatched primary records only. Input requirements 2. Lookup Failure. Input requirements:Join will be support two or more input links and single output link doesn‟t support reject link. In case of merge it will support inner as well as left-outer only. in case of lookup. If the condition not met option = continue. right-outer. Treatment of unmatched records 3. left-outer. In case of merge. If it is having continue option will be support Reject link. it doesn‟t catch the unmatched master records on mater link. 4. merge and lookup? It‟s mainly differ from 1. Default it having lookup Failure=Fail. it will be support inner as well as left-outer. Treatment of unmatched records:Join doesn‟t get any unmatched records because of doesn‟t support reject link. 2. If the reference dataset is smaller than physical memory then it recommended to use lookup.DATA STAGE DATA STAGE It‟s having two options like: Condition not met. it supports multiple input links and single output link as well as one reject link. it will be perform inner join. Each and every update unmatched records go to corresponding update rejects links. it will be perform left-outer join. Default it is having fail. Join support 4 types of joins like inner. If the condition not met =“drop”. Memory usage:If the reference dataset is larger than physical memory then we can go for join stage for better performance. In case of lookup. it can support multiple input links and multiple output links and also support reject links same as updated links.

Note:.all the input datasets is having same structure.it is also one of Active processing stage which can be used to combined the multiple input datasets into single output datasets. When I was go for properties of funnel stage… Page 21 .DATA STAGE DATA STAGE Funnel stage:.

It can be support single input link and single output link. When I was go for properties of sort stage… Page 22 . When I was double click on remove duplicate stage.it is also one of processing stage which it can be used to remove the duplicates data based on key field. It can support single input link and single output link. it will show… Sort stage:It is also one of the processing stage which can be used to sort data based on key field.DATA STAGE DATA STAGE Remove Duplicate Stage:. either ascending order or descending order.

When I was go for properties of modify stage… Page 23 . It is used to change the data types if the source contains the varchar and the target contains integer then we have to use this Modify Stage and we have to change according to the requirement. And we can do some modification in length also.DATA STAGE DATA STAGE Modify Stage:It is also one of the processing stages which it can be used to when you are able to handle Null handling and Data type changes.

e. doesn't!!! Let's cover how exactly it does it. Which can be used to make many people have the following misconceptions about Pivot stage? 1) It converts rows into columns 2) By using a pivot stage. sname. m1.. lets take a file with the following fields: sno. m3 Basically you would use a pivot stage when u need to convert those 3 fields like m1.i. no other stage has this functionality of converting columns into rows!!! So. Some DS Professionals refer to this as NORMALIZATION. Another fact about the Pivot stage is that it's irreplaceable i.DATA STAGE DATA STAGE Pivot stage:it is also one of the processing stage. m2.... For example.. that makes it unique. You would need the following Page 24 .m2.e. we can convert 10 rows into 100 columns and 100 columns into 10 rows 3) You can add more points here!! Let me first tell you that a Pivot stage only CONVERTS COLUMNS INTO ROWS and nothing else.m3 into a single field marks which contains a unique value per row.

.DATA STAGE DATA STAGE output When I was go for properties of pivot stage…. Page 25 .

DATA STAGE DATA STAGE Surrogate key Generator:It is also one important stage on processing stage which it can be used to generate the sequence numbers while implementing slowly changing dimension. When I was go for properties of surrogate key generator stage… Q) What is difference between primary key and surrogate key? Surrogate key is an artificial identifier for an entity. Primary key is a natural identifier for an entity. It is a system generated key on dimensional tables. Page 26 . In primary key are all the values are entered manually by the are uniquely identifier there will be no repletion of data. in Surrogate key are generated by the system sequentially.

When I was go for properties of Transformer stage… v In this Editor it is having stage variables. It can have single input link and number of output links and also reject link.DATA STAGE DATA STAGE Transformer Stage:It is an active processing stage which allows filtering the data based on given condition and can derive new data definitions by developing an expression. The transformer stage can be performing data cleaning and data scrubbing operation. and Constraints.An intermediate processing variable that retains value during read and doesn‟t pass the value into target column. This stage uses Microsoft . Page 27 . Derivations.net framework environment for it‟s compilation. Stage Variable .

Q) How can I define stage variables on Transformer Stage? When I was click stage properties on Transformer Stage. The source which we are looking for the change is called before dataset. The source which is used as reference to capture the change is called after dataset. 2) Derivations.DATA STAGE DATA STAGE Derivation . Constraints . it will show one window look like… When I was click on constraints icon it will show one window… Q) What is order of the Execution in Transformer stage? Order of Execution is… 1) Stage Variables. by this change code will be recognizing delete. The change code will be added in output dataset.Conditions that are either true or false that specifies flow of data with a link. So. 3) Constraints. Page 28 . Change Capture Stage:This is also one of the active processing stage which it can be used to capture the changes between two sources like After and Before. insert or update.Expression that specifies value to be passed on to the target column.

. Page 29 .DATA STAGE DATA STAGE When I was go for properties of change capture stage.

DATA STAGE DATA STAGE 15) Explain Development and Debug stages? Development and Debug Stages Row generator Stage:it produces set of data fitting specified meta data. It is having no input links and a single output link. When I was go for properties of row generator stage… Page 30 . It is useful where you want to test your job but have no real data available to process.

Page 31 .DATA STAGE DATA STAGE Column generator stage:This stage adds the columns to incoming data and generates mock data for these columns for each data row processed. It can have single input link and single output link. When I was go for properties of column generator stage….

DATA STAGE DATA STAGE Sample job for column generator stage:- Input data:- Output data:- Page 32 .

This stage selects BOTTOM N rows from the input dataset and copies the selected rows to an output datasets. It can have a single input link and single output link. This stage selects TOP N rows from the input dataset and copies the selected rows to an output datasets.and debug the application with large datasets. Tail Stage:This stage helpful for testing and debug the application with large datasets. Page 33 .DATA STAGE DATA STAGE Head Stage:This stage helpful for testing . It can have a single input link and single output link. When I was go for properties of head stage….

Page 34 . When I was go for properties of sample stage….DATA STAGE DATA STAGE When I was go for properties of tail stage… Sample Stage:This stage will be having single input link and any number of output link when operating percent or period mode.

When I was go for properties of peek stage… Page 35 . It can be used to print the record column values to the job log view.DATA STAGE DATA STAGE Peek Stage:it can have a single input link and any number output link.

DATA STAGE DATA STAGE Mock data:- Page 36 .

how it means. The transformer stage transforms the records from source to target by generating sequence to the records by using stored procedure stage. When I was run the job. first two compare two input datasets. So. changes is occurred at source level then change capture stage gives the change code=3 . If any updation is occurred at source. 1. There are two types variables are there. for every update in the source. in first time there is no change in the records. So. The output of the join stage to connected to transform stage which was transforming update records to target update stage. While designing the job we set the properties for these variables. the records was initially loaded into target insert stage. The output link of the join stage which is connected to transformer stage which is connected to target update stage. the transformer stage transform the records to join stage through the update link. Join stage joins the updated records and target records by removing duplicate records using remove duplicate stage. that updation records will be stored to target side (TGT_UPDATE). First compare two input datasets. In this implementation. Environmental variables are also called as Global variables. We create/declare these variables in DS Administrator. And also other output link (update link) of the transformer stage which is joined with the target stage while removing records by using remove duplicate stage. by using this change code.Environmental variables/Global Variables Page 37 . the source is having EMP table with 100 records. the change capture stage gives the change code=1. these two are connected to change capture stage which is connected to transformer stage which is having two output links like insert link and update link. For example. The insert link is connected to stored procedure stage which is connected to transformer which is connected to target stage. it is having two input datasets like before and after datasets. it insert new record in target.DATA STAGE DATA STAGE 21) Can you explain Type-2 implementation? SCD type-2 is the common problems in DWH. It is to maintain the history information for particular organization in target. How it will be store means. Local Variables 2. 17) What is Environment Variables? Basically Environment variable is predefined variable those we can use while creating DS job.

2 code and metadata is stored in file based system. password and schema.5 and 8. Example is you want to connect to database you need use id .1 a single join stage can support Multiple references. By using this if there is any change in password or schema no need to worry about all the jobs.1 code is a file based system where as metadata is stored in database...0. Creating project specific environment variables.1 we have range lookup. 2. and another developer wants to open the same job. In 7.1 quality stage is integrated in designer. in 8.2 to 8. Also when the job runs through script its just enough to give the parameter value in the command line of script. Use them where ever you are want with #Variable#. In data stage 7. So that it will be more clear for us.In any job through out ur project in this some default variables r there and also we can define some user defined variables also. Give me to you some example for environment variable.5. 7.in 8. These are constant through out the project so they will be created as environment variables. in 8.2.1 we don‟t have any manager client. that job can‟t be opened. In 7.Click on the "User Defined" folder to see the list of job specific environment variables.1? Main differences b/w data stage 7.1 we require operating system authentications and data stage authentications.Start up Data Stage Administrator. In 7.2 we have manager as client.5.0. when a developer opens a particular job.0.0. 20) What is difference between version Data stage 7. In 8.Choose the project and click the "Properties" button.2 a single join stage can't support multiple references.only for particular job only Env Variables:. 5. Once you enter give a parameter name and corresponding default value for it.. Change it at the level of environment variable that will take care of all the jobs.1 it can be possible when a developer opens a particular job and another developer wants to open the same job then it can be opened as read only job. Or you can press Ctrl+J to enter into Job Parameters dialog box.5." button. Page 38 .2 we required operating system authentications. 18) Explain Job parameters? There is an icon to go to Job parameters in the tool bar. In 7. 6.0. Else you have to change the value in the job compile and then run in the script. in 8. This helps to enter the value when you run the job.On the General tab click the "Environment. In 7.DATA STAGE DATA STAGE Local Variables:. So its easy for the users to handle the jobs using parameters.0. the manager client is embedded in designer client. 4.2 quality stage has separate designer .5.1 1.0. In8. in 8.0. Its not necessary always to open the job to change the parameter value.0.5.2 we don‟t have range lookup.5. 3..5. In 7.5. How means.

in 7.1 quick find and advance find features are available. following steps needs to be followed in Routines.DATA STAGE DATA STAGE 8. In quality stage we have many stages like investigate. But I know the routines. 2.2 it is not possible 9. So reject link has to be defined every output link you wish to collect rejected data. in 7.0.5. survivorship. 1. Q) How did you handle Rejected Data? Reject-link is defined and reject data is loaded back into DWH.2 first time one job is run and surrogate key s Generated from initial to n value. Stop the job using DSStopJob function Page 39 .1 surrogate key is incremented automatically. Where you can create. In 8. You can use that stage to call your procedure in Data stage jobs. Q) Explain METASTAGE in DS 8. Run the other job using DSRunjob function 3.1 a compare utility is available to compare 2 jobs. To Execute one job from other job.0. match. The Job YYY can be executed from Job XXX by using Data stage macros in Routines. 1) Transform Function 2) Before-after Sub routines 3) Job control routines. Next time the same job is compiled and run again surrogate key is generated from Initial to n. Automatic increment of surrogate key is not in 7. But in 8. Ex: Consider two Jobs XXX and YYY. Q) What are Routines and where/how are they written and have written any routine before? It didn‟t use Routines at any time in my project.2 not available 10.2. In 7. view or Edit.5. Q) What is job control? How it is developed? Explain with steps? Controlling Data stage jobs through some other Data stage jobs.0. Meta data is type of data we are handling. In 8. Routines are stored in the routine Branch of the DS Repository. These data definitions are stored in repository and can be accessed with the use of Meta stage. Reject data typically a bad data like duplicates of primary keys or null-rows where data is expected.1? It is used to handle the metadata which will be very useful for data lineage and data analysis later on. Q) How can I call the Store procedure in data stage? There is a stage named Stored Procedure available in Data stage palette under Database category.5. one in development another is in production.5. Attach job using DSAttachjob function. Q) Do you know about INTEGRITY/QUALITY stage? Quality stage can be integrated with data stage. like that we can do the quality related works and we can integrate with the data stage we need quality stage plug-in to achieve the task. a state fiyle is used to store the maximum value of surrogate key.0. The following different types of routines.

Page 40 .DATA STAGE DATA STAGE Q) How to kill the job in data stage? By killing the corresponding process ID. Specify the columns on which u want to eliminate as the keys of hash ……………………………………………………………ALL THE BEST………………………………………………………………………….. Q) How do you eliminate duplicate rows? The Duplicates can be eliminated by loading the corresponding data in the Hash file.

Sign up to vote on this title
UsefulNot useful