This action might not be possible to undo. Are you sure you want to continue?
I was extracting some data into sequential files and the job failed with error message file full. I replace the sequential file by a file set but the job is running for more than 45 mins and does not stop after sending multiple stop instruction to the job. Any help or advice will be highly appreciated.
A sequential file is an operating system file. It's size limit is determined by the operating system (and your ulimit setting on UNIX). A File Set is a construction of one or more operating system files per processing node, each of which may be no more than 2GB. You can have up to 1000 files per processing node, disk space permitting, in a File Set. There exists a control file, whose name ends in ".fs", that reports where each of the files is. A persistent Data Set is exactly the same as a File Set. The only differences are that a File Set stores data in human-readable form, while a Data Set stores data in internal form (binary numbers), and that the control file name suffix is ".ds" for a persistent Data Set. A virtual Data Set has no visibility as files per processing node in the operating system, as it (the virtual Data Set) exists entirely in memory except for its control file, which has a ".v" suffix. *·A sequential file can only be accessed on one node. In general it can only be accessed sequentially by a single process, hence concept of parllelism is gone. *·dataset preserves partition.It stores data on the nodes,so when you read from a dataset you dont have to re partition your data. * ·we cannot use Unix cp or rm commands to copy or delete a dataset becuse,Datastage represents a single data set with multiple files.Using rm simply removes the descriptor file,leaving the much larger files behind. * . Unix command to find and delete the whole set of file
path sould be change to their environment..... But reg file set, is it max of 1000 files per node or 10000 files per processing node??????? 2. 1)what is difference between serverjobs & paraller jobs? 2) Orchestrate Vs Datastage Parallel Extender? 3) What are OConv () and Iconv () functions and where are they used? 4) What is aggregate cache in aggregator transforamtion? 5) What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job? 6) How do you rename all of the jobs to support your new File-naming conventions?
7) How do you merge two files in DS? 8) How did you handle an 'Aborted' sequencer? 9) What are Performance tunings you have done in your last project to increase the performance of slowly running jobs? 10) If data is partitioned in your job on key 1 and then you aggregate on key 2, what issues could arise? 11) how can you do incremental load in datastage? 12) What is Full load & Incremental or Refresh load? 13) What r the different types of Type2 dimension maping? 3
1)Data Stage Architecture a) it is a client server architecture with client components being admninstrator,designer,manager and director and the server components being …ds server, repository and package installer 2) How do you create a project? a) Through Datastage Administrator ..loctaion (C:\Ascential\DataStage\Projects\) 3) How many projects can you create maximum? a) It Depends upon license keys.. 4) How do you create users and give the permission a) Through administrator, permissions are :- Datastage Operator, Datastage Production Manager, Datastage Developer,None(should know roles for each one of them) 5) What are the permissions available in Administrator a) :- Datastage Operator:Schedule and Run the jobs. Datastage Production Manager:Full access to DataStage Datastage Developer.:Create and modify jobs,debuugg,run,schedule,import and export…can release locks but canot create protected projects. 6) Is it possible an operator to view the full log information a) Yes..only if the administrator gives operator the permission (i.e my checking the box “ DataStage operator can view full log” 7) Tell me the type of jobs (Active or passive also odbc and plugins) a)Server jobs,paralle,mainframe and job sequence 8)How do you lookup through seq file a) One cannot do a lookup through a seq file 9)What is the Stage variable, a) Stage variable is variable which executes locally within the stage. 10)What does a constraint do a) A constraint is like a filter condition which is used to limit the records depending upon the business logic 11) What is the derivation do a) derivation is like an expression which is used to derive some value from the input columns and also modify the input columns according to the business needs (explain with example) 12) Tell me the sequence of execution (StageVariable, Constraint, Derivation) ……..?????? a) Stage variable, derivation, constraints (explain with example) 13) Why do you use hash file a) primarily used as a lookup 14) Difference between hash file and seq file a) hash file is a file with one or multiple key based fields…(what abt sequential files ..even they can be key based right…no it cannot be used ) 15) Name some type of seq file ?
a) fixed-width , delimited files 16) What is the size of your hash file……………?????? a) By default the size of the hash file is 128 mb which is specified in the administrator..under the tunables tab..and has a max value of 999mb 17) how do we calculate the size of our hash files One is through hash calculator……and on some equations… 18) What is hash algorithm a) is a property to be set for dynamic hash files inorder to determine the way in which it applies the hash function to incoming record and spreads the records into multiple groups... Seqnum or general…read hashed file stage from page 25….) 19) How many types are available in hash file a) Static In which there are 18 types whereas in dynamic we have type 30 20) Which type of hash file do you used, why a) type 30 file (dynamic) because of incremental data loads … 21) How do you create a hash file a) Create.file <filename> DYNAMIC or through ds designer ..using create file option… 22) How do you specify the hash file a) we need to define key columns while specifying the hash files..and there is no limit for the number of key columns…(check this out) 23) Is it possible to view the records in hash file through any editor, if yes which editor a) through Data Browser editor Datastage designer… 24) What is the extension of a hash file a) .30 extension. Do confirm 25) Is it possible to create a hash file contains all the columns in a normal seq file(with out key columns) a) no 26) Difference between static and dynamic hash file a) Dynamic hash file allocates dynamically memory size, whereas static hash file does not beyond the specified file. 27) Tell me the different types of stages a) Active and passive stages… 28) Difference between Active stage and passive stage a) Active stages are those which does some processing in it whereas, ---sort,agg,transformer,pivot.. Passive stages are those which does do not do any processing in them..—seq file,odbc.. 29) Is it possible to check constraint at Active stages, if yes how a) Yes, through a transformer stage… 30) Where do you define the constraint? a) In the transformer (constraint entry field on the output link) 31) What is the job parameter, where do you define it a) The job parameter is parameter through which run-time details can be manipulated. It can be defined through the designer at job level (explain with some examples) 32) What is the environment variable, where do you define it a) Environment variable can be defined in the Administrator. (Explain with some examples) ,envt variables are like global variables which can be used across the project..33) Difference between job parameter, environment variable and stage variable a) Environment variable is one through which one can define project wide defaults.
Job parameter is one which through which one can override any previous defaults and Can be applicable to the particular job Stage Variable is one which is locally executed for the active stage (explain with some examples) 34) While running a job, Is it possible to control other job through a stage, not job control coding, if yes how, and what are the stages supported it a) Before After Sub Routines.. through a job control also …do explain) 35) Have you written job control, what is it use a) To control running of jobs , status of jobs, and can implement the job logic within it. (explain with some examples) 36) How do you attach a job in job control a) DSAttachJob ..it is a utility (explain with some examples) 37) How do you set a job parameter in job control a) DSSetParam (Job Handle Parameter name, value) this allows one to set parameters. (explain with some examples) 38) What is routine a) Routines are pieces of code which can be executed before or after a job to trigeer of some activities… 39) Different types of routine a) Transformer routines , Before/After (explain with some examples) 40) What is the use of routine a) To trigger some activities 41) Where the routines are stored……………..????? a) Routines are stored in the Routines branch of the Data Stage Repository, 42) How many windows are shown in DS designer, what are they a) Designer window, repository,pallete 43) What are the uses of transformer, aggregator, pivot, and sort stages a) Transformer allows one to transform the stage i.e to modify incoming records,apply filter rules…,transform data… Aggregator allows one to do internal sorting and also allows on to records based on groups Pivot allows one to change from vertical to horizontal. 44) What is the use of merge stage a) merge allows us to merge two sequential files into one or more output links. A Merge stage is a passive stage that can have no input links and one or more output links. 45) Is it possible to join more than two seq file using merge stage, if no is there any stage to solve this a) it is not possible to join two or more seq file using merge stage ..but it is possible through link collector provided the metadata for the sources are the same….what if we have to join two or more seq files with diff metadata…????? 46) Name all the join type a) inner join, right outer, left outer, full outer join. ..(7 types see merge stage…) 47) How do you extract data from database? a) Through ODBC and OCI 48) Name all the update action a)there are 8 update actions…(found on the input link of ODBC stage) 49) In job control which language is used to write a) BASIC 50) Is it possible to call a basic program which is written externally and use it in DS?
next..through data stage director… 55) A job is running..1 run through debugger.stored under repository… 71) How do you write a routine . if yes how a) it is not possible to lookup a lookup hash file… 53) What does the director do a) Job Locks.. Job Resources. how a) yes. in version 7..?????? a) Ask Sakthi while doing routines… 66) What are the other error have you faced a) see error documentation 67) What is the difference between run and validate a job a) run is to execute the job whereas validate is to check for errors like file exists odbc connections.1 run directly 52) Is it possible to lookup a lookup hash file.. but I would like to stop the job. where is it stored a) metadata are table definations . Job Report in XML....???????? 51) Is it possible to run a job in DS designer if yes how ---a) YES in version 5. 68) What is the use of DS manager a) Ds manager is used to edit and manage the contents of the repository..export and import of jobs or the entire project… 69) How do you import export the project a) Through data stage manager 70) What is a Meta data..Run the jobs. 65) What is a phantom error how do u resolve it…?. Viewing Logs.. .. Scheduling.(what are the implications of this. 54) Have you schedule the job.status 63) What is the difference between warning.a) We can call a basic program but don’t know how….daily 61) How do you find the no of rows per second in DS director a) in designer we can do it by choosing “view performance statistics” but in director it is through “tools-New monitor” 62) How do you know the job status a) through director….table definations.intermediate existence of hash files….)… 58) Situations wherin there is a need to clear the status file a) 59) Is it enabled in DS director if not how to enable it a)by default clean up resource and clear status file is not enabled in the director one can enable it through administrator by checking the “Enable Job administrator in Director” in general tab… 60) Tell me the types of scheduling a) today. info messages shows what things a) shows the time when the execution of job started…shows information about individual jobs and their status…warnings and fatal errors..or dsjob..like used to create or edit routines . fatal message a) Warnings do not abort the job where as fatal messages abort the job…… 64) In Log file Control.and through Cleanup resources in the director.….tomorrow. what are the ways to stop the job a) Through Director.every.. 56) What is the use of log file a) To check for the execution of the Job and to track down warnings and errors 57) Describe cleanup resource and clear status file a) Cleanup resource allows one to remove locks or / kill the jobs Clear status clears the last run status of the job and resets the job status to Has been reset.
Whereas shared containers are available throughout the project and appear in the repository window… 75) what are containers a) containers are a collection of group stages and links…which can be reused (shared container) 76) Difference between Annotation and Description Annotation a) Annotations are shot or long descriptions. we can have multiple annotations in a job and they can be copied to other jobs as well where as Description Annotation . Just thought of sharing good collection of DataStage Interview Questions site. may be helps some one who is learning. Comments: 1 | Answered : Yes | Last update: July 27. so it definatly helps to understand the concept rather just reading them...com/ 1 Dimension Modelling types along with their significance Comments: 0 | Answered : Yes | Last update: March 28. 2 Dimensional modelling is again sub divided into 2 types. each questions is kind of disucssion thread.geekinterview...a) we can write a routine by going to the routine category in ds manager and selecting create routine option…routines are written in basic language… 72) What is the use of release a job a) releasing a job is significant to clean up the resources of a job which is locked or idle… 73) What is the use of table definition in Manager a) table definations depict the metadata of a table. 80) What are the caching properties while creating hash files? 81) Where do u specify the size of ur hash file a) in the administrator under tunables tab….we can have only one per job…and they cannot be copied into other jobs… 77) What are the advantages of Description Annotation a) advantage of Description Annotation is that it gets automatically reflected in the Manager and director… 78) What are the various types of compilation and run time errors u have faced…? a) 79) explain the “allow stage write cache option “ for hash files and what are its implications..geekinterview.. 2005 Data Modelling is Broadly classified into 2 types. http://www. 2005 . a) E-R Diagrams (Entity ..com/Interview-Questions/DataWarehouse/DataStage http://www.default size is 128 mb and the max is 999mb 4...we can use the ds manager to edit or create table definations… 74) What is the difference between local container and shared container a local container can be used within the job itself and does not appear in the repository window....Relatioships).? a)it caches the hash file in memory …should no use this option when we are reading and writing to the same hash file……. b) Dimensional Modelling.
0) to incorporate the p. More normalized form. 3 Importance of Surrogate Key in Data warehousing? Comments: 0 | Answered : Yes | Last update: March 28. 7 Differentiate Primary Key and Partition Key? Comments: 0 | Answered : Yes | Last update: March 28.Table with Unique Primary Key. Consists of fields with numeric values.. b)Snowflake Schema .. There are several methods of . 2005 Stage Variable . Partition Key is a just a part of Primary Key. 2005 Primary Key is a combination of unique and not null.. 2005 Using "dsjob" command as follows. c) Current. 4 Differentiate Database data and Data warehouse data? Comments: 1 | Answered : Yes | Last update: March 30. Most importance of using it is it is independent of underlying database.Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional table. Derivation . Denormalized form. i. It can be a collection of key values called as composite primary key. Datastage used Orchestrate with Datastage XE (Beta version of 6. 2005 Surrogate Key is a Primary Key for a Dimension table.Simple & Much Faster.a)Star Schema .. 8 How do you execute datastage job from command line prompt? Comments: 0 | Answered : Yes | Last update: March 28.Complex with more Granularity.An intermediate processing variable that retains value during read and doesnt pass the value into target column. 10 What is the default cache size? How do you change the cache size if needed? Comments: 0 | Answered : Yes | Last update: March 28. 6 Orchestrate Vs Datastage Parallel Extender? Comments: 0 | Answered : Yes | Last update: March 28.. 2005 Data in a Database is a) Detailed or Transactional b) Both Readable and Writable.. Derivations and Constants? Comments: 0 | Answered : Yes | Last update: March 28. 2005 Fact table ... dsjob -run -jobstatus projectname jobname 9 What are Stage Variables. 5 What is the flow of loading data into fact & dimensional tables? Comments: 0 | Answered : Yes | Last update: March 28.. Dimension table .._Expression that specifies value to be passed o. 2005 Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX platform.e Surrogate Key is not affected by the changes going on with a databas. 2005 .
which you are. First create a job. what issues could arise? 3) If your running 4 ways parallel and you have 10 stages on the canvas. Jobs are developed on a Unix or Windows server transferred to the mainframe to be compiled and run. The enterprise edition offers parallel processing features for scalable high volume solutions. • DataStage Enterprise MVS: Server jobs. There are some stages that are common to all types (such as aggregation) but they tend to have different fields and options within that stage. Designed originally for Unix. Take the flat file. it now supports Windows. The first two versions share the same Designer interface but have a different set of design stages depending on the type of job you are working on. Linux and Unix System Services on mainframes. which can populate the data from database into a Sequential file and name it as Seq_First1. Left . MVS jobs are jobs designed using an alternative set of stages that are generated into Cobol/JCL code and are transferred to a mainframe to be compiled and run. We can incraese it by going into Datastage Administrator and selecting the Tunable Tab and specify the cache size over there. sequence jobs. having and use a Merge Stage to join the two files. 5) How can you implement Complex Jobs in datastage A) What do u mean by complex jobs. • DataStage Enterprise Edition was originally called Orchestrate. --------------------------------------------------------------------------------------------------------------------------------5. how many processes does datastage? 4) Does Enterprise Edition only add the parallel processing for better performance? Are any stages/transformations? A) DataStage Standard Edition was previously called DataStage and DataStage Server Edition. 1) Can we use shared container as lookup in datastage server jobs? 2) If data is partitioned in your job on key 1 and then you aggregate on key 2. parallel jobs. parallel jobs.Default cache size is 256 MB. Server jobs only accept server stages. You have various join types in Merge Stage like Pure Inner Join. • DataStage Enterprise: Server jobs. Parallel jobs have parallel stages but also accept some server stages via a container. sequence jobs. mvs jobs. then renamed to Parallel Extender when purchased by Ascential. If u used more than 15 stages in a job and if you used 10 lookup tables in a job then u can call it as a complex job 6) Can u join flat file and database A) Yes. MVS jobs only accept MVS stages. we can do it in an indirect way.
7) What is trouble shooting in server jobs? What are the diff kinds of errors encountered while running? .Outer Join. You can use any one of these which suits your requirements. Right Outer Join etc..
rather than carrying it forward and then filtering it out 9) What is Data stage Multi-byte.Use Local formats for dates. They run on the DataStage Server. Oracle Interface operators and SAS-Interface operators. Parallel jobs. These are available if you have installed DataStage Server. i.e whatever data u need to filter. These are only available if you have installed Enterprise Edition. As per the manuals and documents. International Components for Unicode (ICU) libraries support NLS functionality in Orchestrate. They can also run on a separate z/OS (USS) machine if required. You can email me at venkatdba2000@yahoo. Single-byte file conversions? How we use those conversions in data stage? 10) What is difference between server jobs & parallel jobs A) Server jobs. we B) have different level of interfaces.Sort the data according to the local rules . DB2 interface operators. and the key field is mandatory in the master and update dataset/table Merge stage is used to merge two flat files in server jobs. The merge stage is used to join two tables (server/parallel) or two tables/datasets (parallel). or cluster systems. These run on DataStage servers that are SMP.Process the data in a wide range of languages . filter it out in the SQL. 11) What is merge? And how to use merge? A) Merge is a stage that is available in both parallel and server jobs. MPP. I shall share as much as I can. times and money .8) What is the mean of Try to have the constraints in the 'Selection' criteria of the jobs itself? This will eliminate the unnecessary records even getting in before joins are made? A) It probably means that u can put the selection criteria in the where clause. Can you be more specific? Like Teradata interface operators. Orchestrate National Language Support (NLS) makes it possible for you to process data in international languages using Unicode character sets. Merge requires that the master table/dataset and the update table/dataset to be sorted.veluri@gmail. Operator NLS Functionality* Teradata Interface Operators * switch Operator * filter Operator * The DB2 Interface Operators * The Oracle Interface Operators* The SAS-Interface Operators * transform Operator * modify Operator * import and export Operators * generator Operator Should you need any further assistance pls let me know.com or venkata. 12) How we use NLS function in Datastage? What are advantages of NLS function? Where we can use that one? Explain briefly? A) Dear User. Merge is performed on a key field. connecting to other data sources as necessary.comRegardsVenkat ----------------------By using NLS function we can do the following .
The DataStage client components are: AdministratorAdministers DataStage projects and conducts housekeeping on the serverDesignerCreates DataStage jobs that are compiled into executable programs Director Used to run and monitor the DataStage jobsManagerAllows you to view and edit the contents of the repository. length. data dictionaries.comRegardsVenkat . A data warehouse is a repository (centralized as well as distributed) of Data. out put & transfer pages will have 4 tabs and the last one is build under that u can find the TABLE NAME .. disk storage information. Metadata is data about data. Metadata is stored in a data dictionary and repository.. The repository environment encompasses all corporate metadata resources: database catalogs. NLS is implemented in DataStage Server engine For Parallel jobs.e. valid values.apt file that has the node's information and Configuration of SMP/MMP server. For Server jobs. Emp. various extra features appear in the product.com or venkata. It takes the Key columns sort them in Ascending or descending order. A) Merge is used to join two tables.etc APT_CONFIG is just an environment variable used to identify the *. Examples of metadata include data element descriptions. analytical. under interface tab: input. data type descriptions.apt file. It insulates the data warehouse from changes in the schema of operational systems...veluri@gmail. attribute/property descriptions. Metadata includes things like the name. and navigation services. Let us consider two table i. able to answer any adhoc. Dept. temp area) for the specific project Datastage understands the architecture of the system through this file (APT_CONFIG_FILE). and process/method descriptions. range/domain descriptions. 13) What is APT_CONFIG in datastage? A) The APT_CONFIG_FILE (not just APT_CONFIG) is the configuration file that defines the nodes. (the scratch area. Sould you need any further assistance pls revert to this mail id venkatdba2000@yahoo.If NLS is installed. In data stage I/O and Transfer. For example this file consists information of node names.If we want to join these two tables we are having DeptNo as a common Key so we can give that column name as key and sort Deptno in ascending order and can join those two tables 15) What is version Control? A) i) Version Control stores different versions of DS jobs ii) Runs different versions of same job iii) Reverts to previous version of a job iv) View version histories 16) What are the Repository Tables in DataStage and what are they? A) Dear User. historical or complex queries. and description of a data element. Don’t confuse that with *. NLS is implemented using the ICU library. 14) What is merge and how it can be done plz explain with simple example taking 2 tables.
Loading).commit... and specifically for those stage whose output connects to the shared container input.17) Where does Unix script of datastage execute weather in client machine or in server? Suppose if it executes on server then it will execute? A) Datastage jobs are executed in the server machines only. If job fail means data type problem or missing column action . job fail or job aborted. job 4) if job 1 have 10.. then meta data will be propagated at run time.. 18) Defaults nodes for datastage parallel Edition A) Default nodes is always one Actually the Number of Nodes depend on the number of processors in your system.. 21) Scenario based Question . Continue .000 row. 20) I want to process 3 files in sequentially one by one... Suppose that 4 job control by the sequencer like (job 1. While processing the files it should fetch files automatically. A) Suppose job sequencer synchronies or control 4 job but job 1 have problem. in this condition should go director and check it what type of problem showing either data type problem. Continue. Continue (ii) On Skip -. how can i do that. There is nothing that is stored in the client machine... after run the job only 5000 data has been loaded in target table remaining are not loaded and your job going to be aborted then.. job 2.So u should go Run window ->Click-> Tracing->Performance or In your target table ->general -> action-> select this option here two option (i) On Fail -. warning massage.. Batch program are generate depends your job nature either simple job or sequencer job.. so there is no need to map it at design time. If RCP is disabled for the job. you can see this program on job control option 23) What is difference between data stage and informatica? . First u check how many data already load after then select on skip option then continue and what remaining position data not loaded then select On Fail. Transformation. in such case OSH has to perform Import and export every time when the job runs and the processing time job is also increased.Commit. How can short out the problem. If your system is supporting two processors we will get two nodes by default. 19) What happens if RCP is disabling? A) Runtime column propagation (RCP): If RCP is enabled for any job. job 3. Again run the job defiantly u gets successful massage 22) What is the Batch Program and how can generate? A) Batch program is the program it's generate run time to maintain by the datastage it self but u can easy to change own the basis of your requirement (Extraction.
Mainly they are just the sequence of numbers or can be alfanumeric values also. What are the skill's required for this. RUN!! Are they doing a match/merge routine that knows how to process this in sequential files? Then maybe they are the right one.cfm?articleId=4306 24) Importance of Surrogate Key in Data warehousing? A)Surrogate Key is a Primary Key for a Dimension table. The 2nd option is schedule these jobs using Data Stage director. http://www. then either would probably be OK. Then call this shell script in any of the market available schedulers. then ask how each vendor would do that. Think about what process they are going to do.did u use it? A) The Manager is a graphical tool that enables you to view and manage the contents of the DataStage Repository . i mean he will deal with blue prints and he will design the jobs the stages that are required in developing the code 26) How do we do the automation of dsjobs? A) "dsjobs" can be automated by using Shell scripts in UNIX system. These are sustem genereated key. We can also pass all the parameters from command prompt. It all depends on what you need the ETL to do.dmreview. If you are small enough in your data sets.datastage designer is how will desgn the job. A) datastage developer is one how will code the jobs. Are they requiring you to load yesterday’s file into a table and do lookups? If so. Basically it's depends on what you are trying to accomplish What are the requirements for your ETL tool? Do you have large sequential files (1 million rows.. this will be used in the concept of slowly changing dimension. In such condition there is a need of a key by which we can identify the changes made in the dimensions.SCD3. The concept of surrogate comes into play when there is slowely changing dimension in a table. 27) What is DS Manager used for . i. Most importance of using it is it is independent of underlying database. for example) that need to be compared every day versus yesterday? If so.A) Here is very good articles on these differences.. These slowely changing dimensions can be of three type namely SCD1. inorder to keep track of changes in primary key 25) what's the difference between Datastage Developers and Datastage Designers. which helps to get an idea. We can call Datastage Batch Job from Command prompt using 'dsjob'.com/article_sub.SCD2.e Surrogate Key is not affected by the changes going on with a database.
. @FALSE The compiler replaces the value with 0. See the Date function. @INROWNUM Input row counter. The Duplicates can be eliminated by loading thecorresponding data in the Hash file. Use "Duplicate Data Removal" stage or 2. 30) What about System variables? A) DataStage provides a set of variables containing useful system information that you can access from a transform or routine.datastage maneger is used to export and import purpose [/B] main use of export and import is sharing the jobs and projects one project to other project. Default Hased file is "Dynamic .Type Random 30 D" 29) How do you eliminate duplicate rows? A) Delete from from table name where rowid not in(select max/min(rowid)from emp group by column name) Data Stage provides us with a stage Remove Duplicates in Enterprise edition. @DATE The internal date when the program started. @MONTH The current extracted from the value in @DATE. Char(254). removal of duplicates done in two ways: 1. @IM An item mark. For use in constrains and derivations in Transformer stages. Specify the columns on which u want to eliminate as the keys of hash. System variables are readonly. For use in derivations in Transformer stages. use group by on all the columns used in select .Sub divided into 17 types based on Primary Key Pattern. b) Dynamic . Using that stage we can eliminate the duplicates based on a key column. @DAY The day of the month extracted from the value in @DATE. @LOGNAME The user login name.sub divided into 2 types i) Generic ii) Specific. duplicates will go away. @FM A field mark. @OUTROWNUM Output row counter (per link). @NULL The null value. 28) What are types of Hashed File? A) Hashed File is classified broadly into 2 types. a) Static . Char(255).
REJECTED Can be used in the constraint expression of a Transformer stage of an output link. More normalized form. A) a)Star Schema . @TRUE The compiler replaces the value with 1. @VM A value mark (a delimiter used in UniVerse files).Simple & Much Faster. 31) What is DS Designer used for . where you can enter your parameter name and the corresponding the path of the file.RETURN. @WHO The name of the current DataStage project directory. install and manage maps and locales. Char(253). @PATH The pathname of the current DataStage project. b) Snowflake Schema . . @SM A subvalue mark (a delimiter used in UniVerse files). and add links.@NULL. control the purging of the Repository. 1. execSH 36) How do you pass filename as the parameter for a job? A) While job developement we can create a paramater 'FILE_NAME' and the value can be passed while running the job. 35) How will you call external function or subroutine from datastage? A) There is datastage option to call external programs . drop them onto the Designer work area. See the Time function. @SCHEMA The schema name of the current DataStage project. if National Language Support (NLS) is enabled. @YEAR The current year extracted from @DATE. Go to DataStage Administrator->Projects->Properties->Environment>UserDefined.CODE Status codes returned by system processes or commands. @TM A text mark (a delimiter used in UniVerse files).STR The internal representation of the null value. Char(128).did u use it? A) The Administrator enables you to set up DataStage users. 32) What is DS Administrator used for . @USERNO The user number. Char(252). but is set to FALSE whenever an output link is successfully written.Complex with more Granularity. @TIME The internal time when the program started. REJECTED is initially TRUE. The Designer graphical interface lets you select stage icons. 33) How to create batches in Datastage from command prompt 34) Dimensional modelling is again sub divided into 2 types. Here you can see a grid. Char(251).did u use it? A) You use the Designer to build jobs by creating a visual design that models the flow and transformation of data from the data source through to the target warehouse. and. Denormalized form. @SYSTEM.
39) When should we use ODS? 40) How can we create Containers? A) There are Two types of containers 1. Go to the stage Tab of the job. It contains maximum 90 days information. 37) How to handle Date convertions in Datastage? Convert a mm/dd/yyyy format to yyyy-dd-mm? A) We use a) "Iconv" function . click on the "Use Job Parameter" and select the parameter name which you have given in the above.It is subject oriented.4]"). Function to convert mm/dd/yyyy format to yyyy-dd-mm is Oconv(Iconv(Filedname.4]") 38) What is the difference between operational data stage (ODS) & data warehouse? A) A dataware house is a decision support database for organisational needs.2. Where as Shared Containers can be used any where in the project.Local Container 2."D/MDY[2.non volatile. Keep the project default in the text box.Internal Convertion. ODS (Operational Data Source) is a integrated collection of related information.time varient collect of data.2."D-MDY[2. Copy the parameter name from the text box and use it in your job. Local container: Step1:Select the stages required Step2:Edit>ConstructContainer>Local SharedContainer: Step1:Select the stages required Step2:Edit>ConstructContainer>Shared . The selected parameter name appears in the text box beside the "Use Job Parameter" button.integrated .Shared Container Local container is available for that particular Job only. b) "Oconv" function . select the NLS tab.External Convertion.2.
Establish Baselines 2.Avoid the Use of only one flow for tuning/performance testing 3.Understand and evaluate the tuning knobs available.value 43) What is the difference between routine and transform and function 44) How can we implement Lookup in DataStage Server jobs? A) By using the hashed files u can implement the lookup in datasatge.Isolate and solve 6.prompt.type.Evaluate data skew 5.Shared containers are stored in the SharedContainers branch of the Tree Structure 41) How can we improve the performance of DataStage jobs? A) Performance and tuning of DS jobs: 1. EDIT>JOBPARAMETERS In that Parameters Tab we can define the name. hashed files stores data based on hashed algorithm and key values 45) How can we join one Oracle source and Sequential file?.Do not involve the RDBMS in intial testing 8.Distribute file systems to eliminate bottlenecks 7. A) Join and look up used to join oracle and sequential file 46) What is iconv and oconv functions? 47) Difference between Hashfile and Sequential File? .Work in increment 4. 42) What are the Job parameters? A) These Parameters are used to provide Administrative access and change run time values of the job.
B. 51) How will you determine the sequence of jobs to load into data warehouse? A) First we execute the jobs that load the data into Dimension tables. "AUTOSYS": Thru autosys u can automate the job by invoking the shell script written to schedule the datastage jobs 54) What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job. May be you can schedule the sequencer around the time the file is expected to arrive. B. dsexport. Once the file has start the job or sequencer depending on the file. A sequential file is just a file with no key column. A) There is no TRUNCATE on ODBC stages. It is Clear table blah blah and that is a delete from statement. you do have both Clear and Truncate options. then load the Aggregator tables (if any). On an OCI stage such as Oracle. Under UNIX: Poll for the file. dsimport. .exe.exe. then fact tables: A) As we load the dimensional tables the keys (primary) are generated and these keys (primary) are Foreign keys in Fact tables. Hash file used as a reference for look up. Sequential file cannot 48) How do you rename all of the jobs to support your new File-naming conventions? 49) Does the selection of 'Clear the table and Insert rows' in the ODBC stage send a Truncate statement to the DB or does it do some kind of Delete logic. then Fact tables. A) A. They are radically different in permissions (Truncate requires you to have alter table permissions where Delete doesn't).imports the DataStage components. 53) What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director? A) Use crontab utility along with dsexecute() function along with proper parameters passed. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. 52) What are the command line functions that import and export the DS jobs? A) A. 50) The above might rise another question: Why do we have to load the dimensional tables first.exports the DataStage components.A) Hash file stores the data based on hash algorithm and on a key value.
Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts. 59) What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs? A) 1. Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.55) Read the String functions in DS A) Functions like  -> sub-string function and ':' -> concatenation operator Syntax: string [ [ start. 9. Try to have the constraints in the 'Selection' criteria of the jobs itself. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal. repeats ] 56) What are Sequencers? A) Sequencers are job control programs that execute other jobs with preset Job parameters. make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories. instance. 11. Before writing a routine or a transform. Tuned the 'Project Tunables' in Administrator for better performance. Used sorted data for Aggregator. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel. 57) How did you handle an 'Aborted' sequencer? A) In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again. 10. 7. 2. Constraints are generally CPU intensive and take a significant amount of time to process. ] length ] string [ delimiter. updates and selects. Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts. Removed the data not used from the source as early as possible in the job. 3. . 58) How did you handle an 'Aborted' sequencer? A) In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs 6. 4. This will eliminate the unnecessary records even getting in before joins are made. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries 8. 5.
HELP to determine the optimal settings for your hash files. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts. Handle the nulls 4.FILE or HASH. 16. Filter. disable this Option. Use Preload to memory option in the hash file output. Try not to use a sort stage when you can use an ORDER BY clause in the database. 11. Use ANALYZE. Check the write cache of Hash file. Use the power of DBMS. 1. Bulk loaders are generally faster than using ODBC or OLE. If possible. 10. Removed the data not used from the source as early as possible in the job. This will improve the performance. Write into the error tables only after all the transformer stages.remove the columns that you would not use. 25. Row Generator) 2. 19. IPC Stage that is provided in Server Jobs not in Parallel Jobs 12. Make sure your cache is big enough to hold the hash files. updates and selects.12. Use SQL Code while extracting the data 3. Minimise the warnings 5. Reduce the number of lookups in a job design 6. 16. 22. modify. 20. Tuned the 'Project Tunables' in Administrator for better performance. If the hash file is used only for lookup then \"enable Preload to memory\". Make every attempt to use the bulk loader for your particular database. Using a constraint to filter a record set is much slower than performing a SELECT … WHERE…. 17. Reduce the width of the input record . Used sorted data for Aggregator. Minimise the usage of Transformer (Instead of this use Copy. 23. If the same hash file is used for Look up and as well as target. 26. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs 27. 18. introduce new transformers if it exceeds 7 lookups. This would also minimize overflow on the hash file. Drop indexes before data loading and recreate after loading data into tables 9. 13. Also. Gen\'ll we cannot avoid no of lookups if our requirements to do lookups compulsory. 24. 21. Don\'t use more than 7 lookups in the same transformer. Cache the hash files you are reading from and writting into. 15. 14. 28. 13. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries . Use IPC stage between two passive stages Reduces processing time 8. Use not more than 20stages in a job 7. Tuning should occur on a job-by-job basis. Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts. Use Write to cache in the hash file input. 15. 14. There is no limit for no of stages like 20 or 30 but we can break the job into small jobs then we use dataset Stages to store the data. break the input into multiple threads and run multiple instances of the job. check the order of execution of the routines.
When designing a job. Try not to use a sort stage when you can use an ORDER BY clause in the database.• ActiveX (OLE) functions. You can use ActiveX (OLE) . Try to have the constraints in the 'Selection' criteria of the jobs itself. or edit them using the Routine dialog box. you can get DataStage to create a wrapper that enablesyou to call these functions from within DataStage. view. 31. You can also define your ownbefore/after subroutines using the Routine dialog box. Make every attempt to use the bulk loader for your particular database. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal. you can specify asubroutine to run before or after the job.which are located in the Routines Built-in Before/Afterbranch in the Repository. If a function uses data in a particularcharacter set. Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs. DataStage has a number of built-intransform functions which are located in the Routines Examples Functions branch of the Repository. You can also defineyour own transform functions in the Routine dialog box.• Before/After subroutines. 61) What are Routines and where/how are they written and have you written any routines before? A) RoutinesRoutines are stored in the Routines branch of the DataStage Repository. Using a constraint to filter a record set is much slower than performing a SELECT … WHERE…. 33. make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories.where you can create. 36. 60) How did you handle reject data? A) Typically a Reject-link is defined and the rejected data is loaded back into data warehouse. it is your responsibility to map the data to and fromUnicode. or before or after an activestage. So Reject link has to be defined every Output link you wish to collect rejected data. These functionsare stored under the Routines branch in the Repository. Youspecify the category when you create the routine. 37. 30. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel. DataStage has a number of built-in before/after subroutines. Using the Routinedialog box. 35. 32. Thefollowing program components are classified as routines:• Transform functions. Rejected data is typically bad data like duplicates of Primary keys or null-rows where data is expected.29. Before writing a routine or a transform. Constraints are generally CPU intensive and take a significant amount of time to process. These are specialized BASIC functionsthat have been defined outside DataStage. Bulk loaders are generally faster than using ODBC or OLE. Tuning should occur on a job-by-job basis.• Custom UniVerse functions.9-4 Ascential DataStage Designer Guideyou should be aware of any mapping requirements when usingcustom UniVerse functions. These are functions that you can use whendefining custom transforms. This will eliminate the unnecessary records even getting in before joins are made. Use the power of DBMS. If NLS is enabled. 34.
key expressions and constraints inTransformer stages• Defining a custom transformIn each of these cases the DataStage Expression Editor guides you as towhat programming elements you can insert into the expression.When using the Expression Editor.• DataStage BASIC functions. After import.Programming in DataStage 9-5you can access the BASIC functions via the Function… commandon the Suggest Operand menu.” are classified as routinesand are described under “Routines” on page 9-3. and the expression that performs the transformation. they all appear under the DS Routines… command on theSuggest Operand menu. By default. and todescribe portions of code that you can enter when defining a job.FunctionsFunctions take arguments and return a value.DataStage is supplied with a number of built-in transforms (which youcannot edit). These are one of the fundamental buildingblocks of the BASIC language. such functions are located in the Routines Class name branch in the Repository. You can also define your own custom transforms. the type it is transformedinto.where you can create. When using the ExpressionEditor.TransformsTransforms are stored in the Transforms branch of the DataStage Repository. This creates awrapper that enables you to call the functions. but you can specify your owncategory when importing the functions.functions asprogramming components within DataStage. Job control routines are not stored under the Routines branch in theRepository.A special case of routine is the job control routine. DataStage functions begin with DS to distinguish themfrom general BASIC functions. although called “functions.When using the Expression Editor. the transforms appear under the DSTransform… command on the Suggest Operand menu. These are mostly used in job controlroutines.The following items. . Such a routine is usedto set up a DataStage job that controls other DataStage jobs. Job controlroutines are specified in the Job control page on the Job Properties dialogbox. Areas ofDataStage where you can use such expressions are:• Defining breakpoints in the debugger• Defining column derivations. youcan view and edit the BASIC wrapper using the Routine dialogbox.you can access the DataStage BASIC functions via the DS Functions…command on the Suggest Operand menu. When using the Expression Editor. When using the Expression Editor. Transforms specify the type of data transformed. These are special BASIC functionsthat are specific to DataStage. which arestored in the Repository and can be used by other DataStage jobs. The word “function” isapplied to many components in DataStage:• BASIC functions. view or edit them using the Transform dialogbox.• Transform functions• Custom UniVerse functions• ActiveX (OLE) functionsExpressionsAn expression is an element of code that defines a value. The word“expression” is used both as a specific part of BASIC syntax. Such functions aremade accessible to DataStage by importing them. all of these components appear underthe DS Routines… command on the Suggest Operand menu.
The often Parameterized variables in a job are: DB DSN name.Converts an expression to an output format. 6) Do not stop the 6.R. The Data file has a default size of 2Gb and the overflow file is used if the data exceeds the 2GB size.T for the data to be looked against at. 65) Have you ever involved in updating the DS versions like DS 5.62) What are OConv () and Iconv () functions and where are they used? A) IConv() . 66) Did you Parameterize the job or hard-coded the values in the jobs? A) Always parameterized the job. username. In general we use Type-30 dynamic Hash files.Converts a string to an internal storage format OConv() .0 server before the upgrade.dsx file 2) See that you are using the same parent folder for the new version also for your old jobs using the hard-coded file path to work.0 install process collects project information during the upgrade. There is no way you will hard–code some parameters in your jobs.X. I have taken in doing so: 1) Definitely take a back up of the whole project(s) by exporting the project as a . 4) Make sure that all your DB DSN's are created with the same name as old one's. if so tell us some the steps you have taken in doing so? A) Yes. The following are some of the steps. 5) In case if you are just upgrading your DB from Oracle 8i to Oracle 9i there is tool on DS CD that can do this for you. This step is for moving DS from one machine to another. 3) After installing the new version import the old project(s) and you have to compile them all again. You can use 'Compile All' tool for this. dates W. Either the values are coming from Job Properties or from a ‘Parameter Manager’ – a third part tool. There is NO rework (recompilation of existing jobs/routines) needed after the upgrade. password. 67) Tell me the environment in your last projects A) Give the OS of the Server and the OS of the Client of your recent most project 68) How do you catch bad rows from OCI stage? 69) Suppose if there are million records did you use OCI? if not then what stage do you prefer? . 63) Explain the differences between Oracle8i/9i? 64) What are Static Hash files and Dynamic Hash files? A) As the names itself suggest what they mean. version 7.
75) Orchestrate Vs Datastage Parallel Extender? A) Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX platform. 72) Differentiate Database data and Data warehouse data? A) Data in a Database is a) Detailed or Transactional b) Both Readable and Writable.a) logical modeling 2.Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional table. the data should be loaded into Fact table. Consists of fields with numeric values. Dimension table .Data should be first loaded into dimensional table.Relatioships).A) Using Orabulk 70) How do you pass the parameter to the job sequence if the job is running at night? A) Two ways 1. Load . 2. Run the job in the sequencer using dsjobs utility where we can specify the values to be taken for each parameter 71) What is the order of execution done internally in the transformer with the stage editor having input links on the lft hand side and output links? A) Stage variables.Table with Unique Primary Key.b)Physical modeling 74) What is the flow of loading data into fact & dimensional tables? A) Fact table . 73) Dimension Modelling types along with their significance A) Data Modelling is Broadly classified into 2 types. which contains the transactional data. This can be the source systems or the ODS (Operational Data Store). b) Dimensional Modelling Data Modeling1) E-R Diagrams2) Dimensional modeling 2. By Database. constraints and column derivation or expressions. Based on the primary key values in dimensional table. a) E-R Diagrams (Entity . Datastage used Orchestrate with . one means OLTP (On Line Transaction Processing). Ste the default values of Parameters in the Job Sequencer and map these parameters to job. c) Current.
Constant . Plug-In: a) Good Performance. There are several methods of partition like Hash. dsjob -run -jobstatus projectname jobname 78) What are Stage Variables.0 i. It can be a collection of key values called as composite primary key. c) Can handle Stored Procedures. Partition Key is a just a part of Primary Key. 76) Differentiate Primary Key and Partition Key? A) Primary Key is a combination of unique and not null. 82) Types of Parallel Processing? . Derivations and Constants? A) Stage Variable . 77) How do you execute datastage job from command line prompt? A) Using "dsjob" command as follows. We can incraese it by going into Datastage Administrator and selecting the Tunable Tab and specify the cache size over there. b) Database specific.Expression that specifies value to be passed on to the target column. Derivation .(Only one database) c) Cannot handle Stored Procedures.0) to incorporate the parallel processing capabilities.Datastage XE (Beta version of 6. Random etc. 81) How to run a Shell Script within the scope of a Data stage job A) By using "ExcecSH" command at Before/After job properties. Now Datastage has purchased Orchestrate and integrated it with Datastage XE and released a new version Datastage 6.An intermediate processing variable that retains value during read and doesnt pass the value into target column.Conditions that are either true or false that specifies flow of data with a link. b) Can be used for Variety of Databases.While using Hash partition we specify the Partition Key. 80) Compare and Contrast ODBC and Plug-In stages? A) ODBC : a) Poor Performance. DB2.e Parallel Extender. 79) What is the default cache size? How do you change the cache size if needed? A) Default cache size is 256 MB..
87) What is ' insert for update ' in datastage 88) How can we pass parameters to job by using file. Event Messages. a) Number of Processes or Nodes.Dates of Jobs Compiled. and then calling the execution of a datastage job. If the size of the file decreases it is called as "Splitting". If the size of the file increases it is called as "Modulus". 84) Functionality of Link Partitioner and Link Collector? A) Link Partitioner : It actually splits data into various partitions or data flows using various partition methods . Program Generated Messages. b) MPP . the size of the file keeps changing randomly. the ds job has the parameters defined (which are passed by unix) 6.Massive Parallel Processing. b) Log View .Warning Messages. merges it into a single data flow and loads to target. 86) Types of vies in Datastage Director? A) There are 3 types of views in Datastage Director a) Job View . by passing parameters from unix file. . 83) What does a Config File in parallel extender consist of? A) Config file consists of the following. Link Collector : It collects the data coming from partitions.Symmetrical Multi Processing.Status of Job last run c) Status View .A) Parallel Processing is broadly classified into 2 types. 85) What is Modulus and Splitting in Dynamic Hashed File? A) In a Hashed File. a) SMP . b) Actual Disk Storage Location. A) You can do this.
Are they requiring you to load yesterday’s file into a table and do lookups? If so. then ask how each vendor would do that. why? What add-ons are available for extracting data from industry-standard ERP. Ask both vendors for a list of their customers with characteristics similar to your own that have used their ETL product for at least a year. Then interview each client (preferably several people at each site) with an eye toward identifying unexpected problems. ask each customer – if they had it all to do over again – whether or not they’d choose the same tool and why? You might be surprised at some of the answers. for product lists. Les Barbusinski’s Answer: Without getting into specifics. Ultimately. they have pretty much similar functionality. There are numerous ETL products on the market and it seems that you are looking at only two of them. Accounting. for example) that need to be compared every day versus yesterday? If so. If you are unfamiliar with the many products available. then either would probably be OK. The trick is to find out which one will work best in your environment.com November 20. the Data Administration Newsletter. industry. can anyone tell me what the main differences between them are? Chuck Kelley’s Answer: You are right. what are the requirements for your ETL tool? Do you have large sequential files (1 million rows. and how much external scripting is required? What kinds of languages are supported for ETL script extensions? Almost any ETL tool will look like any other on the surface. RUN!! Are they doing a match/merge routine that knows how to process this in sequential files? Then maybe they are the right one. how and with which ones? How well does each tool handle complex transformations. Especially clients who closely resemble your shop in terms of size. can anyone tell me what the main differences between them are? Ask The Experts published in DMReview.I'm evaluating DataStage from Ascential and PowerCenter from Informatica they seem to have pretty similar functionality. platforms. in-house skill sets. . you may refer to www. It all depends on what you need the ETL to do.com. or quirkiness with the tool that have been encountered by that customer. identify all possible products and evaluate each product against the detailed requirements. However. 2001 Chuck Kelley and Les Barbusinski and Joyce Bischoff By Q : A: I'm evaluating DataStage from Ascential and PowerCenter from Informatica they seem to have pretty similar functionality. and CRM packages? Can the tool’s meta data be integrated with third-party data modeling and/or business intelligence tools? If so.tdan. data volumes and transformation complexity. source systems. benefits. You should first document your requirements. The best way I’ve found to make this determination is to ascertain how successful each vendor’s clients have been using their product. Joyce Bischoff’s Answer: You should do a careful research job when selecting products. Think about what process they are going to do. If you are small enough in your data sets. here are some differences you may want to explore with each vendor: • • • • • Does the tool use a relational or a proprietary database to store its meta data and scripts? If proprietary.
.e.Why we need sort stage other than sort-merge collective method and perform sort option in the stage in advanced properties? 23..How can we maintain the partitioning in Sort stage? 13....For surrogate key generator stage where will be the next value stored? 24...Where we need partitioning (In processing or some where) 14.If you ask the vendors. which may or may not be totally accurate. Ask both vendors and compare the answers... 4. 7. What is difference between server jobs & parallel jobs? What is orchestrate? Orchestrate Vs DataStage Parallel Extender? What are the types of Parallelism? Is Pipeline parallelism in PX is same what Inter-processes does in Server? 6.If we use SAME partitioning in the first stage which partitioning method it will take? 15....... You will not want the vendor to have a representative present when you speak with someone at the reference site. What are partitioning methods available in PX? 7.. for one stage 4 nodes and for next stage 3 nodes? 18.. Lookup Stage :Is it Persistent or non-persistent? (What is happening behind the scene) 12. 5.. What does a Configuration File in parallel extender consist of? 10.What is difference between file set and data set? 11...What is the symbol we will get when we are using round robin partitioning method? 16....... 3. .... After you are very familiar with the products...... non-combinability? 19. It is also not a good idea to depend upon a high-level manager at the reference site for a reliable opinion of the product....What are schema files? 20.. 2... DataStage PX questions 1... they will certainly be able to tell you which of their product’s features are stronger than the other product...Why we need datasets rather than sequential files? 21......If we check the preserve partitioning in one stage and if we don’t give any partitioning method (Auto) in the next stage which partition method it will use? 17.What is combinability.. What are OConv () and Iconv () functions and where are they used? Can we use these functions in PX? 9.. Managers may paint a very rosy picture of any selected product so that they do not look like they selected an inferior product.....In surrogate key generator stage how it generates the number? (Based on nodes or Based on rows) . Is look-up stage returns multi-rows or single rows? 22.. call their references and be sure to talk with technical people who are actually using the product....... Can we give node allocations i..... What is Re-Partitioning? When actually re-partition will occur? 8.............
How did you handle an 'Aborted' sequencer? 34.How do you rename all of the jobs to support your new File-naming conventions? 32.What are Performance tunings you have done in your last project to increase the performance of slowly running jobs? 35.What is Dataset Stage? 42.How do you merge two files in DS? 33.What is RCP? How it is implemented? 48. it might have the form . The IPC stage buffers data so that the next process (or next stage in the same process) can pick it up.What will you do in a situation where somebody wants to send you a file and use that file as an input or reference and then run job? 31. Do you understand the relationship between stages and Orchestrate operators? Essentially each stage generates an operator.What is copy stage? When do you use that? 40.What is the difference between stages and operators? 27.What is aggregate cache in aggregator transformation? 30.What is the difference between Lookup.What is lookup stage? Can you define derivations in the lookup stage output? 39. 1) Lookup Stage :Is it Persistent or non-persistent?(What is happening behind the scene) Ans: Look up stage is non-persistent 2) Is Pipeline parallelism in PX is same what Interprocessesor does in Server? Ans: Yes and no. copy and column export stages instead of transformer stage? 28.What is Change Capture stage? Which execution mode would you use when you used for comparison of data? 41.Why we need filter.How do you drop dataset? 43.How to extract data from more than 1 heterogeneous Sources? 50. Pipeline parallelism in parallel jobs is much more complete.What is row generator? When do you use it? 49. These (assuming that they don't combine into single processes) can form a pipeline so that. if you examine the generated OSH.If data is partitioned in your job on key 1 and then you aggregate on key 2.How can we pass parameters to job by using file? 8.What is the preserve portioning in Advanced tab? 26.What is Full load & Incremental or Refresh load? 37.How do you eliminate duplicate in a Dataset? 44.What is Peek Stage? When do you use it? 45.What are the different join options available in Join Stage? 46.25.Describe the types of Transformers used in DataStagePX for processing? And uses 29.Describe cleanup resource and clear status file? 38. what issues could arise? 36. Merge & Join Stage? 47.
3) How can we maintain the partitioning in Sort stage? 4) Where we need partitioning (In processing or some where) 5) If we use SAME partitioning in the first stage which partitioning method it will take 6) What is the symbol we will get when we are using round robin partitioning method? 7) If we check the preserve partitioning in one stage and if we don’t give any partitioning method which partition method it will use? 8) What is orchestrate? Ans: Orchestrate was product from Torrent before being bought over by Ascential. for one stage 4 nodes and for next stage 3 nodes? 10) What is combinability.e. Ochestrate provides the OSH framework. very fast. 9) Can we give node allocations i. non-combinability? 11) What are schema files? 12) Why we need datasets rather than sequential files? Ans: A sequential file as a source or target needs to be repartitioned as it is (as the name suggests) a single sequential stream of data. which has the UNIX command line interface. A dataset can be saved across nodes using the partitioning method selected so it is always faster when used as a source or target. 13) Is look-up stage returns multi-rows or single rows? 14) Why we need sort stage other than sort-merge collective method and perform sort option in the stage advanced properties? 15) For surrogate key generator stage where will be the next value stored? 16) When actually re-partition will occur? .Code: Very slick.
An OSH script is a quoted string which specifies the operators and connections of a single Orchestrate step. Ascential re-named the technology "Parallel Extender".17) In transformer stage can we give constraints? Ans: Yes. In parallel you tend to use specific stage types for specific tasks (and the Transformer stage doesn't do lookups).ds > out. Out. 53.ds”.Server generates DataStage BASIC. of course. DataStage PX GUI generates OSH (Orchestrate Shell) scripts for the jobs you run.Orchestrate Vs DataStage Parallel Extender? 54.input dataset. 52. and parallel stages correspond to Orchestrate operators.What are the types of Parallelism? .What is orchestrate? --.ds . it is: Osh “op < in. DataStage PX questions 51. In.Orchestrate is the old name of the underlying parallel execution engine. Finally. mainframe generates COBOL and JCL. which would have to be managed manually (if at all) in the server environment. parallel generates Orchestrate shell script (osh) and C++. There are many more stage types for parallel than server or mainframe. We Can give 18) What is a constraint in the Advanced tab? 19) What is the diff between Range and Range Map partitioning? 8.ds – Output dataset. In server and mainframe you tend to do most of the work in Transformer stage. In its simplest form.What is difference between server jobs & parallel jobs? --. Where op – Orchestrate operator. there's the automatic partitioning and collection of data in the parallel environment.
This method is useful for resizing partitions of an input dataset that are not equal in size to approximately equal-sized partitions. 2. Auto: --. writes data to the target as soon as the data is available. and so on. This enhances the speed at which loading takes place. then if there are 4 logical nodes then each node would process 25 records each.Simultaneously. the first record goes to the first processing node.There are 2 types of Parallel Processing. For example if there are 100 records. 55. it can be exchanged between them without waiting for the entire record set to be read.What are partitioning methods available in PX? The Partitioning methods available in PX are: 1.At the same time. Partitioning Parallelism – Partitioning parallelism means that entire record set is partitioned into small sets and processed on different nodes.--.Link partitioner and Link collector stages can be used to achieve a certain degree of partitioning parallelism. In Parallel jobs. the target stage starts running on another processor.Is Pipeline parallelism in PX is same what Inter-processes does in Server? YES. processes the data in the pipeline and starts filling another pipeline.It chooses the best partitioning method depending on: The mode of execution of the current stage and the preceding stage.Here. reads the data from the source and starts filling a pipeline with the read data. In other words. 56. That is. several processors can run the same job simultaneously. For example. IPC stage is a stage which helps one passive stage read data from another as soon as data is available. --. Note:. --. It means as soon as the data is available between stages( in pipes or links). stages do not have to wait for the entire set of records to be read first and then transferred to the next stage. the second to the second processing node. Data Stage uses ‘Round robin’ when it partitions the data initially. Pipeline Parallelism – It is the ability for a downstream stage to begin processing a row as soon as an upstream stage has finished processing that row (rather than processing one row completely through the job before beginning the next row). b. Round robin: --. each handling a separate subset of the total data. They are: a. the Transformer stage starts running on another processor. . it is managed automatically. The number of nodes available in the configuration file. consider a job(src Tranformer Tgt) running on a system having three processors: --.The source stage starts running on one processor.
Entire: --. Reason: For example. Hash: --. if this method is used to partition an input dataset containing update information for an existing DB2 table. --.It implements the Partitioning method same as the one used by the preceding stage. a range map has to be made using the ‘Write range map’ stage.It distributes the records randomly across all processing nodes and guarantees that each processing node receives approximately equal-sized partitions. This does not necessarily mean that the partitions will be equal in size. hashing keys that create a large number of partitions should be selected. Then during the execution of the parallel operator. The records stay on the same processing node.3. Range: --. 6. 5.It distributes all the records with identical key values to the same processing node so as to ensure that related records are in the same partition. It is mostly used with stages that create lookup tables for their input. Same is considered as the fastest Partitioning method. data is not redistributed or repartitioned. 4. Random: --. Same: --. 7. It guarantees that all records with same partitioning key values are assigned to the same partition. 8.Partitioning is based on a key column modulo the number of partitions. For example. 9.It divides a dataset into approximately equal-sized partitions. records are assigned to the processing node containing the corresponding DB2 record. both the input record and the DB2 table record are local to the processing node. if you hash partition a dataset based on a zip code field. DB2: --.It distributes the complete dataset as input to every instance of a stage on every processing node. that is. Note: In order to use a Range partitioner. each of which contains records with key columns within a specific range. where a large percentage of records are from one or two zip codes. it can lead to bottlenecks because some nodes are required to process more records than other nodes.Partitions an input dataset in the same way that DB2 would partition it. Data Stage uses ‘Same’ when passing data between stages in a job. Modulus: --. . The modulus partitioner assigns each record of an input dataset to a partition of its output dataset as determined by a specified key field in the input dataset.When Hash Partitioning.
MCN – Extracts numeric characters from a field. .‘Iconv()’ converts a string to an internal storage format.‘Oconv()’ converts an expression to an output format. Syntax:. MCA – Extracts alphabetic characters from a field. but now you want to process the data grouped by DEPTNO. The only stage which allows the of Iconv() and Oconv() in PX is ‘Basic Transformer’ stage. --.Re-Partitioning is the rearranging of data among the partitions. DataStage first reads the Configuration File to determine the available usage . Like. 59. These functions can’t be used directly in PX. MCN – Extracts numeric characters from a field. conversion[@VM conversion]…) .Code indicates the conversion code which specifies how the data needs to be formatted for output or internal storage.What does a Configuration File in parallel extender consist of? • The Configuration File consists of all the processing nodes of a parallel system. MCL .What are IConv () and Oconv () functions and where are they used? Can we use these functions in PX? --. In a job. It describes every processing node that DataStage uses to run an application. Then you will need to Repartition to ensure that all the employees falling under the same DEPTNO are in the same group.Converts Uppercase Letters to Lowercase…. When you run a job.57. Note:. MCA – Extracts alphabetic characters from a field.Iconv(string.Oconv(expression. For example. consider the EMP data that is initially processed based on SAL. • It can be defined and edited using the DataStage Manager. Syntax:.Basic Transformer can be used only on SMP systems but not on MPP or Cluster syntems. .Code indicates the conversion code which specifies how the string needs to be formatted. 58. the Parallel-to-Parallel flow results in Re-Partitioning.string evaluates to the string to be converted.Expression is a string stored in internal format that needs to be converted to output format.What is Re-Partitioning? When actually re-partition will occur? --. It gives access to the functions supported by the DataStage Server Engine.Converts Uppercase Letters to Lowercase…. MCL .code[@VM code]…) .
which can then be used by other jobs. by editing the Configuration file. each referred to by a control file.What is difference between file set and data set? Dataset:.Datasets are operating system files. then four. the DataStage jobs need not be altered or even recompiled. Instead. . A Dataset consists of two parts: Descriptor file: It contains metadata and data location. -. Fileset:. For example. Individual raw Data files.How can we maintain the partitioning in Sort stage? -. The dataset will therefore have four partitions. assume you sort a dataset on a system with four processing nodes and store the results to a Dataset stage. and list the files it has generated in a file with extension. 60. It allows you to store data in persistent form. PX jobs use datasets to manage data within a job.• • nodes. write them to their destination. then on two nodes. You then use that dataset as input to a stage executing on a different number of nodes. When a system is modified by adding or removing processing nodes or by reconfiguring nodes. --The Dataset Stage is used to read data from or write data to a dataset. A fileset consists of two parts: Descriptor file: It contains location of raw data files and the meta data. which has the suffix .fs. For example.Where we need partitioning (In processing or some where) . The configuration file also gives control over parallelization of a job during development cycle. Lookup Stage :Is it Persistent or non-persistent? (What is happening behind the scene) --. 63. This can be avoided by specifying the Same Partitioning method so that the original partitions are preserved.Partitioning in Sort stage can be maintained using the Partitioning method. Data file: It contains the data. The data in the Datasets is stored in internal format. This destroys the sort order of the data.The Fileset Stage is used to read data from or write data to a fileset. DataStage automatically repartitions the dataset to spread it out to all the processing nodes. 62.DataStage can generate and name exported files.ds. and so on.Lookup stage is non-persistent. first a job can be run on a single processing node. The data files and the file that lists them is called a ‘Fileset’. editing the Configuration file alone will suffice. ‘Same’. 61.
64. and the type of Partitioning method: Preceding Stage Current Stage Sequential mode-- (FAN OUT) -- Parallel mode (Indicates Partitioning) Parallel mode -- (FAN IN) -- Sequential mode(Indicates Collecting) Parallel mode -- (BOX) -- Parallel mode (Indicates AUTO method) Parallel mode -- (BOW TIE)-- Parallel mode (Indicates Repartitioning) Parallel mode -- (PARALLEL LINES) Parallel mode (Indicates SAME partitioning) 66.Using Combinability. It means we need Partitioning where we have huge volumes of data to process.--. Given below is the list of icons that appear on the link based on the mode of execution. So.Preserve Partitioning indicates whether the stage wants to preserve the partitioning at the next stage of the job. it is taken as the partitioning method. This . 68. 65. for one stage 4 nodes and for next stage 3 nodes? --. 67. -.BOW TIE. It lets the DataStage compiler potentially 'optimize' the number of processes used at runtime by combining operators.What is the symbol we will get when we are using round robin partitioning method? -. the partitioning method used by the preceding stage is used.e. DataStage combines the operators that underlie parallel stages so that they run in the same process.Partitioning is needed in processing.In this case.As Auto is the default partitioning method.If we use SAME partitioning in the first stage which partitioning method it will take? --. the node allocation is common for the project as a whole but not for individual stages in a job. Options in this tab are: Set – Sets the Preserve partitioning flag. of the current stage and the preceding stage.Generally all the processing nodes for a project are defined in the Configuration file. It means that node allocation is project specific. Propagate – Sets the flag to Set or Clear depending on the option selected in the previous stage. non-combinability? --. parallel or sequential.What is combinability.If we check the preserve partitioning in one stage and if we don’t give any partitioning method (Auto) in the next stage which partition method it will use? -. Can we give node allocations i. Clear – Clears the preserve partitioning flag.
then choose "Show Schema". and may also contain information about key columns. 69. Data sets are operating system files. so it is always faster when we used as a source or target. Don’t combine: Never combine operators. Auto: Use the default combination method. This means if you are partitioning your data in a stage you can define the sort at the same time.What are schema files? -. 21. It has three options to set: Usually this setting is left ot its default so that DataStage can tune jobs for performance automatically..Import any table definition. Table definitions are stored in DataStage Repository and can be loaded into stages as and when required. ) --. In contrast.A Sequential file as the source or target needs to be repartitioned as it is(as name suggests) a single sequential stream of data. each referred to by a control file.Why we need datasets rather than sequential files? --.ds. date:date. Using datasets wisely can be key to good performance in a set of linked jobs. 70. which by convention has the suffix . The .Sort Stage is used to perform more complex sort operations which is not possible using stages Advanced tab properties.‘Schema file’ is a plain text file in which meta data for a stage is specified. load it into a parallel job. Many stages have an optional sort function via the partition tab. Lookup stage returns the rows related to the key values. The following is an example for record schema: record( name:string. address:nullable string. It can be multiple rows depends on the keys you are mentioning for the lookup in the stage 22. Combinable: Combine all possible operators. --. --. Is look-up stage returns multi-rows or single rows? ---. For some stages you can specify a property that causes the stages to take its meta data from the specified schema file. The schema defines the columns and their characteristics in a data set. The Data Set stage allows you to store data being operated on in a persistent form. Why we need sort stage other than sort-merge collective method and perform sort option in the stage in advanced properties? --. which can then be used by other DataStage jobs. By default most parallel job stages take their meta data from the columns tab. --.It is stored outside the DataStage Repository in a document management system or a source code control system.A Schema consists of a record definition.Schema is an alternative way to specify column definitions for the data used by the parallel jobs. A dataset can be saved across nodes using partitioning method selected.saves significant amount of data copying and preparation in passing data between operators.
sort stage is for use when you don't have any stage doing partitioning in your job but you still want to sort your data. Clears the preserve partitioning flag. Processing. Propagate. These are stages that help you when you are developing and troubleshooting parallel jobs.Stages are generic user interface from where we can read and write from files and databases. What is the difference between stages and operators? --. There a three options 1. . 23. File. If you are processing very large volumes and need to sort you will find the sort stage is more flexible then the partition tab sort.It indicates whether the stage wants to preserve partitioning at the next stage of the job. This can be achieved using round robin partitioning method when your starting point is sequential 25. this indicates to the next stage in the job that it should preserve existing partitioning if possible. Examples are the Peek and Row Generator stages. For surrogate key generator stage where will be the next value stored? 24. Set. These are stages that read or write data contained in a database. Set the flag to set or clear depending on what in previous stage of the job has set(or if that is set to propagate the stage before that and so on until a preserve partitioning flag setting is encountered). or if you want to sort your data in descending order. In surrogate key generator stage how it generates the number? (Based on nodes or Based on rows) --. 26. These are stages that perform some processing on the data that is passing through them. Examples of database stages are the Oracle Enterprise and DB2/UDB Enterprise stages. Clear. and the input data partitions should be perfectly balanced across the nodes. also it's capable of doing processing of data.Propagate. this indicates that this stage doesn’t care which partitioning method the next stage uses. These are stages that read or write data contained in a file or set of files. trouble shoot and develop jobs. Examples of processing stages are the Aggregator and Transformer stages.Based on the nodes we are generating key values. or if you want to use one of the sort stage options such as "Allow Duplicates" or "Stable Sort". Different types of stages are Database. Sets the preserve partitioning flag.Clear 3. Development/Debug. Examples of file stages are the Sequential File and Data Set stages. What is the preserve partioning flag in Advanced tab? --. Set 2.
the operators in an application step might start with an import operator. which have not been written to any of the outputs links by reason of a write failure or expression evaluation failure. copy. For example. Restructure. These transformations can be simple or complex and can be applied to individual columns in your data. It gives access to BASIC transforms and functions (BASIC is the language supported by . It can also have a reject link that takes any rows. 28. In orchestrate framework DataStage stages generates an orchestrate operator directly.The BASIC Transformer stage is a also a processing stage. The processing power of Orchestrate derives largely from its ability to execute operators in parallel on multiple processing nodes. Transformer stages can have a single input and any number of outputs. Basic Transformer:. These are part of the optional Web Services package. Transformations are specified using a set of functions. By default. in pipeline fashion.Operators are the basic functional units of an orchestrate application. Transformer stages allow you to create transformations to apply to your data. 27. These are the stages that allow Parallel jobs to be made available as RTI services. They comprise the RTI Source and RTI Target stages. Subsequent operators in the sequence could perform various processing and analysis tasks. column export stages are operator stages. Examples are Make Sub record and Make Vector stages.Real Time. Why we need filter. These operators are the basic functional units of an orchestrate application. Orchestrate dynamically scales your application up or down in response to system configuration changes. Filter. copy and column export stages instead of transformer stage? --. --. without requiring you to modify your application. which reads data from a file and converts it to an Orchestrate data set. It is similar in appearance and function to the Transformer stage in parallel jobs. These are stages that deal with and manipulate data containing columns of complex data type. Describe the types of Transformers used in DataStage PX for processing and uses? Transformer Basic Transformer Transformer-: The Transformer stage is a processing stage. The operators in your Orchestrate application pass data records from one operator to the next. Thus using operator stages will increase the speed of data processing applications rather than using transformer stages. Orchestrate operators execute on all processing nodes in your system.In parallel jobs we have specific stage types for performing specialized tasks.
BASIC Transformer stage can have a single input and any number of outputs.Use wait for file activity stage between job activity stages in job sequencer. Sort the data before sending to change capture stage or remove duplicate stage. Write a script.By using check point information we can restart the sequence from failure. So you have to make the necessary changes to these Sequencers. 3. the treatment of rows with unmatched keys. 31. What will you do in a situation where somebody wants to send you a file and use that file as an input or reference and then run job? --. Key column should be hash partitioned and sorted before aggregate operation. 32. which can do a simple rename of the strings looking up the file. Export the whole project as a dsx. 34. What are Performance tunings you have done in your last project to increase the performance of slowly running jobs? 1. Then import the new dsx file and recompile all jobs.Filter unwanted records in beginning of the job flow itself. How did you handle an 'Aborted' sequencer? --.Create a file with new and old names. What is aggregate cache in aggregator transformation? --. How do you merge two files in DS? --.Use Operator stages like remove duplicate. 30. 5. Filter. and their requirements for data being input of key columns. Use Join stage instead of Lookup stage when the data is huge. The three stages differ mainly in the memory they use. 29. if u enabled the check point information reset the aborted job and run again.Aggregate cache is the memory used for grouping operations by the aggregator stage. Join Stage or Lookup Stage All these merge or join occurs based on the key values. Either go for Merge stage. 6. Be cautious that the name of the jobs has also been changed in your job control jobs or Sequencer jobs.We can merge two files in 3 different ways. Using Dataset stage instead of sequential files wherever necessary. 4. Copy etc instead of transformer stage.the datastage server engine and available in server jobs). . 33. 2. How do you rename all of the jobs to support your new File-naming conventions? --.
It can also perform lookups directly in a DB2 or Oracle database or in a lookup table contained in a Lookup File Set stage. The table lookup is based on the values of a set of lookup key columns. Each record of the output data set contains columns from a source record plus columns from all the corresponding lookup records where corresponding source and lookup records have the same value for the lookup key columns. For each record of the source data set from the primary link. and a single rejects link. the Lookup stage performs a table lookup on each of the lookup tables attached by reference links. You need to ensure that the data being looked up in the lookup table is in the same partition as the input data referencing it. 38. What is Full load & Incremental or Refresh load? 37. Lookups can also be used for validation of a row. Clear Status file command is for resetting the status records associated with all stages in that job. The Lookup stage can have a reference link. 36. If data is partitioned in your job on key 1 and then you aggregate on key 2. One way of doing this is to partition the lookup tables using the Entire method. the row is rejected. The lookup key columns do not have to have the same names in the primary and the reference links. a single output link. it can have multiple reference links (where it is directly looking up a DB2 table or Oracle table. The optional reject link carries source records that do not have a corresponding entry in the input lookup tables. Describe cleanup resource and clear status file? The Cleanup Resources command is to • View and end job processes • View and release the associated locks Cleanup Resources command is available in director under Job menu. which gives the desired result and also eases grouping operation. The most common use for a lookup is to map short codes in the input data set onto expanded information from a lookup table which is then joined to the incoming data and output. What is lookup stage? Can you define derivations in the lookup stage output? Lookup stage is used to perform lookup operations on a data set read into memory from any other Parallel job stage that can output data. . one set for each table. it can only have a single reference link). Depending upon the type and setting of the stage(s) providing the look up information. There are some special partitioning considerations for lookup stages.35. A lot of the setting up of a lookup operation takes place on the stage providing the lookup table. If there is no corresponding entry in a lookup table to the key’s values. what issues could arise? --. The input link carries the data from the source data set and is known as the primary link.It will result in false output even though job runs successfully. In aggregator key value should be hash partitioned so that identical key values will be in the same node. a single input link. The following pictures show some example jobs performing lookups. Another way is to partition it in the same way as the input data (although this implies sorting of the data). Lookup stages do not require data on the input link or reference links to be sorted. The keys are defined on the Lookup stage.
41. This prevents DataStage from deciding that the Copy operation is superfluous and optimizing it out of the job. What is copy stage? When do you use that? The Copy stage copies a single input data set to a number of output data sets. You can also manage data sets independently of a job using the Data Set Management utility. Using data sets wisely can be key to good performance in a set of linked jobs. 40. delete. The preserve-partitioning flag is set on the change data set. How do you eliminate duplicate in a Dataset? The simplest way to remove duplicate is by the use of Remove Duplicate Stage. The stage produces a change data set. What is Change Capture stage? Which execution mode would you use when you used for comparison of data? The Change Capture stage takes two input data sets. Director. How do you drop dataset? There two ways of dropping a data set. The Peek stage lets you print record column values either to the job log or to a separate output link as the stage copies records from its input data set to one or more output data sets. available from the DataStage Designer. Like the Head stage and the Tail stage. Manager. Designer and second is by using Unix command-line utility orchadmin. needs Force set to be TRUE. removes all duplicate rows. 44. The columns the data is hashed on should be the key columns used for the data compare. The stage assumes that the incoming data is key-partitioned and sorted in ascending order. you can compare the value columns in the rows to see if one is an edited copy of the other. What is Dataset Stage? The Data Set stage is a file stage. It can be configured to execute in parallel or sequential mode. It can have a single input link and any number of output links. each referred to by a control file. denoted before and after. and outputs a single data set whose records represent the changes made to the before data set to obtain the after data set. 39. whose table definition is transferred from the after data set’s table definition with the addition of one column: a change code with values encoding the four actions: insert. You can also optionally specify change values.Yes. first is by using Data Set Management Utility (GUI) located in the Manager. Records can be copied without modification or you can drop or change the order of. which by convention has the suffix . We can use both Sequential as well as parallel mode of execution for change capture stage. Copy stage is useful when we want to make a backup copy of a data set on disk while performing an operation on another copy. You can achieve the sorting and partitioning using the Sort stage or by using the built-in sorting and partitioning abilities of the Change Capture stage. The Remove Duplicates stage takes a single sorted data set as input. It allows you to read data from or write data to a data set. What is Peek Stage? When do you use it? The Peek stage is a Development/Debug stage. DataStage parallel extender jobs use data sets to manage data within a job. rows from the two data sets are assumed to be copies of one another if they have the same values in these key columns. If two rows have identical key columns. The Data Set stage allows you to store data being operated on in a persistent form. which can then be used by other DataStage jobs.ds. The compare is based on a set a set of key columns. 43. and edit. Copy stage with a single input and a single output. copy. Each record of the input data set is copied to every output data set. the Peek stage can be . Data sets are operating system files. and writes the results to an output data set. or Director 42.
What are the different join options available in Join Stage? There are four join option available in the join stage. The stage drops the key column from the left and intermediate data sets. Inner: Transfers records from input data sets whose key columns contain equal values to the output data set. Lookup does not have this requirement --Lookup allows reject links. Lookup provides a reject link. The above-mentioned step is not required for join and lookup stages. This is known as runtime column propagation (RCP). 45. It can cope with the situation where Meta data isn’t fully defined. If there are more than one update dataset then duplicates should be removed from update datasets as well. or in the Outputs . It also requires identically sorted and partitioned inputs and.helpful for monitoring the progress of your application or to diagnose a bug in your application. So we cannot get the rejected records directly. You can define part of your schema and specify that. Right Outer: Transfers all values from the right data set and transfers values from the left data set and intermediate data sets only where key columns match. They differ mainly in: – Memory usage – Treatment of rows with unmatched key values – Input requirements (sorted. The stage drops the key column from the right and intermediate data sets. This can be enabled for a project via the DataStage Administrator. If the volume of data is quite huge. no reject links are possible. ---Merge stage allow us to capture failed lookups from each reference input separately. What is the difference between Lookup. Full Outer: Transfers records in which the contents of the key columns are equal from the left and right input data sets to the output data set. then it is safe to go for Joiner --Join requires the input dataset to be key partitioned and sorted. In joiner. Also lookup is used if the data being looked up can fit in the available temporary memory. it will adopt these extra columns and propagate them through the rest of the job. Merge & Join Stage? These "three Stages" combine two or more input links according to values of userdesignated "key" column(s). Join does not allow reject links. It also transfers records whose key columns contain unequal values from both input data sets to the output data set. de-duplicated) --The main difference between joiner and lookup is in the way they handle the data and the reject links. and set for individual links via the Outputs Page Columns tab for most stages. if more than one reference input. (Full outer joins do not support more than two input links. if your job encounters extra columns that are not defined in the meta data when it actually runs. If the volume of data is huge to be fit into memory you go for join and avoid lookup as paging can occur when lookup is used. Records whose key columns do not contain equal values are dropped Left Outer: Transfers all values from the left data set but transfers values from the right data set and intermediate data sets only where key columns match. What is RCP? How it is implemented? DataStage is flexible about Meta data.) The default is inner 46. 47. de-duplicated reference inputs In case of merge stage as part of pre processing step duplicates should be removed from master dataset.
. You should always ensure that runtime column propagation is turned on. The Row Generator stage produces a set of mock data fitting the specified meta data. and a single output link. Aggregator or Lookup stage to unify the incoming. RCP is implemented through Schema File. 50.page General tab for Transformer stages. Row Generator is also useful when we want processing stages to execute at least once in absence of data from the source. It has no input links. How to extract data from more than 1 heterogeneous Sources? We can extract data from different sources (ie ORACL. How can we pass parameters to job by using file? This can be done through the shell script where we can read the different parameter from the file and call dsjob command to execute the job for those interpreted parameters. DB2 . This is useful where we want to test our job but have no real data available to process. Merge. 49. After getting the data we can use Join. Sequential file etc) in a job. The schema file is a plain text file contains a record (or row) definition. 48. What is row generator? When do you use it? The Row Generator stage is a Development/Debug stage.