This action might not be possible to undo. Are you sure you want to continue?
Page 1 of 115 DWH Training -9739096158
The following table reflects all changes to this document. Date 01-Nov2004 14-Sep2010 Author / Contributor Versio n 1.0 1.1 Reason for Change
Initial Document Updated Document
Table of Contents
2.1 DEFINATIONS NORMALIZATION: First Normal Form: Second Normal Form: Third Normal Form: Boyce-Codd Normal Form: Fourth Normal Form: ORACLE SET OF STATEMENTS: Data Definition Language :(DDL) Data Manipulation Language (DML) Data Querying Language (DQL) Data Control Language (DCL) Transactional Control Language (TCL) Syntaxes: ORACLE JOINS: Equi Join/Inner Join: Non-Equi Join Self Join Natural Join Cross Join Outer Join Left Outer Join Right Outer Join Full Outer Join What’s the difference between View and Materialized View? Page 2 of 115 DWH Training -9739096158
4 5 5 5 5 6 6 6 6 6 7 7 7 7 9 10 10 10 11 11 11 11 11 12 12
View: Materialized View: Inline view: Indexes: Why hints Require? Explain Plan: Store Procedure: Packages: Triggers: Data files Overview: 2.2 IMPORTANT QUERIES
12 13 14 19 20 22 23 24 25 27 27
3 DWH CONCEPTS
What is BI?
4.1 Informatica Overview 4.2 Informatica Scenarios: 4.3 Development Guidelines 4.4 Performance Tips 4.5 Unit Test Cases (UTP):
54 94 101 105 107
Page 3 of 115 DWH Training -9739096158
Detailed Design DocumentAutomation of Candidate Extract and Load Process
The purpose of this document is to provide the detailed information about DWH Concepts and Informatica based on real-time training.
Organizations can store data on various media and in different formats, such as a hard-copy document in a filing cabinet or data stored in electronic spreadsheets or in databases. A database is an organized collection of information. To manage databases, you need database management systems (DBMS). A DBMS is a program that stores, retrieves, and modifies data in the database on request. There are four main types of databases: hierarchical, network, relational, and more recently object relational(ORDBMS).
Page 4 of 115 DWH Training -9739096158
Some Oracle databases were modeled according to the rules of normalization that were intended to eliminate redundancy.
Obviously, the rules of normalization are required to understand your relationships and functional dependencies First Normal Form: A row is in first normal form (1NF) if all underlying domains contain atomic values only. Eliminate duplicative columns from the same table. • Create separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key).
Second Normal Form: An entity is in Second Normal Form (2NF) when it meets the requirement of being in First Normal Form (1NF) and additionally: Does not have a composite primary key. Meaning that the primary key can not be subdivided into separate logical entities. • All the non-key columns are functionally dependent on the entire primary key. • A row is in second normal form if, and only if, it is in first normal form and every non-key attribute is fully dependent on the key. • 2NF eliminates functional dependencies on a partial key by putting the fields in a separate table from those that are dependent on the whole key. An example is resolving many: many relationships using an intersecting entity.
Third Normal Form: An entity is in Third Normal Form (3NF) when it meets the
Page 5 of 115 DWH Training -9739096158
requirement of being in Second Normal Form (2NF) and additionally: Functional dependencies on non-key fields are eliminated by putting them in a separate table. At this level, all non-key fields are dependent on the primary key. • A row is in third normal form if and only if it is in second normal form and if attributes that do not contribute to a description of the primary key are move into a separate table. An example is creating look-up tables.
Boyce-Codd Normal Form: Boyce Codd Normal Form (BCNF) is a further refinement of 3NF. In his later writings Codd refers to BCNF as 3NF. A row is in Boyce Codd normal form if, and only if, every determinant is a candidate key. Most entities in 3NF are already in BCNF. Fourth Normal Form: An entity is in Fourth Normal Form (4NF) when it meets the requirement of being in Third Normal Form (3NF) and additionally: Has no multiple sets of multi-valued dependencies. In other words, 4NF states that no entity can have more than a single one-to-many relationship. ORACLE SET OF STATEMENTS: Data Definition Language :(DDL) Create Alter Drop Truncate Data Manipulation Language (DML) Insert Update
Page 6 of 115 DWH Training -9739096158
Delete Data Querying Language (DQL) Select Data Control Language (DCL) Grant Revoke Transactional Control Language (TCL) Commit Rollback Save point
Syntaxes: CREATE OR REPLACE SYNONYM HZ_PARTIES FOR SCOTT.HZ_PARTIES CREATE DATABASE LINK CAASEDW CONNECT TO ITO_ASA IDENTIFIED BY exact123 USING ' CAASEDW’ Materialized View syntax: CREATE MATERIALIZED VIEW EBIBDRO.HWMD_MTH_ALL_METRICS_CURR_VIEW REFRESH COMPLETE START WITH sysdate NEXT TRUNC(SYSDATE+1)+ 4/24 WITH PRIMARY KEY AS
Page 7 of 115 DWH Training -9739096158
select * from HWMD_MTH_ALL_METRICS_CURR_VW; Another Method to refresh: DBMS_MVIEW.REFRESH('MV_COMPLEX', 'C');
Case Statement: Select NAME, (CASE WHEN (CLASS_CODE = 'Subscription') THEN ATTRIBUTE_CATEGORY ELSE TASK_TYPE END) TASK_TYPE, CURRENCY_CODE From EMP Decode() Select empname,Decode(address,’HYD’,’Hyderabad’, ‘Bang’, Bangalore’, address) as address from emp; Procedure: CREATE OR REPLACE NUMBER, NUMBER DEFAULT 1) AS PROCEDURE Update_bal (
cust_id_IN In In amount_IN
BEGIN Update account_tbl Set amount= amount_IN where cust_id= cust_id_IN End Trigger:
Page 8 of 115 DWH Training -9739096158
CREATE OR REPLACE TRIGGER EMP_AUR AFTER/BEFORE UPDATE ON EMP REFERENCING NEW AS NEW OLD AS OLD FOR EACH ROW
DECLARE BEGIN IF (:NEW.last_upd_tmst <> :OLD.last_upd_tmst) THEN -- Insert into Control table record Insert into table emp_w values('wrk',sysdate) ELSE -- Exec procedure Exec update_sysdate() END;
• • • • •
Equi join Non-equi join Self join Natural join Cross join
Page 9 of 115
DWH Training -9739096158
Outer join Left outer Right outer Full outer Equi Join/Inner Join:
SQL> select empno,ename,job,dname,loc from emp e,dept d where e.deptno=d.deptno; USING CLAUSE SQL> select empno,ename,job ,dname,loc from emp e join dept d using(deptno); ON CLAUSE SQL> select empno,ename,job,dname,loc from emp e join dept d on(e.deptno=d.deptno); Non-Equi Join A join which contains an operator other than ‘=’ in the joins condition. Ex: SQL> select empno,ename,job,dname,loc from emp d.deptno;
e,dept d where e.deptno > Self Join
Joining the table itself is called self join. Ex: SQL> select e1.empno,e2.ename,e1.job,e2.deptno from e1.empno=e2.mgr;
Page 10 of 115 DWH Training -9739096158
emp e1,emp e2 where
Natural Join Natural join compares all the common columns. Ex: SQL> select empno,ename,job,dname,loc from emp
natural join dept; Cross Join This will gives the cross product. Ex: SQL> select empno,ename,job,dname,loc from emp cross
join dept; Outer Join Outer join gives the non-matching records along with matching records. Left Outer Join This will display the all matching records and the records which are in left hand side table those that are not in right hand side table. Ex: SQL> select empno,ename,job,dname,loc from emp e left
outer join dept d on(e.deptno=d.deptno); Or SQL> select empno,ename,job,dname,loc from emp e,dept d where e.deptno=d.deptno(+); Right Outer Join This will display the all matching records and the records which
Page 11 of 115 DWH Training -9739096158
are in right hand side table those that are not in left hand side table. Ex: SQL> select empno,ename,job,dname,loc from emp e on(e.deptno=d.deptno); Or SQL> select empno,ename,job,dname,loc from emp e,dept d where e.deptno(+) = Full Outer Join This will display the all matching records and the non-matching records from both tables. Ex: SQL> select empno,ename,job,dname,loc from emp e full on(e.deptno=d.deptno); OR SQL> select p.part_id, s.supplier_name 2 from part p, supplier s 3 where p.supplier_id = s.supplier_id (+) 4 union 5 select p.part_id, s.supplier_name 6 from part p, supplier s 7 where p.supplier_id (+) = s.supplier_id; What’s the difference between View and Materialized View? d.deptno;
right outer join dept d
outer join dept d
View: Why Use Views? • To restrict data access
Page 12 of 115 DWH Training -9739096158
• To make complex queries easy • To provide data independence A simple view is one that: – Derives data from only one table – Contains no functions or groups of data – Can perform DML operations through the view.
A complex view is one that: – Derives data from many tables – Contains functions or groups of data – Does not always allow DML operations through the view A view has a logical existence but a materialized view has a physical existence.Moreover a materialized view can be Indexed, analysed and so on....that is all the things that we can do with a table can also be done with a materialized view. We can keep aggregated data into materialized view. we can schedule the MV to refresh but table can’t.MV can be created based on multiple tables. Materialized View: In DWH materialized views are very essential because in reporting side if we do aggregate calculations as per the business requirement report performance would be de graded. So to improve report performance rather than doing report calculations and joins at reporting side if we put same logic in the MV then we can directly select the data from MV without any joins and aggregations. We can also schedule MV (Materialize View).
Page 13 of 115 DWH Training -9739096158
Inline view: If we write a select statement in from clause that is nothing but inline view. Ex: Get dept wise max sal along with empname and emp no. Select a.empname, a.empno, b.sal, b.deptno From EMP a, (Select max (sal) sal, deptno from EMP group by deptno) b Where a.sal=b.sal and a.deptno=b.deptno What is the difference between view and materialized view? View A view has a logical existence. It does not contain data. Its not a database object. We can perform DML operation on view. When we do select * from view it will fetch the data from base table. In view we cannot schedule to refresh. Materialized view A materialized view has a physical existence. It is a database object. We cannot perform DML operation on materialized view. When we do select * from materialized view it will fetch the data from materialized view. In materialized view we can schedule to refresh. We can keep aggregated data into materialized view. Materialized view can be created based on multiple
Page 14 of 115 DWH Training -9739096158
What is the Difference between Delete, Truncate and Drop? DELETE The DELETE command is used to remove rows from a table. A WHERE clause can be used to only remove some rows. If no WHERE condition is specified, all rows will be removed. After performing a DELETE operation you need to COMMIT or ROLLBACK the transaction to make the change permanent or to undo it. TRUNCATE TRUNCATE removes all rows from a table. The operation cannot be rolled back. As such, TRUCATE is faster and doesn't use as much undo space as a DELETE. DROP The DROP command removes a table from the database. All the tables' rows, indexes and privileges will also be removed. The operation cannot be rolled back. Difference between Rowid and Rownum? ROWID A globally unique identifier for a row in a database. It is created at the time the row is inserted into a table, and destroyed when it is removed from a table.'BBBBBBBB.RRRR.FFFF' where BBBBBBBB is the block number, RRRR is the slot(row) number, and FFFF is a file number. ROWNUM For each row returned by a query, the ROWNUM pseudo column returns a number indicating the order in which Oracle selects the row from a table or set of joined rows. The first row selected has a ROWNUM of 1, the second has 2, and so on.
Page 15 of 115 DWH Training -9739096158
You can use ROWNUM to limit the number of rows returned by a query, as in this example: SELECT * FROM employees WHERE ROWNUM < 10; Rowid Rowid is an oracle internal id that is allocated every time a new record is inserted in a table. This ID is unique and cannot be changed by the user. Rowid is permanent. Rowid is a globally unique identifier for a row in a database. It is created at the time the row is inserted into the table, and destroyed when it is removed from a table. Row-num Row-num is a row number returned by a select statement.
Row-num is temporary. The row-num pseudocoloumn returns a number indicating the order in which oracle selects the row from a table or set of joined rows.
Order of where and having: SELECT column, group_function FROM table [WHERE condition] [GROUP BY group_by_expression] [HAVING group_condition] [ORDER BY column];
The WHERE clause cannot be used to restrict groups. you use the
Page 16 of 115 DWH Training -9739096158
HAVING clause to restrict groups.
Differences between where clause and having clause Where clause Having clause
Both where and having clause can be used to filter the data. Where as in where clause it is not mandatory. Where clause applies to the individual rows. But having clause we need to use it with the group by. Where as having clause is used to test some condition on the group rather than on individual rows. But having clause is used to restrict groups. Restrict group by function by having In having clause it is with aggregate records (group by functions).
Where clause is used to restrict rows. Restrict normal query by where In where clause every record is filtered based on where.
MERGE Statement You can use merge command to perform insert and update in a single command. Ex: Merge into student1 s1 Using (select * from student2) s2 On (s1.no=s2.no) When matched then Update set marks = s2.marks
Page 17 of 115 DWH Training -9739096158
When not matched then Insert (s1.no, s1.name, s1.marks) s2.name, s2.marks); What is the difference between sub-query & co-related sub query? A sub query is executed once for the parent statement whereas the correlated sub query is executed once for each row of the parent query. Sub Query: Example: Select deptno, ename, sal from emp a where sal in (select sal from Grade where sal_grade=’A’ or sal_grade=’B’) Co-Related Sun query: Example: Find all employees who earn more than the average salary in their department. SELECT last-named, salary, department_id FROM employees A WHERE salary > (SELECT AVG (salary) FROM employees B WHERE B.department_id =A.department_id Group by B.department_id) EXISTS: The EXISTS operator tests for existence of rows in the results set of the subquery. Select dname from dept where exists (select 1 from EMP where dept.deptno= emp.deptno);
Page 18 of 115 DWH Training -9739096158
Sub-query A sub-query is executed once for the parent Query
Co-related sub-query Where as co-related subquery is executed once for each row of the parent query. Example: Select a.* from emp e where sal >= (select avg(sal) from emp a where a.deptno=e.deptno group by a.deptno);
Example: Select * from emp where deptno in (select deptno from dept);
Indexes: 1. Bitmap indexes are most appropriate for columns having low distinct values—such as GENDER, MARITAL_STATUS, and RELATION. This assumption is not completely accurate, however. In reality, a bitmap index is always advisable for systems in which data is not frequently updated by many concurrent systems. In fact, as I'll demonstrate here, a bitmap index on a column with 100percent unique values (a column candidate for primary key) is as efficient as a B-tree index. 2. When to Create an Index 3. You should create an index if:
A column contains a wide range of values
5. A column contains a large number of null values 6. One or more columns are frequently used together in a WHERE clause or a join condition 7. The table is large and most queries are expected to retrieve less than 2 to 4 percent of the rows
By default if u create index that is nothing but b-tree
Page 19 of 115
DWH Training -9739096158
Why hints Require? It is a perfect valid question to ask why hints should be used. Oracle comes with an optimizer that promises to optimize a query's execution plan. When this optimizer is really doing a good job, no hints should be required at all. Sometimes, however, the characteristics of the data in the database are changing rapidly, so that the optimizer (or more accuratly, its statistics) are out of date. In this case, a hint could help. You should first get the explain plan of your SQL and determine what changes can be done to make the code operate without using hints if possible. However, hints such as ORDERED, LEADING, INDEX, FULL, and the various AJ and SJ hints can take a wild optimizer and give you optimal performance Tables analyze and update Analyze Statement The ANALYZE statement can be used to gather statistics for a specific table, index or cluster. The statistics can be computed exactly, or estimated based on a specific number of rows, or a percentage of rows: ANALYZE TABLE employees COMPUTE STATISTICS; ANALYZE TABLE employees ESTIMATE STATISTICS SAMPLE 15 PERCENT;
EXEC DBMS_STATS.gather_table_stats('SCOTT', 'EMPLOYEES'); Automatic Optimizer Statistics Collection By default Oracle 10g automatically gathers optimizer statistics using a scheduled job called GATHER_STATS_JOB. By default this job runs within maintenance windows between 10 P.M. to 6 A.M. week nights and all day on weekends. The job calls the DBMS_STATS.GATHER_DATABASE_STATS_JOB_PROC internal procedure which gathers statistics for tables with either empty
Page 20 of 115 DWH Training -9739096158
or stale statistics, similar to the DBMS_STATS.GATHER_DATABASE_STATS procedure using the GATHER AUTO option. The main difference is that the internal job prioritizes the work such that tables most urgently requiring statistics updates are processed first. Hint categories: Hints can be categorized as follows:
ALL_ROWS One of the hints that 'invokes' the Cost based optimizer ALL_ROWS is usually used for batch processing or data warehousing systems.
(/*+ ALL_ROWS */)
FIRST_ROWS One of the hints that 'invokes' the Cost based optimizer FIRST_ROWS is usually used for OLTP systems.
(/*+ FIRST_ROWS */)
CHOOSE One of the hints that 'invokes' the Cost based optimizer This hint lets the server choose (between ALL_ROWS and FIRST_ROWS, based on statistics gathered. Hints for Join Orders, Hints for Join Operations, Hints for Parallel Execution, (/*+ parallel(a,4) */) specify degree either 2 or 4 or 16 Additional Hints HASH Hashes one table (full scan) and creates a hash index for that table. Then hashes other table and uses hash index to find corresponding records. Therefore not suitable for < or > join conditions.
• • •
/*+ use_hash */ Use Hint to force using index
Page 21 of 115 DWH Training -9739096158
SELECT /*+INDEX (TABLE_NAME INDEX_NAME) */ COL1,COL2 FROM TABLE_NAME Select ( /*+ hash */ ) empno from ORDERED- This hint forces tables to be joined in the order specified. If you know table X has fewer rows, then ordering it first may speed execution in a join. PARALLEL (table, instances)This specifies the operation is to be done in parallel. If index is not able to create then will go for /*+ parallel(table, 8)*/-----For select and update example---in where clase like st,not in ,>,< ,<> then we will use. Explain Plan: Explain plan will tell us whether the query properly using indexes or not.whatis the cost of the table whether it is doing full table scan or not, based on these statistics we can tune the query. The explain plan process stores data in the PLAN_TABLE. This table can be located in the current schema or a shared schema and is created using in SQL*Plus as follows: SQL> CONN sys/password AS SYSDBA Connected SQL> @$ORACLE_HOME/rdbms/admin/utlxplan.sql SQL> GRANT ALL ON sys.plan_table TO public; SQL> CREATE PUBLIC SYNONYM plan_table FOR sys.plan_table; What is your tuning approach if SQL query taking long time? Or how do u tune SQL query? If query taking long time then First will run the query in Explain Plan, The explain plan process stores data in the PLAN_TABLE. it will give us execution plan of the query like whether the query is using the relevant indexes on the joining columns or indexes to support the query are missing.
Page 22 of 115 DWH Training -9739096158
If joining columns doesn’t have index then it will do the full table scan if it is full table scan the cost will be more then will create the indexes on the joining columns and will run the query it should give better performance and also needs to analyze the tables if analyzation happened long back. The ANALYZE statement can be used to gather statistics for a specific table, index or cluster using ANALYZE TABLE employees COMPUTE STATISTICS; If still have performance issue then will use HINTS, hint is nothing but a clue. We can use hints like
ALL_ROWS One of the hints that 'invokes' the Cost based optimizer ALL_ROWS is usually used for batch processing or data warehousing systems.
(/*+ ALL_ROWS */)
FIRST_ROWS One of the hints that 'invokes' the Cost based optimizer FIRST_ROWS is usually used for OLTP systems.
(/*+ FIRST_ROWS */)
CHOOSE One of the hints that 'invokes' the Cost based optimizer This hint lets the server choose (between ALL_ROWS and FIRST_ROWS, based on statistics gathered. HASH Hashes one table (full scan) and creates a hash index for that table. Then hashes other table and uses hash index to find corresponding records. Therefore not suitable for < or > join conditions.
/*+ use_hash */ Hints are most useful to optimize the query performance.
Store Procedure: What are the differences between stored procedures
Page 23 of 115 DWH Training -9739096158
and triggers? Stored procedure normally used for performing tasks But the Trigger normally used for tracing and auditing logs. Stored procedures should be called explicitly by the user in order to execute But the Trigger should be called implicitly based on the events defined in the table. Stored Procedure can run independently But the Trigger should be part of any DML events on the table. Stored procedure can be executed from the Trigger But the Trigger cannot be executed from the Stored procedures. Stored Procedures can have parameters. But the Trigger cannot have any parameters. Stored procedures are compiled collection of programs or SQL statements in the database. Using stored procedure we can access and modify data present in many tables. Also a stored procedure is not associated with any particular database object. But triggers are event-driven special procedures which are attached to a specific database object say a table. Stored procedures are not automatically run and they have to be called explicitly by the user. But triggers get executed when the particular event associated with the event gets fired. Packages: Packages provide a method of encapsulating related procedures, functions, and associated cursors and variables together as a unit in the database.
Page 24 of 115 DWH Training -9739096158
package that contains several procedures and functions that process related to same transactions. A package is a group of related procedures and functions, together with the cursors and variables they use, Packages provide a method of encapsulating related procedures, functions, and associated cursors and variables together as a unit in the database.
Triggers: Oracle lets you define procedures called triggers that run implicitly when an INSERT, UPDATE, or DELETE statement is issued against the associated table
Triggers are similar to stored procedures. A trigger stored in the database can include SQL and PL/SQL
Types of Triggers This section describes the different types of triggers:
• • • • Row Triggers and Statement Triggers BEFORE and AFTER Triggers INSTEAD OF Triggers Triggers on System Events and User Events
Row Triggers A row trigger is fired each time the table is affected by the triggering statement. For example, if an UPDATE statement updates multiple rows of a table, a row trigger is fired once for each row affected by the UPDATE statement. If a triggering statement affects no rows, a row trigger is not run. BEFORE and AFTER Triggers When defining a trigger, you can specify the trigger timing-Page 25 of 115 DWH Training -9739096158
whether the trigger action is to be run before or after the triggering statement. BEFORE and AFTER apply to both statement and row triggers. BEFORE and AFTER triggers fired by DML statements can be defined only on tables, not on views. Difference between Trigger and Procedure Triggers In trigger no need to execute manually. Triggers will be fired automatically. Triggers that run implicitly when an INSERT, UPDATE, or DELETE statement is issued against the associated table. Differences between stored procedure and functions Stored Procedure Stored procedure may or may not return values. Functions Function should return at least one output parameter. Can return more than one parameter using OUT argument. Function can be used to calculations But function is not a precompiled statement. Whereas function does not accept arguments. Functions are mainly used to compute values Can be invoked form SQL statements e.g. SELECT
Page 26 of 115 DWH Training -9739096158
Stored Procedures Where as in procedure we need to execute manually.
Stored procedure can be used to solve the business logic. Stored procedure is a precompiled statement. Stored procedure accepts more than one argument. Stored procedures are mainly used to process the tasks. Cannot be invoked from SQL statements. E.g. SELECT
Can affect the state of database using commit. Stored as a pseudo-code in database i.e. compiled form.
Cannot affect the state of database. Parsed and compiled at runtime.
Data files Overview: A tablespace in an Oracle database consists of one or more physical datafiles. A datafile can be associated with only one tablespace and only one database. Table Space: Oracle stores data logically in tablespaces and physically in datafiles associated with the corresponding tablespace. A database is divided into one or more logical storage units called tablespaces. Tablespaces are divided into logical units of storage called segments. Control File: A control file contains information about the associated database that is required for access by an instance, both at startup and during normal operation. Control file information can be modified only by Oracle; no database administrator or user can edit a control file.
2.2 IMPORTANT QUERIES
1. Get duplicate rows from the table: Select empno, count (*) from EMP group by empno having count (*)>1; 2. Remove duplicates in the table:
Page 27 of 115 DWH Training -9739096158
Delete from EMP where rowid not in (select max (rowid) from EMP group by empno); 3. Below query transpose columns into rows. Nam e abc xyz No 100 200 Add1 hyd Mysor e Add2 bang pune
Select name, no, add1 from A UNION Select name, no, add2 from A;
4. Below query transpose rows into columns. select emp_id, max(decode(row_id,0,address))as address1, max(decode(row_id,1,address)) as address2, max(decode(row_id,2,address)) as address3 from (select emp_id,address,mod(rownum,3) row_id from temp order by emp_id ) group by emp_id Other query: select emp_id, max(decode(rank_id,1,address)) as add1,
Page 28 of 115 DWH Training -9739096158
max(decode(rank_id,2,address)) as add2, max(decode(rank_id,3,address))as add3 from (select emp_id,address,rank() over (partition by emp_id order by emp_id,address )rank_id from temp ) group by emp_id 5. Rank query: Select empno, ename, sal, r from (select empno, ename, sal, rank () over (order by sal desc) r from EMP); 6. Dense rank query: The DENSE_RANK function works acts like the RANK function except that it assigns consecutive ranks: Select empno, ename, Sal, from (select empno, ename, sal, dense_rank () over (order by sal desc) r from emp); 7. Top 5 salaries by using rank: Select empno, ename, sal,r from (select empno,ename,sal,dense_rank() over (order by sal desc) r from emp) where r<=5; Or Select * from (select * from EMP order by sal desc) where rownum<=5; 8. 2 nd highest Sal: Select empno, ename, sal, r from (select empno, ename, sal, dense_rank () over (order by sal desc) r from EMP) where r=2; 9. Top sal: Select * from EMP where sal= (select max (sal) from EMP); 10. How to display alternative rows in a table?
Page 29 of 115 DWH Training -9739096158
SQL> select *from emp where (rowid, 0) in (select rowid,mod(rownum,2) from emp); 11. Hierarchical queries
Starting at the root, walk from the top down, and eliminate employee Higgins in the result, but process the child rows. SELECT department_id, employee_id, last_name, job_id, salary FROM employees WHERE last_name! = ’Higgins’ START WITH manager_id IS NULL CONNECT BY PRIOR employee_id = menagerie;
3 DWH CONCEPTS
What is BI?
Business Intelligence refers to a set of methods and techniques that are used by organizations for tactical and strategic decision making. It leverages methods and technologies that focus on counts, statistics and business objectives to improve business performance. The objective of Business Intelligence is to better understand customers and improve customer service, make the supply and distribution chain more efficient, and to identify and address business problems and opportunities quickly. Warehouse is used for high level data analysis purpose.It is used for predictions, timeseries analysis, financial Analysis, what -if simulations etc. Basically it is used for better decision making. What is a Data Warehouse?
Page 30 of 115 DWH Training -9739096158
Data Warehouse is a "Subject-Oriented, Integrated, TimeVariant Nonvolatile collection of data in support of decision making". In terms of design data warehouse and data mart are almost the same. In general a Data Warehouse is used on an enterprise level and a Data Marts is used on a business division/department level. Subject Oriented: Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated: Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Time-variant: All data in the data warehouse is identified with a particular time period. Non-volatile: Data is stable in a data warehouse. More data is added but data is never removed. What is a DataMart? Datamart is usually sponsored at the department level and developed with a specific details or subject in mind, a Data Mart is a subset of data warehouse with a focused objective. What is the difference between a data warehouse and a data mart? In terms of design data warehouse and data mart are almost the same. In general a Data Warehouse is used on an enterprise level and a Data Marts is used on a business division/department level. A data mart only contains data specific to a particular subject
Page 31 of 115 DWH Training -9739096158
Difference between data mart and data warehouse Data Mart Data mart is usually sponsored at the department level and developed with a specific issue or subject in mind, a data mart is a data warehouse with a focused objective. A data mart is used on a business division/ department level. A Data Mart is a subset of data from a Data Warehouse. Data Marts are built for specific user groups. Data Warehouse Data warehouse is a “SubjectOriented, Integrated, TimeVariant, Nonvolatile collection of data in support of decision making”.
A data warehouse is used on an enterprise level A Data Warehouse is simply an integrated consolidation of data from a variety of sources that is specially designed to support strategic and tactical decision making. The main objective of Data Warehouse is to provide an integrated environment and coherent picture of the business at a point in time.
By providing decision makers with only a subset of data from the Data Warehouse, Privacy, Performance and Clarity Objectives can be attained.
what is fact less fact table? A fact table that contains only primary keys from the dimension tables, and that do not contain any measures that type of fact table is called fact less fact table . What is a Schema? Graphical Representation of the datastructure.
Page 32 of 115 DWH Training -9739096158
First Phase in implementation of Universe What are the most important features of a data warehouse? DRILL DOWN, DRILL ACROSS, Graphs, PI charts, dashboards and TIME HANDLING To be able to drill down/drill across is the most basic requirement of an end user in a data warehouse. Drilling down most directly addresses the natural end-user need to see more detail in an result. Drill down should be as generic as possible becuase there is absolutely no good way to predict users drilldown path. What does it mean by grain of the star schema? In Data warehousing grain refers to the level of detail available in a given fact table as well as to the level of detail provided by a star schema. It is usually given as the number of records per key within the table. In general, the grain of the fact table is the grain of the star schema. What is a star schema? Star schema is a data warehouse schema where there is only one "fact table" and many denormalized dimension tables. Fact table contains primary keys from all the dimension tables and other numeric columns columns of additive, numeric facts.
Page 33 of 115 DWH Training -9739096158
What is a snowflake schema? Unlike Star-Schema, Snowflake schema contain normalized dimension tables in a tree like structure with many nesting levels. Snowflake schema is easier to maintain but queries require more joins.
What is the difference between snow flake and star schema Star Schema The star schema is the simplest data warehouse scheme. In star schema each of the dimensions is represented in a single table .It should not have any hierarchies between dims. It contains a fact table surrounded by dimension Snow Flake Schema Snowflake schema is a more complex data warehouse model than a star schema. In snow flake schema at least one hierarchy should exists between dimension tables.
It contains a fact table surrounded by dimension
Page 34 of 115
DWH Training -9739096158
tables. If the dimensions are de-normalized, we say it is a star schema design. In star schema only one join establishes the relationship between the fact table and any one of the dimension tables. A star schema optimizes the performance by keeping queries simple and providing fast response time. All the information about the each level is stored in one row. It is called a star schema because the diagram resembles a star.
tables. If a dimension is normalized, we say it is a snow flaked design. In snow flake schema since there is relationship between the dimensions tables it has to do many joins to fetch the data. Snowflake schemas normalize dimensions to eliminated redundancy. The result is more complex queries and reduced query performance. It is called a snowflake schema because the diagram resembles a snowflake.
What is Fact and Dimension? A "fact" is a numeric value that a business wishes to count or sum. A "dimension" is essentially an entry point for getting at the facts. Dimensions are things of interest to the business. A set of level properties that describe a specific aspect of a business, used for analyzing the factual measures. What is Fact Table? A Fact Table in a dimensional model consists of one or more numeric facts of importance to a business. Examples of facts are as follows: • the number of products sold • the value of products sold • the number of products produced
Page 35 of 115 DWH Training -9739096158
• the number of service calls received
What is Factless Fact Table? Factless fact table captures the many-to-many relationships between dimensions, but contains no numeric or textual facts. They are often used to record events or coverage information. Common examples of factless fact tables include:
Identifying product promotion events (to determine promoted products that didn’t sell) Tracking student attendance or registration events Tracking insurance-related accident events
Types of facts? There are three types of facts:
Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table. Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others. Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.
What is Granularity? Principle: create fact tables with the most granular data possible to support analysis of the business process. In Data warehousing grain refers to the level of detail available in a given fact table as well as to the level of detail provided by a star schema. It is usually given as the number of records per key within the
Page 36 of 115 DWH Training -9739096158
table. In general, the grain of the fact table is the grain of the star schema. Facts: Facts must be consistent with the grain.all facts are at a uniform grain.
Watch for facts of mixed granularity Total sales for day & montly total
Dimensions: each dimension associated with fact table must take on a single value for each fact row.
Each dimension attribute must take on one value. Outriggers are the exception, not the rule.
Page 37 of 115 DWH Training -9739096158
What is slowly Changing Dimension? Slowly changing dimensions refers to the change in dimensional attributes over time. An example of slowly changing dimension is a Resource dimension where attributes of a particular employee change over time like their designation changes or dept changes etc. What is Conformed Dimension? Conformed Dimensions (CD): these dimensions are something that is built once in your model and can be reused multiple times with different fact tables. For example, consider a model containing multiple fact tables, representing different data marts. Now look for a dimension that is common to these facts tables. In this example let’s consider that the product
Page 38 of 115 DWH Training -9739096158
dimension is common and hence can be reused by creating short cuts and joining the different fact tables.Some of the examples are time dimension, customer dimensions, product dimension. What is Junk Dimension? A "junk" dimension is a collection of random transactional codes, flags and/or text attributes that are unrelated to any particular dimension. The junk dimension is simply a structure that provides a convenient place to store the junk attributes. A good example would be a trade fact in a company that brokers equity trades. When you consolidate lots of small dimensions and instead of having 100s of small dimensions, that will have few records in them, cluttering your database with these mini ‘identifier’ tables, all records from all these small dimension tables are loaded into ONE dimension table and we call this dimension table Junk dimension table. (Since we are storing all the junk in this one table) For example: a company might have handful of manufacture plants, handful of order types, and so on, so forth, and we can consolidate them in one dimension table called junked dimension table It’s a dimension table which is used to keep junk attributes What is De Generated Dimension? An item that is in the fact table but is stripped off of its description, because the description belongs in dimension table, is referred to as Degenerated Dimension. Since it looks like dimension, but is really in fact table and has been degenerated of its description, hence is called degenerated dimension.. Degenerated Dimension: a dimension which is located in fact table known as Degenerated dimension Dimensional Model: A type of data modeling suited for data warehousing. In a dimensional model, there are two types of tables: dimensional tables and fact tables. Dimensional table
Page 39 of 115 DWH Training -9739096158
records information on each dimension, and fact table records all the "fact", or measures. Data modeling There are three levels of data modeling. They are conceptual, logical, and physical. This section will explain the difference among the three, the order with which each one is created, and how to go from one level to the other. Conceptual Data Model Features of conceptual data model include:
Includes the important entities and the relationships among them. No attribute is specified. No primary key is specified.
At this level, the data modeler attempts to identify the highestlevel relationships among the different entities. Logical Data Model Features of logical data model include:
• • • •
Includes all entities and relationships among them. All attributes for each entity are specified. The primary key for each entity specified. Foreign keys (keys identifying the relationship between different entities) are specified. Normalization occurs at this level.
At this level, the data modeler attempts to describe the data in as much detail as possible, without regard to how they will be physically implemented in the database. In data warehousing, it is common for the conceptual data model and the logical data model to be combined into a single step (deliverable).
Page 40 of 115 DWH Training -9739096158
The steps for designing the logical data model are as follows: 1. Identify all entities. 2. Specify primary keys for all entities. 3. Find the relationships between different entities. 4. Find all attributes for each entity. 5. Resolve many-to-many relationships. 6. Normalization. Physical Data Model Features of physical data model include:
Specification all tables and columns. Foreign keys are used to identify relationships between tables. Demoralization may occur based on user requirements. Physical considerations may cause the physical data model to be quite different from the logical data model.
At this level, the data modeler will specify how the logical data model will be realized in the database schema. The steps for physical data model design are as follows: 1. Convert entities into tables. 2. Convert relationships into foreign keys. 3. Convert attributes into columns. 9. http://www.learndatamodeling.com/dm_standard.htm 10. Modeling is an efficient and effective way to represent the organization’s needs; It provides information in a graphical way to the members of an organization to understand and communicate the business rules and processes. Business Modeling and Data Modeling are the two important types of modeling.
Page 41 of 115 DWH Training -9739096158
The differences between a logical data model and physical data model is shown below. Logical vs Physical Data Modeling Logical Data Model Represents business information and defines business rules Entity Attribute Primary Key Alternate Key Inversion Key Entry Rule Relationship Definition Physical Data Model Represents the physical implementation of the model in a database. Table Column Primary Key Constraint Unique Constraint or Unique Index Non Unique Index Check Constraint, Default Value Foreign Key Comment
Page 42 of 115 DWH Training -9739096158
Below is the simple data model
Below is the sq for project dim
Page 43 of 115 DWH Training -9739096158
Page 44 of 115 DWH Training -9739096158
EDIII – Logical Design
Page 45 of 115 DWH Training -9739096158
ACW_DF_FEES_STG Non-Key Attributes SEGMENT1 ORGANIZATION_ID ITEM_TYPE BUYER_ID COST_REQUIRED QUARTER_1_COST QUARTER_2_COST QUARTER_3_COST QUARTER_4_COST COSTED_BY COSTED_DATE APPROV ED_BY APPROV ED_DATE
ACW_DF_FEES_F Primary Key ACW_DF_FEES_KEY [PK1] Non-Key Attributes PRODUCT_KEY ORG_KEY DF_MGR_KEY COST_REQUIRED DF_FEES COSTED_BY COSTED_DATE APPROV ING_MGR APPROV ED_DATE D_CREATED_BY D_CREATION_DATE D_LAST_UPDATE_BY D_LAST_UPDATED_DATE
ACW_ORGANIZATION_D Primary Key ORG_KEY [PK1] Non-Key Attributes ORGANIZATION_CODE CREA TED_BY CREA TION_DATE LAST_UPDATE_DATE LAST_UPDATED_BY D_CREATED_BY D_CREATION_DATE D_LAST_UPDATE_DATE D_LAST_UPDATED_BY
PID for DF Fees
EDW_TIME_HIERARCHY ACW_PCBA_A PPROVAL_F Primary Key PCBA _APPROVAL_KEY [PK1] Non-Key Attributes PART_KEY CISCO_PART_NUMBER SUPPLY_CHANNEL_KEY NPI APPROV AL_FLAG ADJUSTMENT APPROV AL_DATE ADJUSTMENT_AMT SPEND_BY _ASSEMBLY COMM_MGR_KEY BUYER_ID RFQ_CREATED RFQ_RESPONSE CSS D_CREATED_BY D_CREATED_DATE D_LAST_UPDATED_BY D_LAST_UPDATE_DATE ACW_DF_A PPROVAL_F Primary Key DF_APPROVAL_KEY [PK1] Non-Key Attributes PART_KEY CISCO_PART_NUMBER SUPPLY_CHANNEL_KEY PCBA _ITEM_FLAG APPROV ED APPROV AL_DATE BUYER_ID RFQ_CREATED RFQ_RESPONSE CSS D_CREATED_BY D_CREATION_DATE D_LAST_UPDATED_BY D_LAST_UPDATE_DATE
ACW_PCBA_A PPROVAL_STG Non-Key Attributes INV ENTORY_ITEM_ID LATEST_REV LOCATION_ID LOCATION_CODE APPROV AL_FLAG ADJUSTMENT APPROV AL_DATE TOTA L_ADJUSTMENT TOTA L_ITEM_COST DEMAND COMM_MGR BUYER_ID BUYER RFQ_CREATED RFQ_RESPONSE CSS
ACW_USERS_D Primary Key USER_KEY [PK1] Non-Key Attributes PERSON_ID EMAIL_ADDRESS LAST_NAME FIRST_NAME FULL_NAME EFFECTIV E_STA RT_DATE EFFECTIV E_END_DATE EMPLOYEE_NUMBER LAST_UPDATED_BY LAST_UPDATE_DATE CREA TION_DATE CREA TED_BY D_LAST_UPDATED_BY D_LAST_UPDATE_DATE D_CREATION_DATE D_CREATED_BY
ACW_DF_A PPROVAL_STG Non-Key Attributes INV ENTORY_ITEM_ID CISCO_PART_NUMBER LATEST_REV PCBA _ITEM_FLAG APPROV AL_FLAG APPROV AL_DATE LOCATION_ID LOCATION_CODE BUYER BUYER_ID RFQ_CREATED RFQ_RESPONSE CSS
ACW_PART_TO_PID_D Users Key Primary PART_TO_PID_KEY [PK1] Non-Key Attributes PART_KEY CISCO_PART_NUMBER PRODUCT_KEY PRODUCT_NA ME LATEST_REVISION D_CREATED_BY D_CREATION_DATE D_LAST_UPDATED_BY D_LAST_UPDATE_DATE
ACW_PRODUCTS_D Primary Key PRODUCT_KEY [PK1] Non-Key Attributes PRODUCT_NA ME BUSINESS_UNIT_ID BUSINESS_UNIT PRODUCT_FAMILY_ID PRODUCT_FAMILY ITEM_TYPE D_CREATED_BY D_CREATION_DATE D_LAST_UPDATE_BY D_LAST_UPDATED_DATE
ACW_SUPPLY_CHA NNEL_D Primary Key SUPPLY_CHANNEL_KEY [PK1] Non-Key Attributes SUPPLY_CHANNEL DESCRIPTION LAST_UPDATED_BY LAST_UPDATE_DATE CREA TED_BY CREA TION_DATE D_LAST_UPDATED_BY D_LAST_UPDATE_DATE D_CREATED_BY D_CREATION_DATE
EDII– Physical Design
Page 46 of 115 DWH Training -9739096158
ACW_DF_FEES_ST G Columns SEGM ENT 1 VARCHAR2(40) ORGA NIZAT ION_IDNUMB ER(10) IT EM _TYPE CHAR(30) BUYER_ID NUMB ER(10) COST _REQUIRED CHAR(1) QUART ER_1_COST FLOAT (12) QUART ER_2_COST FLOAT (12) QUART ER_3_COST FLOAT (12) QUART ER_4_COST FLOAT (12) COST ED_B Y NUMB ER(10) COST ED_DAT E DAT E APPROVED_BY NUMB ER(10) APPROVED_DATE DAT E
ACW_DF_FEES_F Columns ACW_DF_FEES_KEY NUMB ER(10) [P K1] PRODUCT _KEY NUMB ER(10) ORG_KE Y NUMB ER(10) DF_MGR_KEY NUMB ER(10) COST_REQUIRED CHAR(1) DF_FEES FLOAT (12) COSTED_BY NUMB ER(10) COSTED_DAT E DAT E APP ROV ING_MGR NUMB ER(10) APP ROV ED_DAT E DAT E D_CREA T ED_BY CHAR(10) D_CREA T ION_DAT E DAT E D_LAST _UPDAT E_BY CHAR(10) D_LAST _UPDAT ED_DAT CHAR(10) E
ACW_ORGANIZAT ION_D Colum ns ORG_KEY NUMB ER(10) [P K1] ORGA NIZAT ION_CODE CHA R(30) CREAT ED_BY NUMB ER(10) CREAT ION_DAT E DAT E LAST_UPDATE_DAT E DAT E LAST_UPDATED_BY NUMB ER D_CREA TED_BY CHA R(10) D_CREA TION_DATE DAT E D_LAST _UPDAT E_DATE AT E D D_LAST _UPDAT ED_BYCHA R(10) PID_for_DF_Fees
EDW_T IME_HIE RARCHY ACW_PCBA_APPROVAL_F Columns PCB A_A PPROVAL_KEY CHA R(10) [PK1] PART _K EY NUM BER(10) CISCO_PA RT _NUMBE R CHA R(10) SUP PLY _CHANNE L_KEY NUM BER(10) NPI CHA R(1) APP ROV AL_FLAG CHA R(1) ADJUST MENT CHA R(1) APP ROV AL_DAT E DAT E ADJUST MENT _AM T FLOAT (12) SPE ND_BY_ASSE MBLY FLOAT (12) COMM_MGR_KEY NUM BER(10) BUY ER_ID NUM BER(10) RFQ_CREAT ED CHA R(1) RFQ_RE SPONSE CHA R(1) CSS CHA R(10) D_CREA T ED_BY CHA R(10) D_CREA T ED_DAT E CHA R(10) D_LAST _UPDATED_BY CHA R(10) D_LAST _UPDATE_DAT E DAT E
ACW_PCBA_APPROVAL_ST G Columns INVENT ORY_IT EM_ID NUMB ER(10) LAT EST _REV CHAR(10) LOCAT ION_ID NUMB ER(10) LOCAT ION_CODE CHAR(10) APPROVAL_FLAG CHAR(1) ADJUST ME NT CHAR(1) APPROVAL_DA T E DAT E T OTAL_ADJUST MENT HAR(10) C T OTAL_IT EM _COST FLOAT (10) DEMA ND NUMB ER COMM _MGR CHAR(10) BUYER_ID NUMB ER(10) BUYER VARCHAR2(240) RFQ_CREAT ED CHAR(1) RFQ_RE SPONSE CHAR(1) CSS CHAR(10)
ACW_US ERS_D Columns USER_K EY NUMB ER(10) [P K1] PERSON_ID CHA R(10) EMAIL_ADDRESS CHA R(10) LAST_NAM E VARCHAR2(50) FIRST _NAME VARCHAR2(50) FULL_NAM E CHA R(10) EFFECT IVE_ST ART _DATDAT E E EFFECT IVE_END_DAT E DAT E EMPLOYEE_NUMBER NUMB ER(10) SEX NUMB ER LAST_UPDATE_DAT E DAT E CREAT ION_DAT E DAT E CREAT ED_BY NUMB ER(10) D_LAST _UPDAT ED_BY CHA R(10) D_LAST _UPDAT E_DATE DAT E D_CREA TION_DATE DAT E D_CREA TED_BY CHA R(10)
ACW_DF_APPROVAL_STG Columns INVENTORY_IT EM_ID NUMB ER(10) CISCO_PA RT _NUM BER CHAR(30) LATEST _REV CHAR(10) PCBA_IT EM_FLAG CHAR(1) APPROVAL_FLAG CHAR(1) APPROVAL_DA TE DAT E LOCAT ION_ID NUMB ER(10) SUPPLY_CHANNE L CHAR(10) BUYER VARCHAR2(240) BUYER_ID NUMB ER(10) RFQ_CREATED CHAR(1) RFQ_RESPONSE CHAR(1) CSS CHAR(10)
ACW_DF_APPROVA L_F Columns DF_APPROVAL_KEY NUMBER(10) [P K1] PART_K EY NUMBER(10) CISCO_PART _NUMBE R CHA R(30) SUP PLY _CHANNE L_KEYNUMBER(10) PCB A_IT EM_FLAG CHA R(1) APP ROV ED CHA R(1) APP ROV AL_DAT E DAT E BUY ER_ID NUMBER(10) RFQ_CREAT ED CHA R(1) RFQ_RE SPONSE CHA R(1) CSS CHA R(10) D_CREA T ED_BY CHA R(10) D_CREA T ION_DAT E DAT E D_LAST _UPDAT ED_BY CHA R(10) D_LAST _UPDAT E_DAT EDAT E
ACW_PART _T O_PID_D Columns PART _T O_PID_KEY NUMB ER(10) [P K1] PART _KEY NUMB ER(10) CISCO_PA RT_NUMBERCHA R(30) PRODUCT _KEY NUMB ER(10) PRODUCT _NAME CHA R(30) LATEST _REVIS ION CHA R(10) D_CREAT ED_BY CHA R(10) D_CREAT ION_DAT E DAT E D_LAST _UPDAT ED_BYCHA R(10) D_LAST _UPDAT E_DAT E AT E D
ACW_PRODUCT S_D Columns PRODUCT_KEY NUMB ER(10) [P K1] PRODUCT_NAME CHAR(30) BUS INESS_UNIT _ID NUMB ER(10) BUS INESS_UNIT VARCHAR2(60) PRODUCT_FAM ILY_ID NUMB ER(10) PRODUCT_FAM ILY VARCHAR2(180) IT EM_T YPE CHAR(30) D_CREA TED_BY CHAR(10) D_CREA TION_DATE DAT E D_LAST _UPDAT E_BY CHAR(10) D_LAST _UPDAT ED_DAT CHAR(10) E
ACW_SUPPLY_CHANNEL_D Columns SUP PLY _CHANNE L_KEY NUMB ER(10) [P K1] SUP PLY _CHANNE L CHA R(60) DES CRIPT ION VARCHAR2(240) LAST _UPDAT ED_BY NUMB ER LAST _UPDAT E_DATE DAT E CRE ATED_BY NUMB ER(10) CRE ATION_DATE DAT E D_LAST_UPDAT ED_BY CHA R(10) D_LAST_UPDAT E_DAT E DAT E D_CREA T ED_BY CHA R(10) D_CREA T ION_DAT E DAT E
Types of SCD Implementation: Type 1 Slowly Changing Dimension In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other words, no history is kept.
Page 47 of 115 DWH Training -9739096158
In our example, recall we originally have the following table: Customer Key 1001 Name Christina State Illinois
After Christina moved from Illinois to California, the new information replaces the new record, and we have the following table: Customer Key 1001 Advantages: - This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information. Disadvantages: - All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in this case, the company would not be able to know that Christina lived in Illinois before.
About 50% of the time. When to use Type 1: Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to keep track of historical changes.
Type 2 Slowly Changing Dimension In Type 2 Slowly Changing Dimension, a new record is added to
Page 48 of 115 DWH Training -9739096158
the table to represent the new information. Therefore, both the original and the new record will be present. The newe record gets its own primary key. In our example, recall we originally have the following table: Customer Key 1001 Name Christina State Illinois
After Christina moved from Illinois to California, we add the new information as a new row into the table: Customer Key 1001 1005 Advantages: - This allows us to accurately keep all historical information. Disadvantages: - This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern. - This necessarily complicates the ETL process. Usage: About 50% of the time. When to use Type 2: Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes. Type 3 Slowly Changing Dimension
Page 49 of 115 DWH Training -9739096158
Name Christina Christina
State Illinois California
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of interest, one indicating the original value, and one indicating the current value. There will also be a column that indicates when the current value becomes active. In our example, recall we originally have the following table: Customer Key 1001 Name Christina State Illinois
To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:
• • • • •
Customer Key Name Original State Current State Effective Date
After Christina moved from Illinois to California, the original information gets updated, and we have the following table (assuming the effective date of change is January 15, 2003): Customer Key 1001 Advantages: - This does not increase the size of the table, since new information is updated. - This allows us to keep some part of history. Disadvantages: - Type 3 will not be able to keep all history where an attribute is
Page 50 of 115 DWH Training -9739096158
Original State Illinois
Current State California
Effective Date 15-JAN-2003
changed more than once. For example, if Christina later moves to Texas on December 15, 2003, the California information will be lost. Usage: Type 3 is rarely used in actual practice. When to use Type 3: Type III slowly changing dimension should only be used when it is necessary for the data warehouse to track historical changes, and when such changes will only occur for a finite number of time. What is Staging area why we need it in DWH? If target and source databases are different and target table volume is high it contains some millions of records in this scenario without staging table we need to design your informatica using look up to find out whether the record exists or not in the target table since target has huge volumes so its costly to create cache it will hit the performance. If we create staging tables in the target database we can simply do outer join in the source qualifier to determine insert/update this approach will give you good performance. It will avoid full table scan to determine insert/updates on target. And also we can create index on staging tables since these tables were designed for specific application it will not impact to any other schemas/users. While processing flat files to data warehousing we can perform cleansing. Data cleansing, also known as data scrubbing, is the process of ensuring that a set of data is correct and accurate. During data cleansing, records are checked for accuracy and consistency.
Since it is one-to-one mapping from ODS to staging we do truncate and reload.
• We can create indexes in the staging state, to perform our source qualifier best.
Page 51 of 115 DWH Training -9739096158
• If we have the staging area no need to relay on the informatics transformation to known whether the record exists or not. Data cleansing Weeding out unnecessary or unwanted things (characters and spaces etc) from incoming data to make it more meaningful and informative Data merging Data can be gathered from heterogeneous systems and put together Data scrubbing Data scrubbing is the process of fixing or eliminating individual pieces of data that are incorrect, incomplete or duplicated before the data is passed to end user. Data scrubbing is aimed at more than eliminating errors and redundancy. The goal is also to bring consistency to various data sets that may have been created with different, incompatible business rules.
ODS (Operational Data Sources): My understanding of ODS is, its a replica of OLTP system and so the need of this, is to reduce the burden on production system (OLTP) while fetching data for loading targets. Hence its a mandate Requirement for every Warehouse. So every day do we transfer data to ODS from OLTP to keep it up to date? OLTP is a sensitive database they should not allow multiple select statements it may impact the performance as well as if something goes wrong while fetching data from OLTP to data warehouse it will directly impact the business. ODS is the replication of OLTP.
Page 52 of 115 DWH Training -9739096158
ODS is usually getting refreshed through some oracle jobs. enables management to gain a consistent picture of the business. What is a surrogate key? A surrogate key is a substitution for the natural primary key. It is a unique identifier or number ( normally created by a database sequence generator ) for each record of a dimension table that can be used for the primary key to the table. A surrogate key is useful because natural keys may change. What is the difference between a primary key and a surrogate key? A primary key is a special constraint on a column or set of columns. A primary key constraint ensures that the column(s) so designated have no NULL values, and that every value is unique. Physically, a primary key is implemented by the database system using a unique index, and all the columns in the primary key must have been declared NOT NULL. A table may have only one primary key, but it may be composite (consist of more than one column). A surrogate key is any column or set of columns that can be declared as the primary key instead of a "real" or natural key. Sometimes there can be several natural keys that could be declared as the primary key, and these are all called candidate keys. So a surrogate is a candidate key. A table could actually have more than one surrogate key, although this would be unusual. The most common type of surrogate key is an incrementing integer, such as an auto increment column in MySQL, or a sequence in Oracle, or an identity column in SQL Server.
Page 53 of 115 DWH Training -9739096158
4.1 Informatica Overview
Informatica is a powerful Extraction, Transformation, and Loading tool and is been deployed at GE Medical Systems for data warehouse development in the Business Intelligence Team. Informatica comes with the following clients to perform various tasks.
Designer – used to develop transformations/mappings Workflow Manager / Workflow Monitor replace the Server Manager - used to create sessions / workflows/ worklets to run, schedule, and monitor mappings for data movement Repository Manager – used to maintain folders, users, permissions, locks, and repositories. Integration Services – the “workhorse” of the domain. Informatica Server is the component responsible for the actual work of moving data according to the mappings developed and placed into operation. It contains several distinct parts such as the Load Manager, Data Transformation Manager, Reader and Writer. Repository ServicesInformatica client tools and Informatica Server connect to the repository database over the network through the Repository Server.
Informatica Transformations: Mapping: Mapping is the Informatica Object which contains set of transformations including source and target. Its look like pipeline.
Page 54 of 115 DWH Training -9739096158
Mapplet: Mapplet is a set of reusable transformations. We can use this mapplet in any mapping within the Folder. A mapplet can be active or passive depending on the transformations in the mapplet. Active mapplets contain one or more active transformations. Passive mapplets contain only passive transformations. When you add transformations to a mapplet, keep the following restrictions in mind:
If you use a Sequence Generator transformation, you must use a reusable Sequence Generator transformation. If you use a Stored Procedure transformation, you must configure the Stored Procedure Type to be Normal. You cannot include the following objects in a mapplet:
o o o o o o
Normalizer transformations COBOL sources XML Source Qualifier transformations XML sources Target definitions Other mapplets
The mapplet contains Input transformations and/or source definitions with at least one port connected to a transformation in the mapplet. The mapplet contains at least one Output transformation with at least one port connected to a transformation in the mapplet.
Input Transformation: Input transformations are used to create a logical interface to a mapplet in order to allow data to pass into the mapplet. Output Transformation: Output transformations are used to
Page 55 of 115 DWH Training -9739096158
create a logical interface from a mapplet in order to allow data to pass out of a mapplet. System Variables $$$SessStartTime returns the initial system date value on the machine hosting the Integration Service when the server initializes a session. $$$SessStartTime returns the session start time as a string value. The format of the string depends on the database you are using. Session: A session is a set of instructions that tells informatica Server how to move data from sources to targets. WorkFlow: A workflow is a set of instructions that tells Informatica Server how to execute tasks such as sessions, email notifications and commands. In a workflow multiple sessions can be included to run in parallel or sequential manner. Source Definition: The Source Definition is used to logically represent database table or Flat files. Target Definition: The Target Definition is used to logically represent a database table or file in the Data Warehouse / Data Mart. Aggregator: The Aggregator transformation is used to perform Aggregate calculations on group basis. Expression: The Expression transformation is used to perform the arithmetic calculation on row by row basis and also used to convert string to integer vis and concatenate two columns. Filter: The Filter transformation is used to filter the data based on single condition and pass through next transformation. Router: The router transformation is used to route the data based on multiple conditions and pass through next transformations. It has three groups 1) Input group 2) User defined group
Page 56 of 115 DWH Training -9739096158
3) Default group Joiner: The Joiner transformation is used to join two sources residing in different databases or different locations like flat file and oracle sources or two relational tables existing in different databases. Source Qualifier: The Source Qualifier transformation is used to describe in SQL the method by which data is to be retrieved from a source application system and also used to join two relational sources residing in same databases. What is Incremental Aggregation? A. Whenever a session is created for a mapping Aggregate Transformation, the session option for Incremental Aggregation can be enabled. When PowerCenter performs incremental aggregation, it passes new source data through the mapping and uses historical cache data to perform new aggregation calculations incrementally. Lookup: Lookup transformation is used in a mapping to look up data in a flat file or a relational table, view, or synonym. Two types of lookups: 1) Connected 2) Unconnected
Differences between connected lookup and unconnected lookup Connected Lookup This is connected to pipleline and receives the input values from pipleline. Unconnected Lookup Which is not connected to pipeline and receives input values from the result of a: LKP expression in another transformation via arguments.
Page 57 of 115 DWH Training -9739096158
We cannot use this lookup more than once in a mapping. We can return multiple columns from the same row. We can configure to use dynamic cache. Pass multiple output values to another transformation. Link lookup/output ports to another transformation.
We can use this transformation more than once within the mapping Designate one return port (R), returns one column from each row. We cannot configure to use dynamic cache. Pass one output value to another transformation. The lookup/output/return port passes the value to the transformation calling: LKP expression. Use a static cache Does not support user defined default values. Cache includes all lookup/output ports in the lookup condition and the lookup/return port.
Use a dynamic or static cache Supports user defined default values. Cache includes the lookup source column in the lookup condition and the lookup source columns that are output ports.
Lookup Caches: When configuring a lookup cache, you can specify any of the following options:
• • • •
Persistent cache Recache from lookup source Static cache Dynamic cache
Page 58 of 115
DWH Training -9739096158
Dynamic cache: When you use a dynamic cache, the PowerCenter Server updates the lookup cache as it passes rows to the target. If you configure a Lookup transformation to use a dynamic cache, you can only use the equality operator (=) in the lookup condition. NewLookupRow Port will enable automatically.
NewLookupRo w Value 0
Description The PowerCenter Server does not update or insert the row in the cache. The PowerCenter Server inserts the row into the cache. The PowerCenter Server updates the row in the cache.
Static cache: It is a default cache; the PowerCenter Server doesn’t update the lookup cache as it passes rows to the target. Persistent cache: If the lookup table does not change between sessions, configure the Lookup transformation to use a persistent lookup cache. The PowerCenter Server then saves and reuses cache files from session to session, eliminating the time required to read the lookup table. Differences between dynamic lookup and static lookup Dynamic Lookup Cache In dynamic lookup the cache memory will get refreshed as
DWH Training -9739096158
Static Lookup Cache In static lookup the cache memory will not get
Page 59 of 115
soon as the record get inserted or updated/deleted in the lookup table.
refreshed even though record inserted or updated in the lookup table it will refresh only in the next session run. It is a default cache.
When we configure a lookup transformation to use a dynamic lookup cache, you can only use the equality operator in the lookup condition. NewLookupRow port will enable automatically. Best example where we need to use dynamic cache is if suppose first record and last record both are same but there is a change in the address. What informatica mapping has to do here is first record needs to get insert and last record should get update in the target table.
If we use static lookup first record it will go to lookup and check in the lookup cache based on the condition it will not find the match so it will return null value then in the router it will send that record to insert flow. But still this record dose not available in the cache memory so when the last record comes to lookup it will check in the cache it will not find the match so it returns null value again it will go to insert flow through router but it is suppose to go to update flow because cache didn’t get refreshed when the first record get inserted into target table.
Normalizer: The Normalizer transformation is used to generate
Page 60 of 115 DWH Training -9739096158
multiple records from a single record based on columns (transpose the column data into rows) We can use normalize transformation to process cobol sources instead of source qualifier. Rank: The Rank transformation allows you to select only the top or bottom rank of data. You can use a Rank transformation to return the largest or smallest numeric value in a port or group. The Designer automatically creates a RANKINDEX port for each Rank transformation. Sequence Generator: The Sequence Generator transformation is used to generate numeric key values in sequential order. Stored Procedure: The Stored Procedure transformation is used to execute externally stored database procedures and functions. It is used to perform the database level operations. Sorter: The Sorter transformation is used to sort data in ascending or descending order according to a specified sort key. You can also configure the Sorter transformation for casesensitive sorting, and specify whether the output rows should be distinct. The Sorter transformation is an active transformation. It must be connected to the data flow.
Union Transformation: The Union transformation is a multiple input group transformation that you can use to merge data from multiple pipelines or pipeline branches into one pipeline branch. It merges data from multiple sources similar to the UNION ALL SQL statement to combine the results from two or more SQL statements. Similar to the UNION ALL statement, the Union transformation does not remove duplicate rows.Input groups should have similar structure.
Update Strategy: The Update Strategy transformation is used
Page 61 of 115 DWH Training -9739096158
to indicate the DML statement. We can implement update strategy in two levels: 1) Mapping level 2) Session level. Session level properties will override the mapping level properties.
Aggregator Transformation: Transformation type: Active Connected The Aggregator transformation performs aggregate calculations, such as averages and sums. The Aggregator transformation is unlike the Expression transformation, in that you use the Aggregator transformation to perform calculations on groups. The Expression transformation permits you to perform calculations on a row-by-row basis only. Components of the Aggregator Transformation: The Aggregator is an active transformation, changing the number of rows in the pipeline. The Aggregator transformation has the following components and options Aggregate cache: The Integration Service stores data in the aggregate cache until it completes aggregate calculations. It stores group values in an index cache and row data in the data cache. Group by port: Indicate how to create groups. The port can be any input, input/output, output, or variable port. When grouping data, the Aggregator transformation outputs the last row of each group unless otherwise specified. Sorted input: Select this option to improve session performance. To use sorted input, you must pass data to the
Page 62 of 115 DWH Training -9739096158
Aggregator transformation sorted by group by port, in ascending or descending order. Aggregate Expressions: The Designer allows aggregate expressions only in the Aggregator transformation. An aggregate expression can include conditional clauses and non-aggregate functions. It can also include one aggregate function nested within another aggregate function, such as: MAX (COUNT (ITEM)) The result of an aggregate expression varies depending on the group by ports used in the transformation Aggregate Functions Use the following aggregate functions within an Aggregator transformation. You can nest one aggregate function within another aggregate function. The transformation language includes the following aggregate functions: (AVG,COUNT,FIRST,LAST,MAX,MEDIAN,MIN,PERCENTAGE,SUM,V ARIANCE and STDDEV) When you use any of these functions, you must use them in an expression within an Aggregator transformation. Perfomance Tips in Aggregator Use sorted input to increase the mapping performance but we need to sort the data before sending to aggregator transformation. Filter the data before aggregating it. If you use a Filter transformation in the mapping, place the transformation before the Aggregator transformation to reduce unnecessary aggregation. SQL Transformation Transformation type:
Page 63 of 115 DWH Training -9739096158
Active/Passive Connected The SQL transformation processes SQL queries midstream in a pipeline. You can insert, delete, update, and retrieve rows from a database. You can pass the database connection information to the SQL transformation as input data at run time. The transformation processes external SQL scripts or SQL queries that you create in an SQL editor. The SQL transformation processes the query and returns rows and database errors. For example, you might need to create database tables before adding new transactions. You can create an SQL transformation to create the tables in a workflow. The SQL transformation returns database errors in an output port. You can configure another workflow to run if the SQL transformation returns no errors. When you create an SQL transformation, you configure the following options:
Mode. The SQL transformation runs in one of the following modes: Script mode. The SQL transformation runs ANSI SQL scripts that are externally located. You pass a script name to the transformation with each input row. The SQL transformation outputs one row for each input row. Query mode. The SQL transformation executes a query that you define in a query editor. You can pass strings or parameters to the query to define dynamic queries or change the selection parameters. You can output multiple rows when the query has a SELECT statement.
Page 64 of 115 DWH Training -9739096158
Database type. The type of database the SQL transformation connects to. Connection type. Pass database connection information to the SQL transformation or use a connection object. Script Mode An SQL transformation configured for script mode has the following default ports: Port ScriptNa me Type Description Input Receives the name of the script to execute for the current row.
ScriptRes Outp Returns PASSED if the script execution ult ut succeeds for the row. Otherwise contains FAILED. ScriptErr or Outp Returns errors that occur when a script fails for ut a row.
Java Transformation Overview Transformation type: Active/Passive Connected The Java transformation provides a simple native programming interface to define transformation functionality with the Java programming language. You can use the Java transformation to quickly define simple or moderately complex transformation functionality without advanced knowledge of the Java programming language or an external Java development environment. For example, you can define transformation logic to loop through input rows and generate multiple output rows based on a specific condition. You can also use expressions, user-defined
Page 65 of 115 DWH Training -9739096158
functions, unconnected transformations, and mapping variables in the Java code. Transaction Control Transformation Transformation type: Active Connected PowerCenter lets you control commit and roll back transactions based on a set of rows that pass through a Transaction Control transformation. A transaction is the set of rows bound by commit or roll back rows. You can define a transaction based on a varying number of input rows. You might want to define transactions based on a group of rows ordered on a common key, such as employee ID or order entry date. In PowerCenter, you define transaction control at the following levels: Within a mapping. Within a mapping, you use the Transaction Control transformation to define a transaction. You define transactions using an expression in a Transaction Control transformation. Based on the return value of the expression, you can choose to commit, roll back, or continue without any transaction changes. Within a session. When you configure a session, you configure it for user-defined commit. You can choose to commit or roll back a transaction if the Integration Service fails to transform or write any row to the target. When you run the session, the Integration Service evaluates the expression for each row that enters the transformation. When it evaluates a commit row, it commits all rows in the transaction to the target or targets. When the Integration Service evaluates a roll back row, it rolls back all rows in the transaction from the target or targets. If the mapping has a flat file target you can generate an output file each time the Integration Service starts a new transaction. You can dynamically name each target flat file.
Page 66 of 115 DWH Training -9739096158
What is the difference between joiner and lookup Joiner In joiner on multiple matches it will return all matching records. In joiner we cannot configure to use persistence cache, shared cache, uncached and dynamic cache We cannot override the query in joiner We can perform outer join in joiner transformation. We cannot use relational operators in joiner transformation.(i.e. <,>,<= and so on) Lookup In lookup it will return either first record or last record or any value or error value. Where as in lookup we can configure to use persistence cache, shared cache, uncached and dynamic cache. We can override the query in lookup to fetch the data from multiple tables. We cannot perform outer join in lookup transformation. Where as in lookup we can use the relation operators. (i.e. <,>,<= and so on)
What is the difference between source qualifier and lookup Source Qualifier In source qualifier it will push all the matching records. Lookup Where as in lookup we can restrict whether to display first value, last value or any value Where as in lookup we concentrate on cache concept. When the source and lookup table exists in different database then we need to use
Page 67 of 115 DWH Training -9739096158
In source qualifier there is no concept of cache. When both source and lookup are in same database we can use source qualifier.
Have you done any Performance tuning in informatica?
Yes, One of my mapping was taking 3-4 hours to process 40 millions rows into staging table we don’t have any transformation inside the mapping its 1 to 1 mapping .Here nothing is there to optimize the mapping so I created session partitions using key range on effective date column. It improved performance lot, rather than 4 hours it was running in 30 minutes for entire 40millions.Using partitions DTM will creates multiple reader and writer threads.
2) There was one more scenario where I got very good performance in the mapping level .Rather than using lookup transformation if we can able to do outer join in the source qualifier query override this will give you good performance if both lookup table and source were in the same database. If lookup tables is huge volumes then creating cache is costly. 3) And also if we can able to optimize mapping using less no of transformations always gives you good performance. 4) If any mapping taking long time to execute then first we need to look in to source and target statistics in the monitor for the throughput and also find out where exactly the bottle neck by looking busy percentage in the session log will come to know which transformation taking more time ,if your source query is the bottle neck then it will show in the end of the session log as “query issued to database “that means there is a performance issue in the source query.we need to tune the query using . Informatica Session Log shows busy percentage If we look into session logs it shows busy percentage based on that we need to find out where is bottle neck. ***** RUN INFO FOR TGT LOAD ORDER GROUP , CONCURRENT SET  ****
Page 68 of 115 DWH Training -9739096158
Thread [READER_1_1_1] created for [the read stage] of partition point [SQ_ACW_PCBA_APPROVAL_STG] has completed: Total Run Time = [7.193083] secs, Total Idle Time = [0.000000] secs, Busy Percentage = [100.000000] Thread [TRANSF_1_1_1] created for [the transformation stage] of partition point [SQ_ACW_PCBA_APPROVAL_STG] has completed. The total run time was insufficient for any meaningful statistics. Thread [WRITER_1_*_1] created for [the write stage] of partition point [ACW_PCBA_APPROVAL_F1, ACW_PCBA_APPROVAL_F] has completed: Total Run Time = [0.806521] secs, Total Idle Time = [0.000000] secs, Busy Percentage = [100.000000]
If suppose I've to load 40 lacs records in the target table and the workflow is taking about 10 - 11 hours to finish. I've already increased the cache size to 128MB. There are no joiner, just lookups and expression transformations Ans: (1) If the lookups have many records, try creating indexes on the columns used in the lkp condition. And try increasing the lookup cache.If this doesnt increase the performance. If the target has any indexes disable them in the target pre load and enable them in the target post load. (2) Three things you can do w.r.t it. 1. Increase the Commit intervals ( by default its 10000) 2. Use bulk mode instead of normal mode incase ur target doesn't have primary keys or use pre and post session SQL to implement the same (depending on the business req.) 3. Uses Key partitionning to load the data faster. (3)If your target consists key constraints and indexes u slow
Page 69 of 115 DWH Training -9739096158
the loading of data. To improve the session performance in this case drop constraints and indexes before you run the session and rebuild them after completion of session.
What is Constraint based loading in informatica? By setting Constraint Based Loading property at session level in Configaration tab we can load the data into parent and child relational tables (primary foreign key). Genarally What it do is it will load the data first in parent table then it will load it in to child table. What is use of Shortcuts in informatica? If we copy source definaltions or target definations or mapplets from Shared folder to any other folders that will become a shortcut. Let’s assume we have imported some source and target definitions in a shared folder after that we are using those sources and target definitions in another folders as a shortcut in some mappings. If any modifications occur in the backend (Database) structure like adding new columns or drop existing columns either in source or target I f we reimport into shared folder those new changes automatically it would reflect in all folder/mappings wherever we used those sources or target definitions. Target Update Override If we don’t have primary key on target table using Target Update Override option we can perform updates.By default, the Integration Service updates target tables based on key values. However, you can override the default UPDATE statement for each target in a mapping. You might want to update the target based on non-key columns. Overriding the WHERE Clause You can override the WHERE clause to include non-key
Page 70 of 115 DWH Training -9739096158
columns. For example, you might want to update records for employees named Mike Smith only. To do this, you edit the WHERE clause as follows: UPDATE T_SALES SET DATE_SHIPPED =:TU.DATE_SHIPPED, TOTAL_SALES = :TU.TOTAL_SALES WHERE EMP_NAME = :TU.EMP_NAME and EMP_NAME = 'MIKE SMITH' If you modify the UPDATE portion of the statement, be sure to use :TU to specify ports. How do you perform incremental logic or Delta or CDC? Incremental means suppose today we processed 100 records ,for tomorrow run u need to extract whatever the records inserted newly and updated after previous run based on last updated timestamp (Yesterday run) this process called as incremental or delta. Approach_1: Using set max var () 1) First need to create mapping var ($$Pre_sess_max_upd)and assign initial value as old date (01/01/1940). 2) Then override source qualifier query to fetch only LAT_UPD_DATE >=$$Pre_sess_max_upd (Mapping var) 3) In the expression assign max last_upd_date value to $ $Pre_sess_max_upd(mapping var) using set max var
4) Because its var so it stores the max last upd_date value in the repository, in the next run our source qualifier query will fetch only the records updated or inseted after previous run. Approach_2: Using parameter file 1 First need to create mapping parameter ($ $Pre_sess_start_tmst )and assign initial value as old date (01/01/1940) in the parameterfile. 2 Then override source qualifier query to fetch only LAT_UPD_DATE >=$$Pre_sess_start_tmst (Mapping var) 3 Update mapping parameter($$Pre_sess_start_tmst)
Page 71 of 115 DWH Training -9739096158
values in the parameter file using shell script or another mapping after first session get completed successfully 4 Because its mapping parameter so every time we need to update the value in the parameter file after comptetion of main session.
Approach_3: Using oracle Control tables 1 First we need to create two control tables cont_tbl_1 and cont_tbl_1 with structure of session_st_time,wf_name 2 Then insert one record in each table with session_st_time=1/1/1940 and workflow_name 3 create two store procedures one for update cont_tbl_1 with session st_time, set property of store procedure type as Source_pre_load .
In 2nd store procedure set property of store procedure type as Target _Post_load.this proc will update the session _st_time in Cont_tbl_2 from cnt_tbl_1.
5 Then override source qualifier query to fetch only LAT_UPD_DATE >=(Select session_st_time from cont_tbl_2 where workflow name=’Actual work flow name’. SCD Type-II Effective-Date Approach • We have one of the dimension in current project called resource dimension. Here we are maintaining the history to keep track of SCD changes. • To maintain the history in slowly changing dimension or resource dimension. We followed SCD Type-II EffectiveDate approach. • My resource dimension structure would be eff-start-date, eff-end-date, s.k and source columns. • Whenever I do a insert into dimension I would populate effPage 72 of 115 DWH Training -9739096158
start-date with sysdate, eff-end-date with future date and s.k as a sequence number. • If the record already present in my dimension but there is change in the source data. In that case what I need to do is • Update the previous record eff-end-date with sysdate and insert as a new record with source data. Informatica design to implement SDC Type-II effectivedate approach • Once you fetch the record from source qualifier. We will send it to lookup to find out whether the record is present in the target or not based on source primary key column.
Once we find the match in the lookup we are taking SCD column from lookup and source columns from SQ to expression transformation. In lookup transformation we need to override the lookup override query to fetch Active records from the dimension while building the cache.
• In expression transformation I can compare source with lookup return data. • If the source and target data is same then I can make a flag as ‘S’. • If the source and target data is different then I can make a flag as ‘U’. • If source data does not exists in the target that means lookup returns null value. I can flag it as ‘I’. • Based on the flag values in router I can route the data into insert and update flow. • If flag=’I’ or ‘U’ I will pass it to insert flow. • If flag=’U’ I will pass this record to eff-date update flow • When we do insert we are passing the sequence value to s.k.
Page 73 of 115 DWH Training -9739096158
• Whenever we do update we are updating the eff-end-date column based on lookup return s.k value. Complex Mapping • We have one of the order file requirement. Requirement is every day in source system they will place filename with timestamp in informatica server. • We have to process the same date file through informatica. • Source file directory contain older than 30 days files with timestamps. • For this requirement if I hardcode the timestamp for source file name it will process the same file every day. • So what I did here is I created $InputFilename for source file name. • Then I am going to use the parameter file to supply the values to session variables ($InputFilename). • To update this parameter file I have created one more mapping. • This mapping will update the parameter file with appended timestamp to file name. • I make sure to run this parameter file update mapping before my actual mapping. How to handle errors in informatica? • We have one of the source with numerator and denominator values we need to calculate num/deno • When populating to target. • If deno=0 I should not load this record into target table.
We need to send those records to flat file after completion of 1st session run. Shell script will check the file size.
• If the file size is greater than zero then it will send email
Page 74 of 115 DWH Training -9739096158
notification to source system POC (point of contact) along with deno zero record file and appropriate email subject and body. • If file size<=0 that means there is no records in flat file. In this case shell script will not send any email notification. • Or • We are expecting a not null value for one of the source column. • If it is null that means it is a error record. • We can use the above approach for error handling. Why we need source qualifier? Simply it performs select statement. Select statement fetches the data in the form of row. Source qualifier will select the data from the source table. It identifies the record from the source. Parameter file it will supply the values to session level variables and mapping level variables. Variables are of two types: • Session level variables • Mapping level variables Session level variables are of four types: • $DBConnection_Source • $DBConnection_Target • $InputFile • $OutputFile Mapping level variables are of two types:
Page 75 of 115 DWH Training -9739096158
• Variable • Parameter
What is the difference between mapping level and session level variables? Mapping level variables always starts with $$. A session level variable always starts with $. Flat File Flat file is a collection of data in a file in the specific format. Informatica can support two types of files • Delimiter • Fixed Width In delimiter we need to specify the separator. In fixed width we need to known about the format first. Means how many character to read for particular column. In delimiter also it is necessary to know about the structure of the delimiter. Because to know about the headers. If the file contains the header then in definition we need to skip the first row. List file: If you want to process multiple files with same structure. We don’t need multiple mapping and multiple sessions. We can use one mapping one session using list file option. First we need to create the list file for all the files. Then we can use this file in the main mapping. Parameter file Format: It is a text file below is the format for parameter file. We use to
Page 76 of 115 DWH Training -9739096158
place this file in the unix box where we have installed our informatic server. [GEHC_APO_DEV.WF:w_GEHC_APO_WEEKLY_HIST_LOAD.WT:wl_ GEHC_APO_WEEKLY_HIST_BAAN.ST:s_m_GEHC_APO_BAAN_SALE S_HIST_AUSTRI] $InputFileName_BAAN_SALE_HIST=/interface/dev/etl/apo/srcfile s/HS_025_20070921 $DBConnection_Target=DMD2_GEMS_ETL $$CountryCode=AT $$CustomerNumber=120165 [GEHC_APO_DEV.WF:w_GEHC_APO_WEEKLY_HIST_LOAD.WT:wl_ GEHC_APO_WEEKLY_HIST_BAAN.ST:s_m_GEHC_APO_BAAN_SALE S_HIST_BELUM] $DBConnection_Sourcet=DEVL1C1_GEMS_ETL $OutputFileName_BAAN_SALES=/interface/dev/etl/apo/trgfiles/ HS_002_20070921 $$CountryCode=BE $$CustomerNumber=101495
Page 77 of 115 DWH Training -9739096158
Difference between 7.x and 8.x Power Center 7.X Architecture.
Page 78 of 115 DWH Training -9739096158
Power Center 8.X Architecture.
Page 79 of 115 DWH Training -9739096158
Page 80 of 115 DWH Training -9739096158
For example, in PowerCenter: • PowerCenter Server has become a service, the Integration Service • No more Repository Server, but PowerCenter includes a Repository Service • Client applications are the same, but work on top of the new services framework Below are the difference between 7.1 and 8.1 of infa.. 1) powercenter connect for sap netweaver bw option 2) sql transformation is added 3) service oriented architecture 4) grid concept is additional feature 5) random file name can genaratation in target 6) command line programms: infacmd and infasetup new commands were added. 7) java transformation is added feature 8) concurrent cache creation and faster index building are additional feature in lookup transformation 9) caches or automatic u dont need to allocate at transformation level 10) push down optimization techniques,some 11) we can append data into the flat file target. 12)Dynamic file names we can generate in informatica 8 13)flat file names we can populate to target while processing through list file . 14)For Falt files header and footer we can populate using advanced options in 8 at session level. 15) GRID option at session level
Page 81 of 115 DWH Training -9739096158
Effective in version 8.0, you create and configure a grid in the Administration Console. You configure a grid to run on multiple nodes, and you configure one Integration Service to run on the grid. The Integration Service runs processes on the nodes in the grid to distribute workflows and sessions. In addition to running a workflow on a grid, you can now run a session on a grid. When you run a session or workflow on a grid, one service process runs on each available node in the grid.
Pictorial Representation of Workflow execution:
1. A PowerCenter Client request IS to start workflow 2. IS starts ISP 3. ISP consults LB to select node 4. ISP starts DTM in node selected by LB Integration Service (IS) The key functions of IS are
Interpretation of the workflow and mapping metadata from the repository. Execution of the instructions in the metadata Manages the data from source system to target system within the memory and disk
Page 82 of 115
DWH Training -9739096158
The main three components of Integration Service which enable data movement are,
Integration Service Process Load Balancer Data Transformation Manager
Integration Service Process (ISP) The Integration Service starts one or more Integration Service processes to run and monitor workflows. When we run a workflow, the ISP starts and locks the workflow, runs the workflow tasks, and starts the process to run sessions. The functions of the Integration Service Process are,
Locks and reads the workflow Manages workflow scheduling, ie, maintains session dependency Reads the workflow parameter file Creates the workflow log Runs workflow tasks and evaluates the conditional links Starts the DTM process to run the session Writes historical run information to the repository Sends post-session emails Load Balancer
The Load Balancer dispatches tasks to achieve optimal performance. It dispatches tasks to a single node or across the nodes in a grid after performing a sequence of steps. Before understanding these steps we have to know about Resources, Resource Provision Thresholds, Dispatch mode and Service levels
Resources – we can configure the Integration Service to check the resources available on each node and match them with the resources required to run the task. For
Page 83 of 115
DWH Training -9739096158
example, if a session uses an SAP source, the Load Balancer dispatches the session only to nodes where the SAP client is installed
Three Resource Provision Thresholds, The maximum number of runnable threads waiting for CPU resources on the node called Maximum CPU Run Queue Length. The maximum percentage of virtual memory allocated on the node relative to the total physical memory size called Maximum Memory %. The maximum number of running Session and Command tasks allowed for each Integration Service process running on the node called Maximum Processes Three Dispatch mode’s – Round-Robin: The Load Balancer dispatches tasks to available nodes in a round-robin fashion after checking the “Maximum Process” threshold. Metricbased: Checks all the three resource provision thresholds and dispatches tasks in round robin fashion. Adaptive: Checks all the three resource provision thresholds and also ranks nodes according to current CPU availability Service Levels establishes priority among tasks that are waiting to be dispatched, the three components of service levels are Name, Dispatch Priority and Maximum dispatch wait time. “Maximum dispatch wait time” is the amount of time a task can wait in queue and this ensures no task waits forever
A .Dispatching Tasks on a node 1. The Load Balancer checks different resource provision thresholds on the node depending on the Dispatch mode set. If dispatching the task causes any threshold to be exceeded, the Load Balancer places the task in the dispatch queue, and it dispatches the task later 2. The Load Balancer dispatches all tasks to the node that runs the master Integration Service process B. Dispatching Tasks on a grid, 1. The Load Balancer verifies which nodes are currently running and enabled
Page 84 of 115 DWH Training -9739096158
2. The Load Balancer identifies nodes that have the PowerCenter resources required by the tasks in the workflow 3. The Load Balancer verifies that the resource provision thresholds on each candidate node are not exceeded. If dispatching the task causes a threshold to be exceeded, the Load Balancer places the task in the dispatch queue, and it dispatches the task later 4. The Load Balancer selects a node based on the dispatch mode Data Transformation Manager (DTM) Process When the workflow reaches a session, the Integration Service Process starts the DTM process. The DTM is the process associated with the session task. The DTM process performs the following tasks:
Retrieves and validates session information from the repository. Validates source and target code pages. Verifies connection object permissions. Performs pushdown optimization when the session is configured for pushdown optimization. Adds partitions to the session when the session is configured for dynamic partitioning. Expands the service process variables, session parameters, and mapping variables and parameters. Creates the session log. Runs pre-session shell commands, stored procedures, and SQL. Sends a request to start worker DTM processes on other nodes when the session is configured to run on a grid. Creates and runs mapping, reader, writer, and transformation threads to extract, transform, and load data
Page 85 of 115
DWH Training -9739096158
Runs post-session stored procedures, SQL, and shell commands and sends post-session email
After the session is complete, reports execution result to ISP
Approach_1: Using set max var () 1) 2) 3) 4) First need to create mapping var ($$INCREMENT_TS)and Then override source qualifier query to fetch only In the expression assign max last_upd_date value to ($ Because its var so it stores the max last upd_date value in
assign initial value as old date (01/01/1940). LAT_UPD_DATE >=($$INCREMENT_TS (Mapping var) $INCREMENT_TS (mapping var) using set max var the repository, in the next run our source qualifier query will fetch only the records updated or inseted after previous run.
Page 86 of 115 DWH Training -9739096158
Logic in the mapping variable is
Page 87 of 115 DWH Training -9739096158
Logic in the SQ is
In expression assign max last update date value to the variable using function set max variable.
Page 88 of 115 DWH Training -9739096158
Page 89 of 115 DWH Training -9739096158
Logic in the update strategy is below
Page 90 of 115 DWH Training -9739096158
Approach_2: Using parameter file First need to create mapping parameter ($$LastUpdateDate Time )and assign initial value as old date (01/01/1940) in the parameterfile. Then override source qualifier query to fetch only LAT_UPD_DATE >=($$LastUpdateDate Time (Mapping var) Update mapping parameter($$LastUpdateDate Time) values in the parameter file using shell script or another mapping after first session get completed successfully Because its mapping parameter so every time we need to update the value in the parameter file after comptetion of main session. Parameterfile:
[GEHC_APO_DEV.WF:w_GEHC_APO_WEEKLY_HIST_LOAD.WT:wl_ GEHC_APO_WEEKLY_HIST_BAAN.ST:s_m_GEHC_APO_BAAN_SALE S_HIST_AUSTRI] $DBConnection_Source=DMD2_GEMS_ETL $DBConnection_Target=DMD2_GEMS_ETL $$LastUpdateDate Time =01/01/1940
Page 91 of 115 DWH Training -9739096158
Updating parameter File
Logic in the expression
Main mapping Page 92 of 115 DWH Training -9739096158
Sql override in SQ Transformation
Page 93 of 115 DWH Training -9739096158
4.2 Informatica Scenarios:
How to populate 1st record to 1st target ,2nd record to 2nd target ,3rd record to 3rd target and 4th record to 1st target through informatica?
We can do using sequence generator by setting end value=3 and enable cycle option.then in the router take 3 goups In 1st group specify condition as seq next value=1 pass those records to 1st target simillarly In 2nd group specify condition as seq next value=2 pass those records to 2nd target In 3rd group specify condition as seq next value=3 pass those records to 3rd target. Since we have enabled cycle option after reaching end value sequence generator will start from 1,for the 4th record seq.next value is 1 so it will go to 1st target.
Page 94 of 115 DWH Training -9739096158
2) How to do Dymanic File generation in Informatica? I want to generate the separate file for every State (as per state, it should generate file).It has to generate 2 flat files and name of the flat file is corresponding state name that is the requirement. Below is my mapping. Source (Table) -> SQ -> Target (FF) Source: Stat e AP AP KA KA KA Transacti on 2 1 5 7 3 City HYD TPT BANG MYSOR E HUBLI
This functionality was added in informatica 8.5 onwards earlier versions it was not there. We can achieve it with use of transaction control and special "FileName" port in the target file . In order to generate the target file names from the mapping, we should make use of the special "FileName" port in the target file. You can't create this special port from the usual New port button. There is a special button with label "F" on it to the right most corner of the target flat file when viewed in "Target Designer". When you have different sets of input data with different target files created, use the same instance, but with a Transaction
Page 95 of 115 DWH Training -9739096158
Control transformation which defines the boundary for the source sets. in target flat file there is option in column tab i.e filename as column. when you click that one non editable column gets created in metadata of target. in transaction control give condition as iif(not isnull(emp_no),tc_commit_before,continue) else tc_commit_before map the emp_no column to target's filename column ur mapping will be like this source -> squlf-> transaction control-> target run it ,separate files will be created by name of Ename
3) How to concatenate row data through informatica? Source: Enam e stev methe w john tom Target: Ename Stev methew EmpNo 100 EmpNo 100 100 101 101
Page 96 of 115 DWH Training -9739096158
Approach1: Using Dynamic Lookup on Target table: If record doen’t exit do insert in target .If it is already exist then get corresponding Ename vale from lookup and concat in expression with current Ename value then update the target Ename column using update strategy. Approch2: Using Var port : Sort the data in sq based on EmpNo column then Use expression to store previous record information using Var port after that use router to insert a record if it is first time if it is already inserted then update Ename with concat value of prev name and current name value then update in target. 4) How to send Unique (Distinct) records into One target and duplicates into another tatget? Source: Enam e stev Stev john Mathe w EmpNo 100 100 101 102
Output: Target_1: Ename Stev EmpNo 100
Page 97 of 115 DWH Training -9739096158
Target_2: Ename Stev EmpNo 100
Approch 1: Using Dynamic Lookup on Target table: If record doen’t exit do insert in target_1 .If it is already exist then send it to Target_2 using Router. Approch2: Using Var port : Sort the data in sq based on EmpNo column then Use expression to store previous record information using Var ports after that use router to route the data into targets if it is first time then sent it to first target if it is already inserted then send it to Tartget_2. 5) How to Process multiple flat files to single target table through informatica if all files are same structure? We can process all flat files through one mapping and one session using list file. First we need to create list file using unix script for all flat file the extension of the list file is .LST. This list file it will have only flat file names. At session level we need to set source file directory as list file path And source file name as list file name And file type as indirect.
Page 98 of 115 DWH Training -9739096158
6) How to populate file name to target while loading multiple files using list file concept. In informatica 8.6 by selecting Add currently processed flatfile name option in the properties tab of source definition after import source file defination in source analyzer.It will add new column as currently processed file name.we can map this column to target to populate filename. 7) If we want to run 2 workflow one after another(how to set the dependence between wf’s) • If both workflow exists in same folder we can create 2 worklet rather than creating 2 workfolws. • Finally we can call these 2 worklets in one workflow. • There we can set the dependency. • If both workflows exists in different folders or repository then we cannot create worklet. • We can set the dependency between these two workflow using shell script is one approach. • The other approach is event wait and event rise. If both workflow exists in different folrder or different rep then we can use below approaches. 1) Using shell script • As soon as first workflow get completes we are creating zero byte file (indicator file). • If indicator file is available in particular location. We will run second workflow. • If indicator file is not available we will wait for 5 minutes and again we will check for the indicator. Like this we will continue the loop for 5 times i.e 30 minutes.
Page 99 of 115 DWH Training -9739096158
• After 30 minutes if the file does not exists we will send out email notification. 2) Event wait and Event rise approach We can put event wait before actual session run in the workflow to wait a indicator file if file available then it will run the session other event wait it will wait for infinite time till the indicator file is available.
8) How to load cumulative salary in to target ? Solution: Using var ports in expression we can load cumulative salary into target.
Page 100 of 115 DWH Training -9739096158
4.3 Development Guidelines
General Development Guidelines The starting point of the development is the logical model created by the Data Architect. This logical model forms the foundation for metadata, which will be continuously be maintained throughout the Data Warehouse Development Life Cycle (DWDLC). The logical model is formed from the requirements of the project. At the completion of the logical model technical documentation defining the sources, targets, requisite business rule transformations, mappings and filters. This documentation serves as the basis for the creation of the Extraction, Transformation and Loading tools to actually manipulate the data from the applications sources into the Data Warehouse/Data Mart. To start development on any data mart you should have the following things set up by the Informatica Load Administrator Informatica Folder. The development team in consultation with the BI Support Group can decide a three-letter
Page 101 of 115 DWH Training -9739096158
code for the project, which would be used to create the informatica folder as well as Unix directory structure. Informatica Userids for the developers Unix directory structure for the data mart. A schema XXXLOAD on DWDEV database. Transformation Specifications Before developing the mappings you need to prepare the specifications document for the mappings you need to develop. A good template is placed in the templates folder You can use your own template as long as it has as much detail or more than that which is in this template. While estimating the time required to develop mappings the thumb rule is as follows. Simple Mapping – 1 Person Day Medium Complexity Mapping – 3 Person Days Complex Mapping – 5 Person Days. Usually the mapping for the fact table is most complex and should be allotted as much time for development as possible. Data Loading from Flat Files It’s an accepted best practice to always load a flat file into a staging table before any transformations are done on the data in the flat file. Always use LTRIM, RTRIM functions on string columns before loading data into a stage table. You can also use UPPER function on string columns but before using it you need to ensure that the data is not case sensitive (e.g. ABC is different from Abc) If you are loading data from a delimited file then make sure the delimiter is not a character which could appear in the data itself. Avoid using comma-separated files. Tilde (~) is a good delimiter to use. Failure Notification
Page 102 of 115 DWH Training -9739096158
Once in production your sessions and batches need to send out notification when then fail to the Support team. You can do this by configuring email task in the session level. Naming Conventions and usage of Transformations Port Standards: Input Ports – It will be necessary to change the name of input ports for lookups, expression and filters where ports might have the same name. If ports do have the same name then will be defaulted to having a number after the name. Change this default to a prefix of “in_”. This will allow you to keep track of input ports through out your mappings. Prefixed with: IN_
Variable Ports – Variable ports that are created within an expression Transformation should be prefixed with a “v_”. This will allow the developer to distinguish between input/output and variable ports. For more explanation of Variable Ports see the section “VARIABLES”. Prefixed with: V_
Output Ports – If organic data is created with a transformation that will be mapped to the target, make sure that it has the same name as the target port that it will be mapped to. Prefixed with: O_
Quick Reference Object Type Folder Mapping Syntax XXX_<Data Mart Name> m_fXY_ZZZ_<Target Table
Page 103 of 115 DWH Training -9739096158
Name>_x.x Session Batch Source Definition Target Definition Aggregator Expression Filter Joiner Lookup Normalizer Rank Router Sequence Generator Source Qualifier Stored Procedure Update Strategy Mapplet Input Transformation Output Tranformation s_fXY_ZZZ_<Target Table Name>_x.x b_<Meaningful name representing the sessions inside> <Source Table Name> <Target Table Name> AGG_<Purpose> EXP_<Purpose> FLT_<Purpose> JNR_<Names of Joined Tables> LKP_<Lookup Table Name> Norm_<Source Name> RNK_<Purpose> RTR_<Purpose> SEQ_<Target Column Name> SQ_<Source Table Name> STP_<Database Name>_<Procedure Name> UPD_<Target Table Name>_xxx MPP_<Purpose> INP_<Description of Data being funneled in> OUT_<Description of Data being funneled out>
Page 104 of 115 DWH Training -9739096158
XXX_<Database Name>_<Schema Name>
4.4 Performance Tips
What is Performance tuning in Informatica The aim of performance tuning is optimize session performance so sessions run during the available load window for the Informatica Server. Increase the session performance by following. The performance of the Informatica Server is related to network connections. Data generally moves across a network at less than 1 MB per second, whereas a local disk moves data five to twenty times faster. Thus network connections ofteny affect on session performance. So avoid work connections. 1. Cache lookups if source table is under 500,000 rows and DON’T cache for tables over 500,000 rows. 2. Reduce the number of transformations. Don’t use an Expression Transformation to collect fields. Don’t use an Update Transformation if only inserting. Insert mode is the default. 3. If a value is used in multiple ports, calculate the value once (in a variable) and reuse the result instead of recalculating it for multiple ports. 4. Reuse objects where possible. 5. Delete unused ports particularly in the Source Qualifier and Lookups. 6. Use Operators in expressions over the use of functions.
Page 105 of 115 DWH Training -9739096158
7. Avoid using Stored Procedures, and call them only once during the mapping if possible. 8. Remember to turn off Verbose logging after you have finished debugging. 9. Use default values where possible instead of using IIF (ISNULL(X),,) in Expression port. 10. When overriding the Lookup SQL, always ensure to put a valid Order By statement in the SQL. This will cause the database to perform the order rather than Informatica Server while building the Cache. 11. Improve session performance by using sorted data with the Joiner transformation. When the Joiner transformation is configured to use sorted data, the Informatica Server improves performance by minimizing disk input and output. 12. Improve session performance by using sorted input with the Aggregator Transformation since it reduces the amount of data cached during the session. 13. Improve session performance by using limited number of connected input/output or output ports to reduce the amount of data the Aggregator transformation stores in the data cache. 14. Use a Filter transformation prior to Aggregator transformation to reduce unnecessary aggregation. 15. Performing a join in a database is faster than performing join in the session. Also use the Source Qualifier to perform the join. 16. Define the source with less number of rows and master source in Joiner Transformations, since this reduces the search time and also the cache. 17. When using multiple conditions in a lookup conditions, specify the conditions with the equality operator first. 18. Improve session performance by caching small lookup tables.
Page 106 of 115 DWH Training -9739096158
19. If the lookup table is on the same database as the source table, instead of using a Lookup transformation, join the tables in the Source Qualifier Transformation itself if possible. 20. If the lookup table does not change between sessions, configure the Lookup transformation to use a persistent lookup cache. The Informatica Server saves and reuses cache files from session to session, eliminating the time required to read the lookup table. 21. Use :LKP reference qualifier in expressions only when calling unconnected Lookup Transformations. 22. Informatica Server generates an ORDER BY statement for a cached lookup that contains all lookup ports. By providing an override ORDER BY clause with fewer columns, session performance can be improved. 23. Eliminate unnecessary data type conversions from mappings. 24. Reduce the number of rows being cached by using the Lookup SQL Override option to add a WHERE clause to the default SQL statement.
Unit Test Cases (UTP):
QA Life Cycle consists of 5 types of Testing regimens: 1. Unit Testing 2. Functional Testing 3. System Integration Testing 4. User Acceptance Testing
Unit testing: The testing, by development, of the application modules to verify each unit (module) itself meets the accepted
Page 107 of 115 DWH Training -9739096158
user requirements and design and development standards Functional Testing: The testing of all the application’s modules individually to ensure the modules, as released from development to QA, work together as designed and meet the accepted user requirements and system standards System Integration Testing: Testing of all of the application modules in the same environment, database instance, network and inter-related applications, as it would function in production. This includes security, volume and stress testing. User Acceptance Testing(UAT): The testing of the entire application by the end-users ensuring the application functions as set forth in the system requirements documents and that the system meets the business needs.
Actual Results, Step # SAPCMS Inter face s Descripti on Test Conditions Expected Results Pass or Fail (P or F) Test ed By
Page 108 of 115 DWH Training -9739096158
Actual Results, Step # 1 Check for the total count of records in source tables that is fetched and the total records in the PRCHG table for a perticular session timestamp SOURCE: SELECT count(*) FROM XST_PRCHG_STG TARGET: Select count(*) from _PRCHG Both the source and target table load record count should match. Should be same as the expected Descripti on Test Conditions Expected Results
Pass or Fail (P or F)
Test ed By
Check for all the target columns whether they are getting populated correctly with source data.
select PRCHG_ID, PRCHG_DESC, DEPT_NBR, EVNT_CTG_CDE, PRCHG_TYP_CDE, PRCHG_ST_CDE, from T_PRCHG MINUS select PRCHG_ID, PRCHG_DESC, DEPT_NBR, EVNT_CTG_CDE, PRCHG_TYP_CDE, PRCHG_ST_CDE, from PRCHG
Both the source and target table record values should return zero records
Should be same as the expected
Page 109 of 115 DWH Training -9739096158
Actual Results, Step # 3 Check for Insert strategy to load records into target table. Identify a one record from the source which is not in target table. Then run the session It should insert a record into target table with source data Should be same as the expected Descripti on Test Conditions Expected Results
Pass or Fail (P or F)
Test ed By
Check for Update strategy to load records into target table.
Identify a one Record from the source which is already present in the target table with different PRCHG_ST_CDE or PRCHG_TYP_CDE values Then run the session
It should update record into target table with source data for that existing record
Should be same as the expected
How strong you are in UNIX? 1) I have Unix shell scripting knowledge whatever informatica required like If we want to run workflows in Unix using PMCMD. Below is the script to run workflow using Unix. cd /pmar/informatica/pc/pmserver/ /pmar/informatica/pc/pmserver/pmcmd startworkflow -u $INFA_USER -p $INFA_PASSWD -s $INFA_SERVER: $INFA_PORT -f $INFA_FOLDER -wait $1 >> $LOG_PATH/ $LOG_FILE
Page 110 of 115 DWH Training -9739096158
2) And if we suppose to process flat files using informatica but those files were exists in remote server then we have to write script to get ftp into informatica server before start process those files. 3) And also file watch mean that if indicator file available in the specified location then we need to start our informatica jobs otherwise will send email notification using Mail X command saying that previous jobs didn’t completed successfully something like that. 4) Using shell script update parameter file with session start time and end time. This kind of scripting knowledge I do have. If any new UNIX requirement comes then I can Google and get the solution implement the same.
Basic Commands: Cat file1 (cat is the command to create none zero byte file) cat file1 file2 > all -----it will combined (it will create file if it doesn’t exit) cat file1 >> file2---it will append to file 2
> will redirect output from standard out (screen) to file or printer or whatever you like. >> Filename will append at the end of a file called filename. < will redirect input to a process or command.
How to create zero byte file? Touch filename file) (touch is the command to create zero byte
how to find all processes that are running ps -A Crontab command.
Page 111 of 115 DWH Training -9739096158
Crontab command is used to schedule jobs. You must have permission to run this command by Unix Administrator. Jobs are scheduled in five numbers, as follows. Minutes (0-59) Hour (0-23) Day of month (1-31) month (1-12) Day of week (0-6) (0 is Sunday) so for example you want to schedule a job which runs from script named backup jobs in /usr/local/bin directory on sunday (day 0) at 11.25 (22:25) on 15th of month. The entry in crontab file will be. * represents all values.
25 22 15 * 0 /usr/local/bin/backup_jobs
The * here tells system to run this each month. Syntax is crontab file So a create a file with the scheduled jobs as above and then type crontab filename .This will scheduled the jobs.
Below cmd gives total no of users logged in at this time. who | wc -l echo "are total number of people logged in at this time." Below cmd will display only directories $ ls -l | grep '^d' Pipes: The pipe symbol "|" is used to direct the output of one command to the input of another. Moving, renaming, and copying files:
Page 112 of 115 DWH Training -9739096158
Cp file1 file2 mv file1 newname mv file1 ~/AAA/ your home directory. rm file1 [file2 ...]
copy a file move or rename a file move file1 into sub-directory AAA in remove or delete a file
To display hidden files ls –a
Viewing and editing files: cat filename Dump a file to the screen in ascii.
More file name to view the file content head filename Show the first few lines of a file.
head -5 filename Show the first 5 lines of a file. tail filename Tail -7 filename Show the last few lines of a file. Show the last 7 lines of a file.
Searching for files : find command find -name aaa.txt current directory or Finds all the files named aaa.txt in the
any subdirectory tree. find / -name vimrc on the system. Find all the files named 'vimrc' anywhere
find /usr/local/games -name "*xpilot*" Find all files whose names contain the string 'xpilot' which exist within the '/usr/local/games' directory tree.
Page 113 of 115 DWH Training -9739096158
You can find out what shell you are using by the command: echo $SHELL
If file exists then send email with attachment. if [[ -f $your_file ]]; then uuencode $your_file $your_file|mailx -s "$your_file exists..." your_email_address fi
Below line is the first line of the script #!/usr/bin/sh Or #!/bin/ksh What does #! /bin/sh mean in a shell script? It actually tells the script to which interpreter to refer. As you know, bash shell has some specific functions that other shell does not have and vice-versa. Same way is for perl, python and other languages. It's to tell your shell what shell to you in executing the following statements in your shell script.
Interactive History A feature of bash and tcsh (and sometimes others) you can use the up-arrow keys to access your previous commands, edit them, and re-execute them. Basics of the vi editor Opening a file
Page 114 of 115 DWH Training -9739096158
Vi filename Creating text Edit modes: These keys enter editing modes and type in the text of your document. i I a A r R Insert before current cursor position Insert at beginning of current line Insert (append) after current cursor position Append to end of line Replace 1 character Replace mode
<ESC> Terminate insertion or overwrite mode Deletion of text x dd Delete single character Delete current line and put in buffer
:w Write the current file. :w new.file Write the file to the name 'new.file'.
:w! existing.file Overwrite an existing file with the file currently being edited. :wq :q :q! Write the file and quit. Quit. Quit with no changes.
Page 115 of 115 DWH Training -9739096158