Professional Documents
Culture Documents
Version 1.0
REVISION HISTORY
1
Detailed Design Document
1 Introduction
1.1 Purpose
The purpose of this document is to provide the detailed information about DWH Concepts and Informatica
based on real-time training.
2 ORACLE
2.1 DEFINATIONS
Organizations can store data on various media and in different formats, such as a hard-copy document
To manage databases, you need database management systems (DBMS). A DBMS is a program that
stores, retrieves, and modifies data in the database on request. There are four main types of databases:
NORMALIZATION:
Some Oracle databases were modeled according to the rules of normalization that were intended to eliminate
redundancy.
Obviously, the rules of normalization are required to understand your relationships and functional dependencies
A row is in first normal form (1NF) if all underlying domains contain atomic values only.
An entity is in Second Normal Form (2NF) when it meets the requirement of being in First Normal Form (1NF) and
additionally:
• Does not have a composite primary key. Meaning that the primary key can not be subdivided into separate
logical entities.
• All the non-key columns are functionally dependent on the entire primary key.
• A row is in second normal form if, and only if, it is in first normal form and every non-key attribute is fully
dependent on the key.
• 2NF eliminates functional dependencies on a partial key by putting the fields in a separate table from those
that are dependent on the whole key. An example is resolving many: many relationships using an
intersecting entity.
An entity is in Third Normal Form (3NF) when it meets the requirement of being in Second Normal Form (2NF) and
additionally:
• Functional dependencies on non-key fields are eliminated by putting them in a separate table. At this level,
all non-key fields are dependent on the primary key.
2
• A row is in third normal form if and only if it is in second normal form and if attributes that do not contribute
to a description of the primary key are move into a separate table. An example is creating look-up tables.
Boyce Codd Normal Form (BCNF) is a further refinement of 3NF. In his later writings Codd refers to BCNF as 3NF. A
row is in Boyce Codd normal form if, and only if, every determinant is a candidate key. Most entities in 3NF are
already in BCNF.
An entity is in Fourth Normal Form (4NF) when it meets the requirement of being in Third Normal Form (3NF) and
additionally:
Has no multiple sets of multi-valued dependencies. In other words, 4NF states that no entity can have more than a
single one-to-many relationship.
Create
Alter
Drop
Truncate
Insert
Update
Delete
Select
Grant
Revoke
Commit
Rollback
Save point
Syntaxes:
CREATE DATABASE LINK CAASEDW CONNECT TO ITO_ASA IDENTIFIED BY exact123 USING ' CAASEDW’
3
Materialized View syntax:
REFRESH COMPLETE
AS
DBMS_MVIEW.REFRESH('MV_COMPLEX', 'C');
Case Statement:
Select NAME,
(CASE
WHEN (CLASS_CODE = 'Subscription')
THEN ATTRIBUTE_CATEGORY
ELSE TASK_TYPE
END) TASK_TYPE,
CURRENCY_CODE
From EMP
Decode()
Select empname,Decode(address,’HYD’,’Hyderabad’,
‘Bang’, Bangalore’, address) as address from emp;
Procedure:
cust_id_IN In NUMBER,
BEGIN
End
Trigger:
REFERENCING
NEW AS NEW
OLD AS OLD
4
DECLARE
BEGIN
ELSE
-- Exec procedure
Exec update_sysdate()
END;
ORACLE JOINS:
Equi join
Non-equi join
Self join
Natural join
Cross join
Outer join
Left outer
Right outer
Full outer
USING CLAUSE
ON CLAUSE
Non-Equi Join
A join which contains an operator other than ‘=’ in the joins condition.
Ex: SQL> select empno,ename,job,dname,loc from emp e,dept d where e.deptno > d.deptno;
Self Join
Ex1: SQL> select e1.empno,e2.ename ,e1.job,e2.deptno from emp e1,emp e2 where e1.mgr=e2.empno;
Ex2:
5
WHERE worker.manager_id = manager.employee_id ;
Natural Join
Cross Join
Outer Join
Outer join gives the non-matching records along with matching records.
This will display the all matching records and the records which are in left hand side table those that are not in right
hand side table.
Ex: SQL> select empno,ename,job,dname,loc from emp e left outer join dept d on(e.deptno=d.deptno);
Or
e.deptno=d.deptno(+);
This will display the all matching records and the records which are in right hand side table those that are not in left
hand side table.
Ex: SQL> select empno,ename,job,dname,loc from emp e right outer join dept d on(e.deptno=d.deptno);
Or
This will display the all matching records and the non-matching records from both tables.
Ex: SQL> select empno,ename,job,dname,loc from emp e full outer join dept d on(e.deptno=d.deptno);
OR
6
What’s the difference between View and Materialized View?
View:
We can keep aggregated data into materialized view. we can schedule the MV to refresh but table can’t.MV can be
created based on multiple tables.
Materialized View:
In DWH materialized views are very essential because in reporting side if we do aggregate calculations as per the
business requirement report performance would be de graded. So to improve report performance rather than doing
report calculations and joins at reporting side if we put same logic in the MV then we can directly select the data
from MV without any joins and aggregations. We can also schedule MV (Materialize View).
Inline view:
If we write a select statement in from clause that is nothing but inline view.
Ex:
Get dept wise max sal along with empname and emp no.
7
What is the difference between view and materialized view?
A view has a logical existence. It does not contain A materialized view has a physical existence.
data.
We cannot perform DML operation on view. We can perform DML operation on materialized view.
When we do select * from view it will fetch the data When we do select * from materialized view it will
from base table. fetch the data from materialized view.
DELETE
The DELETE command is used to remove rows from a table. A WHERE clause can be used to only remove some
rows. If no WHERE condition is specified, all rows will be removed. After performing a DELETE operation you need to
COMMIT or ROLLBACK the transaction to make the change permanent or to undo it.
TRUNCATE
TRUNCATE removes all rows from a table. The operation cannot be rolled back. As such, TRUCATE is faster and
doesn't use as much undo space as a DELETE.
DROP
The DROP command removes a table from the database. All the tables' rows, indexes and privileges will also be
removed. The operation cannot be rolled back.
ROWID
A globally unique identifier for a row in a database. It is created at the time the row is inserted into a table,
and destroyed when it is removed from a table.'BBBBBBBB.RRRR.FFFF' where BBBBBBBB is the block
number, RRRR is the slot(row) number, and FFFF is a file number.
ROWNUM
For each row returned by a query, the ROWNUM pseudo column returns a number indicating the order in
which Oracle selects the row from a table or set of joined rows. The first row selected has a ROWNUM of 1,
the second has 2, and so on.
You can use ROWNUM to limit the number of rows returned by a query, as in this example:
8
Rowid Row-num
Rowid is an oracle internal id that is allocated Row-num is a row number returned by a select
every time a new record is inserted in a table. statement.
This ID is unique and cannot be changed by the
user.
Rowid is a globally unique identifier for a row in a The row-num pseudocoloumn returns a number
database. It is created at the time the row is indicating the order in which oracle selects the row
inserted into the table, and destroyed when it is from a table or set of joined rows.
removed from a table.
FROM table
[WHERE condition]
[GROUP BY group_by_expression]
[HAVING group_condition]
[ORDER BY column];
The WHERE clause cannot be used to restrict groups. you use the
Both where and having clause can be used to filter the data.
Where as in where clause it is not mandatory. But having clause we need to use it with the group
by.
Where clause applies to the individual rows. Where as having clause is used to test some
condition on the group rather than on individual
rows.
Where clause is used to restrict rows. But having clause is used to restrict groups.
In where clause every record is filtered based on In having clause it is with aggregate records (group
where. by functions).
9
MERGE Statement
You can use merge command to perform insert and update in a single command.
On (s1.no=s2.no)
Sub Query:
Example:
Select deptno, ename, sal from emp a where sal in (select sal from Grade where sal_grade=’A’ or sal_grade=’B’)
Example:
Find all employees who earn more than the average salary in their department.
Group by B.department_id)
EXISTS:
10
Sub-query Co-related sub-query
A sub-query is executed once for the parent Where as co-related sub-query is executed once
Query for each row of the parent query.
Example: Example:
Select * from emp where deptno in (select Select a.* from emp e where sal >= (select
deptno from dept); avg(sal) from emp a where a.deptno=e.deptno
group by a.deptno);
Indexes:
1. Bitmap indexes are most appropriate for columns having low distinct values—such as GENDER,
MARITAL_STATUS, and RELATION. This assumption is not completely accurate, however. In reality, a
bitmap index is always advisable for systems in which data is not frequently updated by many
concurrent systems. In fact, as I'll demonstrate here, a bitmap index on a column with 100-percent
unique values (a column candidate for primary key) is as efficient as a B-tree index.
6. One or more columns are frequently used together in a WHERE clause or a join condition
7. The table is large and most queries are expected to retrieve less than 2 to 4 percent of the rows
It is a perfect valid question to ask why hints should be used. Oracle comes with an optimizer that promises to
optimize a query's execution plan. When this optimizer is really doing a good job, no hints should be required at all.
Sometimes, however, the characteristics of the data in the database are changing rapidly, so that the optimizer (or
more accuratly, its statistics) are out of date. In this case, a hint could help.
You should first get the explain plan of your SQL and determine what changes can be done to make the code operate
without using hints if possible. However, hints such as ORDERED, LEADING, INDEX, FULL, and the various AJ and SJ
hints can take a wild optimizer and give you optimal performance
The ANALYZE statement can be used to gather statistics for a specific table, index or cluster. The statistics can be
computed exactly, or estimated based on a specific number of rows, or a percentage of rows:
11
Automatic Optimizer Statistics Collection
By default Oracle 10g automatically gathers optimizer statistics using a scheduled job called GATHER_STATS_JOB.
By default this job runs within maintenance windows between 10 P.M. to 6 A.M. week nights and all day on
weekends. The job calls the DBMS_STATS.GATHER_DATABASE_STATS_JOB_PROC internal procedure which gathers
statistics for tables with either empty or stale statistics, similar to the DBMS_STATS.GATHER_DATABASE_STATS
procedure using the GATHER AUTO option. The main difference is that the internal job prioritizes the work such that
tables most urgently requiring statistics updates are processed first.
Hint categories:
• ALL_ROWS
One of the hints that 'invokes' the Cost based optimizer
ALL_ROWS is usually used for batch processing or data warehousing systems.
• FIRST_ROWS
One of the hints that 'invokes' the Cost based optimizer
FIRST_ROWS is usually used for OLTP systems.
• CHOOSE
One of the hints that 'invokes' the Cost based optimizer
This hint lets the server choose (between ALL_ROWS and FIRST_ROWS, based on statistics gathered.
• Hints for Parallel Execution, (/*+ parallel(a,4) */) specify degree either 2 or 4 or 16
• Additional Hints
• HASH
Hashes one table (full scan) and creates a hash index for that table. Then hashes other table and uses hash
index to find corresponding records. Therefore not suitable for < or > join conditions.
/*+ use_hash */
ORDERED- This hint forces tables to be joined in the order specified. If you know table X has fewer rows, then
ordering it first may speed execution in a join.
If index is not able to create then will go for /*+ parallel(table, 8)*/-----For select and update example---in where
clase like st,not in ,>,< ,<> then we will use.
Explain Plan:
Explain plan will tell us whether the query properly using indexes or not.whatis the cost of the table whether it is
doing full table scan or not, based on these statistics we can tune the query.
The explain plan process stores data in the PLAN_TABLE. This table can be located in the current schema or a shared
12
schema and is created using in SQL*Plus as follows:
What is your tuning approach if SQL query taking long time? Or how do u tune SQL query?
If query taking long time then First will run the query in Explain Plan, The explain plan process stores data in the
PLAN_TABLE.
it will give us execution plan of the query like whether the query is using the relevant indexes on the joining
columns or indexes to support the query are missing.
If joining columns doesn’t have index then it will do the full table scan if it is full table scan the cost will be more
then will create the indexes on the joining columns and will run the query it should give better performance and
also needs to analyze the tables if analyzation happened long back. The ANALYZE statement can be used to gather
statistics for a specific table, index or cluster using
If still have performance issue then will use HINTS, hint is nothing but a clue. We can use hints like
• ALL_ROWS
One of the hints that 'invokes' the Cost based optimizer
ALL_ROWS is usually used for batch processing or data warehousing systems.
• FIRST_ROWS
One of the hints that 'invokes' the Cost based optimizer
FIRST_ROWS is usually used for OLTP systems.
• CHOOSE
One of the hints that 'invokes' the Cost based optimizer
This hint lets the server choose (between ALL_ROWS and FIRST_ROWS, based on statistics gathered.
• HASH
Hashes one table (full scan) and creates a hash index for that table. Then hashes other table and uses hash
index to find corresponding records. Therefore not suitable for < or > join conditions.
/*+ use_hash */
Store Procedure:
13
Stored procedure can be executed from the Trigger
But the Trigger cannot be executed from the Stored procedures.
Stored procedures are compiled collection of programs or SQL statements in the database.
Using stored procedure we can access and modify data present in many tables.
Also a stored procedure is not associated with any particular database object.
But triggers are event-driven special procedures which are attached to a specific database object say a table.
Stored procedures are not automatically run and they have to be called explicitly by the user. But triggers get
executed when the particular event associated with the event gets fired.
Packages:
Packages provide a method of encapsulating related procedures, functions, and associated cursors and variables
together as a unit in the database.
package that contains several procedures and functions that process related to same transactions.
A package is a group of related procedures and functions, together with the cursors and variables they use,
Packages provide a method of encapsulating related procedures, functions, and associated cursors and variables
together as a unit in the database.
Triggers:
Oracle lets you define procedures called triggers that run implicitly when an INSERT, UPDATE, or DELETE statement
is issued against the associated table
Triggers are similar to stored procedures. A trigger stored in the database can include SQL and PL/SQL
Types of Triggers
• INSTEAD OF Triggers
Row Triggers
A row trigger is fired each time the table is affected by the triggering statement. For example, if an UPDATE
statement updates multiple rows of a table, a row trigger is fired once for each row affected by the UPDATE
statement. If a triggering statement affects no rows, a row trigger is not run.
When defining a trigger, you can specify the trigger timing--whether the trigger action is to be run before or after
the triggering statement. BEFORE and AFTER apply to both statement and row triggers.
BEFORE and AFTER triggers fired by DML statements can be defined only on tables, not on views.
14
Difference between Trigger and Procedure
In trigger no need to execute manually. Triggers will Where as in procedure we need to execute manually.
be fired automatically.
Stored procedure may or may not return values. Function should return at least one output parameter.
Can return more than one parameter using OUT
argument.
Stored procedure can be used to solve the business Function can be used to calculations
logic.
Stored procedure accepts more than one argument. Whereas function does not accept arguments.
Stored procedures are mainly used to process the Functions are mainly used to compute values
tasks.
Cannot be invoked from SQL statements. E.g. SELECT Can be invoked form SQL statements e.g. SELECT
Can affect the state of database using commit. Cannot affect the state of database.
A tablespace in an Oracle database consists of one or more physical datafiles. A datafile can be associated with only
one tablespace and only one database.
Table Space:
Oracle stores data logically in tablespaces and physically in datafiles associated with the corresponding tablespace.
A database is divided into one or more logical storage units called tablespaces. Tablespaces are divided into logical
units of storage called segments.
Control File:
A control file contains information about the associated database that is required for access by an instance, both at
startup and during normal operation. Control file information can be modified only by Oracle; no database
administrator or user can edit a control file.
Select empno, count (*) from EMP group by empno having count (*)>1;
15
Delete from EMP where rowid not in (select max (rowid) from EMP group by empno);
UNION
select
emp_id,
max(decode(row_id,0,address))as address1,
max(decode(row_id,1,address)) as address2,
max(decode(row_id,2,address)) as address3
group by emp_id
Other query:
select
emp_id,
max(decode(rank_id,1,address)) as add1,
max(decode(rank_id,2,address)) as add2,
max(decode(rank_id,3,address))as add3
from
(select emp_id,address,rank() over (partition by emp_id order by emp_id,address )rank_id from temp )
group by
emp_id
5. Rank query:
Select empno, ename, sal, r from (select empno, ename, sal, rank () over (order by sal desc) r from EMP);
The DENSE_RANK function works acts like the RANK function except that it assigns consecutive ranks:
16
Select empno, ename, Sal, from (select empno, ename, sal, dense_rank () over (order by sal desc) r from emp);
Select empno, ename, sal,r from (select empno,ename,sal,dense_rank() over (order by sal desc) r from emp) where
r<=5;
Or
Select * from (select * from EMP order by sal desc) where rownum<=5;
8. 2 nd highest Sal:
Select empno, ename, sal, r from (select empno, ename, sal, dense_rank () over (order by sal desc) r from EMP)
where r=2;
9. Top sal:
Select * from EMP where sal= (select max (sal) from EMP);
SQL> select *from emp where (rowid, 0) in (select rowid,mod(rownum,2) from emp);
Starting at the root, walk from the top down, and eliminate employee Higgins in the result, but
FROM employees
3 DWH CONCEPTS
What is BI?
Business Intelligence refers to a set of methods and techniques that are used by organizations for tactical and
strategic decision making. It leverages methods and technologies that focus on counts, statistics and business
objectives to improve business performance.
The objective of Business Intelligence is to better understand customers and improve customer service, make the
supply and distribution chain more efficient, and to identify and address business problems and opportunities quickly.
In terms of design data warehouse and data mart are almost the same.
17
In general a Data Warehouse is used on an enterprise level and a Data Marts is used on a business
division/department level.
Subject Oriented:
Data that gives information about a particular subject instead of about a company's ongoing operations.
Integrated:
Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole.
Time-variant:
All data in the data warehouse is identified with a particular time period.
Non-volatile:
Data is stable in a data warehouse. More data is added but data is never removed.
What is a DataMart?
Datamart is usually sponsored at the department level and developed with a specific details or subject in mind, a
Data Mart is a subset of data warehouse with a focused objective.
In terms of design data warehouse and data mart are almost the same.
In general a Data Warehouse is used on an enterprise level and a Data Marts is used on a business
division/department level.
Data mart is usually sponsored at the department Data warehouse is a “Subject-Oriented, Integrated,
level and developed with a specific issue or subject in Time-Variant, Nonvolatile collection of data in support
mind, a data mart is a data warehouse with a focused of decision making”.
objective.
A data mart is used on a business division/ A data warehouse is used on an enterprise level
department level.
A Data Mart is a subset of data from a Data A Data Warehouse is simply an integrated
Warehouse. Data Marts are built for specific user consolidation of data from a variety of sources that is
groups. specially designed to support strategic and tactical
decision making.
By providing decision makers with only a subset of The main objective of Data Warehouse is to provide an
data from the Data Warehouse, Privacy, Performance integrated environment and coherent picture of the
and Clarity Objectives can be attained. business at a point in time.
A fact table that contains only primary keys from the dimension tables, and that do not contain any measures
that type of fact table is called fact less fact table .
18
What is a Schema?
DRILL DOWN, DRILL ACROSS, Graphs, PI charts, dashboards and TIME HANDLING
To be able to drill down/drill across is the most basic requirement of an end user in a data warehouse. Drilling down
most directly addresses the natural end-user need to see more detail in an result. Drill down should be as generic as
possible becuase there is absolutely no good way to predict users drill-down path.
In Data warehousing grain refers to the level of detail available in a given fact table as well as to the level of detail
provided by a star schema.
It is usually given as the number of records per key within the table. In general, the grain of the fact table is the
grain of the star schema.
Star schema is a data warehouse schema where there is only one "fact table" and many denormalized dimension
tables.
Fact table contains primary keys from all the dimension tables and other numeric columns columns of additive,
numeric facts.
Unlike Star-Schema, Snowflake schema contain normalized dimension tables in a tree like structure with many
nesting levels.
19
What is the difference between snow flake and star schema
The star schema is the simplest data warehouse Snowflake schema is a more complex data
scheme. warehouse model than a star schema.
In star schema each of the dimensions is In snow flake schema at least one hierarchy should
represented in a single table .It should not have any exists between dimension tables.
hierarchies between dims.
It contains a fact table surrounded by dimension It contains a fact table surrounded by dimension
tables. If the dimensions are de-normalized, we say tables. If a dimension is normalized, we say it is a
it is a star schema design. snow flaked design.
In star schema only one join establishes the In snow flake schema since there is relationship
relationship between the fact table and any one of between the dimensions tables it has to do many
the dimension tables. joins to fetch the data.
It is called a star schema because the diagram It is called a snowflake schema because the diagram
resembles a star. resembles a snowflake.
A "fact" is a numeric value that a business wishes to count or sum. A "dimension" is essentially an entry point for
getting at the facts. Dimensions are things of interest to the business.
A set of level properties that describe a specific aspect of a business, used for analyzing the factual measures.
A Fact Table in a dimensional model consists of one or more numeric facts of importance to a business. Examples of
facts are as follows:
20
• the value of products sold
Factless fact table captures the many-to-many relationships between dimensions, but contains no numeric or textual
facts. They are often used to record events or coverage information.
• Identifying product promotion events (to determine promoted products that didn’t sell)
Types of facts?
• Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table.
• Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the
fact table, but not the others.
• Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in
the fact table.
What is Granularity?
Principle: create fact tables with the most granular data possible to support analysis of the business process.
In Data warehousing grain refers to the level of detail available in a given fact table as well as to the level of detail
provided by a star schema.
It is usually given as the number of records per key within the table. In general, the grain of the fact table is the
grain of the star schema.
Facts: Facts must be consistent with the grain.all facts are at a uniform grain.
Dimensions: each dimension associated with fact table must take on a single value for each fact row.
21
Dimensional Model
Slowly changing dimensions refers to the change in dimensional attributes over time.
An example of slowly changing dimension is a Resource dimension where attributes of a particular employee change
over time like their designation changes or dept changes etc.
Conformed Dimensions (CD): these dimensions are something that is built once in your model and can be reused
multiple times with different fact tables. For example, consider a model containing multiple fact tables, representing
different data marts. Now look for a dimension that is common to these facts tables. In this example let’s consider
that the product dimension is common and hence can be reused by creating short cuts and joining the different fact
tables.Some of the examples are time dimension, customer dimensions, product dimension.
A "junk" dimension is a collection of random transactional codes, flags and/or text attributes that are unrelated to
any particular dimension. The junk dimension is simply a structure that provides a convenient place to store the junk
attributes. A good example would be a trade fact in a company that brokers equity trades.
When you consolidate lots of small dimensions and instead of having 100s of small dimensions, that will have few
records in them, cluttering your database with these mini ‘identifier’ tables, all records from all these small
dimension tables are loaded into ONE dimension table and we call this dimension table Junk dimension table. (Since
22
we are storing all the junk in this one table) For example: a company might have handful of manufacture plants,
handful of order types, and so on, so forth, and we can consolidate them in one dimension table called junked
dimension table
An item that is in the fact table but is stripped off of its description, because the description belongs in dimension
table, is referred to as Degenerated Dimension. Since it looks like dimension, but is really in fact table and has been
degenerated of its description, hence is called degenerated dimension..
Degenerated Dimension: a dimension which is located in fact table known as Degenerated dimension
Dimensional Model:
A type of data modeling suited for data warehousing. In a dimensional model, there are two types of tables:
dimensional tables and fact tables. Dimensional table records information on each dimension, and fact table
records all the "fact", or measures.
Data modeling There are three levels of data modeling. They are conceptual, logical, and physical. This section will
explain the difference among the three, the order with which each one is created, and how to go from one level to
the other.
• No attribute is specified.
At this level, the data modeler attempts to identify the highest-level relationships among the different entities.
• Foreign keys (keys identifying the relationship between different entities) are specified.
At this level, the data modeler attempts to describe the data in as much detail as possible, without regard to how
they will be physically implemented in the database.
In data warehousing, it is common for the conceptual data model and the logical data model to be combined into a
single step (deliverable).
The steps for designing the logical data model are as follows:
6. Normalization.
• Physical considerations may cause the physical data model to be quite different from the logical data model.
At this level, the data modeler will specify how the logical data model will be realized in the database schema.
9. http://www.learndatamodeling.com/dm_standard.htm
10. Modeling is an efficient and effective way to represent the organization’s needs; It provides information
in a graphical way to the members of an organization to understand and communicate the business rules
and processes. Business Modeling and Data Modeling are the two important types of modeling.
The differences between a logical data model and physical data model is shown below.
Represents business information and defines Represents the physical implementation of the model in a
business rules database.
Entity Table
Attribute Column
Definition Comment
24
Below is the simple data model
25
26
EDIII – Logical Design
ACW_ORGANIZATION_D
ACW_DF_FEES_STG ACW_DF_FEES_F Primary Key
Non-Key Attributes Primary Key ORG_KEY [PK1]
SEGMENT1 ACW_DF_FEES_KEY Non-Key Attributes
ORGANIZATION_ID [PK1] ORGANIZATION_CODE
ITEM_TYPE Non-Key Attributes CREA TED_BY
BUYER_ID
PRODUCT_KEY CREA TION_DATE
COST_REQUIRED
ORG_KEY LAST_UPDATE_DATE
QUARTER_1_COST
DF_MGR_KEY LAST_UPDATED_BY
QUARTER_2_COST
COST_REQUIRED D_CREATED_BY
QUARTER_3_COST
DF_FEES D_CREATION_DATE PID for DF Fees
QUARTER_4_COST D_LAST_UPDATE_DATE
COSTED_BY
COSTED_BY
COSTED_DATE D_LAST_UPDATED_BY
COSTED_DATE
APPROV ING_MGR
APPROV ED_BY
APPROV ED_DATE
APPROV ED_DATE
D_CREATED_BY
D_CREATION_DATE ACW_USERS_D
D_LAST_UPDATE_BY Primary Key
D_LAST_UPDATED_DATE USER_KEY [PK1]
Non-Key Attributes
EDW_TIME_HIERARCHY
PERSON_ID
EMAIL_ADDRESS
ACW_PCBA_A PPROVAL_F LAST_NAME
Primary Key FIRST_NAME
ACW_PCBA_A PPROVAL_STG FULL_NAME
PCBA _APPROVAL_KEY
Non-Key Attributes [PK1] EFFECTIV E_STA RT_DATE
INV ENTORY_ITEM_ID Non-Key Attributes EFFECTIV E_END_DATE
LATEST_REV PART_KEY EMPLOYEE_NUMBER
LOCATION_ID LAST_UPDATED_BY
CISCO_PART_NUMBER
LOCATION_CODE LAST_UPDATE_DATE
SUPPLY_CHANNEL_KEY
APPROV AL_FLAG CREA TION_DATE
NPI
ADJUSTMENT CREA TED_BY
APPROV AL_FLAG
APPROV AL_DATE D_LAST_UPDATED_BY
ADJUSTMENT
TOTA L_ADJUSTMENT APPROV AL_DATE D_LAST_UPDATE_DATE
TOTA L_ITEM_COST D_CREATION_DATE
ADJUSTMENT_AMT
DEMAND D_CREATED_BY
SPEND_BY _ASSEMBLY
COMM_MGR ACW_PRODUCTS_D
COMM_MGR_KEY
BUYER_ID Primary Key
BUYER_ID
BUYER ACW_PART_TO_PID_D PRODUCT_KEY [PK1]
RFQ_CREATED
RFQ_CREATED Users
Primary Key Non-Key Attributes
RFQ_RESPONSE
RFQ_RESPONSE CSS PART_TO_PID_KEY [PK1] PRODUCT_NA ME
CSS
D_CREATED_BY Non-Key Attributes BUSINESS_UNIT_ID
D_CREATED_DATE PART_KEY BUSINESS_UNIT
D_LAST_UPDATED_BY CISCO_PART_NUMBER PRODUCT_FAMILY_ID
ACW_DF_A PPROVAL_STG D_LAST_UPDATE_DATE PRODUCT_KEY PRODUCT_FAMILY
Non-Key Attributes PRODUCT_NA ME ITEM_TYPE
LATEST_REVISION D_CREATED_BY
INV ENTORY_ITEM_ID ACW_DF_A PPROVAL_F
D_CREATED_BY D_CREATION_DATE
CISCO_PART_NUMBER Primary Key
D_CREATION_DATE D_LAST_UPDATE_BY
LATEST_REV
DF_APPROVAL_KEY D_LAST_UPDATED_BY D_LAST_UPDATED_DATE
PCBA _ITEM_FLAG [PK1]
APPROV AL_FLAG D_LAST_UPDATE_DATE
Non-Key Attributes
APPROV AL_DATE
LOCATION_ID PART_KEY
LOCATION_CODE CISCO_PART_NUMBER
BUYER SUPPLY_CHANNEL_KEY
BUYER_ID PCBA _ITEM_FLAG
RFQ_CREATED APPROV ED ACW_SUPPLY_CHA NNEL_D
RFQ_RESPONSE APPROV AL_DATE
Primary Key
CSS BUYER_ID
SUPPLY_CHANNEL_KEY
RFQ_CREATED
RFQ_RESPONSE [PK1]
CSS Non-Key Attributes
D_CREATED_BY SUPPLY_CHANNEL
D_CREATION_DATE DESCRIPTION
D_LAST_UPDATED_BY LAST_UPDATED_BY
D_LAST_UPDATE_DATE LAST_UPDATE_DATE
CREA TED_BY
CREA TION_DATE
D_LAST_UPDATED_BY
D_LAST_UPDATE_DATE
D_CREATED_BY
D_CREATION_DATE
27
EDII– Physical Design
ACW_PRODUCT S_D
Colum ns
ACW_DF_ APPROVA L_ST G
PRODUCT _KEY NUMB ER(10) [P K1]
Colum ns
PRODUCT _NAM E CHA R(3 0)
INVENT ORY_IT EM _ID NUM B ER(10 ) BUS INESS _UNIT _ID NUMB ER(10)
CISCO_PA RT _NUMBE R CHA R(30) ACW_DF_APPROVA L_F ACW_PA RT _T O_PID_D
BUS INESS _UNIT VARCHAR2(6 0)
LAT EST _REV CHA R(10) Colum ns Column s
PRODUCT _FAM IL Y_ ID NUMB ER(10)
PCB A_IT EM _FLAG CHA R(1) DF_APPROVAL _KEY NUM B ER(10 ) [P K1] PART _T O_ PID_KEY NUM B ER(10) [P K1] PRODUCT _FAM IL Y VARCHAR2(1 80)
APP ROV AL _FLAG CHA R(1) PART _K EY NUM B ER(10 ) PART _K EY NUM B ER(10) IT EM _T YPE CHA R(3 0)
APP ROV AL _DA TE DAT E CISCO_PA RT _NUM BE R CHA R(30) CISCO_PA RT _NUMBE RCHA R(30 ) D_CREA T ED_BY CHA R(1 0)
LOCAT ION_ID NUM B ER(10 ) SUP PLY _CHANNE L_KEYNUM B ER(10 ) PRODUCT_ KEY NUM B ER(10) D_CREA T ION_ DAT E DAT E
SUP PLY _CHANNE L CHA R(10) PCB A_ITEM _FL AG CHA R(1) PRODUCT_ NAM E CHA R(30 ) D_L AST _ UPDAT E_BY CHA R(1 0)
BUY ER VARCHAR2(240) APP ROV ED CHA R(1) LAT EST _REVIS ION CHA R(10 ) D_L AST _ UPDAT ED_DATCHA
E R(1 0)
BUY ER_ID NUM B ER(10 ) APP ROV AL_DA T E DAT E D_CREA T ED_ BY CHA R(10 )
RFQ_CREAT ED CHA R(1) BUY ER_ID NUM B ER(10 ) D_CREA T ION_DAT E DAT E
RFQ_RE SPONSE CHA R(1) RFQ_CREAT ED CHA R(1) D_LAST _UPDAT ED_ BYCHA R(10 )
CSS CHA R(10) RFQ_RE SPONSE CHA R(1) D_LAST _UPDAT E_DATE DAT E
CSS CHA R(10)
D_CREA T ED_BY CHA R(10)
D_CREA T ION_DAT E DAT E
D_LAST _UPDAT ED_BY CHA R(10)
D_LAST _UPDAT E_DAT EDAT E
ACW_ SUPPLY_CHANNEL_ D
Co lum ns
SUP PLY _CHANNE L_ KEYNUM B ER(10) [P K1]
SUP PLY _CHANNE L CHA R(60)
DES CRIPT ION VARCHAR2(2 40)
LAST _UPDAT ED_ BY NUM B ER
LAST _UPDAT E_ DAT E DAT E
CRE AT ED_BY NUM B ER(10)
CRE AT ION_DAT E DAT E
D_LAST _UPDAT ED_ BY CHA R(10)
D_LAST _UPDAT E_DAT EDAT E
D_CREA T ED_ BY CHA R(10)
D_CREA T ION_DAT E DAT E
Users
In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other
words, no history is kept.
28
After Christina moved from Illinois to California, the new information replaces the new record, and we have the
following table:
Advantages:
- This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of
the old information.
Disadvantages:
- All history is lost. By applying this methodology, it is not possible to trace back in history. For
example, in this case, the company would not be able to know that Christina lived in Illinois before.
- Usage:
Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to keep track of
historical changes.
In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information.
Therefore, both the original and the new record will be present. The newe record gets its own primary key.
After Christina moved from Illinois to California, we add the new information as a new row into the table:
Advantages:
Disadvantages:
- This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to
start with, storage and performance can become a concern.
29
Usage:
Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical
changes.
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of interest, one
indicating the original value, and one indicating the current value. There will also be a column that indicates when
the current value becomes active.
To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:
• Customer Key
• Name
• Original State
• Current State
• Effective Date
After Christina moved from Illinois to California, the original information gets updated, and we have the following
table (assuming the effective date of change is January 15, 2003):
Advantages:
- This does not increase the size of the table, since new information is updated.
Disadvantages:
- Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if Christina
later moves to Texas on December 15, 2003, the California information will be lost.
Usage:
Type III slowly changing dimension should only be used when it is necessary for the data warehouse to track
historical changes, and when such changes will only occur for a finite number of time.
30
What is Staging area why we need it in DWH?
If target and source databases are different and target table volume is high it contains some millions of records in this
scenario without staging table we need to design your informatica using look up to find out whether the record exists
or not in the target table since target has huge volumes so its costly to create cache it will hit the performance.
If we create staging tables in the target database we can simply do outer join in the source qualifier to determine
insert/update this approach will give you good performance.
• We can create indexes in the staging state, to perform our source qualifier best.
• If we have the staging area no need to relay on the informatics transformation to known whether
the record exists or not.
Data cleansing
Weeding out unnecessary or unwanted things (characters and spaces etc) from incoming data to make it
more meaningful and informative
Data merging
Data scrubbing
Data scrubbing is the process of fixing or eliminating individual pieces of data that are incorrect, incomplete
or duplicated before the data is passed to end user.
Data scrubbing is aimed at more than eliminating errors and redundancy. The goal is also to bring
consistency to various data sets that may have been created with different, incompatible business rules.
My understanding of ODS is, its a replica of OLTP system and so the need of this, is to reduce the burden on
production system (OLTP) while fetching data for loading targets. Hence its a mandate Requirement for every
Warehouse.
OLTP is a sensitive database they should not allow multiple select statements it may impact the performance as well
as if something goes wrong while fetching data from OLTP to data warehouse it will directly impact the business.
A surrogate key is a substitution for the natural primary key. It is a unique identifier or number ( normally created
by a database sequence generator ) for each record of a dimension table that can be used for the primary key to the
table.
31
What is the difference between a primary key and a surrogate key?
A primary key is a special constraint on a column or set of columns. A primary key constraint ensures that the
column(s) so designated have no NULL values, and that every value is unique. Physically, a primary key is
implemented by the database system using a unique index, and all the columns in the primary key must have been
declared NOT NULL. A table may have only one primary key, but it may be composite (consist of more than one
column).
A surrogate key is any column or set of columns that can be declared as the primary key instead of a "real" or
natural key. Sometimes there can be several natural keys that could be declared as the primary key, and these are
all called candidate keys. So a surrogate is a candidate key. A table could actually have more than one surrogate
key, although this would be unusual. The most common type of surrogate key is an incrementing integer, such as an
auto increment column in MySQL, or a sequence in Oracle, or an identity column in SQL Server.
4 ETL-INFORMATICA
Informatica is a powerful Extraction, Transformation, and Loading tool and is been deployed at GE Medical Systems
for data warehouse development in the Business Intelligence Team. Informatica comes with the following clients to
perform various tasks.
Informatica Transformations:
Mapping: Mapping is the Informatica Object which contains set of transformations including source and target. Its
look like pipeline.
Mapplet:
Mapplet is a set of reusable transformations. We can use this mapplet in any mapping within the Folder.
A mapplet can be active or passive depending on the transformations in the mapplet. Active mapplets contain one or
more active transformations. Passive mapplets contain only passive transformations.
When you add transformations to a mapplet, keep the following restrictions in mind:
• If you use a Sequence Generator transformation, you must use a reusable Sequence Generator
transformation.
• If you use a Stored Procedure transformation, you must configure the Stored Procedure Type to be Normal.
o Normalizer transformations
o COBOL sources
o XML sources
o Target definitions
o Other mapplets
32
• The mapplet contains Input transformations and/or source definitions with at least one port connected to a
transformation in the mapplet.
• The mapplet contains at least one Output transformation with at least one port connected to a
transformation in the mapplet.
Input Transformation: Input transformations are used to create a logical interface to a mapplet in order to allow
data to pass into the mapplet.
Output Transformation: Output transformations are used to create a logical interface from a mapplet in order to
allow data to pass out of a mapplet.
System Variables
$$$SessStartTime returns the initial system date value on the machine hosting the Integration Service when the
server initializes a session. $$$SessStartTime returns the session start time as a string value. The format of the
string depends on the database you are using.
Session: A session is a set of instructions that tells informatica Server how to move data from sources to targets.
WorkFlow: A workflow is a set of instructions that tells Informatica Server how to execute tasks such as sessions,
email notifications and commands. In a workflow multiple sessions can be included to run in parallel or sequential
manner.
Source Definition: The Source Definition is used to logically represent database table or Flat files.
Target Definition: The Target Definition is used to logically represent a database table or file in the Data
Warehouse / Data Mart.
Aggregator: The Aggregator transformation is used to perform Aggregate calculations on group basis.
Expression: The Expression transformation is used to perform the arithmetic calculation on row by row basis and
also used to convert string to integer vis and concatenate two columns.
Filter: The Filter transformation is used to filter the data based on single condition and pass through next
transformation.
Router: The router transformation is used to route the data based on multiple conditions and pass through next
transformations.
1) Input group
3) Default group
Joiner: The Joiner transformation is used to join two sources residing in different databases or different locations
like flat file and oracle sources or two relational tables existing in different databases.
Source Qualifier: The Source Qualifier transformation is used to describe in SQL the method by which data is to be
retrieved from a source application system and also
A. Whenever a session is created for a mapping Aggregate Transformation, the session option for Incremental
Aggregation can be enabled. When PowerCenter performs incremental aggregation, it passes new source data
through the mapping and uses historical cache data to perform new aggregation calculations incrementally.
33
Lookup: Lookup transformation is used in a mapping to look up data in a flat file or a relational table, view, or
synonym.
1) Connected
2) Unconnected
This is connected to pipleline and receives the Which is not connected to pipeline and receives
input values from pipleline. input values from the result of a: LKP expression
in another transformation via arguments.
We cannot use this lookup more than once in a We can use this transformation more than once
mapping. within the mapping
We can return multiple columns from the same Designate one return port (R), returns one column
row. from each row.
We can configure to use dynamic cache. We cannot configure to use dynamic cache.
Pass multiple output values to another Pass one output value to another transformation.
transformation. Link lookup/output ports to The lookup/output/return port passes the value to
another transformation. the transformation calling: LKP expression.
Supports user defined default values. Does not support user defined default values.
Cache includes the lookup source column in the Cache includes all lookup/output ports in the
lookup condition and the lookup source columns lookup condition and the lookup/return port.
that are output ports.
Lookup Caches:
When configuring a lookup cache, you can specify any of the following options:
• Persistent cache
• Static cache
• Dynamic cache
• Shared cache
Dynamic cache: When you use a dynamic cache, the PowerCenter Server updates the lookup cache as it passes
rows to the target.
If you configure a Lookup transformation to use a dynamic cache, you can only use the equality operator (=) in the
lookup condition.
34
NewLookupRow Port will enable automatically.
0 The PowerCenter Server does not update or insert the row in the cache.
Static cache: It is a default cache; the PowerCenter Server doesn’t update the lookup cache as it passes rows to
the target.
Persistent cache: If the lookup table does not change between sessions, configure the Lookup transformation to
use a persistent lookup cache. The PowerCenter Server then saves and reuses cache files from session to session,
eliminating the time required to read the lookup table.
In dynamic lookup the cache memory will get In static lookup the cache memory will not get
refreshed as soon as the record get inserted or refreshed even though record inserted or updated
updated/deleted in the lookup table. in the lookup table it will refresh only in the next
session run.
Best example where we need to use dynamic If we use static lookup first record it will go to
cache is if suppose first record and last record both lookup and check in the lookup cache based on
are same but there is a change in the address. the condition it will not find the match so it will
What informatica mapping has to do here is first return null value then in the router it will send
record needs to get insert and last record should that record to insert flow.
get update in the target table.
But still this record dose not available in the
cache memory so when the last record comes to
lookup it will check in the cache it will not find the
match so it returns null value again it will go to
insert flow through router but it is suppose to go
to update flow because cache didn’t get refreshed
when the first record get inserted into target
table.
Normalizer: The Normalizer transformation is used to generate multiple records from a single record based on
columns (transpose the column data into rows)
We can use normalize transformation to process cobol sources instead of source qualifier.
35
Rank: The Rank transformation allows you to select only the top or bottom rank of data. You can use a Rank
transformation to return the largest or smallest numeric value in a port or group.
The Designer automatically creates a RANKINDEX port for each Rank transformation.
Sequence Generator: The Sequence Generator transformation is used to generate numeric key values in
sequential order.
Stored Procedure: The Stored Procedure transformation is used to execute externally stored database procedures
and functions. It is used to perform the database level operations.
Sorter: The Sorter transformation is used to sort data in ascending or descending order according to a specified sort
key. You can also configure the Sorter transformation for case-sensitive sorting, and specify whether the output rows
should be distinct. The Sorter transformation is an active transformation. It must be connected to the data flow.
Union Transformation:
The Union transformation is a multiple input group transformation that you can use to merge data from multiple
pipelines or pipeline branches into one pipeline branch. It merges data from multiple sources similar to the UNION
ALL SQL statement to combine the results from two or more SQL statements. Similar to the UNION ALL statement,
the Union transformation does not remove duplicate rows.Input groups should have similar structure.
Update Strategy: The Update Strategy transformation is used to indicate the DML statement.
1) Mapping level
2) Session level.
Aggregator Transformation:
Transformation type:
Active
Connected
The Aggregator transformation performs aggregate calculations, such as averages and sums. The Aggregator
transformation is unlike the Expression transformation, in that you use the Aggregator transformation to perform
calculations on groups. The Expression transformation permits you to perform calculations on a row-by-row basis
only.
The Aggregator is an active transformation, changing the number of rows in the pipeline. The Aggregator
transformation has the following components and options
Aggregate cache: The Integration Service stores data in the aggregate cache until it completes aggregate
calculations. It stores group values in an index cache and row data in the data cache.
Group by port: Indicate how to create groups. The port can be any input, input/output, output, or variable port.
When grouping data, the Aggregator transformation outputs the last row of each group unless otherwise specified.
Sorted input: Select this option to improve session performance. To use sorted input, you must pass data to the
Aggregator transformation sorted by group by port, in ascending or descending order.
Aggregate Expressions:
The Designer allows aggregate expressions only in the Aggregator transformation. An aggregate expression can
include conditional clauses and non-aggregate functions. It can also include one aggregate function nested within
36
another aggregate function, such as:
The result of an aggregate expression varies depending on the group by ports used in the transformation
Aggregate Functions
Use the following aggregate functions within an Aggregator transformation. You can nest one aggregate function
within another aggregate function.
When you use any of these functions, you must use them in an expression within an Aggregator transformation.
Use sorted input to increase the mapping performance but we need to sort the data before sending to aggregator
transformation.
If you use a Filter transformation in the mapping, place the transformation before the Aggregator transformation to
reduce unnecessary aggregation.
SQL Transformation
Transformation type:
Active/Passive
Connected
The SQL transformation processes SQL queries midstream in a pipeline. You can insert, delete, update, and retrieve
rows from a database. You can pass the database connection information to the SQL transformation as input data at
run time. The transformation processes external SQL scripts or SQL queries that you create in an SQL editor. The
SQL transformation processes the query and returns rows and database errors.
For example, you might need to create database tables before adding new transactions. You can create an SQL
transformation to create the tables in a workflow. The SQL transformation returns database errors in an output port.
You can configure another workflow to run if the SQL transformation returns no errors.
When you create an SQL transformation, you configure the following options:
Script mode. The SQL transformation runs ANSI SQL scripts that are externally located. You pass a script name to
the transformation with each input row. The SQL transformation outputs one row for each input row.
Query mode. The SQL transformation executes a query that you define in a query editor. You can pass strings or
parameters to the query to define dynamic queries or change the selection parameters. You can output multiple rows
when the query has a SELECT statement.
Database type. The type of database the SQL transformation connects to.
Connection type. Pass database connection information to the SQL transformation or use a connection object.
Script Mode
37
An SQL transformation configured for script mode has the following default ports:
ScriptName Input Receives the name of the script to execute for the current row.
ScriptResult Output Returns PASSED if the script execution succeeds for the row. Otherwise contains FAILED.
ScriptError Output Returns errors that occur when a script fails for a row.
Transformation type:
Active/Passive
Connected
The Java transformation provides a simple native programming interface to define transformation functionality with
the Java programming language. You can use the Java transformation to quickly define simple or moderately
complex transformation functionality without advanced knowledge of the Java programming language or an external
Java development environment.
For example, you can define transformation logic to loop through input rows and generate multiple output rows
based on a specific condition. You can also use expressions, user-defined functions, unconnected transformations,
and mapping variables in the Java code.
Transformation type:
Active
Connected
PowerCenter lets you control commit and roll back transactions based on a set of rows that pass through a
Transaction Control transformation. A transaction is the set of rows bound by commit or roll back rows. You can
define a transaction based on a varying number of input rows. You might want to define transactions based on a
group of rows ordered on a common key, such as employee ID or order entry date.
Within a mapping. Within a mapping, you use the Transaction Control transformation to define a transaction. You
define transactions using an expression in a Transaction Control transformation. Based on the return value of the
expression, you can choose to commit, roll back, or continue without any transaction changes.
Within a session. When you configure a session, you configure it for user-defined commit. You can choose to
commit or roll back a transaction if the Integration Service fails to transform or write any row to the target.
When you run the session, the Integration Service evaluates the expression for each row that enters the
transformation. When it evaluates a commit row, it commits all rows in the transaction to the target or targets.
When the Integration Service evaluates a roll back row, it rolls back all rows in the transaction from the target or
targets.
If the mapping has a flat file target you can generate an output file each time the Integration Service starts a new
transaction. You can dynamically name each target flat file.
38
Transaction control expression
Enter the transaction control expression in the Transaction Control Condition field. The transaction control expression
uses the IIF function to test each row against the condition. Use the following syntax for the expression:
The expression contains values that represent actions the Integration Service performs based on the return value of
the condition. The Integration Service evaluates the condition on a row-by-row basis. The return value determines
whether the Integration Service commits, rolls back, or makes no transaction changes to the row. When the
Integration Service issues a commit or roll back based on the return value of the expression, it begins a new
transaction. Use the following built-in variables in the Expression Editor when you create a transaction control
expression:
TC_CONTINUE_TRANSACTION. The Integration Service does not perform any transaction change for this row.
This is the default value of the expression.
TC_COMMIT_BEFORE. The Integration Service commits the transaction, begins a new transaction, and writes
the current row to the target. The current row is in the new transaction.
TC_COMMIT_AFTER. The Integration Service writes the current row to the target, commits the transaction, and
begins a new transaction. The current row is in the committed transaction.
TC_ROLLBACK_BEFORE. The Integration Service rolls back the current transaction, begins a new transaction,
and writes the current row to the target. The current row is in the new transaction.
TC_ROLLBACK_AFTER. The Integration Service writes the current row to the target, rolls back the transaction,
and begins a new transaction. The current row is in the rolled back transaction.
Transaction Control transformation. Create the following transaction control expression to commit data when the
Integration Service encounters a new order entry date:
Joiner Lookup
In joiner on multiple matches it will return all matching In lookup it will return either first record or last record
records. or any value or error value.
In joiner we cannot configure to use persistence Where as in lookup we can configure to use
cache, shared cache, uncached and dynamic cache persistence cache, shared cache, uncached and
dynamic cache.
We cannot override the query in joiner We can override the query in lookup to fetch the data
from multiple tables.
We can perform outer join in joiner transformation. We cannot perform outer join in lookup
transformation.
We cannot use relational operators in joiner Where as in lookup we can use the relation operators.
transformation.(i.e. <,>,<= and so on) (i.e. <,>,<= and so on)
39
What is the difference between source qualifier and lookup
In source qualifier it will push all the matching Where as in lookup we can restrict whether to
records. display first value, last value or any value
When both source and lookup are in same database When the source and lookup table exists in different
we can use source qualifier. database then we need to use lookup.
1) Yes, One of my mapping was taking 3-4 hours to process 40 millions rows into staging table we don’t have
any transformation inside the mapping its 1 to 1 mapping .Here nothing is there to optimize the mapping
so I created session partitions using key range on effective date column. It improved performance lot,
rather than 4 hours it was running in 30 minutes for entire 40millions.Using partitions DTM will creates
multiple reader and writer threads.
2) There was one more scenario where I got very good performance in the mapping level .Rather than using
lookup transformation if we can able to do outer join in the source qualifier query override this will give you
good performance if both lookup table and source were in the same database. If lookup tables is huge
volumes then creating cache is costly.
3) And also if we can able to optimize mapping using less no of transformations always gives you good
performance.
4) If any mapping taking long time to execute then first we need to look in to source and target statistics in
the monitor for the throughput and also find out where exactly the bottle neck by looking busy percentage
in the session log will come to know which transformation taking more time ,if your source query is the
bottle neck then it will show in the end of the session log as “query issued to database “that means there
is a performance issue in the source query.we need to tune the query using .
If we look into session logs it shows busy percentage based on that we need to find out where is bottle neck.
***** RUN INFO FOR TGT LOAD ORDER GROUP [1], CONCURRENT SET [1] ****
Thread [READER_1_1_1] created for [the read stage] of partition point [SQ_ACW_PCBA_APPROVAL_STG] has
completed: Total Run Time = [7.193083] secs, Total Idle Time = [0.000000] secs, Busy Percentage = [100.000000]
Thread [TRANSF_1_1_1] created for [the transformation stage] of partition point [SQ_ACW_PCBA_APPROVAL_STG]
has completed. The total run time was insufficient for any meaningful statistics.
Thread [WRITER_1_*_1] created for [the write stage] of partition point [ACW_PCBA_APPROVAL_F1,
ACW_PCBA_APPROVAL_F] has completed: Total Run Time = [0.806521] secs, Total Idle Time = [0.000000] secs,
Busy Percentage = [100.000000]
If suppose I've to load 40 lacs records in the target table and the workflow
is taking about 10 - 11 hours to finish. I've already increased
the cache size to 128MB.
There are no joiner, just lookups
and expression transformations
40
Ans:
this case drop constraints and indexes before you run the
By setting Constraint Based Loading property at session level in Configaration tab we can load the data into parent
and child relational tables (primary foreign key).
Genarally What it do is it will load the data first in parent table then it will load it in to child table.
If we copy source definaltions or target definations or mapplets from Shared folder to any other folders that will
become a shortcut.
Let’s assume we have imported some source and target definitions in a shared folder after that we are using those
sources and target definitions in another folders as a shortcut in some mappings.
If any modifications occur in the backend (Database) structure like adding new columns or drop existing columns
either in source or target I f we reimport into shared folder those new changes automatically it would reflect in all
folder/mappings wherever we used those sources or target definitions.
If we don’t have primary key on target table using Target Update Override option we can perform updates.By
default, the Integration Service updates target tables based on key values. However, you can override the default
UPDATE statement for each target in a mapping. You might want to update the target based on non-key columns.
You can override the WHERE clause to include non-key columns. For example, you might want to update records for
employees named Mike Smith only. To do this, you edit the WHERE clause as follows:
If you modify the UPDATE portion of the statement, be sure to use :TU to specify ports.
41
How do you perform incremental logic or Delta or CDC?
Incremental means suppose today we processed 100 records ,for tomorrow run u need to extract whatever the
records inserted newly and updated after previous run based on last updated timestamp (Yesterday run) this
process called as incremental or delta.
1) First need to create mapping var ($$Pre_sess_max_upd)and assign initial value as old date (01/01/1940).
2) Then override source qualifier query to fetch only LAT_UPD_DATE >=$$Pre_sess_max_upd (Mapping var)
3) In the expression assign max last_upd_date value to $$Pre_sess_max_upd(mapping var) using set max
var
4) Because its var so it stores the max last upd_date value in the repository, in the next run our source
qualifier query will fetch only the records updated or inseted after previous run.
1 First need to create mapping parameter ($$Pre_sess_start_tmst )and assign initial value as old date
(01/01/1940) in the parameterfile.
3 Update mapping parameter($$Pre_sess_start_tmst) values in the parameter file using shell script or
another mapping after first session get completed successfully
4 Because its mapping parameter so every time we need to update the value in the parameter file
after comptetion of main session.
1 First we need to create two control tables cont_tbl_1 and cont_tbl_1 with structure of
session_st_time,wf_name
2 Then insert one record in each table with session_st_time=1/1/1940 and workflow_name
3 create two store procedures one for update cont_tbl_1 with session st_time, set property of store
procedure type as Source_pre_load .
4 In 2nd store procedure set property of store procedure type as Target _Post_load.this proc will
update the session _st_time in Cont_tbl_2 from cnt_tbl_1.
5 Then override source qualifier query to fetch only LAT_UPD_DATE >=(Select session_st_time from
cont_tbl_2 where workflow name=’Actual work flow name’.
• We have one of the dimension in current project called resource dimension. Here we are maintaining the
history to keep track of SCD changes.
• To maintain the history in slowly changing dimension or resource dimension. We followed SCD Type-II
Effective-Date approach.
• My resource dimension structure would be eff-start-date, eff-end-date, s.k and source columns.
• Whenever I do a insert into dimension I would populate eff-start-date with sysdate, eff-end-date with future
date and s.k as a sequence number.
• If the record already present in my dimension but there is change in the source data. In that case what I
need to do is
42
• Update the previous record eff-end-date with sysdate and insert as a new record with source data.
• Once you fetch the record from source qualifier. We will send it to lookup to find out whether the record is
present in the target or not based on source primary key column.
• Once we find the match in the lookup we are taking SCD column from lookup and source columns from SQ
to expression transformation.
• In lookup transformation we need to override the lookup override query to fetch Active records from the
dimension while building the cache.
• If the source and target data is same then I can make a flag as ‘S’.
• If the source and target data is different then I can make a flag as ‘U’.
• If source data does not exists in the target that means lookup returns null value. I can flag it as ‘I’.
• Based on the flag values in router I can route the data into insert and update flow.
• Whenever we do update we are updating the eff-end-date column based on lookup return s.k value.
Complex Mapping
• We have one of the order file requirement. Requirement is every day in source system they will place
filename with timestamp in informatica server.
• Source file directory contain older than 30 days files with timestamps.
• For this requirement if I hardcode the timestamp for source file name it will process the same file every
day.
• Then I am going to use the parameter file to supply the values to session variables ($InputFilename).
• This mapping will update the parameter file with appended timestamp to file name.
• I make sure to run this parameter file update mapping before my actual mapping.
• We have one of the source with numerator and denominator values we need to calculate num/deno
• We need to send those records to flat file after completion of 1st session run. Shell script will check the file
size.
43
• If the file size is greater than zero then it will send email notification to source system POC (point of
contact) along with deno zero record file and appropriate email subject and body.
• If file size<=0 that means there is no records in flat file. In this case shell script will not send any email
notification.
• Or
• We are expecting a not null value for one of the source column.
Source qualifier will select the data from the source table.
Parameter file it will supply the values to session level variables and mapping level variables.
• $DBConnection_Source
• $DBConnection_Target
• $InputFile
• $OutputFile
• Variable
• Parameter
What is the difference between mapping level and session level variables?
Flat File
• Delimiter
• Fixed Width
44
In delimiter we need to specify the separator.
In fixed width we need to known about the format first. Means how many character to read for particular column.
In delimiter also it is necessary to know about the structure of the delimiter. Because to know about the headers.
If the file contains the header then in definition we need to skip the first row.
List file:
If you want to process multiple files with same structure. We don’t need multiple mapping and multiple sessions.
We can use one mapping one session using list file option.
First we need to create the list file for all the files. Then we can use this file in the main mapping.
It is a text file below is the format for parameter file. We use to place this file in the unix box where we have
installed our informatic server.
[GEHC_APO_DEV.WF:w_GEHC_APO_WEEKLY_HIST_LOAD.WT:wl_GEHC_APO_WEEKLY_HIST_BAAN.ST:s_m_GEHC_
APO_BAAN_SALES_HIST_AUSTRI]
$InputFileName_BAAN_SALE_HIST=/interface/dev/etl/apo/srcfiles/HS_025_20070921
$DBConnection_Target=DMD2_GEMS_ETL
$$CountryCode=AT
$$CustomerNumber=120165
[GEHC_APO_DEV.WF:w_GEHC_APO_WEEKLY_HIST_LOAD.WT:wl_GEHC_APO_WEEKLY_HIST_BAAN.ST:s_m_GEHC_
APO_BAAN_SALES_HIST_BELUM]
$DBConnection_Sourcet=DEVL1C1_GEMS_ETL
$OutputFileName_BAAN_SALES=/interface/dev/etl/apo/trgfiles/HS_002_20070921
$$CountryCode=BE
$$CustomerNumber=101495
45
Difference between 7.x and 8.x
46
47
Developer Changes:
• Client applications are the same, but work on top of the new services framework
6) command line programms: infacmd and infasetup new commands were added.
8) concurrent cache creation and faster index building are additional feature in lookup transformation
13)flat file names we can populate to target while processing through list file .
14)For Falt files header and footer we can populate using advanced options in 8 at session level.
Effective in version 8.0, you create and configure a grid in the Administration Console. You configure a grid to run on
multiple nodes, and you configure one Integration Service to run on the grid. The Integration Service runs processes
on the nodes in the grid to distribute workflows and sessions. In addition to running a workflow on a grid, you can
now run a session on a grid. When you run a session or workflow on a grid, one service process runs on each
available node in the grid.
48
1. A PowerCenter Client request IS to start workflow
2. IS starts ISP
Manages the data from source system to target system within the memory and disk
The main three components of Integration Service which enable data movement are,
Load Balancer
The Integration Service starts one or more Integration Service processes to run and monitor workflows. When we
run a workflow, the ISP starts and locks the workflow, runs the workflow tasks, and starts the process to run
sessions. The functions of the Integration Service Process are,
Load Balancer
The Load Balancer dispatches tasks to achieve optimal performance. It dispatches tasks to a single node or across
the nodes in a grid after performing a sequence of steps. Before understanding these steps we have to know about
Resources, Resource Provision Thresholds, Dispatch mode and Service levels
Resources – we can configure the Integration Service to check the resources available on each node and
match them with the resources required to run the task. For example, if a session uses an SAP source, the
Load Balancer dispatches the session only to nodes where the SAP client is installed
Three Resource Provision Thresholds, The maximum number of runnable threads waiting for CPU resources
49
on the node called Maximum CPU Run Queue Length. The maximum percentage of virtual memory allocated
on the node relative to the total physical memory size called Maximum Memory %. The maximum number
of running Session and Command tasks allowed for each Integration Service process running on the node
called Maximum Processes
Three Dispatch mode’s – Round-Robin: The Load Balancer dispatches tasks to available nodes in a round-
robin fashion after checking the “Maximum Process” threshold. Metric-based: Checks all the three resource
provision thresholds and dispatches tasks in round robin fashion. Adaptive: Checks all the three resource
provision thresholds and also ranks nodes according to current CPU availability
Service Levels establishes priority among tasks that are waiting to be dispatched, the three components of
service levels are Name, Dispatch Priority and Maximum dispatch wait time. “Maximum dispatch wait time”
is the amount of time a task can wait in queue and this ensures no task waits forever
1. The Load Balancer checks different resource provision thresholds on the node depending on the Dispatch
mode set. If dispatching the task causes any threshold to be exceeded, the Load Balancer places the task in
the dispatch queue, and it dispatches the task later
2. The Load Balancer dispatches all tasks to the node that runs the master Integration Service process
1. The Load Balancer verifies which nodes are currently running and enabled
2. The Load Balancer identifies nodes that have the PowerCenter resources required by the tasks in the
workflow
3. The Load Balancer verifies that the resource provision thresholds on each candidate node are not exceeded.
If dispatching the task causes a threshold to be exceeded, the Load Balancer places the task in the dispatch
queue, and it dispatches the task later
When the workflow reaches a session, the Integration Service Process starts the DTM process. The DTM is the
process associated with the session task. The DTM process performs the following tasks:
Performs pushdown optimization when the session is configured for pushdown optimization.
Adds partitions to the session when the session is configured for dynamic partitioning.
Expands the service process variables, session parameters, and mapping variables and parameters.
Sends a request to start worker DTM processes on other nodes when the session is configured to run on a
grid.
Creates and runs mapping, reader, writer, and transformation threads to extract, transform, and load data
Runs post-session stored procedures, SQL, and shell commands and sends post-session email
50
After the session is complete, reports execution result to ISP
1) First need to create mapping var ($$INCREMENT_TS)and assign initial value as old date (01/01/1940).
2) Then override source qualifier query to fetch only LAT_UPD_DATE >=($$INCREMENT_TS (Mapping var)
3) In the expression assign max last_upd_date value to ($$INCREMENT_TS (mapping var) using set max var
4) Because its var so it stores the max last upd_date value in the repository, in the next run our source
qualifier query will fetch only the records updated or inseted after previous run.
51
Logic in the mapping variable is
Logic in the SQ is
52
In expression assign max last update date value to the variable using function set max variable.
53
Logic in the update strategy is below
54
Approach_2: Using parameter file
First need to create mapping parameter ($$LastUpdateDate Time )and assign initial value as old date (01/01/1940)
in the parameterfile.
Then override source qualifier query to fetch only LAT_UPD_DATE >=($$LastUpdateDate Time (Mapping var)
Update mapping parameter($$LastUpdateDate Time) values in the parameter file using shell script or another
mapping after first session get completed successfully
Because its mapping parameter so every time we need to update the value in the parameter file after comptetion
of main session.
Parameterfile:
[GEHC_APO_DEV.WF:w_GEHC_APO_WEEKLY_HIST_LOAD.WT:wl_GEHC_APO_WEEKLY_HIST_BAAN.ST:s_m_GEHC_
APO_BAAN_SALES_HIST_AUSTRI]
$DBConnection_Source=DMD2_GEMS_ETL
$DBConnection_Target=DMD2_GEMS_ETL
55
Logic in the expression
Main mapping
56
Sql override in SQ Transformation
Workflod Design
57
4.2 Informatica Scenarios:
1) How to populate 1st record to 1st target ,2nd record to 2nd target ,3rd record to 3rd
target and 4th record to 1st target through informatica?
We can do using sequence generator by setting end value=3 and enable cycle option.then in the router take 3
goups
In 1st group specify condition as seq next value=1 pass those records to 1st target simillarly
In 2nd group specify condition as seq next value=2 pass those records to 2nd target
In 3rd group specify condition as seq next value=3 pass those records to 3rd target.
Since we have enabled cycle option after reaching end value sequence generator will start from 1,for the 4th record
seq.next value is 1 so it will go to 1st target.
I want to generate the separate file for every State (as per state, it should generate file).It has to generate 2 flat
files and name of the flat file is corresponding state name that is the requirement.
Below is my mapping.
Source:
AP 2 HYD
AP 1 TPT
KA 5 BANG
KA 7 MYSORE
KA 3 HUBLI
This functionality was added in informatica 8.5 onwards earlier versions it was not there.
We can achieve it with use of transaction control and special "FileName" port in the target file .
In order to generate the target file names from the mapping, we should make use of the special "FileName" port in
the target file. You can't create this special port from the usual New port button. There is a special button with label
"F" on it to the right most corner of the target flat file when viewed in "Target Designer".
When you have different sets of input data with different target files created, use the same instance, but with a
Transaction Control transformation which defines the boundary for the source sets.
in target flat file there is option in column tab i.e filename as column.
when you click that one non editable column gets created in metadata of target.
58
source -> sq->expression-> transaction control-> target
Source:
Ename EmpNo
stev 100
methew 100
john 101
tom 101
Target:
Ename EmpNo
If record doen’t exit do insert in target .If it is already exist then get corresponding Ename vale from lookup and
concat in expression with current Ename value then update the target Ename column using update strategy.
Sort the data in sq based on EmpNo column then Use expression to store previous record information using Var port
after that use router to insert a record if it is first time if it is already inserted then update Ename with concat value
of prev name and current name value then update in target.
4) How to send Unique (Distinct) records into One target and duplicates into another
tatget?
Source:
Ename EmpNo
stev 100
Stev 100
john 101
Mathew 102
59
Output:
Target_1:
Ename EmpNo
Stev 100
John 101
Mathew 102
Target_2:
Ename EmpNo
Stev 100
If record doen’t exit do insert in target_1 .If it is already exist then send it to Target_2 using Router.
Sort the data in sq based on EmpNo column then Use expression to store previous record information using Var
ports after that use router to route the data into targets if it is first time then sent it to first target if it is already
inserted then send it to Tartget_2.
5) How to Process multiple flat files to single target table through informatica if all files are
same structure?
We can process all flat files through one mapping and one session using list file.
First we need to create list file using unix script for all flat file the extension of the list file is .LST.
6) How to populate file name to target while loading multiple files using list file concept.
In informatica 8.6 by selecting Add currently processed flatfile name option in the properties tab of source
definition after import source file defination in source analyzer.It will add new column as currently processed file
name.we can map this column to target to populate filename.
7) If we want to run 2 workflow one after another(how to set the dependence between
wf’s)
• If both workflow exists in same folder we can create 2 worklet rather than creating 2 workfolws.
60
• There we can set the dependency.
• If both workflows exists in different folders or repository then we cannot create worklet.
• We can set the dependency between these two workflow using shell script is one approach.
If both workflow exists in different folrder or different rep then we can use below approaches.
• As soon as first workflow get completes we are creating zero byte file (indicator file).
• If indicator file is not available we will wait for 5 minutes and again we will check for the indicator. Like this
we will continue the loop for 5 times i.e 30 minutes.
• After 30 minutes if the file does not exists we will send out email notification.
We can put event wait before actual session run in the workflow to wait a indicator file if file available then it will run
the session other event wait it will wait for infinite time till the indicator file is available.
Solution:
Using var ports in expression we can load cumulative salary into target.
61
4.3 Development Guidelines
The starting point of the development is the logical model created by the Data Architect. This logical model forms the
foundation for metadata, which will be continuously be maintained throughout the Data Warehouse Development Life
Cycle (DWDLC). The logical model is formed from the requirements of the project. At the completion of the logical
model technical documentation defining the sources, targets, requisite business rule transformations, mappings and
filters. This documentation serves as the basis for the creation of the Extraction, Transformation and Loading tools to
actually manipulate the data from the applications sources into the Data Warehouse/Data Mart.
To start development on any data mart you should have the following things set up by the Informatica Load
Administrator
Informatica Folder. The development team in consultation with the BI Support Group can decide a three-
letter code for the project, which would be used to create the informatica folder as well as Unix directory
structure.
Informatica Userids for the developers
Unix directory structure for the data mart.
A schema XXXLOAD on DWDEV database.
Transformation Specifications
Before developing the mappings you need to prepare the specifications document for the mappings you need to
develop. A good template is placed in the templates folder You can use your own template as long as it has as much
detail or more than that which is in this template.
While estimating the time required to develop mappings the thumb rule is as follows.
It’s an accepted best practice to always load a flat file into a staging table before any transformations are done on
62
the data in the flat file.
Always use LTRIM, RTRIM functions on string columns before loading data into a stage table.
You can also use UPPER function on string columns but before using it you need to ensure that the data is not case
sensitive (e.g. ABC is different from Abc)
If you are loading data from a delimited file then make sure the delimiter is not a character which could appear in
the data itself. Avoid using comma-separated files. Tilde (~) is a good delimiter to use.
Failure Notification
Once in production your sessions and batches need to send out notification when then fail to the Support team. You
can do this by configuring email task in the session level.
Port Standards:
Input Ports – It will be necessary to change the name of input ports for lookups, expression and filters where ports
might have the same name. If ports do have the same name then will be defaulted to having a number after the
name. Change this default to a prefix of “in_”. This will allow you to keep track of input ports through out your
mappings.
Prefixed with: IN_
Transformation should be prefixed with a “v_”. This will allow the developer to distinguish between input/output and
variable ports. For more explanation of Variable Ports see the section “VARIABLES”.
Prefixed with: V_
Output Ports – If organic data is created with a transformation that will be mapped to the target, make sure that it
has the same name as the target port that it will be mapped to.
Prefixed with: O_
Quick Reference
Aggregator AGG_<Purpose>
Expression EXP_<Purpose>
63
Filter FLT_<Purpose>
Rank RNK_<Purpose>
Router RTR_<Purpose>
Mapplet MPP_<Purpose>
connections.
1. Cache lookups if source table is under 500,000 rows and DON’T cache for tables over 500,000 rows.
2. Reduce the number of transformations. Don’t use an Expression Transformation to collect fields. Don’t use
an Update Transformation if only inserting. Insert mode is the default.
3. If a value is used in multiple ports, calculate the value once (in a variable) and reuse the result instead of
recalculating it for multiple ports.
64
4. Reuse objects where possible.
7. Avoid using Stored Procedures, and call them only once during the mapping if possible.
8. Remember to turn off Verbose logging after you have finished debugging.
9. Use default values where possible instead of using IIF (ISNULL(X),,) in Expression port.
10. When overriding the Lookup SQL, always ensure to put a valid Order By statement in the SQL. This will
cause the database to perform the order rather than Informatica Server while building the Cache.
11. Improve session performance by using sorted data with the Joiner transformation. When the Joiner
transformation is configured to use sorted data, the Informatica Server improves performance by
minimizing disk input and output.
12. Improve session performance by using sorted input with the Aggregator Transformation since it reduces the
amount of data cached during the session.
13. Improve session performance by using limited number of connected input/output or output ports to reduce
the amount of data the Aggregator transformation stores in the data cache.
14. Use a Filter transformation prior to Aggregator transformation to reduce unnecessary aggregation.
15. Performing a join in a database is faster than performing join in the session. Also use the Source Qualifier
to perform the join.
16. Define the source with less number of rows and master source in Joiner Transformations, since this reduces
the search time and also the cache.
17. When using multiple conditions in a lookup conditions, specify the conditions with the equality operator first.
19. If the lookup table is on the same database as the source table, instead of using a Lookup transformation,
join the tables in the Source Qualifier Transformation itself if possible.
20. If the lookup table does not change between sessions, configure the Lookup transformation to use a
persistent lookup cache. The Informatica Server saves and reuses cache files from session to session,
eliminating the time required to read the lookup table.
21. Use :LKP reference qualifier in expressions only when calling unconnected Lookup Transformations.
22. Informatica Server generates an ORDER BY statement for a cached lookup that contains all lookup ports. By
providing an override ORDER BY clause with fewer columns, session performance can be improved.
24. Reduce the number of rows being cached by using the Lookup SQL Override option to add a WHERE clause
to the default SQL statement.
Testing regimens:
1. Unit Testing
2. Functional Testing
65
3. System Integration Testing
Unit testing: The testing, by development, of the application modules to verify each unit (module) itself meets the
accepted user requirements and design and development standards
Functional Testing: The testing of all the application’s modules individually to ensure the modules, as released
from development to QA, work together as designed and meet the accepted user requirements and system
standards
System Integration Testing: Testing of all of the application modules in the same environment, database
instance, network and inter-related applications, as it would function in production. This includes security, volume
and stress testing.
User Acceptance Testing(UAT): The testing of the entire application by the end-users ensuring the application
functions as set forth in the system requirements documents and that the system meets the business needs.
UTP Template:
(P or
Step Description Test Conditions Expected Results F)
SAP-
CMS
Interf
aces
1 Check for the SOURCE: Both the source and target Should be Pass Stev
total count of table load record count same as the
records in SELECT count(*) FROM should match. expected
source tables XST_PRCHG_STG
that is
fetched and
the total TARGET:
records in
the PRCHG Select count(*) from
table for a _PRCHG
perticular
session
timestamp
66
Actual Pass Teste
Results, or Fail d By
(P or
Step Description Test Conditions Expected Results F)
2 Check for all select PRCHG_ID, Both the source and target Should be Pass Stev
the target table record values should same as the
columns PRCHG_DESC, return zero records expected
whether they
are getting DEPT_NBR,
populated
correctly with EVNT_CTG_CDE,
source data.
PRCHG_TYP_CDE,
PRCHG_ST_CDE,
from T_PRCHG
MINUS
select PRCHG_ID,
PRCHG_DESC,
DEPT_NBR,
EVNT_CTG_CDE,
PRCHG_TYP_CDE,
PRCHG_ST_CDE,
from PRCHG
3 Check for Identify a one record from It should insert a record into Should be Pass Stev
Insert the source which is not in target table with source data same as the
strategy to target table. Then run the expected
load records session
into target
table.
4 Check for Identify a one Record It should update record into Should be Pass Stev
Update from the source which is target table with source data same as the
strategy to already present in the for that existing record expected
load records target table with different
into target PRCHG_ST_CDE or
table. PRCHG_TYP_CDE values
Then run the session
67
5 UNIX
cd /pmar/informatica/pc/pmserver/
2) And if we suppose to process flat files using informatica but those files were exists in remote server then we
have to write script to get ftp into informatica server before start process those files.
3) And also file watch mean that if indicator file available in the specified location then we need to start our
informatica jobs otherwise will send email notification using
Mail X command saying that previous jobs didn’t completed successfully something like that.
4) Using shell script update parameter file with session start time and end time.
This kind of scripting knowledge I do have. If any new UNIX requirement comes then I can Google and get the
solution implement the same.
Basic Commands:
Cat file1 (cat is the command to create none zero byte file)
cat file1 file2 > all -----it will combined (it will create file if it doesn’t exit)
cat file1 >> file2---it will append to file 2
o > will redirect output from standard out (screen) to file or printer or whatever you like.
ps -A
Crontab command.
Crontab command is used to schedule jobs. You must have permission to run this command by Unix Administrator.
Jobs are scheduled in five numbers, as follows.
Minutes (0-59) Hour (0-23) Day of month (1-31) month (1-12) Day of week (0-6) (0 is Sunday)
so for example you want to schedule a job which runs from script named backup jobs in /usr/local/bin directory on
sunday (day 0) at 11.25 (22:25) on 15th of month. The entry in crontab file will be. * represents all values.
68
25 22 15 * 0 /usr/local/bin/backup_jobs
who | wc -l
$ ls -l | grep '^d'
Pipes:
The pipe symbol "|" is used to direct the output of one command to the input
of another.
mv file1 ~/AAA/ move file1 into sub-directory AAA in your home directory.
ls –a
find command
find -name aaa.txt Finds all the files named aaa.txt in the current directory or
Sed (The usual sed command for global string search and replace is this)
69
If you want to replace 'foo' with the string 'bar' globally in a file.
find / -name vimrc Find all the files named 'vimrc' anywhere on the system.
Find all files whose names contain the string 'xpilot' which
You can find out what shell you are using by the command:
echo $SHELL
#!/usr/bin/sh
Or
#!/bin/ksh
It actually tells the script to which interpreter to refer. As you know, bash shell has some specific functions that other
shell does not have and vice-versa. Same way is for perl, python and other languages.
It's to tell your shell what shell to you in executing the following statements in your shell script.
Interactive History
A feature of bash and tcsh (and sometimes others) you can use
Opening a file
Vi filename
Creating text
Edit modes: These keys enter editing modes and type in the text
of your document.
70
a Insert (append) after current cursor position
r Replace 1 character
R Replace mode
Deletion of text
:w! existing.file Overwrite an existing file with the file currently being edited.
:q Quit.
71