You are on page 1of 28



In computing, extract, transform, and load (ETL) refers to a process in database usage and especially
in data warehousing that:
Extracts data from outside sources
Transforms it to fit operational needs, which can include quality levels
Loads it into the end target (database, more specifically, operational data store, data mart, or data
ETL systems are commonly used to integrate data from multiple applications, typically developed and
supported by different vendors or hosted on separate computer hardware. The disparate systems
containing the original data are frequently managed and operated by different employees. For
example a cost accounting system may combine data from payroll, sales and purchasing.
The first part of an ETL process involves extracting the data from the source systems. In many cases
this is the most challenging aspect of ETL, since extracting data correctly sets the stage for how
subsequent processes go further.
Most data warehousing projects consolidate data from different source systems. Each separate system
may also use a different data organization and/or format. Common data source formats are relational
databases and flat files, but may include non-relational database structures such as Information
Management System (IMS) or other data structures such as Virtual Storage Access Method (VSAM) or
Indexed Sequential Access Method (ISAM), or even fetching from outside sources such as through web
spidering or screen-scraping. The streaming of the extracted data source and load on-the-fly to the
destination database is another way of performing ETL when no intermediate data storage is required.
In general, the goal of the extraction phase is to convert the data into a single format appropriate for
transformation processing.
An intrinsic part of the extraction involves the parsing of extracted data, resulting in a check if the
data meets an expected pattern or structure. If not, the data may be rejected entirely or in part.
The transform stage applies a series of rules or functions to the extracted data from the source to
derive the data for loading into the end target. Some data sources require very little or even no
manipulation of data. In other cases, one or more of the following transformation types may be
required to meet the business and technical needs of the target database:
Selecting only certain columns to load (or selecting null columns not to load). For example, if the
source data has three columns (also called attributes), for example roll_no, age, and salary, then the
extraction may take only roll_no and salary. Similarly, the extraction mechanism may ignore all those
records where salary is not present (salary = null).
Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the
warehouse stores M for male and F for female)
Encoding free-form values (e.g., mapping "Male" to "M")

Deriving a new calculated value (e.g., sale_amount = qty * unit_price)

Joining data from multiple sources (e.g., lookup, merge) and deduplicating the data
Aggregation (for example, rollup summarizing multiple rows of data total sales for each store,
and for each region, etc.)
Generating surrogate-key values
Transposing or pivoting (turning multiple columns into multiple rows or vice versa)
Splitting a column into multiple columns (e.g., converting a comma-separated list, specified as a
string in one column, into individual values in different columns)
Disaggregation of repeating columns into a separate detail table (e.g., moving a series of addresses
in one record into single addresses in a set of records in a linked address table)
Lookup and validate the relevant data from tables or referential files for slowly changing
Applying any form of simple or complex data validation. If validation fails, it may result in a full,
partial or no rejection of the data, and thus none, some or all the data is handed over to the next
step, depending on the rule design and exception handling. Many of the above transformations may
result in exceptions, for example, when a code translation parses an unknown code in the extracted
The load phase loads the data into the end target, usually the data warehouse (DW). Depending on
the requirements of the organization, this process varies widely.


The objective of ETL testing is to assure that the data that has been loaded from
source to destination after business transformation is accurate. It also involves the
verification of data at various middle stages that are being used between a source
and destination.
Across the tools and databases the following are two documents that will always be two hands of an
ETL Tester. But it also important that the following two document should be in complete state before
starting ETL testing. Continuous change in the below two documents will lead to inaccurate testing and
ETL mapping sheets
DB schema of Source, Target and any middle stage that is in between.

An ETL mapping sheets contains all the information of source and destination tables including each
and every column and their look-up in reference tables.
An ETL Testers needs to be comfortable with SQL queries as ETL Testing may involve writing big
queries with multiple joins to validate data at any stage of ETL. ETL mapping sheets provide a
significant help while writing queries for data verification. DB schema should also be kept handy to
verify any detail in mapping sheets.

What is DWH

A data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database used for reporting
and data analysis. It is a central repository of data which is created by integrating data from one or
more disparate sources. Data warehouses store current as well as historical data and are used for
creating trending reports for senior management reporting such as annual and quarterly comparisons.
The data stored in the warehouse are uploaded from the operational systems (such as marketing,
sales etc., shown in the figure to the right). The data may pass through an operational data store for
additional operations before they are used in the DW for reporting.
The typical ETL-based data warehouse uses staging, data integration, and access layers to house its
key functions. The staging layer or staging database stores raw data extracted from each of the
disparate source data systems. The integration layer integrates the disparate data sets by
transforming the data from the staging layer often storing this transformed data in an operational data
store (ODS) database. The integrated data are then moved to yet another database, often called the
data warehouse database, where the data is arranged into hierarchical groups often called dimensions
and into facts and aggregate facts. The combination of facts and dimensions is sometimes called a star
schema. The access layer helps users retrieve data.
A data warehouse constructed from an integrated data source systems does not require ETL, staging
databases, or operational data store databases. The integrated data source systems may be
considered to be a part of a distributed operational data store layer. Data federation methods or data
virtualization methods may be used to access the distributed integrated source data systems to
consolidate and aggregate data directly into the data warehouse database tables. Unlike the ETLbased data warehouse, the integrated source data systems and the data warehouse are all integrated
since there is no transformation of dimensional or reference data. This integrated data warehouse
architecture supports the drill down from the aggregate data of the data warehouse to the
transactional data of the integrated source data systems.
Benefits of DWH

A data warehouse maintains a copy of information from the source transaction systems. This
architectural complexity provides the opportunity to:

Congregate data from multiple sources into a single database so a single query engine can
be used to present data.
Mitigate the problem of database isolation level lock contention in transaction processing
systems caused by attempts to run large, long running, analysis queries in transaction
processing databases.
Maintain data history, even if the source transaction systems do not.

Integrate data from multiple source systems, enabling a central view across the enterprise.
This benefit is always valuable, but particularly so when the organization has grown by
Improve data quality, by providing consistent codes and descriptions, flagging or even fixing
bad data.
Present the organization's information consistently.
Provide a single common data model for all data of interest regardless of the data's source.
Restructure the data so that it makes sense to the business users.
Restructure the data so that it delivers excellent query performance, even for complex
analytic queries, without impacting the operational systems.
Add value to operational business applications, notably customer relationship management
(CRM) systems.

DWH Architecture


Dimensional modeling (DM) is the name of a set of techniques and concepts used in data
warehouse design. It is considered to be different from entity-relationship modeling (ER).
Dimensional Modeling does not necessarily involve a relational database. The same modeling
approach, at the logical level, can be used for any physical form, such as multidimensional
database or even flat files. According to data warehousing consultant Ralph Kimball,[1] DM is a
design technique for databases intended to support end-user queries in a data warehouse. It is
oriented around understandability and performance. According to him, although transactionoriented ER is very useful for the transaction capture, it should be avoided for end-user

Dimensional modeling always uses the concepts of facts (measures), and dimensions (context).
Facts are typically (but not always) numeric values that can be aggregated, and dimensions are
groups of hierarchies and descriptors that define the facts. For example, sales amount is a fact;
timestamp, product, register#, store#, etc. are elements of dimensions. Dimensional models
are built by business process area, e.g. store sales, inventory, claims, etc. Because the different
business process areas share some but not all dimensions, efficiency in design, operation, and
consistency, is achieved using conformed dimensions, i.e. using one copy of the shared
dimension across subject areas. The term "conformed dimensions" was originated by Ralph
Dimensional modeling process
The dimensional model is built on a star-like schema, with dimensions surrounding the fact
table. To build the schema, the following design model is used:
Choose the business process
Declare the grain
Identify the dimensions
Identify the fact
Choose the business process
The process of dimensional modeling builds on a 4-step design method that helps to ensure the
usability of the dimensional model and the use of the data warehouse. The basics in the design
build on the actual business process which the data warehouse should cover. Therefore the first
step in the model is to describe the business process which the model builds on. This could for
instance be a sales situation in a retail store. To describe the business process, one can choose
to do this in plain text or use basic Business Process Modeling Notation (BPMN) or other design
guides like the Unified Modeling Language (UML).
Declare the grain
After describing the Business Process, the next step in the design is to declare the grain of the
model. The grain of the model is the exact description of what the dimensional model should be
focusing on. This could for instance be An individual line item on a customer slip from a retail
store. To clarify what the grain means, you should pick the central process and describe it with
one sentence. Furthermore the grain (sentence) is what you are going to build your dimensions
and fact table from. You might find it necessary to go back to this step to alter the grain due to
new information gained on what your model is supposed to be able to deliver.
Identify the dimensions
The third step in the design process is to define the dimensions of the model. The dimensions
must be defined within the grain from the second step of the 4-step process. Dimensions are
the foundation of the fact table, and are where the data for the fact table is collected. Typically
dimensions are nouns like date, store, inventory etc. These dimensions are where all the data is
stored. For example, the date dimension could contain data such as year, month and weekday.

Identify the facts

After defining the dimensions, the next step in the process is to make keys for the fact table.
This step is to identify the numeric facts that will populate each fact table row. This step is
closely related to the business users of the system, since this is where they get access to data
stored in the data warehouse. Therefore most of the fact table rows are numerical, additive
figures such as quantity or cost per unit, etc.
Fact Table
In data warehousing, a fact table consists of the measurements, metrics or facts of a business
process. It is located at the center of a star schema or a snowflake schema surrounded by
dimension tables. Where multiple fact tables are used, these are arranged as a fact
constellation schema. A fact table typically has two types of columns: those that contain facts
and those that are foreign keys to dimension tables. The primary key of a fact table is usually a
composite key that is made up of all of its foreign keys. Fact tables contain the content of the
data warehouse and store different types of measures like additive, non-additive, and semi
additive measures.
Fact tables provide the (usually) additive values that act as independent variables by which
dimensional attributes are analyzed. Fact tables are often defined by their grain. The grain of a
fact table represents the most atomic level by which the facts may be defined. The grain of a
SALES fact table might be stated as "Sales volume by Day by Product by Store". Each record in
this fact table is therefore uniquely defined by a day, product and store. Other dimensions
might be members of this fact table (such as location/region) but these add nothing to the
uniqueness of the fact records. These "affiliate dimensions" allow for additional slices of the
independent facts but generally provide insights at a higher level of aggregation (a region
contains many stores).
Types of Measures
Additive - Measures that can be added across any dimension.
Non Additive - Measures that cannot be added across any dimension.
Semi Additive - Measures that can be added across some dimensions.
Types of Fact Tables
There are basically three fundamental measurement events, which characterizes all fact tables.
A transactional table is the most basic and fundamental. The grain associated with a
transactional fact table is usually specified as "one row per line in a transaction", e.g., every line
on a receipt. Typically a transactional fact table holds data of the most detailed level, causing it
to have a great number of dimensions associated with it.
The periodic snapshot, as the name implies, takes a "picture of the moment", where the
moment could be any defined period of time, e.g. a performance summary of a salesman over
the previous month. A periodic snapshot table is dependent on the transactional table, as it

needs the detailed data held in the transactional fact table in order to deliver the chosen
performance output.
This type of fact table is used to show the activity of a process that has a well-defined
beginning and end, e.g., the processing of an order. An order moves through specific steps until
it is fully processed. As steps towards fulfilling the order are completed, the associated row in
the fact table is updated. An accumulating snapshot table often has multiple date columns,
each representing a milestone in the process. Therefore, it's important to have an entry in the
associated date dimension that represents an unknown date, as many of the milestone dates
are unknown at the time of the creation of the row.
Dimension Table

In data warehousing, a dimension table is one of the set of companion tables to a fact table. The fact
table contains business facts (or measures), and foreign keys which refer to candidate keys (normally
primary keys) in the dimension tables. Contrary to fact tables, dimension tables contain descriptive
attribute (or fields) that are typically textual fields (or discrete numbers that behave like text).
Dimension table rows are uniquely identified by a single key field. It is recommended that the key field
be a simple integer because a key value is meaningless, used only for joining fields between the fact
and dimension tables.
A dimensional data element is similar to a categorical variable in statistics.
Typically dimensions in a data warehouse are organized internally into one or more hierarchies. "Date"
is a common dimension, with several possible hierarchies:
"Days (are grouped into) Months (which are grouped into) Years",
"Days (are grouped into) Weeks (which are grouped into) Years"
"Days (are grouped into) Months (which are grouped into) Quarters (which are grouped into) Years"
Types of Dimensions
Confirmed Dimension
Junk Dimension

A schema is a collection of database objects, including tables, views, indexes, and synonyms.
There is a variety of ways of arranging schema objects in the schema models designed for data
warehousing. The most common data-warehouse schema model is a star schema. However, a
significant but smaller number of data warehouses use third-normal-form (3NF) schemas, or other
schemas which are more highly normalized than star schemas.

Star Schema
The star schema is the simplest data warehouse schema. It is called a star schema because the
diagram of a star schema resembles a star, with points radiating from a center. The center of the star
consists of one or more fact tables and the points of the star are the dimension tables.
A star schema is characterized by one or more very large fact tables that contain the primary
information in the data warehouse and a number of much smaller dimension tables (or lookup tables),
each of which contains information about the entries for a particular attribute in the fact table.
A star query is a join between a fact table and a number of lookup tables. Each lookup table is joined
to the fact table using a primary-key to foreign-key join, but the lookup tables are not joined to each
Cost-based optimization recognizes star queries and generates efficient execution plans for them.
(Star queries are not recognized by rule-based optimization.)
A typical fact table contains keys and measures. For example, a simple fact table might contain the
measure Sales, and keys Time, Product, and Market. In this case, there would be corresponding
dimension tables for Time, Product, and Market. The Product dimension table, for example, would
typically contain information about each product number that appears in the fact table. A measure is
typically a numeric or character column, and can be taken from one column in one table or derived
from two columns in one table or two columns in more than one table.
A star join is a primary-key to foreign-key join of the dimension tables to a fact table. The fact table
normally has a concatenated index on the key columns to facilitate this type of join.
The main advantages of star schemas are that they:

Provide a direct and intuitive mapping between the business entities being analyzed by end
users and the schema design.
Provides highly optimized performance for typical data warehouse queries.

Snowflake Schema
The snowflake schema is a more complex data warehouse model than a star schema, and is a type of
star schema. It is called a snowflake schema because the diagram of the schema resembles a

Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has
been grouped into multiple tables instead of one large table. For example, a product dimension table
in a star schema might be normalized into a Product table, a Product_Category table, and a
Product_Manufacturer table in a snowflake schema. While this saves space, it increases the number of
dimension tables and requires more foreign key joins. The result is more complex queries and reduced
query performance.

A data mart is the access layer of the data warehouse environment that is used to get data out to the
users. The data mart is a subset of the data warehouse that is usually oriented to a specific business
line or team. Data marts are small slices of the data warehouse. Whereas data warehouses have an
enterprise-wide depth, the information in data marts pertains to a single department. In some
deployments, each department or business unit is considered the owner of its data mart including all
the hardware, software and data. This enables each department to use, manipulate and develop their
data any way they see fit; without altering information inside other data marts or the data warehouse.
In other deployments where conformed dimensions are used, this business unit ownership will not
hold true for shared dimensions like customer, product, etc.
Types of Data Marts
Dependent Data Mart and Independent
There are two basic types of data marts: dependent and independent. The categorization is based
primarily on the data source that feeds the data mart. Dependent data marts draw data from a central
data warehouse that has already been created. Independent data marts, in contrast, are standalone
systems built by drawing data directly from operational or external sources of data, or both.
The main difference between independent and dependent data marts is how you populate the data
mart; that is, how you get data out of the sources and into the data mart. This step, called the
Extraction-Transformation-and Loading (ETL) process, involves moving data from operational systems,
filtering it, and loading it into the data mart.
With dependent data marts, this process is somewhat simplified because formatted and summarized
(clean) data has already been loaded into the central data warehouse. The ETL process for dependent

data marts is mostly a process of identifying the right subset of data relevant to the chosen data mart
subject and moving a copy of it, perhaps in a summarized form.
With independent data marts, however, you must deal with all aspects of the ETL process, much as
you do with a central data warehouse. The number of sources is likely to be fewer and the amount of
data associated with the data mart is less than the warehouse, given your focus on a single subject.
The motivations behind the creation of these two types of data marts are also typically different.
Dependent data marts are usually built to achieve improved performance and availability, better
control, and lower telecommunication costs resulting from local access of data relevant to a specific
department. The creation of independent data marts is often driven by the need to have a solution
within a shorter time.
A Slowly Changing Dimension (SCD) is a well-defined strategy to manage both current and historical
data over time in a data warehouse. You must first decide which type of slowly changing dimension to
use based on your business requirements.





Type1 Overwriting

Only one version of the dimension record exists. When a change is

made, the record is overwritten and no historic data is stored.


Type2 Creating Another

Dimension Record

There are multiple versions of the same dimension record, and new
versions are created while old versions are still kept upon


Type3 Creating a Current

Value Field

There are two versions of the same dimension record: old values and Yes
current values, and old values are kept upon modification on current

Type 1
This methodology overwrites old with new data, and therefore does not track historical data.
Example of a supplier table:


Acme Supply Co


In the above example, Supplier_Code is the natural key and Supplier_Key is a surrogate key.
Technically, the surrogate key is not necessary, since the row will be unique by the natural key
(Supplier_Code). However, to optimize performance on joins use integer rather than character keys.
If the supplier relocates the headquarters to Illinois the record would be overwritten:


Acme Supply Co


The disadvantage of the Type I method is that there is no history in the data warehouse. It has the
advantage however that it's easy to maintain.
Type 2
This method tracks historical data by creating multiple records for a given natural key in the
dimensional tables with separate surrogate keys and/or different version numbers. Unlimited history is
preserved for each insert.
For example, if the supplier relocates to Illinois the version numbers will be incremented sequentially:


Acme Supply Co
Acme Supply Co



Another method is to add 'effective date' columns.














Acme Supply
Acme Supply




Type 3
This method tracks changes using separate columns and preserves limited history. The Type 3
preserves limited history as it's limited to the number of columns designated for storing historical
data. The original table structure in Type 1 and Type 2 is the same but Type III adds additional
columns. In the following example, an additional column has been added to the table to record the
supplier's original state - only the previous history is stored.








Acme Supply




This record contains a column for the original state and current statecannot track the changes if the
supplier relocates a second time.
One variation of this is to create the field Previous_Supplier_State instead of Original_Supplier_State
which would track only the most recent historical change.


Test Strategy Document

Test plan document

Mapping specification

Macro and Micro design

Copy books

Schema file

Test Data file (cobol,.csv)

Test Environment:
Data stage (ETL Tool)
DB2 (Database)
Putty (UNIX)

ETL Testing Life cycle

ETL testing techniques

Smoke Test:
Smoke Test is to check the count of records present in Source and target. If the count is matching
both the sides then only we proceed further and if it is not matching then primary key, attribute all will
fail. So it will block all the test cases. When we start testing we have to do first smoke test.
For example: If 2000 records are present in source side then same no. of record should move to the
target i.e. all 2000 records should move to target
Primary Key Validation:
Primary Key is the key which uniquely identify the record. After performing Smoke test we have to do
Primary Key Validation if the count is matching in Smoke Test. Here, we are validating the primary key
in source and target after applying the transformation, occur rule whether they are matched or not. If
they are matched then the Primary Key validation is passed otherwise it is failed.
Attribute Validation:

After performing Primary Key Validation we performed attribute testing i.e. we check whether each
attribute is moving correctly to PP or not as per mapping transformation. We just add an attribute
along with the primary key with that transformation logic and do source minus target. If everything is
matching then it will show source minus target count 0 else it will fail and some error count will be
For example: If PP transformation is 'convert to char' then data what we will get in target should be in
char format.
Duplicate Checking:
It is to check whether duplicates are being moved to target or not. The main aim of duplicate test is
to validate that same data should not move again and again to the target.
TDQ (Technical Data Quality): TDQ stands for Technical Data Quality. TDQ is performed before
applying the transformation. In TDQ we are checking whether null or spaces or invalid dates are
moving correctly or not. There are different scenarios we have while performing TDQ.
If the source field data type is varchar(20) or integer and null or space is there in source and
target field is also varchar then it should be moved as null in PP file.
If the source field data type is varchar(50) or integer and null or space is there in source and
target field is integer then 0 should be moved in target.
If source field is varchar(20) or integer and null or space is there in source and target field is
date then 1111-11-11 should be moved in target.
If source field contains any invalid values in source and target field is date then
should be moved in target.


CDC (Change Data Capture):

Change Data Capture is the process of capturing changes made at the data source. It improves the
operational efficiency and ensures data synchronization. It easily identifies the data that has been
changed and makes the data available for further use.
CDC is applied when DML (Data Manipulation Language) operations (Insert, Update, and Delete) are
performed on the data.

Execute SQL scripts to check for active records in every individual table.

Attribute values are compared for two periods, Period 1 and Period 2.

The values are compared for four scenarios, Matching, Unmatching, Insertion and Deletion.

Once the scripts are executed for these four scenarios in both the periods, then the results are
compared manually to check for the correctness of CDC that has been applied.
We are using two types of files: Delta file and Full Snapshot file.

Delta File:

Delta processes compare the last historical file with the current one to identify the changes that have
occurred. The delta processing application will only update the data that has changed in the source
Delta load
The delta load process extracts only that data which has changed since the last time a build was run.
The delta load process is used for extracting data for the operational data store of IBM Rational
Insight data warehouse. This topic is an overview of the delta load implementation.
To run the delta load process, you need to store the date and time of the last successful build of the
ETL (extract, transform, and load) process. The CONFIG.ETL_INFO table in the data warehouse is
defined for this purpose. Every time an ETL job is run, some variables are initialized. For the delta load
process, the following two variables are used:

The MODIFIED_SINCE variable.

The ETL job searches the CONFIG.ETL_INFO table to get date and time for the last successful
ETL run and sets that value to the MODIFIED_SINCE variable, which will be used in the later
ETL builds later to determine if there are changes to the data since the last run.

The ETL_START_TIME variable

The ETL job gets the system date and time and stores that value to the ETL_START_TIME
variable. After the ETL job is over, the value stored in this variable is used for updating the

Whether the delta load process works for a specific product or not depends upon the data service
through which the product data is extracted.

Full Snapshot File:

Full snapshot files containing all active source records. It captures information each period regardless
if a change occurs.
Matching Scenario: No content change occurs in between Period 1 and Period 2. Same records are

present in both periods if Period 2 is a Full file. If Period 2 is a Delta file then the records which have
not undergone any updation will not be present in Period 2. Always the records will be moving from
Period 1 to Target with the effective date as Period 1 date and end date as 12/31/9999.
Unmatched Scenario: The content of the records will be different from Period 1 to Period 2. The

records which undergone some updation in Period 1 should be expired and the end date should be the
effective date of Period 2. The updated records of Period 1 will be active in Period 2 with end date as
12/31/9999. All the records from Period 1 and Period 2 will be moved to the Target.
Insertion Scenario: New records are inserted in Period 2. The effective date of these records will be

same as the effective date of Period 2 and the end date will be 12/31/9999. All the records along with
newly inserted records will be moved to the Target.

Deletion Scenario: The records which got expired in Period 1 will not be present in Period 2 i.e., those

records are deleted. But the deleted records are moved to the Target from Period 1 with effective date
and end date as the effective date of Period 1.
Default Checking: It is done to check whether the defaulted columns are being moved to the target

correctly or not.
For example: If transformation rule is null we should get null in the target.

Report is a route map that keeps track of every result that has been captured from different scenarios
of testing. Testing is carried out in three cycles and the report is generated for every cycle. We are
maintaining the 'Test Result Summary' report which explains the overall status of Testing.
Test Case Metrics:

Execution of test cases leads to creation of metrics which are then incorporated in various reports for
management as well as test team reporting:
Initial metrics involve around the total number of test cases required in each area, the number
of test cases completed, the percentage completed, the start date of test case preparation and target
Metrics are created once the test cases are put into execution mode in terms of Pass/Fail and
updated accordingly in Quality Center for tracking purposes
During execution mode, defect numbers and the volume of defects generated during each test
cycle are also closely monitored and reported. The defects are reported by each integration area to
measure the data quality or code issues and any repeat defects for a particular area.
Useful Unix Commands

cd dirname --- change directory. You basically 'go' to another directory, and you will
see the files in that directory when you do 'ls'. You always start out in your 'home
directory', and you can get back there by typing 'cd' without arguments.
cd .. will get you one level up from your current position. You don't have to walk
along step by step - you can make big leaps or avoid walking around by specifying
mkdir dirname --- make a new directory
mv - move or rename files or directories
rm - remove files or directories
rmdir - remove a directory

cp filename1 filename2 --- copies a file diff - display differences between text files
diff filename1 filename2 --- compares files, and shows where they differ
wc filename --- tells you how many lines, words, and characters there are in a file

grep - searches files for a specified string or expression

ls - list names of files in a directory

--- lists your files in 'long format', which contains lots of useful information, e.g.
the exact size of the file, who owns the file and who has the right to look at it, and
when it was last modified.
ls -l

--- lists all files, including the ones whose filenames begin in a dot, which you
do not always want to see. There are many more options, for example to list files by
size, by date, recursively etc.
ls -a

cat --- The most common use of cat is to read the contents of files, and cat is often the most
convenient program for this purpose. All that is necessary to open a text file for viewing on the display
monitor is to type the word cat followed by a space and the name of the file and then press the ENTER
key. For example, the following will display the contents of a file named file1:
cat file1

Test plan:

A test plan is a document detailing a systematic approach to testing a system such as a machine or
software. The plan typically contains a detailed understanding of what the eventual workflow will be.
Test strategy:

A test strategy is an outline that describes the testing approach of the software development cycle. It
is created to inform project managers, testers, and developers about some key issues of the testing
process. This includes the testing objective, methods of testing new functions, total time and
resources required for the project, and the testing environment.
Test strategies describe how the product risks of the stakeholders are mitigated at the test-level,
which types of test are to be performed, and which entry and exit criteria apply. They are created
based on development design documents. System design documents are primarily used and
occasionally, conceptual design documents may be referred to. Design documents describe the
functionality of the software to be enabled in the upcoming release. For every stage of development
design, a corresponding test strategy should be created to test the new feature sets.

A database is an organized collection of data. The data are typically organized to model relevant
aspects of reality (for example, the availability of rooms in hotels), in a way that supports processes
requiring this information (for example, finding a hotel with vacancies).

Database management systems (DBMSs) are specially designed applications that interact with the
user, other applications, and the database itself to capture and analyze data. A general-purpose
database management system (DBMS) is a software system designed to allow the definition, creation,
querying, update, and administration of databases. Well-known DBMSs include MySQL, PostgreSQL,
SQLite, Microsoft SQL Server, Microsoft Access, Oracle, SAP, dBASE, FoxPro, IBM DB2, LibreOffice Base
and FileMaker Pro. A database is not generally portable across different DBMS, but different DBMSs
can inter-operate by using standards such as SQL and ODBC or JDBC to allow a single application to
work with more than one database.

It is used to establish the relationship between two database objects.

A relational database management system (RDBMS) is a database management system (DBMS) that
is based on the relational model as introduced by E. F. Codd, of IBM's San Jose Research Laboratory.
Many popular databases currently in use are based on the relational database model.
RDBMSs have become a predominant choice for the storage of information in new databases used for
financial records, manufacturing and logistical information, personnel data, and much more. Relational
databases have often replaced legacy hierarchical databases and network databases because they are
easier to understand and use. However, relational databases have been challenged by object
databases, which were introduced in an attempt to address the object-relational impedance mismatch
in relational database, and XML databases.


An object-relational database (ORD), or object-relational database management system (ORDBMS), is

a database management system (DBMS) similar to a relational database, but with an object-oriented
database model: objects, classes and inheritance are directly supported in database schemas and in
the query language. In addition, just as with pure relational systems, it supports extension of the data
model with custom data-types and methods.
An object-relational database can be said to provide a middle ground between relational databases
and object-oriented databases (OODBMS). In object-relational databases, the approach is essentially
that of relational databases: the data resides in the database and is manipulated collectively with
queries in a query language; at the other extreme are OODBMSes in which the database is essentially
a persistent object store for software written in an object-oriented programming language, with a
programming API for storing and retrieving objects, and little or no specific support for querying.
The basic goal for the Object-relational database is to bridge the gap between relational databases
and the object-oriented modeling techniques used in programming languages such as Java, C++,
Visual Basic .NET or C#. However, a more popular alternative for achieving such a bridge is to use a
standard relational database systems with some form of Object-relational mapping (ORM) software.
Whereas traditional RDBMS or SQL-DBMS products focused on the efficient management of data
drawn from a limited set of data-types (defined by the relevant language standards), an object-

relational DBMS allows software developers to integrate their own types and the methods that apply to
them into the DBMS.

An operational data store (or "ODS") is a database designed to integrate data from multiple sources
for additional operations on the data. The data is then passed back to operational systems for further
operations and to the data warehouse for reporting.
Because the data originates from multiple sources, the integration often involves cleaning, resolving
redundancy and checking against business rules for integrity. An ODS is usually designed to contain
low-level or atomic (indivisible) data (such as transactions and prices) with limited history that is
captured "real time" or "near real time" as opposed to the much greater volumes of data stored in the
data warehouse generally on a less-frequent basis.

Online transaction processing, or OLTP, is a class of information systems that facilitate and manage
transaction-oriented applications, typically for data entry and retrieval transaction processing. The
term is somewhat ambiguous; some understand a "transaction" in the context of computer or
database transactions, while others (such as the Transaction Processing Performance Council) define it
in terms of business or commercial transactions. OLTP has also been used to refer to processing in
which the system responds immediately to user requests. An automatic teller machine (ATM) for a
bank is an example of a commercial transaction processing application.

Contrasting OLTP and Data Warehousing Environments


In computing, online analytical processing, or OLAP is an approach to answering multi-dimensional

analytical (MDA) queries swiftly.OLAP is part of the broader category of business intelligence, which
also encompasses relational database, report writing and data mining. Typical applications of OLAP
include business reporting for sales, marketing, management reporting, business process

management (BPM), budgeting and forecasting, financial reporting and similar areas, with new
applications coming up, such as agriculture.The term OLAP was created as a slight modification of the
traditional database term OLTP (Online Transaction Processing).
OLAP tools enable users to analyze multidimensional data interactively from multiple perspectives.
OLAP consists of three basic analytical operations: consolidation (roll-up), drill-down, and slicing and
dicing. Consolidation involves the aggregation of data that can be accumulated and computed in one
or more dimensions. For example, all sales offices are rolled up to the sales department or sales
division to anticipate sales trends. By contrast, the drill-down is a technique that allows users to
navigate through the details. For instance, users can view the sales by individual products that make
up a regions sales. Slicing and dicing is a feature whereby users can take out (slicing) a specific set of
data of the OLAP cube and view (dicing) the slices from different viewpoints.
Databases configured for OLAP use a multidimensional data model, allowing for complex analytical and
ad-hoc queries with a rapid execution time. They borrow aspects of navigational databases,
hierarchical databases and relational databases.

Data mining (the analysis step of the "Knowledge Discovery in Databases" process, or KDD), an
interdisciplinary subfield of computer science,is the computational process of discovering patterns in
large data sets involving methods at the intersection of artificial intelligence, machine learning,
statistics, and database systems. The overall goal of the data mining process is to extract information
from a data set and transform it into an understandable structure for further use. Aside from the raw
analysis step, it involves database and data management aspects, data pre-processing, model and
inference considerations, interestingness metrics, complexity considerations, post-processing of
discovered structures, visualization, and online updating.
Data mining uses information from past data to analyze the outcome of a particular problem or
situation that may arise. Data mining works to analyze data stored in data warehouses that are used
to store that data that is being analyzed. That particular data may come from all parts of business,
from the production to the management. Managers also use data mining to decide upon marketing
strategies for their product. They can use data to compare and contrast among competitors. Data
mining interprets its data into real time analysis that can be used to increase sales, promote new
product, or delete product that is not value-added to the company.

The waterfall model is a sequential design process, often used in software development processes, in
which progress is seen as flowing steadily downwards (like a waterfall) through the phases of
Conception, Initiation, Analysis, Design, Construction, Testing, Production/Implementation, and
The waterfall development model originates in the manufacturing and construction industries; highly
structured physical environments in which after-the-fact changes are prohibitively costly, if not
impossible. Since no formal software development methodologies existed at the time, this hardwareoriented model was simply adapted for software development.


Agile software development is a group of software development methods based on iterative and
incremental development, where requirements and solutions evolve through collaboration between
self-organizing, cross-functional teams. It promotes adaptive planning, evolutionary development and
delivery, a time-boxed iterative approach, and encourages rapid and flexible response to change. It is
a conceptual framework that promotes foreseen interactions throughout the development cycle. The
Agile Manifesto introduced the term in 2001.

The V-model represents a software development process (also applicable to hardware development)
which may be considered an extension of the waterfall model. Instead of moving down in a linear way,
the process steps are bent upwards after the coding phase, to form the typical V shape. The V-Model
demonstrates the relationships between each phase of the development life cycle and its associated
phase of testing. The horizontal and vertical axes represents time or project completeness (left-toright) and level of abstraction (coarsest-grain abstraction uppermost), respectively.

The spiral model is a software development process combining elements of both design and
prototyping-in-stages, in an effort to combine advantages of top-down and bottom-up concepts. Also
known as the spiral lifecycle model (or spiral development), it is a systems development method
(SDM) used in information technology (IT). This model of development combines the features of the
prototyping and the waterfall model. The spiral model is intended for large, expensive and complicated
The spiral model combines the idea of iterative development (prototyping) with the systematic,
controlled aspects of the waterfall model. It allows for incremental releases of the product, or
incremental refinement through each time around the spiral. The spiral model also explicitly includes
risk management within software development. Identifying major risks, both technical and
managerial, and determining how to lessen the risk helps keep the software development process
under control
Contrary to popular belief, Software Testing is not a just a single activity. It consists of series of
activities carried out methodologically to help certify your software product. These activities (stages)
constitute the Software Testing Life Cycle (STLC).
The different stages in Software Test Life Cycle -

Each of these stages have a definite Entry and Exit criteria , Activities & Deliverables associated with
In an Ideal world you will not enter the next stage until the exit criteria for the previous stage is met.
But practically this is not always possible. So for this tutorial , we will focus of activities and
deliverables for the different stages in STLC. Lets look into them in detail.
Requirement Analysis
During this phase, test team studies the requirements from a testing point of view to identify the
testable requirements. The QA team may interact with various stakeholders (Client, Business Analyst,
Technical Leads, System Architects etc) to understand the requirements in detail. Requirements could
be either Functional (defining what the software must do) or Non Functional (defining system
performance /security availability ) .Automation feasibility for the given testing project is also done in
this stage.

Identify types of tests to be performed.

Gather details about testing priorities and focus.

Prepare Requirement Traceability Matrix (RTM).

Identify test environment details where testing is supposed to be carried out.

Automation feasibility analysis (if required).


Automation feasibility report. (if applicable)

Test Planning
This phase is also called Test Strategy phase. Typically , in this stage, a Senior QA manager will
determine effort and cost estimates for the project and would prepare and finalize the Test Plan.

Preparation of test plan/strategy document for various types of testing

Test tool selection

Test effort estimation

Resource planning and determining roles and responsibilities.

Training requirement


Test plan /strategy document.

Effort estimation document.

Test Case Development

This phase involves creation, verification and rework of test cases & test scripts. Test data , is
identified/created and is reviewed and then reworked as well.

Create test cases, automation scripts (if applicable)

Review and baseline test cases and scripts

Create test data (If Test Environment is available)


Test cases/scripts
Test data

Test Environment Setup

Test environment decides the software and hardware conditions under which a work product is tested.
Test environment set-up is one of the critical aspects of testing process and can be done in parallel
with Test Case Development Stage. Test team may not be involved in this activity if the
customer/development team provides the test environment in which case the test team is required to
do a readiness check (smoke testing) of the given environment.

Understand the required architecture, environment set-up and prepare hardware and software
requirement list for the Test Environment.
Setup test Environment and test data

Perform smoke test on the build


Environment ready with test data set up

Smoke Test Results.

Test Execution
During this phase test team will carry out the testing based on the test plans and the test cases
prepared. Bugs will be reported back to the development team for correction and retesting will be

Execute tests as per plan

Document test results, and log defects for failed cases

Map defects to test cases in RTM

Retest the defect fixes

Track the defects to closure


Completed RTM with execution status

Test cases updated with results

Defect reports

Test Cycle Closure

Testing team will meet , discuss and analyze testing artifacts to identify strategies that have to be
implemented in future, taking lessons from the current test cycle. The idea is to remove the process
bottlenecks for future test cycles and share best practices for any similar projects in future.

Evaluate cycle completion criteria based on Time, Test coverage,Cost,Software,Critical

Business Objectives , Quality
Prepare test metrics based on the above parameters.

Document the learning out of the project

Prepare Test closure report

Qualitative and quantitative reporting of quality of the work product to the customer.

Test result analysis to find out the defect distribution by type and severity.


Test Closure report

Test metrics


What is a Defect/Bug?
Bug can be defined as the abnormal behavior of the software. No software exists without a bug. The
elimination of bugs from the software depends upon the efficiency of testing done on the software. A
bug is a specific concern about the quality of the Application under Test (AUT).
Bug Life Cycle:
In software development process, the bug has a life cycle. The bug should go through the life cycle to
be closed. A specific life cycle ensures that the process is standardized. The bug attains different
states in the life cycle. The life cycle of the bug can be shown diagrammatically as follows:

The different states of a bug can be summarized as follows:

1. New
2. Open
3. Assign
4. Test
5. Verified
6. Deferred
7. Reopened
8. Duplicate
9. Rejected
10. Closed
Description of Various Stages:
1. New: When the bug is posted for the first time, its state will be NEW. This means that the bug is
not yet approved.
2. Open: After a tester has posted a bug, the lead of the tester approves that the bug is genuine and
he changes the state as OPEN.

3. Assign: Once the lead changes the state as OPEN, he assigns the bug to corresponding developer
or developer team. The state of the bug now is changed to ASSIGN.
4. Test: Once the developer fixes the bug, he has to assign the bug to the testing team for next round
of testing. Before he releases the software with bug fixed, he changes the state of bug to TEST. It
specifies that the bug has been fixed and is released to testing team.
5. Deferred: The bug, changed to deferred state means the bug is expected to be fixed in next
releases. The reasons for changing the bug to this state have many factors. Some of them are priority
of the bug may be low, lack of time for the release or the bug may not have major effect on the
6. Rejected: If the developer feels that the bug is not genuine, he rejects the bug. Then the state of
the bug is changed to REJECTED.
7. Duplicate: If the bug is repeated twice or the two bugs mention the same concept of the bug, then
one bug status is changed to DUPLICATE.
8. Verified: Once the bug is fixed and the status is changed to TEST, the tester tests the bug. If the
bug is not present in the software, he approves that the bug is fixed and changes the status to
9. Reopened: If the bug still exists even after the bug is fixed by the developer, the tester changes the
status to REOPENED. The bug traverses the life cycle once again.
10. Closed: Once the bug is fixed, it is tested by the tester. If the tester feels that the bug no longer
exists in the software, he changes the status of the bug to CLOSED. This state means that the bug is
fixed, tested and approved.
While defect prevention is much more effective and efficient in reducing the number of defects, most
organization conducts defect discovery and removal. Discovering and removing defects is an expensive
and inefficient process. It is much more efficient for an organization to conduct activities that prevent
Guidelines on deciding the Severity of Bug:
Indicate the impact each defect has on testing efforts or users and administrators of the application
under test. This information is used by developers and management as the basis for assigning priority
of work on defects.
A sample guideline for assignment of Priority Levels during the product test phase includes:



Critical / Show Stopper An item that prevents further testing of the product or function
under test can be classified as Critical Bug. No workaround is possible for such bugs. Examples
of this include a missing menu option or security permission required to access a function
under test.
Major / High A defect that does not function as expected/designed or cause other
functionality to fail to meet requirements can be classified as Major Bug. The workaround can
be provided for such bugs. Examples of this include inaccurate calculations; the wrong field
being updated, etc.
Average / Medium The defects which do not conform to standards and conventions can be
classified as Medium Bugs. Easy workarounds exists to achieve functionality objectives.

Examples include matching visual and text links which lead to different end points.

Minor / Low Cosmetic defects which does not affect the functionality of the system can be
classified as Minor Bugs.

Guidelines on writing Bug Description:

Bug can be expressed as Result followed by the action. That means, the unexpected behavior
occurring when a particular action takes place can be given as bug description.

Be specific. State the expected behavior which did not occur - such as after pop-up did not
appear and the behavior which occurred instead.
Use present tense.


Dont use unnecessary words.


Dont add exclamation points. End sentences with a period.


DONT USE ALL CAPS. Format words in upper and lower case (mixed case).


Mention steps to reproduce the bug compulsorily.

Keys Related To Dimensional Data Modeling

Business Key:
A business key or natural key is an index which identifies uniqueness of a row based on columns that
exist naturally in a table according to business rules. For example business keys are customer code in
a customer table, composite of sales order header number and sales order item line number within a
sales order details table.
Natural Key:
A natural key is a key that is formed of attributes that already exist in the real world. For example, a
USA citizen's social security number could be used as a natural key.
Surrogate Key:
A surrogate key in a database is a unique identifier for either an entity in the modeled world or an
object in the database. The surrogate key is not derived from application data.

Important Links for your reference

For Manual Testing and others:
For DWH: