You are on page 1of 87

1 Display LAST NAME alone from the below list

KARTHI P
BHUVI P
HARI
MOHAN S
VINAY

2 Display the number of values present in the below column

MARKS
12,34,55,98
87,34
23,54,67
12,23,23,24,34

3 Display the below numbers alone from a column

11,22,33,44,55,66,77,88,99

4 Display the employees who got hired in April month

5 How to display employees who joined on leap years

6 How to replace multiple commas into single comma

7 How to fetch 55th to 60th rows in a table based on row_id column

8 Display employees who joined on the same date


Why do you need ETL?

It helps companies to analyze their business data for taking critical business decisions.
Transactional databases cannot answer complex business questions that can be answered by ETL example.
A Data Warehouse provides a common data repository
ETL provides a method of moving the data from various sources into a data warehouse.
As data sources change, the Data Warehouse will automatically update.
Well-designed and documented ETL system is almost essential to the success of a Data Warehouse project.
Allow verification of data transformation, aggregation and calculations rules.
ETL process allows sample data comparison between the source and the target system.
ETL process can perform complex transformations and requires the extra area to store the data.
ETL helps to Migrate data into a Data Warehouse. Convert to the various formats and types to adhere to one consistent system
ETL is a predefined process for accessing and manipulating source data into the target database.
ETL in data warehouse offers deep historical context for the business.
It helps to improve productivity because it codifies and reuses without a need for technical skills.

Explain bug life cycle.


When a tester finds a bug, the bug is assigned NEW or OPEN with status.
The bug is either assigned to Development Project Managers or is given to Bug Bounty Program. They will check whether it is a
Now, the tester checks whether a similar defect was raised earlier. If yes, the defect is assigned a status ‘DUPLICATE’
Once the bug is fixed, the defect is assigned a status ‘FIXED’
Next, the tester will re-test the code. In case, the test case passes, the defect is CLOSED
If the test case fails again, the bug is RE-OPENED and assigned to the developer.

Why do Organizations Need Data Warehouse?


Organizations with organized IT practices are looking forward to creating the next level of technology transformation.
They are now trying to make themselves much more operational with easy-to-interoperate data.
Having said that data is the most important part of any organization, it may be everyday data or historical data.
Data is the backbone of any report and reports are the baseline on which all vital management decisions are taken.
Most companies are taking a step forward in constructing their data warehouse to store and monitor real-time data as well as
Crafting an efficient data warehouse is not an easy job. Many organizations have distributed departments with different applic
ETL tool is employed in order to make a flawless integration between different data sources from different departments.
The ETL tool will work as an integrator, extracting data from different sources; transforming it into the preferred format based
transformation rules and loading it into a cohesive DB known as Data Warehouse.
Well planned, well defined and effective testing scope guarantees smooth conversion of the project to production.
A business gains real buoyancy once the ETL processes are verified and validated by an independent group of experts to make
ETL or Data warehouse testing is categorized into four different engagements irrespective of the technology or ETL tools use
New Data Warehouse Testing: New DW is built and verified from scratch. Data input is taken from customer requirements an
Migration Testing: In this type of project, customers will have an existing DW and ETL performing the job, but they are looking
Change Request: In this type of project new data is added from different sources to an existing DW. Also, there might be a con
Report Testing: Report is the end result of any Data Warehouse and the basic propose for which DW builds. The report must b

ETL Testing Techniques


1) Data Transformation Testing: Verify if data is transformed correctly according to various business requirements and rules.

2) Source to Target Count Testing: Make sure that the count of records loaded in the target is matching with the expected cou
3) Source to Target Data Testing: Make sure that all projected data is loaded into the data warehouse without any data loss or

4) Data Quality Testing: Make sure that ETL application appropriately rejects, replaces with default values and reports invalid d

5) Performance Testing: Make sure that data is loaded in the data warehouse within the prescribed and expected time frames

6) Production Validation Testing: Validate the data in the production system & compare it against the source data.

7) Data Integration Testing: Make sure that the data from various sources has been loaded properly to the target system and a

8) Application Migration Testing: In this testing, ensure that the ETL application is working fine on moving to a new box or platf

9) Data & constraint Check: The datatype, length, index, constraints, etc. are tested in this case.

10) Duplicate Data Check: Test if there is any duplicate data present in the target system. Duplicate data can lead to incorrect

Apart from the above ETL testing methods, other testing methods like system integration testing, user acceptance testing, incr
regression testing, retesting and navigation testing are also carried out to make sure that everything is smooth and reliable.

Given below is the list of objects that are treated as essential for validation in this testing:

Verify that data transformation from source to destination works as expected.


Verify that the expected data is added to the target system.
Verify that all DB fields and field data are loaded without any truncation.
Verify data checksum for record count match.
Verify that for rejected data proper error logs are generated with all the details.
Verify NULL value fields
Verify that duplicate data is not loaded.
Verify data integrity

ETL Testing Challenges


This testing is quite different from conventional testing. Many challenges are faced while performing data warehouse testing.
Here are a few challenges that I experienced on my project:

Incompatible and duplicate data


Loss of data during ETL process.
Unavailability of the inclusive testbed.
Testers have no privileges to execute ETL jobs on their own.
The volume and complexity of the data is huge.
Fault in business processes and procedures.
Trouble acquiring and building test data
Unstable testing environment
Missing business flow information
Data is important for businesses to make critical business decisions. ETL testing plays a significant role in validating and ensurin

Test Scenario Test Cases


Mapping doc Verify mapping doc whether corresponding ETL information is provided or not.
validation Change log should maintain in every mapping doc.
1. Validate the source and target table structure against corresponding mapping
doc.
2. Source data type and target data type should be same
Validation 3. Length of data types in both source and target should be equal
4. Verify that data field types and formats are specified
5. Source data type length should not less than the target data type length
6. Validate the name of columns in the table against mapping doc.
Constraint
Ensure the constraints are defined for specific table as expected
Validation
1. The data type and length for a particular attribute may vary in files or tables
Data consistency though the semantic definition is the same.
issues
2. Misuse of integrity constraints
1. Ensure that all expected data is loaded into target table.
2. Compare record counts between source and target.
3. Check for any rejected records
Completeness
Issues 4. Check data should not be truncated in the column of target tables
5. Check boundary value analysis
6. Compares unique values of key fields between data loaded to WH and
source data
Correctness 1. Data that is misspelled or inaccurately recorded
Issues 2. Null, non-unique or out of range data
Transformation Transformation
1. Number check: Need to number check and validate it
2. Date Check: They have to follow date format and it should be same across
all records
Data Quality 3. Precision Check
4. Data check
5. Null check
Null Validate Verify the null values, where “Not Null” specified for a specific column.

1. Needs to validate the unique key, primary key and any other column should
be unique as per the business requirements are having any duplicate rows

Duplicate Check 2. Check if any duplicate values exist in any column which is extracting from
multiple columns in source and combining into one column
3. As per the client requirements, needs to be ensure that no duplicates in
combination of multiple columns within target only
Date values are using many areas in ETL development for

1. To know the row creation date


Date Validation 2. Identify active records as per the ETL development perspective
3. Identify active records as per the business requirements perspective

4. Sometimes based on the date values the updates and inserts are generated.

1. To validate the complete data set in source and target table minus a query in
a best solution
2. We need to source minus target and target minus source
3. If minus query returns any value those should be considered as mismatching
rows
Complete Data
Validation 4. Needs to matching rows among source and target using intersect statement
Complete Data
Validation

5. The count returned by intersect should match with individual counts of source
and target tables
6. If minus query returns of rows and count intersect is less than source count
or target table then we can consider as duplicate rows are existed.
Data Cleanness Unnecessary columns should be deleted before loading into the staging area.
Reproduce the problem on production and testing environment.
If the problem is occurring only on production environment then it may
be due to configuration issue.
On the other side if it is occurring on QA environment then check the
impact of that issue on the application.
Do a clear root cause analysis
Investigate the issue to find out that how long that defect has been
around.
Determine the fix of ticket and list out the areas where can put more
impact.
If the issue is impacting more customers then go for the hotfix and
deploy it on QA environment.
Testing team should focus on testing all the regression scenarios around
the fix.
If the applied fix works fine then it should be deployed to production
and post release sanity should be done so that it should not occur again.
Do a retrospection meeting.
What to do
when
defect is found
in production
but not during
the QA phase?
Bug It is the consequence/outcome of the coding error
Testing is the process of identifying defects, where a defect is any variance
Defect between actual and expected results.actually doesn't meet the reqiurements
error found by tester is called Defect, defect accepted by development team
then it is called Bug
re to one consistent system.

ey will check whether it is a valid defect. If not valid, the bug is rejected, and its new status is REJECTED.
atus ‘DUPLICATE’

gy transformation.

torical data.
sions are taken.
or real-time data as well as historical data.
ments with different applications running on distributed technology.
fferent departments.
he preferred format based on the business

to production.
t group of experts to make sure that the data warehouse is concrete and robust.
echnology or ETL tools used:
customer requirements and different data sources and a new data warehouse is built and verified with the help of ETL tools.
he job, but they are looking to bag new tools in order to improve efficiency.
Also, there might be a condition where customers need to change their existing business rules or they might integrate the new rules.
W builds. The report must be tested by validating the layout, data in the report and calculation.

requirements and rules.

hing with the expected count.


se without any data loss or truncation.

values and reports invalid data.

and expected time frames to confirm improved performance and scalability.

he source data.

to the target system and all the threshold values are checked.

oving to a new box or platform.

data can lead to incorrect analytical reports.

ser acceptance testing, incremental testing,


g is smooth and reliable.

g data warehouse testing.

ole in validating and ensuring that the business information is accurate, consistent and reliable. It also minimizes the hazard of data loss in
elp of ETL tools.

t integrate the new rules.


izes the hazard of data loss in production.
A slowly changing dimension (SCD) is a dimension that is able to handle data attributes which change over time.

For example:

A customer dimension may hold attributes such as name, address, and phone number.

Over time, a customer's details may change (e.g. move addresses, change phone number, etc).

A slowly changing dimension is able to accommodate these changes, with some SCD patterns having the added ability to prese
Deciding on which type of slowly changing dimension pattern to implement will vary based on your business requirements.

SCD Type 0
There are situations where you ignore any changes. For example, when an employee joined an organization, there are joined r
such as joined Designation and JoinedDate, etc. that should not change over time.

The following is the example for Type 0 of Slowly Changing Dimensions in Data Warehouse.

In the above Customer Dimension, FirstDesignation, JoinedDate and DateFirstPurchase are the attributes that will not be updated which is Type 0 SCD.

SCD Type 1
In the Type 1 SCD, you simply overwrite data in dimensions. There can be situations where you don’t have the entire data
when the record is initiated in the dimension. For example, when the customer record is initiated, you may not get all attribut
Therefore, when the customer record is initiated at the operational database, there will be empty or null records in the custom
Once the ETL is executed, those empty records will be created in the data warehouse. Once these attributes are filled in the op

SCD Type 2
Type 2 Slowly Changing Dimensions in Data warehouse is the most popular dimension that is used in the data warehouse.
As we discussed data warehouse is used for data analysis. If you need to analyze data, you need to accommodate historical as

For the SCD Type 2, we need to include three more attributes such as StartDate, EndDate and IsCurrent as shown below.
Type 2 Slowly Changing Dimensions in Data Warehouse
In the above customer dimension, there are two records and let us say that customer whose CustomerCode is AW00011012, h

Implementation of Type 2 Slowly Changing Dimensions in Data Warehouse.


As you can see in the above figure, CustomerCode AW00011012 has a new record with 11013. All the new transactions will be

SELECT C.Designation,SUM(SalesAmount) SalesAmount,SUM(TotalProductCost) TotalProductCost


FROM FactInternetSales F
INNER JOIN Dim_Customer C ON F.CustomerKey = C.CustomerKey
GROUP BY C.Designation
Once the query is executed, the following results will be observed.

Sample dataset for the Type 2 SCD.


As you can see Management designation can be seen in the above result which means that it has covered the historical aspect
Type 2 SCD is one of the implementations where you cannot avoid surrogate keys in dimensional tables in the data warehouse

SCD Type 3
Type 3 Slowly Changing Dimension in Data warehouse is a simple implementation where history will be kept in the additional
If we relate the same scenario that we discussed under Type 2 SCD to Type 3 SCD, the customer dimension would look like bel
Type 3 SCD

As you can see, historical aspects of the data are preserved as a different column. However, this method will not be scalable if

Typically, this would be better suited to implement name changes of an employee. In some cases, female employees will chan
history. Further, this technique will allow only to keep the last version of the history, unlike Type 2 SCD.
SCD Type 4
As we discussed in SCD type 2, we maintain the history by adding a different version of the row to the dimension. However, if

For example, let us assume we want to keep the customer risk type depending on his previous payment.
Since this is an attribute related to the customer, it should be stored in a customer dimension. This means every month there
customer record. If you have 1000 customers, you are looking at 12,000 records per month. As you can imagine this Slowly Ch

SCD Type 4 is introduced in order to fix this issue. In this technique, a rapidly changing column is moved out of the dimension
table. This new dimension is linked to the fact table as shown in the below diagram.
With the above implementation of Type 4 Slowly Changing Dimensions in Data Warehouse, you are eliminating the unnecessa
However, still you have the capabilities of performing the required analysis.

SCD Type 6
Type 6 Slowly Changing Dimensions in Data Warehouse is a combination of Type 2 and Type 3 SCDs. This means that Type 6 SC

With this implementation, you can further improve the analytical capabilities in the data warehouse
ange over time.

ving the added ability to preserve history.


our business requirements.

organization, there are joined related attributes

updated which is Type 0 SCD.

don’t have the entire data


d, you may not get all attributes.
ty or null records in the customer records.
se attributes are filled in the operational databases, that has to be updated in the data warehouse.

ed in the data warehouse.


to accommodate historical aspects of data. Let us see how we can implement SCD Type 2.

Current as shown below.


stomerCode is AW00011012, has been promoted to Senior Management. However, if you simply update the record with the new value, yo

All the new transactions will be related to CustomerKey 11013 while previous transactional are related to CustomerKey 11012. This mecha

s covered the historical aspects.


l tables in the data warehouse.

will be kept in the additional column.


dimension would look like below.

method will not be scalable if you want to preserve

s, female employees will change their names after their marriage. In such situations, you can use Type 3 SCD since these types of changes
to the dimension. However, if the changes are rapid in nature Type 2 SCD will not be scalable.

his means every month there will be a new version of the


you can imagine this Slowly Changing Dimensions in Data Warehouse is not scalable.

moved out of the dimension and is moved to a new dimension


are eliminating the unnecessary volume in the main dimension.

CDs. This means that Type 6 SCD has both columns are rows in its implementation
record with the new value, you will not see the previous records. Therefore, a new record will be created with a new CustomerKey and a

stomerKey 11012. This mechanism helps to preserve the historic aspect of the customer as shown in the below query.

D since these types of changes will not occur rapidly.


ith a new CustomerKey and a new Designation. However, other attributes will be remaining the same.
Basic insert,create,delete Delete table from ship;
How to validate the data without minus query Use Not exits/Intersect

What will be the output (Number of records) for below


operators.
Table A has one column with values(1,1,2,2,2,3,4,5)
Table B has one column with values (1,1,1,2,2,3)
1. A Union B
2. A Intersection B
3. (A Union B) minus (A Intersection B)
4. A minus B
5. A Left Join B
6. A Right join B
7. A Inner...

what is your testing approach if you get mismatch


records and how do you identify mismatch records

Duplicate records 1st method:count,groupby,having


2nd method:Row_number using
with clause
3rd method:Case and Lag

Remove duplicate max,rownumber/rank,min()/self joi


removing duplicates by creating back up table
without dropping orginal table

find the manager name for employee were empid and self join or left join but left join
managerid in same table gives you data for the emp who
doesn't have managerid

use recrusive CTE

Add a new column in the table

Second highest salary using max

third highest using top

4 th highest salary using limit


Limit n-1,1

using dense_rank
we can use with(inside dense_ran)
don’t use rownumber it ives
sequential num so duplicates
there better use dense_rank

find mx salary for each dep


alternative for TOP clause in sql

show a single or same row from atable twice in results use union all
find department that has less than 3 employees use join,groupby, having

ISNULL

convert row to column pivot

case,Max

Custom sorting order by month use orderby monthname it wont


work so use case in order by

i/p:
name sales
april 100
jan 200
june 300
dec 150

use orderby Month(date)

i/p:
Date sales
2021-10-01 100
2021-01-01 100
2021-03-01 100
2021-02-01 100
previous quarter sales use lag( )
i/p:
year sales quarter
2021 100 Q1
2021 200 Q2
2021 300 Q3
2020 500 Q1
2020 300 Q2
2020 400 Q3

Lead()
i/p:
year sales quarter
2021 100 Q1
2021 200 Q2
2021 300 Q3
2020 500 Q1
2020 300 Q2
2020 400 Q3

split concatened string to columns substr/substring and INSTR


FULLNAME --firstname,lastname
substring_index

left,right,len,charindex

i/p:
id Name
1 cooper,adam

Replace special character use replace and also ASCII char(9),


char(10),char(13)

TOP 50%record
fetch last 5 records
Which is faster in or EXISTS in SQL?

what is diff b/w CHAR and VARCHAR2 datatype


fetch random rows from table

Round the values 240


235.400
235.420
200.000
Add Zeros infront of the phonenumber

get unique values without using distinct

group by
Union since its removes duplicates
ORACLE SQL
Windows functions RANK,LAG,LEAD

Ho will you convert text into date


Is to good to have same subquery multiple times
order of execution
Why we need surrogate (non-natural)key in dim tables?
Find the employees hired last n month last 3 months

last 30 days

last one year

last two year

find rows which retrieve only numeric data ISNUMERIC


query to find highest number of employees

to have only departname not the co

can we join the table without primary and foreign relation


blocking and deadlocking Blocking
deadlocking

select query retrive all the students name starts with charindex,left,substring
'M' without like operator

people born on same day and month excluding year


people whose birth year is same a(2017,2018)
. All people who are born yesterday, today, tomorrow, last seven days, and next 7 days

tomorrows date
sine sday

last 7 days excuding today

delete parent child rows use casade delete

delete child table then parent


Find the numbers from string ex Nir12ma0la separate use patindex,stuff
it by id as number and name as nirmala

How to replace multiple commas into single comma use replace

Fetch 55 to 60 rows from table use row_number

How to display employees who joined on leap years only with one condition year%4=0
use all condition with column
as leap or not leap year

just employee list

Display employees who joined on the same date self join

Display the employees who got hired in April month Month(date)=4


Datename(month,'date')=october
Display the below numbers alone from a column

11,22,33,44,55,66,77,88,99
insert into ship(id,name) values (1,'sherin');
SELECT * FROM [shipPERS] AS A
WHERE NOT EXISTS ( SELECT NULL FROM ship AS B WHERE A.SHIPPERID=B.SHIPPERID)

select * from a union select * from b


select * from a intersect select * from b
(select * from a union select * from b)Minus(select * from a intersect select * from b)
select * from a minus select * from b
select * from a left join b on a.id=b.id
select * from a right join b on a.id=b.id
select * from a inner join b on a.id=b.id

Use minus (a-b) and (b-a) operator and analyse scenarios in nextsheet manually
Else we can use inner full outer join using primary keys where a.PK=b.PK make sure
Primary keys don’t have duplicates
select emp,salary,count(*) from emp
group by emp,salary
having count(*) >1
WITH CTE as( select *,ROW_NUMBER() over (partition by email order by email) as RN from
customer)
select * from CTE where RN>1
select *,case when email= lag(email) over (order by email) then 'Yes' else 'No' end
duplicate
from customer order by email

delete from ship where shipperid in(


SELECT max(shipperid) from ship group by shippername,phone)

delete from cars where id in (


select b.id from car a join car b on a.model=b.model and a.brand =b.brand where
a.id<b.id)

with CTE as(


select shippername,phone,RANK() over (partition by shippername,phone order by
shipperid) as rank from ship ) delete from cte where rank>1

with CTE as(


select shippername,phone,rownumber() over (partition by shippername,phone order by
shipperid) as rownum from ship) ) delete from cte where rownum>1
select distinct * from ship
create table bk as select * from ship
select * from bk
truncate table ship
insert into select * from bk
drop bk

select e1.empid,e1.empname,e2.empid as managerid,e2.empname as managername


from emp e1
join emp e2
on e1.managerid=e2.empid;

alter table emp add salary numeric (50)


Update emp set salary =10000 where empid=1
select * from emp where salary in (select Max(salary) from emp where salary < (
select Max(salary)from emp))

select * from emp where salary in (


select Min(salary) from
(select distinct top 3 salary from emp
order by salary desc) as tb
)
select * from emp where salary in (
select top 1 (salary) from
(select distinct top 3 salary from emp
order by salary desc) as tb
)order by salary

select salary from emp


order by salary desc
limit 3,1;
select top 1* from (select name,salary,dense_rank() over(order by salary desc) as rnk
from emp) as tmp
where tmp.rnk=3

select depid,max(salary) from department group by deptid


set rowcount=3
select * from emp
set rowcount=0
select depname from department where depname='IT' union all
select depname from department where depname='IT'
select e.deptid,d.deptname from emp e join department d
on e.deptid=d.deptid
group by e.deptid,d.deptname
having count(empid)<3

select ISNULL(null,'Nomanager) ,It returns 'Nomanager' as it has NULL-o/pis nomanager


select ISNULL('pragim','nomanager'),when we are not passing null,if its not nll then it
returns the value . o/p is pragim

select ID,[NAME],[GENDER],[SALARY] from


(select id,name as ename,value from empl) as source
PIVOT
(max(value)
for
ename in ([NAME],[GENDER],[SALARY])
) as PIVOT table

select id,
case when Name='name' then value else ' ' end as name,
when Name='Gender' then value else ' ' end as Gender,
when Name='salary' then value else ' ' end as salary
from empl
o/p id name gender salary
1 Adam
1 male
1 50,000

select * from sales order by case when name='jan' then 1


when name='april' then 4
when name='jun' then 6
when name-'dec' then 12 else NULL end

select datename(month,date) as dd,Month(date) as d ,sales from


sales_detail order by Month(date)

Dd d sales
oct 10 100
jan 1 100
mar 3 100
feb 2 100
select year,quartername,sales ,lag(sales) over (partition by year order
by quarter) as previousquartersales from sales_detail

select year,quartername,sales ,LEAD(sales) over (partition by year order


by quarter desc) as previousquartersales from sales_detail

select name,SUBSTRING(name,1,INSTR(name,',')-1) as lastname,


SUBSTRING(name,-1,INSTR(name,',')-1) as firstname
SELECT SUBSTRING_INDEX("venugopal,saranya", ",", 1)as lastname
SELECT SUBSTRING_INDEX("venugopal,saranya", ",", -1)as firstname

select name,left(name,charindex(',',name)-1) as lastname


right (name,len(name)-charindex(',',name)) as firstname
from emp
note:another one way is string_split(use pivot tedious one)

replace(address,' ',''),replace(address,'char(9),'')

select TOP 50 percent * from emp


select * from emp wher empid > ((select count(*) from emp)-5)
When the subquery results are large, EXISTS operator provides better performance. In
contrast, when the sub-query results are
small, the IN operator is faster than EXISTS.
the IN clause can't compare anything with NULL values, but the EXISTS clause can compare
everything with NULLs.

NOT in doesn’t return the null records whereas exits does


ex emp,depid departmnt table dep id depname
1 NULL 1 S
2 3 3 C
3 4 4 e

not in query doesn’t give this emp id whereas not exitss will give
CHAR is used to store strings of fixed length
VARCHAR2 is used to store strings of variable length
SELECT column FROM table
ORDER BY RAND ( )
LIMIT 1
SELECT ROUND(235.415, -1) AS RoundValue
SELECT ROUND(235.415, 1) AS RoundValue
SELECT ROUND(235.415, 1) AS RoundValue
SELECT ROUND(235.415, -2) AS RoundValue;
SELECT concat('000',substring(phone,6,15)) a,*
FROM Shippers;
select * from (Select rownumber() over( partition by id order by 1)) as rw from ship)
where rw=1
select id,name from emp group by id,name
select id,name from emp union select id,name from emp
select id,name from emp union select null,null from dual where 1=2
function which uses values from one or multiple rows to return a value for each row.
(This contrasts with an aggregate function, which returns a single value for multiple rows.)

select CAST('31-01-2022' as datetime)


No,we can use WITH clause
FROM ,JOIN,WHERE,GROUP BY,HAVING,SELECT,ORDER BY,LIMIT
Surrogate keys are necessary to handle changes in dimension table attributes.
select datediff(month,'hiredate',getdate()) as diff
from emp order by hiredate desc
where datediff(month,'hiredate',getdate()) between 1 and 3
select datediff(day,'hiredate',getdate()) as diff
from emp order by hiredate desc
where datediff(day,'hiredate',getdate()) between 1 and 3
select datediff(year,'hiredate',getdate()) as diff
from emp order by hiredate desc
where datediff(year,'hiredate',getdate()) between 1 and 1
select datediff(year,'hiredate',getdate()) as diff
from emp order by hiredate desc
where datediff(year,'hiredate',getdate()) between 0 and 1
select value from emp wher ISNUMERIC(Value)=1
select top 1 depname ,count(*) as count
from employee e join department d
on e.depid=d.depid
group by e.depid order by count desc

select top 1 depname


from employee e join department d
on e.depid=d.depid
group by e.depid order by count(*) desc

yes we can join as long as column values involved in the join can be converted into one datatype
occurs if a transaction tries to aquire an incompatiable lock on another resource that
another transaction has already locked.the blocked transaction remain blocked untill the
blocking transactions releases the lock
occurs when two or more transaction have a resource locked and each transaction
request a lock on the resource that
another transaction has already locked.the Neither of the transactions here can move
forward,as each one is waiting for other to release the lock

In this case,SQL server intervens and ends the deadlock by cancelling one of the
transaction,so the other
transaction can move forward
select * from emp where charindex('M',name)=1
select * from emp where Left(name,1)='M'
select * from emp where substring(name,1,1)='M'
select name,cast(dob as date) from emp where day(dob)=9 and month(dob) =10
select name,cast(dob as date) from emp where year(dob)=2017
select dateadd(day,-1,cast(getdate() as date)=yesterday date
select name,cast(dob as date) from emp where cast(dob as date)=dateadd(day,-1,cast(getdate() as date)
select name,cast(dob as date) from emp where cast(dob as date)=dateadd(day,1,cast(getdate() as date)
select name,cast(dob as date) from emp where cast(dob as date) between
dateadd(day,-1,cast(getdate() as date)and cast(getdate() as date)
select name,cast(dob as date) from emp where cast(dob as date) between
dateadd(day,-7,cast(getdate() as date)and dateadd(day,-1,cast(getdate() as date)
Alter table Employees
add constraint FK_Dept_Employees_Cascade_Delete
foreign key (DeptId) references Departments(Id) on delete cascade
Begin Try

Begin Tran

Declare @GenderToDelete int = 2

-- Delete first from child tables


Delete from Teachers where GenderId = @GenderToDelete
Delete from Students where GenderId = @GenderToDelete

-- Finally Delete from parent table


Delete from Gender where Id = @GenderToDelete

Commit Tran
End Try

Begin Catch

Rollback Tran

End Catch
Create function UDF_ExtractNumbers
(
@input varchar(255)
)
Returns varchar(255)
As
Begin
Declare @alphabetIndex int = Patindex('%[^0-9]%', @input)
Begin
While @alphabetIndex > 0
Begin
Set @input = Stuff(@input, @alphabetIndex, 1, '' )
Set @alphabetIndex = Patindex('%[^0-9]%', @input )
End
End
Return @input
End

declare @input varchar(50)='saran,,venu,,,geetha,,,,bhuvi,,,,,,'


select replace(@input,',',.,')
select replace(replace(@input,',',.,'),',.','')
select replace(replace(replace(@input,',',.,'),',.',''),'.','')

select * from ( select row_number() over (order by outletnumber) as rn,* from wrap.
dimoutlet) a where rn between 55 and 60
select * from emp where year(hiredate)%4=0
select * from (
select case when (year(hiredate)%4=0 and year(a.hiredate)%100<>0 or year(hiredate
%400)=0
then 'Leapyear'
else 'Not a leap year'
end status
from emp) a
where a.status='Leapyear'

select * from emp where hiredate in (


select case when (year(hiredate)%4=0 and year(a.hiredate)%100<>0 or year(hiredate
%400)=0
then hiredate
end status
from emp)

select a.* from emp a join emp b ona.hiredtae=b.hiredate


where a.id<>b.id
select * from emp where month(hiredate)=4
select b.rn from (
select *,
row_number() over (oreder by id ) as rn
from emp
)b
where b.rn%11=0 and b.rn<=99
create table ship(id int,name varachar(255));
SELECT * FROM [shipPERS] intersect
SELECT * FROM ship;

1) 1,2,3,4,5
2) 1,2,3
3) 4,5
4) 4,5
5) (cartesian product)
a-1,1,1,1,1,1,2,2,2,2,2,2,3,4,5
b-1,1,1,1,1,1,2,2,2,2,2,2,3,null,null
6) a-1,1,1,1,1,1,2,2,2,2,2,2,3
b-1,1,1,1,1,1,2,2,2,2,2,2,3
7) a-1,1,1,1,1,1,2,2,2,2,2,2,3
b-1,1,1,1,1,1,2,2,2,2,2,2,3

Lag is used to check with previous value

emp 1 and 5 has same record in which if we


select max(emp) that record alone deleted.
Min function is to find out multiple duplicates
for one rec:delete from cars where id not in (
select min(id) from cars group by
model,brand)
declare int @id int;
set @id=7
with cte as( select empid,empname,managerid
from emp
where empid= @id
union all
select e.empid,e,empname,c.managerid from
emp e join cte c on
emp.empid=cte.managerid)

select Max(salary) from emp where salary <


(select Max(salary) from emp where salary < (
select Max(salary)from emp))

reset it back we give set rowcount=0,else it


will alwz give 3 so revert it back
-->select your base name
-->use pivot
-->which colum do u want to agggregate

-->give what your column needs to be in table

select id,
Max(case when Name='name' then value else
' ' end )as name,
Max(when Name='Gender' then value else ' '
end) as Gender,
Max(when Name='salary' then value else ' '
end )as salary
from empl
group by id
o/p id name gender salary
1 Adam male 50000

after giving order by month(date)

Dd d sales
jan 1 100
feb 2 100
mar 3 100
oct 10 100
0/p:lag(sales)
year sales quarter previousquarter
2021 100 Q1 NULL
2021 200 Q2 100
2021 300 Q3 200
2020 500 Q1 NULL
2020 300 Q2 100
2020 400 Q3 300

0/p:lead(sales)
year sales quarter previousquarter
2021 300 Q3 NULL
2021 200 Q2 300
2021 100 Q1 200
2020 400 Q3 NULL
2020 300 Q2 400
2020 500 Q1 300

o/p
id name lastname firstname
nto one datatype
1,cast(getdate() as date)
1,cast(getdate() as date)
count same then we can use
intersect
select country ,ity1,city2,city3 from(select *,'city'+cast(row_number()
over (partition by country order
by country) s as varchar(10) as column sequence) temp
pivot
( max(city)
for columnsequence in ( city1,city2,city3)) dd
0/p:lag(sales,2)-gives 2 previous qurter
year sales quarter previousquarter
2021 100 Q1 NULL
2021 200 Q2 NULL
2021 300 Q3 100
2020 500 Q1 NULL
2020 300 Q2 NULL
2020 400 Q3 500
Database:collection of data in the form of tables ex:ODS
An Organized collection of related data which stores data in a tabular format
A database is any collection of data organized for storage, accessibility, and
retrieval.

What is ETL Testing?


ETL testing is done to ensure that the data that has been loaded from a source
to the destination after business transformation is accurate. It also involves the
verification of data at various middle stages that are being used between
source and destination. ETL stands for Extract-Transform-Load.
Difference b/w ETL testing and Database testing:
Data Extraction, Transform and Loading for BI Reporting
It is used for Analytical Reporting, information and forecasting.In ETL
testing,we need to focus on data
Multidimensional.
applied to(OLAP)
ETL testing involves the following operations −
Validation of data movement from the source to the target system.
Verification of data count in the source and the target system.
Verifying data extraction, transformation as per requirement and expectation.
Verifying if table relations − joins and keys − are preserved during the
transformation.
Common ETL testing tools include QuerySurge, Informatica, etc

Difference Between Database and Data Warehouse Testing:


data warehouse testing is done with large volume with data involving OLAP
(online analytical processing) databases.
In data warehouse testing most of the data comes from different kind of data
sources which are sequentially inconsistent.
In data warehouse testing we use read-only (Select) operation.
while demoralized DB is used in data warehouse testing.

3-layer architecture:A typical ETL tool-based data warehouse uses staging


area, data integration, and access layers to perform its functions. It’s normally
a 3-layer architecture.
Staging Layer − The staging layer or staging database is used to store the data
extracted from different source data systems.

Data Integration Layer − The integration layer transforms the data from the
staging layer and moves the data to a database, where the data is arranged
into hierarchical groups, often called dimensions, and into facts and aggregate
facts. The combination of facts and dimensions tables in a DW system is called
a schema.

Access Layer − The access layer is used by end-users to retrieve the data for
analytical reporting and information
Data Purging:When data needs to be deleted from the data warehouse, it can
be a very tedious task to delete data in bulk. The term data purging refers to
methods of permanently erasing and removing data from a data
warehousehen you delete data, you are removing it on a temporary basis, but
when you purge data, you are permanently removing the data and freeing up
memory or storage space. In general, the data that is deleted is usually junk
data such as null values or extra spaces in the row. Using this approach, users
can delete multiple files at once and maintain both efficiency and speed.

OLAP: the main objective is data analysis.use for anlaysis the business
data is denormalised
involves historical proessing of information
useful in analyzing the business.we need to know business process whether
its gain/loss.
It focuses on information out.(outcome of existing data)
Based on Star,snowflake schema and fact constellation-Dimensional
contains historical data
provide summarized and consolidated data ( which means we are taking high
level info for anlaysis purposes)
Examples of OLAP:DWH
Annual financial performance
Trends in marketing leads
Features of OLAP:
They manage historical data.
These systems do not make changes to the transactional data.
Their primary objective is data analysis and not data processing.
Data warehouses store data in multidimensional format.
Advantages of OLAP:
Businesses can use a single multipurpose platform for planning, budgeting,
forecasting, and analysing.
Information and calculations are very consistent in OLAP.
Adequate security measures are taken to protect confidential data.
Disadvantages of OLAP:
Traditional tools in this system need complicated modelling procedures.
Therefore, maintenance is dependent on IT professionals.
Collaboration between different departments might not always be possible.
Fact:Facts are the business events that you want to measure
Quantitive information about the business process.Also called as
measurements/metrics.It holds numeric data.it has the foreign key of
dimensions.Fact table seen as a table which captures the interaction between
these different dimensions ex: quantity,sales amount,profit,turn over etc

Dimensional modelling:
a process of arranging data into dimension and facts
dimensions and facts are building block of dimensional model

Data Model: The data models are used to represent the data and how it is
stored in the database and to set the relationship between data items
Data model tells how the logical structure of a database is modelled.
Pictorial representation of tables
represents the releationship between the tables
1)conceptual data model
2)Logical data model
3)physical data model
schemas:it is a logical description of the entire database
a database uses relational model while a datewarehouse uses star,snowflake
and fact constellation schemas(galaxy)

Common dimension is shared across multiple subject areas called as


conformed dimensions
A conformed dimension can refer to multiple tables in multiple data marts
within the same organization
Junk dimension is the way to solve this problem. In a junk dimension, we
combine these indicator fields into a single dimension. This way, we'll only
need to build a single dimension table, and the number of fields in the fact
table, as well as the size of the fact table, can be decreased. The content in the
junk dimension table is the combination of all possible values of the individual
indicator fields.
A slowly changing dimension (SCD) is a dimension that is able to handle data
attributes which change over time.

SCD1:the new information simply overwrites the original information. In other


words, no history is kept.
Advantages:
- This is the easiest way to handle the Slowly Changing Dimension problem,
since there is no need to keep track of the old information.
Disadvantages:
- All history is lost. By applying this methodology, it is not possible to trace
back in history. For example, in this case, the company would not be able to
know that Christina lived in Illinois before.
When to use Type 1:
Type 1 slowly changing dimension should be used when it is not necessary for
the data warehouse to keep track of historical changes.
Data Integrity
Data integrity refers to the validity of data in db, meaning data is consistent
and accurate. In the data warehousing field, we frequently hear the term,
"Garbage In, Garbage Out." If there is no data integrity in the data warehouse,
any resulting report and analysis will not be useful..
Integrity constaints to enforce the business rules on data in db

In a data warehouse or a data mart, there are three areas of where data
integrity needs to be enforced:

Database level
We can enforce data integrity at the database level. Common ways of
enforcing data integrity include:
Referential integrity
The relationship between the primary key of one table and the foreign key of
another table must always be maintained. For example, a primary key cannot
be deleted if there is still a foreign key that refers to this primary key.
Primary key / Unique constraint
Primary keys and the UNIQUE constraint are used to make sure every row in a
table can be uniquely identified.
Not NULL vs. NULL-able
For columns identified as NOT NULL, they may not have a NULL value.
Valid Values
Only allowed values are permitted in the database. For example, if a column
can only have positive integers, a value of '-1' cannot be allowed.
ETL process
For each step of the ETL process, data integrity checks should be put in place
to ensure that source data is the same as the data in the destination. Most
common checks include record counts or record sums.
Access level
We need to ensure that data is not altered by any unauthorized means either
during the ETL process or in the data warehouse. To do this, there needs to be
safeguards against unauthorized access to data (including physical access to
the servers), as well as logging of all data access history. Data integrity can only
ensured if there is no unauthorized access to the data.
ETL Process

What do you test for source file?


What do you test in DWH
issues you found in Data Warehouse (DWH) after loading

which file format did you get as source file


what do you test in pre etl

what is the flag in DWH

SQL Commands
DDL-defines the db schemas
DML-maniputes the data in db
DCL-deals with rights,permision and other control of db
DQL
TCL-transaction of the db
DBMS:DBMS is a software application that interacts with users,application and
database itself to captures data and analyse the data.Data stored in the
database can be retrived,modified and deleted and data can be in any form
strings,images,numbers.
Types of DBMS:Hierarical,object-oriented,network and relational

DBMS is the management of data that should remain integrated


when any changes are done in it. It is because if the integrity of the data is
affected, whole data will get disturbed and corrupted. Therefore, to maintain
the integrity of the data, there are four properties described in the database
management system, which are known as the ACID properties.

constraints: constarints are used to specify the limit on the datatype of the
table.It can be specified while altering or creating the table

PrimaryKey: Primary key is used as unique identifier for each record in the
table.we cannot store NULL values.It supports only one primary key in the
table
Composite key:A composite key is the key having two or more column that
together can uniquely identify a row in a table.In score table,Primary key is the
composition of two columns(subjectid+studentid)

Triggers:
A trigger is a special type of stored procedure that automatically runs when an
event occurs in the database serve
1NF:Scalable table design which can be extended.It has four rules:
1)Each column should contains atomic values(single value)
2)must have same datatypes
3)Column name should be unique.Same name leads to confusion during
retrieval
4)Orders in which data is saved doesn't matter

BCF/3.5NF:Pre-requities are :It should be in 3NF and we have more than one
candidate key here comes BCF role then it divide the table and have one
candidate key

Types of Loading:

Initial Load — populating all the Data Warehouse tables


Incremental Load — applying ongoing changes as when needed periodically.
Full Refresh —erasing the contents of one or more tables and reloading with
fresh data.
Load verification
Ensure that the key field data is neither missing nor null.
Test modeling views based on the target tables.
Check that combined values and calculated measures.
Data checks in dimension table as well as history table.
Check the BI reports on the loaded fact and dimension table.
Parsing is nothing but checking the syntaxes of SQL query.All the syntax of
Query is correct or not is checked by SQL Parser.

There are 2 functions of parser:

1.Syntax analysis

2.Semantic analysis

Mapping : Represents flow of data from source to destination

- Session : A set of instructions which describes the data movement from


source to destination

- Worklet : A set of tasks are represented as A Worklet

- Workflow : A set of instructions that specifies the way to execute the tasks to
Informatica

- Mapplet : Used for creation and configuration of A group of transformations


DateWarehouse:Process of aggregating/combines the data from
multiple sources into one common repository which can be used for
analytical process.
A Data Warehouse (DWH) is another name for an Enterprise Data 
Warehouse (EDW) (EDW). 
A Data Warehouse is a centralised storage location for information 
derived from one or more data sources. The process of extracting 
data from source systems and bringing it into 
the data warehouse is referred to as ETL.The three main types of data
warehouses are enterprise data warehouses (EDW), operational data
stores (ODS), and data marts.

Data Warehouse Testing


Data Warehouse Testing is a testing method in which the data inside a
data warehouse is tested for integrity, reliability, accuracy and
consistency in order to comply with the company’s data framework.
The main purpose of data warehouse testing is to ensure that the
integrated data inside the data warehouse is reliable enough for a
company to make decisions on.
Database testing:Data validation and Integration
It is used to integrate data from multiple applications, Severe impact
In database testing we need to focus on tables,relations,columns and
datas
ER method
applied to(OLTP)
Database Testing
Database testing stresses more on data accuracy, correctness of data
and valid values. It involves the following operations −
Verifying if primary and foreign keys are maintained.
Verifying if the columns in a table have valid data values.
Verifying data accuracy in columns. Example − Number of months
column shouldn’t have a value greater than 12.
Verifying missing data in columns. Check if there are null columns
which actually should have a valid value.
Common database testing tools include Selenium, QTP, etc

Database testing is done using a smaller scale of data normally with


OTP (Online transaction processing) type of databases
In database testing, normally data is consistently injected from uniform
sources
We generally only perform CRUD (Create, read, update and delete)
operations during database testing
Normalized databases are used in DB testing

data mart.
An enterprise data warehouse can be divided into subsets, also called
data marts, which are focused on a particular business unit or
department. Data marts allow selected groups of users to easily access
specific data without having to search through an entire data
warehouse
E.g., Marketing, Sales, HR or finance of an organization
Data mining is considered as a process of extracting data from large
data sets.it looks for hidden patterns within the data set and try to
predict future behavior. Data mining is primarily used to discover and
indicate relationships among the data sets.Data mining aims to enable
business organizations to view business behaviors, trends relationships
that allow the business to make data-driven decisions. It is also known
as knowledge Discover in Database (KDD)

OLTP:the main objective is data processing;this can be use for running


the business.
data will be normalised
involves day to day processing
useful in running the business.day to day transation happned
It focuses on data in ( customers is doing transations so "in")
Based on entity relationship model(define the relationship b/w
multiple tables)
contains current data
provide primitive and highly detailed data( branch 1 has customers in
depth details)
Examples of OLTP:
Credit card activity
online booking tickets,online chatting,ecommerce
Scanning at checkout kiosks in retail stores.
Features of OLTP:
It manages transations in real-time.
It focuses on processing transactions quickly.
Relational Databases store data in this system.
They manage the transactions governed by the ACID properties.
They modify the data in the databases.
Advantages of OLTP:
Day to day transactions in an organisation is easily regulated.
Increases the organisation’s customer base as it simplifies the
individual processes.
Disadvantages of OLTP:
Hardware failures in the system can severely impact online
transactions.
These systems can become complicated as multiple users can access
and modify data at the same time.
Dimension:Dimensions are what gives the measurements more
meaning
Its an object that describes the facts or business number
holds descriptive information for facts
Each dimension is a collection of related dimensional
attributes.dimension is made up of many dimensional attributes that
altogether describes the characteristics of the same dimensions
ex:product,location,time,store

conceptual model:(Highly abstract i.e identifies the highest-level


relationships between the different entities.)
1)important entities and relationship b/w them
2)no attributes is specified
3)no keys
Logical model
1)important entities and relationship b/w them
2)all attributes for each entity are specified
3)define keys (primary key-foreig key relationship)
4)Normalization occurs at this level.
physical model
1)displays all the tables and columns.
2)display keys
3)displays datatypes
Convert entities into tables.
Convert relationships into foreign keys.
Convert attributes into columns.
star schema:
a fact table is surrounded by multiple dimension tables(i.e
denormalized dimension tables)
Each dimension in a star schema is represented with only one
dimension table and the dimension table contains the set of
attributes
In the fact table every column Is a foreign key which Is having a
relationship with dimensions table primary key
If fact is centrally located and surrounded by set of denormalized
dimension tables,then its star schema

DWH strategies :
1)Top down
2)Bottom Up
1) top down:enterprises warehouse has to be created initially and later
derive independant subjects from enterprises warehouses

2) bottom up :independant subjects has to be created initially and later


integrates subjects to get enterprises warehouses

SCD2: a new record is added to the table to represent the new


information. Therefore, both the original and the new record will be
present. The new record gets its own primary key.
Advantages:
- This allows us to accurately keep all historical information.
Disadvantages:
- This will cause the size of the table to grow fast. In cases where the
number of rows for the table is very high to start with, storage and
performance can become a concern.
- This necessarily complicates the ETL process.
When to use Type 2:
Type 2 slowly changing dimension should be used when it is necessary
for the data warehouse to track historical changes.
Sql joins : A JOIN clause is used to join rows from two or more tables
based on a common column. Join is a commonly used SQL Server
clause for combining and retrieving data from two or more tables.
Here are the different types of the JOINs in SQL:
1.(INNER) JOIN: Returns records that have matching values in both
tables
2.LEFT (OUTER) JOIN: Returns all records from the left table, and the
matched records from the right table
3.RIGHT (OUTER) JOIN: Returns all records from the right table, and the
matched records from the left table
4.FULL (OUTER) JOIN: Returns all records when there is a match in
either left or right table

Index:PERFORMANCE TUNING Indexes are used to retrieve data from


the database more quickly, which means that you can speed up the
query process in SQL Server. An index is a collection of keys derived
from one or more columns in a table or view. These keys are stored in
a structure that allows SQL Server to quickly and efficiently find the
row or rows associated with the key values.
CREATE INDEX index_name
ON table_name (column1, column2, ...);
ETL also go through different phases:
Business and requirement understanding
Test planning&estimation
Designing test cases and preparing test data
test execution with bug reporting and closure
summary report and result analysis
Test closure
Job failures
Data issues
Performance

source data
environment validation

Data definition Language


Data Manipulation Language
Data Control Language
Data Query Language
Transaction Control Language

Relational DB:relational database is a collection of tables to store data.


The tables are usually related to each other by primary and foreign key
constraints, hence the term Relational Database Management System
RDBMS is a system where data is organized in two-dimensional tables
using rows and columns.EX − Oracle Database, MySQL, Microsoft SQL
Server etc.
Network DBMS is a system where the data elements maintain one to
one relationship (1: 1) or many to many relationship (N: N).
It also has a hierarchical structure, but the data is organized like a
graph and it is allowed to have more than one parent for one child
record.

Transactions: Its is the logical unit of work happened in db.It can be


performed by multiple user or application.Transactions are tools to
achieve the ACID properties
ATOMICITY:The entire transaction has to be performed/executed else
shouldn't be executed.
CONSISTENCY:the integrity of the data should be maintained,so that
the database remains consistent before and after the transaction. The
data should always be correct.

NOT NULL:Ensure that Null vaues cannot be stored in a column


DEFAULT:Ensure that it has set a default value for a column when no
vaue is specified
UNIQUE:Ensure that all the values in a column are different
CHECK:Ensure that values in a column satisy a specific condition
INDEX:used to cretae or retrieve data from the db quickly
UniqueKey: Unique key is used as unique identifier for record when
the primary is not present in the table.we can store NULL values.It
supports only one NULL value in the table.it can have more than one
unique keys
duplicates are not allowed
Surrogate Key:It is a type of primary key which is generated when a
new reord is inserted into a table automatically by database

Before Insert:activated before data is inserted into table


After Insert:activated After data is inserted into table
Before update:activated before data in the table is updated
After update:activated after data in the table is updated
Before delete:activated before data is removed from table
After delete:activated after data is removed from table
2NF:Pre-requities are :It should be in 1NF form and it should not have
any partial dependencies.(subject table,student together gives score
table).In score table,Primary key is the composition of two
columns(subjectid+studentid),here teachers name is partial
dependency).our objective is to remove partial dependency in the
2NFwhich can be done by moving teacher info to subject table else
create a new teacher table with teachers details.It should have single
column primary key

4NF:Pre-requities are :It should be in 3NF and it should not have any
multivalued dependencies

SQL is the standard database language


Based on this standard SQL, database vendors like Microsoft, Oracle
and many other organizations developed their own database query
languages
TSQL is a proprietary procedural language for working with Microsoft
SQL Server database
Similarly, PL/SQL is a proprietary procedural language for working with
Oracle database
T-SQL and PL/SQL are an extension to standard SQL.
This means they have more features and functions than the standard
SQL.

columns has more nulls no need to do index,

A PARTITION BY clause is used to partition rows of table into groups. It


is useful when we have to perform a calculation on individual rows of a
group using other rows of that group.

It is always used inside OVER() clause.


The partition formed by partition clause are also known as Window.
This clause works on windows functions only. Like- RANK(), LEAD(),
LAG() etc.
A database is designed to record data, whereas a Data
warehouse is designed to analyze data.
A database is an application-oriented collection of data,
whereas Data Warehouse is a subject-oriented collection of
data.
Database uses Online Transactional Processing (OLTP),
whereas Data warehouse uses Online Analytical Processing
(OLAP).
Database tables and joins are complicated because they are
normalized, whereas Data Warehouse tables and joins are
easy because they are denormalized.
ER modeling techniques are used for designing Databases,
whereas data modeling techniques are used for designing
Data Warehouse.

DWH features/characteristics
subject oriented:
Used to track or anlaysis the data for particular area .In
business terms,it should be built based on the business’s
functional requirements, especially in regard to a specific area
under discussion. ex:how many no of transactions happening
per day?(specific kind of data),customers,sales
Integrated
All the data from diverse sources must undergo the ETL
process, which involves cleaning junk for redundancy
Time-Variant:
Historical data is kept in a data warehouse. For example,
one can retrieve data from 3 months, 6 months, 12 months, or
even older data from a data warehouse. This contrasts with a
transactions system, where often only the most recent data is
kept. For example, a transaction system may hold the most
recent address of a customer, where a data warehouse can
hold all addresses associated with a customer.
Non-volatile:
Once data is in the data warehouse, it will not change. So,
historical data in a data warehouse should never be altered.
once we dump the data, data is in static.we wont
change/modify the data
importance of ETL testing?
Ensure data is transformed efficiently and quickly from one
system to another.
Data quality issues during ETL processes, such as duplicate
data or data loss, can also be identified and prevented by ETL
testing.
Assures that the ETL process itself is running smoothly and is
not hampered.
Ensures that all data implemented is in line with client
requirements and provides accurate output.
Ensures that bulk data is moved to the new destination
completely and securely.

Business intelligence:Software suite to transform raw data


into some actionable information
such as cretae
reports/dashboards,summaries,maps,graphs,charts etc
ETL tools used: MSSQL,SSIS, oracle warehouse builder,power
center informatica,query surge(testing tool)open text
integration center
BI tools: generate the reports and it used for analytical
purposes
Tableau,cognos,power BI
diff b/w DWH and Datamart:Data Warehouse is a large
repository of data collected from different sources whereas
Data Mart is only subtype of a data warehouse.
Data Warehouse is focused on all departments in an
organization whereas Data Mart focuses on a specific group.
Data Warehouse designing process is complicated whereas
the Data Mart process is easy to design.
Data Warehouse takes a long time for data handling whereas
Data Mart takes a short time for data handling
a sale is an event that captures ,when a customer ,visits a
store,buys a product using a promotion and then pays for
it,thus geenrating a sale
A road safety manager wants to analyse the accidents then
what are the facts and dimensions
Facts accidents
dim:road
condition,vehicle,driver,loation,weather,time,passenger
analytical questions be like show me no of accidents by state
and area(location),
show me no of acidents by vehicle type,brand(vehicle),show
me no of acidents by time
Here Location is dimension and its state,area zipcode,country
are its attributes
Normalization:breaking down a large table into a large
number of smaller tables .OLTP
normalisation splits up the data into addiitonal tables
DatabaseNormalization 
is the process of organising data into related tables; it also rem
oves redundancy and increases integrity, which improves quer
y performance,To normalize a database, we divide it into table
s and create relationships between the tables.  processes of
organising data to avaoid duplication and redundancy

Denormalisation:Denormalization is a database optimization


technique where we add redundant data in the database to
get rid of the complex join operations. This is done to speed
up database access speed. Denormalization is done after
normalization for improving the performance of the database.
The data from one table is included in another table to reduce
the number of joins in the query and hence helps in speeding
up the performance.OLAP
Disadvantages of Denormalization
As data redundancy is there, update and insert operations are
more expensive and take more time. Since we are not
performing normalization, so this will result in redundant data.
Data Integrity is not maintained in denormalization. As there is
redundancy so data can be inconsistent.
Snowflakes schema:Extension of the star schema
Some dimension table in the snowflakes schemas are
normalised
If fact is centrally located and surrounded by set of
normalized dimension tables,then its snowflake schema
Galaxy schema:
multiple fact tables and multiple dimensions tables
If a common dimension is shared across multiple subject areas
through the fact table then the structure
is called as galaxy

SCD Type
Type 0 Ignore any changes and audit the changes.
Type 1 Overwrite the changes
Type 2 History will be added as a new row.
Type 3 History will be added as a new column.
Type 4 A new dimension will be added
Type 6 Combination of Type 2 and Type 3

SCD3: there will be two columns to indicate the particular


attribute of interest, one indicating the original value, and one
indicating the current value. There will also be a column that
indicates when the current value becomes active.
Advantages:
- This does not increase the size of the table, since new
information is updated.
- This allows us to keep some part of history.
Disadvantages:
- Type 3 will not be able to keep all history where an attribute
is changed more than once. For example, if Christina later
moves to Texas on December 15, 2003, the California
information will be lost.
When to use Type 3:
Type III slowly changing dimension should only be used when
it is necessary for the data warehouse to track historical
changes, and when such changes will only occur for a finite
number of time.
Clustered index:It is used for easy retrieval of data from
database and it is faster.One table can have only one
clustered index. you apply clustered indexing in a table, it will
perform sorting in that table only.Sorts the rows based on the
column

If you apply primary key to any column, then automatically it


will become clustered index.
Create Clustered Index TEMP ON EMP(roll no ASC)

Non clustered index:Doesn’t sort, creates separate object


inside the table tat will refer the orignal table
The non-clustered index and table data are both stored in
different places. It is not able to sort (ordering) the table
data.Maintains Logical order of the data The non-clustered
indexing is the same as a book where the content is written in
one place, and the index is at a different place. MySQL allows
a table to store one or more than one non-clustered index.
The non-clustered indexing improves the performance of the
queries which uses keys without assigning primary key.
Create NONClustered Index ABC ON EMP(name asc,age ASC)
It starts with understanding the business requirements till the
generation of a summary report.
Understanding the business requirement.

Validation of the business requirement.

Test Estimation is used to provide the estimated time to run


test-cases and to complete the summary report.

Test Planning involves finding the Testing technique based on


the inputs as per business requirement.

Creating test scenarios and test cases.

Once the test-cases are ready and approved, the next step is
to perform pre-execution check.

Execute all the test-cases.

The last step is to generate a complete summary report and


file a closure process.

The Testing I have done is:


Structure validation of the file. ...
Check for duplicate records.
Select one row from the target file and for that record alone,
run the source query and then compare the Source output
and the target record in the flat file manually comparing each
field one at a time.
Check for data truncation.
There are three basic levels of testing performed on a data
warehouse −

Unit testing
Integration testing
System testing
Unit Testing
In unit testing, each component is separately tested.

Each module, i.e., procedure, program, SQL Script, Unix shell is


tested.

This test is performed by the developer.

Integration Testing
In integration testing, the various modules of the application
are brought together and then tested against the number of
inputs.

It is performed to test whether the various components do


well after integration
System Testing –
System testing is the form of testing that validates and tests
the whole data warehouse application. This type of testing is
being performed by technical testing team. This test is
conducted after developer’s team performs unit testing and
the main purpose of this testing is to check whether the entire
system is working altogether or not.
Conflicting business rules used by various data sources.
The inability to schedule extracts on time, or within the
allotted time interval when updating the DWH.
Inability to capture all changes (ex., inserts, updates, deletes)
in source files.
The absence of an effective and centralized source metadata
repository.
Misinterpretation of slowly changing dimensions (SCDs) in ETL
code.
Errors in the transformation or substitution values for NULL
values in the ETL process.
The absence of automated or effective unit testing facility in
ETL tools.
The dearth of effective error reporting, validation, and
metadata updates in ETL code.
Inappropriate ETL process for data insert/update/delete
functions.
Loss of data during the ETL process (rejected records, dropped
fields).
An inability to restart the ETL process from checkpoints
without losing data.
Lack of automatic and effective data defect and correction
functions in the ETL code.
Inability to incorporate profile, cleansing, and ETL tools to
compare and reconcile data and associated metadata.
Misaligned primary and foreign key strategies for the same
type of entity (e.g., one table stores customer information
using the Social Security Number as the key, another uses the
CustomerID as the key, and still another uses a surrogate key).

Some validations are done during Extraction:

Reconcile records with the source data


Make sure that no spam/unwanted data loaded
Data type check
Remove all types of duplicate/fragmented data
Check whether all the keys are in place or not

In data warehousing we have an Is Current Flag column in SCD


type 2 dimension table. This flag indicates whether the row is
the current version or not. Some DW data modellers believe
that in SQL Server platform this column should be created as
bit rather than int or Y/N. Because it saves space.

CREATE,ALTER,DROP,TRUNCATE,RENAME,COMMENT
DELETE,UPDATE,INSERT,MERGE,CALL,LOCKTABLE
GRANT,REVOVKE
SELECT
COMMIT,ROLLBACK,SET TRANSACTION ,SAVEPOINT

Hierarchical Database is a system where the data elements


have a one to many relationship (1: N). Here data is organized
like a tree
The hierarchy starts from the root node, connecting all the
child nodes to the parent node.it is used in industry on
mainframe platforms.EX− IMS(IBM), Windows registry
(Microsoft).
Object-orientedDBMS is a system where information or data
is in the form of objects which is used in OOPs.

ISOLATION:Multiple transactions can occurs concurrently


without leading to the inconsistency of database
state.Transactions occurs independently.changes occuring in T
will not be visible to other T untill its committed.Reponsibility
of concurrency control subsystem
DURABILITY:Once the transaction is committed,the updates
and modification to the database are stored in and wriiten to
disk and they persists even after any system failure or
crash.Permenant

SQL:Structured query language.SQL is a core of relational


database which is used for managing and acessing the
database
MYSQL:It is a open sourece RDBMS that works on many
platforms.It provides maulti user access to support many
storage engines and is backed by oracle
ForeignKey: Foreign key maintains referntial integrity by
enforces the link between datas in two tables.The foreign in
child table reference to the primary key of the parent
table.Foreign key constraints prevent the action that break the
relationship between child and parent
CANDIDATE KEY in SQL is a set of attributes that uniquely
identify tuples in a table. Candidate Key is a super key with no
repeated attributes.
3NF:Pre-requities are :It should be in 2NF and it should not
have any transitive dependencies.(there is a column
(total_marks) in the depends on non prime column (exam)not
on the prime column(subid+scoreid)).our objetive is to remove
transitive dependency in the 3NFwhich can be done by create
a new exam table with exam details

In pessimistic locking a record or page is locked immediately


when the lock is requested, while in an optimistic lock the
record or page is only locked when the changes made to that
record are updated.

A factless fact table is a fact table that does not have any
measures, i.e. any numeric fields that can be aggregated.
Table:collection of data in the form of rows and
columns.Table refers to a collection of data in an
organised manner in the form of rows and colums.
Field refers to number of columns in a table
Example: Suppose after normalization we have two
tables first, Student table and second, Branch table.
The student has the attributes as Roll_no, Student-
name, Age, and Branch_id.branch table is related to
the Student table with Branch_id as the foreign key in
the Student table.If we want the name of students
along with the name of the branch name then we need
to perform a join operation. The problem here is that if
the table is large we need a lot of time to perform the
join operations. So, we can add the data of
Branch_name from Branch table to the Student table
and this will help in reducing the time that would have
been used in join operation and thus optimize the
database

Advantages of Denormalization
Query execution is fast since we have to join fewer
tables.
Analyze Business Requirements: To perform ETL
Testing effectively, it is crucial to understand and
capture the business requirements through the use of
data models, business flow diagrams, reports, etc.
Identifying and Validating Data Source: To proceed, it is
necessary to identify the source data and perform
preliminary checks such as schema checks, table
counts, and table validations. The purpose of this is to
make sure the ETL process matches the business model
specification.
Design Test Cases and Preparing Test Data: Step three
includes designing ETL mapping scenarios, developing
SQL scripts, and defining transformation rules. Lastly,
verifying the documents against business needs to
make sure they cater to those needs. As soon as all the
test cases have been checked and approved, the pre-
execution check is performed. All three steps of our
ETL processes - namely extracting, transforming, and
loading - are covered by test cases.
Test Execution with Bug Reporting and Closure: This
process continues until the exit criteria (business
requirements) have been met. In the previous step, if
any defects were found, they were sent to the
developer for fixing, after which retesting was
performed. Moreover, regression testing is performed
in order to prevent the introduction of new bugs
during the fix of an earlier bug.
Summary Report and Result Analysis: At this step, a
test report is prepared, which lists the test cases and
their status (passed or failed). As a result of this report,
stakeholders or decision-makers will be able to
properly maintain the delivery threshold by
understanding the bug and the result of the testing
process.
Test Closure: Once everything is completed, the
reports are closed.
3. Name some tools that are used in ETL.
Primary key is a candidate key
Surrogate key is a primary key
we can either delete all the rows in one go or can delete rows one by one
Here we can use the “ROLLBACK” command to restore the tuple because it
Delete(DML) does not auto-commit.
Delete from table
Delete from table where condition
we can drop (delete) the whole structure in one go i.e. it removes the
named elements of the schema.Here we can’t restore the table by using the
Drop(DDL) “ROLLBACK” command because it auto commits.
Drop table;

It is used to delete all the rows of a table in one go. we can’t delete the
single row as here WHERE clause is not used. By using this command the
existence of all the rows of the table is lost.Here we can’t restore the tuples
Truncate (DDL) of the table by using the “ROLLBACK” command.
TRUNCATE table;

union It can be used to combine the result set of two different SELECT statement.
It removes duplicate rows between the various select statements.
Data type should be same as the result set of each select statement

It can be used to retrieve matched records between both tables or more


Join tables

It doesn't remove duplicate data.


Result set can have different types of data types

Inner Join It can be used to retrieve only matched records between both tables

It doesn't return anything when match is not found.


It is used to retrieve all matching records as well non matching records of the
Outer Join tables
It return null in the column values
It is implicit joins clause that combines the table using common column and
datatypes in two tables without giving "ON" conditions.it displays all the
Natural join attributes in the table
Inner Join joins two table on the basis of the column which is explicitly
Inner Join specified in the ON clause
Cartisean product of two tables.CROSS JOIN is used to combine each row of
Cross join the first table with each row of the second table.
Both gives all the records from the table but union all allows the duplicates
where as union doesn't.every SELECT statement must have the same
Union&union all number of columns
INTERSECT returns only common rows returned by the two SELECT statements.
UNION Same as INTERSECT
This operator returns only unique records of the first table, not the common
MINUS records of both tables.
This operator returns only unique records of the first table, not the common
records of both tables.Both minus and except do the same purpose but
EXCEPT minus used in oracle,except in used postgresql
Both under DDL statement.drop deletes all the data inside the table as well
the table which means both data and metadata whereas truncate removes
Drop and truncate only data and metadata remains the same

select ISNULL(null,'Nomanager) ,It returns 'Nomanager' as it has NULL-o/pis


nomanager
select ISNULL('pragim','nomanager'),when we are not passing null,if its not
ISNULL nll then it returns the value . o/p is pragim
Return the length of a string:
LEN select Len('sql')=3

Return a substring of a string before a specified number of delimiter occurs:


SELECT SUBSTRING_INDEX("venugopal,saranya", ",", 1) as lastname
SUBSTRING_INDEX SELECT SUBSTRING_INDEX("venugopal,saranya", ",", -1)as firstname

This function extracts some characters from a string.


SUBSTRING(string, start, length)
SELECT SUBSTRING('CustomerName', 1, 5) AS E =custo
SELECT SUBSTRING('CustomerName', -1, 5) AS E =rname
Note:from left to right name (1,2,3,4)
SUBSTR right to left also we can name(-4,-3,-2,-1)
The INSTR() function returns the position of the first occurrence of a string in
another string.
INSTR Select INSTR('saran,venu' , ',') =venu

This function searches for a substring in a string, and returns the position.
CHARINDEX() Saranya- select charindex('n','saranya') =5
In patindex we can use wildcards but in charindex we cannot.function
returns the position of a pattern in a string. If the pattern is not found, this
PATINDEX function returns 0

Using the stuff function we delete a substring of a certain length of a string


and replace it with a new string
STUFF STUFF('i/p string',starting string,noofcharacters to replace,replacementexp)
REPLICATE Repeat the string for the specified no of times
select replicate('saran',5)=saransaransaransaransaran
REPLACE SELECT REPLACE("this is a", "i", "*");=th*s *s a
As the function name replace indicates, the replace function replaces all
occurrences of a specific string value with another string
number of characters from a string (starting from left).
LEFT Select LEFT('SQL', 3)=SQL
number of characters from a string (starting from right).
RIGHT Select RIGHT('SQL Query',5)=query
trim of unwanted character or spaces from left
LTRIM select LTRIM('_name','_') from student
trim of unwanted character or spaces from right
RTRIM select RTRIM('name.','.') from student

select LTRIM(RTRIM('_name.','.'),'_')
give all the data in uppercase
UCASE()/upper SELECT UCASE(CustomerName)

STRCMP() function compares two strings.


SELECT STRCMP("SQL Tutorial", "SQL Tutorial");
If string1 = string2, this function returns 0
If string1 < string2, this function returns -1
STRCMP() If string1 > string2, this function returns 1
this function returns the first non-null value in a list.
COALESCE() SELECT COALESCE(NULL, NULL, NULL, 'saran', NULL, 'E');
TOP 50 percent select TOP 50 percent * from emp

Count(*),count(1), count(*) and count(1) both are same count(*) is speed ,count(columnname)
count(columnname) doesn't count the null values
display cuurentdate select getdate();
ROW_NUMBER Returns an increasing unique number for each row even it has duplicates

Returns an increasing unique number for each row but when there are
duplicates,same rank applied to all the duplicate rows but next row after
duplicate row will have the rank it would have been assigned if there had
RANK been no duplicates.So rank function skips the ranking,if there are duplicates

Returns an increasing unique number for each row but when there are
duplicates,same rank applied to all the duplicate rows but Dense_rank
function will not skips the ranking,if there are duplicates.Rank will be in
DENSE_RANK sequence
Where” clause is used to filter the records from a table that is based on a
specified condition, then the “Having” clause is used to filter the record from
WHERE &HAVING the groups based on the specified condition.
GROUP BY clause summarizes the rows into groups and the HAVING clause
GROUPBY &HAVING applies one or more conditions to these groups
Query statement placed inside a nother select statement also called as inner
query.
SUBQUERY Subquery executed first and pass the value to outer query
If the subquery depends on the outer query for its values,then its called
correlated subquery.Correlated subquery gets executed,once for every row
CORRELATED SUBQUthat is selected by outer query
select name,
(select SUM(sales) from sales where productid=tblproduct.id) as total,*
from tblproducts

A noncorrelated subquery executes independently of the outer query. The


Non correlated subq subquery executes first, and then passes its results to the outer query,
select * from emp where depid in (select id from dep)
View is the virtual table formed from one or more base tables or views.
A View is never stored it is only displayed.Data is pulled from base table
View is updated each time the virtual table (View) is used.
VIEW When the base table is dropped ,it will not be accessible

Materialized view is a physical copy of the base table.


A Materialized View is stored on the disk.
Materialized View has to be updated manually or using triggers.
MATERIALIZED VIEW When the base table is dropped ,it will be accessible
CTE expression,a piece of code executes automitically untill it returns the
RECRUSIVE value

select datename(month,getdate())= october


select datename(weekday,getdate())= Monday
select datename(week,'2022-09-28')=40
DATENAME select datename(dayofyear,'2022-09-28')=271

select datepart(year,'2022-09-28')=2022
select datepart(quarter,'2022-09-28')=4
select datepart(month,getdate())= october
select datepart(dayofyear,'2022-09-28')=271
select datepart(weekday,getdate())= 4
select datepart(week,'2022-09-28')=40
DATEPART
1,2,3,4 1,2,3 are duplicates

1,1,1,4

1,1,1,2
A View is technically a virtual logical copy of the table

a snapshot of the original base tables

You might also like