You are on page 1of 45

Data Analysis and Integration

SQL Views (cont.)


Views over views

• Higher schemas

Views
count_emp_dept sum_salary_dept

Views
curr_dept_emp curr_dept_manager curr_salaries curr_titles

Base tables
employees departments dept_emp dept_manager salaries titles

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 2


Views over views

• count_emp_dept(dept_no, count_emp)
– a view to show the number of employees in each department
• sum_salary_dept(dept_no, sum_salary)
– a view to show the sum of salaries by department

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 3


Views over views

• count_emp_dept(dept_no, count_emp)
– a view to show the number of employees in each department

create or replace view count_emp_dept(dept_no, count_emp) as


select dept_no, count(emp_no) as count_emp
from curr_dept_emp
group by dept_no;
+---------+-----------+
select * from count_emp_dept; | dept_no | count_emp |
+---------+-----------+
| d001 | 15 |
| d002 | 18 |
| d003 | 10 |
| d004 | 44 |
| d005 | 62 |
| d006 | 18 |
| d007 | 42 |
| d008 | 14 |
| d009 | 29 |
+---------+-----------+
9 rows in set (0.00 sec)

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 4


Views over views

• sum_salary_dept(dept_no, sum_salary)
– a view to show the Sum of salaries by department

create or replace view sum_salary_dept(dept_no, sum_salary) as


select b.dept_no, sum(a.salary) as sum_salary
from curr_salaries as a, curr_dept_emp as b
where a.emp_no = b.emp_no
group by b.dept_no;
+---------+------------+
select * from sum_salary_dept; | dept_no | sum_salary |
+---------+------------+
| d001 | 1249477 |
| d002 | 1492870 |
| d003 | 643182 |
| d004 | 2928341 |
| d005 | 4434974 |
| d006 | 1212103 |
| d007 | 3715959 |
| d008 | 1064935 |
| d009 | 1914195 |
+---------+------------+
9 rows in set (0.00 sec)

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 5


Query unfolding (curr_salaries)
select emp_no, salary
from curr_salaries
where salary > 80000
limit 10;

create or replace view curr_salaries(emp_no, salary) as


select emp_no, salary
from salaries
where from_date <= current_date and to_date >= current_date;

select emp_no, salary


from (select emp_no, salary
from salaries
where from_date <= current_date
and to_date >= current_date) as a
where salary > 80000
limit 10;

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 6


Query unfolding (count_emp_dept)
select dept_no
from count_emp_dept
where count_emp > 40;

create or replace view count_emp_dept(dept_no, count_emp) as


select dept_no, count(emp_no) as count_emp
from curr_dept_emp
group by dept_no;

select dept_no
from (select dept_no, count(emp_no) as count_emp
from curr_dept_emp
group by dept_no) as a
where count_emp > 40;

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 7


Query unfolding (count_emp_dept)

select dept_no
from (select dept_no, count(emp_no) as count_emp
from curr_dept_emp
group by dept_no) as a
where count_emp > 40;

create or replace view curr_dept_emp(emp_no, dept_no) as


select emp_no, dept_no
from dept_emp
where from_date <= current_date and to_date >= current_date;

select dept_no
from (select dept_no, count(emp_no) as count_emp
from (select emp_no, dept_no
from dept_emp
where from_date <= current_date and to_date >= current_date) as b
group by dept_no) as a
where count_emp > 40;
IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 8
Query unfolding (count_emp_dept)

• Note that
select dept_no
from (select dept_no, count(emp_no) as count_emp
from (select emp_no, dept_no
from dept_emp
where from_date <= current_date and to_date >= current_date) as b
group by dept_no) as a
where count_emp > 40;

could be re-written more simply as:

select dept_no
from dept_emp
where from_date <= current_date and to_date >= current_date
group by dept_no
having count(emp_no) > 40;

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 9


Query unfolding (sum_salary_dept)
select sum(sum_salary)
from sum_salary_dept;

create or replace view sum_salary_dept(dept_no, sum_salary) as


select b.dept_no, sum(a.salary) as sum_salary
from curr_salaries as a, curr_dept_emp as b
where a.emp_no = b.emp_no
group by b.dept_no;

select sum(sum_salary)
from (select b.dept_no, sum(a.salary) as sum_salary
from curr_salaries as a, curr_dept_emp as b
where a.emp_no = b.emp_no
group by b.dept_no) as c;

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 10


Query unfolding (sum_salary_dept)

select sum(sum_salary)
from (select b.dept_no, sum(a.salary) as sum_salary
from curr_salaries as a, curr_dept_emp as b
where a.emp_no = b.emp_no
group by b.dept_no) as c;

create or replace view curr_salaries(emp_no, salary) as


select emp_no, salary
from salaries
where from_date <= current_date and to_date >= current_date;

create or replace view curr_dept_emp(emp_no, dept_no) as


select emp_no, dept_no
from dept_emp
where from_date <= current_date and to_date >= current_date;

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 11


Query unfolding (sum_salary_dept)

select sum(sum_salary)
from (select b.dept_no, sum(a.salary) as sum_salary
from (select emp_no, salary
from salaries
where from_date <= current_date and to_date >= current_date) as a,
(select emp_no, dept_no
from dept_emp
where from_date <= current_date and to_date >= current_date) as b
where a.emp_no = b.emp_no
group by b.dept_no) as c;

+-----------------+
| sum(sum_salary) |
+-----------------+
| 18656036 |
+-----------------+
1 row in set (0.01 sec)

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 12


Data Analysis and Integration
Concepts of data integration
Introduction

• The need for data integration


– company A merges with company B
– A is a company with the employees database
– B is a company with the company database
– provide an integrated view of data from both companies
• e.g. employees, departments, salaries, job titles

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 14


Company A

• The employees database

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 15


Company B

• The company database

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 16


Data sources

• Comparing the two data sources (schema)

data source A (employees database) data source B (company database)

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 17


Schema matching (employees)

• Comparing the two data sources (schema matching)

×
×
×

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 18


Schema matching (titles)

• Comparing the two data sources (schema matching)

×
×

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 19


Schema matching (salaries)

• Comparing the two data sources (schema matching)

×
×

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 20


Schema matching (managers)

• Comparing the two data sources (schema matching)

×
×

×
×

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 21


Schema matching (branches)

• Comparing the two data sources (schema matching)

×
IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 22
Schema matching (departments)

• Comparing the two data sources (schema matching)

×
×

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 23


Mediated schema

• Once company A merges with company B


– we need to access data through a single/uniform access point

• Example of a common schema (mediated schema)


all_employees(emp_no, first_name, last_name, birth_date, report_to)
all_departments(dept_no, dept_name)
all_dept_emp(emp_no, dept_no)
all_salaries(emp_no, salary)
all_titles(emp_no, title)

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 24


Mediated schema

• Common schema (mediated schema)


all_employees(emp_no, first_name, last_name, birth_date, report_to)
all_departments(dept_no, dept_name)
all_dept_emp(emp_no, dept_no)
all_salaries(emp_no, salary)
all_titles(emp_no, title)

– absence of from_date and to_date attributes means that data retrieved


from the employees database will always refer to current date

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 25


Wrappers for data sources
• We have a set of views to retrieve data for the current date

employees(emp_no, birth_date, first_name, last_name, gender, hire_date)


departments(dept_no, dept_name)
curr_dept_emp(emp_no, dept_no)
curr_dept_manager(emp_no, dept_no)
curr_salaries(emp_no, salary)
curr_titles(emp_no, title)

– we will use these views as a wrapper for the employees database


• i.e. a layer through which we access the employees database

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 26


Schema mapping (all_employees)

• Mapping to common schema (schema mapping)


– transformations/queries that populate common schema
all_employees(emp_no, first_name, last_name, birth_date, report_to)

– from employees database


select a.emp_no, a.first_name, a.last_name, a.birth_date, c.emp_no
from employees.employees as a,
employees.curr_dept_emp as b,
employees.curr_dept_manager as c
where a.emp_no = b.emp_no and b.dept_no = c.dept_no;

– from company database


select employeeid, firstname, lastname, dob, reportto
from company.employees;

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 27


Schema mapping (all_employees)

• Mapping to common schema (schema mapping)


all_employees(emp_no, first_name, last_name, birth_date, report_to)

(select a.emp_no, a.first_name, a.last_name, a.birth_date, c.emp_no


from employees.employees as a,
employees.curr_dept_emp as b,
employees.curr_dept_manager as c
where a.emp_no = b.emp_no and b.dept_no = c.dept_no)
union
(select employeeid, firstname, lastname, dob, reportto
from company.employees);
+--------+------------+-----------+------------+--------+
| emp_no | first_name | last_name | birth_date | emp_no |
+--------+------------+-----------+------------+--------+
| 21637 | Yefim | Luby | 1964-04-28 | 110039 |
| 25949 | Owen | Matheson | 1959-08-08 | 110039 |
| ... | ... | ... | ... | ... |
| 1001 | Ravi | Gupta | 1969-12-03 | 1001 |
| 1002 | Ram | charan | 1985-02-20 | 1001 |
| ... | ... | ... | ... | ... |
+--------+------------+-----------+------------+--------+

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 28


Schema mapping (all_employees)

• Mapping to common schema (schema mapping)


all_employees(emp_no, first_name, last_name, birth_date, report_to)

create view all_employees(emp_no, first_name, last_name, birth_date, report_to) as


(select a.emp_no, a.first_name, a.last_name, a.birth_date, c.emp_no
from employees.employees as a,
employees.curr_dept_emp as b,
employees.curr_dept_manager as c
where a.emp_no = b.emp_no and b.dept_no = c.dept_no)
union
(select employeeid, firstname, lastname, dob, reportto
from company.employees);

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 29


Schema mapping (all_departments)

• Mapping to common schema (schema mapping)


all_departments(dept_no, dept_name)

– from employees database


select dept_no, dept_name
from employees.departments

– from company database


select departmentid, departmentname
from company.department

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 30


Schema mapping (all_departments)

• Mapping to common schema (schema mapping)


all_departments(dept_no, dept_name)

+---------+--------------------+
(select dept_no, dept_name | dept_no | dept_name |
from employees.departments) +---------+--------------------+
| d009 | Customer Service |
union | d005 | Development |
| d002 | Finance |
(select departmentid, departmentname | d003 | Human Resources |
from company.department); | d001 | Marketing |
| d004 | Production |
| d006 | Quality Management |
| d008 | Research |
| d007 | Sales |
| 101 | IT |
| 102 | HR |
| 103 | Finance |
| 104 | Sales |
| 105 | marketing |
+---------+--------------------+
14 rows in set (0.00 sec)

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 31


Schema mapping (all_departments)

• Mapping to common schema (schema mapping)


all_departments(dept_no, dept_name)

create view all_departments(dept_no, dept_name) as


(select dept_no, dept_name
from employees.departments)
union
(select departmentid, departmentname
from company.department);

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 32


Schema mapping (all_dept_emp)

• Mapping to common schema (schema mapping)


all_dept_emp(emp_no, dept_no)

– from employees database


select emp_no, dept_no
from employees.curr_dept_emp

– from company database


select employeeid, departmentid
from company.employees

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 33


Schema mapping (all_dept_emp)

• Mapping to common schema (schema mapping)


all_dept_emp(emp_no, dept_no)

(select emp_no, dept_no


from employees.curr_dept_emp)
union
(select employeeid, departmentid
from company.employees);

+--------+---------+
| emp_no | dept_no |
+--------+---------+
| 10721 | d009 |
| 11260 | d009 |
| ... | ... |
| 1008 | 101 |
| 1014 | 101 |
| ... | ... |
+--------+---------+

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 34


Schema mapping (all_dept_emp)

• Mapping to common schema (schema mapping)


all_dept_emp(emp_no, dept_no)

create view all_dept_emp(emp_no, dept_no) as


(select emp_no, dept_no
from employees.curr_dept_emp)
union
(select employeeid, departmentid
from company.employees);

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 35


Schema mapping (all_salaries)

• Mapping to common schema (schema mapping)


all_salaries(emp_no, salary)

– from employees database


select emp_no, salary
from employees.curr_salaries

– from company database


select employeeid, salary
from company.employees

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 36


Schema mapping (all_salaries)

• Mapping to common schema (schema mapping)


all_salaries(emp_no, salary)

(select emp_no, salary


from employees.curr_salaries)
union
(select employeeid, salary
from company.employees);

+--------+-----------+
| emp_no | salary |
+--------+-----------+
| 10721 | 44812.00 |
| 11260 | 52435.00 |
| ... | ... |
| 1001 | 850000.00 |
| 1002 | 650000.00 |
| ... | ... |
+--------+-----------+

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 37


Schema mapping (all_salaries)

• Mapping to common schema (schema mapping)


all_salaries(emp_no, salary)

create view all_salaries(emp_no, salary) as


(select emp_no, salary
from employees.curr_salaries)
union
(select employeeid, salary
from company.employees);

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 38


Schema mapping (all_titles)

• Mapping to common schema (schema mapping)


all_titles(emp_no, title)

– from employees database


select emp_no, title
from employees.curr_titles

– from company database


select employeeid, jobtitle
from company.employees

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 39


Schema mapping (all_titles)

• Mapping to common schema (schema mapping)


all_titles(emp_no, title)
+--------+--------------------+
| emp_no | title |
+--------+--------------------+
| 11371 | Senior Engineer |
(select emp_no, title | 41548 | Staff |
from employees.curr_titles) | 62635 | Engineer |
| 64387 | Senior Staff |
union | 110039 | Manager |
| 204631 | Assistant Engineer |
(select employeeid, jobtitle | 207968 | Technique Leader |
from company.employees); | ... | ... |
| 1001 | CEO |
| 1002 | Director |
| 1003 | President |
| 1004 | Vice President |
| 1005 | Sr. Manager |
| 1007 | Sales Manager |
| 1008 | Reporting Manager |
| 1009 | Team Leader |
| 1010 | Sales Rep |
| 1014 | Software Engineer |
| 1023 | Admin |
| 1024 | Network Engineer |
| ... | ... |
+--------+--------------------+

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 40


Schema mapping (all_titles)

• Mapping to common schema (schema mapping)


all_titles(emp_no, title)

create view all_titles(emp_no, title) as


(select emp_no, title
from employees.curr_titles)
union
(select employeeid, jobtitle
from company.employees);

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 41


Data matching – duplicates
• When comparing data instances from distinct data
sources

+---------+--------------------+ +--------------+----------------+
| dept_no | dept_name | | departmentid | departmentname |
+---------+--------------------+ +--------------+----------------+
| d009 | Customer Service | | 101 | IT |
| d005 | Development | | 102 | HR |
| d002 | Finance | | 103 | Finance |
| d003 | Human Resources | | 104 | Sales |
| d001 | Marketing | | 105 | marketing |
| d004 | Production | +--------------+----------------+
| d006 | Quality Management |
| d008 | Research |
| d007 | Sales |
+---------+--------------------+

– duplicate department names need to be merged

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 42


Data matching – conversion
• When comparing data instances from distinct data
sources
+--------+ +-----------+
| salary | | salary |
+--------+ +-----------+
| 44812 | | 850000.00 |
| 52435 | | 650000.00 |
| 81461 | | 625000.00 |
| 101179 | | 215000.00 |
| 76104 | | 115000.00 |
| 105453 | | 38000.00 |
| 71350 | | 65000.00 |
| 49249 | | 75000.00 |
| 91443 | US dollars Indian rupees | 45000.00 |
| 91836 | | 15000.00 |
| 80046 | | 15000.00 |
| 67677 | | 15000.00 |
| 70313 | | 15000.00 |
| 66406 | | 35000.00 |
| 56851 | | 35000.00 |
| 57499 | | 25000.00 |
| ... | | ... |
+--------+ +-----------+

– salaries need to be converted

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 43


Data matching – approximate duplicates

• When comparing data instances from distinct data


sources
+-------------------+
| jobtitle |
+-------------------+
+--------------------+ | CEO |
| title | | Director |
+--------------------+ | President |
| Senior Engineer | | Vice President |
| Staff | | Sr. Manager |
| Engineer | similar jobs/roles? | Accoutant |
| Senior Staff | | Sales Manager |
| Assistant Engineer | | Reporting Manager |
| Technique Leader | | Team Leader |
| Manager | | Sales Rep |
+--------------------+ | Software Engineer |
| Admin |
| Network Engineer |
+-------------------+

– similar job titles need to be found and merged/consolidated

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 44


Summary of concepts
• Multiple data sources with different schemas
– relational databases, but could be other data sources as well
• Schema matching between data sources
– how attributes in one data source correspond to attributes in another data
source
• Design of a common mediated schema
– subset of attributes from data source schemas
• Wrappers for data sources
– facilitate and simplify access to data sources
• Schema mapping from data sources to mediated schema
– queries to bring data from local schema to global mediated schema
• Data matching between data sources
– find exact/approximate duplicates from different sources that may need to
be merged, converted or consolidated

IST/MEIC Análise e Integração de Dados - 2022/2023 - 1º Sem 45

You might also like