Introduction to Data Warehouse

(slides in this section are used courtesy of Carrig Emerging Technology Ph: 410- 553- 6760 www.c a r r i g e t. c o m )

1

Introduction to Data Warehousing and Data Introduction to Data Warehousing and Data Mining Mining

1) Data Warehouse Introduction 2) Engineering Conflicts 3) OLTP and DSS 4) Stovepipe vs. Integration 5) Data Warehouse Solution 6) Enterprise Information System 7) Security in a Data Warehouse 8) Moving Data to a Data Warehouse 9) Data Marts 10) Data Mining
2

1

Introduction Introduction
• Key topics for this course include:
– Data Warehouse – Data Mart – Data Mining

• Background and review of relational database systems • Main focus on data warehouse and data mining

3

Data Warehouse Introduction Data Warehouse Introduction
• A data warehouse is a single source for key, corporate information needed to enable business decisions • A database application is a piece of software that provides a user interface for users to add, delete, query and update data • Typically, a database management system is used to actually do the work of adding, deleting, querying or updating data
Application
Database System Data

4

2

Engineering Conflicts, Query and Update Engineering Conflicts, Query and Update
• It is often an engineering problem when data is updated and long-running queries occur at the same time • In some cases, the users who are doing updates must wait for queries to complete • One way to avoid this is to make a read-only copy of data
Database System Application
Data for update Data for query

5

OLTP and DSS Defined OLTP and DSS Defined
• An application that updates is called an on-line transaction processing (OLTP) application • An application that issues queries to the readonly database is called a decision support system (DSS)

OLTP Application Database System OLTP Data DSS Data

DSS Application

6

3

Applications in a Typical Enterprise Applications in a Typical Enterprise • Most organizations have several disparate OLTP/DSS applications in several databases Finance OLTP Application Inventory OLTP Application Sales OLTP Application Finance DSS Application Inventory DSS Application DATABASE SYSTEM Sales DSS Application Finance OLTP Data Finance DSS Data Inventory OLTP Data Inventory DSS Data Sales OLTP Data Sales DSS Data 7 Stovepipe vs Integration Stovepipe vs Integration • When systems stand by themselves they are often referred to as “stovepipes” • Systems that easily share data are called “well integrated systems” Finance OLTP Application Inventory OLTP Application Finance DSS Application Inventory DSS Application 8 4 .

you don’t have to co-ordinate as much when applications are built and you still reap the benefits of data sharing 9 Data Warehouse Solution Data Warehouse Solution • A data warehouse is an attempt to integrate separate DSS so that users can query one place to find the answers to their questions • A data warehouse has the key. where data is integrated from the several different stovepipe systems – Data warehouse is really sharing-lite -.Problems with Stovepipe Architecture Problems with Stovepipe Architecture • Problems: – Users who wish to access data must query several different DSS to find it – Data may have fundamental conflicts between DSS – a department code table in one DSS may differ in another DSS – a measurement may be stored in meters in one DSS and yards in another • Solution: – Use a data warehouse. corporate data in the organization • A data warehouse tracks historical data 10 5 .

A Success Story Data Warehouse A Success Story • Largest data warehouse is Wal-Mart (9 TB) • Uses for Wal-Mart data warehouse – Identifies where a new store should be built based on customer demand – Identifies how stores are performing across the nation – Contains every “scan” from every purchase • Benefits Wal-Mart gained from their data warehouse – Provided competitive advantage over K-Mart – Reduced excess inventory in individual stores – Avoided wasted funds in building stores which would fail 11 Selling the Data Warehouse Selling the Data Warehouse • A data warehouse project will fail without corporate sponsorship – Preferably. data sources will be very difficult to identify • Only add data to the warehouse that will answer key. corporate questions asked by the corporate sponsor. you will have a data dump 12 6 . the project should be sponsored by the CEO – The CEO must be sold on the value to the business to improve competitive advantage by deploying a data warehouse • If an active. Otherwise.Data Warehouse -. corporate sponsor does not exist.

Building a Useful Data Warehouse Building a Useful Data Warehouse • You really need: – strong executive sponsorship – good knowledge of the data – sound software engineering – stability from source systems – users who want a success • A 75 percent failure rate is often cited • It is WORTH the effort!!! 13 Enterprise Information System Enterprise Information System • An EIS (Enterprise Information System) allows users to query data in a data warehouse • Users can access key. corporate data in the data warehouse Enterprise Information System Data Warehouse 14 7 .

etc. more detailed tool – Often very knowledgeable about the data – Willing to do more work to learn about the data – Sometimes even learn SQL to issue their own ad-hoc queries • General users want a tool that provides detailed data. multiple EIS are needed to satisfy different types of users – Some users only want a system that has pre-defined reports so they only need to “click one button” to see data they need. 15 Users of an Enterprise Information System Users of an Enterprise Information System • Analysts want a flexible. users want to click a few buttons and get data they want – Results must be graphs – Users should be able to drill-down into key areas. These users want the system to be no harder to use than a “coffee pot” – Other users want to delve into the data and build their own queries • Executives want a high-level. summary data and a simple tool – Must be VERY easy to use.Users of an Enterprise Information System Users of an Enterprise Information System • Frequently. but not so focused on large reports 16 8 . – Simple application. but is very easy to use – Want access to the data warehouse to do routine tasks such as “Find me Hank’s phone number”.

Data Warehouse // EIS Data Warehouse EIS Finance OLTP Application Inventory OLTP Application Inventory OLTP Data Sales OLTP Application Finance OLTP Data Enterprise Information System S a lle s Sa es OLTP OLTP Data Data Data Warehouse Finance Subject Area Inventory Subject Area Sales Subject Area 17 Need for Data Warehouses Need for Data Warehouses • Data warehouses provide a single place to store key corporate data – The idea is that users can go one place to find this key data using an enterprise information system (EIS) • Data warehouse is also a place to store and access historical data – Users measure performance goals for their company over a period of time – Company statistics are available – Data not stored in the same place is difficult to locate and compare. easily lost – Single query can be used to access key data 18 9 .

real-time data is needed in a data warehouse. These include – – – – – Views Access control Security Administration Encryption Audit 19 Moving Data into the Data Warehouse Moving Data into the Data Warehouse • Moving data from source OLTP systems to the data warehouse is the hard part of data warehousing • Updates to the data warehouse are performed periodically – weekly – nightly – monthly • Occasionally. corporate information is all in one place • To mitigate that risk. database system components can be used to protect the data warehouse. but this is not very common 20 10 .Security in Data Warehouse Security in Data Warehouse • Building a data warehouse does increase security risk because key.

Using Middleware to Move Data Using Middleware to Move Data • Data can be moved to the warehouse via data migration software • This is often called “middleware” because it sits between the source OLTP and the data warehouse Source OLTP System Data Warehouse Migration Software “Middleware” Data Warehouse 21 Need for a Data Mart Need for a Data Mart • A data mart is a subset of the data warehouse that may make it simpler for users to access key corporate data – Sometimes. users only need a piece of data from the data warehouse • The data mart is typically fed from the data warehouse Data Warehouse Inventory Subject Area Finance Subject Area Sales Subject Area New York Data Mart California Data Mart 22 11 .

Data Mart in Action Data Mart in Action Finance OLTP Application Inventory OLTP Application Inventory OLTP Data Sales OLTP Application Finance OLTP Data Enterprise Information System S a lle s Sa es OLTP OLTP Data Data Data Warehouse Finance Subject Area Inventory Subject Area Sales Subject Area California Data Mart New York Data Mart 23 Data Mining Introduction Data Mining Introduction • Data Mining is done by running software that examines a database and looks for patterns in the data • A data warehouse by itself will respond to queries from users – It will not tell users about patterns in data that users may not have thought about – To find patterns in data. data mining is used to try and mine key information from a data warehouse 24 12 .

Advantages of Data Mining Advantages of Data Mining • Data mining allows companies to collect information and make them more productive and beat their competition • Data mining helps identify – why customers buy certain products – – – – ideas for very direct marketing ideas for shelf placement training of employees vs. employee retention 25 Implementing Data Mining Implementing Data Mining • Apply data mining tools to run data mining algorithms against data • There are two approaches: – Copy data from the Data Warehouse and mine it – Mine the data in the Data Warehouse • Popular tools use a variety of different data mining algorithms: – association rules – genetic algorithms – decision trees – neural networks 26 13 . employee retention employee benefits vs.

Data Mining using Separate Data Data Mining using Separate Data • You can move data from the data warehouse to data mining tools – Advantages – Data mining tools may organize data so they can run faster – Disadvantages – Could be very expensive to move large amounts of data Data Warehouse Data Mining Tool Copy of data made by the Data Mining Tool 27 Data Mining Against the Data Warehouse Data Mining Against the Data Warehouse • Data mining tools can access data directly in the Data Warehouse – Advantages – No copy of data is needed for data mining – Disadvantages – Data may not be organized in a way that is efficient for the tool Data Warehouse Data Mining Tool 28 14 .

c a r r i g e t. c o m ) 30 15 .553. 29 SQL Review (slides in this section are used courtesy of Carrig Emerging Technology Ph: 410.6760 www.Data Mining: Summary Data Mining: Summary • Data mining attempts to find patterns in data that we did not know about • Often data mining is just a new buzzword for statistics • Data mining differs from statistics in that large volumes of data are used • Many different data mining algorithms exist and we will discuss them in the course • Examples – identify users who are most likely to commit credit card fraud – identify what attributes about a person most results in them buying product x.

Introduction to SQL Introduction to SQL 1) Introduction to SQL 2) Data Definition Language (DDL) 3) Data Manipulation Language (DML) 4) SELECT Construct 5) SELECT Operators 6) Wildcard Searches 7) Aggregate Operators 8) Calculated Attributes 9) Sorting Results 31 Introduction to Structured Query Language Introduction to Structured Query Language • Structured Query Language (SQL) is the language used to communicate with a relational database – Industry standard – Based on set theory • SQL composed of two types of constructs: – Data Definition Language (DDL) – Defines the structure of the database – Data Manipulation Language (DML) – Provides the constructs to input and retrieve data 32 16 .

Indexes are used to improve database performance 33 SQL Overview -. – UPDATE PRODUCTS SET PRICE = PRICE + 4 – DELETE is used to eliminate rows of data from the database. 'hardware'. updating.SQL Overview -. – INSERT INTO PRODUCTS VALUES ('food'. and retrieving data.DDL SQL Overview DDL • Data Definition Language (DDL) is used to describe the structure of the database – Create tables. – Ex: SELECT * FROM PRODUCTS – INSERT is used to add new rows to the database. – Typical Operations are: – CREATE TABLE defines what columns are in the table and establishes the table – CREATE INDEX defines an index for the table.DML SQL Overview DML • Data Manipulation Language (DML) is used for storing. etc. 'housewares') – UPDATE is used to change rows that already exist in the database. • Typical operations include: – SELECT is used to retrieve data. – DELETE FROM PRODUCTS 34 17 . indexes.

• Single table SELECT constructs: – – – – WHERE IN BETWEEN LIKE – Aggregate Operators – DISTINCT – ORDER BY 35 SELECT Examples SELECT Examples • Query Purpose: Retrieve names and prices of all products SELECT ProductName. Price FROM TinyProducts • Query Purpose: Retrieve all information for all employees from the TinyProducts table SELECT * FROM TinyProducts 36 18 .SELECT Overview SELECT Overview • SELECT is used to retrieve records from the database.

SELECT with WHERE SELECT with WHERE • The WHERE clause is used to filter which information is returned from a SELECT • Query Purpose: Retrieve all information only for product type of “food” SELECT * FROM TinyProducts WHERE ProductType = ‘Food’ 37 Use of Boolean Operators Use of Boolean Operators • Conditions can be separated by Boolean operators: – AND. NOT • Query Purpose: List all information about food products that are either cereal or fruit SELECT * FROM TinyProducts WHERE (ProductName = 'Cereal') OR (ProductName = 'Fruit') 38 19 . OR.

or Housewares' – 'Find all food whose type is Meat. ProductName FROM TinyProducts WHERE Price < 2 AND ProductName = 'Fruit' 39 IN Operator IN Operator • The IN operator allows a search for records that match one value in a set of unordered values • Example questions to use IN: – 'Find all products whose type is Food. or Fruit' 40 20 . Fish.Boolean Operator Example Boolean Operator Example • Query Purpose: List the names of all products that the type is fruit and the price is less than $2.00 SELECT ProductType. Hardware. Vegetables.

Linens. ProductType FROM TinyProducts WHERE (ProductName = ’Cookware') OR (ProductName = 'Linens') OR (ProductName = 'Dishes') 41 BETWEEN Operator BETWEEN Operator • The BETWEEN operator allows a search for a range of values • Example Queries: – 'Find all fruit between Bananas and Grapes' – 'Find all cereals whose price is between $1.50 4. ProductType FROM TinyProducts WHERE ProductName in ('Cookware'.00 42 21 .00 a box 1. 'Linens'.IN Example IN Example • Query Purpose: List the name of Housewares that are Cookware. or Dishes SELECT ProductName. 'Dishes') instead of: SELECT ProductName.50 and $4.

BETWEEN Example BETWEEN Example • Query Purpose: Find all products whose price is between $2. Price FROM TinyProducts WHERE Price BETWEEN 2. Hardware FROM TinyProducts WHERE (Price >= 2.00 AND 8.00) OR (Price <= 8.00 and $8.00 instead of: SELECT ProductName.00 SELECT ProductName.00) 43 Wildcard Searches of Strings Wildcard Searches of Strings • The LIKE operator is used to search parts of a string • The following wildcard characters are used: % to match any zero or more characters _ to match exactly one character 44 22 .

and AVERAGE are used when computing statistics on a range of data • Query Examples: – 'What is the highest batting average on the team?' – 'What is the average number of hits for all the little league teams in the National League?' – 'What are the names of the players that had the lowest average on the little league team?' 46 23 .Wildcard Search Examples Wildcard Search Examples • Query Purpose: List all products whose name starts with an ’C' SELECT * FROM TinyProducts WHERE ProductName LIKE 'C%' • Query Purpose: List all products that have a SKU number with the last 2 characters of ’23' when you don't know the first character SELECT * FROM TinyProducts WHERE SKUNumber LIKE '_23' 45 Aggregate Operators Aggregate Operators • MIN. MAX.

AVG(Average) FROM PLAYERS WHERE League = 'National' 47 SUM and COUNT Operators SUM and COUNT Operators • Use the SUM operator to total the results of a query • COUNT will count the total number of occurrences of an item in a search 1+2+3+4 48 24 . MAX(Average).Aggregate Operators Example Aggregate Operators Example • Query Purpose: Find the minimum. and average batting average of all players in the National League of Little League SELECT MIN(Average). maximum.

SUM And COUNT Examples SUM And COUNT Examples • Query Purpose: Find the total number of homeruns hit by all players in the American League? SELECT SUM(HomeRuns) FROM PLAYERS WHERE League='American' • Query Purpose: List the names of players that have hit 3 home runs in the National League? SELECT COUNT(*) FROM PLAYERS WHERE HomeRuns = '3' AND League = 'National' 49 Calculated Attributes Calculated Attributes • A new attribute can be obtained by using arithmetic operators (+. /) 50 25 . *.-.-. /) on other numeric attributes • All operators follow standard precedence: – Multiplication and division are computed first left to right – Addition and subtraction are computed last left to right – Use parenthesis to override the standard precedence (+. *.

Calculated Attributes Example Calculated Attributes Example Query Purpose: List all players with their hits. at bats. Hits. and their batting average SELECT Name. AtBats. (Hits / AtBats) FROM PLAYERS 51 DISTINCT Operator DISTINCT Operator • DISTINCT is used to exclude duplicate occurrences in the result of a query • Query Purpose: List all distinct batting averages SELECT DISTINCT(Average) FROM PLAYERS 52 26 .

Otherwise. Average FROM PLAYERS ORDER BY Average • For descending order add the keyword DESC SELECT Name. Average FROM PLAYERS ORDER BY Name DESC 54 27 .Sorting Query Results Sorting Query Results • The ORDER BY clause is used at the end of the SELECT statement to sort the results of a query • Use DESC on the end of the ORDER BY clause to sort the data in descending order. the result will be in ascending order 53 Sorting Example Sorting Example • Query Purpose: List all players in ascending order of their batting average SELECT Name.

Hits / AtBats FROM PLAYERS ORDER BY 3 DESC 55 More SQL More SQL 1) GROUP BY Construct 2) HAVING Filter 3) Multiple Tables 4) Joins 5) Equijoins 6) Cartesian Product 7) Nulls 8) OUTER JOIN 56 28 . use its position in the list of columns following SELECT • Query Purpose: List all players in descending order of their batting average (here we assume batting average is computed at the time of the query) SELECT Name. Hits.Sorting Calculated Attributes Sorting Calculated Attributes • To refer to a computed attribute in the ORDER BY. AtBats.

• As an example. consider the EMPLOYEE table where Department partitions the EMPLOYEE set into subsets: Engineering Marketing Finance Customer 57 GROUP BY Example GROUP BY Example • Query Purpose: For each department. AVG(Salary) FROM EMPLOYEE GROUP BY Department 58 29 .GROUP BY Clause GROUP BY Clause • GROUP BY will partition a table into multiple groups of related rows. list the average salary using the EMPLOYEE table SELECT Department.

list the highest salary of their administrative assistants. MAX and AVG. SELECT Department. MAX(Salary) FROM EMPLOYEE WHERE Title='administrative assistant' GROUP BY Department 59 HAVING Construct HAVING Construct • HAVING is used to restrict the output of aggregate functions. to only those groups of rows that meet some condition.GROUP BY With WHERE GROUP BYWith WHERE GROUP BY GROUP BY With WHERE WHERE • To filter data further. SELECT Department. we can use the WHERE clause with GROUP BY clause Query Purpose: For each department. such as SUM. MIN. Query Purpose: List the average salary for all departments that have more than three employees. AVG(Salary) FROM EMPLOYEE GROUP BY Department HAVING COUNT(*) > 3 60 30 .

Inner Join 62 31 . A join operation is done through the SELECT construct.Multi-Table SQL Multi-Table SQL • It is often necessary to combine data into multiple tables. • A join allows us to combine data from different tables. Outer Join. • Types of Joins: Equijoin. EMPLOYEE EmpID Name Salary 1 2 3 4 Fred 200 ATTENDS EmpID Name 1 2 2 3 3 3 Harvard GMU Yale MIT Stanford GMU 61 Ethel 300 Mike 400 David 100 Joins Joins • Joins are the means by which multiple tables can be combined.

Name FROM EMPLOYEE a.Equijoin Equijoin • Joins only those rows where a foreign key matches the primary key • Allows information from multiple tables to be linked together in a single query • Can be used to link as many tables as needed in a single query 63 Equijoin Query Example Equijoin Query Example • Query Purpose: List the names of all colleges attended by Ethel SELECT b.Name = 'Ethel' 64 32 . ATTENDS b WHERE a.EmpID AND a.EmpID = b.

a cartesian product is produced – Restated in English: When the linking condition is omitted from the WHERE clause.79 3. When no fields are 'joined' in the WHERE clause. you get a lot of excess garbage that you probably do not want.65 4.45 3.65 2.0 65 Warning about Joining Tables Warning about Joining Tables • A join is really just a subset of a cartesian product.Equijoin Example Equijoin Example EMPLOYEE EmpID 1 2 3 Name Fred Ethel Mike Salary 200 300 400 ATTENDS EmpID 1 2 2 3 3 3 College Harvard GMU Nova Yale Nova GMU GPA 2.Name = 'Ethel' 66 33 . Sample Query: SELECT b.Name FROM EMPLOYEE a. ATTENDS b WHERE a.85 2.

EmpID a.GPA 3. • This indicates that the value is unknown and avoids the need for user-defined special indicators.8 3. • To prevent a column from having nulls.7 3. 300 300 300 300 b.5 67 Nulls Nulls • An attribute may be defined as null. 68 34 .EmpID 1 2 3 4 b... specify NOT NULL on the column in the CREATE TABLE statement when setting up the database.Salary 2 2 2 2 Ethel Ethel Ethel Ethel .4 2..Cartesian Product Cartesian Product • Each row in one table with every other row in other table a.Name a.

NULL) Query Purpose: Find all employees whose salary is unknown (or null) SELECT * FROM EMPLOYEE WHERE Salary IS NULL 69 OUTER JOIN OUTER JOIN • An OUTER JOIN is used when the query should return a result row even for rows that do not have corresponding data in one of the tables.Nulls Examples Nulls Examples Statement Purpose: Add an employee whose salary is unknown INSERT INTO EMPLOYEE (3. 70 35 . • A LEFT OUTER JOIN returns all rows from the 'left' table. • Nulls are returned when a row in the 'left' table has no corresponding rows in the right table.'Hank'.

– For an equijoin.Name.45 Ethel 3. employee number 4 did not attend college.65 Mike 4. b.00 David NULL 72 36 . Include employees who have not attended any colleges SELECT a. only those who attended a college would be listed – Here.EmpID = b.85 Mike 2.79 Ethel 3.65 Mike 2. Name GPA ---------.EmpID 71 LEFT OUTER JOIN Example LEFT OUTER JOIN Example • Result of the outer join – All employees are listed.GPA FROM EMPLOYEE a LEFT OUTER JOIN ATTENDS b on a.LEFT OUTER JOIN Example LEFT OUTER JOIN Example • Query Purpose: List the college GPAs for each employee. but is still retrieved by the outer join.----Fred 2.

6760 www.Advanced SQL (slides in this section are used courtesy of Carrig Emerging Technology Ph: 410.553.c a r r i g e t. c o m ) 73 Advanced SQL Advanced SQL 1) Finding the nth element in a list 2) Finding the median 3) Correlated subquery 4) Data Definition Language Constructs 74 37 .

called TEST. with the following values: X 4 5 8 76 38 . – Examples: – Who makes the second highest salary in marketing department? – What is the fifth best product in sales? – This can be done with a program that uses SQL to access the database: SQL is sent to the database and the program keeps retrieving the result set until the threshold is crossed. with just one column. • We show another way of doing this using standard SQL. x.Find the Nth Element Find the Nth Element • It is very common to try to find the nth element in a list. 75 Find the Nth Element: Example Table Find the Nth Element: Example Table • Consider a table.

this yields each element matched with every other element: 4 4 4 5 5 5 8 8 8 4 5 8 4 5 8 4 5 8 77 Find the Nth Element: Step 2 Find the Nth Element: Step 2 • Next keep only those rows where the first column is greater than or equal the second column. 78 39 . For example. 4 4 4 5 5 5 8 8 8 4 5 8 4 5 8 4 5 8 4 5 5 8 8 8 4 4 5 4 5 8 Notice the pattern that just developed. 4 has only one match as it is the first number in the list. 8 has three matches. each number on the list now has a certain number of values that match on the right.Find the Nth Element: Step 1 Find the Nth Element: Step 1 • First join TEST with itself. This number matches the position of this value in the list. 5 has two matches.

Find the Nth Element: Step 3 Find the Nth Element: Step 3 • Now group by the column on the left and identify the size of each group. • The same ideas can be applied to any SELECT statement output. a.ProductName. a.Price >= b. a.Price.Price GROUP BY a. TinyProducts b WHERE a.Price.SKUNumber FROM TinyProducts a.a. a. SELECT a.ProductType.ProductType. 4 5 5 8 8 8 4 4 5 4 5 8 4 5 8 1 2 3 79 Finding the Nth Element: Example Finding the Nth Element: Example • Query Purpose: Find the information about the product with the second highest price. a.SKUNumber HAVING COUNT(*) = (SELECT COUNT(*)-1 FROM TinyProducts) 80 40 .ProductName.

Finding the Top N Elements: Example Finding the Top N Elements: Example • To ask for the top n values instead of the nth value. SELECT FROM WHERE GROUP HAVING a.ProductName. TinyProducts b WHERE a. a.Price.Price. a.SKUNumber HAVING COUNT(*) >= (SELECT COUNT(*)-1 FROM TinyProducts) ORDER BY a. • Query Purpose: Find information about the products with the two highest prices.ProductName.a.ProductType.ProductType. a. • Query Purpose: Find the median price in TinyProducts.ProductType.Price BY a. a. specify a range (>=) instead of just an equality (=) in the HAVING. a.Price 81 Finding the Median Finding the Median • The median is defined as the element in the middle of the list. a. SELECT a.SKUNumber FROM TinyProducts a.ProductName.Price >= b.Price. TinyProducts b a.a. a. a.ProductType. a.Price.Price GROUP BY a. a.SKUNumber TinyProducts a.SKUNumber COUNT(*) = (SELECT (COUNT(*)/2)+1 FROM TinyProducts) 82 41 .ProductName.Price >= b.

Using Subqueries Using Subqueries • A subquery may be used in the middle of a query. • Query Purpose: Find the information about the highest priced product.Salary Employee b a. using a simple subquery. a.Price.ProductType.Name = 'Ethel') 84 42 . a. it is called a correlated subquery.SKUNumber FROM TinyProducts a WHERE Price = (SELECT MAX(PRICE) FROM TinyProducts) 83 Correlated Subquery Correlated Subquery • If the subquery references a data element from outside of the subquery.Salary > b.ProductName. the correlated subquery is executed.Salary b.Name. SELECT a. – For each row in the outer part of the query. a. a. The following query will indicate who makes more money than ‘Ethel’ SELECT a.Salary FROM Employee a WHERE EXISTS (SELECT FROM WHERE AND b.

Other Data Manipulation Other Data Manipulation • INSERT – Add rows to a single table • UPDATE – Modify rows in a single table • DELETE – Remove rows from a single table 85 INSERT Examples INSERT Examples • Statement Purpose: Add a record for employee #1. ’Fred'. ’Fred' with a salary of 200 to the EMPLOYEE table INSERT INTO Employee VALUES (1. 200) • Statement Purpose: Copy all rows in the EMPLOYEE table and place them in NEW_EMPLOYEE INSERT INTO New_Employee SELECT * FROM Employee 86 43 .

00 WHERE Name = 'Fred' • Statement Purpose: Give all employees a ten percent raise UPDATE Employee SET Salary = Salary * 1.UPDATE Example UPDATE Example • Statement Purpose: Modify Fred’s salary to 150 UPDATE Employee SET Salary = 150.10 87 DELETE Examples DELETE Examples • Statement Purpose: Remove all employees who have a salary higher than 100. DELETE FROM Employee WHERE Salary > 100 • To remove all employees: DELETE FROM Employee 88 44 .

2)) To drop the EMPLOYEE table DROP TABLE EMPLOYEE 89 Data Warehouse Security (slides in this section are used courtesy of Carrig Emerging Technology Ph: 410. c o m ) 90 45 .CREATE TABLE Example CREATE TABLE Example • Statement Purpose: Create a table to store employee information CREATE TABLE EMPLOYEE (EmpId SMALLINT. Salary DECIMAL(5.6760 www.c a r r i g e t. Name CHAR(10).553.

Data Warehouse Security Data Warehouse Security 1) Key Security Services 2) Views 3) Access Control 4) Roles 5) Encryption 6) Audit Trails 7) Security Holes 8) Intrusion Detection 9) Misuse Detection 91 Introduction Introduction • A key feature provided by database systems is good security services. applications do not have to worry about problems that arise with security violations. Database System EIS Security Services 92 46 . corporate data. • A data warehouse also requires good security services because it holds key. – In a database system with good security.

Key Security Services Key Security Services
• Access Control
– Controls who accesses what data

• Administration of Access Control
– Used to give access to users as well as track who has various accesses and what kind of accesses are given to a user or group of users – Audit tracks the usage of the data warehouse

93

Security in a Data Warehouse Security in a Data Warehouse
• A data warehouse consolidates organizations key data in one place.
– A data warehouse increases the security risk that unauthorized users will try to obtain this data

• Security aspects of EIS applications must be designed and implemented very thoroughly. • Access control and audits are two of the critical components of security.

94

47

Data Warehouse Security Components Data Warehouse Security Components
• Database system components that can be used to protect a data warehouse include:
– Views – Allow users to only see certain rows or columns of data – Access control – Indicate which users have access to what data – Administration – This component is used to actually give access to groups of users and to define the accesses given to either an individual or a group. – Encryption – Protect data from access outside of the DBMS – Audit – Track what users are doing

95

Views in Data Warehouse Views in Data Warehouse
• A view is a logical view into one or more tables. Users may be given access to the view without access to the base table. • Views provide some security assistance because they can hide data from users.
EMPLOYEE
Name Hank Esther Tom Sue Dave Pete Kathy Address 1 South Street 2 North Street 34 Main Street 45 Easy Street 56 5th Avenue 7 Broadway 89 Western Avenue Salary $50,000 $80,000 $90,000 $28,500 $35,000 $60,000 $85,000

96

48

View Example View Example
• A view called SAFE_EMPLOYEE may be created as:
CREATE VIEW SAFE_EMPLOYEE AS (SELECT name, address FROM EMPLOYEE)

Now users of the view SAFE_EMPLOYEE will not even know that salary exists.
SAFE_EMPLOYEE
Name Hank Esther Tom Sue Dave Pete Kathy Address 1 South Street 2 North Street 34 Main Street 45 Easy Street 56 5th Avenue 7 Broadway 89 Western Avenue Salary

VIEW (SAFE_EMPLOYEE) “Salary” is effectively hidden

97

Updating Views Updating Views
• Restrictions exist on updating views. For the EMPLOYEE table, it is possible to insert into the SAFE_EMPLOYEE view.
– Example : INSERT INTO SAFE_EMPLOYEE VALUES (‘Hank’, 300) This will insert a NULL into the SALARY column of the base table EMPLOYEE.

• Other restrictions to view updates exist:
– Cannot update a view that is defined with an aggregate – Cannot update a view that is defined with a GROUP BY

98

49

it is not necessary to add thousands of new accesses. Mike) GRANT SELECT ON LOAN TO LOAN_OFFICER 100 50 . • Syntax – GRANT <ALL|UPDATE|DELETE|INSERT|SELECT> ON <object-name> TO <user name> – Example: GRANT SELECT ON EMPLOYEE TO MARY • Access control is done by DBAs and creators of tables. – Example: REVOKE SELECT ON EMPLOYEE FROM MARY 99 Database Roles Database Roles • Roles provide security administration by allowing users to be grouped into roles.Data Warehouse Access Control Data Warehouse Access Control • Access control is implemented in a data warehouse with the SQL Grant and Revoke commands. – If new tables are created. – Examples: CREATE ROLE loan_officer AS (Hank. – As an example. • To remove access the REVOKE command is used. John. – This dramatically simplifies administration. Accesses may then be given to a group of users. some roles for a company might be: – Administrative assistant – Loan officer – Salesperson • Accesses may be assigned based on roles.

more fine-grained access control can be granted in the application.Example of Application-based Roles Example of Application-based Roles • Consider: Users Applications Database System Data • If the database system controls accesses than it does not matter what the application does. 101 Application Roles Application Roles • The application can restrict: – Data entry screens – Reports • Care must be taken to restrict users in a consistent fashion so that a user cannot jump to a different application and avoid security set up by another application. accesses are controlled consistently (same for SALES as MARKETING) • However. 102 51 .

it can then be decrypted. • Three places where encryption might be used in a data warehouse: – Network – Data – Tape backups 104 52 . – Example: A message “sell 500 shares” would appear as “xyzzy” without the key.Role Based Security in a Data Warehouse Role Based Security in a Data Warehouse • Both application and database level security are useful in a data warehouse. – The size of the key is a factor in how difficult it is to attack the encryption scheme. Once the key is paired with the encrypted string “xyzzy”. • Database level security is needed so that users are only allowed to see data they need to see. • Application level security can be used to control access to certain menus so that users do not even know what reports exist. 103 Encryption Encryption • Encryption is the process of coding data so that it can only be read by users who have the key that allows them to decrypt the data.

• One way to reduce the risk of this threat is to encrypt traffic on the network.Network Encryption Network Encryption • In a data warehouse application. – Attackers might be able to steal network traffic just by breaking into the network medium. it may be possible for the “man in the middle” to masquerade as another user and circumvent existing application and database security. User Network Data Warehouse Application Database System Tape Backup 105 Network Encryption Network Encryption • Network encryption is critical because the network connects all of the key components in a data warehouse. 106 53 . data and queries are transmitted through a network. • Without this. • Encrypting network traffic mitigates the risk that an attacker could succeed with the “man in the middle” attack.

but the tapes are not encrypted.Data Encryption Data Encryption • Data encryption refers to encrypting the actual data in the data warehouse. they would have to decrypt it in order to read it. • If the database is encrypted. EIS Database System Data Warehouse Tape Backup 108 54 . • If the attackers were to retrieve data from the warehouse. the risk exists of someone walking off with the tapes. databases are copied to some kind of long-term storage (usually tapes). EIS Database System Data Warehouse 107 Backup Encryption Backup Encryption • Periodically.

Time. deletes. the SELECT is often used to track the queries that have been run against the warehouse.Audit Trails Audit Trails • Audit trails are a means of tracking queries. • If a user is suspected of an evil deed. Action that accessed the object (INSERT. – Audit trails are turned on when the DBMS is started and all activity that uses the data warehouse is tracked in the audit trail. Date. DELETE. the audit trail can be examined to identify what data has been accessed by users. updates. SELECT) – For UPDATE. 109 Details of DW Audit Trails Details of DW Audit Trails • An audit trail of a database system typically includes the following information: – User ID. UPDATE. Object that has been accessed (table or view). 110 55 . • For data warehouses. the old value and new value is tracked. and additions of new data to the data warehouse.

• It is important to constantly keep up with known security holes and apply the latest fixes as soon as they are released. – This information can be used to optimize queries • An additional use for audit trails is performance tuning of the data warehouse. – Administrators know where to focus their efforts – Reduces administrative overhead 111 Dealing with Known Security Holes Dealing with Known Security Holes • Commercial database systems and operating systems are often filled with holes that allow users to obtain unauthorized access.Other Uses for DW Audit Trails Other Uses for DW Audit Trails • Audit trails can be used to identify the most popular data in the warehouse. • One of the key risks surrounding a data warehouse is that privileged users have the “keys to the kingdom”. – To reduce the risk of these known holes. vendors often provide “fixes” to their products as soon as these holes become public. 112 56 .

– This would separate the task of giving accesses and managing the audit trail from the task of making sure the data in the warehouse was correct and properly optimized.The Risk of “Privileged Users” The Risk of “Privileged Users” • "Privileged users" include: – Data warehouse administrators – Operating system programmers – Operators in the computer center – These users can: – Modify. Security Services Access Control Audit Security Services Access Control Audit Database Services Database Tuning Query Optimization Backups Database Services Database Tuning Query Optimization Backups 114 57 . 113 Reducing the Risk of Privileged Users Reducing the Risk of Privileged Users • One way to reduce the risk of privileged users is to separate security administration from database administration. delete and query any data in the warehouse – Modify the audit trail to mask their actions – Give other users unauthorized access • Numbers of "privileged users" could be anywhere from 20 to 30 in some organizations.

g. 115 Intrusion Detection Intrusion Detection • An intrusion is defined as an unauthorized access to a system. a hacker). intrusion detection tools are used.Information Security Attacks Information Security Attacks • Two types of Information security attacks on data warehouses are: – Intrusion – An intrusion occurs when an unauthorized user gains access to the data warehouse. but identification of misuse is typically MUCH harder to do than intrusion. – Misuse – Misuse. The assumption is the user is external to the environment (e. • To reduce the risk of intrusion. INTRUSION DETECTION SYSTEM USER DATA WAREHOUSE 116 58 .. often referred to as the insider problem occurs when a user who has access to the warehouse uses that access for an unauthorized purpose • Audit Trails can be used to identify either type of attack. – These tools monitor access to the data warehouse and sound an alarm if unauthorized accesses are detected.

– This is also known as the insider problem.Misuse Detection Misuse Detection • Unwanted access by a user that has the ability to access data is referred to as misuse. • Encryption can be used to further protect against the risk of someone walking off with the data warehouse. – Some estimates have shown that 80 % of computer crime is a result of misuse. • Audit Trails are useful for: – Catching attackers – Identifying usage trends of the data warehouse 118 59 . 117 Summary Summary • DBMS Security is useful for data warehouses to hide data from users with views and to restrict access to data with GRANT and REVOKE. • Application Level Security assists EIS that access data warehouses by hiding certain reports from users. • For data warehouses the threat of misuse is high especially by privileged users.

c a r r i g e t. c o m ) 119 Moving Data to the Data Warehouse Moving Data to the Data Warehouse 1) Moving Data into the Data Warehouse 2) Updating the Data Warehouse 3) Full Refresh 4) Copy Only the Changes 5) BCP 6) Simple Transformations 7) Complex Transformations 8) Commercial ETL Tools 120 60 .Moving Data to the Data Warehouse (slides in this section are used courtesy of Carrig Emerging Technology Ph: 410.6760 www.553.

• Some key issues: – Determine the frequency of data updates -..how often should data be moved from source systems to the data warehouse. SQL Server’s BCP) – Commercial tools 121 Updating the Data Warehouse Updating the Data Warehouse • OLTP (On-Line Transaction Processing) Systems have to send their updates to the data warehouse. Finance OLTP Application Inventory OLTP Application Sales OLTP Application Data Warehouse Finance Subject Area Inventory Subject Area Sales Subject Area 122 61 .g. – Various means of updating data in the warehouse exist: – SQL Commands – Database system load programs (e.Moving Data into the Data Warehouse Moving Data into the Data Warehouse • Data must be moved to the data warehouse from source systems.

but significant maintenance required if the warehouse has lots of tables. Finance OLTP Application te da Up ily Da Finance Subject Area Inventory OLTP Application ate Upd kly Wee Sales OLTP Application te pda ly U nth Mo Data Warehouse Inventory Subject Area Sales Subject Area 123 Determining the Frequency of Updates Determining the Frequency of Updates • Requirements should drive update frequency • Range of updates runs from real-time. – Monthly or weekly update – Much more manageable 124 62 . monthly.Frequency of Updates to the Data Frequency of Updates to the Data Warehouse Warehouse • Updates may occur daily. or in real-time. to quarterly. – Real time update – Expensive – Requires update of warehouse while users are querying – Daily update – Somewhat cheaper than real time. weekly.

Source OLTP esh efr ll R Fu Finance Subject Area Data Warehouse Inventory Subject Area Sales Subject Area 125 Full Refresh Full Refresh Target Data Warehouse Source Table Target Table 126 63 .Updating the Warehouse Updating the Warehouse • Full Refresh vs. Only the Changes Inventory OLTP Application ges an Ch Finance OLTP Application Sales OLTP Application es tabl o m e les of s b e s h ther ta refr o F u l l ges for n cha ate pd tu las ce sin • Copy the entire source table in the OLTP system to the destination table in the Data Warehouse.

127 Full Refresh vs.may “run out of night” – Can lose out on warehouse ability to track historical data. Source OLTP Source Table Target Table Target Data Warehouse Modified data since last update to the warehouse Data from two updates ago. Historical data no longer in source OLTP.Copy Only the Changes Copy Only the Changes • Copy only the changes to the source table in the OLTP system to the destination table in the data warehouse. Only the Changes Full Refresh vs. Only the Changes • Full Refresh – Pros – Much easier to implement – Less chance of messing up your database (good data integrity) – Cons – Can take a lot longer to actually do -. • Only the Changes (DELTA) – Pros – Tracks historical data – Cons – Can be very hard to implement – Can require changes in source applications (more on this later) 128 64 .

Full Refresh Using INSERT-SELECT Full Refresh Using INSERT-SELECT • One way to move data from one table to another is via the INSERT-SELECT. – Syntax: INSERT INTO <target_table> <any sql SELECT statement> • Example: INSERT INTO DW_EMPLOYEE SELECT * FROM EMPLOYEE TARGET 129 Updating Changes Using INSERT-SELECT Updating Changes Using INSERT-SELECT • Changes may be moved by adding a WHERE clause to the INSERT-SELECT. CURRENT_TIMESTAMP) 130 65 . • Example: – INSERT INTO DW_EMPLOYEE SELECT * FROM EMPLOYEE WHERE DATE-UPDATED = DATEPART(m.

Updating Using BCP Updating Using BCP • BCP is the bulk copy program that comes with MS SQL Server..txt -c -Sservername -Usa -Ppassword • To bulk copy data from the publishers.pub2 in publishers. execute from the command prompt: bcp pubs. • Syntax: bcp <table> [in | out] <data file> Source OLTP Unload Temporary Flat File Target Data Warehouse Load Source Table Target Table 131 BCP Example BCP Example • To bulk copy data from the publishers table in the pubs database to the publishers. – Bulk copy (BCP) moves data to or from a flat file to a SQL table..txt -c -Sservername -Usa -Ppassword 132 66 .publishers out publishers.txt data file in ASCII text format.txt file into the pub2 table in the pubs database. execute from the command prompt: bcp pubs.

LS) 34 in CONVERT TO CENTIMETERS BLUE TO RT VE 4 ON DE 8 and C O C eves) bles sle o ta g (lon in tw put 84 TABLE 2 TABLE 1 86.Simple Transformation Simple Transformation • In addition to moving data from OLTP to the warehouse. TOTAL_CLOTH = 50 Store 32 yards ) TRANSFORMATION (Pattern = 32. Store 31 (Pattern = 31. Total Cloth = 20 meters) Data Warehouse P a t t e r n = 3 1 . Before the data is moved from system A. it is often necessary to transform data. T o t a l C l o t h = 7 0 yards 133 Complex Transformation Complex Transformation • More complex transformations occur when a value in a source table must be moved to several locations in a data warehouse. T o t a l C l o t h = 5 0 yards P a t t e r n = 3 2 . 34 Inches. we need to transform the data. – Example: System A stores TOTAL_CLOTH in meters and system B stores TOTAL_CLOTH in yards.36 cm COLOR TABLE 3 Long Sleeves TABLE 4 Long Sleeves Data Warehouse 134 67 . BLUE3 4 8 4 (Color = Blue.

135 Data Transformation Services Data Transformation Services 136 68 . • All provide the ability to code complex transformations.Commercial ETL Tools Commercial ETL Tools • Key tools in the marketplace – – – – Informatica Ardent DecisionBase (Platinum) Microsoft Data Transformation Services • All provide libraries of common transformations.

Choose a Source Choose a Source 137 Choose a Destination Choose a Destination 138 69 .

Choose to use a Query for Transfer Choose to use a Query for Transfer 139 Enter SQL Query Enter SQL Query 140 70 .

Choose Destination TableName Choose Destination TableName 141 Verify Transformation Verify Transformation 142 71 .

Decide When to Run Transformation Decide When to Run Transformation 143 Final Verification Final Verification 144 72 .

000 1996-07-05 00:00:00.000 1996-07-04 00:00:00.6000 42.0 0.8000 34.4000 discount 0.000 1996-07-05 00:00:00.0 0.0 1996-07-04 00:00:00.Run Transformation Run Transformation 145 Check Results Check Results select * from orderfact orderid 10248 10248 10248 10249 10249 orderdate productid productname 11 42 72 14 51 Queso Cabrales Singaporean Hokkien Fried Mozzarella di Giovanni Tofu Manjimup Dried Apples quantity unitprice 12 10 5 9 40 14.0 0.000 146 73 .8000 18.000 1996-07-04 00:00:00.0 0.0000 9.

• Doing full refresh is easy.c a r r i g e t. • ETL commercial tools are beginning to mature and can lessen the pain of this task. • Tracking changes is a tough business.6760 www. 147 More Ways of Moving Data to the Data Warehouse (slides in this section are used courtesy of Carrig Emerging Technology Ph: 410. • Either full refreshes of data or just the changes may be done.Summary Summary • ETL is one of the hard parts of building a data warehouse.553. but historical data is lost and it may take a lot of time. c o m ) 148 74 .

More Ways of Moving Data More Ways of Moving Data to the Data Warehouse to the Data Warehouse 1) Determining What Data Has Changed 2) Recovery Logs 3) Triggers 4) Insert Triggers 5) Delete Triggers 6) Update Triggers 7) Manual Detection 149 More Ways of Moving Data More Ways of Moving Data to the Data Warehouse to the Data Warehouse • There is a need to move data into the data warehouse from OLTP and DSS applications • The problem is detecting what data needs to be moved into the data warehouse • Three methods: – Recovery Logs – Triggers – Manual Techniques 150 75 .

’Sales’.’50000) Mktg 35000 IT 71000 HR 0 Sales 60000 35000 71000 0 60000 55000 110000 152 76 .) • Problem: How to get updates made to multiple sources to the same information in the data warehouse? SOURCE DATA WAREHOUSE A LE TAB “ROW X” Employee UPD A ROWTES X NAME DEPT. Fred Mktg Hank Sales Sue Joe UPDATES IT Sales SALARY 35000 60000 71000 50000 ? ? A LE TAB “ROW X” B LE TAB “ROW X” EmployeeCount DEPT Mktg Sales IT HR COUNT 1 1 2 1 0 SalaryInfo DEPT AVG SAL TOT SAL P OLT Insert into Employee Values (‘Joe’.) Determining What Data Has Changed (cont.Determining What Data Has Changed Determining What Data Has Changed • Problem: How to get updates made to the source to the same information in the data warehouse? SOURCE How to get updates from Source Table A to Data Warehouse Table B DATA WAREHOUSE A LE TAB S TE DA UP P OLT ? B LE TAB 151 Determining What Data Has Changed (cont.

– Change Data Capture Utility – This scans the database log and identifies all changes that the user is interested in and either writes them to a file or stores them in another table. • Recovery log can be used to identify the data to be updated in the data warehouse.What is the Recovery Log? What is the Recovery Log? • Recovery log is used for transaction processing – Used to handle errors – Does contain before and after image. 153 Change Data Capture Utility in Action Change Data Capture Utility in Action SOURCE OLTP DATA DBMS LOG All changes to DBMS RECOVERY LOG S AD RE CHANGE DATA CAPTURE UTILITY DATA WAREHOUSE WRITES 154 77 .

Example of Using Recovery Log Example of Using Recovery Log • Consider an update to the Employee table – The information is recorded in the log – The change data capture reconstructs update – Can then be sent to the data warehouse UPDATE EMPLOYEE Where SSN=10 LOG TABLE=EMPLOYEE SSN=10 OldSalary=100. • Commercial tools such as CA’s log analyzer can place the results of their work in a table. 156 78 . Use commercial tools to read the log and identify the changes.0 CHANGE DATA CAPTURE RECONSTRUCTS DATA WAREHOUSE UPDATE 155 Using the Recovery Log Using the Recovery Log • Recovery logs are usually in proprietary format. NewSalary=200 SET Salary=Salary*2.

another event is triggered.Summary of Change Data Capture Summary of Change Data Capture • Pro – Log exists anyway. – Many tables will be in the source that have nothing to do with the data warehouse. • Triggers can be used to detect the changes and perform data warehouse updates. – Triggers are used to identify changes that are needed by the warehouse. but change data capture will process their changes as well. – Proprietary format. – A different trigger might be run on key updates so that the data warehouse nightly process would know what data has changed. – A trigger can be added to a source table and whenever the source table is updated. or DELETE occurs on a table. an update can be placed either directly in the warehouse or in a staging table that tracks all updates. UPDATE. 157 Triggers Triggers • Triggers allow DBA’s to specify that when an “event” such as an INSERT. 158 79 . might as well use it to find what has changed • Con – Some difficult scenarios may occur where it is hard to see what the new update should be in the Data Warehouse. may not be supported in many DBMS and will always lag behind DBMS development.

we need to do an insert into the EmployeeStatistics table. Y TRIGGER inserts values (X. – Shown on the next page 160 80 . Y) into the Data Warehouse DATA WAREHOUSE INSERT into TABLE A VALUES (X. salary) • DW Data .Summary table: –EmployeeStatistics (total number employees. total salary paid. name. Y) 159 Real-Life Trigger Example Real-Life Trigger Example • OLTP/DSS Data . average salary). Y) into a “STAGING” area STEP 3 Nightly Process STEP 1 STEP 4 Nightly Process inserts values (X.Employee table: –Employee (ssn. Y) A LE TAB Values (X. sets off the TRIGGER X. • When a row is inserted in the employee table.Example of a Trigger Example of a Trigger STAGING STEP 2 A LE TAB Values (X. Y) are inserted When values are inserted.

'John'.00 2 Mike 400.00 SELECT * FROM EMPLOYEESTATISTICS AvgSalary --------350. 300) RESULTS (1 ROW(S) AFFECTED) INSERT INTO EMPLOYEE VALUES (2.00 162 81 .00 EmployeeStatistics NoEmployee TotSalary ---------.'Mike'.-------------------------1 John 300.---------2 700.Insert Trigger Example Insert Trigger Example CREATE TRIGGER EmployeeInsertTrigger ON Employee FOR INSERT AS BEGIN UPDATE EmployeeStatistics SET NoEmployee = NoEmployee + (SELECT COUNT(*) FROM INSERTED) UPDATE EmployeeStatistics SET TotSalary = TotSalary + (SELECT SUM(Salary) FROM INSERTED) UPDATE EmployeeStatistics SET AvgSalary = TotSalary / NoEmployee END 161 Insert Trigger in Action Insert Trigger in Action COMMANDS INSERT INTO EMPLOYEE VALUES (1. 400) (1 ROW(S) AFFECTED) SELECT * FROM EMPLOYEE Employee EmpId Name Salary -----.

(SELECT COUNT(*) FROM DELETED) UPDATE EmployeeStatistics SET TotSalary = TotSalary .0 END 163 Update Trigger Example Update Trigger Example CREATE TRIGGER EmployeeUpdateTrigger ON Employee FOR UPDATE AS BEGIN IF UPDATE (Salary) UPDATE EmployeeStatistics SET TotSalary = TotSalary (SELECT SUM(Salary) FROM DELETED) + (SELECT SUM(Salary) FROM INSERTED) UPDATE EmployeeStatistics SET AvgSalary = TotSalary / NoEmployee END 164 82 .Delete Trigger Example Delete Trigger Example CREATE TRIGGER EmployeeDeleteTrigger ON Employee FOR DELETE AS BEGIN DECLARE @numberEmployee int UPDATE EmployeeStatistics SET NoEmployee = NoEmployee .(SELECT SUM(Salary) FROM DELETED) SELECT @numberEmployee = NoEmployee FROM EmployeeStatistics IF @numberEmployee > 0 BEGIN UPDATE EmployeeStatistics SET AvgSalary = TotSalary / NoEmployee End ELSE UPDATE EmployeeStatistics SET AvgSalary = 0.

add it! OLTP Hank John Mike Sam DATA WAREHOUSE Hank John RE PA Mike M CO ADD THE DIFFERENCES 166 83 . if the data is not in the warehouse.Summary of Using Triggers Summary of Using Triggers • Pro – Only needed for tables whose data is going to go to the DW • Con – Additional work needed to create detailed triggers – Non-trivial to generate a trigger to implement appropriate action – May not be acceptable for commercial software on source system 165 Other Ways to Determine What Has Changed Other Ways to Determine What Has Changed • There are other manual ways of detecting the change and doing DW updates – Look at each row of OLTP and the data in the warehouse – Compare the differences between the two files.

Manually Identifying What Has Changed Manually Identifying What Has Changed • Pro – Flexible • Con – Very expensive – Could take a long time 167 Summary Summary • Recovery Logs • Triggers • Manual Detection 168 84 .

c a r r i g e t.Data Warehouse Design (slides in this section are used courtesy of Carrig Emerging Technology Ph: 410.553.ER Diagrams 3) Design Normalization 4) Star Schema Design 170 85 . c o m ) 169 Data Warehouse Design Data Warehouse Design 1) Overview 2) Describing a Design .6760 www.

the most prevalent is the ER (Entity-Relationship) Diagram • Entities – Things that occur in the real world. usually nouns e.Overview Overview • How to describe a design – Entity Relationship (ER) Diagram • Types of Designs – Normalized – Star Schema – Snowflake 171 Describing a Design Describing a Design • Different techniques exist. employee. example: one employee may attend many colleges -. part.usually verbs – Types of relationships – 1-1 – 1-Many – Many-1 – Many-Many 172 86 . • Relationships – How entities interact.g. product. etc..

there are many different normalized forms. – Many-many relationships require two tables that store the singlevalued relationships and one linking table that indicates how the entities are related. – 1NF – 2NF – 3NF 174 87 . in 1NF. The relationship is represented in the linking table by referencing keys in the two tables that represent each entity in the relationship. Each normal form (NF) builds on the previous one so that a table in 2NF is. • Checking the design – In a Normalized Design.MANY 173 Normalized Design Normalized Design • Methodology – All 1-1 relationships are placed in a single table.Examples of Relationships Examples of Relationships 1-1 1-MANY MANY-1 MANY. by definition.

SP is the linking table that says who sells what parts. – Entities: Customer.Dealing With Many-Many Relationships Dealing With Many-Many Relationships • For Many-Many – Two 1-1 Tables (SUPPLIER. SUPPLIER S# 1 2 SNAME SEARS OFFICE DEPOT PARTS P# 1 2 PNAME HAMMERS NAILS SP S# 1 1 2 2 P# 1 2 1 2 175 Normalized Design: Example Normalized Design: Example • A store sells a product which is supplied by a given vendor. PARTS) – One linking table (SP) – Ex: Suppliers. Parts are the 1-1. The product is purchased by a customer at a certain time. Product. Store – Relationships: Customer buys Product – Product is located in Store – Product is supplied By a Vendor VENDOR CUSTOMER PRODUCT STORE BUYS IS-LOCATED-IN 176 88 .

– 3NF – No transitive dependencies -. all non-key columns are affected.i.e.e.Checking a Normalized Design Checking a Normalized Design • Normalization – Used to reduce data insertion. – 1NF – All “cells” are atomic -. 177 Overview of Normalized Design Overview of Normalized Design • Pro – Relatively easy to change • Con – Queries can involve numerous joins – The massive number of tables and links between tables makes it hard for customers to build their own queries 178 89 . delete. and update anomalies caused by bad designs.i.i. all keys are completely dependent on the primary key. if the primary key changes. – Enables users to quickly check a design and make sure there are no glaring holes in the design. all other columns change. If the primary key changes.e. each entry in a column contains only one value – 2NF – All non-key values are functionally dependent upon the entire primary key -.

Think of a dimension as a way to slice the data. Selling a product in a store on Wednesday. etc.e. time. • Drill down operations are very well supported 180 90 . • Identify all the dimensions of the data being used. – Ex: by time. location. etc.Star Schema Star Schema • Methodology – Single fact table in the middle describing a key event (e. by customer. employee) D = DIMENSIONS D1 D2 FACT D5 D4 D3 179 Star Schema: Methodology Star Schema: Methodology • Identify a key fact that occurs. patient visiting a hospital.g. sale) surrounded by dimension tables (i. by product. – Usually some event creates a real fact.

Star Schema: Example Star Schema: Example • A store sells a product which is supplied by a given vendor. The product is purchased by a customer at a certain time. • Fact – CustomerPurchase • Dimensions are – Customer – Product – Time – Vendor 181 Star Schema: Example (cont.) Customer Time Sale Store Product Price SALE SALE ID CUST. ID STORE ID PROD. ID PRICE TIME 1 CUSTOMER CUST.00 Buys Apples 4/24/99 Has Big Car 3 TIME FRED DAY 24 4 1234 Y Y YEAR 99 MONTH QTR 2Q 182 91 . ID 3 NAME 7 PHONE 4 $3.) Star Schema: Example (cont.

millions of rows – Flexibility – Not as easy for customers to change the design 183 Snowflake Schema Snowflake Schema • Several stars can be connected to form a snowflake MARKETING Distribution Ad Direct Mail Price SALES Marketing Revenue Sales Location PRODUCT Parts Manufacturing Sale Price Vendor Make Chips Product Cost Price Labor 184 92 .Star Schema: Overview Star Schema: Overview • Pro – Easy for users to navigate and understand • Con – Performance – Can end up with one monster fact table.

Summary Summary • Two basic types of design – Star Schema – Normalized • Many Data Warehouse vendors sell products built specifically for the star schema • Some data warehouses insist that normalization is the way to build the data warehouse.c a r r i g e t.6760 www. 185 Building a Data Warehouse (slides in this section are used courtesy of Carrig Emerging Technology Ph: 410.553. c o m ) 186 93 .

Building a Data Warehouse Building a Data Warehouse 1) Top Down Approaches 2) Enterprise Data Model Approach 3) "Let Data Users Decide" 4) "Let Data Warehouse Builders Decide" 5) "Let Senior Management Decide" 6) Bottom Up Approach 187 Building the Data Warehouse Building the Data Warehouse • How to decide what data goes into the data warehouse? • Methods: – Top Down – Using Enterprise Data Models – "Let data users decide" approach – "Let data warehouse builders decide" approach – "Let senior management decide" approach – Bottom Up – Combine data marts into a data warehouse 188 94 .

• Put data in the warehouse based on the enterprise data model. This approach says let the business decide.Using Enterprise Data Models Using Enterprise Data Models • Use the Enterprise Data Model to decide what data goes into the data warehouse. 189 An Enterprise Data Model Example An Enterprise Data Model Example MAKE CHIPS PUT IN BAGS SELL CHIPS COUNT $$ BUY MORE POTATOES CHIP SUPPLIERS CHIP RECIPES INGREDIANTS 190 95 . – Identify key data used by these processes in an enterprise data model -. – Model key processes.might be a giant Entity-Relationship diagram.

– The data users deciding the data warehouse data and design will pay for it as well. – If the business model changes.no chance of leaving key data out.if the business is common enough the packaged EDM might be very close and then you just have to modify it to fit your business. you can charge users who query the data as well. you may have to rebuild the Enterprise Data Model and the data warehouse."Enterprise Data Model" Approach "Enterprise Data Model" Approach • Pro – All inclusive -. – Also. SOURCE USERS DATA WAREHOUSE 192 96 . • Con – Very difficult to build an EDM. 191 "Let Data Users Decide" "Let Data Users Decide" • Let the users of the data warehouse choose what data will go into the warehouse. • Ways of Avoiding the Con – In some cases you can buy an EDM -.

• Ways of Mitigating the Con – Do not just take money -. – Users who need the data may not use the DW because of budget concerns. 194 97 ."Let Data Users Decide": An Example "Let Data Users Decide": An Example DATA WAREHOUSE DATA DATA demographics DATA budget trends Advertising Ethnic group ? education Age spending Revenue ? ? MARKETING HUMAN RESOURCES FINANCE 193 "Let Data Users Decide" Approach "Let Data Users Decide" Approach • Pro – Reduces budget problems – Users know best! • Con – Requires marketing – Could end up with data in the warehouse that is meaningless to the people who run the place.try to determine if data is really corporate. – Users may not place important data in the warehouse because their budget is small.

Pay As You Go Warehouse Analogy Pay As You Go Warehouse Analogy I-495 195 "Let Data Warehouse Builders Decide" "Let Data Warehouse Builders Decide" • The technical staff who is building the warehouse decides what data gets put in the warehouse. LETS PUT INFORMATION ON HOW TO BUILD VIRUSES IN THE DATA WAREHOUSE DATA WAREHOUSE 196 98 .

"Let Data Warehouse Builders Decide" "Let Data Warehouse Builders Decide" Approach Approach • Pro – Very easy to design – Does not take much time – Do not have to deal with users • Con – Could easily result in data DUMP not data warehouse • Ways to mitigate the con – Talk to lots of users to help you guess what should go in the DW 197 “Let Senior Management Decide” “Let Senior Management Decide” • The senior management decides what data goes into the warehouse. • Identify the key questions on senior management’s mind and get the data to answer these questions. 198 99 . • Asking the senior management is the safest way to build a data warehouse.

• Combine data marts into a data warehouse. 199 Bottom-Up Approach Bottom-Up Approach • Move data from existing OLTP Applications to data marts.if you do not move quickly senior management will become very angry with the DW.talk to the aides of senior management to find out what is on their mind. DATA WAREHOUSE DATA MART 25 YARDS DATA MART 50 METERS DATA MART 200 CM OLTP APP OLTP APP OLTP APP 200 100 .you will have to only get a few questions at a time – This dramatically increases visibility . – Allocate resources so you can plan to move very quickly once you hear from the senior management.“Let Senior Management Decide” Approach “Let Senior Management Decide” Approach • Pro – Ensures executive support for the project • Con – Senior management does not have much time for this -. • Ways to mitigate the con – Do your homework before talking to the senior management -.

201 Recommendations for an Approach Recommendations for an Approach "Let senior management decide" 202 101 . • Ways to mitigate the con – Develop standards for data when building the data marts so that you can glue data from different data marts together.Bottom-Up Approach Bottom-Up Approach • Pro – Data marts are much easier to build than full-fledged DW. • Con – Could end up with a bunch of stove pipe data marts.

553. c o m ) 203 User Interface to the Data Warehouse User Interface to the Data Warehouse 1) Introduction 2) Types of Users 3) Functions Users Want to Do 4) Approaches to Building a User Interface 5) Hand Built 6) Class Libraries 7) OLAP Tools 8) Types of User Interfaces 204 102 .User Interface to the Data Warehouse (slides in this section are used courtesy of Carrig Emerging Technology Ph: 410.c a r r i g e t.6760 www.

– It is critical to identify the key users. – Once you do this.Introduction Introduction • A User Interface (UI) is a front end application designed for the user that presents information in a simplified manner. • There are three main approaches to building UI’s – Build your own entirely – Use commercial Class Libraries – Using OLAP Tools 206 103 . – Data in a data warehouse does nothing if users cannot access it – Users do not want to learn SQL to drive DW applications Finance OLTP Application Inventory OLTP Application Sales OLTP Application DATA WAREHOUSE Finance OLTP Data Inventory OLTP Data Sales OLTP Data USER INTERFACE 205 Building User Interfaces Building User Interfaces • DW applications have different types of users with different functionality requirements. you need to identify their functional requirements.

Types of Users Types of Users
CEO Executive Marketing Analysts Everyone Executive Sales Analysts Everyone Executive Finance Analysts Everyone
207

Types of Users (cont.) Types of Users (cont.)
• Executives
– – – – People who run the place Need answers quickly May not be very technical Expect UI to get them what they want quickly and efficiently without any need for special training

• Analysts
– Have time to really analyze data and think about it – May have strong statistical and IT background (i.e. Power user of Excel) – Expect UI to have many complex features, and provide the ability to generate new queries and perform statistical analysis of the data.
208

104

Types of Users (cont.) Types of Users (cont.)
• Regular User
– All other users – Just need some simple answers to simple questions like “What is Hank’s phone number) – Expect UI to be simplistic, easy to understand, and provide access to basic information.

209

Subject Matter Experts Expect Subject Matter Experts Expect
• Query data in the data warehouse • Trend analysis
– “show me how much money we have spent on computers in the last four years”
Trend

Sales

1995

1999

• Benchmark to competitors
– “what are all our competitors charging for product X”
210

105

Subject Matter Experts Expect (cont.) Subject Matter Experts Expect (cont.)
• Drill Down
– “on that chart you just showed me, I noticed that revenue was down in Region #4. Please drill down and show me the breakdown of each area in Region #4.”
DRILL DOWN WAL-MART
20 15 10 5 0 1 2 3 4 REGIONS
DRILL DOWN Revenue

REVENUE

Y Values X Values

MD

DC VA Region 4

211

Approaches to Building User Interfaces Approaches to Building User Interfaces
• Hand-Built
– Write all of your own code

• Use Class Libraries
– Use an object oriented approach and buy the CLASS libraries that do all the hard work

• OLAP
– Use an On-Line Analytical Processing package to build user interfaces for you.

212

106

) Architecture of User Interfaces (cont.) • OLAP YEAR E OR ST REGION Result Cube Commercial Off The Shelf (COTS) REVENUE USER INTERFACE DBMS 214 107 . JAVA DATA WAREHOUSE DBMS Commercial Off The Shelf • Class Libraries USER INTERFACE GRAP HIC CLAS S S LIBR ARY (COTS) OLAP CLASS LIBRARY Hand Built USER E FAC INTER SS CLA RY LIBRA 213 Architecture of User Interfaces (cont.) Architecture of User Interfaces (cont.Architecture of User Interfaces (cont.) • Hand Built USER INTERFACE i.e.

• Pros – Very flexible • Cons – Could take a long time to develop – Requires substantial resources – May need lots of testing and debugging 215 Using Class Libraries to Build User Interfaces Using Class Libraries to Build User Interfaces • Write initial user dialog yourself and call class libraries for the hard part (graphics and data access functionality).Hand-Building User Interfaces Hand-Building User Interfaces • Write all the code yourself – Requires many design documents. coding and testing for all of the code components. • Pro – Many class libraries available -.if the class library does not do what you want it to do you have to – Find a new class library – Live without the functionality – Can take a while to find the class library you need and learn how to interface to it 216 108 .avoid doing a lot of coding yourself • Con – Not as flexible -.

• Three types multi-dimensional OLAP – – – – Relational OLAP (ROLAP) Multi-dimensional (MOLAP) Hybrid (HOLOP) Distributed (DOLAP) 217 Summary of Tools for UI Development of DW Summary of Tools for UI Development of DW • Tools that may be used include: – Development of in-house software – Do it all yourself – Use Class Libraries – OLAP – ROLAP – MOLAP – HOLAP – DOLAP • Different tools or techniques may be useful depending upon what kind of user interface is being developed. – Executive Information Systems – Analytical Systems – Enterprise Information Systems 218 109 .Using OLAP Tools to Build User Interfaces Using OLAP Tools to Build User Interfaces • Many different OLAP tools – – – – Need to survey an OLAP tool Buy an OLAP tool Install it If it does not match all requirements some code may be needed to communicate with the OLAP tool.

220 110 . – May just want to use tools that allow development of a subscription service in which users may “Subscribe” to a few canned reports. • Development process: – No clean life cycle – Prototype constantly. Usually have to guess at what executives will want to see – Show executives let them come up with ideas for revisions – Drill down functionality required • Tools – Frequently hand-built. but purchasing a class library can help lower the development cost.Types of User Interfaces Types of User Interfaces • Executive Information System – Developed for the person who runs the place • Analytical System – Developed for business analysts • Enterprise Information System – Developed for users throughout the organization EXECUTIVE INFORMATION SYSTEM CEO Executive Executive Executive Marketing Analysts ENTERPRISE INFORMATION SYSTEM Sales Analysts Finance Analysts ANALYTICAL SYSTEM Everyone Everyone Everyone 219 Executive Information System Executive Information System • The Executive IS is developed specifically for people who run the organization.

– More complex interface is acceptable – Users may be required to know some SQL knowledge • Tools: – OLAP Tools are frequently used to build the interface 221 Enterprise Information System Enterprise Information System • Enterprise IS is written for the general user to retrieve simple. • Tools – Place some simple. – Simpler than Executive IS as it does not require drill down functionality.Analytical System Analytical System • Analytical systems are user interfaces developed for business analysts in an organization. • Development process: – Frequently developed in-house – So many users around that you really cannot pick a few and ask what they need. key information on a few screens and control access and then deploy. key information. 222 111 . • Development process: – Allow users to drag-and-drop data around to further the analysis of this data.

but OLAP has drawbacks due to: – Data sparseness – No well accepted query language • Enterprise Information System – Much simpler than executive system – Good candidate for in-house development 223 112 .Summary of Types of User Interfaces Summary of Types of User Interfaces • Executive Information System – For the senior executives – Use in-house development or in -house development augmented by class libraries • Analytical System – OLAP may make sense here as the interface is more complicated.

Sign up to vote on this title
UsefulNot useful