You are on page 1of 5

2) Describe the importance of denormalization as part of data warehouse design (150

words).

Denormalization data warehousing strategy is used to enhance the functionality of

a database infrastructure.  Denormalization calls redundant data to a normalized data

warehouse to minimize the running time of specific database queries that unite data from many

tables into one. In fact, the interpretation of denormalization depends on normalization, which is

characterized as the act of arranging a database into tables by removing repetitions to implement

a given use case. Remember, a denormalized database should never be mistaken for a database

which was never normalized. A data warehouse is a living operational environment. Data

extracted from sources enters the warehouse inventory at the point of extraction. Systems

dropping files for warehouse ingestion follow warehouse rules for doing so. Data is transformed

according to algorithms required by the warehouse. It loaded from these sources and stored

according to DW rules. It is reposited in one or more storage locations, also by warehouse rules.

No one location or technology is the data warehouse.

3) Review the key roles involved in the design of a dimensional model such as data modeler,
business analyst, business intelligence application developer, data steward, ETL developer,
database administrator, security manager, data warehouse administrator. Select a role
and define the tasks that this person performs (150 words).
Database administrator: A database administrator's (DBA) primary job is to ensure that data is

available, protected from loss and corruption, and easily accessible as needed. A DBA often collaborates

on the initial installation and configuration of a new Oracle, SQL Server etc database. The system

administrator sets up hardware and deploys the operating system for the database server, then the DBA

installs the database software and configures it for use. As updates and patches are required, the DBA
handles this on-going maintenance. Database administrators (DBAs) use specialized software to

store and organize data. The role may include capacity planning, installation,

configuration, database design, migration, performance monitoring, security, troubleshooting,

as well as backup and data recovery.

A database administrator's responsibilities can include the following tasks:[13]

 Installing and upgrading the database server and application tool

 Allocating system storage and planning storage requirements for the database system


 Modifying the database structure, as necessary, from information given by application developers
 Enrolling users and maintaining system security
 Ensuring compliance with database vendor license agreement
 Controlling and monitoring user access to the database

4) Identify the importance of selecting the Grain of a data warehouse in the Kimball Data
Warehouse Model. Provide examples of grains within a Data Warehouse (150 words).

The grain establishes exactly what a single fact table row represents. The grain declaration

becomes a binding contract on the design. The grain must be declared

before choosing dimensions or facts because every candidate dimension or fact must be

consistent with the grain. This consistency enforces a uniformity on all dimensional designs that

is critical to BI application performance and ease of use. Atomic grain refers to the lowest level

at which data is captured by a given business process. The most important result of declaring

the grain of the fact table is anchoring the discussion of the dimensions. But declaring the grain

lets you be equally clear about the measured numeric facts. The grain of the dimensional model

is the finest level of detail that is implied when the fact and dimension tables are joined.

For example, the granularity of a dimensional model that consists of the dimensions Date,

Store, and Product is product sold in store by day. The fact and dimension tables have a

granularity associated with them. In dimensional modeling, granularity refers to the level of


detail stored in a table. For example, a dimension such as Date (with Year and Quarter

hierarchies) has a granularity at the quarter level but does not have information for individual

days or months. Alternately, a Date dimension table (with Year, Quarter, and Month hierarchies)

has granularity at the Month level, but does not contain information at the day level.

7) Compare and contrast the models used for developing a data warehouse for
unstructured data versus structured data (150 words).

The "Dimensional Data Model" otherwise known as the "Star Schema" was
developed by Ralph Kimball in the 1980s to support these business needs. 
This approach has stood the test of time and is the recommended way to
organize data for business query and analysis.
Data modeling in data warehouses is different from data modeling in operational database systems.
The primary function of data warehouses is to support DSS processes. Thus, the objective of data
warehouse modeling is to make the data warehouse efficiently support complex queries on long term
information.

In contrast, data modeling in operational database systems targets efficiently supporting simple
transactions in the database such as retrieving, inserting, deleting, and changing data. Moreover, data
warehouses are designed for the customer with general information knowledge about the enterprise,
whereas operational database systems are more oriented toward use by software specialists for
creating distinct applications.

8) Define the importance of ontologies for unstructured data warehouses. Provide an


example of unstructured data and the use of an ontology used to manage this data (150
words).

Ontologies also provide the means to represent any data formats, including unstructured, semi-
structured or structured data, enabling smoother data integration, easier concept and text
mining, and data-driven analytics. Since ontologies define the terms used to describe and
represent an area of knowledge, they are used in many applications to capture relationships
and boost knowledge management. One of the main features of ontologies is that, by having the
essential relationships between concepts built into them, they enable automated reasoning
about data. Such reasoning is easy to implement in semantic graph databases that use
ontologies as their semantic schemata.

What’s more, ontologies function like a ‘brain’. They ‘work and reason’ with concepts and
relationships in ways that are close to the way humans perceive interlinked concepts.

In addition to the reasoning feature, ontologies provide more coherent and easy navigation as
users move from one concept to another in the ontology structure.

Another valuable feature is that ontologies are easy to extend as relationships and concept
matching are easy to add to existing ontologies. As a result, this model evolves with the growth
of data without impacting dependent processes and systems if something goes wrong or needs
to be changed.

Ontologies also provide the means to represent any data formats, including unstructured, semi-
structured or structured data, enabling smoother data integration, easier concept and text
mining, and data-driven analytics.

9) Define the importance of indexes for unstructured data warehouses. Provide an example
of unstructured data and the indexes that you would define to manage this data (150
words).
Indexes are database objects associated with database tables and created to speed up access to
data within the tables. Indexing techniques have already been in existence for decades for the
OLTP relational database system but they cannot handle large volume of data and complex and
iterative queries that are common in OLAP applications.
Unstructured data is the information that either does not have a pre-defined data model or is not
organized in a pre-defined manner. In other words, it is that information that doesn't reside in a traditional
row-column database.
An example of unstructured data is attachments stored in a server. The indexing process for this content
is done by data import handler which handles unstructured content using a hybrid of data sources to
create indexing information using the existing search indexing framework like WebSphere Commerce.
There are different types of indexes necessary to make text analysis efficient. Indexes range from simple
indexes to complex combined indexes, which can be made up of any and all of the other kinds of indexes.
These includes: Patterned Index, Homographic Index, Alternate Spelling Index, Stemmed Words Index
and Clustered Index.
The importance of index is to locate data very fast hence accessing it. Otherwise it would be very slow
accessing a document, file or any other kind of data since the search had to be done on every content.

10) Identify design decisions that must be made as part of the ETL process. Describe the
importance of these design decisions (150 words).

One of the earliest and most fundamental decisions you must make is whether to hand code
your ETL system or use a vendor-supplied package. Technical issues and license costs aside,
you shouldn’t go off in a direction that your employees and managers find unfamiliar without
seriously considering the decision’s long-term implications. This decision will have a major
impact on the ETL environment, driving staffing decisions, design approaches, metadata
strategies, and implementation timelines for a long time.

ETL comes from Data Warehousing and stands for Extract-Transform-Load. ETL covers a
process of how the data are loaded from the source system to the data warehouse. The design
process and design decisions need to be made is as follow:
Extract
The Extract step covers the data extraction from the source system and makes it accessible for
further processing. The main objective of the extract step is to retrieve all the required data from
the source system with as little resources as possible.
-The extract step should be designed in a way that it does not negatively affect the source system
interms or performance, response time or any kind of locking.

There are several ways to perform the extract:


Update notification
Incremental extract
Full extract
Clean
The cleaning step is one of the most important as it ensures the quality of the data in the data
warehouse. Cleaning should perform basic data unification rules, such as:
Making identifiers unique (sex categories Male/Female/Unknown, M/F/null, Man/Woman/Not
Available are translated to standard Male/Female/Unknown)
Convert null values into standardized Not Available/Not Provided value
Convert phone numbers, ZIP codes to a standardized form
Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str

You might also like