You are on page 1of 10

1. Differentiate between OLTP and Data Warehouse.

Ans: Differences between OLTP and Data Warehouse


Application databases are OLTP (On-Line Transaction Processing) systems where every transaction has to be
recorded as and when it occurs. Consider the scenario where a bank ATM has disbursed cash to a customer but was
unable to record this event in the bank records. If this happens frequently, the bank wouldn't stay in business for too
long. So the banking system is designed to make sure that every transaction gets recorded within the time you stand
before the ATM machine.
A Data Warehouse (DW) on the other end, is a database (yes, you are right, it's a database) that is designed for
facilitating querying and analysis. Often designed as OLAP (On-Line Analytical Processing) systems, these databases
contain read-only data that can be queried and analyzed far more efficiently as compared to your regular OLTP
application databases. In this sense an OLAP system is designed to be read-optimized.
Separation from your application database also ensures that your business intelligence solution is scalable (your
bank and ATMs don't go down just because the CFO asked for a report), better documented and managed.
Creation of a DW leads to a direct increase in quality of analysis as the table structures are simpler (you keep only
the needed information in simpler tables), standardized (well-documented table structures), and often de-
normalized (to reduce the linkages between tables and the corresponding complexity of queries). Having a well-
designed DW is the foundation for successful BI (Business Intelligence)/Analytics initiatives, which are built upon.
Data Warehouses usually store many months or years of data. This is to support historical analysis. OLTP systems
usually store data from only a few weeks or months. The OLTP system stores only historical data as needed to
successfully meet the requirements of the current transaction.













2. What are the key issues in Planning a Data Warehouse
Ans: Key Issues during Datawarehouse Construction
Planning for your Data Warehouse begins with a thorough consideration of the key issues. Answers to the key
questions are vital for the proper planning and the successful completion of the project. Therefore, let us consider
the pertinent issues, one by one.
Values and Expectations. Some companies jump into Data Warehousing without assessing the value to be derived
from their proposed Data Warehouse. Of course, first you have to be sure that, given the culture and the current
requirements of your company; a Data Warehouse is the most viable solution. After you have established the
suitability of this solution, only then can you begin to enumerate the benefits and value propositions.
Risk Assessment. Planners generally associate project risks with the cost of the project. If the project fails, how
much money will go down the drain? But the assessment of risks is more than calculating the loss from the project
costs. What are the risks faced by the company without the benefits derivable from a Data Warehouse? What losses
are likely to be incurred? What opportunities are likely to be missed?
Differences between OLTP and Data Warehouse projects
The Data Warehouse and the OLTP database are both relational databases. However, the objectives of both these
databases are different.
The OLTP database records transactions in real time and aims to automate the clerical data entry processes of a
business entity. Addition, Modification and Deletion of data in the OLTP database is essential and the semantics of
the application used in the front end makes an impact on the organization of the data in the database.
The Data Warehouse on the other hand does not cater to real time operational requirements of the enterprise. It is
more a storehouse of current and historical data and may also contain data extracted from external data sources.
However, the Data Warehouse supports OLTP systems by providing a place for the latter to offload data as it
accumulates and providing services, which would otherwise degrade the performance of the database table.










3. Explain Source Data Component and Data Staging Components of Data Warehouse
Architecture.
Ans: Source Data Component
1. Production Data
2. Internal Data
3. Archived Data
4. External Data
Production Data This category of data comes from the various operational systems of the enterprise. Based on the
information requirements in the Data Warehouse, you choose segments of data from the different operational
systems. While dealing with this data, you come across many variations in the data formats. You also notice that the
data resides on different hardware platforms. Further, the data is supported by different database systems and
operating systems. This is the data from many vertical applications.
In operational systems, information queries are narrow. You query an operational system for information about
specific instances of business objects. You may want just the name and address of a single customer. Or, you may
need the orders placed by a single customer in a single week. Or, you may just need to look at a single invoice and
the items billed on that single invoice. In operational systems, you do not have broad queries. You do not query the
operational system in unexpected ways. The queries are all predictable. Again, you do not expect a particular query
to run across different operational systems. What do all these mean? There is no conformance of data among the
various operational systems of an enterprise. A term like an account may have different meanings in different
systems.
The significant and disturbing characteristic of production data is disparity. Your great challenge is to standardize and
transform the disparate data from the various production systems, convert the data, and integrate the pieces into
useful data for storage in the Data Warehouse.
Internal Data In every organization, users keep their private spreadsheets, documents, customer profiles, and
sometimes even departmental databases. This is the internal data, parts of which could be useful for Data
Warehouse for analysis.
If your organization does business with the customers on a one-to-one basis and the contribution of each customer
to the bottom line is significant, then detailed customer profiles with ample demographics are important in a Data
Warehouse. Profiles of individual customers become very important for consideration. When your account
representatives talk to their assigned customers or when your marketing department wants to make specific
offerings to individual customers, you need the details. Although much of this data may be extracted from
production systems, individuals and departments in their private files hold a lot of it.
You cannot ignore the internal data held in private files in your organization. It is a collective judgment call on how
much of the internal data should be included in the Data Warehouse. The IT department must work with the user
departments to gather the internal data. Internal data adds additional complexity to the process of transforming and
integrating the data before it can be stored in the Data Warehouse. You have to determine strategies for collecting
data from spreadsheets, find ways of taking data from textual documents, and tie into departmental databases to
gather pertinent data from those sources. Again, you may want to schedule the acquisition of internal data. Initially,
you may want to limit yourself to only some significant portions before going live with your first data mart.
Archived Data Operational systems are primarily intended to run the current business. In every operational system,
you periodically take the old data and store it in archived files. The circumstances in your organization dictate how
often and which portions of the operational databases are archived for storage. Some data is archived after a year.
Sometimes data is left in the operational system databases for as long as five years. Many different methods of
archiving exist. There are staged archival methods. At the first stage, recent data is archived to a separate archival
database that may still be online. At the second stage, the older data is archived to flat files on disk storage. At the
next stage, the oldest data is archived to tape cartridges or microfilm and even kept off-site.
As mentioned earlier, a Data Warehouse keeps historical snapshots of data. You essentially need historical data for
analysis over time. For getting historical information, you look into your archived data sets. Depending on your Data
Warehouse requirements, you have to include sufficient historical data. This type of data is useful for detecting
patterns and analyzing trends.
External Data Most executives depend on data from external sources for a high percentage of the information they
use. They use statistics relating to their industry produced by external agencies. They use market share data of
competitors. They use standard values of financial indicators for their business to check on their performance.
For example, the Data Warehouse of a car rental company contains data on the current production schedules of the
leading automobile manufacturers. This external data in the Data Warehouse helps the car rental company plan for
their fleet management. The purposes served by such external data sources cannot be fulfilled by the data available
within your organization
itself. The insights gleaned from your production data and your archived data are somewhat limited. They give you a
picture based on what you are doing or have done in the past. In order to spot industry trends and compare
performance against other organizations, you need data from external sources.
Usually, data from outside sources do not conform to your formats. You have to do conversions of data into your
internal formats and data types. You have to organize the data transmissions from the external sources. Some
sources may provide information at regular, stipulated intervals. Others may give you the data on request. You need
to accommodate the variations.
Data Staging Component
After you have extracted data from various operational systems and from external sources, you have to prepare the
data for storing in the Data Warehouse. The extracted data coming from several disparate sources need to be
changed, converted, and made ready in a format that is suitable to be stored for querying and analysis.
Three major functions need to be performed for getting the data ready. You have to extract the data, transform the
data, and then load the data into the Data Warehouse storage. These three major functions of extraction,
transformation, and preparation for loading take place in a staging area. The data-staging component consists of a
workbench for these functions. Data staging provides a place and an area with a set of functions to clean, change,
combine, convert, reduplicate, and prepare source data for storage and use in the Data Warehouse.
Data Extraction This function has to deal with numerous data sources. You have to employ the appropriate
technique for each data source. Source data may be from different source machines in diverse data formats. Part of
the source data may be in relational database systems. Some data may be on other legacy network and hierarchical
data models. Many data sources may still be in flat files. You may want to include data from spreadsheets and local
departmental data sets. Data extraction may become quite complex.
Tools are available on the market for data extraction. You may want to consider using outside tools suitable for
certain data sources. For the other data sources, you may want to develop in-house programs to do the data
extraction. Purchasing outside tools may entail high initial costs. In-house programs, on the other hand, may mean
ongoing costs for development and maintenance.
After you extract the data, where do you keep the data for further preparation? You may perform the extraction
function in the legacy platform itself if that approach suits your framework. More frequently, Data Warehouse
implementation teams extract the source into a separate physical environment from which moving the data into the
Data Warehouse would be easier. In the separate environment, you may extract the source data into a group of flat
files, or a data-staging relational database, or a combination of both.
Data Transformation In every system implementation, data conversion is an important function. For example, when
you implement an operational system such as a magazine subscription application, you have to initially populate
your database with data from the prior system records. You may be converting over from a manual system. Or, you
may be moving from a file-oriented system to a modern system supported with relational database tables. In either
case, you will convert the data from the prior systems. So, what is so different for a Data Warehouse? How is data
transformation for a Data Warehouse more involved than for an operational system?
Again, as you know, data for a Data Warehouse comes from many disparate sources. If data extraction for a Data
Warehouse poses great challenges, data transformation presents even greater challenges. Another factor in the Data
Warehouse is that the data feed is not just an initial load. You will have to continue to pick up the ongoing changes
from the source systems. Any transformation tasks you set up for the initial load will be adapted for the ongoing
revisions as well.
You perform a number of individual tasks as part of data transformation. First, you clean the data extracted from
each source. Cleaning may just be correction of misspellings, or may include resolution of conflicts between state
codes and zip codes in the source data, or may deal with providing default values for missing data elements, or
elimination of duplicates when you bring in the same data from multiple source systems.
Standardization of data elements forms a large part of data transformation. You standardize the data types and field
lengths for same data elements retrieved from the various sources. Semantic standardization is another major task.
You resolve synonyms and homonyms. When two or more terms from different source systems mean the same
thing, you resolve the synonyms. When a single term means many different things in different source systems, you
resolve the homonym.
Data transformation involves many forms of combining pieces of data from the different sources. You combine data
from single source record or related data elements from many source records. On the other hand, data
transformation also involves purging source data that is not useful and separating out source records into new
combinations. Sorting and merging of data takes place on a large scale in the data staging area.
In many cases, the keys chosen for the operational systems are field values with built-in meanings. For example, the
product key value may be a combination of characters indicating the product category, the code of the warehouse
where the product is stored, and some code to show the production batch. Primary keys in the Data Warehouse
cannot have built-in meanings. Data transformation also includes the assignment of surrogate keys derived from the
source system primary keys.
A grocery chain point-of-sale operational system keeps the unit sales and revenue amounts by individual
transactions at the checkout counter at each store. But in the Data Warehouse, it may not be necessary to keep the
data at this detailed level. You may want to summarize the totals by product at each store for a given day and keep
the summary totals of the sale units and revenue in the Data Warehouse storage. In such cases, the data
transformation function would include appropriate summarization.
When the data transformation function ends, you have a collection of integrated data that is cleaned, standardized,
and summarized. You now have data ready to load into each data set in your Data Warehouse.
Data Loading Two distinct groups of tasks form the data loading function. When you complete the design and
construction of the Data Warehouse and go live for the first time, you do the initial loading of the data into the Data
Warehouse storage. The initial load moves large volumes of data using up substantial amounts of time. As the Data
Warehouse starts functioning, you continue to extract the changes to the source data, transform the data revisions,
and feed the incremental data revisions on an ongoing basis. The figure below illustrates the common types of data
movements from the staging area to the Data Warehouse storage.























4. Discuss the Extraction Methods in Data Warehouses.
Ans: The extraction method you choose is highly dependent on the source system and also from the business needs
in the targeted Data Warehouse environment. Very often, there's no possibility to add additional logic to the source
systems to enhance an incremental extraction of data due to the performance or the increased workload of these
systems. Sometimes even the customer is not allowed to add anything to an out-of-the-box application.
The estimated amount of the data to be extracted and the stage in the ETL process (initial load or maintenance of
data) may also impact the decision of how to extract, from a logical and a physical perspective. Basically, you have to
decide how to extract data logically and physically.
Logical Extraction Methods
There are two kinds of logical extraction:
Full Extraction
Incremental Extraction
Full Extraction
The data is extracted completely from the source system. Since this extraction reflects all the data currently available
on the source system, there's no need to keep track of changes to the data source since the last successful
extraction. The source data will be provided as-is and no additional logical information (for example, timestamps) is
necessary on the source site. An example for a full extraction may be an export file of a distinct table or a remote
SQL statement scanning the complete source table. Incremental Extraction
At a specific point in time, only the data that has changed since a well-defined event back in history will be extracted.
This event may be the last time of extraction or a more complex business event like the last booking day of a fiscal
period. To identify this delta change there must be a possibility to identify all the changed information since this
specific time event. This information can be either provided by the source data itself like an application column,
reflecting the last-changed timestamp or a change table where an appropriate additional mechanism keeps track of
the changes besides the originating transactions. In most cases, using the latter method means adding extraction
logic to the source system.
Many Data Warehouses do not use any change-capture techniques as part of the extraction process. Instead, entire
tables from the source systems are extracted to the data warehouse or staging area, and these tables are compared
with a previous extract from the source system to identify the changed data.
Physical Extraction Methods
Depending on the chosen logical extraction method and the capabilities and restrictions on the source side, the
extracted data can be physically extracted by two mechanisms. The data can either be extracted online from the
source system or from an offline structure. Such an offline structure might already exist or it might be generated by
an extraction routine.
These are the following methods of physical extraction:
Online Extraction
Offline Extraction
Online Extraction
The data is extracted directly from the source system itself. The extraction process can connect directly to the source
system to access the source tables themselves or to an intermediate system that stores the data in a reconfigured
manner (for example, snapshot logs or change tables). Note that the intermediate system is not necessarily
physically different from the source system. With online extractions, you need to consider whether the distributed
transactions are using original source objects or prepared source objects.
Offline Extraction
The data is not extracted directly from the source system but is staged explicitly outside the original source system.
The data already has an existing structure (for example, redo logs, archive logs or transportable tablespaces) or was
created by an extraction routine.
You should consider the following structures:
Flat Files:
Data is in a defined, generic format. Additional information about the source object is necessary for further
processing.
Dump Files:
An Oracle - specific format in which the information about the containing objects is included.
Redo And Archive Logs
Redo logs comprise files in a proprietary format which log a history of all changes made to the data base. Each redo
log file consists of redo records. A redo record (redo entry) , holds a group of change-vectors, each of which
describes or represents a change made to a single block in the database.
For example, if a user UPDATEs a salary-value in an employee-table, the DBMS generates a redo record containing
change-vectors that describe changes to the data segment block for the table. And if the user then COMMITs the
update, Oracle generates another redo record and assigns the change a "system change number" (SCN).
A single transaction may involve multiple changes to data blocks, so it may have more than one redo record.
A group of redo log files to one or more offline destinations, known collectively as the archived redo log, or more
simply the archive log. The process of turning redo log files into archived redo log files is called archiving. This
process is only possible if the database is running in ARCHIVELOG mode. You can choose automatic or manual
archiving.








5. Write short notes on (i) RAID 0 (ii)RAID 1
Ans. (i) RAID 0
A RAID 0 (also known as a stripe set or striped volume) splits data evenly across two or more disks
(striped) without parity information for speed. RAID 0 was not one of the original RAID levels and provides
no data redundancy. RAID 0 is normally used to increase performance, although it can also be used as a way
to create a large logical disk out of two or more physical ones.
A RAID 0 can be created with disks of differing sizes, but the storage space added to the array by each disk
is limited to the size of the smallest disk. For example, if a 100 GB disk is striped together with a 350 GB
disk, the size of the array will be 200 GB (100 GB 2).

The diagram shows how the data is distributed into Ax stripes to the disks. Accessing the stripes in the order
A1, A2, A3, ... provides the illusion of a larger and faster drive. Once the stripe size is defined on creation it
needs to be maintained at all times.
Performance
RAID 0 is also used in areas where performance is desired and data integrity is not very important, for
example in some computer gaming systems. However, real-world tests with computer games have shown
that RAID-0 performance gains are minimal, although some desktop applications will benefit.Another
article examined these claims and concluded: "Striping does not always increase performance (in certain
situations it will actually be slower than a non-RAID setup), but in most situations it will yield a significant
improvement in performance."
(ii) RAID 1
An exact copy (or mirror) of a set of data on two disks. This is useful when read performance or reliability is more
important than data storage capacity. Such an array can only be as big as the smallest member disk. A classic RAID 1
mirrored pair contains two disks








6. What is Metadata Management? Explain Integrated Metadata Management with a
block diagram
Ans: The purpose of Metadata management is to support the development and administration of data warehouse
infrastructure as well as analysis of the data of time.
Metadata widely considered as a promising driver for improving effectiveness and efficiency of data warehouse
usage, development, maintenance and administration. Data warehouse usage can be improved because metadata
provides end users with additional semantics necessary to reconstruct the business context of data stored in the
data warehouse.
Integrated Metadata Management
An integrated Metadata Management supports all kinds of users who are involved in the data warehouse
development process. End users, developers and administrators can use/see the Metadata. Developers and
administrators mainly focus on technical Metadata but make use of business Metadata if they want. Developers and
administrators need metadata to understand transformations of object data and underlying data flows as well as the
technical and conceptual system architecture.

Several Metadata management systems are in existence. One such system/ tool is Integrated Metadata Repository
System (IMRS). It is a metadata management tool used to support a corporate data management function and is
intended to provide metadata management services. Thus, the IMRS will support the engineering and configuration
management of data environments incorporating e-business transactions, complex databases, federated data
environments, and data warehouses / data marts. The metadata contained in the IMRS used to support application
development, data integration, and the system administration functions needed to achieve data element semantic
consistency across a corporate data environment, and to implement integrated or shared data environments.
Metadata management has several sub processes like data warehouse development.
Some of them are listed below,
Metadata definition
Metadata collection
Metadata control
Metadata publication to the right people at the right time.
Determining what kind of data to be captured.

You might also like