Professional Documents
Culture Documents
A) Data Mining is used for the estimation of future. For example, if we take
a company/business organization, by using the concept of Data Mining, we
can predict the future of business in terms of Revenue (or) Employees (or)
Customers (or) Orders etc.
Traditional approaches use simple algorithms for estimating the future. But,
it does not give accurate results when compared to Data Mining.
2A) View - stores the SQL statement in the database and let you use it as a
table. Every time you access the view, the SQL statement executes.
Materialized view - stores the results of the SQL in table form in the
database. SQL statement only executes once and after that every time you
run the query, the stored result set is used. Pros include quick query results.
2B) VIEW: This is a PSEUDO table that is not stored in the database and it
is just a query.
According to Kimball...
i.e.,
3C) the main difference b/w the Kimball and inmon technologies is...
Kimball --- creating data marts first then combining them up to form a data
warehouse
The column which we are using rarely or not used, these columns are
formed a dimension is called junk dimension
Degenerative dimension
Example:
But
We are talking only the column empno, ename from the EMP table and
forming a dimension this is called degenerative dimension
6B) basically the fact table consists of the Index keys of the dimension/look
up tables and the measures.
So when ever we have the keys in a table .that itself implies that the table is
in the normal form.
7A) Basic diff is E-R modeling will have logical and physical model.
Dimensional model will have only physical model.
7B) E-R modeling revolves around the Entities and their relationships to
capture the overall process of the system.
8B) Conformed facts are allowed to have the same name in separate tables
and can be combined and compared mathematically.
8C) the relationship between the facts and dimensions are with 3NF, and can
works in any type of joins are called as conformed schema, the members of
that schema are called so...
8D) Conformed dimensions are those tables that have a fixed structure.
There will be no need to change the metadata of these tables and they can go
along with any number of facts in that application without any changes
8E) A dimension table which is used by more than one fact table is known as
a conformed dimension.
9B) Most of the time, we use Mr. Ralph Kimball methodologies for data
warehousing design. Two kind of schema: star and snow flake.
2. Inmon Model.
2>>Bottom up method
In Top Down method first loads the Datamarts and then loads the data ware
house.
In Bottom Up method first loads the Data warehouse and then loads the Data
marts.
9E) Top Down approach is first Data warehouse then Data marts.
Top down approach in the sense preparing individual departments data (Data
Marts) from the Enterprise Data warehouse
Bottom up Approach is nothing but first gathering all the departments’ data
and then cleanse the data and Transforms the data and then load all the
individual departments data into the enterprise data ware house
11A) Hierarchies are logical structures that use ordered levels as a means of
organizing data. A hierarchy can be used to define data aggregation. For
example, in a time dimension, a hierarchy might aggregate data from the
month level to the quarter level to the year level. A hierarchy can also be
used to define a navigational drill path and to establish a family structure.
Within a hierarchy, each level is logically connected to the levels above and
below it. Data values at lower levels aggregate into the data values at higher
levels. A dimension can be composed of more than one hierarchy. For
example, in the product dimension, there might be two hierarchies--one for
product categories and one for product suppliers.
Levels
A level represents a position in a hierarchy. For example, a time dimension
might have a hierarchy that represents data at the month, quarter, and year
levels. Levels range from general to specific, with the root level as the
highest or most general level. The levels in a dimension are organized into
one or more hierarchies.
Level Relationships
Level relationships specify top-to-bottom ordering of levels from most
general (the root) to most specific information. They define the parent-child
relationship between the levels in a hierarchy.
12) What are data validation strategies for data mart v...?
12A) Data validation is to make sure that the loaded data is accurate and
meets the business requirements.
13) What r the data types present in Bo? N what happens I...
View is nothing but an alias and it can be used to resolve the loops in the
universe.
13B) in my knowledge, these are called as object types in the Business
Objects.
And alias is different from view in the universe. View is at database level,
but alias is a different name given for the same table to resolve the loops in
universe.
13C) the different data types in business objects are: 1. Character.2. Date.3.
Long text.4. Number
14A) Surrogate key is the primary key for the Dimensional table.
It is just a unique identifier or number for each row that can be used for the
primary key to the table. The only requirement for a surrogate primary key is
that it is unique for each row in the table.
On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1'
(that's what would be in your Employee Dimension). This employee has a
turnover allocated to him on the Business Unit 'BU1' but on the 2nd of June
the Employee 'E1' is muted from Business Unit 'BU1' to Business Unit
'BU2.' All the new turnover has to belong to the new Business Unit 'BU2'
but the old one should Belong to the Business Unit 'BU1.'
If you used the natural business key 'E1' for your employee within your data
warehouse everything would be allocated to Business Unit 'BU2' even what
actually belongs to 'BU1.'
If you use surrogate keys, you could create on the 2nd of June a new record
for the Employee 'E1' in your Employee Dimension with a new surrogate
key.
This way, in your fact table, you have your old data (before 2nd of June)
with the SID of the Employee 'E1' + 'BU1.' All new data (after 2nd of June)
would take the SID of the employee 'E1' + 'BU2.'
14G) surrogate keys r that which join dimension tables and fact table
In the linked cube the data cells can be linked in to another analytical
database. If an end-user clicks on a data cell, you are actually linking
through another analytic database.
15B) linked cube in which a sub-set of the data can be analyzed into great
detail. The linking ensures that the data in the cubes remain consistent.
16A) Partitioning a cube mainly used for optimization.(ex) U may have data
for 5gb to create a report u can specify a size for a cube as 2gb so if the cube
exceeds 2gb it automatically creates the second cube to store the data.
17B) Meta data is the data about data; Business Analyst or data modeler
usually capture information about data - the source (where and how the data
is originated), nature of data (char, varchar, nullable, existence, valid values
etc) and behavior of data (how it is modified / derived and the life cycle ) in
data dictionary a.k.a metadata. Metadata is also presented at the Data mart
level, subsets, fact and dimensions, ODS etc. For a DW user, metadata
provides vital information for analysis / DSS.
17C) metadata is data about data, it including things name, location, and
length including things.
We can u store data in metadata in data warehouse
18A) Incremental loading means loading the ongoing changes in the OLTP.
18B) Please learn to spell incremental and cross reference first! Or at least
use a spell check!
19) What are the possible data marts in Retail sales....?
20) What is the main difference between schema in RDBMS and schemas in
Data Warehouse....?
DWH Schema
* Used for OLAP systems
* New generation schema
* De Normalized
* Easy to understand and navigate
* Extract and complex problems can be easily solved
* Very good model
20D) RDBMS-normalized
OLTP Schema:
* Normalized
* More no. of Trans
* Less time for queries execution
* More no. of users
* Have Insert, delete and update Trans.
* De Normalized
* Less no. of Trans
* Less no. of users
* More time for query exec
* Will not have more insert, delete and updates.
Informatica
Data Stage
Oracle Warehouse Builder
Ab Initio
Data Junction
21E)
Informatica
Data Stage
MS-SQL DTS (Integrated Services 2005)
Abinitio
SQL Loader
Sunopsis
Oracle Warehouse Builder
Data Junction
Data Integrator (Business Objects)
21F) Have any come acress ETL tool "sunopsis"..? If not please check this
URL....It is amazing...
http://www.sunopsis.com
Data modeling is probably the most labor intensive and time consuming part
of the development process. Why bother especially if you are pressed for
time? A common response by practitioners who write on the subject is that
you should no more build a database without a model than you should build
a house without blueprints.
The goal of the data model is to make sure that the all data objects required
by the database are completely and accurately represented. Because the data
model uses easily understood notations and natural language, it can be
reviewed and verified as correct by the end-users.
23D) VLDB stands for Very Large Data Base, any database too large
(normally more than 1TB) considered as VLDB.
24B) A real time data warehouse provide live data for DSS (may not be
100% up to that moment, some latency will be there). Data warehouse have
access to the OLTP sources, data is loaded from the source to the target not
daily or weekly, but may be every 10 minutes through replication or log
shipping or something like that. SAP BW is providing real time DW, with
the help of extended star schema, source data is shared.
Reduce or eliminate the time taken to get new and changed data out of your
source systems.
Eliminate, or reduce as much as possible, the time required to cleanse,
transform and load your data.
Reduce as much as possible the time required to update your aggregates.
Starting with version 9i, and continuing with the latest 10g release, Oracle
has gradually introduced features into the database to support real-time and
near-real-time, data warehousing. These features include:
25A) when a table is used to check for some data for its presence prior to
loading of some other data or the same data to another table, the table is
called a LOOKUP Table.
25B) when a value for the column in the target table is looked up from
another table apart from the source tables, that table is called the lookup
table.
25C) when we want to get related value from some other table based on
particular value... suppose in one table A we have two columns emp_id,
name and in other table B we have emp_id address in target table we want to
have emp_id, name, address we will take source as table A and look up table
as B by matching emp_id we will get the result as three columns...emp_id,
name, address
25F) when a table is used to check for some data for its presence prior to
loading of some other data or the same data to another table, the table is
called a LOOKUP Table.
Since the index key in the fact table is from (referencing) the particular
Dimension table so it’s also called as look up table.
25I) The Look Up table provides the detailed information about the
attributes. For example, the lookup table for the quarter attribute would
include a list of all the quarters available in the data warehouse.i.e. First
quarter of 2001 may be represented as "Q1 2001" or "2001 Q1".BYE.
26A) General purpose of scheduling tool may be cleansing and loading data
at specific given time
27) What type of Indexing mechanism do we need to use for a typical data
warehouse
27B) Function Index, B-tree Index, Partition Index, Hash index etc...
27C) on the fact table it is best to use bitmap indexes. Dimension tables can
use bitmap and/or the other types of clustered/non-clustered, unique/non-
unique indexes.
28) Explain the advantages of RAID 1, 1/0, and 5. What type of RAID setup
would you put your TX logs
28A) Raid 0 - Make several physical hard drives look like one hard drive.
No redundancy but very fast. May use for temporary spaces where loss of
the files will not result in loss of committed data.
Raid 1- Mirroring. Each hard drive in the drive array has a twin. Each twin
has an exact copy of the other twin’s data so if one hard drive fails; the other
is used to pull the data. Raid 1 is half the speed of Raid 0 and the read and
writes performance is good.
31B) in its simple definition you can say data mining is a way to discover
new meaning in data.
32B) A attribute in fact table it’s not a fact and it’s not a key value
32C) in simple terms, the column in a fact table that does not map to any
dimensions, neither it s a measure column.
33C) Time dimension in DWH must be load manually. We load data into
Time dimension using pl/sql scripts.
33D) Generally we load the Time dimension by using Source Stage as a Seq
File and we use one passive stage in that transformer stage we will manually
write functions as Month and Year Functions to load the time dimensions but
for the lower level i.e., Day also we have one function to implement loading
of Time Dimension.
34A) ER - Stands for entity relationship diagrams. It is the first step in the
design of data model which will later lead to a physical database design of
possible an OLTP or OLAP database
Simply stated the ER model is a conceptual data model that views the real
world as entities and relationships. A basic component of the model is the
Entity-Relationship diagram which is used to visually represent data objects.
Since Chen wrote his paper the model has been extended and today it is
commonly used for database design For the database designer, the utility of
the ER model is:
It maps well to the relational model. The constructs used in the ER model
can easily be transformed into relational tables.
it is simple and easy to understand with a minimum of training. Therefore,
the model can be used by the database designer to communicate the design
to the end user.
34D) E.R Diagram (Entity Relationship diagram) means how the different
database table related to each other and what r the primary key and foreign
key and their relation.
34E) Physical and logical arrangement of the database table and relationship
is explained by a diagram, that diagram is known as ER diagram
35) Difference between Snow flake and Star Schema. What are situations
where Snow flake Schema is better?
35A) star schema and snowflake both serve the purpose of dimensional
modeling when it comes to data warehouses.
Star schema is a dimensional model with a fact table (large) and a set of
dimension tables (small). The whole set-up is totally denormalized.
However in cases where the dimension table are split to many table that is
where the schema is slightly inclined towards normalization ( reduce
redundancy and dependency) there comes the snow flake schema.
The nature/purpose of the data that is to be feed to the model is the key to
your question as to which is better.
35B) Star schema contains the dimension tables mapped around one or more
fact tables.
It is a denormalised model.
Snowflake means
2) Simply database
35D) star schema and snowflake both serve the purpose of dimensional
modeling when it comes to data warehouses.
Star schema is a dimensional model with a fact table (large) and a set of
dimension tables (small). The whole set-up is totally denormalized.
However in cases where the dimension table are split to many table that is
where the schema is slightly inclined towards normalization ( reduce
redundancy and dependency) there comes the snow flake schema.
The nature/purpose of the data that is to be feed to the model is the key to
your question as to which is better
It is used to maintain, store the current and up to date information and the
transactions regarding the source databases taken from the OLTP system.
It is directly connected to the source database systems instead of to the
staging area.
Edit by Admin: ODS Stands for Operational Data Store not Online Data
Storage
37C) ODS stands for Operational Data Store. It contains near real time data.
In typical data warehouse architecture, sometimes ODS is used for analytical
reporting as well as source for Data Warehouse.
37D) Operational Data Services is Hybrid structure that has some aspects of
a data warehouse and other
Aspects of an Operational system.
Contains integrated data.
It can support DSS processing.
It can also support High transaction processing.
Placed in between Warehouse and Web to support web users.
37E) the form that data warehouse takes in the operational environment.
Operational data stores can be updated, do provide rapid constant time, and
contain only limited amount of historical data
37F) An Operational Data Store presents a consistent picture of the current
data stored and managed by transaction processing system. As data is
modified in the source system, a copy of the changed data is moved into the
ODS. Existing data in the ODS is updated to reflect the current status of the
source system
Current data means particular data from one date into one date
38A) they are dimension tables in a star schema data mart that adhere to a
common structure, and therefore allow queries to be executed across star
schemas. For example, the Calendar dimension is commonly needed in most
data marts. By making this Calendar dimension adhere to a single structure,
regardless of what data mart it is used in your organization, you can query
by date/time from one data mart to another to another.
Consider Cube-1 contains F1, D1, D2, D3 and Cube-2 contains F2,D1, D2,
D4 are the Facts and Dimensions
Here D1, D2 are the Conformed Dimensions
38C) if a table is used as a dimension table for more than one fact tables.
Then the dimension table is called conformed dimensions.
38E) Conformed Dimensions are the one if they share one or more attributes
whose values are drawn from the same domains.
38F) the dimensions which is used more than one fact table is called
conformed dimensions
38H) Conformed Dimensions are the Dimensions which are common to two
cubes .say CUBE-1 contains F1,D1,D2,D3 and CUBE-2 contains
F2,D1,D2,D4 are the Facts and Dimensions ,here D1,D2 are the Conformed
Dimensions
38I) if the dimension is 100% sharable across the star schema then this
dimension is called as confirmed dimension.
RE: Which columns go to the fact table and which columns go the
dimension table
39B) SCD Type 1, the attribute value is overwritten with the new value,
obliterating the historical attribute values. For example, when the product
roll-up
Changes for a given product, the roll-up attribute are merely updated with
the current value.
SCD Type 2, a new record with the new attributes is added to the dimension
table. Historical fact table rows continue to reference the old dimension key
with the old roll-up attribute; going forward, the fact table rows will
reference the new surrogate key with the new roll-up thereby perfectly
partitioning history.
SCDType 3, attributes are added to the dimension table to support two
simultaneous roll-ups - perhaps the current product roll-up as well as
“current version minus one”, or current version and original.
39C) SCD: -------- The value of dimensions is used change very rarely that is
called Slowly Changing dimensions
Here mainly 3
1) Versioning
2) Flag value
The new dimensions will be inserted into the target along with Primary key
Flagvalue: The updated dimensions insert into the target along with 0
40) What is Normalization, First Normal Form, Second Normal Form, and
Third Normal Form?
3NF - table should be in 2NF + non key should not dependent on another
non-key ({part}, warehouse name, warehouse addr)
{Primary key}
More...
4, 5 NF - for multi-valued dependencies (essentially to describe many-to-
many relations)
2normal form: The nonkey values must be depend upon the primary key
41B) ETL is a short for Extract, Transform and Load. It is a data integration
function that involves extracting the data from outside sources, transforming
it into business needs and ultimately loading it into a data warehouse
41C) ETL is an abbreviation for "Extract, Transform and Load”. This is the
process of extracting data from their operational data sources or external
data sources, transforming the data which includes cleansing, aggregation,
summarization, integration, as well as basic transformation and loading the
data into some form of the data warehouse.
42A) Non-additive facts are facts that cannot be summed up for any of
The dimensions present in the fact table. Example: temparature, bill
number...etc
42B) fact table typically has two types of columns: those that contain
numeric facts (often called measurements), and those that are foreign keys to
dimension tables.
A fact table contains either detail-level facts or facts that have been
aggregated. Fact tables that contain aggregated facts are often called
summary tables. A fact table usually contains facts with the same level of
aggregation.
Though most facts are additive, they can also be semi-additive or non-
additive. Additive facts can be aggregated by simple arithmetical addition. A
common example of this is sales. Non-additive facts cannot be added at all.
42C) If the columns of a fact table is not able in the position to aggregate
then it is called non-additive facts.
44) Why should you put your data warehouse on a different system than
your OLTP system?
These are used to store only daily transactions as the changes have to be
made in as few places as possible. OLTP do not have historical data of the
organization
44C) An DW is typically used most often for intensive querying . Since the
primary responsibility of an OLTP system is to faithfully record on going
transactions (inserts/updates/deletes), these operations will be considerably
slowed down by the heavy querying that the DW is subjected to.
45B) a fact table in data ware house is it describes the transaction data. It
contains characteristics and key figures.
45C) A Fact table is a collection of facts and foreign key relations to the
dimensions.
45E) Fact table contains the transactions data ,which have more columns and
less no of rows.
Among the data it also includes the foreign key of the dimension tables
which r attached to it.
45F) Fact Table contains the keys(primary key,foreign key) of the related
dimension tables and measures which is based on the keys.
46) What are Semi-additive and factless facts and in which scenario will you
use such kinds of fact table?
47A) Level of granularity means level of detail that you put into the fact
table in a data warehouse. For example: Based on design you can decide to
put the sales data in each transaction. Now, level of granularity would mean
what detail you are willing to put for each transactional fact. Product sales
with respect to each minute or you want to aggregate it up to minute and put
that data.
47B) It also means that we can have (for example) data aggregated for a year
for a given product as well as the data can be drilled down to Monthly,
weekly and daily basis...the lowest level is known as the grain. Going down
to details is Granularity
48) Which columns go to the fact table and which columns go the dimension
table?
48A) To add on, Foreign key elements along with Business Measures, such
as Sales in $ amt, Date may be a business measure in some case, units (qty
sold) may be a business measure, are stored in the fact table. It also depends
on the granularity at which the data is stored
49A) they are of two types insert--> if it is not there in the dimension and
update--> if it exists.
50A) Aggregate tables contain redundant data that is summarized from other
data in the warehouse.
50B) these are the tables which contain aggregated / summarized data. E.g.
Yearly, monthly sales information. These tables will be used to reduce the
query execution time.
51A) a dimension table in data warehouse is one which contains primary key
and attributes. we called primary key as DIMID's(dimension id's).
51C) Dimension tables r nothing but a master tables ,thru which u can
extract the actual transactions .Dimension table contains less columns and
more rows.
51D) Dimensional table is a table which contains business dimensions thru
which v analyze the business matrices
52A) Cognos
Business Objects
Micro Strategies
Actuate
52B)
1. MS-Excel
2. Business Objects (Crystal Reports)
3. Cognos (Impromptu, Power Play)
4. Micro strategy
5. MS reporting services
6. Informatica Power Analyzer
7. Actuate
8. Hyperion (BRIO)
9. Oracle Express OLAP
10. Proclarity
52C)
INEA
MS-Excel
Business Objects (Crystal Reports)
Cognos (Impromptu, Power Play)
Micro strategy
MS reporting services
Informatica Power Analyzer
Actuate
Hyperion (BRIO)
Oracle Express OLAP
Proclarity
SAS
53A) OLTP
Current data
Short database transactions
Online update/insert/delete
Normalization is promoted
High volume transactions
Transaction recovery is necessary
OLAP
Current and historical data
Long database transactions
Batch update/insert/delete
Denormalization is promoted
Low volume transactions
Transaction recovery is not necessary
53C)
OLTP: FEW
OLAP: MANY
JOINS
OLTP: MANY
OLAP: FEW
54B) Star schema is a type of organizing the tables such that we can retrieve
the result from the database easily and fastly in the warehouse environment.
Usually a star schema consists of one or more dimension tables around a fact
table which looks like a star, so that it got its name.
54C) it’s a type of organizing the entities in a way, such that u can retrieve
the result from the database easily and very fastly.Usually a star schema will
have one or more dimension tables linking around a fact table and looks like
a star. Hence got this name.
55) Why are OLTP database designs not generally a good idea for a Data
Warehouse?
55A) OLTP cannot store historical information about the organization. It is
used for storing the details of daily transactions while a data warehouse is a
huge storage of historical information obtained from different datamarts for
making intelligent decisions about the organization.
56A) the star schema is created when all the dimension tables directly link to
the fact table. Since the graphical representation resembles a star it is called
a star schema. It must be noted that the foreign keys in the fact table link to
the primary key of the dimension table. This sample provides the star
schema for a sales_ fact for the year 1998. The dimensions created are Store,
Customer, Product_ class and time_by_day. The Product table links to the
product_class table through the primary key and indirectly to the fact table.
The fact table contains foreign keys that link to the dimension tables.
56B) the snowflake schema is a schema in which the fact table is indirectly
linked to a number of dimension tables. The dimension tables are
normalized to remove redundant data and partitioned into a number of
dimension tables for ease of maintenance. An example of the snowflake
schema is the splitting of the Product dimension into the product_category
dimension and product_manufacturer dimension..
Snowflake schema
56E)
Star Schema snowflake schema
----------- ----------------
Star schema is normalized Denormalised.
Easy to use and understand End users will get confused.
Want little efforts for maintenance Easy to maintain
Fast execution of queries more time for exec bcas of more joins
57C) the snowflake schema is an extension of the star schema, where each
point of the star explodes into more points. The main advantage of the
snowflake schema is the improvement in query performance due to
minimized disk storage requirements and joining smaller lookup tables. The
main disadvantage of the snowflake schema is the additional maintenance
efforts needed due to the increase number of lookup tables.
Snowflake schema
If the schema has more than one fact table then the schema is said to be
Multiple star
Oracle Designer
Erwin (Entity Relationship for windows)
Informatica (Cubes/Dimensions)
Embarcadero
Power Designer Sybase
59A) Dimensions that change over time are called Slowly Changing
Dimensions. For instance, a product price changes over time; People change
their names for some reason; Country and State names may change over
time. These are a few examples of Slowly Changing Dimensions since some
changes are happening to them over a period of time
59B) if the data in the dimension table happen to change very rarely, then it
is called as slowly changing dimension.
60A) Data mart is small subset of the data warehouse. It contains business
division and department level.
60B) a data mart is a focused subset of a data warehouse that deals with a
single area (like different department) of data and is organized for quick
analysis
60C) Data Marts: A subset of data warehouse data used for a specific
business function whose format may be a star schema, hypercube or
statistical sample
60D) Data mart is the sub set of data ware housing and it is analysis the data
one particular department and particular point of view.
60E) Data Mart: a data mart is a small data warehouse. In general, a data
warehouse is divided into small units according the business requirements.
For example, if we take a Data Warehouse of an organization, then it may be
divided into the following individual Data Marts. Data Marts are used to
improve the performance during the retrieval of data.
61B) Data ware house is a relational database and it design analysis and
transformation processing. A Data warehousing is a subject oriented,
integrated, timevarient and nonvolatile collection of the data, A the support
and management of the decision making process.
62A) you can disconnect the report from the catalog to which it is attached
by saving the report with a snapshot of the data. However, you must
reconnect to the catalog if you want to refresh the data.
63A) True
64) What is active data warehousing?
66A) Type-1
Type-2(full History)
i) Version Number
ii) Flag
iii) Date
Type-3
66B) SCD
66C) SCD
66D) SCD means if the data in the dimension is happen to change very
rarely,
67A) a fact may be measure, metric or a dollar value. Measure and metric
are non additive facts.
Dollar value is additive fact. If we want to find out the amount for a
particular place for a particular period of time, we can add the dollar
amounts and come up with the total amount.
A non additive fact, for e.g. measure height(s) for 'citizens by geographical
location' , when we rollup 'city' data to 'state' level data we should not add
heights of the citizens rather we may want to use it to derive 'count'
Additive: Additive facts are facts that can be summed up through all of the
dimensions in the fact table.
67C) Factless Fact - same as non additive facts ... it can be counted but
cannot be measured directly...