Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword
Like this
9Activity
0 of .
Results for:
No results containing your search query
P. 1
kuvempu university Data Warehousing

kuvempu university Data Warehousing

Ratings: (0)|Views: 866 |Likes:
Published by Prince Raj
kuvempu university 5 sem note
kuvempu university 5 sem note

More info:

Published by: Prince Raj on Jun 04, 2012
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as DOC, PDF, TXT or read online from Scribd
See more
See less

09/25/2013

pdf

text

original

 
Page.1Q1.Explain the meaning of data cleaning and dataformating.
Ans-
Data cleaning
 This step complements the previous one. It is also the most timeconsuming due to a lot of possible techniques that can beimplemented so as to optimize data quality for future modeling stage.Possible techniques for data cleaning include:·
Data normalization
. For example decimal scaling into the range(0,1), or standard deviation normalization.·
Data smoothing
. Discretization of numeric attributes is oneexample, this is helpful or even necessary for logic based methods.·
Treatment of missing values
. There is not simple and safe solutionfor the cases where some of the attributes have significant number of missing values. Generally, it is good to experiment with and withoutthese attributes in the modelling phase, in order to find out theimportance of the missing values.
Data formatting
Final data preparation step which represents syntactic modifications to the data that donot change its meaning, but are required by the particular modelling tool chosen for theDM task. These include:
reordering of the attributes or records: some modelling tools require reordering of theattributes (or records) in the dataset: putting target attribute at the beginning or at theend, randomizing order of records (required by neural networks for example)
changes related to the constraints of modelling tools: removing commas or tabs,special characters, trimming strings to maximum allowed number of characters,replacing special characters with allowed set of special characters.
Q2
.What is metadata ? Explain various purpose in whichmetadata is used.Ans-
Meta data is data about data. Since data in a dataware house is both voluminousand dynamic, it needs constant monitoring. This can be done only if we have a separateset of data about data is stored. This is the purpose of meta data.Meta data is useful for data transformation and load data management and querygeneration. This chapter introduces a few of the commonly used meta data functionsfor each of them. Meta data, by definition is “data about data” or “data that describesthe data”. In simple terms, the data warehouse contains data that describes differentsituations. But there should also be some data that gives details about the data stored ina data warehouse. This data is “metadata”. Metadata, apart form other things, will beused for the following purposes.1. data transformation and loading2. data management3. query generation
Q3.Write the steps in designing of fact tables.
Ans-
DESIGNING OF FACT TABLES
The above listed methods, when iterated repeatedly will help to finally arrive at a set of entities that go into a fact table. The next question is how big a fact table can be? Ananswer could be that it should be big enough to store all the facts, still making the task of collecting data from this table reasonably fast. Obviously, this depends on thehardware architecture as well as the design of the database. A suitablehardware architecture can ensure that the cost of collecting data is reduced by theinherent capability of the hardware on the other hand the database designed shouldensure that whenever a data is asked for, the time needed to search for the same isminimum. In other words, the designer should be able to balance the value of information made available by the database and cost of making the same dataavailable to the user. A larger database obviously stores more details, so is definitelyuseful, but the cost of storing a larger database as well as the cost of searching andevaluating the same becomes higher. Technologically, there is perhaps no limit on thesize of the database. How does one optimize the cost- benefit ratio? There are nostandard formulae, but some of the following facts can be taken not of.i.Understand the significance of the data stored with respect totime. Only those data that are still needed for processing need to be stored. For example customer details after a period of timemay become irrelevant. Salary details paid in 1980s may be of little use in analyzing the employee cost of 21st century etc. Asand when the data becomes obsolete, it can be removed.ii. Find out whether maintaining of statistical samples of each of the subsets could beresorted to instead of storing the entire data. For example, instead of storing the salesdetails of all the 200 towns in the last 5 years, one can store details of 10 smaller towns, five metros, 10 bigger cities and 20 villages. After all data warehousing mostoften is resorted to get trends and not the actual figures. The subsets of these individualdetails can always be extrapolated
Q3.List and explain the aspects to be looked into while designing the summarytables.
 Ans-
ASPECTS TO BE LOOKED INTO WHILE DESIGNING THESUMMARY TABLES
The main purpose of using summary tables is to cut down the time taken to execute aspecific query. The main methodology involves minimizing the volume of data beingscanned each time the query is to be answered. In other words, partial answers to thequery are already made available. For example, in the above cited example of mobilemarket, if one expectsi) the citizens above 18 years of ageii) with salaries greater than 15,000 andiii) with professions that involve traveling are the potential customers, then, every timethe query is to be processed (may be every month or every quarter), one will have tolook at the entire data base to compute these values and then combine them suitably toget the relevant answers. The other method is to prepare summary tables, which havethe values pertaining toe ach of these sub-queries, before hand, and then combine themas and when the query is raised .Itcan be noted that the summaries can be prepared in the background (or when thenumber of queries running are relatively sparse) and only the aggregation can be doneon the fly. Summary table are designed by following the steps given belowi) Decide the dimensions along which aggregation is to be done.ii) Determine the aggregation of multiple facts.iii) Aggregate multiple facts into the summary table.
Q4.Explain the role of access control issues in data martdesign.Ans-
ROLE OF ACCESS CONTROL ISSUES IN DATA MARTDESIGN
This is one of the major constraints in data mart designs. Any data warehouse, with it’shuge volume of data is, more often than not, subject to various access controls as towho could access which part of data. The easiest case is where the data is partitioned soclearly that a user of each partition cannot access any other data. In such cases, each of these can be put in a data mart and the user of each canaccess only his data . In the data ware house, the data pertaining to all these marts arestored, but the partitioning are retained. If a super user wants to get an overall view of the data, suitable aggregations can be generated. However, in certain other cases thedemarcation may not be so clear. In such cases, a judicious analysis of the privacyconstraints so as to optimize the privacy of each data mart is maintained.Data marts, as described in the previous sections can be designed, based on severalsplits noticeable either in the data or the organization or in privacy laws. They may also be designed to suit the user access tools. In the latter case, there is not much choiceavailable for design parameters. In the other cases, it is always desirable to design thedata mart to suit the design of the ware house itself. This helps to maintain maximumcontrol on the data base instances, by ensuring that the same design is replicated ineach of the data marts. Similarly the summary information’s on each of the data martcan be a smaller replica of the summary of the data ware house it self.
Q5.List the application and reasons for the growing popularity of data mining.Ans-REASONS FOR THE GROWING POPULARITY OF DATAMININGa) Growing Data Volume
The main reason for necessity of automated computer systems for intelligent dataanalysis is the enormous volume of existing and newly appearing data that require processing. The amount of data accumulated each day by various business, scientific,and governmental organizations around the world is daunting. It becomes impossiblefor human analysts to cope with such overwhelming amounts of data.
b) Limitations of Human Analysis
Two other problems that surface when human analysts process data are the inadequacyof the human brain when searching for complex multifactor dependencies in data, andthe lack of objectiveness in such an analysis. A human expert is always a hostage of the previous experience of investigating other systems. Sometimes this helps, sometimesthis hurts, but it is almost impossible to get rid of this fact.
c) Low Cost of Machine Learning
One additional benefit of using automated data mining systems is that this process has amuch lower cost than hiring an many highly trained professional statisticians. Whiledata mining does not eliminate human participation in solving the task completely, itsignificantly simplifies the job and allows an analyst who is not a professional instatistics and programming to manage the process of extracting knowledge from data.
Q6What is data mining ? What kind of data can be mined ?Ans-
There are many definitions for Data mining. Few important definitions are given below.
Data mining
refers to extracting or mining knowledge from large amounts of data.
Data mining
is the process of exploration and analysis, by automatic or semiautomaticmeans, of large quantities of data in order to discover meaningful patterns and rules.
WHAT KIND OF DATA CAN BE MINED?
In principle, data mining is not specific to one type of media or data. Data miningshould be applicable to any kind of information repository. However, algorithms andapproaches may differ when applied to different types of data. Indeed, the challenges presented by different types of data vary significantly.Data mining is being put into use and studied for databases, including relationaldatabases, object-relational databases and object oriented databases, data warehouses,transactional databases, unstructured and semi structured repositories such as the WorldWide Web, advanced databases such as spatial databases,multimedia databases, time-series databases and textual databases, and even flat files.Here are some examples in more detail
Flat files
: Flat files are actually the most common data source for data miningalgorithms, especially at the research level. Flat files are simple data files in text or  binary format with a structure known by the data mining algorithm to be applied. Thedata in these files can be transactions, time-series data, scientific measurements.
Relational Databases
: A relational database consists of a set of tables containing either values of entity attributes, or values of attributes from entity relationships. Tables havecolumns and rows, where columns represent attributes and rows represent tuples.
Q7.Give the top level syntax of the data mining query languages DMQL.Ans-
A data mining language helps in effective knowledge discovery from the datamining systems. Designing a comprehensive data mining language is challenging because data mining covers a wide spectrum of tasks from data characterization tomining association rules, data classification and evolution analysis.Each task has different requirements. The design of an effective data mining querylanguage requires a deep understanding of the power, limitation and underlyingmechanism of the various kinds of data mining tasks.Q8.
Explain the meaning of data mining with apriori algorithmAns-
APriori algorithm data mining discovers items that are
frequently associated
together. Let us look at the example of a store that sells DVDs, Videos, CDs, Booksand Games. The store owner might want to discover which of these items customersare likely to buy together. This can be used to increase the store’s
cross sell
and
upsellratios
. Customers in this particular store may like buying a DVD and a Game in 10 outof every 100 transactions or the sale of Videos may hardly ever be associatedwith a sale of a DVD. With the information above, the store could strive for moreoptimum placement of DVDs and Games as the sale of one of them may improve thechances of the sale of the other frequently associated item. On the other hand, themailing campaigns may be fine tuned to reflect the fact that offering discountcoupons on Videos may even negatively impact the sales of DVDs offered in the samecampaign. A better decision could be not to offer both DVDs and Videos in acampaign. To arrive at these decisions, the store may have had to analyze 10,000 pasttransactions of customers using calculations that seperate frequent and consequentlyimportant associations from weak and unimportant associations.
 
iv) Determine the level of aggregation and the extent of embedding.v) Design time into the table.vi) Index the summary table.
2.Q9.Explain the working principle of decision tree used for data mining.Ans-DATA MINING WITH DECISION TREES
Decision trees are powerful and popular tools for classification and prediction. Theattractiveness of tree-based methods is due in large part to the fact that, it is simple anddecision trees represent
rules
. Rules can readily be expressed so that we humans canunderstand them or in a database access language like SQL so that records falling intoa particular category may be retrieved. In some applications, the accuracy of aclassification or prediction is the only thing that matters; if a direct mail firm obtains amodel that can accurately predict which members of a prospect pool are mostlikely to respond to a certain solicitation, they may not care how or why the modelworks.
Decision tree working concept
 Decision tree
is a classifier in the form of a tree structure where each node is either:
a
leaf node
, indicating a class of instances, or 
a
decision node
that specifies some test to be carried out on a single attribute0value,with one branch and sub-tree for each possible outcome of the test.A decision tree can be used to classify an instance by starting at the root of the tree andmoving through it until a leaf node, which provides the classification of the instance.
Example:
 Decision making in the Bombay stock market 
Suppose that the major factors affecting the Bombay stock market are:
what it did yesterday;
what the New Delhi market is doing today;
 bank interest rate;
unemployment rate;India’s prospect at cricket.Q10.
What is Baye’s theorem ? Explain the working procedure of Bayesisan
 
classifier
.Ans-
Bayes Theorem
Let X be a data sample whose class label is unknown. Let H be some hypothesis, suchas that the data sample X belongs to a specified class C. For classification problems,we want to determine P (H/X), the probability that the hypothesis H holds given theobserved data sample X. P(HX) is the posterior probability, or a posteriori probability,of H conditioned on X. For example, suppose the world of data samples consists of fruits, described by their color and shape. Suppose that X is red and round, and that His the hypothesis that X is an apple. Then P(HX) reflects our confidence thatX is an apple given that we have seen that X is red and round. In contrast, P(H) is the prior probability, or a priori probability, of H. For our example, this is the probabilitythat any given data sample is an apple, regardless of how the data sample looks. The posterior probability, P(HX), is based on more information (Such as backgroundknowledge) than the prior probability, P(H), which is independent of X.Similarly, P(XH) is the posterior probability of X conditioned on H. That is, it is the probability that X is red and round given that we know that it is true that X is an apple.P(X) is the prior probability of X. . P(X), P(H), and P(XH) may be estimated from thegiven data, as we shall see below. Bayes theorem is useful in that it provides a way of calculating the posterior probability,P(HX), from P(H), P(X), andP(XH). Bayes theorem isP(HX) = P(XH) P(H)P(X)
Q11.Explain how neural network can be used for data mining ?Ams-
A neural processing element receives inputs from other connected processingelements. These input signals or values pass through weighted connections, whicheither amplify or diminish the signals. Inside the neural processing element, all of theseinput signals are summed together to give the total input to the unit. This total inputvalue is then passed through a mathematical function to produce an output or decisionvalue ranging from 0 to 1. Notice that this is a real valued (analog) output, not a digital0/1 output. If the input signal matches the connection weights exactly, then the outputis close to 1. If the input signal totally mismatches the connection weights then theoutput is close to 0. Varying degrees of similarity are represented by the intermediatevalues. Now, of course, we can force the neural processing element to make a binary(1/0) decision, but by using analog values ranging between 0.0 and 1.0 as the outputs,we are retaining more information to pass on to the next layer of neural processingunits. In a very real sense, neural networks are analog computers.Each neural processing element acts as a simple pattern recognition machine. It checksthe input signals against its memory traces (connection weights) and produces anoutput signal that corresponds to the degree of match between those patterns. In typicalneural networks, there are hundreds of neural processing elements whose patternrecognition and decision making abilities are harnessed together tosolve problems.
Backpropagation
Backpropagation learns by iteratively processing a set of training samples, comparingthe network’s prediction for each sample with the actual known class label. For each sample with theactual knownclass label. For each training sample, the weights are modified so as to minimize themeans squared error  between the network’s prediction and the actual class. These modifications are made inthe “backwards”direction, that is, from the output layer, through each hidden layer down to the firsthidden layer (hence thename backpropagation). Although it is not guaranteed, in general the weights willeventually coverage,and the learning process stops. The algorithm is summarized in Figure each step isdescribed below.
Initialize the weights:
The weights in the network are initialized to small randomnumbers (e.g.,ranging from – 1.0 to 1.0, or -0.5 to 0.5). Each unit has a bias associated with it, asexplained below. The
 
K-means algorithm
This algorithm has as an input a predefined number of clusters, that is the
from itsname. Means stands for an average, an average location of all the members of a particular cluster. When dealing with clustering techniques, one has to adopt a notionof a high dimensional space, or space in which orthogonal dimensions are all attributesfrom the table of data we are analyzing. The value of each attribute of anexample represents a distance of the example from the origin along the attribute axes.The coordinates of this point are averages of attribute values of all examples that belong to the cluster. The steps of the K-means algorithm are given below.1. Select randomly
 points (it can be also examples) to be theseeds for the
centroids
of 
clusters.2. Assign each example to the
centroid 
closest to the example,forming in this way
exclusive clusters of examples.3. Calculate new
centroids
of the clusters. For that purpose averageall attribute values of the examples belonging to the same cluster (
centroid 
).
 
Q12.Explain the STAR-FLAKE schema in detail.Ans-
STAR FLAKE SCHEMAS
One of the key factors for a data base designer is to ensure that a database should beable to answer all types of queries, even those that are not initially visualized by thedeveloper. To do this, it is essential to understand how the data within the database isused. In a decision support system, which is what a data ware is supposed to provide basically, a large number of different questions are asked about the same set of facts.For example, given a sales data question likei) What is the average sales quantum of a particular item?ii) Which are the most popular brands in the last week?iii) Which item has the least tumaround item.iv) How many customers returned to procure the same item within one month. Etc.,.Can be asked. They are all based on the sales data, but the method of viewing the datato answer the question is different. The answers need to be given by rearranging or cross referencing different facts.
Q13.Explain the method for designing dimension tables
.Ans-
DESIGNING DIMENSION TABLES
After the fact tables have been designed, it is essential to design the dimension tables.However, the design of dimension tables need not be considered a critical activity,though a good design helps in improving the performance. It is also desirable to keepthe volumes relatively small, so that restructuring cost will beless. Now we see some of the commonly used dimensions.
Star dimension
They speed up the query performance by denormalising reference information into asingle table. They presume that the bulk of queries coming are such that they analyzethe facts by applying a number of constraints to a single dimensioned data.For example, the details of sales from a stores can be stored in horizontal rows andselect one/few of the attributes. Suppose a cloth store stores details of the sales one below the other and questions like how many while shirts of size 85" are sold in oneweek are asked. All that the query has to do is to put the relevant constraints to get theinformation. This technique works well in solutions where there are a number of entitles, all related to the key dimension entity.
Q14.Explain the Horizontal partioning in briefly
.Ans-
 Needless to say, the dataware design process should try to maximize the performance of the system. One of the ways to ensure this is to try to optimize bydesigning the data base with respect to specific hardware architecture. Obviously, theexact details of optimization depends on the hardware platforms. Normally the following guidelines are useful:i. maximize the processing, disk and I/O operations.ii. Reduce bottlenecks at the CPU and I/OThe following mechanisms become handly
4.3.1 Maximising the processing and avoiding bottlenecks
One of the ways of ensuring faster processing is to split the data query into several parallel queries, convert them into parallel threads and run them parallel. This methodwill work only when there are sufficient number of processors or sufficient processing power to ensure that they can actually run in parallel. (again not that to run five threads,it is not always necessary that we should have five processors. But to ensure optimality,even a lesser number of processors should be able to do the job, provided they are ableto do it fast enough to avoid bottlenecks at eh processor).
Normalisation
The usual approach in normalization in database applications is to ensure that the datais divided into two or more tables, such that when the data in one of them is updated, itdoes not lead to anamolies of data (The student is advised to refer any book on data base management systems for details, if interested).The idea is to ensure that when combined, the data available is consistent.However, in data warehousing, one may even tend to break the large table into several“denormalized” smaller tables. This may lead to lots of extra space being used. But ithelps in an indirect way – It avoids the overheads of joining the data during queries.To make things clear consider the following table
Q16.Explain the need of data mart in detail.Ans-
THE NEED FOR DATA MARTS
In a crude sense, if you consider a data ware house as a store house of data, a data martis a retail outlet of data. Searching for any data in a huge store house is difficult, but if the data is available, you should be positively able to get it. On the other hand, in aretail out let, since the volume to be searched from is small, you can be able to accessthe data fast. But it is possible that the data you are searching for may not be availablethere, in which case you have to go back to your main store house to search for thedata. Coming back to technical terminology, one can say the following are the reasonsfor which data marts are created.i) Since the volume of data scanned is small, they speed up the query processing.ii) Data can be structured in a form suitable for a user access tooiii) Data can be segmented or partitioned so that they can be used on different platformsandalso different control strategies become applicable.
 
 biases are similarly initialized to small random numbers.Each training sample, X, is processed by the following steps.
3.IDENTIFY THE ACCESS TOOL REQUIREMENTS
Data marts are required to support internal data structures that support the user accesstools. Data within those structures are not actually controlled by the ware house, butthe data is to be rearranged and up dated by the ware house. This arrangement (called populating of data) is suitable for the existing requirements of data analysis. While therequirements are few and less complicated, any populatingmethod may be suitable, but as the demands increase (as it happens over a period of time) the populating methods should match the tools used.As a rule, this rearrangement (or populating) is to be done by the ware house after acquiring the data from the source. In other words, the data received from the sourceshould not directly be arranged in the form of structures as needed by the access tools.This is because each piece of data is likely to be used by several access tools whichneed different populating methods. Also, additional requirements maycome up later. Hence each data mart is to be populated from the ware house based onthe access tool requirements of the data ware house.
Q17.Explain the Data warehouse process manager in detail.Ans-
DATAWARE HOUSE PROCESS MANAGERS
These are responsible for the smooth flow, maintainance and upkeep of data into andout of the database. The main types of process managers are
Load manager 
Warehouse manager and
Query manager We shall look into each of them briefly. Before that, we look at a schematic diagramthat defines the boundaries of the three types of managers.
Load manager
This is responsible for any data transformations and for loading of data into thedatabase. They shouldeffect the following
Data source interaction
Data transformation
Data load.The actual complexity of each of these modules depend on the size of the database.It should be able to interact with the source systems to verify the received data. This isa very important aspect and any improper operations leads to invalid data affecting theentire warehouse. The concept is normally achieved by making the source and dataware house systems compatible.
Ware house Manager
The warehouse manager is responsible for maintaining data of the ware house. Itshould also create and maintain a layer of meta data. Some of the responsibilities of theware house manager are
o
Data movementMeta data management
o
Performance monitoring
o
Archiving.Data movement includes the transfer of data within the ware house, aggregation,creation and maintenance of tables, indexes and other objects of importance. It should be able to create new aggregations as well as remove the old ones. Creation of additional rows / columns, keeping track of the aggregation processes and creating meta data are also it’s functions.
Query Manager
We shall look at the last of manager, but not of any less importance, the querymanager. The main responsibilities include the control of the following.
o
User’s access to data
o
Query scheduling
Query Monitoring
These jobs are varied in nature and have not been automated as yet.The main job of the query manager is to control the user’s access to data and also to present the data as a result of the query processing in a format suitable to the user. Theraw data, often from different sources, need to be compiled in a format suitable for querying. The query manager will have to act as a mediator between the user on onehand and the meta data on the other.
 
Q18.Explain the Data warehouse Delivery process in detail.Ans-
THE DATA WAREHOUSE DELIVERY PROCESS
This section deals with the dataware house from a different view point - how thedifferent components that go into it enable the building of a data ware house. The studyhelps us in two ways:i) to have a clear view of the data ware house building process.ii) to understand the working of the data ware house in the context of the components. Now we look at the concepts in details.
i. IT Strategy
The company must and should have an overall IT strategy and the data ware housinghas to be a part of the overall strategy. This would not only ensure that adequate backup in terms of data and investments are available, but also will help in integratingthe ware house into the strategy. In other words, a data ware house can not bevisualized in isolation.
ii. Business case analysis
This looks at an obvious thing, but is most often misunderstood. The overallunderstanding of the business and the importance of various components there in is a must. This will ensurethat one can clearly justify the appropriate level of investment that goes into the dataware house design and also the amount of returns accruing.Unfortunately, in many cases, the returns out of the ware housing activity are notquantifiable. At the end of the year, one cannot say - I have saved / generated 2.5 croreRs. because of data ware housing - sort of statements. Data ware house affects the business and strategy plans indirectly - giving scope for undue expectations on onehand and total neglect on the other. Hence, it is essential that the designer must have a sound understanding of the overall business, the scope for his concept
Q19.Briefly explain the system management tools.Ans-
 SYSTEM MANAGEMENT TOOLS
The most important jobs done by this class of managers includes the following1. Configuration managers 2. schedule managers3. event managers 4. database mangers5. back up recovery managers 6. resource and performance a monitors.We shall look into the working of the first five classes, since last type of managers areless critical in nature.
Configuration manager
This tool is responsible for setting up and configuring the hardware. Since several typesof machines are being addressed, several concepts like machine configuration,compatibility etc. are to be taken care of, as also the platform on which the systemoperates.
Schedule manager
The scheduling is the key for successful warehouse management. Almost all operationsin the ware house need some type of scheduling. Every operating system will have it’sown scheduler and batch control mechanism. But these schedulers may not be capableof fully meeting the requirements of a data warehouse.
Event manager
An event is defined as a measurable, observable occurrence of a defined action. If thisdefinition is quite vague, it is because it encompasses a very large set of operations.The event manager is a software that continuously monitors the system for theoccurrence of the event and then take any action that is suitable (Note that the event is a“measurable and observable” occurrence). The action to be taken is also normallyspecific to the event.
Database manager
The database manger normally will also have a separate (and often independent)system manager module. The purpose of these managers is to automate certain processes and simplify the execution of others. Some of operations are listed asfollows.
Ability to add/remove users
o
User management
o
Manipulate user quotas
o
Assign and deassign the user profiles
 
Q20.What is schema? Distinguish between facts and dimensions.Ans- schema-
A schema, by definition, is a logical arrangements of facts that facilitateease of storage and retrieval, as described by the end users. The end user is not bothered about the overall arrangements of the data or the fields in it. For example, asales executive, trying to project the sales of a particular item is only interested in thesales details of that item where as a tax practitioner looking at the same data will beinterested only in the amounts received by the company and the profits made.
Distinguish between facts and dimensions
The star schema looks a good solution to the problem of ware housing. It simply statesthat one should identify the facts and store it in the read-only area and the dimensionssurround the area. Whereas the dimensions are liable to change, the facts are not. Butgiven a set of raw data from the sources, how does one identify the facts and thedimensions? It is not always easy, but the following steps can help in thatdirection.i) Look for the fundamental transactions in the entire business process. These basicentities are the facts.ii) Find out the important dimensions that apply to each of these facts. They are thecandidates for dimension tables.iii) Ensure that facts do not include those candidates that are actually dimensions, witha set of facts attached to it.iv) Ensure that dimensions do not include these candidates that are actually facts.
Q21.Explain how to categorize data mining system.Ans- CATEGORIZE DATA MINING SYSTEMS
There are many data mining systems available or being developed. Some arespecialized systems dedicated to a given data source or are confined to limited datamining functionalities, other are more versatile and comprehensive. Data miningsystems can be categorized according to various criteria amongother classification are the following
a) Classification according to the type of data source mined
: this classificationcategorizes data mining systems according to the type of data handled such as spatialdata, multimedia data, time-series data, text data, World Wide Web, etc.
b) Classification according to the data model drawn on
: this classificationcategorizes data mining systems based on the data model involved such as relationaldatabase, object-oriented database, data warehouse, transactional, etc.
Classification according to the king of knowledge discovered
: this classificationcategorizes data mining systems based on the kind of knowledge discovered or datamining functionalities, such as characterization, discrimination, association,classification, clustering, etc. Some systems tend to be comprehensive systems offeringseveral data mining functionalities together.
Q22 ..A DATA MINING QUERY LANGUAGE
A data mining query language provides necessary primitives that allow users tocommunicate with data mining systems. But novice users may find data mining querylanguage difficult to use and the syntax difficult to remember. Instead , user may prefer to communicate with data mining systems through a graphical user interface (GUI). Inrelational database technology , SQL serves as a standard core languagefor relational systems , on top of which GUIs can easily be designed. Similarly, a datamining query language may serve as a core language for data mining systemimplementations, providing a basis for the development of GUI for effective datamining. A data mining GUI may consist of the following functional componentsa)Data collection and data mining query composition - This componentallows the user to specify task-relevant data sets and to compose datamining queries. It is similar to GUIs used for the specification of relational queries. b) Presentation of discovered patterns – This component allows the display of thediscovered patterns in various forms, including tables, graphs, charts, curves and other visualization techniques.

Activity (9)

You've already reviewed this. Edit your review.
1 thousand reads
1 hundred reads
Kailash Jaiswal added this note
thats nice
Komal Gupta liked this
Komal Gupta liked this
Komal Gupta liked this
Reza Reza liked this
Vaishali Upadhye liked this
Vaishali Upadhye liked this

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->