Page.1Q1.Explain the meaning of data cleaning and dataformating.
This step complements the previous one. It is also the most timeconsuming due to a lot of possible techniques that can beimplemented so as to optimize data quality for future modeling stage.Possible techniques for data cleaning include:·
. For example decimal scaling into the range(0,1), or standard deviation normalization.·
. Discretization of numeric attributes is oneexample, this is helpful or even necessary for logic based methods.·
Treatment of missing values
. There is not simple and safe solutionfor the cases where some of the attributes have significant number of missing values. Generally, it is good to experiment with and withoutthese attributes in the modelling phase, in order to find out theimportance of the missing values.
Final data preparation step which represents syntactic modifications to the data that donot change its meaning, but are required by the particular modelling tool chosen for theDM task. These include:
reordering of the attributes or records: some modelling tools require reordering of theattributes (or records) in the dataset: putting target attribute at the beginning or at theend, randomizing order of records (required by neural networks for example)
changes related to the constraints of modelling tools: removing commas or tabs,special characters, trimming strings to maximum allowed number of characters,replacing special characters with allowed set of special characters.
.What is metadata ? Explain various purpose in whichmetadata is used.Ans-
Meta data is data about data. Since data in a dataware house is both voluminousand dynamic, it needs constant monitoring. This can be done only if we have a separateset of data about data is stored. This is the purpose of meta data.Meta data is useful for data transformation and load data management and querygeneration. This chapter introduces a few of the commonly used meta data functionsfor each of them. Meta data, by definition is “data about data” or “data that describesthe data”. In simple terms, the data warehouse contains data that describes differentsituations. But there should also be some data that gives details about the data stored ina data warehouse. This data is “metadata”. Metadata, apart form other things, will beused for the following purposes.1. data transformation and loading2. data management3. query generation
Q3.Write the steps in designing of fact tables.
DESIGNING OF FACT TABLES
The above listed methods, when iterated repeatedly will help to finally arrive at a set of entities that go into a fact table. The next question is how big a fact table can be? Ananswer could be that it should be big enough to store all the facts, still making the task of collecting data from this table reasonably fast. Obviously, this depends on thehardware architecture as well as the design of the database. A suitablehardware architecture can ensure that the cost of collecting data is reduced by theinherent capability of the hardware on the other hand the database designed shouldensure that whenever a data is asked for, the time needed to search for the same isminimum. In other words, the designer should be able to balance the value of information made available by the database and cost of making the same dataavailable to the user. A larger database obviously stores more details, so is definitelyuseful, but the cost of storing a larger database as well as the cost of searching andevaluating the same becomes higher. Technologically, there is perhaps no limit on thesize of the database. How does one optimize the cost- benefit ratio? There are nostandard formulae, but some of the following facts can be taken not of.i.Understand the significance of the data stored with respect totime. Only those data that are still needed for processing need to be stored. For example customer details after a period of timemay become irrelevant. Salary details paid in 1980s may be of little use in analyzing the employee cost of 21st century etc. Asand when the data becomes obsolete, it can be removed.ii. Find out whether maintaining of statistical samples of each of the subsets could beresorted to instead of storing the entire data. For example, instead of storing the salesdetails of all the 200 towns in the last 5 years, one can store details of 10 smaller towns, five metros, 10 bigger cities and 20 villages. After all data warehousing mostoften is resorted to get trends and not the actual figures. The subsets of these individualdetails can always be extrapolated
Q3.List and explain the aspects to be looked into while designing the summarytables.
ASPECTS TO BE LOOKED INTO WHILE DESIGNING THESUMMARY TABLES
The main purpose of using summary tables is to cut down the time taken to execute aspecific query. The main methodology involves minimizing the volume of data beingscanned each time the query is to be answered. In other words, partial answers to thequery are already made available. For example, in the above cited example of mobilemarket, if one expectsi) the citizens above 18 years of ageii) with salaries greater than 15,000 andiii) with professions that involve traveling are the potential customers, then, every timethe query is to be processed (may be every month or every quarter), one will have tolook at the entire data base to compute these values and then combine them suitably toget the relevant answers. The other method is to prepare summary tables, which havethe values pertaining toe ach of these sub-queries, before hand, and then combine themas and when the query is raised .Itcan be noted that the summaries can be prepared in the background (or when thenumber of queries running are relatively sparse) and only the aggregation can be doneon the fly. Summary table are designed by following the steps given belowi) Decide the dimensions along which aggregation is to be done.ii) Determine the aggregation of multiple facts.iii) Aggregate multiple facts into the summary table.
Q4.Explain the role of access control issues in data martdesign.Ans-
ROLE OF ACCESS CONTROL ISSUES IN DATA MARTDESIGN
This is one of the major constraints in data mart designs. Any data warehouse, with it’shuge volume of data is, more often than not, subject to various access controls as towho could access which part of data. The easiest case is where the data is partitioned soclearly that a user of each partition cannot access any other data. In such cases, each of these can be put in a data mart and the user of each canaccess only his data . In the data ware house, the data pertaining to all these marts arestored, but the partitioning are retained. If a super user wants to get an overall view of the data, suitable aggregations can be generated. However, in certain other cases thedemarcation may not be so clear. In such cases, a judicious analysis of the privacyconstraints so as to optimize the privacy of each data mart is maintained.Data marts, as described in the previous sections can be designed, based on severalsplits noticeable either in the data or the organization or in privacy laws. They may also be designed to suit the user access tools. In the latter case, there is not much choiceavailable for design parameters. In the other cases, it is always desirable to design thedata mart to suit the design of the ware house itself. This helps to maintain maximumcontrol on the data base instances, by ensuring that the same design is replicated ineach of the data marts. Similarly the summary information’s on each of the data martcan be a smaller replica of the summary of the data ware house it self.
Q5.List the application and reasons for the growing popularity of data mining.Ans-REASONS FOR THE GROWING POPULARITY OF DATAMININGa) Growing Data Volume
The main reason for necessity of automated computer systems for intelligent dataanalysis is the enormous volume of existing and newly appearing data that require processing. The amount of data accumulated each day by various business, scientific,and governmental organizations around the world is daunting. It becomes impossiblefor human analysts to cope with such overwhelming amounts of data.
b) Limitations of Human Analysis
Two other problems that surface when human analysts process data are the inadequacyof the human brain when searching for complex multifactor dependencies in data, andthe lack of objectiveness in such an analysis. A human expert is always a hostage of the previous experience of investigating other systems. Sometimes this helps, sometimesthis hurts, but it is almost impossible to get rid of this fact.
c) Low Cost of Machine Learning
One additional benefit of using automated data mining systems is that this process has amuch lower cost than hiring an many highly trained professional statisticians. Whiledata mining does not eliminate human participation in solving the task completely, itsignificantly simplifies the job and allows an analyst who is not a professional instatistics and programming to manage the process of extracting knowledge from data.
Q6What is data mining ? What kind of data can be mined ?Ans-
There are many definitions for Data mining. Few important definitions are given below.
refers to extracting or mining knowledge from large amounts of data.
is the process of exploration and analysis, by automatic or semiautomaticmeans, of large quantities of data in order to discover meaningful patterns and rules.
WHAT KIND OF DATA CAN BE MINED?
In principle, data mining is not specific to one type of media or data. Data miningshould be applicable to any kind of information repository. However, algorithms andapproaches may differ when applied to different types of data. Indeed, the challenges presented by different types of data vary significantly.Data mining is being put into use and studied for databases, including relationaldatabases, object-relational databases and object oriented databases, data warehouses,transactional databases, unstructured and semi structured repositories such as the WorldWide Web, advanced databases such as spatial databases,multimedia databases, time-series databases and textual databases, and even flat files.Here are some examples in more detail
: Flat files are actually the most common data source for data miningalgorithms, especially at the research level. Flat files are simple data files in text or binary format with a structure known by the data mining algorithm to be applied. Thedata in these files can be transactions, time-series data, scientific measurements.
: A relational database consists of a set of tables containing either values of entity attributes, or values of attributes from entity relationships. Tables havecolumns and rows, where columns represent attributes and rows represent tuples.
Q7.Give the top level syntax of the data mining query languages DMQL.Ans-
A data mining language helps in effective knowledge discovery from the datamining systems. Designing a comprehensive data mining language is challenging because data mining covers a wide spectrum of tasks from data characterization tomining association rules, data classification and evolution analysis.Each task has different requirements. The design of an effective data mining querylanguage requires a deep understanding of the power, limitation and underlyingmechanism of the various kinds of data mining tasks.Q8.
Explain the meaning of data mining with apriori algorithmAns-
APriori algorithm data mining discovers items that are
together. Let us look at the example of a store that sells DVDs, Videos, CDs, Booksand Games. The store owner might want to discover which of these items customersare likely to buy together. This can be used to increase the store’s
. Customers in this particular store may like buying a DVD and a Game in 10 outof every 100 transactions or the sale of Videos may hardly ever be associatedwith a sale of a DVD. With the information above, the store could strive for moreoptimum placement of DVDs and Games as the sale of one of them may improve thechances of the sale of the other frequently associated item. On the other hand, themailing campaigns may be fine tuned to reflect the fact that offering discountcoupons on Videos may even negatively impact the sales of DVDs offered in the samecampaign. A better decision could be not to offer both DVDs and Videos in acampaign. To arrive at these decisions, the store may have had to analyze 10,000 pasttransactions of customers using calculations that seperate frequent and consequentlyimportant associations from weak and unimportant associations.