You are on page 1of 6

Page.1Q1.Explain the meaning of data cleaning and data formating.

iv) Determine the level of aggregation and the extent of embedding. v) Design time into the table. vi) Index the summary table. 2.Q9.Explain the working principle of decision tree used for data mining. Ans-DATA MINING WITH DECISION TREES Decision trees are powerful and popular tools for classification and prediction. The attractiveness of tree-based methods is due in large part to the fact that, it is simple and decision trees represent rules. Rules can readily be expressed so that we humans can understand them or in a database access language like SQL so that records falling into a particular category may be retrieved. In some applications, the accuracy of a classification or prediction is the only thing that matters; if a direct mail firm obtains a model that can accurately predict which members of a prospect pool are most likely to respond to a certain solicitation, they may not care how or why the model works. Decision tree working concept Decision tree is a classifier in the form of a tree structure where each node is either: a leaf node, indicating a class of instances, or a decision node that specifies some test to be carried out on a single attribute0value, with one branch and sub-tree for each possible outcome of the test. A decision tree can be used to classify an instance by starting at the root of the tree and moving through it until a leaf node, which provides the classification of the instance. Example: Decision making in the Bombay stock market Suppose that the major factors affecting the Bombay stock market are: what it did yesterday; what the New Delhi market is doing today; bank interest rate; unemployment rate; Indias prospect at cricket. Q10. What is Bayes theorem ? Explain the working procedure of Bayesisan classifier. Ans-Bayes Theorem Let X be a data sample whose class label is unknown. Let H be some hypothesis, such as that the data sample X belongs to a specified class C. For classification problems, we want to determine P (H/X), the probability that the hypothesis H holds given the observed data sample X. P(HX) is the posterior probability, or a posteriori probability, of H conditioned on X. For example, suppose the world of data samples consists of fruits, described by their color and shape. Suppose that X is red and round, and that H is the hypothesis that X is an apple. Then P(HX) reflects our confidence that X is an apple given that we have seen that X is red and round. In contrast, P(H) is the prior probability, or a priori probability, of H. For our example, this is the probability that any given data sample is an apple, regardless of how the data sample looks. The posterior probability, P(HX), is based on more information (Such as background knowledge) than the prior probability, P(H), which is independent of X. Similarly, P(XH) is the posterior probability of X conditioned on H. That is, it is the probability that X is red and round given that we know that it is true that X is an apple. P(X) is the prior probability of X. . P(X), P(H), and P(XH) may be estimated from the given data, as we shall see below. Bayes theorem is useful in that it provides a way of calculating the posterior probability, P(HX), from P(H), P(X), and P(XH). Bayes theorem is P(HX) = P(XH) P(H) P(X) Q11.Explain how neural network can be used for data mining ? Ams-A neural processing element receives inputs from other connected processing elements. These input signals or values pass through weighted connections, which either amplify or diminish the signals. Inside the neural processing element, all of these input signals are summed together to give the total input to the unit. This total input value is then passed through a mathematical function to produce an output or decision value ranging from 0 to 1. Notice that this is a real valued (analog) output, not a digital 0/1 output. If the input signal matches the connection weights exactly, then the output is close to 1. If the input signal totally mismatches the connection weights then the output is close to 0. Varying degrees of similarity are represented by the intermediate values. Now, of course, we can force the neural processing element to make a binary (1/0) decision, but by using analog values ranging between 0.0 and 1.0 as the outputs, we are retaining more information to pass on to the next layer of neural processing units. In a very real sense, neural networks are analog computers. Each neural processing element acts as a simple pattern recognition machine. It checks the input signals against its memory traces (connection weights) and produces an output signal that corresponds to the degree of match between those patterns. In typical neural networks, there are hundreds of neural processing elements whose pattern recognition and decision making abilities are harnessed together to solve problems. Backpropagation Backpropagation learns by iteratively processing a set of training samples, comparing the networks prediction for each sample with the actual known class label. For each sample with the actual known class label. For each training sample, the weights are modified so as to minimize the means squared error between the networks prediction and the actual class. These modifications are made in the backwards direction, that is, from the output layer, through each hidden layer down to the first hidden layer (hence the name backpropagation). Although it is not guaranteed, in general the weights will eventually coverage, and the learning process stops. The algorithm is summarized in Figure each step is described below. Initialize the weights: The weights in the network are initialized to small random numbers (e.g., ranging from 1.0 to 1.0, or -0.5 to 0.5). Each unit has a bias associated with it, as explained below. The

biases are similarly initialized to small random numbers. Each training sample, X, is processed by the following steps.

Q19.Briefly explain the system management tools. Ans- SYSTEM MANAGEMENT TOOLS The most important jobs done by this class of managers includes the following 1. Configuration managers 2. schedule managers 3. event managers 4. database mangers 5. back up recovery managers 6. resource and performance a monitors. We shall look into the working of the first five classes, since last type of managers are less critical in nature. Configuration manager This tool is responsible for setting up and configuring the hardware. Since several types of machines are being addressed, several concepts like machine configuration, compatibility etc. are to be taken care of, as also the platform on which the system operates. Schedule manager The scheduling is the key for successful warehouse management. Almost all operations in the ware house need some type of scheduling. Every operating system will have its own scheduler and batch control mechanism. But these schedulers may not be capable of fully meeting the requirements of a data warehouse. Event manager An event is defined as a measurable, observable occurrence of a defined action. If this definition is quite vague, it is because it encompasses a very large set of operations. The event manager is a software that continuously monitors the system for the occurrence of the event and then take any action that is suitable (Note that the event is a measurable and observable occurrence). The action to be taken is also normally specific to the event. Database manager The database manger normally will also have a separate (and often independent) system manager module. The purpose of these managers is to automate certain processes and simplify the execution of others. Some of operations are listed as follows. Ability to add/remove users o User management o Manipulate user quotas o Assign and deassign the user profiles Q20.What is schema? Distinguish between facts and dimensions. Ans- schema- A schema, by definition, is a logical arrangements of facts that facilitate ease of storage and retrieval, as described by the end users. The end user is not bothered about the overall arrangements of the data or the fields in it. For example, a sales executive, trying to project the sales of a particular item is only interested in the sales details of that item where as a tax practitioner looking at the same data will be interested only in the amounts received by the company and the profits made. Distinguish between facts and dimensions The star schema looks a good solution to the problem of ware housing. It simply states that one should identify the facts and store it in the read-only area and the dimensions surround the area. Whereas the dimensions are liable to change, the facts are not. But given a set of raw data from the sources, how does one identify the facts and the dimensions? It is not always easy, but the following steps can help in that direction. i) Look for the fundamental transactions in the entire business process. These basic entities are the facts. ii) Find out the important dimensions that apply to each of these facts. They are the candidates for dimension tables. iii) Ensure that facts do not include those candidates that are actually dimensions, with a set of facts attached to it. iv) Ensure that dimensions do not include these candidates that are actually facts. Q21.Explain how to categorize data mining system. Ans- CATEGORIZE DATA MINING SYSTEMS There are many data mining systems available or being developed. Some are specialized systems dedicated to a given data source or are confined to limited data mining functionalities, other are more versatile and comprehensive. Data mining systems can be categorized according to various criteria among other classification are the following a) Classification according to the type of data source mined: this classification categorizes data mining systems according to the type of data handled such as spatial data, multimedia data, time-series data, text data, World Wide Web, etc. b) Classification according to the data model drawn on: this classification categorizes data mining systems based on the data model involved such as relational database, object-oriented database, data warehouse, transactional, etc. Classification according to the king of knowledge discovered: this classification categorizes data mining systems based on the kind of knowledge discovered or data mining functionalities, such as characterization, discrimination, association, classification, clustering, etc. Some systems tend to be comprehensive systems offering several data mining functionalities together. Q22 ..A DATA MINING QUERY LANGUAGE A data mining query language provides necessary primitives that allow users to communicate with data mining systems. But novice users may find data mining query language difficult to use and the syntax difficult to remember. Instead , user may prefer to communicate with data mining systems through a graphical user interface (GUI). In relational database technology , SQL serves as a standard core language for relational systems , on top of which GUIs can easily be designed. Similarly, a data mining query language may serve as a core language for data mining system implementations, providing a basis for the development of GUI for effective data mining. A data mining GUI may consist of the following functional components a) Data collection and data mining query composition - This component allows the user to specify task-relevant data sets and to compose data mining queries. It is similar to GUIs used for the specification of relational queries. b) Presentation of discovered patterns This component allows the display of the discovered patterns in various forms, including tables, graphs, charts, curves and other visualization techniques.

(data ware house) in the project, so that he can answer the probing questions. Q24.With the help of a block diagram explain the typical process flow in a data Warehouse. Ans- TYPICAL PROCESS FLOW IN A DATA WAREHOUSE Any data ware house must support the following activities i) Populating the ware house (i.e. inclusion of data) ii) day-to-day management of the ware house. iii) Ability to accommodate the changes. The processes to populate the ware house have to be able to extract the data, clean it up, and make it available to the analysis systems. This is done on a daily / weekly basis depending on the quantum of the data population to be incorporated. The day to day management of data ware house is not to be confused with maintenance and management of hardware and software. When large amounts of data are stored and new data are being continually added at regular intervals, maintaince of the quality of data becomes an important element. Ability to accommodate changes implies the system is structured in such a way as to be able to cope with future changes without the entire system being remodeled. Based on these, we can view the processes that a typical data ware house scheme should support as follows. Q25.How the Naive Bayesian classification works. Ans- Naive Bayesian Classification The nave Bayesian classifier, or simple Bayesian classifier, works as follows: 1. Each data sample is represented by an n-dimensional feature vector, X = (x1, x2, . . . . xn), depicting n measurements made on the sample from n attributes, respectively, A1, A2, . .An. 2. Suppose that there are m classes, C1, C2, . Cm. Given an unknown data sample, X (i.e., having no class label), the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X. That is, the nave Bayesian classifier assigns an unknown sample X to the class Ci if and only if P(CiX) > P(CjX) for 1 j m, j I Thus we maximize P(Ci/X). The class Ci for which P(Ci/X) is maximized is called the maximum posteriori hypothesis. By Bayes theorem P(CiX) = P(XCi) P(Ci) P(X) 3. As P(X) is constant for all classes, only P(XCi) P(Ci) need be maximized. If the class prior probabilities are not known, then it is commonly assumed that the classes are equally likely, that is, P(C1) = P(C2) = = P(Cm), and we would therefore maximize P(XCi). Otherwise, we maximize P(XCi) P(Ci). Note that the class prior probabilities may be estimated by P(Ci) = si/s where si is the number of training samples of class Ci, and s is the total number of training samples. Training Bayesian Belief Networks In the learning or training of a belief network, a number of scenarios are possible. The network structure may be given in advance or inferred from the data. The network variables may be observable or hidden in all or some of the training samples. The case of hidden data is also referred to as missing values or incomplete data. If the network structure is known and the variables are observable, then training the network is straightforward. It consists of computing the CPT entries, as is similarly done when computing the probabilities involved in native Bayesian classification. Neural Network Topologies The arrangement of neural processing units and their interconnections can have a profound impact on the processing capabilities of the neural networks. In general, all neural networks have some set of processing units that receive inputs from the outside world, which we refer to appropriately as the input units. Many neural networks also have one or more layers of hidden processing units that receive inputs only from other processing units. A layer or slab of processing units receives a vector of data or the outputs of a previous layer of units and processes them in parallel. The set of processing units that represents the final result of the neural network computation is designated as the output units. There are three major connection topologies that define how data flows between the input, hidden, and output processing units. Backpropagation Backpropagation learns by iteratively processing a set of training samples, comparing the networks prediction for each sample with the actual known class label. For each sample with the actual known class label. For each training sample, the weights are modified so as to minimize the means squared error between the networks prediction and the actual class. These modifications are made in the backwards direction, that is, from the output layer, through each hidden layer down to the first hidden layer (hence the name backpropagation). Although it is not guaranteed, in general the weights will eventually coverage, and the learning process stops. The algorithm is summarized in Figure each step is described below. Initialize the weights: The weights in the network are initialized to small random numbers (e.g., ranging from 1.0 to 1.0, or -0.5 to 0.5). Each unit has a bias associated with it, as explained below. The biases are similarly initialized to small random numbers. Nonlinear Regression Polynomial regression can be modeled by adding polynomial terms to the basic linear model. By applying transformations to the variables, we can convert the nonlinear model into a linear one that can then be solved by the method of least squares. Transformation of a polynomial regression model to a linear regression model. Consider a cubic polynomial relationship given by Y = + 1X1 + 2X2 + 3X3 To convert this equation to linear form, we define new variables: X 1 = X X2 = X2 X3 = X3 Using the above Equation can then be converted to linear form by applying the above assignments, resulting in the equation Y = + 1X1 + 2X2 + 3X3 which is solvable by the method of least squares.

4.Q.Enlist the desirable schemes required for a good architecture of data mining systems.
Ans- ARCHITECTURES OF DATA MINING SYSTEMS A good system architecture will enable the system to make best use of the software environment , accomplish data mining tasks in an efficient and timely manner, interoperate and exchange information with other information systems, be adaptable to users different requirements and evolve with time. To know what are the desired architectures for data mining systems, we view data mining is integrated with database/data warehousing and coupling with the following schemes a) no-coupling b) loose coupling c) semitight coupling d) tight-coupling No-coupling It means that data mining system will not utilize any function of a database or data warehousing system. Here in this system , it fetches data from a particular source such as a file , processes data using some data mining algorithms and then store the mining result in another file. This system has some disadvantages 1) Database system provides a great deal of flexibility and efficiency at storing , organizing, accessing and processing data. Without this in a file, Data mining system may spend a more amount of time finding, collecting , cleaning and transforming data. Qno--CLUSTERING IN DATA MINING Requirements for clustering Clustering is a division of data into groups of similar objects. Each group, called cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups. Representing data by fewer clusters necessarily loses certain fine details (akin to lossy data compression), but achieves simplification. It represents many data objects by few clusters, and hence, it models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. Therefore, clustering is unsupervised learning of a hidden data concept. Data mining deals with large databases that impose on clustering analysis additional severe computational requirements. Requirements for clustering Clustering is a challenging and interesting field potential applications pose their own special requirements. The following are typical requirements of clustering in data mining. Scalability: Many clustering algorithms work well on small data sets containing fewer than 200 data objects However, a large database may contain millions of objects. Clustering on a sample of a given large data set may lead to biased results. Highly scalable clustering algorithms are needed. Ability to deal with different types of attributed: Many algorithms are designed to cluster interval-based (numerical) data. However, applications may require clustering other types of data, such as binary, categorical (nominal), and ordinal data, or mixtures of these data types. Nominal, Ordinal and Ratio-Scaled Variables Nominal Variables A nominal variable is a generalization of the binary variable in that it can take on more than two states. For example, map_color is a nominal variable that may have, say, five states: red, yellow, green, pink and blue. Nominal variables can be encoded by asymmetric binary variables by creating a new binary variable for each of the M nominal states. For an object with a given state value, the binary variable representing that state is set to 1, while variable map_color, a binary variable can be created for each of the five colors listed above. For an object having the color yellow, the yellow variable is set to 1, while the remaining four variables are set to 0.. Ordinal Variables A discrete ordinal variable resembles a nominal variable, except that the M states of the ordinal value are ordered in a meaningful sequence. Ordinal variables are very useful for registering subjective assessments of qualities that cannot be measured objectively. For example professional ranks are often enumerated in a sequential order, such as assistant, associate, and full. A continuous ordinal variable looks like a set of continuous data of an unknown scale; that is, the relative ordering of the values is essential but their actual magnitude is not. For example, the relative ranking in a particular sport (e.g., gold, silver, bronze) is often more essential than the actual values of a particular measure. Ratio-Scaled Variables A ratio-scaled variable makes a positive measurement on a nonlinear scale, such as an exponential scale, approximately following the formula AeBt or Ae-Bt, Where A and B are positive constants. Typical examples include the growth of a bacteria population, or the decay of a radioactive element. To compute the dissimilarity between objects described by ratio-scaled variable There are three methods to handle ratio-scaled variables for computing the dissimilarity between objects. Neural Network Topologies The arrangement of neural processing units and their interconnections can have a profound impact on the processing capabilities of the neural networks. In general, all neural networks have some set of processing units that receive inputs from the outside world, which we refer to appropriately as the input units. Many neural networks also have one or more layers of hidden processing units that receive inputs only from other processing units. A layer or slab of processing units receives a vector of data or the outputs of a previous layer of units and processes them in parallel. The set of processing units that represents the final result of the neural network computation is designated as the output units. There are three major connection topologies that define how data flows between the input, hidden, and output processing units. Feed-Forward Networks Feed-forward networks are used in situations when we can bring all of the information

to bear on a problem at once, and we can present it to the neural network. It is like a pop quiz, where the teacher walks in, writes a set of facts on the board, and says, OK, tell me the answer. You must take the data, process it, and jump to a conclusion. In this type of neural network, the data flows through the network in one direction, and the answer is based solely on the current set of inputs. 5. Explain the concept of data warehousing and data mining. Ans. A data warehouse is a collection of a large amount of data and these data is the pieces of information Which is use to suitable managerial decisions. (a storehouse of data) eg:- student data to the details of the citizens of a city or the sales of previous years or the number of patients that came to a hospital with different ailments. Such data becomes a storehouse of information. Data mining is the process of exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules. The main concept of datamining using a variety of techniques to identify nuggets of information or decision making knowledge in bodies of data, and extracting these in such a way that they can be put to use in the areas such as decision support, prediction, forecasting and estimation. Q15. Define data mining query in term of primitives. Ans: a) Growing Data Volume: The main reason for necessity of automated computer systems for intelligent data analysis is the enormous volume of existing and newly appearing data that require processing. The amount of data accumulated each day by various business, scientific, and governmental organizations around the world is daunting. b) Limitations of Human Analysis: Two other problems that surface when human analysts processdata are the inadequacy of the human brain when searching for complex multifactor dependencies in data, and the lack of objectiveness in such an analysis. c) Low Cost of Machine Learning: While data mining does not eliminate human participation in solving the task completely, it significantly simplifies the job and allows an analyst who is not a professional in statistics and programming to manage the process of extracting knowledge from data. Qno-List various applications of Data mining in various fields.