Professional Documents
Culture Documents
Data Mining & Housing
Data Mining & Housing
The Data Warehousing supports business analysis and decision making by creating an enterprise wide integrated database of summarized, historical information. It integrates data from multiple incompatible sources. By transforming data into meaningful information a data warehouse allows the business manager to perform more substantive, accurate and consistent analysis. DataMining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources and can be integrated with new products and systems as they are brought online. When implemented on high performance clien/server or parallel processing computers datamining tools can analyze massive databases that support querying effectively. A Data Warehouse is of course a database, but it contains summarized information. Integration of Data Mining with Warehouse exploits effective results like better quering process, performance sharing and also getting reliable information. Here in the following section we expose the entire concept of Data Warehousing & Data Mining.
By
D. Ajith kumar(IVth CSIT) & SANJAY JOSHI(IIIrd ECE)
Contents
1) Introduction Features Decision Support Systems 2) Datawarehouse schemas 3) Microsoft Data Warehousing Framework 4) Dataminig working procedure Datawarehouse with data mining An approach to Client/Server data warehousing Applications Conclusion
INTRODUCTION:
Modern organizations are under enormous pressure with recent development of the technology. Clearly we need a rapid access to all kinds of information. To assist this we need to consider the past and to identify relevant trend analysis. So in order to perform any trend analysis we must have a database. In most organizations you will find really large databases in operation for normal daily transactions. These types of databases are known as operational databases; in most cases they have not been design to store historical data or to respond to queries but simply to support all the applications for day to day transactions. The second type of database found in organizations is the data warehouse. This is designed for strategic decision support and is largely built up from the databases that make up the operational database. The basic characteristic of a data warehouse is that it contains vast amount of data which can mean billions of records. Smaller, local data warehouse are called data marts. A data warehouse is designed especially for decision support queries, therefore only data that is needed for decision support is extracted from the operational data and stored in the data warehouse along with the time when it was retrieved from operational databases. Datawarehousing
FEATURES :
1. Time dependent: - That is, containing information collected over time, which
implies there must always be a connection between the information in the warehouse and the time when it was entered.
3. Subject oriented: - That is, built around all the existing applications of the
operational data.The data dayuse. warehouse is designed specifically for decision support while the operational databases contain about information for day to-
The requirements of the end-users: Some end-users need specific query tools so that they can build their queries themselves. Some others are interested only in particular part of information. We can build a specific type of application around this to speed up the query process.
1.
contains corporate wide information integrated from multiple operational data sources for consolidated data analysis. Typically it is composed of several subject areas such as customers, products, and sales and is used for both tactical and strategic decision making.
2.
that is built for use by an individual department or division of an organization. Unlike the enterprise data warehouse, datamarts are often built from the bottom of by departmental resources for a specific support application or group of users. Datamarts contain summarized and often detailed about subject area.
DATAWAREHOUSE SCHEMAS :
A multidimensional data model identifies the dimensions, their hierarchies the measure functions etc., for the design of data cube. But realization of data cube is in designing phase. Variouse schemas as employed. 1. Star schema : It is a modeling paradign in which the datawarehouse contains a large single fact table and a set of smaller dimensional tables, one for each dimension.
It contains detailed summary data Each tuple consists of foreign key to each dimension table. Corresponds to only one tuple in each dimension table.
Dimension table:
It consists of columns that corresponds to the attributes of the dimensions. One tuple in a dimension table may corresponds to more than one tuple in the fact table.
1:N relationship exists between factable and dimensiontables. It is easy to understand and easy to define hierarchies. It reduces the no. of physical joins and is easy to maintain.
2. Snowflake schema :
It consists of single fact table and multiple dimension tables. The difference between star schema and snowflake schema is that in star schema the dimension tables are denormalized and in snowflake schema these tables are normalized.
Dimension1 table
Fact table
Dimension2 table
Dimension3 table
describes the relationships between the various components used in the process of building using and managing a data warehouse.
Information directory
Building
Using
Operational Sources
End-User Tools
Schema
Transform
Schedule
Repl
Info Publish
OLAP
The core of the Microsoft framework is a set of enabling technologies comprised of the data transport layer and integrated data repository. Operational data must pass through a cleaning and transformation stage before being placed into the datamarts or data warehouse in order to confirm to the decisions laid out during the design stage. End-user tools including desktop productivity products specialized analysis products and custom programs are used to gain access the information in the data warehouse. Ideally user access is through a directory facility that enables the user search for appropriate and relevant data to resolve business questions, and provides a layer of security between the users and backend systems.Finally a verity of tools come into play for the management of data warehouse environment such as scheduling repeated tasks and managing multiserver N/w.
Microsoft repository provides the integration point for the metadata shared by the various tools used in the data warehousing process. Shared metadata allows for the transparent integration of the multiple tools from a variety of vendors, with out the need for specialized interfaces between each of the products.
Datamining
DataMinig or knowledge discovery in databases is the nontrivial extraction of implicit and previously unknown and potentially usefull information from the data. Data mining is the search for relationship and global patterns that exist in large databases but are hidden among vast amount of data.
WORKING PROCEDURE :
DataMining software analyzes relationships and patterns in stored transactions data based on open-ended user queries. Generally sought four types of relationships are :
classes : Stored data is used to locate data in predetermined groups. Clusters : Data items are grouped according to logical relationships or consumer preferences. Associations : Data can be mined to identify associations. Sequential patterns : Data is mined to anticipate behaviour patterns and trends.
Major Steps :
Extract, transform and load transaction data onto the datawarehouse system.
1.
Store and manage the data in a multidimensional database system. Provide data access to business analysts and Information technology professionals. Analyze the data by application software. Present data in useful manner such as graph or table.
Techniques in DataMining :
Artificial Neural Networks: Non-linear predictive models that learn through training and resemble biological neural network in structure.
2.
Decision Trees: Tree shaped structures that represent sets of decisions. These decisions generate rules for classification of dataset. Genetic Algorithms: Optimization techniques that use processes such as genetic combinations, mutation and natural selection in a design based on the concepts of evaluation. Rule Induction: The extraction of useful if-then rules from data based on
3.
4.
statistical significance.
has been transferred from the operational database to the data warehouse; furthermore, in many cases you can clean the data before commencing data mining.
Operational data
Data Warehouse
Datamarts
databases
require alteration. Of all the techniques currently available on the market, client/server represents the best choice for building a data warehouse.
APPLICATIONS: Datawarehousing:
a. Sales and marketing analysis across many industries. b. Inventory turn and product tracking in manufacturing. c. Profitable lane or driver risk analysis in transportation. d. Claims analysis or fraud detection in insurance.
DataMining:
Retail/Marketing : Identifying buying patterns from customers. Banking: Detect patterns of fraudulent credit card use. Healthcare: 1. Identifying the behaviour of the risky customer. 2. Identifying successful medical therapies for different illenesses.
Conclusion: Acquiring of right information at right time to right people is key to take right decisions. To make possible so, the path called data warehouse is used to data mining.
Bibliography:
1. Data Mining by Pieter Adriaans , Dolf Zantinge
Contact Address:
D. Ajith kumar , 01711A1201, IV/IV CSIT NARAYANA ENGG. COLLEGE, NELLORE. Mail: ajithatmail@yahoo.co.in