You are on page 1of 8

What is data mining?

the non trivial extraction of implicit, previously unknown, and potentially useful information from data

Data mining encompasses a number of different technical approaches, such as: o clustering, o data summarization, o learning classification rules, o finding dependency net works, o analysing changes, and o detecting anomalies

Comparison Data Mining and DBMS

DBMS - queries based on the data held e.g. o last months sales for each product o sales grouped by customer age etc. o list of customers who lapsed their policy Data Mining - infer knowledge from the data held to answer queries e.g. o what characteristics do customers share who lapsed their policies and how do they differ from those who renewed their policies? o why is the Cleveland division so profitable?

Who needs data mining?

Who(ever) has information fastest and uses it wins Businesses are looking for new ways to let end users find the data they need to:
o o o

make decisions serve customers and gain the competitive edge

Applications

Medicine - drug side effects, hospital cost analysis, genetic sequence analysis, prediction etc. Finance - stock market prediction, credit assessment, fraud detection etc. Marketing/sales - product analysis, buying patterns, sales prediction, target mailing, identifying `unusual behaviour' etc. Knowledge Acquisition Scientific discovery - superconductivity research, etc. Engineering - automotive diagnostic expert systems, fault detection etc.

Data Mining Goals


Classification

DM system learns from examples or the data how to partition or classify the data i.e. it formulates classification rules Example - customer database in a bank

o o

Question - Is a new customer applying for a loan a good investment or not? Typical rule formulated -

if STATUS = married and INCOME > 10000 and HOUSE_OWNER = yes then INVESTMENT_TYPE = good

Association

Rules that associate one attribute of a relation to another Set oriented approaches are the most efficient means of discovering such rules Example - supermarket database o 72% of all the records that contain items A and B also contain item C o the specific percentage of occurrences, 72 is the confidence factor of the rule

Sequence/Temporal

Sequential pattern functions analyse collections of related records and detect frequently occurring patterns over a period of time Difference between sequence rules and other rules is the temporal factor Example - retailers database o Can be used to discover the set of purchases that frequently precedes the purchase of a microwave oven Example - natural disasters database o Discovery could be that when there is an earthquake in Los Angeles the next day Mount Kilimanjaro erupts

Techniques

Set oriented database methods Statistics Clustering Visualisation Neural networks Rule Induction Set oriented approaches/Databases o make use of DBMSs to discover knowledge, SQL is limiting Statistics o can be used in several data mining stages data cleansing i.e. the removal of erroneous or irrelevant data known as outliers EDA, exploratory data analysis e.g. frequency counts, histograms etc. data selection - sampling facilities and so reduce the scale of computation attribute re-definition e.g. Body Mass Index, BMI, which is Weight/Height2 data analysis - measures of association and relationships between attributes, interestingness of rules, classification etc. Visualization

enhances EDA, makes patterns more visible e.g. NETMAP a commercial data mining tool uses this technique Clustering i.e. Cluster Analysis o Clustering and segmentation is basically partitioning the database so that each partition or group is similar according to some criteria or metric o Clustering according to similarity is a concept which appears in many disciplines e.g. in chemistry the clustering of molecules o Data mining applications make use of clustering according to similarity e.g. to segment a client/customer base o It provides sub-groups of a population for further analysis or action - very important when dealing with very large databases o Can be used for profile generation for target marketing i.e. where previous response to mailing campaigns can be used to generate a profile of people who responded and this can be used to predict response and filter mailing lists to achieve the best response

Knowledge acquisition
using data mining

Expert systems are models of real world processes Much of the information is available straight from the process e.g. o in production systems, data is collected for monitoring the system o knowledge can be extracted using data mining tools o experts can verify the knowledge

Multimedia Data Mining in Digital Libraries: Standards and Features


Sanjeevkumar R. Jadhav*, and Praveenkumar Kumbargoudar*

Abstract
The digital library retrieves, collects, stores and preserves the digital data. For this purpose, there is need to convert different formats of information such as text, images, video, audio, etc. The data mining techniques are popular while conversion of the multimedia files in the libraries. The present paper attempted to define the term data mining. It also covered different data mining features and standards. The paper explained about the Architecture of data mining, which contains the stages of the data mining such as (1) domain understanding; (2) data selection; (3) cleaning and preprocessing; (4) discovering patters; (5) interpretation; and (6) reporting and using discovered knowledge. It is emphasized that there is need to develop multimedia data mining techniques and standards in the library for conversion of multimedia information.

1. INTRODUCTION Over the past few decades, rapid changes in information technology have drastically changed the functions and activities of the libraries. The Information and Communication Technology created a new type of work culture, new forms of information storage, and new means of communication and dissemination of information. The advent of electronic resources and their increased use in libraries has brought about significant changes in Storage and Communication of Information. As a Result, the Conventional libraries are transforming into digital libraries. Majority of the libraries have computerized already and digitizing their printed collection. In India, the process of digitization is slow compared to other developed countries. This is so because, only 21% of the Indian population is computer literate and only 14% of the Indian

Population is using Internet. Due to the development in digitization, many of the libraries are digitizing their collection by transforming their printed materials into digital form. A fully developed digital library environment involves the following elements1: 1. Initial Conversion of Content from Physical to Digital form. 2. The extraction or creation of metadata or indexing information describing the content to facilitate searching and discovery, as well as administrative and structural metadata to assist in object viewing, management and preservation. 3. Storage of digital content and metadata in appropriate multimedia repository. The repository will include rights management capabilities to enforce Intellectual Property Rights, if required. e-commerce functionality may also be present if needed to handle accounting and billing. 4. Client Services for the browser, including repository querying and workflow. 5. Content delivery via file transfer or streaming media. 6. Patron access through a browser or dedicated client.
* Gulbarga

University, GULBARGA: 585 106. Karnataka. E-Mail: kumbargoudar@rediffmail.com

55 7. A private or public network. 2. DIGITIZATION AND DATA MINING Digitization refers to the conversion of an item be it printed text, manuscript, image or sound, film and video recording from one format (usually print or analogue) into digital. The process basically involves taking a physical object and essentially making an electronic photograph of it. An image of the physical object is captured- using a scanner or digital camera and converted to digital format that can be stored electronically and accessed via a computer2. It is noted that the data and information available in different formats. These formats include Text, Images, Video, Audio, Picture, Maps, etc. It is noted that in case of text information, there is needed to scan the printed text through scanners and provide different links to access it. But in case of multimedia formats like images, Audio, Picture, Maps, Video etc, the conversion and systematic presentation is not easy. Further, there is needed to make automatic search for easy accessibility. The easy search, effective and systematic presentation of the data is essential in case of multimedia information. For this purpose, there is need to adopt data mining techniques in the library. Data mining techniques are basically from logic, Multimedia and Artificial Intelligence techniques. Data mining is the automatic extraction of patterns of information from historical data, enabling companies to focus on the next important aspects of their businesstelling them what they did not know and had not even thought of asking3. Data mining is that it is the process of automating information discovery4, which improves decision making and gives a company advantages on the market. Another definition is that is is the exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules: 5 Data mining is an applied discipline, which grew our of the statistical pattern recognition, machine learning, and artificial intelligence and coupled with business decision making to optimize and enhance it. Initially, data mining techniques have been applied to structured data from databases. Recently two branches of data mining, text data mining and Web data mining, have emerged6&7. They have their own research agenda, communities of researchers, and supporting companies that develop technologies and tools. Unfortunately, today multimedia data mining is in beginning stage and still there is need for developments to make effective presentation of multimedia information. There are four types of multimedia data: audio data, which includes sound , speech,

and music; image data (black-and-white and colour images); video data, which include timealigned sequences of images; and electronic or digital, which is sequences of time aligned 2D or 3D coordinates of a stylus, a light per, data glove sensors, or a similar device. All this data is generated by specific kind of sensors. The concept of mining in multimedia is also referred to as automatic annotation or annotation mining. There appears to be three main pattern discovery approaches that have been used for automatic annotation in multimedia data mining. These approaches primarily differ in terms of how external knowledge is provided to mine concepts. The first approach includes assigning key words or classifying the data. The second approach for automatic annotation is through clustering and here multimedia documents are clustered first and then the resulting clusters are assigned keywords by annotator. The third approach does not rely on manual annotator and it tries to mine concepts by knowing the contextual information. 56 The Multimedia Data Mining (MDM) is a part of multimedia technology, which covers the following areas8. Media compression and storage. Delivering streaming media over networks with required quality of service. Media restoration, transformation, and editing. Media indexing, summarization, search, and retrieval. Creating interactive multimedia systems for learning/training and creative art production. Creating multimodal user interfaces. 3. MULTIMEDIA DATA MINING ARCHITECTURE The data mining process consists of several processes and stages, which are related to each other and interactive. The main stages of the data mining process are (1) domain understanding; (2) data selection; (3) cleaning and preprocessing; (4) discovering patters; (5) interpretation; and (6) reporting and using discovered knowledge. The domain understanding stage requires learning how the results of data-mining will be used so as to gather all relevant prior knowledge before mining9. Figure: Multimedia Data Mining Architecture The data selection stage requires the user to target a database or select a subset of fields or data records to be used for data mining. A proper domain understands at this stage 57 helps in the identification of useful data. This is the most time consuming stage of the entire data mining process for business applications; data are never clean and in the form suitable for data mining. For multimedia data mining, this stage is generally not an issue, because the data are not in relational form and there are no subsets of fields to choose from. The next stage in a typical data mining process is the preprocessing step that involves integrating data from different sources and making choices about representing or coding certain data fields that serve as inputs to the pattern discovery stage. Such representation choices are needed because certain fields may contain data at levels of details not considered suitable for the pattern discovery stage. The preprocessing stage is of considerable importance in multimedia data mining, given the unstructured nature of multimedia data. The pattern discovery stage is the heart of the entire data mining process. It is the stage where the hidden patterns and trends in the data are actually uncovered. There are several approaches to the pattern discovery stage. These include association, classification, clustering, regression, time-series analysis and visualization. Each of these approaches can be implemented through one of several competing methodologies, such as statistical data analysis, machine learning, neural networks and pattern recognition. It is because of the use

of methodologies from several disciplines that data mining is often viewed as a multidisciplinary field. The interpretation stage of the data mining process is used to evaluate the quality of discovery and its value to determine whether previous stage should be revisited or not. Proper domain understanding is crucial at this stage to put a value on discovered patterns. The final stage of the data mining process consists of reporting and putting to use the discovered knowledge to generate new actions or products and services or marketing strategies as the case may be. According to Myatt10 any exploratory data mining project should include the following steps: 1. Problem Definition: The problem to be solved along with the projected deliverables (information products) should be clearly defined, an appropriate team should be put together, and a plan generated for executing the analysis. 2. Data Preparation: Prior to starting any data analysis or data mining project, the data should be collected characterized, cleaned, transformed, and partitioned into an appropriate form for processing further. 3. Implementation of the Analysis: On the basis of the information from steps 1 & 2, appropriate analysis techniques should be selected and often these methods need to be optimized. 4. Deployment of Results: The Results from Step 3 should be communicated and/ or deployed into a pre-existing process. 4. FEATURES AND STANDARDS FOR MULTIMEDIA DATA MINING It is noted that different image attributes such as Colour, edges, shape, and texture are used to extract features for mining. Feature extraction based on these attributes may be 58 performed at the global or local level. For example, colour histograms may be used as features to characterize the spatial distribution of colour in an image. Similarly, the shape of a segmented region may be represented as a feature vector of Fourier descriptors to capture global shape property of the segmented region or a shape could be described in terms of salient points or segments to provide localized descriptions. Global descriptors are generally easy to compute, provide a compact representation, and are less prone to segmentation errors. However such descriptors may fail to uncover subtle patterns or changes in shape because global descriptors tend to integrate the underlying information. Local descriptors, on the other hand, tend to do generate more elaborate representation and can yield useful results even when part of the underlying attribute, for example, the shape of a region is occluded, is missing. In the case of video, additional attributes resulting from object and camera motion are used. In case of audio, both the temporal and the spectral domain features have been employed. Examples of some of the features used include short-time energy, pause rate, zero-crossing rate, normalized harmonicity, fundamental frequency, frequency spectrum, bandwidth, spectral centroid, spectral roll-off frequency and band energy ratio. Many researchers have found the cepstral based features, Mel-Frequency Cepstral Coefficients (MFCC) and Linear Predictive Coefficients (LPC), very useful, especially in mining tasks involving speech recognition. The MPEG-7 standard provides a good representative set of features for multimedia data. The features are referred as descriptors in MPEG-7. The MPEG-7 Visual description tools describe visual data such as images and videos while the Audio description tools account for audio data. The MPEG-7 visual description defines the following main features for color attributes: Color Layout Descriptor, Color Structure Descriptor, Dominant Color Descriptor and Scalable Color Descriptor. The Color Layout Descriptor is a compact and resolution invariant descriptor that is defined as YCbCr Color

space to capture the spatial distribution of color over major image regions. The Color Structure Descriptor captures both color content and information about its spatial arrangement using a structuring element that is moved over the image. The Dominant Color Descriptor characterizes an image or an arbitrarily shaped region by a small number of representative colors. The Scalable Color Descriptor is a color histogram in the HSV Color Space encoded by Haar transform to yield a scalable representation. While the above features are defined with respect to an image or its part, the feature Group of Frames-Group of Pictures Color (GoFGoPColor) describes the color histogram aggregated over multiple frames of a video9. MPEG-7 provides for two main shape descriptors; others are based on these and additional semantic information. The Region shape Descriptor describers the shape of a region using Angular Radial Transform (ART). The description is provided in terms of 40 coefficients and is suitable for complex objects consisting of multiple disconnected regions and for simple objects with or without holes. The Contour Shape Descriptor describes the shape of an object based on its outlines. The descriptor used the curvature scale space representation of the contour. The motion descriptors in MPEG-7 are defined to cover a broad range of applications. The motion activity descriptor captures the intuitive notion of intensity or pace of action in a video clip. The descriptor provides information for intensity, direction, and spatial and temporal distribution of activity in a video segment. The spatial distribution of activity indicates whether the activity is spatially limited or not. Similarly, the temporal distribution of activity indicates how the level of activity varies over the entire segment. The Camera Motion Descriptor specifies the camera motion types and their quantitative characterization over the entire video segment. The Motion Trajectory Descriptor describes motion trajectory 59 of moving object basic on spatiotemporal localization of trajectory points. The description provided is at a fairly high level as each moving object is indicated by one representative point at any time instant. The parametric Motion Descriptors describes motion, global and object motion, in a bideo segment by describing the evolution of arbitrarily shaped regions over time using a two-dimensional geometric transform. The MPEG-7 Audio standard defines two sets of audio descriptors. The first set is of low-level features, which are meant for a wide range of applications. The descriptors in this set include silence, power, Spectrum, and Harmonicity. The silence Descriptor simply indicates that there is no significant sound in the audio segment. The power Descriptor measures temporally smoothed instantaneous signal power. The Spectrum Descriptor captures properties such as the audio spectrum envelope, spectrum centroid spectrum spread, spectrum flatness, and fundamental frequency. The second set of audio descriptors is of high-level feature, which are meant for specific applications. The features in this set include Audio Signature, Timbre, and Melody. The Signature Descriptor is designed to generate a unique identifier for identifying audio content. The Timbre Descriptor captures perceptual features of instrument sound. The Melody Descriptor captures monophonic melodic information and is useful for matching of melodies. In addition, the high-level descriptors in MPEG-7 Audio include descriptors for automatic speech recognition, sound classification and indexing. 5. MULTIMEDIA DATA MINING IN DIGITAL LIBRARIES: Quan Liu11 suggested the Standards and guidelines associated with library digitization practices vary from project to project. Over the years, university, public, school, and special libraries have adopted their own policies with regard to digitization. Some older standards, as well as more recent ones, are widely accepted and practiced library digitization projects. Metadata standards and image quality standards and guidelines are commonly

sought when planning digitization projects Common metadata standards used to date are Dublin Core, RDF, EAD, TEI, and SGML and its descendents XML and HTML. The MARC standard has been used as the standard interchange format in representing catalog records electronically. It is noted that in India, only a few University and College libraries have already started digitization and a majority of the University and College libraries are yet to start the work of digitization and conversion work of their collection. Further, it is noted that the experts in library science and information science, to large extent only provided guidelines for conversion of text documents. Hence, there is need to know about the standards and processes of the data mining and storage of multimedia data through data mining techniques. 6. CONCLUSION Multimedia data mining techniques are active and growing area of research now. In case of digital library projects, there is need for multimedia data mining for conversion and preservation of multimedia information. There is needed to make data mining strategy for conversion of multimedia files in the libraries. The digital libraries, to a large extent accessible through the web, must present multimedia information effectively. Then the purpose of these libraries is served properly. To serve this purpose, there is needed to form data mining strategy, considering standards, features and available techniques.

You might also like