Professional Documents
Culture Documents
Chapter 1
Course Structure
• 27 Sessions
– Lectures
– Labs
• Evaluation
– Mid Sem + End Sem: 70 %
– Project: 20%
– Quizzes / Assignments / Presentations: 10%
• Attendance Policy
– 100% attendance compulsory
Course Materials
• Books
– Data Mining: Concepts and Techniques, 3rd ed.
• By Jiawei Han, Micheline Kamber & Jian Pei
– Introduction to Data Mining
• P. N. Tan, M. Steinbach & V. Kumar
– The Art of R Programming
• Norman Matloff
– R for Everyone
• Jared P. Lander
• Handouts and Case Studies
Course Outline
• Module 1 – DM & KD Concepts and Techniques
• Data Mining
– A step in KD process, dealing with identifying
patterns in data
– Application of a specific algorithm based on the
overall goal of the KD process
Knowledge Discovery Process
Integration
Interpretation Knowledge
& Evaluation
Knowledge
Raw
Data __ __ __ Patterns
Understanding
__ __ __
__ __ __ and
Rules
Transformed
Data
DATA Target
Data
Ware
house
Steps of KD Process
1. Learning the application domain:
– relevant prior knowledge and goals of
application
2. Creating a target data set: data selection
3. Data cleaning and preprocessing: (may take
60% of effort!)
4. Data reduction and transformation:
– find useful features, dimensionality/variable
reduction, invariant representation
Steps of KD Process
5. Choosing functions of data mining
– summarization, classification, regression,
association, clustering.
6. Choosing the mining algorithm(s)
7. Data mining: search for patterns of interest
8. Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant
patterns, etc.
9. Use of discovered knowledge
Evolution of DM
1980s
•ERP
1990s
•CRM
2000s
•eCommerce
2010s
•Data Mining / Big Data Analytics
Why the new age has emerged?
• Computing Storm
– Cheaper technology
– Mobile computing
– Social networking
– Cloud computing
• Data Storm
– Volume
– Velocity
– Variety
• Convergence Storm
– Traditional software and hardware technologies
What is Big Data?
• Data becomes large enough that it cannot be
processed using conventional methods
• It isn’t just a description of raw volume
• Real issue is usability / accessibility
• Challenge is to develop cost-effective and reliable
methods for extracting value from large and complex
sets of data in real time
• Big Data analytics vs. Traditional analytics
– Speed
– Scale
– Complexity
Big Data Examples
• Europe's Very Long Baseline Interferometry (VLBI)
has 16 telescopes, each of which produces
1 Gigabit/second of astronomical data over a 25-day
observation session
– storage and analysis a big problem
• AT&T handles billions of calls per day
– so much data, it cannot be all stored --analysis has to be
done “on the fly”, on streaming data
• Variety
– Assortment of data
– Traditional data, especially operational data, is “structured”
– Recently data has become increasingly “unstructured”
– Data does not have a predefined data model and/or does
not fit well into a relational database
– Text, audio, video, image, geospatial, Internet data (click
streams and log files)
The 3 V’s (contd.)
• Variety
– Unstructured data
– Amount of data is doubling every two years
– Most new data is unstructured (~95%)
– Unstructured data is vastly underutilized
• Velocity
– Speed at which data is created, accumulated, ingested, and
processed
Is Big Data analytics worth the effort?
• Competitive advantage in ultracompetitive global economy
• Nucleus Research (2011) concluded that analytics pays back
$10.66 for every dollar spent
• Media Math Co. achieved a 212% ROI in five months with an
annual revenue lift of $2.2M
• Drive top-line and simultaneously minimize operational cost
• Big Data analytics aren’t constrained by predefined set of
questions
• “You don’t know what you don’t know”
• You don’t have to guess
• Fact based decision - use data to find answers that are more
specific and significantly more useful
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
• Task:
– Given customer information for the past N
months, predict who is likely to attrite next month
– Also, estimate customer value and what is the
cost-effective offer to be made to this customer
Credit Risk Assessment Case
• Situation: Person applies for a loan
• Task:
– Should a bank approve the loan?
• Note:
– People who have the best credit don’t need the loans, and people
with worst credit are not likely to repay
– Bank’s best customers are in the middle
• Outlier analysis
– Outlier: a data object that does not comply with the
general behavior of the data
– Noise or exception?
– Methods: by-product of clustering or regression analysis, …
– useful in fraud detection, rare events analysis
Data Mining Functionalities
• Trend and evolution analysis
– Trend, time series, and deviation: regression analysis
– Sequential pattern mining, periodicity analysis
• e.g., first buy digital camera, then buy large SD memory cards
Are All Patterns Interesting?
• A data mining system has the potential to generate thousands
or even millions of patterns, or rules
• Only a small fraction of the patterns potentially generated
would actually be of interest
• What makes a pattern interesting?
– easily understood
– valid on new or test data with some degree of certainty
– potentially useful, and
– novel
• An interesting pattern represents knowledge
• Measures of pattern interestingness
– Support, confidence, accuracy, coverage, unexpectedness, actionable
Data Mining: Confluence of Multiple Disciplines
• User Interaction
– Interactive mining
– Incorporation of background knowledge
– Presentation and visualization of data mining results
Major Issues in Data Mining
• Efficiency and Scalability
– Parallel, distributed, stream, and incremental mining
methods