You are on page 1of 29

Data mining

Information Technology Engineering

Submitted By:
Karadkhelkar Kalyani (18)
Kawale Tina (19)
Mhatre Shivani (23)

Usha Mittal Institute of Technology


SNDT WOMEN’S UNIVERSITY
2018-19
Chapter 1
Abstract:
Students often need guidance in choosing adequate courses to complete their academic
degrees. Course recommender systems have been suggested in the literature as a tool to help
students make informed course selections. Although a variety of techniques have been
proposed in these course recommender systems, combining data mining with user ratings in
order to improve the recommendation.. Here in this project we present how association rule
algorithm- Apriori Association Rule is useful in Course Recommender system.

This project presents a course recommendation system based on Association, which


incorporates a data mining process together with user ratings in recommendation. Starting
from a history of real data, it discovers significant rules that associate academic courses
followed by former students. These rules are later used to infer recommendations.

.
Chapter 2

Introduction:
Students pursuing higher education degrees are faced with two challenges: a myriad of courses from
which to choose, and a lack of knowledge about which courses to follow and in what sequence. It is
according to their friends and colleagues’ recommendations that the majority of them choose their
courses and register. It would be useful to help students in finding courses of interest by the
intermediary of a recommender system. The proposed system is based on the same principle that
consists of taking advantage of the collaborative experience of the students who have finished their
studies. Since the volume of data concerning registered students keeps increasing, applying data mining
to interpret this data can reveal hidden relations between courses followed by students. Once
interesting results are discovered, a course recommender system can use them to predict the most
appropriate courses for current students. As the users rate the recommendations thus provided, system
performances can be improved.

Course recommender system aims at predicting the best combination of courses selected by
students

A course recommender system, and focuses on the effectiveness of the incorporation of data
mining in course recommendation. Since the volume of data concerning registered students
keeps increasing, applying data mining to interpret this data can reveal hidden relations
between courses followed by students
Chapter 3
Data Mining:
Definition:

“Data mining is the process of discovering meaningful new correlations, patterns and trends by
sifting through large amounts of data stored in repositories, using pattern recognition
technologies as well as statistical and mathematical techniques.” [Gartner Group, Larose, pp.xi,
2005]

“Data mining is the analysis of (often large) observational data sets to find unsuspected
relationships and to summarize the data in novel ways that are both understandable and useful
to the data owner” (Hand et al, 2001)

“A class of database applications that look for hidden patterns in a group of data that can be
used to predict future behavior.” (webopdia, n.d)

“Data mining is an interdisciplinary field bringing together techniques from machine learning,
pattern recognition, statistics, databases, and visualization to address the issue of information
extraction from large data bases” (Cabena et al, 1998)

About Data Mining:

Data mining is a process that uses a variety of data analysis tools to discover patterns and
relationships in the data, which can in turn be used to make predictions. Data mining is the
process of discovering hidden, valuable knowledge by analyzing a large amount of data. Also,
we have to store that data in different databases.

As data mining is a very important process, it is advantageous for various industries, such as
manufacturing, marketing, etc. Therefore, there's a need for a standard data mining process.
This data mining process must be reliable. Also, this process should be repeatable by business
people with little to no knowledge of data science.

Why to use Data Mining:

Data in digital form are available everywhere, like on the Internet. It can be used to predict the
future. Usually the statistical approach is used. Data mining is an extension of traditional data
analysis and statistical approaches in that it incorporates analytical techniques drawn from a
range of disciplines. Data mining covers the entire process of data analysis, including data
cleaning and preparation and visualization of the results, and how to produce predictions in
real-time so that specific goals are met.
Applications of Data Mining:

Data Mining Applications in Sales/Marketing:

 Data mining is used for market basket analysis to provide information on what product
combinations were purchased together when they were bought and in what sequence.
This information helps businesses promote their most profitable products and
maximize the profit. In addition, it encourages customers to purchase related products
that they may have been missed or overlooked.
 Retail companies use data mining to identify customer’s behavior buying patterns.

Data Mining Applications in Banking / Finance

 Data mining is used to identify customer’s loyalty by analyzing the data of customer’s
purchasing activities such as the data of frequency of purchase in a period of time, a
total monetary value of all purchases and when was the last purchase. After analyzing
those dimensions, the relative measure is generated for each customer. The higher of
the score, the more relative loyal the customer is.
 To help the bank to retain credit card customers, data mining is applied. By analyzing
the past data, data mining can help banks predict customers that likely to change their
credit card affiliation so they can plan and launch different special offers to retain those
customers.
 Credit card spending by customer groups can be identified by using data mining.
 The hidden correlations between different financial indicators can be discovered by
using data mining.
 From historical market data, data mining enables to identify stock trading rules.

Data Mining Applications in Health Care and Insurance

 Data mining is applied in claims analysis such as identifying which medical procedures
are claimed together.
 Data mining enables to forecasts which customers will potentially purchase new
policies.
 Data mining allows insurance companies to detect risky customers’ behavior patterns.
 Data mining helps detect fraudulent behavior.

Data Mining Applications in Transportation

 Data mining helps determine the distribution schedules among warehouses and outlets
and analyze loading patterns.
Chapter 4:
Recommender System:
In a world where the number of choices can be overwhelming, recommender systems help
users find and evaluate items of interest. They connect users with items to “consume”
(purchase, view, listen to, etc.) by associating the content of recommended items or the
opinions of other individuals with the consuming user’s actions or opinions. Such systems have
become powerful tools in domains from electronic commerce to digital libraries and knowledge
management. For example, a consumer of just about any major online retailer who expresses
an interest in an item – either through viewing a product description or by placing the item in
his “shopping cart” – will likely receive recommendations for additional products. These
products can be recommended based on the top overall sellers on a site, on the demographics
of the consumer, or on an analysis of the past buying behavior of the consumer as a prediction
for future buying behavior.

Data Mining In Recommender System:

The term data mining refers to a broad spectrum of mathematical modeling techniques and
software tools that are used to find patterns in data and user these to build models. In this
context of recommender applications, the term data mining is used to describe the collection of
analysis techniques used to infer recommendation rules or build recommendation models from
large data sets. Recommender systems that incorporate data mining techniques make their
recommendations using knowledge learned from the actions and attributes of users. These
systems are often based on the development of user profiles that can be persistent (based on
demographic or item “consumption” history data), ephemeral (based on the actions during the
current session), or both.

These algorithms include

 Clustering
 Classification techniques
 The generation of association rules
 The production of similarity graphs through techniques such as Horting.
Chapter 5:
Data Mining Algorithms:
There are various Data mining algorithm and those are:

1. Clustering Algorithm:

Clustering is finding groups of objects such that the objects in one group will be similar to one another
and different from the objects in another group. Clustering can be considered the most important
unsupervised learning technique. In educational data mining, clustering has been used to group the
students according to their behavior e.g. clustering can be used to distinguish active student from non-
active student according to their performance in activities

Simple K-means Clustering algorithm


Simple K-means algorithm is a type of unsupervised algorithm in which items are moved
among the set of cluster until required set is reached. This algorithm is used to classify the
Data set provided the number of cluster is given in prior. This algorithm is iterative in nature.

Algorithm: Simple K-means clustering algorithm


Input:
Set of Elements or Database of transaction
D= {t1, t2, t3, …., tn}
Number of required Cluster k
Output:
Set of Cluster K
Method:
Make initial guesses for the means m1, m2, ..., mk;
Repeat
Assign each element ti to the cluster having the
Closest mean.
Calculate the new mean for each cluster.
Until there are no changes in any mean

2. Classification:

Classification is a data mining task that maps the data into predefined groups & classes. It is also called
as supervised learning.

It consists of two steps:


1. Model construction: It consists of set of predetermined classes. Each tuple /sample is assumed to
belong to a predefined class. The set of tuple used for model construction is training set. The model is
represented as classification rules, decision trees, or mathematical formulae.

2. Model usage: This model is used for classifying future or unknown objects. The known label of test
sample is compared with the classified result from the model. Accuracy rate is the percentage of test set
samples that are correctly classified by the model. Test set is independent of training set, otherwise
over-fitting will occur.

3. Association Rule Algorithm:

Association rules are used to show the relationship between data items. Mining association
rules allows finding rules of the form: If antecedent then (likely) consequent where antecedent
and consequent are item sets which are sets of one or more items. Association rule generation
consists of two separate steps: First, minimum support is applied to find all frequent item sets
in a database. Second, these frequent item sets and the minimum confidence constraint are
used to form rules.

These Association rule algorithm contain Apriori algorithm

Apriori algorithm:

The Apriori algorithm is an algorithm that attempts to operate on database records, particularly
transactional records, or records including certain numbers of fields or items.

Name of algorithm is Apriori is because it uses prior knowledge of frequent item set properties.
We apply a iterative approach or level-wise search where k-frequent item sets are used to find
k+1 item sets.

The credit for introducing this algorithm goes to Rakesh Agrawal and Ramakrishnan Srikant in
1994.

Apriori Algorithm – Pros

 Easy to understand and implement


 Can use on large item sets

Apriori Algorithm – Cons

 At times, you need a large number of candidate rules. It can become computationally
expensive.
 It is also an expensive method to calculate support because the calculation has to go
through the entire database.

Apriori Algorithm – Limitations

 The process can sometimes be very tedious.

How to Improve the Efficiency of the Apriori Algorithm?

Use the following methods to improve the efficiency of the apriori algorithm.

 Transaction Reduction – A transaction not containing any frequent k-item set becomes
useless in subsequent scans.
 Hash-based Item set Counting – Exclude the k-item set whose corresponding hashing
bucket count is less than the threshold is an infrequent item set.
Chapter 6
Apriori Association Rule and Algorithm:
Apriori Association rule is used to mine the frequent patterns in database. Support &
confidence are the normal method used to measure the quality of association rule. Support for
the association rule X->Y is the percentage of transaction in the database that contains XUY.
Confidence for the association rule is X->Y is the ratio of the number of transaction that
contains XUY to the number of transaction that contain X.
The Apriori association rule algorithm is given below:

Algorithm: Apriori Algorithm

Purpose:

To find subset which are common to at least a minimum number C(confidence threshold) of the item
set.

Input:
Database of transaction D-(t1, t1,…,tn)

Set of Items I= (I1, I2,….,Ik)

Frequent (large) Item set L

Support,

Confidence

Output:
Association Rule satisfying support and confidence

Method:

Ck=candidate item set of size k

Lk= Frequent item set of size k

C1=Item size 1 in I

L1=Frequent item of size 1

K=1;

Repeat

K=k+1;
Ck+1=candidate generated from Lk-1

For each transaction in database D do

Begin,

Find out support of the candidate item Ck+1

Compare candidate support count of Ck+1 with minimum support

Reduce the infrequent k item set from this set i.e. any k-item set that is not
frequent cannot be subset of(k+1) item set.

End

Lk+1=candidate in Ck+1 with minimum support

Until no more large item set found

Return Uk Lk.
Chapter 7
Design and Implementation:

The Code is designed using the programming language “JAVA”. Also it has a
database connectivity of MYSQL database.

Packages Used:

A java package is a group of similar types of classes, interfaces and sub-packages.

Package in java can be categorized in two form, built-in package and user-defined package.

There are many built-in packages such as java, lang, awt, javax, swing, net, io, util, sql etc.

Here, we will have the detailed learning of creating and using user-defined packages.

Advantage of Java Package

1) Java package is used to categorize the classes and interfaces so that they can be easily
maintained.

2) Java package provides access protection.

3) Java package removes naming collision.


1. java.util.*

2. java.sql.*

Java util package contains collection framework, collection classes, classes related to date and time,
event model, internationalization, and miscellaneous utility classes. On importing this package, you
can access all these classes and methods.

Following table lists out the classes in the collection package.

Interfaces Description Classes


Collection Interface represents a
Collection group of objects. It is a root interface 1. Abstract Collection
of collection framework.
It is a dynamic group of unique
1. HashSet
Set elements. It does not store duplicate
element. 2. LinkedHashSet

They are similar to sets with the only 1. Stack


List difference that they allow duplicate 2. vector
values in them.
3. ArrayList
4. LinkedList
It is an arrangement of the type First-
In-First-Out (FIFO).
Queue 1. LinkedList
First element put in the queue is the
first element taken out from it.

It stores elements in the form of 1. HashMap


Map
unique Key-Value pair.
2. Hashtable

Java.sql.*

Classes used:
1. Class Apriori

2. Class Tuple

What is a class

A class is a group of objects which have common properties. It is a template


or blueprint from which objects are created. It is a logical entity. It can't be
physical.

A class in Java can contain:

o Fields
o Methods
o Constructors
o Blocks
o Nested class and interface

Syntax to declare a class:

class <class_name>{
field;
method;}

Instance variable in Java

A variable which is created inside the class but outside the method is known
as an instance variable. Instance variable doesn't get memory at compile
time. It gets memory at runtime when an object or instance is created. That
is why it is known as an instance variable.

What is JDBC?
JDBC stands for Java Database Connectivity, which is a standard Java API
for database-independent connectivity between the Java programming
language and a wide range of databases.

The JDBC library includes APIs for each of the tasks mentioned below that
are commonly associated with database usage.

 Making a connection to a database.


 Creating SQL or MySQL statements.

 Executing SQL or MySQL queries in the database.

 Viewing & Modifying the resulting records.

Fundamentally, JDBC is a specification that provides a complete set of


interfaces that allows for portable access to an underlying database. Java
can be used to write different types of executables, such as −

 Java Applications

 Java Applets

 Java Servlets

 Java ServerPages (JSPs)

 Enterprise JavaBeans (EJBs).

All of these different executables are able to use a JDBC driver to access a
database, and take advantage of the stored data.

JDBC provides the same capabilities as ODBC, allowing Java programs to


contain database-independent code.

Pre-Requisite
Before moving further, you need to have a good understanding of the
following two subjects −

 Core JAVA Programming

 SQL or MySQL Database

JDBC Architecture
The JDBC API supports both two-tier and three-tier processing models for
database access but in general, JDBC Architecture consists of two layers −

 JDBC API: This provides the application-to-JDBC Manager connection.

 JDBC Driver API: This supports the JDBC Manager-to-Driver Connection.

The JDBC API uses a driver manager and database-specific drivers to


provide transparent connectivity to heterogeneous databases.
The JDBC driver manager ensures that the correct driver is used to access
each data source. The driver manager is capable of supporting multiple
concurrent drivers connected to multiple heterogeneous databases.

Following is the architectural diagram, which shows the location of the


driver manager with respect to the JDBC drivers and the Java application −

Common JDBC Components


The JDBC API provides the following interfaces and classes −

 DriverManager: This class manages a list of database drivers. Matches


connection requests from the java application with the proper database driver
using communication sub protocol. The first driver that recognizes a certain
subprotocol under JDBC will be used to establish a database Connection.

 Driver: This interface handles the communications with the database server.
You will interact directly with Driver objects very rarely. Instead, you use
DriverManager objects, which manages objects of this type. It also abstracts the
details associated with working with Driver objects.

 Connection: This interface with all methods for contacting a database. The
connection object represents communication context, i.e., all communication
with database is through connection object only.
 Statement: You use objects created from this interface to submit the SQL
statements to the database. Some derived interfaces accept parameters in
addition to executing stored procedures.

 ResultSet: These objects hold data retrieved from a database after you execute
an SQL query using Statement objects. It acts as an iterator to allow you to
move through its data.

 SQLException: This class handles any errors that occur in a database


application.
Chapter 8
import java.util.*;

import java.sql.*;

class Tuple {

Set<Integer> itemset;

int support;

Tuple() {

itemset = new HashSet<>();

support = -1;

Tuple(Set<Integer> s) {

itemset = s;

support = -1;

Tuple(Set<Integer> s, int i) {

itemset = s;

support = i;

class Apriori {
static Set<Tuple> c;

static Set<Tuple> l;

static int d[][];

static float min_support;

public static void main(String args[]) throws Exception {

getDatabase();

c = new HashSet<>();

l = new HashSet<>();

Scanner scan = new Scanner(System.in);

int i, j, m, n;

System.out.println("Enter the minimum support (as an integer value):");

min_support = scan.nextFloat();

Set<Integer> candidate_set = new HashSet<>();

for(i=0 ; i < d.length ; i++) {

System.out.println("Transaction Number: " + (i+1) + ":");

for(j=0 ; j < d[i].length ; j++) {

System.out.print("Item number " + (j+1) + " = ");

System.out.println(d[i][j]);

candidate_set.add(d[i][j]);

Iterator<Integer> iterator = candidate_set.iterator();

sssss while(iterator.hasNext()) {
Set<Integer> s = new HashSet<>();

s.add(iterator.next());

Tuple t = new Tuple(s, count(s));

c.add(t);

prune();

generateFrequentItemsets();

static int count(Set<Integer> s) {

int i, j, k;

int support = 0;

int count;

boolean containsElement;

for(i=0 ; i < d.length ; i++) {

count = 0;

Iterator<Integer> iterator = s.iterator();

while(iterator.hasNext()) {

int element = iterator.next();

containsElement = false;

for(k=0 ; k < d[i].length ; k++) {

if(element == d[i][k]) {

containsElement = true;

count++;
break;

if(!containsElement) {

break;

if(count == s.size()) {

support++;

return support;

static void prune() {

l.clear();

Iterator<Tuple> iterator = c.iterator();

while(iterator.hasNext()) {

Tuple t = iterator.next();

if(t.support >= min_support) {

l.add(t);

System.out.println("-+- L -+-");

for(Tuple t : l) {
System.out.println(t.itemset + " : " + t.support);

static void generateFrequentItemsets() {

boolean toBeContinued = true;

int element = 0;

int size = 1;

Set<Set> candidate_set = new HashSet<>();

while(toBeContinued) {

candidate_set.clear();

c.clear();

Iterator<Tuple> iterator = l.iterator();

while(iterator.hasNext()) {

Tuple t1 = iterator.next();

Set<Integer> temp = t1.itemset;

Iterator<Tuple> it2 = l.iterator();

while(it2.hasNext()) {

Tuple t2 = it2.next();

Iterator<Integer> it3 = t2.itemset.iterator();

while(it3.hasNext()) {

try {

element = it3.next();

} catch(ConcurrentModificationException e) {

// Sometimes this Exception gets thrown, so


simply break in that case.
break;

temp.add(element);

if(temp.size() != size) {

Integer[] int_arr = temp.toArray(new


Integer[0]);

Set<Integer> temp2 = new HashSet<>();

for(Integer x : int_arr) {

temp2.add(x);

candidate_set.add(temp2);

temp.remove(element);

Iterator<Set> candidate_set_iterator = candidate_set.iterator();

while(candidate_set_iterator.hasNext()) {

Set s = candidate_set_iterator.next();

// These lines cause warnings, as the candidate_set Set stores a raw


set.

c.add(new Tuple(s, count(s)));

prune();

if(l.size() <= 1) {

toBeContinued = false;
}

size++;

System.out.println("\n=+= FINAL LIST =+=");

for(Tuple t : l) {

System.out.println(t.itemset + " : " + t.support);

static void getDatabase() throws Exception {

Class.forName("com.mysql.jdbc.Driver");

Connection con =
DriverManager.getConnection("jdbc:mysql://localhost:3306/DWM","root","root");

Statement s = con.createStatement();

ResultSet rs = s.executeQuery("SELECT * FROM Apriori1;");

Map<Integer, List <Integer>> m = new HashMap<>();

List<Integer> temp;

while(rs.next()) {

int list_no = Integer.parseInt(rs.getString(1));

int object = Integer.parseInt(rs.getString(2));

temp = m.get(list_no);

if(temp == null) {

temp = new LinkedList<>();

temp.add(object);
m.put(list_no, temp);

Set<Integer> keyset = m.keySet();

d = new int[keyset.size()][];

Iterator<Integer> iterator = keyset.iterator();

int count = 0;

while(iterator.hasNext()) {

temp = m.get(iterator.next());

Integer[] int_arr = temp.toArray(new Integer[0]);

d[count] = new int[int_arr.length];

for(int i=0 ; i < d[count].length ; i++) {

d[count][i] = int_arr[i].intValue();

count++;

}
Chapter 9
Conclusion and future scope:
Chapter 10
References:

You might also like