You are on page 1of 22

Seminar Report

on

“DATA MINING”

Submitted to

DR. BABASAHEB AMBEDKAR TECHNOLOGICAL UNIVERSITY,


LONERE

in partial fulfilment of the requirements for the award of S.Y.


Bachelor of Technology degree in Computer Science &
Engineering

Student Name: Patil Umesh Avinash


PRN: 20676212420034

Under the guidance of


Prof. H.S.MAJGAONKAR

Shri. Venkateshwara Shikshan Sanstha’s


NANASAHEB MAHADIK COLLEGE OF ENGINEERING, PETH.
Department of Computer Science & Engineering
(2021-22)
Shri. Venkateshwara Shikshan Sanstha’s
NANASAHEB MAHADIK COLLEGE OF ENGINEERING, PETH
Department of Computer Science & Engineering
(2021-22)

CERTIFICATE

This is to certify that the Seminar entitled “Data Mining” is a bona


fide record of the Seminar work done by Patil Umesh Avinash. PRN:
20676212420034 under my supervision and guidance, in partial fulfilment of
the requirements for the Outcome Based Education Paradigm in Computer
Science & Engineering from NANASAHEB MAHADIK COLLEGE OF
ENGINEERING, PETH for the academic year 2021-2022.

Place: NMCOE, Peth


Date:

Ms.H.S.Majgoankar Prof. I. Y. Inamdar Dr. B.Shriniwasa Varma


Guide Head of Department Principal

External Examiner
ACKNOLEGEMENT

I would like to express our sincere gratitude to our guide


Ms.H.S.Majgaonkar , whose role as project guide was invaluable for the
project. We are extremely thankful for the keen interest she took in
advising us, for the books and reference materials provided for the moral
support extended to us.
Also, my special thanks to our head of department Prof. I. Y.
Inamdar and our Principal Dr. B. Shriniwasa Varma and our Executive
Director Prof. M. B. Joshi who gave me the golden opportunity to do this
wonderful project on the topic “Data Mining”, which also helped me in
doing a lot of Research and I came to know about so many new things I am
really thankful to them.
Last but not the least we convey our gratitude to all the teachers for
providing us the technical skill that will always remain as our asset and to
all non- teaching staff for the gracious hospitality they offered us.

Patil Umesh Avinash

PRN: 20676212420034
INDEX

Sr. Content Page NO.


No.

1 Abstract

2 Introduction

3 Literature Survey

4 Objectives

5 Theory

6 Case Study 1

7 Case Study 2

8 Conclusion

9 References
CHECKLIST
Sr. Tool Purpose Status Rep. Att Sign of
No. Y/N Guide
1 Mendeley For using
along with
MS Word
&References
2 Plagiarism Check For
detecting
plagiaris
m
3 Paraphrasing Tool For
removing
plagiarism
4 Design CAD
Software
5 Analysis Software

6 Design of
Experiment
7 Selection of
Material
8 Correlation/
Regression for
exp validation
9 Optimization
techniques/invent
ive
problem solving
10 Software used
i.e.
C++/JAVA/Pyt
hon OR
Design Codes used
for non circuit
branches
1. Abstract

Data mining on large databases has been a major concern in research community, due to the
difficulty of analyzing huge volumes of data using only traditional OLAP tools. This sort of
process implies a lot of computational power, memory and disk I/O Which can only be
provided by parallel computers. We present a discussion of how database technology can be
integrated to data mining techniques.
2.Introduction

“Data Mining”,  that mines the data. In simple words, it is defined as finding hidden
insights(information) from the database, extract patterns from the data.

There are different algorithms for different tasks. The function of these algorithms is
to fit the model. These algorithms identify the characteristics of data. There are 2 types
of models.

The term "data mining" is a misnomer, because the goal is the extraction of patterns
and knowledge from large amounts of data, not the extraction (mining) of data itself It
also is a buzzword] and is frequently applied to any form of large-scale data
or information processing (collection, extraction, warehousing, analysis, and statistics)
as well as any application of computer decision support systemincluding artificial
intelligence (e.g., machine learning) and business intelligence. The book Data mining:
Practical machine learning tools and techniques with Java (which covers mostly
machine learning material) was originally to be named just Practical machine
learning, and the term data mining was only added for marketing reasons. Often the
more general terms (large scale) data analysis and analytics—or, when referring to
actual methods, artificial intelligence and machine learning—are more appropriate.
2. Literature survey
1.Shivam Agarwal et al, Data mining is a field of intersection of computer science and
statistics used to discover patterns in the information bank. The main aim of the data
mining process is to extract the useful information from the dossier of data and mold it
into an understandable structure for future use. There are different process and
techniques used to carry out data mining successfully.

2. Sowmya R et al, in an Information technology world, the ability to effectively


process massive datasets has become integral to a broad range of scientific and other
academic disciplines. We are living in an era of data deluge and as a result, the term
“Big Data” is appearing in many contexts. It ranges from meteorology, genomics,
complex physics simulations, biological and environmental research, finance and
business to healthcare. Big Data refers to data streams of higher velocity and higher
variety. The infrastructure required to support the acquisition of Big Data must deliver
low, predictable latency in both capturing data and in executing short, simple queries.
To be able to handle very high transaction volumes, often in a distributed
environment; and support flexible, dynamic data structures. Data processing is
considerably more challenging than simply locating, identifying, understanding, and
citing data. For effective large-scale analysis all of this has to happen in a completely
automated manner. This requires differences in data structure and semantics to be
expressed in forms that are computer understandable, and then “robotically”
resolvable. There is a strong body of work in data integration, mapping and
transformations. However, considerable additional work is required to achieve
automated error-free difference resolution. This paper proposes a framework on recent
research for the Data Mining using Big Data.

3. B.N. Lakshmi et al, Data mining an non-trivial extraction of novel, implicit, and
actionable knowledge from large data sets is an evolving technology which is a direct
result of the increasing use of computer databases in order to store and retrieve
information effectively. It is also known as Knowledge Discovery in Databases
(KDD) and enables data exploration, data analysis, and data visualization of huge
databases at a high level of abstraction, without a specific hypothesis in mind. The
working of data mining is understood by using a method called modeling with it to
make predictions. Data mining techniques are results of long process of research and
product development and include artificial neural networks, decision trees and genetic
algorithms. This paper surveys the data mining technology, its definition, motivation,
its process and architecture, kind of data mined, functionalities and classification of
data mining, major issues, applications and directions for further research of data
mining technology.
4.Ramakrishna Hegde et al, This review paper consists of literature survey to
prediction of scholarship by using Machine Learning and Data Mining technique.
Along with this it contains a small description of ML/DM which are used by the
researchers. It also describes data sets as very important in ML/DM methods. Machine
Learning becomes most popular in the field of IT industry. Nowadays Machine
Learning and Data Mining turn as a powerful technique which applicable for various
fields such as IT, Education sector and also in business sector too. The different types
of ML/DM algorithms are addressed by using all this technique. The algorithms which
give more accuracy results in detection of continuity of every student's scholarship
such as Naïve Bayes, Decision Tree and k-NN. Finally, the proposed model will
provide a list of candidates, who deserve to have a scholarship and also discussion has
been made on accuracy of each techniques which was used to get a result.

5.  B. R. Chandavarkar et al, Machine Learning and Data Mining for healthcare. There
has been an enormous growth in the field of HIT (health information technology) in
the recent years. Be it detection of certain diseases, scanning of organs, finding
tumors, these machine oriented operations without human intervention, have certainly
increased the quality of medical attention one can get, and the technology required has
come a long way. Health data tends to be inherently complex with exceptions in
almost all cases. Data mining is the technique of converting raw data into a
meaningful format. Analysis and prediction on such data, although computationally
and algorithmically complex, is an emerging technology that is a small step to more
proactive and preventive automated treatment options. There are various data mining
techniques such as classification, clustering, association, regression, prediction, pattern
recognition etc [1]. Even the efficiency of certain medicines can be found using
machine learning techniques, which is a life saving and cost effective method. In this
paper, we are going to use machine learning as a tool for predictive analysis to predict
chronic kidney diseases based on the Chronic disease dataset taken from UCI ML
repository. We will be applying machine learning algorithms, specifically decision
trees, to build a classifier to predict if a person has the disease or not. This paper
shows the issue that specific machine learning algorithms need to be tailor-made to
specific nature of medical data.

6. Chitra Jalota et al, Higher education institutions are often very curious to know
about the success rate of the students throughout their study. For this reason, they need
to use several methods like physical examination, Statistical methods and currently
prevailing data mining techniques for the prediction of student's performance. An
upcoming area of research which uses techniques of data mining is known as
Educational Data Mining. It involves machine learning algorithms and statistical
techniques to help the user for interpretation of student's learning habits, their
academic performance and further improvement if required. In this paper we will
discuss various techniques of data mining which are useful for predicting performance
level of students. For this we used dataset of kalboard 360 and applied it on weka to
analyze the data mining techniques.

7. Manomita Chakraborty et al, The increase in the rate of technological evolution is


resulting in a reduction in the cost of various storage devices and as a consequence,
enormous amounts of data are deposited from heterogeneous sources in raw forms.
Therefore, some efficient data mining techniques are required that can process those
data and retrieve useful information from them. Recently, machine learning algorithms
are becoming very popular for doing various data mining tasks. Neural network is one
of them, which has fascinated a lot of researchers due to its efficacy and fruitfulness in
doing many tasks specially classification. But the main problem with neural network is
its nature of black box, i.e., explaining the decision generated by a neural network is a
daunting task. As a solution to this problem, rule extraction technique has been
proposed which expresses the knowledge hidden in a learned network in the guise of
understandable classification rules. The rule extraction is a very deep rooted technique
and a very rich literature exists on this topic. However, a very less number of papers
exist which mainly focused on surveying the existing literature. So, this work aims to
provide a survey on the existing literature, and to shed light on some of the areas
which needs to be focused to enrich the literature. At the same time the paper also tries
to create a scope for the existing and the novice researchers to do research in this field.

8.  Hoda Zahedi  et al, The recent global outbreak of coronavirus disease (COVID-19)
is affecting many countries worldwide. Iran is one of the top 10 most affected
countries. Search engines provide useful data from populations, and these data might
be useful to analyze epidemics. Utilizing data mining methods on electronic resources’
data might provide a better insight into the COVID-19 outbreak to manage the health
crisis in each country and worldwide.

9. Asian J Psychiatr.et al, Data mining is an interdisciplinary process which


incorporates knowledge of computer science and statistics to analyze large
observational datasets. The aim of data mining is to find unsuspected relationships or
patterns from datasets and to summarize the data in novel ways (Hand et al., 2001).
Data mining is a broad concept and encompasses a wide spectrum of analytical
methods. This letter presents two common data mining-based techniques with
empirical examples to prove their merits in assisting mental health research.

10. Gonzalo Gomez-Sanchez et al, For many years, a major question in cancer
genomics has been the identification of those variations that can have a functional role
in cancer, and distinguish from the majority of genomic changes that have no
functional consequences. This is particularly challenging when considering complex
chromosomal rearrangements, often composed of multiple DNA breaks, resulting in
difficulties in classifying and interpreting them functionally. Despite recent efforts
towards classifying structural variants (SVs), more robust statistical frames are needed
to better classify these variants and isolate those that derive from specific molecular
mechanisms. We present a new statistical approach to analyze SVs patterns from 2392
tumor samples from the Pan-Cancer Analysis of Whole Genomes (PCAWG)
Consortium and identify significant recurrence, which can inform relevant
mechanisms involved in the biology of tumors. The method is based on recursive
KDE clustering of 152,926 SVs, randomization methods, graph mining techniques and
statistical measures. The proposed methodology was able not only to identify complex
patterns across different cancer types but also to prove them as not random
occurrences. Furthermore, a new class of pattern that was not previously described has
been identified.

.
4.Objectives
The main objective of data mining is the automatic or semi-automatic analysis of large
amounts of data. This serves to extract exciting patterns hitherto unknown. We talk
about the groups of data records (cluster analysis), unusual records (anomaly
detection), and dependencies (association rules mining).

This generally involves the use of database techniques such as spatial indexes. Thus,
these patterns can be seen as a kind of summary of the input data. In addition to being
able to be used in additional analysis or, for example, in machine learning
and predictive analysis.
One of the examples we can give is data mining. This could identify several groups in
the data, which can then be used to obtain more accurate results — being able
to predict problems through a decision support system.

Neither data collection, data preparation, nor interpretation of results and information
is part of the data mining stage. However, they belong to the entire KDD process as
additional steps.
4.Theory

Data mining includes the utilization of refined data analysis tools to find previously
unknown, valid patterns and relationships in huge data sets. These tools can
incorporate statistical models, machine learning techniques, and mathematical
algorithms, such as neural networks or decision trees. Thus, data mining incorporates
analysis and prediction.

Depending on various methods and technologies from the intersection of machine


learning, database management, and statistics, professionals in data mining have
devoted their careers to better understanding how to process and make conclusions
from the huge amount of data, but what are the methods they use to make it happen?

In recent data mining projects, various major data mining techniques have been
developed and used, including association, classification, clustering, prediction,
sequential patterns, and regression.

Data mining techniques can be classified by different criteria, as follows:

i. Classification of Data mining frameworks as per the type of data sources


mined:
This classification is as per the type of data handled. For example, multimedia,
spatial data, text data, time-series data, World Wide Web, and so on..
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-
oriented database, transactional database, relational database, and so on..
iii. Classification of data mining frameworks as per the kind of knowledge
discovered:
This classification depends on the types of knowledge discovered or data
mining functionalities. For example, discrimination, classification, clustering,
characterization, etc. some frameworks tend to be extensive frameworks
offering a few data mining functionalities together..
iv. Classification of data mining frameworks according to data mining
techniques used:
This classification is as per the data analysis approach utilized, such as neural
networks, machine learning, genetic algorithms, visualization, statistics, data
warehouse-oriented or database-oriented, etc.
The classification can also take into account, the level of user interaction
involved in the data mining procedure, such as query-driven systems,
autonomous systems, or interactive exploratory systems.

2. Clustering:

Clustering is a division of information into groups of connected objects. Describing


the data by a few clusters mainly loses certain confine details, but accomplishes
improvement. It models data by its clusters. Data modeling puts clustering from a
historical point of view rooted in statistics, mathematics, and numerical analysis. From
a machine learning point of view, clusters relate to hidden patterns, the search for
clusters is unsupervised learning, and the subsequent framework represents a data
concept. From a practical point of view, clustering plays an extraordinary job in data
mining applications. For example, scientific data exploration, text mining, information
retrieval, spatial database applications, CRM, Web analysis, computational biology,
medical diagnostics, and much more.

In other words, we can say that Clustering analysis is a data mining technique to
identify similar data. This technique helps to recognize the differences and similarities
between the data. Clustering is very similar to the classification, but it involves
grouping chunks of data together based on their similarities.

3. Regression:

Regression analysis is the data mining process is used to identify and analyze the
relationship between variables because of the presence of the other factor. It is used to
define the probability of the specific variable. Regression, primarily a form of
planning and modeling. For example, we might use it to project certain costs,
depending on other factors such as availability, consumer demand, and competition.
Primarily it gives the exact relationship between two or more variables in the given
data set.

4. Association Rules:

This data mining technique helps to discover a link between two or more items. It
finds a hidden pattern in the data set.
Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of databases.
Association rule mining has several applications and is commonly used to help sales
correlations in data or medical data sets.

The way the algorithm works is that you have various data, For example, a list of
grocery items that you have been buying for the last six months. It calculates a
percentage of items being purchased together.

These are three major measurements technique:

o Lift:
This measurement technique measures the accuracy of the confidence over how
often tem B is purchased.
                  (Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple items are purchased
and compared it to the overall dataset.
                  (Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is purchased when
item A is purchased as well.
                  (Item A + Item B)/ (Item A)

5. Outer detection:

This type of data mining technique relates to the observation of data items in the data
set, which do not match an expected pattern or expected behavior. This technique may
be used in various domains like intrusion, detection, fraud detection, etc. It is also
known as Outlier Analysis or Outilier mining. The outlier is a data point that diverges
too much from the rest of the dataset. The majority of the real-world

datasets have an outlier. Outlier detection plays a significant role in the data mining
field. Outlier detection is valuable in numerous fields like network interruption
identification, credit or debit card fraud detection, detecting outlying in wireless sensor
network data, etc.

6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating
sequential data to discover sequential patterns. It comprises of finding interesting
subsequences in a set of sequences, where the stake of a sequence can be measured in
terms of different criteria like length, occurrence frequency, etc.

In other words, this technique of data mining helps to discover or recognize similar
patterns in transaction data over some time.

7. Prediction:

Prediction used a combination of other data mining techniques such as trends,


clustering, classification, etc. It analyzes past events or instances in the right sequence
to predict a future event.
5.Case Study 1

Data Mining in Healthcare:

Data mining in healthcare has excellent potential to improve the health system. It uses
data and analytics for better insights and to identify best practices that will enhance
health care services and reduce costs. Analysts use data mining approaches such as
Machine learning, Multi-dimensional database, Data visualization, Soft computing,
and statistics. Data Mining can be used to forecast patients in each category. The
procedures ensure that the patients get intensive care at the right place and at the right
time. Data mining also enables healthcare insurers to recognize fraud and abuse.
6. Case Study 2

Data mining in Education:

Education data mining is a newly emerging field, concerned with developing


techniques that explore knowledge from the data generated from educational
Environments. EDM objectives are recognized as affirming student's future learning
behavior, studying the impact of educational support, and promoting learning science.
An organization can use data mining to make precise decisions and also to predict the
results of the student. With the results, the institution can concentrate on what to teach
and how to teach.
8.Conclusion

The use of data mining in enrollment management is a fairly new development.


Current data mining is done primarily on simple numeric and categorical data. In the
future, data mining will include more complex data types. In addition, for any model that
has been designed, further refinement is possible by examining other variables and their
relationships. Research in data mining will result in new methods to determine the most
interesting characteristics in the data. As models are developed and implemented, they can
be used as a tool in enrollment management.
9.References

[1] Sankar K. Pal, Varun Talwar, Pabitra Mitra, “Web Mining in Soft Computing
Framework:Relevance, State of the Art and Future Directions”, IEEE
TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER
2002.

[2] Tomoyuki NANNO, Toshiaki FUJIKI, “Automatically Collecting, Monitoring,


and Mining Japanese Weblogs”, WWW2004, May 17–22, 2004, New York, New
York, USA.ACM1-58113-912-8/04/0005.

[3] Ralph Gross, Alessandro AcquistiH. John Heinz III, “Information Revelation and
Privacy in Online Social Networks (The Facebook case)”, Preproceedings version.
ACM Workshop on Privacy in the Electronic Society (WPES), 2005.

[4] Huan Liu and Lei Yu, “Toward Integrating Feature Selection Algorithms for
Classification and Clustering”,IEEE Transactions on Knowledge and Data
Engineering Volume 17 Issue 4,April 2005.

[5] Andrea Esuli_ and Fabrizio Sebastiani†, “SENTIWORDNET: A Publicly


Available Lexical Resource for Opinion Mining”, Proceedings of the 5th Conference
on Language Resources and Evaluation, 2006.

[6] Marcelo Maia, Jussara Almeida, Virgílio Almeida, “Identifying User Behavior in
Online Social Networks”, SocialNets’08, April 1, 2008 , Glasgow, Scotland, UK
Copyright 2008 ACM ISBN 978-1-60558-124-8/08/04.

[7] http://www.openparenthesis.org/wp-content/uploads/2008/05/bcb3- retweeter.pdf

[8] Ai Ho, Abdou Maiga, Esma Aïmeur, “Privacy Protection Issues in Social
Networking Sites”, 978-1-4244-3806-8/09/$25.00 © 2009 IEEE.
[9] L. Dey and Sk. M. Haque, “Opinion mining from noisy text data,” International
Journal on Document Analysis and Recognition, vol. 12, no. 3, pp. 205–226, 2009.

[10] Mohammad Al-Fayoumi, Soumya Banerjee, Jr., and P. K. Mahanti, “Analysis of


Social Network Using Clever Ant Colony Metaphor”, PROCEEDINGS OF WORLD
ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY VOLUME 41
MAY 2009 ISSN: 2070-3740.

You might also like