Professional Documents
Culture Documents
on
“DATA MINING”
Submitted to
CERTIFICATE
External Examiner
ACKNOLEGEMENT
PRN: 20676212420034
INDEX
1 Abstract
2 Introduction
3 Literature Survey
4 Objectives
5 Theory
6 Case Study 1
7 Case Study 2
8 Conclusion
9 References
CHECKLIST
Sr. Tool Purpose Status Rep. Att Sign of
No. Y/N Guide
1 Mendeley For using
along with
MS Word
&References
2 Plagiarism Check For
detecting
plagiaris
m
3 Paraphrasing Tool For
removing
plagiarism
4 Design CAD
Software
5 Analysis Software
6 Design of
Experiment
7 Selection of
Material
8 Correlation/
Regression for
exp validation
9 Optimization
techniques/invent
ive
problem solving
10 Software used
i.e.
C++/JAVA/Pyt
hon OR
Design Codes used
for non circuit
branches
1. Abstract
Data mining on large databases has been a major concern in research community, due to the
difficulty of analyzing huge volumes of data using only traditional OLAP tools. This sort of
process implies a lot of computational power, memory and disk I/O Which can only be
provided by parallel computers. We present a discussion of how database technology can be
integrated to data mining techniques.
2.Introduction
“Data Mining”, that mines the data. In simple words, it is defined as finding hidden
insights(information) from the database, extract patterns from the data.
There are different algorithms for different tasks. The function of these algorithms is
to fit the model. These algorithms identify the characteristics of data. There are 2 types
of models.
The term "data mining" is a misnomer, because the goal is the extraction of patterns
and knowledge from large amounts of data, not the extraction (mining) of data itself It
also is a buzzword] and is frequently applied to any form of large-scale data
or information processing (collection, extraction, warehousing, analysis, and statistics)
as well as any application of computer decision support systemincluding artificial
intelligence (e.g., machine learning) and business intelligence. The book Data mining:
Practical machine learning tools and techniques with Java (which covers mostly
machine learning material) was originally to be named just Practical machine
learning, and the term data mining was only added for marketing reasons. Often the
more general terms (large scale) data analysis and analytics—or, when referring to
actual methods, artificial intelligence and machine learning—are more appropriate.
2. Literature survey
1.Shivam Agarwal et al, Data mining is a field of intersection of computer science and
statistics used to discover patterns in the information bank. The main aim of the data
mining process is to extract the useful information from the dossier of data and mold it
into an understandable structure for future use. There are different process and
techniques used to carry out data mining successfully.
3. B.N. Lakshmi et al, Data mining an non-trivial extraction of novel, implicit, and
actionable knowledge from large data sets is an evolving technology which is a direct
result of the increasing use of computer databases in order to store and retrieve
information effectively. It is also known as Knowledge Discovery in Databases
(KDD) and enables data exploration, data analysis, and data visualization of huge
databases at a high level of abstraction, without a specific hypothesis in mind. The
working of data mining is understood by using a method called modeling with it to
make predictions. Data mining techniques are results of long process of research and
product development and include artificial neural networks, decision trees and genetic
algorithms. This paper surveys the data mining technology, its definition, motivation,
its process and architecture, kind of data mined, functionalities and classification of
data mining, major issues, applications and directions for further research of data
mining technology.
4.Ramakrishna Hegde et al, This review paper consists of literature survey to
prediction of scholarship by using Machine Learning and Data Mining technique.
Along with this it contains a small description of ML/DM which are used by the
researchers. It also describes data sets as very important in ML/DM methods. Machine
Learning becomes most popular in the field of IT industry. Nowadays Machine
Learning and Data Mining turn as a powerful technique which applicable for various
fields such as IT, Education sector and also in business sector too. The different types
of ML/DM algorithms are addressed by using all this technique. The algorithms which
give more accuracy results in detection of continuity of every student's scholarship
such as Naïve Bayes, Decision Tree and k-NN. Finally, the proposed model will
provide a list of candidates, who deserve to have a scholarship and also discussion has
been made on accuracy of each techniques which was used to get a result.
5. B. R. Chandavarkar et al, Machine Learning and Data Mining for healthcare. There
has been an enormous growth in the field of HIT (health information technology) in
the recent years. Be it detection of certain diseases, scanning of organs, finding
tumors, these machine oriented operations without human intervention, have certainly
increased the quality of medical attention one can get, and the technology required has
come a long way. Health data tends to be inherently complex with exceptions in
almost all cases. Data mining is the technique of converting raw data into a
meaningful format. Analysis and prediction on such data, although computationally
and algorithmically complex, is an emerging technology that is a small step to more
proactive and preventive automated treatment options. There are various data mining
techniques such as classification, clustering, association, regression, prediction, pattern
recognition etc [1]. Even the efficiency of certain medicines can be found using
machine learning techniques, which is a life saving and cost effective method. In this
paper, we are going to use machine learning as a tool for predictive analysis to predict
chronic kidney diseases based on the Chronic disease dataset taken from UCI ML
repository. We will be applying machine learning algorithms, specifically decision
trees, to build a classifier to predict if a person has the disease or not. This paper
shows the issue that specific machine learning algorithms need to be tailor-made to
specific nature of medical data.
6. Chitra Jalota et al, Higher education institutions are often very curious to know
about the success rate of the students throughout their study. For this reason, they need
to use several methods like physical examination, Statistical methods and currently
prevailing data mining techniques for the prediction of student's performance. An
upcoming area of research which uses techniques of data mining is known as
Educational Data Mining. It involves machine learning algorithms and statistical
techniques to help the user for interpretation of student's learning habits, their
academic performance and further improvement if required. In this paper we will
discuss various techniques of data mining which are useful for predicting performance
level of students. For this we used dataset of kalboard 360 and applied it on weka to
analyze the data mining techniques.
8. Hoda Zahedi et al, The recent global outbreak of coronavirus disease (COVID-19)
is affecting many countries worldwide. Iran is one of the top 10 most affected
countries. Search engines provide useful data from populations, and these data might
be useful to analyze epidemics. Utilizing data mining methods on electronic resources’
data might provide a better insight into the COVID-19 outbreak to manage the health
crisis in each country and worldwide.
10. Gonzalo Gomez-Sanchez et al, For many years, a major question in cancer
genomics has been the identification of those variations that can have a functional role
in cancer, and distinguish from the majority of genomic changes that have no
functional consequences. This is particularly challenging when considering complex
chromosomal rearrangements, often composed of multiple DNA breaks, resulting in
difficulties in classifying and interpreting them functionally. Despite recent efforts
towards classifying structural variants (SVs), more robust statistical frames are needed
to better classify these variants and isolate those that derive from specific molecular
mechanisms. We present a new statistical approach to analyze SVs patterns from 2392
tumor samples from the Pan-Cancer Analysis of Whole Genomes (PCAWG)
Consortium and identify significant recurrence, which can inform relevant
mechanisms involved in the biology of tumors. The method is based on recursive
KDE clustering of 152,926 SVs, randomization methods, graph mining techniques and
statistical measures. The proposed methodology was able not only to identify complex
patterns across different cancer types but also to prove them as not random
occurrences. Furthermore, a new class of pattern that was not previously described has
been identified.
.
4.Objectives
The main objective of data mining is the automatic or semi-automatic analysis of large
amounts of data. This serves to extract exciting patterns hitherto unknown. We talk
about the groups of data records (cluster analysis), unusual records (anomaly
detection), and dependencies (association rules mining).
This generally involves the use of database techniques such as spatial indexes. Thus,
these patterns can be seen as a kind of summary of the input data. In addition to being
able to be used in additional analysis or, for example, in machine learning
and predictive analysis.
One of the examples we can give is data mining. This could identify several groups in
the data, which can then be used to obtain more accurate results — being able
to predict problems through a decision support system.
Neither data collection, data preparation, nor interpretation of results and information
is part of the data mining stage. However, they belong to the entire KDD process as
additional steps.
4.Theory
Data mining includes the utilization of refined data analysis tools to find previously
unknown, valid patterns and relationships in huge data sets. These tools can
incorporate statistical models, machine learning techniques, and mathematical
algorithms, such as neural networks or decision trees. Thus, data mining incorporates
analysis and prediction.
In recent data mining projects, various major data mining techniques have been
developed and used, including association, classification, clustering, prediction,
sequential patterns, and regression.
2. Clustering:
In other words, we can say that Clustering analysis is a data mining technique to
identify similar data. This technique helps to recognize the differences and similarities
between the data. Clustering is very similar to the classification, but it involves
grouping chunks of data together based on their similarities.
3. Regression:
Regression analysis is the data mining process is used to identify and analyze the
relationship between variables because of the presence of the other factor. It is used to
define the probability of the specific variable. Regression, primarily a form of
planning and modeling. For example, we might use it to project certain costs,
depending on other factors such as availability, consumer demand, and competition.
Primarily it gives the exact relationship between two or more variables in the given
data set.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It
finds a hidden pattern in the data set.
Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of databases.
Association rule mining has several applications and is commonly used to help sales
correlations in data or medical data sets.
The way the algorithm works is that you have various data, For example, a list of
grocery items that you have been buying for the last six months. It calculates a
percentage of items being purchased together.
o Lift:
This measurement technique measures the accuracy of the confidence over how
often tem B is purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple items are purchased
and compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is purchased when
item A is purchased as well.
(Item A + Item B)/ (Item A)
5. Outer detection:
This type of data mining technique relates to the observation of data items in the data
set, which do not match an expected pattern or expected behavior. This technique may
be used in various domains like intrusion, detection, fraud detection, etc. It is also
known as Outlier Analysis or Outilier mining. The outlier is a data point that diverges
too much from the rest of the dataset. The majority of the real-world
datasets have an outlier. Outlier detection plays a significant role in the data mining
field. Outlier detection is valuable in numerous fields like network interruption
identification, credit or debit card fraud detection, detecting outlying in wireless sensor
network data, etc.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating
sequential data to discover sequential patterns. It comprises of finding interesting
subsequences in a set of sequences, where the stake of a sequence can be measured in
terms of different criteria like length, occurrence frequency, etc.
In other words, this technique of data mining helps to discover or recognize similar
patterns in transaction data over some time.
7. Prediction:
Data mining in healthcare has excellent potential to improve the health system. It uses
data and analytics for better insights and to identify best practices that will enhance
health care services and reduce costs. Analysts use data mining approaches such as
Machine learning, Multi-dimensional database, Data visualization, Soft computing,
and statistics. Data Mining can be used to forecast patients in each category. The
procedures ensure that the patients get intensive care at the right place and at the right
time. Data mining also enables healthcare insurers to recognize fraud and abuse.
6. Case Study 2
[1] Sankar K. Pal, Varun Talwar, Pabitra Mitra, “Web Mining in Soft Computing
Framework:Relevance, State of the Art and Future Directions”, IEEE
TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER
2002.
[3] Ralph Gross, Alessandro AcquistiH. John Heinz III, “Information Revelation and
Privacy in Online Social Networks (The Facebook case)”, Preproceedings version.
ACM Workshop on Privacy in the Electronic Society (WPES), 2005.
[4] Huan Liu and Lei Yu, “Toward Integrating Feature Selection Algorithms for
Classification and Clustering”,IEEE Transactions on Knowledge and Data
Engineering Volume 17 Issue 4,April 2005.
[6] Marcelo Maia, Jussara Almeida, Virgílio Almeida, “Identifying User Behavior in
Online Social Networks”, SocialNets’08, April 1, 2008 , Glasgow, Scotland, UK
Copyright 2008 ACM ISBN 978-1-60558-124-8/08/04.
[8] Ai Ho, Abdou Maiga, Esma Aïmeur, “Privacy Protection Issues in Social
Networking Sites”, 978-1-4244-3806-8/09/$25.00 © 2009 IEEE.
[9] L. Dey and Sk. M. Haque, “Opinion mining from noisy text data,” International
Journal on Document Analysis and Recognition, vol. 12, no. 3, pp. 205–226, 2009.