Orange Data Mining Tool: Presentation

Orange
Data Mining Tool

Presentation
Group Members:
•Name Registration Number
2
Why Orange?
Introduction
 Open Source
 Orange is component based visual
programing software for data mining.
 Component based
 machine learning and data analysis
 No programming
 Supports communication between data
 Data visualization
scientists and domain experts.  Platform independent software
 Allows clustering and classification
 Data mining through visual programming
and python scripting
You can get orange software from this link:
https://orange.biolab.si/getting-started/
3
Getting Started With ORANGE!!
4
sss
6
Dataset: Heart Disease
● Has 303 instances ATTRIBUTES
● 13 attributes ● Narrowing diameter
● Categorical class with 2 ● Cholesterol
values (0,1) ● Chest pain
● In .csv format ● Rest ECG
● Source: pre loaded ● Fasting blood sugar
datasets of Orange. ● Max HR
● Age,gender and more
.
. 7
Dataset: How following factors cause
Heart Disease?
● Age: heart disease increases with age greater than 65
● Fatty deposits called plaques also collect along your artery walls
● Slow the blood flow from the heart
● Causing coronary heart diseases.
● Gender: Heart disease is leading cause of death for both men and women.
8
● Aangina: is chest pain or discomfort caused when your heart muscle doesn't
get enough oxygen-rich blood.
● Cholesterol: When there is too much cholesterol in your blood.

● it builds up in the walls of your arteries
● causing a process called atherosclerosis(heart disease),
● Diameter Narrowing:
● Heart disease is caused by the narrowing or blockage of the coronary arteries.
● Target attribute (0,1)
9
Loading data file into data table:
11
● Distributions
. EDA: Exploratory data analysis
12
● Distributions
13
14
“
15
Selected Algorithm
Algorithms:
● Neural Network
● KNN
● Random Forest
● Naïve Bayes'
● Logistic Regression
● Decision Tree
16
Experimental
Setup
This is how we drag and drop the widgets and
implements our algorithms
17
KNN(k nearest neighbor)
KNN is non-parametric method used for classification and regression.
Requires three things
 The set of stored records.

 Distance Metric to compute distance between records.
 The value of k, the number of nearest neighbors to retrieve Unknown record
Math equation: d(p,q) = √Σ(pi – 𝒒𝒊)𝟐
18
19
20
21
22
Decision tree
 Used to visually and explicitly represent decisions and decision making.
 predictive modelling approaches used in:
 statistics, data mining and machine learning
m
Entropy( D)   pi log 2 ( pi )
i 1
23
24
25
26
27
28
29
30
Naïve Baye's
 Also known as Naive Bayes Classifiers.
 Attributes are statistically independent on one another.
 Unlike other classifiers for a given class
 There will be some correlation between features.
 Explicitly models the features as conditionally independent given the class.
P(X|H)(P H
P(H|X) = 𝑃(𝑋)
31
32
33
34
35
Random Forest
 It is a flexible and simple
 Random Forest algorithm avoid the over fitting problem.
 Used for identifying the most important features from the training dataset.
 It can be used for both classification and regression tasks.
36
37
38
39
40
Logistic Regression
 Used to assign observations to a discrete set of classes.
 Logistic regression can be binomial, ordinal or multinomial.
 Binary (Pass/Fail)
 Multi (Cats, Dogs, Sheep)
 Ordinal (Low, Medium, High)
 Can view probability scores underlying the model’s classifications.
41
42
43
44
Neural Network
 Neural networks is learning algorithms.
 Interpret sensory data
 Through a kind of machine perception, labeling or clustering raw input.
 Consist of different layers for analyzing and learning data.
Math equation :
f(X)=b+∑iwixi
45
46
47
48
49
Concluding
Results
50
Table to compare data
Recall Precision F-Measures
Neural Network 0.813 0.814 0.814
Logistic Regression 0.848 0.848 0.848
Random forest 0.807 0.807 0.807
51
52
53
54
References:
https://www.youtube.com/watch?v=pYXOF0jziGM&index=6&list=PLmNPvQr9Tf-
ZSDLwOzxpvY-HrE0yv-8Fy
https://www.youtube.com/watch?v=bp0VtVS3LN4&index=9&list=PLmNPvQr9Tf-
ZSDLwOzxpvY-HrE0yv-8Fy
https://orange.biolab.si/getting-started/
https://en.wikipedia.org/wiki/Random_forest
https://en.wikipedia.org/wiki/Decision_tree_learning
55
Thanks!
Any questions?
56
Want big impact?
Use big image.
57

Orange Data Mining Tool: Presentation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Orange Data Mining Tool: Presentation

Uploaded by

Copyright:

Available Formats

Orange

Data Mining Tool

•Name Registration Number

● Cholesterol: When there is too much cholesterol in your blood.

. EDA: Exploratory data analysis

 The set of stored records.

Math equation: d(p,q) = √Σ(pi – 𝒒𝒊)𝟐

 It can be used for both classification and regression tasks.

 Can view probability scores underlying the model’s classifications.

Neural Network 0.813 0.814 0.814

Logistic Regression 0.848 0.848 0.848

Random forest 0.807 0.807 0.807

You might also like