You are on page 1of 4

SENTIMENT ANALYSIS ● Remove Correlated Attributes

● process of computationally identifying ○ greater than 0.90 as the


and categorizing opinions expressed in correlation filter relation
a piece of text, as either positive, ○ remove multicollinear variables
negative, or neutral ○ numerical predictors (x)
● unsupervised data
● Filter examples
MEANING CLOUD ○ remove examples of y with
● does not offer continuous score missing values
● nominal “score_tag” ○ separate data set to be used in
● P+, P, NEU, N, N+ testing, training, and validations
● P+ = 1 (STRONG POSITIVE) ○ dealing with missing values in y
● P = 0.75 (POSITIVE)
● NEU = 0 (NEUTRAL) ● Replace Missing Values
● N = -0.75 (STRONG NEGATIVE) ○ replace missing values in data
● N+ = -1(NEGATIVE) set with average
○ dealing with missing values in x
PERFORMING SENTIMENT ANALYSIS
● en: text language ● Split data
● gen_en: sentiment model ○ separate to 80% (build
● pie graph model/train & test) and 20%
○ value column: polarity (validation)
(comments) ○ data set with missing values =
○ group by: polarity (comments) example
○ aggregation function: count ○ Pareto Principle - (80/20)
----------------------------------------------------------------
LINEAR REGRESSION ● Optimize Parameter Grid
● linear approach to modeling the ○ 80% of data will enter
relationship between a quantitative ○ fail on error - won’t proceed if it
dependent variable (y) and one or more encounters an error in the data
independent variables (x1, x2, ..., xp) set
○ validation took place
● Subprocess ○ setting optimization parameters
○ prepare data for analysis
● Cross Validation
● Set role ○ setting optimization parameters
○ make the dependent variable - y ○ 80% is inside (training)
○ assigning dependent variable y
● Grid/Range
● Nominal to Numerical ○ the maximum value is
○ set/create dummy variables dependent on the number of
○ categorical predictors (x) samples you have for the model
building.
● Select Attribute
○ remove dummy variables ● The rule of thumb
○ categorical predictors (x) ○ the number of cases divided by
the number of folds should be at
least 15.
● Apply Model 2 ● Performance Vector (2) Results
○ setting the evaluation of the ○ represents the performance of
regression model the model of the validation data
set
● Apply Model 3 ○ determine how the model will
○ used in predicting the perform in a data set that was
dependent variable y in an not included in model creation
example data set ○ result of the
PerformanceVector(2) should be
PERFORMANCE PARAMETER SET better or almost the same from
● Root Mean Squared Error (RMSE) PerformanceVector
○ lower the errors/value the better
● Optimized Parameter Grid Result
● Absolute Error (MAE) ○ listing of the different criterion
○ lower the errors/value the better parameter that is used in
determining the performance of
● Relative Error (MAPE) the model based from the
○ lower the errors/value the better iteration in the optimization

● Squared Correlation ● Apply Model 3 Result


○ higher the value between ○ shows the predicted values for
predicted and actual values the the unknown value for the target
better variable
----------------------------------------------------------------
* Validation data set must be better than CLASSIFICATION OF MODELS
Training and Testing data set ● Classification
○ process of predicting the class
RESULTS of given data points
● Linear Regression Results ○ belongs to the category of
○ represents the equation of the supervised learning where the
linear regression model that was targets also provided with the
created from the training and input data
testing data set
○ represents the attributes that ● Classification predictive modeling
are significant (p<0.05) and ○ task of approximating a
those that are not (p>0.05) mapping function (f) from input
variables (X) to discrete output
● Performance Vector Results variables (y)
○ represents the performance of
the model in the training and ● Measures of Accuracy (higher values
testing data set are better)
○ the result here will be compared ○ Accuracy
to the “Performance Vector (2)” ○ Precision
to assess the model ○ Sensitivity
○ Specificity
○ True negative
○ True positive
● Measures of error (lower values are ○ Neural networks
better) ○ Support vector clustering
○ Classification of error
○ False negative CLUSTERING METHOD
○ False positive ● process of dividing the data points into a
---------------------------------------------------------------- number of groups
CLUSTERING ● Clustering is a unsupervised method to
● Supervised split a data set to a finite number of
○ directed data mining groups
○ The model generalizes the
relationship between the input ● Set Role
and output variables ○ identifies the one who will be
group
● Unsupervised
○ undirected data mining ● Select Attributes
○ The objective of this class of ○ to know the basis (variables) for
data mining techniques is to find clustering
patterns in data based on the
relationship between data points ● Filter Examples
themselves ○ remove missing values
(rows/examples)
TYPES OF MODELS
● Predictive Analytics (Linear ● Normalize
Regression) ○ 1st method before clustering
○ How much or How many? ○ z-score = standard values
○ Linear Regression ○ turns all values into z-scores
○ Generalized Regression Models (remove unit of measurement;
○ Non – Linear Regression (ex. make them comparable)
polynomial)
○ Neural networks ● Clustering
○ SVM ○ k-means - one of the most
common techniques
● Classification Models (Is this A or B?) ○ n examples are grouped into
○ Is this A or B? different clusters
○ Logistic Regression ○ predetermined number
○ Decision Tree ○ k = no. of clusters
○ Random Forest ○ 1. asks for no. of groups
○ Naïve Bayes ○ 2. gets points at random (gets
○ Neural Networks distance from centroids)
○ SVM
● Multiply
● ClusteringTechniques ○ to see who are grouped
○ how many groups is best? Who
among are grouped to with each ● Cluster Model Visualization
other? ○ describe cluster 1 and 0 (why
○ K – Means they are grouped together)
○ K– Medoids
○ X - means * x-means (minimum k = 3): predetermined
○ Hierarchical * range (e.g. min of 2)
* k-medoids - uses median support vector
* max no. of runs = 15
* Use Mixed measures as measure type in the
parameter tab

You might also like