The document discusses various machine learning techniques for sentiment analysis, linear regression, classification, and clustering. It provides descriptions of key steps for preparing data, developing models, evaluating performance, and interpreting results for these different machine learning methods. Parameter optimization, cross validation, and comparing training vs. validation results are also covered.
The document discusses various machine learning techniques for sentiment analysis, linear regression, classification, and clustering. It provides descriptions of key steps for preparing data, developing models, evaluating performance, and interpreting results for these different machine learning methods. Parameter optimization, cross validation, and comparing training vs. validation results are also covered.
The document discusses various machine learning techniques for sentiment analysis, linear regression, classification, and clustering. It provides descriptions of key steps for preparing data, developing models, evaluating performance, and interpreting results for these different machine learning methods. Parameter optimization, cross validation, and comparing training vs. validation results are also covered.
● process of computationally identifying ○ greater than 0.90 as the
and categorizing opinions expressed in correlation filter relation a piece of text, as either positive, ○ remove multicollinear variables negative, or neutral ○ numerical predictors (x) ● unsupervised data ● Filter examples MEANING CLOUD ○ remove examples of y with ● does not offer continuous score missing values ● nominal “score_tag” ○ separate data set to be used in ● P+, P, NEU, N, N+ testing, training, and validations ● P+ = 1 (STRONG POSITIVE) ○ dealing with missing values in y ● P = 0.75 (POSITIVE) ● NEU = 0 (NEUTRAL) ● Replace Missing Values ● N = -0.75 (STRONG NEGATIVE) ○ replace missing values in data ● N+ = -1(NEGATIVE) set with average ○ dealing with missing values in x PERFORMING SENTIMENT ANALYSIS ● en: text language ● Split data ● gen_en: sentiment model ○ separate to 80% (build ● pie graph model/train & test) and 20% ○ value column: polarity (validation) (comments) ○ data set with missing values = ○ group by: polarity (comments) example ○ aggregation function: count ○ Pareto Principle - (80/20) ---------------------------------------------------------------- LINEAR REGRESSION ● Optimize Parameter Grid ● linear approach to modeling the ○ 80% of data will enter relationship between a quantitative ○ fail on error - won’t proceed if it dependent variable (y) and one or more encounters an error in the data independent variables (x1, x2, ..., xp) set ○ validation took place ● Subprocess ○ setting optimization parameters ○ prepare data for analysis ● Cross Validation ● Set role ○ setting optimization parameters ○ make the dependent variable - y ○ 80% is inside (training) ○ assigning dependent variable y ● Grid/Range ● Nominal to Numerical ○ the maximum value is ○ set/create dummy variables dependent on the number of ○ categorical predictors (x) samples you have for the model building. ● Select Attribute ○ remove dummy variables ● The rule of thumb ○ categorical predictors (x) ○ the number of cases divided by the number of folds should be at least 15. ● Apply Model 2 ● Performance Vector (2) Results ○ setting the evaluation of the ○ represents the performance of regression model the model of the validation data set ● Apply Model 3 ○ determine how the model will ○ used in predicting the perform in a data set that was dependent variable y in an not included in model creation example data set ○ result of the PerformanceVector(2) should be PERFORMANCE PARAMETER SET better or almost the same from ● Root Mean Squared Error (RMSE) PerformanceVector ○ lower the errors/value the better ● Optimized Parameter Grid Result ● Absolute Error (MAE) ○ listing of the different criterion ○ lower the errors/value the better parameter that is used in determining the performance of ● Relative Error (MAPE) the model based from the ○ lower the errors/value the better iteration in the optimization
● Squared Correlation ● Apply Model 3 Result
○ higher the value between ○ shows the predicted values for predicted and actual values the the unknown value for the target better variable ---------------------------------------------------------------- * Validation data set must be better than CLASSIFICATION OF MODELS Training and Testing data set ● Classification ○ process of predicting the class RESULTS of given data points ● Linear Regression Results ○ belongs to the category of ○ represents the equation of the supervised learning where the linear regression model that was targets also provided with the created from the training and input data testing data set ○ represents the attributes that ● Classification predictive modeling are significant (p<0.05) and ○ task of approximating a those that are not (p>0.05) mapping function (f) from input variables (X) to discrete output ● Performance Vector Results variables (y) ○ represents the performance of the model in the training and ● Measures of Accuracy (higher values testing data set are better) ○ the result here will be compared ○ Accuracy to the “Performance Vector (2)” ○ Precision to assess the model ○ Sensitivity ○ Specificity ○ True negative ○ True positive ● Measures of error (lower values are ○ Neural networks better) ○ Support vector clustering ○ Classification of error ○ False negative CLUSTERING METHOD ○ False positive ● process of dividing the data points into a ---------------------------------------------------------------- number of groups CLUSTERING ● Clustering is a unsupervised method to ● Supervised split a data set to a finite number of ○ directed data mining groups ○ The model generalizes the relationship between the input ● Set Role and output variables ○ identifies the one who will be group ● Unsupervised ○ undirected data mining ● Select Attributes ○ The objective of this class of ○ to know the basis (variables) for data mining techniques is to find clustering patterns in data based on the relationship between data points ● Filter Examples themselves ○ remove missing values (rows/examples) TYPES OF MODELS ● Predictive Analytics (Linear ● Normalize Regression) ○ 1st method before clustering ○ How much or How many? ○ z-score = standard values ○ Linear Regression ○ turns all values into z-scores ○ Generalized Regression Models (remove unit of measurement; ○ Non – Linear Regression (ex. make them comparable) polynomial) ○ Neural networks ● Clustering ○ SVM ○ k-means - one of the most common techniques ● Classification Models (Is this A or B?) ○ n examples are grouped into ○ Is this A or B? different clusters ○ Logistic Regression ○ predetermined number ○ Decision Tree ○ k = no. of clusters ○ Random Forest ○ 1. asks for no. of groups ○ Naïve Bayes ○ 2. gets points at random (gets ○ Neural Networks distance from centroids) ○ SVM ● Multiply ● ClusteringTechniques ○ to see who are grouped ○ how many groups is best? Who among are grouped to with each ● Cluster Model Visualization other? ○ describe cluster 1 and 0 (why ○ K – Means they are grouped together) ○ K– Medoids ○ X - means * x-means (minimum k = 3): predetermined ○ Hierarchical * range (e.g. min of 2) * k-medoids - uses median support vector * max no. of runs = 15 * Use Mixed measures as measure type in the parameter tab