You are on page 1of 2

Cheat Sheet: Machine Learning with KNIME Analytics Platform

SUPERVISED LEARNING UNSUPERVISED LEARNING


Supervised Learning: A set of machine learning algorithms to predict the value of a target class or variable. They produce a mapping function (model) from the input features to the target class/variable. To estimate the model parameters during the training phase, labeled example data are needed in the training set. Unsupervised Learning: A set of machine
Generalization to unseen data is evaluated on the test set data via scoring metrics. learning algorithms to discover patterns in the
data. A labeled dataset is not required, since
data are ultimately organized and/or
transformed based on similarity or statistical
CLASSIFICATION NUMERIC PREDICTION NUMERIC PREDICTION measures.

Classification: A type of supervised learning where the target is a class. The model learns to & CLASSIFICATION Numeric Prediction: A type of supervised learning for numeric target variables. The model learns to associate one or more numbers with the vector
produce a class score and to assign each vector of input features to the class with the highest of input features. Note that numeric prediction models can also be trained to predict class scores and therefore can be used for classification CLUSTERING
score. A cost can be introduced to penalize one of the classes during class assignment. problems too.
Artificial Neural Networks (ANN,
NN): Inspired by biological nervous Clustering: A branch of unsupervised learning
Decision Tree: Follows the C4.5 decision tree Naive Bayes: Based on Bayes' Linear/Polynomial Regression: Linear Regres- algorithms that groups data together based
algorithm. These algorithms generate a
RProp MLP Learner systems, Artificial Neural Networks
theorem and assuming statistical are based on architectures of sion is a statistical algorithm to model a TIME SERIES ANALYSIS on similarity measures, without the help of
tree-like structure, creating data subsets, aka independence between input multivariate linear relationship between the labels, classes, or categories.
tree nodes. At each node, the data are split
Naive Bayes Learner
interconnected units called
features (thus "naive"), this artificial neurons. Artificial neurons' numeric target variable and the input features.
based on one of the input features, generating algorithm estimates the condition- Polynomial Regression extends this concept to Time Series Analysis: A set of numeric prediction methods to analyze/predict time series
two or more branches as output. Further splits parameters and connections are k-Means: The n data points in the dataset are
al probability of each output class trained via dedicated algorithms, fitting a polynomial function of a pre-defined data. Time series are time ordered sequences of numeric values. In particular, time series clustered into k clusters based on the
are made in subsequent nodes until a node is given the vector of input features. degree. forecasting aims at predicting future values based on previously observed values.
generated where all or almost all of the data the most popular being the shortest distance from the cluster prototypes.
The class with the highest Back-Propagation algorithm. The cluster prototype is taken as the average
belong to the same class. conditional probability is assigned ARIMA Learner Auto-Regressive Integrated Moving Average (ARIMA): A linear Auto-Re- data point in the cluster.
to the input data. gressive (AR) model is constructed on a specified number p of past values;
Deep Learning: Deep learning
extends the family of ANNs with data are prepared by a degree of differencing d to correct non-stationarity;
Support Vector Machine (SVM): Keras Dense Layer while a linear combination - named Moving Average (MA) - models the q
A supervised algorithm construct- deeper architectures and
additional paradigms, e.g. past residual errors. All ARIMA model parameters are estimated concur-
SVM Learner ing a set of discriminative rently by various algorithms, mostly following the Box–Jenkins approach.
hyperplanes in high-dimensional Recurrent Neural Networks (RNN).
space. In addition to linear The training of such networks, has Lag Column Linear Regression
Logistic Regression: A statistical algorithm that classification, SVMs can perform been enabled by recent advances Regression Tree: Builds a decision tree to predict Learner
models the relationship between the input non-linear classification by in hardware performance as well numeric values through a recursive, top-down, ML-based TSA: A numeric prediction model trained on
features and the categorical output classes by implicitly mapping their inputs into as parallel execution. greedy approach known as recursive binary vectors of past values can predict the current numeric value
maximizing a likelihood function. Originally splitting. At each step, the algorithm splits the of the time series. Hierarchical Clustering: Builds a hierarchy of
high-dimensional feature spaces, clusters by either collecting the most similar
developed for binary problems, it has been where the two classes are linearly subsets represented by each node into two or
extended to problems with more than two more new branches using a greedy search for the (agglomerative approach) or separating the
separable. Generalized Linear Model (GLM): most dissimilar (divisive approach) data points
classes (multinomial logistic regression). best split. The average value of the points in a leaf
A statistics-based flexible and clusters, according to a selected distance
k-Nearest Neighbor (kNN): H2O Generalized generalization of ordinary linear produces the numerical prediction. Long Short Term Memory (LSTM) Units: LSTM units
A non-parametric method that Linear Model Learner produce a hidden state by processing m x n tensors of measure. The result is a dendrogram clustering
K Nearest Neighbor regression, valid also for non-normal the data together bottom-up (agglomerative) or
assigns the class of the k most distributions of the target variable. input values, where m is the size of the input vector at any
similar points in the training data, time and n the number of past vectors. The hidden state separating the data in different clusters
GLM uses the linear combination of top-down (divisive).
based on a pre-defined distance the input features to model an can then be transformed into the current vector of numeric
measure. Class attribution can be arbitrary function of the target values. LSTM units are suited for time series prediction as
weighted by the distance to the k-th variable (the link function) rather values from past vectors can be remembered or forgotten
point and/or by the class than the target variable itself. through a series of gates.
probability.

ENSEMBLE LEARNING DBSCAN: A density-based non-parametric


clustering algorithm. Data points are classified
Ensemble Learning: A combination of multiple models from supervised learning algorithms to
obtain a more stable and accurate overall model. Most commonly used ensemble techniques are
Bagging and Boosting.
Transform Learner Scorer
DEPLOYMENT as core, density-reachable, and outlier points.
Core and density-reachable points in high
density regions are clustered together, while
points with no close neighbors in low-density
regions are labeled as outliers.

BAGGING Model Loader

Bagging: A method for training multiple


classification/regression models on different Read Data Predictor Capture End Deploy Workflow
randomly drawn subsets of the training data.
The final prediction is based on the
predictions provided by all the models, thus Transform SOTA Learner Self-Organizing Tree
reducing the chance of overfitting. Data Input Predictor Data Output
Apply Algorithm (SOTA): A special
Self-Organizing Map (SOM)
Tree Ensemble of Decision/Regression Trees: neural network. Its cell
Ensemble model of multiple decision/regres- Transform structure is grown using a
sion trees trained on different subsets of data. Capture Start Apply binary tree topology.
Data subsets with less or equal rows and less
or equal columns are bootstrapped from the Fuzzy c-Means: One of the
original training set. Final prediction is based Fuzzy c-Means most widely used fuzzy
on a hard vote (majority rule) or soft-vote clustering algorithms. It works
(averaging all probabilities or numeric similarly to the k-Means

TRAINING
predictions) on all involved trees. algorithm, but it allows for data
points to belong to more than
Random Forest of Decision/Regression Trees: BOOSTING one cluster, with different
Ensemble model of multiple decision/regres- degrees of membership.
sion trees trained on different subsets of data. Boosting: A method for training a set of
Data subsets with the same number of rows classification/regression models Resources
are bootstrapped from the original training
set. At each node, the split is performed on a
iteratively. At each step, a new model is
• E-Books: Learn even more with the KNIME books. RECOMMENDATION ENGINES
trained on the prediction errors and added
subset of sqrt(x) features from the original x to the ensemble to improve the results EVALUATION From basic concepts in “KNIME Beginner’s Luck”, to
input features. Final prediction is based on a from the previous model state, leading to advanced concepts in “KNIME Advanced Luck”, Recommendation Engines: A set of
hard vote (majority rule) or soft-vote higher accuracy after each iteration. through to examples of real-world case studies in algorithms that use known information about
(averaging all probabilities or numerical Evaluation: Various scoring metrics for assessing model quality - in particular, a model’s predictive ability or propensity to error.
“Practicing Data Science”. Available for purchase at user preferences to predict items of interest.
predictions) on all involved trees.
Gradient Boosted Regression Trees: Numeric Error Measures: Evaluation metrics knime.com/knimepress
Numeric Scorer
Association Rules: The node
Ensemble model combining multiple Confusion Matrix: A representation of a classification for numeric prediction models quantifying the reveals regularities in co-occur-
sequential simple regression trees into a task’s success through the count of matches and error size and direction. Common metrics • KNIME Blog: Engaging topics, challenges, industry Association
Rule Learner rences of multiple products in
Custom Ensemble Model: stronger model. The algorithm builds the Scorer mismatches between the actual and predicted classes, include RMSE, MAE, or R2. Most of these news, and knowledge nuggets at knime.com/blog large-scale transaction data
Prediction Fusion Combining different supervised model stagewise. At each iteration, a simple aka true positives, false negatives, false positives, and metrics depend on the range of the target recorded at points-of-sale.
models to form a custom regression tree is fitted to predict the true negatives. One class is arbitrarily selected as the variable. Based on the a-priori algorithm,
positive class. • KNIME Hub: Search, share, and collaborate on
ensemble model. The final residuals of the current model, following the KNIME workflows, nodes, and components with the the most frequent itemsets in
prediction can be based on gradient of the loss function. This leads to Accuracy Measures: Evaluation metrics for a classification the dataset are used to generate
an increasingly accurate and complex entire KNIME community at hub.knime.com
majority vote as well as on the model calculated from the values in the confusion matrix, ROC Curve: A graphical representa- recommendation rules.
average or other functions of overall model. The same regression trees such as sensitivity and specificity, precision and recall, or tion of the performance of a binary
the output results. can also be used for classification. overall accuracy. classification model with false • KNIME Forum: Join our global community and
engage in conversations at forum.knime.com Collaborative Filtering: Based on
positive rates on the x-axis and true Spark Collaborative the Alternating Least Squares
Cross-Validation: A model validation technique for positive rates on the y axis. Filtering Learner (MLlib)
assessing how the results of a machine learning model will (ALS) technique, it produces
XGBoost Tree
XGBoost: An optimized distributed library for machine learning models in the Cross Validation Multiple points for the curve are • KNIME Server: The enterprise software for recommendations (filtering)
Ensemble Learner generalize to an independent dataset. A model is trained obtained for different classification team-based collaboration, automation, manage-
gradient boosting framework, designed to be highly efficient, flexible, and and validated N times on different pairs of training set and about the interests of a user by
portable. It features regularization parameters to penalize complex models, thresholds. The area under the ment, and deployment of data science workflows as comparing their current
test set, both extracted from the original dataset. Some curve is the metric value.
effective handling of sparse data for better performance, parallel computation, basic statistics on the resulting N error or accuracy analytical applications and services. Visit preferences with those of
and more efficient memory usage. measures gives insights on overfitting and generalization. www.knime.com/knime-server for more information. multiple users (collaborating).
KNIME Press

Extend your KNIME knowledge with our collection of books from KNIME Press. For beginner and advanced users, through to those interested in specialty topics such as topic detection, data blending, and classic
solutions to common use cases using KNIME Analytics Platform - there’s something for everyone. Available for download at www.knime.com/knimepress.

KNIME ®
BEGINNER·S
LUCK

Decision
Tree Learner
File Reader Partitioning

Decision Tree
training to Predictor Scorer
original 80 vs. 20 predict income
data set

attach class confusion matrix


probabilities + scores

A Guide to KNIME Analytics Platform for Beginners


Authors: Satoru Hayasaka and Rosaria Silipo

SECOND EDITION
Data Blending with KNIME

KNIME®
Rosaria Silipo & Lada Rudnitckaia

© 2021 KNIME AG. All rights reserved. The KNIME® trademark and logo and OPEN FOR INNOVATION® trademark are used by KNIME AG under license from KNIME GmbH, and are registered in the United States. KNIME® is also registered in Germany.

You might also like