DSCI 6003 Class Notes

Monday January 25, 2021 Class: Introduction
1- Difference between supervised and unsupervised models:

2- Supervised: Have knowns Xs and Ys. There is some starting point.
3- Unsupervised: Are more abstract.
4- Class is designed to be: half instructions and new content, half lab.
5- We use Jupyter Notebook to submit labs and other things on Canvas.
6- Midterm and Final will be on pen and paper. But if all the students cannot all be in the
classroom at the same time, it will be remotely.
7- The Projects are individuals. Students pick a machine learning algorithm and a model.
Then we apply a model to the problem. One student used covid data to see detect whether
someone has a mask on them.
8- Do Lab 1
9- Python 3.7 Directory:
!python -m pip install xgboost

Monday February 1, 2021 Class:
Do Lab 2
Do HW1
Chapter 1: Introduction to Machine Learning with Python
Important Points in chapter 1:
1- In machine learning:
a- Row = Sample = Instance = Data point
b- Column = Feature = Attribute
c- Row * Column is the shape
2- Train and Test data

a- Training data = Training set
b- Test data = Test set = Hold-out set
c- 75% Training data and 25% Test data is a good rule of thumb
d- train_test_split function
3- Input and Output

a- f(x) = y
b- x is the input
b- y is the output
c- data = input
d- target = output
4- Machine Learning Models

a- All machine learning models in scikit-learn are implemented in their own classes,
which are called Estimator classes. The k-nearest neighbors classification algorithm
is implemented in the KNeighborsClassifier class in the neighbors module.
5- Basic Scikit-Learn Methods

a- The fit, predict, and score methods are the common interface to supervised models in
scikit-learn, and with the concepts introduced in this chapter, you can apply these models
to many machine learning tasks.
6- Libraries to install
a- $ pip install NumPy scipy matplotlib ipython scikit-learn pandas
b- mglearn can be imported as needed
GitHub Codes for the book:

https://github.com/amueller/introduction_to_ml_with_python
Lab 3
Do HW2
Chapter 2: Supervised Learning (P.39)
1- Classification and Regression
a- 2 classes: binary classification
b- More than 2 classes: multiclass classification
c- Regression: Continuous Output
2- Generalization, Overfitting, and Underfitting

a- Overfitting: Model is too complex
b- underfitting: Model is too simple
c- Sweet Spot: Tradeoff between the above 2
3- K-Nearest Neighbors Model

a- Training dataset accuracy gets better with fewer neighbors.
b- Test dataset accuracy gets better with more neighbors (at least as long as the model
doesn’t become too simple).
c- KNeighborsClassifier is for the classification
d- KNeighborsRegressor is for the regression
e- strengths: relatively easy to implement, easy to understand, doesn’t require a lot of work
f- weaknesses: doesn’t perform as well in large datasets (either in number of features or in
number of samples, especially when many features are “0”), and can be very slow.
g- 2 parameters: number of neighbors and how you measure distance between data points
(by default Euclidian distance is used).
4- Linear regression (aka ordinary least squares)

a- Similar training and test set scores: sign of underfitting.
a- Quite different training and test set scores: sign of overfitting.
5- Ridge regression
a- Regularization.
b- The training score is higher than the test score for all dataset sizes, for both ridge and
linear regression.
c- With enough training data, regularization becomes less important, and given enough
data, ridge and linear regression will have the same performance.
6- Lasso
Comments:
Power parameter for the Minkowski metric. When p = 1, this is equivalent to using
manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p,
minkowski_distance (l_p) is used.
7- Linear models for multiclass classification (P.77 Stopped Here)

Chapter 3: Unsupervised Learning and Preprocessing (P.145)
1- Types of Unsupervised Learning (P.145)
a- Unsupervised transformations
b- Clustering algorithms
Non-Negative Matrix Factorization (P.170 Stopped Here)

About Lab 4A: Build a wine quality prediction model using K-
NN
a- RobustScaler accuracy > StandardScaler accuracy > Original data accuracy.
b- RobustScaler mean of the differences between the Y actuals (quality column) and your
rounded predictions < StandardScaler mean of the differences between the Y actuals
(quality column) and your rounded predictions < Original data mean of the differences
between the Y actuals (quality column) and your rounded predictions.
c- RobustScaler is generally (always 100% of the time???) better than StandardScaler.
About Lab 5: Decision Trees vs Random Forests

a- Rectangular matrix A m * n: is tall if m > n, is wide if m < n, is square if m = n.
b- Random Forest accuracy > Decision Tree accuracy
c- Random Forest error < Decision Tree error
d- Random Forest is generally (always 100% of the time???) better than Decision Tree.
About Machine Learning 06 - Statistics

a- Mean
b- Variance
c- Standard Deviation
d- Median
e- Mode:
The most frequently occurred value
• There may be no mode or several modes. "Multimodal" implies multiple peaks in
histogram.
• Not affected by extreme values (outliers)
f- Percentile and Quartile:

The 𝑝𝑡ℎ percentile - 𝑝% of the values in the data are less than or equal to this value
( 0≤𝑝≤100 )
Quartile:
1𝑠𝑡 quartile = 25𝑡ℎ percentile
2𝑛𝑑 quartile = 50𝑡ℎ percentile = median
3𝑟𝑑 quartile = 75𝑡ℎ percentile
About Lab 6:
a- What do you notice when you plot histograms for data of different sizes (N = 10, N
= 100, N = 1000, N = 100K)
I notice that as the sample size increases, the data becomes more and more evenly spread
in the distribution
About HW6:
a- How can we introduce error if we do not sufficiently randomize our sampling?
If we take the example of the US for instance, it is such a big country with different states
and regions. Therefore, in the case of political polling (presidential election), we want to
make sure that the sample represents different characteristics of the voting population. We
do not want to neglect or underestimate how well the sample size is randomized.
Otherwise, the poll results may be very misleading for the candidates and the poll may
even predict the wrong candidate to win the elections. We can think of The Literary Digest
Poll as a tragic example
b- What is the difference between variance and standard deviation?

Variance can be defined as the expectation of the squared deviation of a random variable
from its mean. Otherwise stated, it measures how far a set of numbers is spread out from
their average value. Standard deviation is the square root of the variance.
A = np.loadtxt(csv, delimiter=',')

DSCI 6003 Class Notes

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DSCI 6003 Class Notes

Uploaded by

Copyright:

Available Formats

Monday January 25, 2021 Class: Introduction

1- Difference between supervised and unsupervised models:

!python -m pip install xgboost

2- Train and Test data

3- Input and Output

4- Machine Learning Models

5- Basic Scikit-Learn Methods

GitHub Codes for the book:

2- Generalization, Overfitting, and Underfitting

3- K-Nearest Neighbors Model

4- Linear regression (aka ordinary least squares)

7- Linear models for multiclass classification (P.77 Stopped Here)

Non-Negative Matrix Factorization (P.170 Stopped Here)

c- RobustScaler is generally (always 100% of the time???) better than StandardScaler.

About Lab 5: Decision Trees vs Random Forests

b- Random Forest accuracy > Decision Tree accuracy

c- Random Forest error < Decision Tree error

About Machine Learning 06 - Statistics

f- Percentile and Quartile:

b- What is the difference between variance and standard deviation?

You might also like