You are on page 1of 13

5/20/2021 Linear Algebra for Data Science (DataCamp)

Linear Algebra for Data Science (DataCamp)

Ch. 1 - Introduction to Linear Algebra


Motivations
[Video]

Creating Vectors in R
# Creating three 3's and four 4's, respectively
rep(3, 3)

## [1] 3 3 3

rep(4, 4)

## [1] 4 4 4 4

# Creating a vector with the first three even numbers and the first three odd numbers
seq(2, 6, by = 2)

## [1] 2 4 6

seq(1, 5, by = 2)

## [1] 1 3 5

# Re-creating the previous four vectors using the 'c' command


c(3, 3, 3)

## [1] 3 3 3

c(4, 4, 4, 4)

## [1] 4 4 4 4

c(2, 4, 6)

## [1] 2 4 6

c(1, 3, 5)

## [1] 1 3 5

The Algebra of Vectors


# Add x to y and print
print(x + y)

## [1] 3 6 9 12 15 18 21

# Multiply z by 2 and print


print(2*z)

## [1] 2 2 4

# Multiply x and y by each other and print


print(x*y)

## [1] 2 8 18 32 50 72 98

# Add x to z, if possible, and print


print(x + z)

https://rstudio-pubs-static.s3.amazonaws.com/597406_cae47e2d352642c7b64449b4fee456c7.html 1/13
5/20/2021 Linear Algebra for Data Science (DataCamp)

## Warning in x + z: longer object length is not a multiple of shorter object


## length

## [1] 2 3 5 5 6 8 8

Creating Matrices in R
# Create a matrix of all 1's and all 2's that are 2 by 3 and 3 by 2, respectively
matrix(1, nrow = 2, ncol = 3)

## [,1] [,2] [,3]


## [1,] 1 1 1
## [2,] 1 1 1

print(matrix(2, nrow = 3, ncol = 2))

## [,1] [,2]
## [1,] 2 2
## [2,] 2 2
## [3,] 2 2

# Create a matrix and changing the byrow designation.


B <- matrix(c(1, 2, 3, 2), nrow = 2, ncol = 2, byrow = FALSE)
B <- matrix(c(1, 2, 3, 2), nrow = 2, ncol = 2, byrow = TRUE)

# Add A to the previously-created matrix


A + B

## [,1] [,2]
## [1,] 2 3
## [2,] 4 3

Matrix-Vector Operations
[Video]

Matrix-Vector Compatibility
Consider the matrix A created by the R code:

A = matrix(c(1, 2, 3, -1, 0, 3), nrow = 2, ncol = 3, byrow = TRUE)

Which of the following vectors b can be multiplied by A to create Ab?

[*] b = c(1, 1, -1)


b = c(-2, 2)
b = c(2, -1, 3, 4, 7)
b = c(-1, 2, 1, 3)

Matrix Multiplication as a Transformation


# Multiply A by b
A%*%b

## [,1]
## [1,] 4
## [2,] 1

# Multiply B by b
B%*%b

## [,1]
## [1,] 0.000000
## [2,] 1.666667

Reflections
# Multiply A by b
A%*%b

## [,1]
## [1,] -2
## [2,] 1

https://rstudio-pubs-static.s3.amazonaws.com/597406_cae47e2d352642c7b64449b4fee456c7.html 2/13
5/20/2021 Linear Algebra for Data Science (DataCamp)

# Multiply B by b
B%*%b

## [,1]
## [1,] 2
## [2,] -1

# Multiply C by b
C%*%b

## [,1]
## [1,] -8
## [2,] -2

Matrix-Matrix Calculations
[Video]

Matrix Multiplication Compatibility


The two matrices generated by the R code below are (small) examples of what are used in neural network models to weigh datasets for prediction:

A = matrix(c(1, 3, 2, -1, 0, 1), nrow = 2, ncol = 3)

B = matrix(c(-1, 1, 2, -3), nrow = 2, ncol = 2)

Often times these collections of weights are applied iteratively using successive applications of matrix multiplication.

Are A and B compatible in any way in terms of matrix multiplication? Use A%*%B and B%*%A in the console to check. What are the dimensions of
the resulting matrix?

No, these matrices are not compatible.


[*] Yes, the multiplication BA results in a 2 by 3 matrix.
Yes, the multiplication AB results in a 2 by 3 matrix.
Yes, the multiplication BA results in a 3 by 2 matrix.

Matrix Multiplication - Order Matters


# Multiply A by B
A%*%B

## [,1] [,2]
## [1,] 0.7071068 0.7071068
## [2,] 0.7071068 -0.7071068

# Multiply A on the right of B


B%*%A

## [,1] [,2]
## [1,] 0.7071068 -0.7071068
## [2,] -0.7071068 -0.7071068

# Multiply the product of A and B by the vector b


A%*%B%*%b

## [,1]
## [1,] 1.414214
## [2,] 0.000000

# Multiply A on the right of B, and then by the vector b


B%*%A%*%b

## [,1]
## [1,] 0.000000
## [2,] -1.414214

Intro to The Matrix Inverse


# Take the inverse of the 2 by 2 identity matrix
solve(diag(2))

## [,1] [,2]
## [1,] 1 0
## [2,] 0 1

https://rstudio-pubs-static.s3.amazonaws.com/597406_cae47e2d352642c7b64449b4fee456c7.html 3/13
5/20/2021 Linear Algebra for Data Science (DataCamp)

# Take the inverse of the matrix A


Ainv <- solve(A)

# Multiply A inverse by A
Ainv%*%A

## [,1] [,2]
## [1,] 1 0
## [2,] 0 1

# Multiply A by its inverse


A%*%Ainv

## [,1] [,2]
## [1,] 1 0
## [2,] 0 1

Ch. 2 - Matrix-Vector Equations


Motivation for Solving Matrix-Vector Equations
[Video]

The Meaning of Ax = b
A great deal of applied mathematics and statistics, as well as data science, ends in a matrix-vector equation of the form:

Ax = b

Which of the following is the most correct way to describe what solving this equation for x is trying to accomplish?

Finding the vector x that, upon some mysterious transformation, makes b .


Finding the vector x that is a linear combination of the elements of b .
[*] To produce b using a linear combination of the columns of A .
To produce b using a linear combination of the rows of A .

Exploring WNBA Data


# Print the Massey Matrix M
print(M)

## Atlanta Chicago Connecticut Dallas Indiana Los.Angeles Minnesota New.York


## 1 33 -4 -2 -3 -3 -3 -3 -3
## 2 -4 33 -3 -3 -3 -3 -2 -3
## 3 -2 -3 34 -3 -3 -3 -3 -4
## 4 -3 -3 -3 34 -3 -4 -3 -3
## 5 -3 -3 -3 -3 33 -3 -3 -3
## 6 -3 -3 -3 -4 -3 41 -8 -3
## 7 -3 -2 -3 -3 -3 -8 41 -3
## 8 -3 -3 -4 -3 -3 -3 -3 34
## 9 -3 -3 -4 -2 -3 -6 -4 -3
## 10 -3 -3 -3 -3 -3 -3 -3 -2
## 11 -3 -3 -3 -3 -2 -2 -3 -3
## 12 -3 -3 -3 -4 -4 -3 -6 -4
## Phoenix San.Antonio Seattle Washington
## 1 -3 -3 -3 -3
## 2 -3 -3 -3 -3
## 3 -4 -3 -3 -3
## 4 -2 -3 -3 -4
## 5 -3 -3 -2 -4
## 6 -6 -3 -2 -3
## 7 -4 -3 -3 -6
## 8 -3 -2 -3 -4
## 9 38 -3 -4 -3
## 10 -3 32 -4 -2
## 11 -4 -4 33 -3
## 12 -3 -2 -3 38

# Print the vector of point differentials f


print(f)

https://rstudio-pubs-static.s3.amazonaws.com/597406_cae47e2d352642c7b64449b4fee456c7.html 4/13
5/20/2021 Linear Algebra for Data Science (DataCamp)

## Differential
## 1 -135
## 2 -171
## 3 152
## 4 -104
## 5 -308
## 6 292
## 7 420
## 8 83
## 9 -4
## 10 -213
## 11 -5
## 12 -7
## 13 0

# Find the sum of the first column of M


sum(M[, 1])

## [1] 0

# Find the sum of the vector f


sum(f)

## [1] 0

Matrix-Vector Equations - Some Theory


[Video]

Why is a Matrix Not Invertible?


For our WNBA Massey Matrix model, some adjustments need to be made for a solution to our rating problem to exist and be unique.

To see this, notice that the following code produces an error:

` > print(M)
1 33 -4 -2 -3 -3 -3 -3 -3 -3 -3 -3 -3 2 -4 33 -3 -3 -3 -3 -2 -3 -3 -3 -3 -3 3 -2 -3 34 -3 -3 -3 -3 -4 -4 -3 -3 -3 4 -3 -3 -3 34 -3 -4 -3 -3 -2 -3 -3 -4 5 -3 -3 -3
-3 33 -3 -3 -3 -3 -3 -2 -4 6 -3 -3 -3 -4 -3 41 -8 -3 -6 -3 -2 -3 7 -3 -2 -3 -3 -3 -8 41 -3 -4 -3 -3 -6 8 -3 -3 -4 -3 -3 -3 -3 34 -3 -2 -3 -4 9 -3 -3 -4 -2 -3 -6 -4
-3 38 -3 -4 -3 10 -3 -3 -3 -3 -3 -3 -3 -2 -3 32 -4 -2 11 -3 -3 -3 -3 -2 -2 -3 -3 -4 -4 33 -3 12 -3 -3 -3 -4 -4 -3 -6 -4 -3 -2 -3 38

solve(M) Error in solve.default(M) : system is computationally singular: reciprocal condition number = 3.06615e-
17 `

Which of the conditions does M explicitly violate in this case?

M is not a square matrix.


The determinant of M is zero.
The sum of each of the columns and rows is equal to zero.
[*] M does not have an inverse.

Understanding a Linear System’s Three Outcomes


In two dimensions, the solution structure of a system of two equations in two unknowns can be understood in a straightforward way via pictures,
with the two equations representing lines (this is why it’s called linear algebra) in the x-y (or x1 - x2) plane. A solution is any point (x,y) ((x1,x2))
where the two lines intersect.

Which of the following three graphs is that of a linear system of two equations with two unknowns that has no solutions?

The first graph.


[*] The second graph.
The third graph.

Understanding the Massey Matrix


For our WNBA Massey Matrix model, some adjustments need to be made for a solution to our rating problem to exist and be unique.

This is because the matrix M, with R output


1 33 -4 -2 -3 -3 -3 -3 -3 -3 -3 -3 -3 2 -4 33 -3 -3 -3 -3 -2 -3 -3 -3 -3 -3 3 -2 -3 34 -3 -3 -3 -3 -4 -4 -3 -3 -3 4 -3 -3 -3 34 -3 -4 -3 -3

usually does not (computationally) have an inverse, as shown by the error produced from running solve(M) in a previous exercise.

One way we can change this is to add a row of 1 ’s on the bottom of the matrix M, a column of -1 ’s to the far right of M, and a 0 to the bottom of
the vector of point differentials f .

What does that row of 1 ’s represent in the setting of rating teams? In other words, what does the final equation stipulate?

Each team gets an equal rating.


[*] The ratings for the entire league add to zero.
The sum of the ratings for the entire league is positive.
The sum of the ratings for the entire league is negative.

https://rstudio-pubs-static.s3.amazonaws.com/597406_cae47e2d352642c7b64449b4fee456c7.html 5/13
5/20/2021 Linear Algebra for Data Science (DataCamp)

Adjusting the Massey Matrix


# Add a row of 1's
M_2 <- rbind(M, rep(1, 12))

# Add a column of -1's


M_3 <- cbind(M_2, rep(-1, 13))

# Change the element in the lower-right corner of the matrix


M_3[13, 13] <- 1

# Print M_3
print(M_3)

## Atlanta Chicago Connecticut Dallas Indiana Los.Angeles Minnesota New.York


## 1 33 -4 -2 -3 -3 -3 -3 -3
## 2 -4 33 -3 -3 -3 -3 -2 -3
## 3 -2 -3 34 -3 -3 -3 -3 -4
## 4 -3 -3 -3 34 -3 -4 -3 -3
## 5 -3 -3 -3 -3 33 -3 -3 -3
## 6 -3 -3 -3 -4 -3 41 -8 -3
## 7 -3 -2 -3 -3 -3 -8 41 -3
## 8 -3 -3 -4 -3 -3 -3 -3 34
## 9 -3 -3 -4 -2 -3 -6 -4 -3
## 10 -3 -3 -3 -3 -3 -3 -3 -2
## 11 -3 -3 -3 -3 -2 -2 -3 -3
## 12 -3 -3 -3 -4 -4 -3 -6 -4
## 13 1 1 1 1 1 1 1 1
## Phoenix San.Antonio Seattle Washington rep(-1, 13)
## 1 -3 -3 -3 -3 -1
## 2 -3 -3 -3 -3 -1
## 3 -4 -3 -3 -3 -1
## 4 -2 -3 -3 -4 -1
## 5 -3 -3 -2 -4 -1
## 6 -6 -3 -2 -3 -1
## 7 -4 -3 -3 -6 -1
## 8 -3 -2 -3 -4 -1
## 9 38 -3 -4 -3 -1
## 10 -3 32 -4 -2 -1
## 11 -4 -4 33 -3 -1
## 12 -3 -2 -3 38 -1
## 13 1 1 1 1 1

Inverting the Massey Matrix


# Find the inverse of M
solve(M)

https://rstudio-pubs-static.s3.amazonaws.com/597406_cae47e2d352642c7b64449b4fee456c7.html 6/13
5/20/2021 Linear Algebra for Data Science (DataCamp)

## [,1] [,2] [,3] [,4] [,5]


## Atlanta 0.032449804 0.005402927 0.003876665 0.004630004 0.004629590
## Chicago 0.005402927 0.032446789 0.004608094 0.004626913 0.004628272
## Connecticut 0.003876665 0.004608094 0.031714805 0.004613451 0.004629714
## Dallas 0.004630004 0.004626913 0.004613451 0.031707219 0.004649172
## Indiana 0.004629590 0.004628272 0.004629714 0.004649172 0.032447936
## Los.Angeles 0.004626242 0.004554829 0.004676789 0.005214940 0.004652111
## Minnesota 0.004611109 0.003985203 0.004651940 0.004727810 0.004678479
## New.York 0.004609212 0.004627729 0.005362761 0.004647832 0.004649262
## Phoenix 0.004610546 0.004608018 0.005295038 0.004013187 0.004613089
## San.Antonio 0.004630254 0.004631081 0.004608596 0.004609009 0.004587382
## Seattle 0.004629212 0.004631185 0.004646217 0.004595132 0.003854641
## Washington 0.004627769 0.004582295 0.004649264 0.005298666 0.005313685
## rep(-1, 13) -0.083333333 -0.083333333 -0.083333333 -0.083333333 -0.083333333
## [,6] [,7] [,8] [,9] [,10]
## Atlanta 0.004626242 0.004611109 0.004609212 0.004610546 0.004630254
## Chicago 0.004554829 0.003985203 0.004627729 0.004608018 0.004631081
## Connecticut 0.004676789 0.004651940 0.005362761 0.005295038 0.004608596
## Dallas 0.005214940 0.004727810 0.004647832 0.004013187 0.004609009
## Indiana 0.004652111 0.004678479 0.004649262 0.004613089 0.004587382
## Los.Angeles 0.027807608 0.007319076 0.004637275 0.006363490 0.004606288
## Minnesota 0.007319076 0.027810474 0.004677632 0.005388578 0.004578013
## New.York 0.004637275 0.004677632 0.031716432 0.004648253 0.003835528
## Phoenix 0.006363490 0.005388578 0.004648253 0.029212019 0.004646110
## San.Antonio 0.004606288 0.004578013 0.003835528 0.004646110 0.033267202
## Seattle 0.004032687 0.004573214 0.004607331 0.005265228 0.005427397
## Washington 0.004841998 0.006331805 0.005314087 0.004669776 0.003906474
## rep(-1, 13) -0.083333333 -0.083333333 -0.083333333 -0.083333333 -0.083333333
## [,11] [,12] [,13]
## Atlanta 0.004629212 0.004627769 8.333333e-02
## Chicago 0.004631185 0.004582295 8.333333e-02
## Connecticut 0.004646217 0.004649264 8.333333e-02
## Dallas 0.004595132 0.005298666 8.333333e-02
## Indiana 0.003854641 0.005313685 8.333333e-02
## Los.Angeles 0.004032687 0.004841998 8.333333e-02
## Minnesota 0.004573214 0.006331805 8.333333e-02
## New.York 0.004607331 0.005314087 8.333333e-02
## Phoenix 0.005265228 0.004669776 8.333333e-02
## San.Antonio 0.005427397 0.003906474 8.333333e-02
## Seattle 0.032485332 0.004585756 8.333333e-02
## Washington 0.004585756 0.029211757 8.333333e-02
## rep(-1, 13) -0.083333333 -0.083333333 2.220446e-16

Solving Matrix-Vector Equations


[Video]

An Analogy with Regular Algebra


As we saw in the video, solving matrix-vector equations is as simple as multiplying both sides of the equation by A ’s inverse, A−1 , should it exist.
The analogy with solving linear equations like 5x=7 is a good one.

If A−1 doesn’t exist, this does not work. The equivalent analogy for linear equations would be a situation in which the coefficient in front of the x
were 0, which is the only real number that does not have an inverse. Which of the following does NOT analogize in this situation?

Dividing by zero is illegal, and is analogous to trying to invert a matrix with a zero determinant.
[*] All of the elements of a matrix must be zero for it to fail to have an inverse.
The equation 0x=b has zero solutions (if b≠0).
The equation 0x=b has infinitely many solutions (if b=0).

2017 WNBA Ratings!


# Solve for r and rename column
r <- solve(M)%*%f
colnames(r) <- "Rating"

# Print r
print(r)

## Rating
## Atlanta -4.012938e+00
## Chicago -5.156260e+00
## Connecticut 4.309525e+00
## Dallas -2.608129e+00
## Indiana -8.532958e+00
## Los.Angeles 7.850327e+00
## Minnesota 1.061241e+01
## New.York 2.541565e+00
## Phoenix 8.979110e-01
## San.Antonio -6.181574e+00
## Seattle -2.666953e-01
## Washington 5.468121e-01
## WNBA 1.043610e-14

https://rstudio-pubs-static.s3.amazonaws.com/597406_cae47e2d352642c7b64449b4fee456c7.html 7/13
5/20/2021 Linear Algebra for Data Science (DataCamp)

Who Was the Champion?


The dplyr package has been loaded for you, as has the solution to the previous question. The arrange() function in dplyr allows you to re-
order a vector based on a trait.

In the previous exercise, you rated the teams at the end of the 2017 WNBA season using the solution to a matrix-vector equation.

Using the the syntax

arrange(r, -Rating)

we can see which team was the best in the WNBA in 2017 (using the negative (“-”) sign in front of the ordering variable (“Rating”) puts the values
in descending order, as opposes to ascending order if just “Rating” is used).

Which team was the best?

# arrange(r, -Rating)

San Antonio
[*] Minnesota
Los Angeles
Phoenix

Other Considerations for Matrix-Vector Equations


[Video]

Other Methods for Matrix-Vector Equations


Which of the following was NOT proposed as a method to solve matrix-vector equations with non-square matrices?

[*] Euler’s method


Least squares
Singular Value Decomposition
Row reduction

Alternatives to the Regular Matrix Inverse


# Print M
print(M)

## Atlanta Chicago Connecticut Dallas Indiana Los.Angeles Minnesota New.York


## [1,] 33 -4 -2 -3 -3 -3 -3 -3
## [2,] -4 33 -3 -3 -3 -3 -2 -3
## [3,] -2 -3 34 -3 -3 -3 -3 -4
## [4,] -3 -3 -3 34 -3 -4 -3 -3
## [5,] -3 -3 -3 -3 33 -3 -3 -3
## [6,] -3 -3 -3 -4 -3 41 -8 -3
## [7,] -3 -2 -3 -3 -3 -8 41 -3
## [8,] -3 -3 -4 -3 -3 -3 -3 34
## [9,] -3 -3 -4 -2 -3 -6 -4 -3
## [10,] -3 -3 -3 -3 -3 -3 -3 -2
## [11,] -3 -3 -3 -3 -2 -2 -3 -3
## [12,] -3 -3 -3 -4 -4 -3 -6 -4
## [13,] 1 1 1 1 1 1 1 1
## Phoenix San.Antonio Seattle Washington WNBA
## [1,] -3 -3 -3 -3 -1
## [2,] -3 -3 -3 -3 -1
## [3,] -4 -3 -3 -3 -1
## [4,] -2 -3 -3 -4 -1
## [5,] -3 -3 -2 -4 -1
## [6,] -6 -3 -2 -3 -1
## [7,] -4 -3 -3 -6 -1
## [8,] -3 -2 -3 -4 -1
## [9,] 38 -3 -4 -3 -1
## [10,] -3 32 -4 -2 -1
## [11,] -4 -4 33 -3 -1
## [12,] -3 -2 -3 38 -1
## [13,] 1 1 1 1 1

# Find the rating vector the conventional way


r <- solve(M)%*%f
colnames(r) <- "Rating"
print(r)

https://rstudio-pubs-static.s3.amazonaws.com/597406_cae47e2d352642c7b64449b4fee456c7.html 8/13
5/20/2021 Linear Algebra for Data Science (DataCamp)

## Rating
## Atlanta -4.012938e+00
## Chicago -5.156260e+00
## Connecticut 4.309525e+00
## Dallas -2.608129e+00
## Indiana -8.532958e+00
## Los.Angeles 7.850327e+00
## Minnesota 1.061241e+01
## New.York 2.541565e+00
## Phoenix 8.979110e-01
## San.Antonio -6.181574e+00
## Seattle -2.666953e-01
## Washington 5.468121e-01
## WNBA 1.043610e-14

# Find the rating vector using ginv


r <- ginv(M)%*%f
colnames(r) <- "Rating"
print(r)

## Rating
## [1,] -4.012938e+00
## [2,] -5.156260e+00
## [3,] 4.309525e+00
## [4,] -2.608129e+00
## [5,] -8.532958e+00
## [6,] 7.850327e+00
## [7,] 1.061241e+01
## [8,] 2.541565e+00
## [9,] 8.979110e-01
## [10,] -6.181574e+00
## [11,] -2.666953e-01
## [12,] 5.468121e-01
## [13,] 5.773160e-14

Ch. 3 - Eigenvalues and Eigenvectors


Intro to Eigenvalues and Eigenvectors
[Video]

Matrix-Vector Multiplications

Rotations
Reflections
Dilations
Contradictions
Projections
Every imaginable combinations of these

Scalar Multiplication

c times vector x ⃗ Notation: cx ⃗ 

Interpreting Scalar Multiplication


Scaling Different Axes
Definition of Eigenvalues and Eigenvectors
Why “Eigen”?
Finding Eigenvalues in R
Scalar Multiplies of Eigenvectors are Eigenvectors
Computing Eigenvalues and Eigenvectors in R
How Many Eigenvalues?
Verifying the Math on Eigenvalues
Computing Eigenvectors in R
Some More on Eigenvalues and Eigenvectors

https://rstudio-pubs-static.s3.amazonaws.com/597406_cae47e2d352642c7b64449b4fee456c7.html 9/13
5/20/2021 Linear Algebra for Data Science (DataCamp)

Eigenvalue Ordering
Markov Models for Allele Frequencies

Ch. 4 - Principal Component Analysis


Intro to the Idea of PCA
[Video]

What Does “Big Data” Mean?


In data science, the term “big data” is generally referring to what with the term “big”?

[*] The number of rows and the number of columns.


The number of rows.
The number of columns.
The number of rows or the number of columns.

Finding Redundancies
# Print the first 6 observations of the dataset
head(combine)

## player position school year height weight forty vertical


## 1 Jaire Alexander CB Louisville 2018 71 192 4.38 35.0
## 2 Brian Allen C Michigan St. 2018 73 298 5.34 26.5
## 3 Mark Andrews TE Oklahoma 2018 77 256 4.67 31.0
## 4 Troy Apke S Penn St. 2018 74 198 4.34 41.0
## 5 Dorance Armstrong EDGE Kansas 2018 76 257 4.87 30.0
## 6 Ade Aruna DE Tulane 2018 78 262 4.60 38.5
## bench broad_jump three_cone shuttle
## 1 14 127 6.71 3.98
## 2 27 99 7.81 4.71
## 3 17 113 7.34 4.38
## 4 16 131 6.56 4.03
## 5 20 118 7.12 4.23
## 6 18 128 7.53 4.48
## drafted
## 1 Green Bay Packers / 1st / 18th pick / 2018
## 2 Los Angeles Rams / 4th / 111th pick / 2018
## 3 Baltimore Ravens / 3rd / 86th pick / 2018
## 4
## 5
## 6 Minnesota Vikings / 6th / 218th pick / 2018

# Find the correlation between variables forty and three_cone


cor(combine$forty, combine$three_cone)

## [1] 0.8315171

# Find the correlation between variables vertical and broad_jump


cor(combine$vertical, combine$broad_jump)

## [1] 0.8163375

Given the results of the previous parts of the exercise, what can you say about the dataset combine at this point?

We have yet to find any redundancy in the dataset.


forty and three_cone are the only redundant variables we’ve found so far.
vertical and broad_jump are the only redundant variables we’ve found so far.
[*] There are at least two sets of redundant variables in this dataset.

The Linear Algebra Behind PCA


[Video]

Covariance Explored
If the covariance between two columns of a matrix is positive and large, what can we say?

The variables are not related.


When one of the variables goes up, the other goes down.
[*] When one of the variables goes up, the other goes up as well.
The variables are related, but we don’t know how.

Standardizing Your Data


https://rstudio-pubs-static.s3.amazonaws.com/597406_cae47e2d352642c7b64449b4fee456c7.html 10/13
5/20/2021 Linear Algebra for Data Science (DataCamp)

# Extract columns 5-12 of combine


A <- combine[, 5:12]

# Make A into a matrix


A <- as.matrix(A)

# Subtract the mean of each column


A[, 1] <- A[, 1] - mean(A[, 1])
A[, 2] <- A[, 2] - mean(A[, 2])
A[, 3] <- A[, 3] - mean(A[, 3])
A[, 4] <- A[, 4] - mean(A[, 4])
A[, 5] <- A[, 5] - mean(A[, 5])
A[, 6] <- A[, 6] - mean(A[, 6])
A[, 7] <- A[, 7] - mean(A[, 7])
A[, 8] <- A[, 8] - mean(A[, 8])

Variance/Covariance Calculations
# Create matrix B from equation in instructions
B <- t(A)%*%A/(nrow(A) - 1)

# Compare 1st element of the 1st column of B to the variance of the first column of A
B[1,1]

## [1] 7.159794

var(A[, 1])

## [1] 7.159794

# Compare 1st element of 2nd column of B to the 1st element of the 2nd row of B to the covariance between the first two colu
mns of A
B[1, 2]

## [1] 90.78808

B[2, 1]

## [1] 90.78808

cov(A[, 1], A[, 2])

## [1] 90.78808

Eigenanalyses of Combine Data


# Find eigenvalues of B
V <- eigen(B)

# Print eigenvalues
V$values

## [1] 2.187628e+03 4.403246e+01 2.219205e+01 5.267129e+00 2.699702e+00


## [6] 6.317016e-02 1.480866e-02 1.307283e-02

Where’s the Variance?


The eigenvalues of B are, when rounding to four digits,

2187.6283 44.0325 22.1921 5.2671 2.6997 0.0632 0.0148 0.0131

Roughly how much of the variability in the dataset can be explained by the first principal component?

About 15 percent.
About 50 percent.
About 75 percent.
[*] About 95 percent.

Performing PCA in R
[Video]

Scaling Data Before PCA

https://rstudio-pubs-static.s3.amazonaws.com/597406_cae47e2d352642c7b64449b4fee456c7.html 11/13
5/20/2021 Linear Algebra for Data Science (DataCamp)

# Scale columns 5-12 of combine


B <- scale(combine[, 5:12])

# Print the first 6 rows of the data


head(B)

## height weight forty vertical bench broad_jump


## [1,] -1.11844839 -1.30960025 -1.3435337 0.5624657 -1.1089286 1.45502476
## [2,] -0.37100257 1.00066356 1.6449741 -1.4281627 0.9238361 -1.49512459
## [3,] 1.12388907 0.08527601 -0.4407553 -0.3743006 -0.6398290 -0.02004991
## [4,] 0.00272034 -1.17883060 -1.4680548 1.9676151 -0.7961955 1.87647467
## [5,] 0.75016616 0.10707096 0.1818505 -0.6084922 -0.1707295 0.50676247
## [6,] 1.49761199 0.21604566 -0.6586673 1.3821362 -0.4834625 1.56038724
## three_cone shuttle
## [1,] -1.38083506 -1.5879750
## [2,] 1.16888714 1.1170258
## [3,] 0.07946038 -0.1057828
## [4,] -1.72852445 -1.4027010
## [5,] -0.43048406 -0.6616049
## [6,] 0.51986694 0.2647653

# Summarize the principal component analysis


summary(prcomp(B))

## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.3679 0.9228 0.78904 0.61348 0.46811 0.37178 0.34834
## Proportion of Variance 0.7009 0.1064 0.07782 0.04704 0.02739 0.01728 0.01517
## Cumulative Proportion 0.7009 0.8073 0.88514 0.93218 0.95957 0.97685 0.99202
## PC8
## Standard deviation 0.25266
## Proportion of Variance 0.00798
## Cumulative Proportion 1.00000

Summarizing PCA in R
# Subset combine only to "WR"
combine_WR <- subset(combine, position == "WR")

# Scale columns 5-12 of combine_WR


B <- scale(combine_WR[, 5:12])

# Print the first 6 rows of the data


head(B)

## height weight forty vertical bench broad_jump


## 7 1.4022982 0.88324903 1.20674474 -0.3430843 -0.3223377 0.07414249
## 17 0.5575402 -0.09700717 -0.80129388 -0.4969965 -0.7938424 -0.95388361
## 18 0.9799192 1.58343202 0.88968601 1.0421255 0.8564239 1.61618163
## 25 0.9799192 1.16332222 1.41811723 -1.5743819 -0.7938424 -1.29655897
## 29 -1.1319757 -1.56739147 -0.80129388 -0.1891721 -0.0865854 -1.29655897
## 46 0.1351613 0.11304773 0.04419607 0.2725645 -1.0295947 0.24548017
## three_cone shuttle
## 7 0.712845019 0.02833449
## 17 -1.098542478 0.84141123
## 18 -1.853287268 -1.46230619
## 25 -1.148858797 0.50262926
## 29 0.008416548 -0.64922946
## 46 0.109049187 0.84141123

# Summarize the principal component analysis


summary(prcomp(B))

## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.5425 1.4255 1.0509 0.9603 0.77542 0.63867 0.59792
## Proportion of Variance 0.2974 0.2540 0.1380 0.1153 0.07516 0.05099 0.04469
## Cumulative Proportion 0.2974 0.5514 0.6894 0.8047 0.87987 0.93085 0.97554
## PC8
## Standard deviation 0.44235
## Proportion of Variance 0.02446
## Cumulative Proportion 1.00000

Does Subsetting Change Things?


In the last exercise, you looked at the PCA analysis of just the wide receivers in the NFL combine data. The summaries of the PCA analysis for
the whole combine dataset and the wide receiver subset are loaded as pca_summary and pca_summary_wr , respectively.

What is true about this data in relation to the dataset as a whole?

https://rstudio-pubs-static.s3.amazonaws.com/597406_cae47e2d352642c7b64449b4fee456c7.html 12/13
5/20/2021 Linear Algebra for Data Science (DataCamp)
With less data, the first PC of the subset data explains more of the variability in the dataset.
The first PC explains similar amounts of variability for both datasets.
[*] It takes the first 3 PCs of the subset data to explain the same amount of variability as the first PC of the whole dataset.

Wrap-Up
[Video]

About Michael Mallari


Michael is a hybrid thinker and doer (https://www.michaelmallari.com/)—a byproduct of being a StrengthsFinder “Learner”
(https://news.gallup.com/businessjournal/694/learner.aspx) over time. With 20+ years of engineering, design, and product experience, he helps
organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with
business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-
profit organizations.

Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He
is also a candidate to receive his MS in Applied Analytics from Columbia University.

LinkedIn (https://www.linkedin.com/in/mmallari/) | Twitter (https://twitter.com/MichaelMallari) | www.michaelmallari.com/data


(https://www.michaelmallari.com/data/) | www.columbia.edu/~mm5470 (http://www.columbia.edu/~mm5470/)

https://rstudio-pubs-static.s3.amazonaws.com/597406_cae47e2d352642c7b64449b4fee456c7.html 13/13

You might also like