Professional Documents
Culture Documents
h"ps://quizlet.com/545986967/bida-630-data-analy@cs-Aash-cards/
1. Iden%fy whether the task required is supervised or unsupervised learning: Deciding whether to issue a loan to an
applicant based on demographic and =nancial data (with reference to a database of similar data on prior customers).
a. This is unsupervised learning
b. This is supervised learning (This is supervised learning, because the database includes whether the loan was
approved or not.)
2. Iden%fy whether the task required is supervised or unsupervised learning: Prin%ng of custom discount coupons at the
conclusion of a grocery store checkout based on what you just bought and what others have bought previously.
a. This is unsupervised learning (This is unsupervised learning, if we assume that we do not know what will be purchased in the future.)
3. Predic%ng whether a company will go bankrupt based on comparing its =nancial data to those of similar bankrupt and
nonbankrupt =rms.
a. This is supervised learning (This is supervised learning, because the status of the similar =rms is known.)
b. This is unsupervised learning
4. ---------------------is used for assessing the performance of the =nal chosen model on new data.
a. The test data par55on
b. The valida%on par%%on
If the test data were used for these purposes, they would play a role in building or selec%ng the best model, and would no longer provide an unbiased
assessment of the chosen model's performance with completely new data.
5. _____________ of data is used to assess the performance of each supervised learning model so that we can compare
models and pick the best one.
a. The test par%%on
b. The valida%on par%%on
The valida%on par%%on is used to assess the performance of each supervised learning model so that we can compare models and pick the best one. In
some algorithms (e.g., classi=ca%on and regression trees, k-nearest neighbors) the valida%on par%%on may be used in automated fashion to tune and
improve the model. This means that the valida%on data are actually used to help build the model
6. The test data are used to build models, or to further tweak the model or improve its =t.
a. True
b. FALSE
7. When a model is =t to training data, zero error with those data is not necessarily good. This special case is called ______
a. Over=Kng
Over=Kng occurs when the model captures not only the generalizeable paLern in the data, but also the error. When we split the data into
training and valida%on sets, we assume that the same paLern (if there is a paLern) exists in both, and that they diNer only in the error that they
contain. An absurd and false model may =t perfectly (on training data set) if the model has enough complexity. Therefore, we may get zero error
for such a model using the training dataset. Such a model, however, is not likely to give useful results on the valida%on data set.)
b. Underes%ma%ng
8. Bar charts are useful for comparing a single sta%s%c (e.g. average, count, percentage) across groups. The height of the
bar represents the value of sta%s%c, and diNerent bars correspond to diNerent groups.
a. TRUE
b. FALSE
9. Which of the following are the most popular visualiza%on tools in JMP_Pro? (3 correct answers)
This study source was downloaded by 100000766496553 from CourseHero.com on 10-10-2022 02:10:08 GMT -05:00
a. Distribu5on
b. Graph Builder
c. Fit Y by X
d. Graphic wizard
e. Data visualize
10.ScaLer plots play important role in predic%on. Next step can be developing a model. ScaLer plots provide informa%on
about rela%onships (linear or non-linear) between variables. The variables in scaLer plot ________.
a. Must be numerical
b. can be nominal
c. can be both numerical and categorical
d. must be ordinal
11. In a box plot, the box include %50 of the data, the horizontal line represents (i)____________, the top and boLom of the
box represent (ii)________, respec%vely.
a. (i) the median (50th percen%le), (ii) 75th and 25th percen%les
b. (i) the mean, (ii) 75th and 25th percen%les
c. (i) the median (50th percen%le), (ii) bounds for outliers
d. (i) the mean, (ii) 10th and 90th percen%les
12.In JMP a diamond is displayed in the box, where the center of the diamond is _________.
a. The halfway between outliers
b. The median
c. The mean
d. The skewness value
13.The density ellipsoid in scaLerplot matrix is a good graphical indicator of the correla%on between two variables. The
ellipsoid collapses diagonally as the correla%on between the two variables approaches either 1 or –1. The ellipsoid is
more circular if the two variables are more correlated. (TRUE or FALSE?)
a. TRUE
b. FALSE (The ellipsoid is more circular (less diagonally oriented) if the two variables are less correlated.)
15.To obtain an honest es%mate of future classi=ca%on error, we use the classi=ca%on matrix that is computed from
________.
a. training data
b. test data
This study source was downloaded by 100000766496553 from CourseHero.com on 10-10-2022 02:10:08 GMT -05:00
c. valida%on data
16.The classiGca5on matrix, also called confusion matrix, gives es%mates of the true classi=ca%on and misclassi=ca%on
rates.
a. TRUE
b. FALSE
17.Pairs of variables that have a very strong (posi%ve or nega%ve) correla%on contain duplica%ve informa%on. Therefore, we
want to omit the variables that are strongly correlated to others to avoid mul%colinearity (when =Kng models).
a. TRUE
b. FALSE
18.How would the correla%ons change if we normalized the data =rst?
a. Correla%ons will not change, since data are normalized by compu%ng correla%ons
b. None of the above is true
c. Correla%ons will change, since the distances change when we normalize the data
19.Which of the following are true about Principal Component Analysis (PCA)? (2 correct answers)
a. PCA is intended for use with quan%ta%ve variables
b. The idea of PCA is to =nd a linear combina%on of the two variables that contains most, even if not all, of the
informa%on, so that this new variable can replace the two original variables.
c. PCA is intended for use with categorical variables
d. Normaliza%on (= standardiza%on) is usually performed in PCA when analysis is conducted on covariances
20.Which of the following are the methods that we use for dimension reduc%on? (4 correct answers)
a. Random selec%on of variables for model development
b. Mul%ple Linear Regression
c. Removing one of the variables in pairs that have a very strong correla%on
d. Logis%cs Regression
e. Removing independent variables from the model
f. Principal Component Analysis
24.k-NN is a “lazy learner”: the %me consuming computa%on is deferred to the %me of predic%on. For every record to be
predicted, we compute its distances from the en%re set of training records only at the %me of predic%on. This behavior
prohibits using this algorithm for real %me predic%on of a large number of records simultaneously.
a. TRUE
b. FALSE
This study source was downloaded by 100000766496553 from CourseHero.com on 10-10-2022 02:10:08 GMT -05:00
25.The Naive Bayes plajorm =ts a model to predict the value of a numerical variable as well as the value of a categorical
variable.
a. TRUE
26.Which of the following are characteris%cs of Naive Bayes Classi=er? (2 correct answers)
a. Makes no assump%ons about the distribu%on of the data
b. Data Driven
c. Model Driven
d. Makes assump%ons about the distribu%on of the data
27.Naive Bayes method relies on assump%on of independence between predictor variables within each class
a. TRUE
b. FALSE
28.Which of the following are advantages of Naive Bayes Method? (3 correct answers)
a. Predicts the value of numerical variable well
b. Works well when a predictor category is not present in training data
c. Requires small number of records
d. Handles purely categorical data well
e. Works well with very large data sets
f. Simple and computa%onally ekcient
29.When a bank that is in poor =nancial condi%on is misclassi=ed as =nancially strong, the misclassi=ca%on cost is much
higher than when a =nancially strong bank is misclassi=ed as weak. To minimize the expected cost of misclassi=ca%on,
the cutoN value for classi=ca%on (which is currently at 0.5) should be increased (TRUE/FALSE?)
a. TRUE
b. FALSE
30.Naive Bayes and K-Nearest Neighbors - Which plajorm is rela%vely faster?
a. Naive Bayes
31.When a bank that is in poor =nancial condi%on is misclassi=ed as =nancially strong, the misclassi=ca%on cost is much
higher than when a =nancially strong bank is misclassi=ed as weak.
32.Neural networks is a nexible data-driven method that can be used for classi=ca%on but not for predic%on
a. TRUE
b. FALSE (Neural networks can be used not only for classi=ca%on and but also for predic%on tasks)
This study source was downloaded by 100000766496553 from CourseHero.com on 10-10-2022 02:10:08 GMT -05:00
33.Which of the following is not an ac%va%on func%on used by hidden layer structure of neural plajorm in JMP Pro?
a. Tanh
b. Linear
c. Gaussian
d. Logit
34.Assume that you are running Neural plajorm in JMP Pro. Which penalty method should be chosen if your data set has
large number of X variables, and you think that a few of them contribute more than others to the predic%ve ability of the
model?
a. No penalty
b. Logarithmic
c. Absolute
d. Squared
36.Discriminant analysis is a data analy%cs technique that can be used to predict the value of a con%nuous numerical
outcome.
a. TRUE
b. FALSE (DA is a classi=ca%on method that can only be used for the classi=ca%on of a categorical outcome)
37.Which of the following are the assump%ons of Discriminant analysis? (2 correct answers)
a. Assumes that, within a class, the features are independent
b. Assumes correla%on among predictors within a class is the same across all classes
c. Assumes the data set has more than 20 records
d. Assumes mul%variate normality of predictors
38.Which of the following are the advantages of Discriminant Analysis? (2 correct answers)
a. Provides es%mates for single predictor contribu%ons
b. Its predic%ve performance is beLer than that of mul%ple linear regression
c. Sensi%ve to outliers
d. Computa%onally simple and useful for small datasets
This study source was downloaded by 100000766496553 from CourseHero.com on 10-10-2022 02:10:08 GMT -05:00
43.Which of the following should be checked for cluster valida%on? (3 correct answers)
a. Cluster stability
b. Cluster sizes
c. Number of clusters
d. Cluster shape (circular/ellip%cal)
e. Cluster interpretability
44. Whi
choft
hef
oll
owi
ngar
etheadv
ant
agesofHi
erar
chi
cal
Clust
eri
ng(
3cor
rectanswer
s)
45.In K-means clustering you must specify the number of clusters, k, or a range of values for k, in advance. However, you can
compare the results of diNerent values of k to select an op%mal number of clusters for your data. The best Gt is
determined by the highest CCC value
a. CCC
46.Which of the following are the most popular forecas%ng methods? (2 correct answers)
a. Classi=ca%on Based
b. Smoothing Based
c. Cluster Analysis
d. Regression Based
a. TRUE
b. FALSE
50.In order to consider dependence between observations one may use ARIMA but not multiple linear regression
a. TRUE
b. FALSE
51.The lift at portion = 0.2 is roughly 2.2. This means that, in the top ____ % of the data, which have been sorted by
predicted probability of fraudulent, there are roughly ____ times more actual fraudulent records than you would expect
given the overallproportion fraudulent (if you were to draw 20% of the records at random).
a. Top 20% of data
b. 2.2 times more fraudulent records than expected given overall proportion 20% of dat
52.A positive coefficient in the logit model translates into an odds larger than 1 (TRUE/FALSE?)
a. TRUE
b. FALSE
53.In the logit model, the estimated coefficient for the variable "TotLns&Lses/Assets" is 9.1732. The odds are then e^9.1732
= 9635. This means that an increase of a unit in total loans and leases-to-assets is associated with an increase in the odds
of being financially weak by a factor of 9635. Neural networks is a flexible data-driven method that can be used for
classification but not for prediction
a. True
b. False
This study source was downloaded by 100000766496553 from CourseHero.com on 10-10-2022 02:10:08 GMT -05:00