You are on page 1of 94

E20-007

Number: 000-000
Passing Score: 800
Time Limit: 120 min
File Version: 1.0

Vendor: EMC

Exam Code: E20-007

Exam Name: Data Science for Big Data Analytics

Version: 12.39

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Exam A

QUESTION 1
What is an appropriate data visualization to use in a presentation for an analyst audience?

A. Pie chart
B. Area chart
C. Stacked bar chart
D. ROC curve

Correct Answer: D
Section: (none)
Explanation

QUESTION 2
What is an example of a null hypothesis?

A. that a newly created model does not provide better predictions than the currently existing model
B. that a newly created model provides a prediction of a null sample mean
C. that a newly created model provides a prediction of a null population mean
D. that a newly created model provides a prediction that will be well fit to the null distribution

Correct Answer: A
Section: (none)
Explanation

QUESTION 3
You have fit a decision tree classifier using 12 input variables. The resulting tree used 7 of the 12 variables, and is 5 levels deep. Some of the nodes
contain only 3 data points. The AUC of the model is 0.85. What is your evaluation of this model?

A. The tree is probably overfit. Try fitting shallower trees and using an ensemble method.
B. The AUC is high,and the small nodes are all very pure. This is an accurate model.
C. The tree did not split on all the input variables. You need a larger data set to get a more accurate model.
D. The AUC is high,so the overall model is accurate. It is not well-calibrated,because the small nodes will give poor estimates of probability.

Correct Answer: A
Section: (none)
Explanation

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
QUESTION 4
If your intention is to show trends over time, which chart type is the most appropriate way to depict the data?

A. Line chart
B. Bar chart
C. Stacked bar chart
D. Histogram

Correct Answer: A
Section: (none)
Explanation

QUESTION 5
You are analyzing a time series and want to determine its stationarity. You also want to determine the order of autoregressive models.
How are the autocorrelation functions used?

"First Test, First Pass" - www.lead2pass.com 4


EMC E20-007 Exam

A. ACF as an indication of stationarity,and PACF for the correlation between Xt and Xt-k not explained by their mutual correlation with X1 through Xk-1.
B. PACF as an indication of stationarity,and ACF for the correlation between Xt and Xt-k not explained by their mutual correlation with X1 through Xk-1.
C. ACF as an indication of stationarity,and PACF to determine the correlation of X1 through Xk-1.
D. PACF as an indication of stationarity,and ACF to determine the correlation of X1 through Xk-1.

Correct Answer: A
Section: (none)
Explanation

QUESTION 6
Which word or phrase completes the statement? A spreadsheet is to a data island as a centralized database for reporting is to a ________?

A. Data Warehouse
B. Data Repository
C. Analytic Sandbox
D. Data Mart

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Correct Answer: A
Section: (none)
Explanation

QUESTION 7
What is one modeling or descriptive statistical function in MADlib that is typically not provided in a standard relational database?

A. Linear regression
B. Expected value
C. Variance
D. Quantiles

Correct Answer: A
Section: (none)
Explanation

QUESTION 8
In which phase of the data analytics lifecycle do Data Scientists spend the most time in a project?

A. Discovery
B. Data Preparation
C. Model Building
D. Communicate Results

Correct Answer: B
Section: (none)
Explanation

QUESTION 9
You are testing two new weight-gain formulas for puppies. The test gives the results:

- Control group: 1% weight gain


- Formula A. 3% weight gain
- Formula B. 4% weight gain
- A one-way ANOVA returns a p-value = 0.027

"First Test, First Pass" - www.lead2pass.com 5


EMC E20-007 Exam

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
What can you conclude?

A. Either Formula A or Formula B is effective at promoting weight gain.


B. Formula B is more effective at promoting weight gain than Formula A.
C. Formula A and Formula B are both effective at promoting weight gain.
D. Formula A and Formula B are about equally effective at promoting weight gain.

Correct Answer: A
Section: (none)
Explanation

QUESTION 10
Data visualization is used in the final presentation of an analytics project. For what else is this technique commonly used?

A. Data exploration
B. Descriptive statistics
C. ETLT
D. Model selection

Correct Answer: A
Section: (none)
Explanation

QUESTION 11
Which functionality do regular expressions provide?

A. text pattern matching


B. underflow prevention
C. increased numerical precision
D. decreased processing complexity

Correct Answer: A
Section: (none)
Explanation

QUESTION 12

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
When creating a project sponsor presentation, what is the main objective?

A. Show that you met the project goals


B. Show how you met the project goals
C. Show how well the model will meet the SLA (service level agreement)
D. Clearly describe the methods and techniques used

Correct Answer: A
Section: (none)
Explanation

QUESTION 13
The average purchase size from your online sales site is $17, 200. The customer experience team believes a certain adjustment of the website will
increase sales. A pilot study on a few hundred customers showed an increase in average purchase size of $1.47, with a significance level of p=0.1.
The team runs a larger study, of a few thousand customers. The second study shows an increased average purchase size of $0.74, with a significance
level of 0.03. What is your assessment of this study?

"First Test, First Pass" - www.lead2pass.com 6


EMC E20-007 Exam

A. The change in purchase size is not practically important,and the good p-value of the second study is probably a result of the large study size.
B. The change in purchase size is small,but may aggregate up to a large increase in profits over the entire customer base.
C. The difference in the change in purchase size between the two studies is troubling; The team should run another,larger study.
D. The p-value of the second study shows a statistically significant change in purchase size. The new website is an improvement.

Correct Answer: A
Section: (none)
Explanation

QUESTION 14
Which word or phrase completes the statement? Business Intelligence is to monitoring trends as Data Science is to ________ trends.

A. Predicting
B. Discarding
C. Driving
D. Optimizing

Correct Answer: A

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Section: (none)
Explanation

QUESTION 15
Consider a scale that has five (5) values that range from "not important" to "very important". Which data classification best describes this data?

A. Ordinal
B. Nominal
C. Real
D. Ratio

Correct Answer: A
Section: (none)
Explanation

QUESTION 16
Which key role for a successful analytic project can provide business domain expertise with a deep understanding of the data and key performance
indicators?

A. Business Intelligence Analyst


B. Project Manager
C. Project Sponsor
D. Business User

Correct Answer: A
Section: (none)
Explanation

QUESTION 17
On analyzing your time series data you suspect that the data represented as y1, y2, y3, ... , yn-1, yn may have a trend component that is quadratic in
nature. Which pattern of data will indicate that the trend in the time series data is quadratic in nature?

"First Test, First Pass" - www.lead2pass.com 7


EMC E20-007 Exam

A. (y3-y2) - (y2-y1) = .........= (yn-yn-1)-(yn-1-yn-2)


B. (y2-y1) = (y3-y2) = ....... = (yn-yn-1)

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
C. ((y2-y1) /y1 ) * 100% = .......((yn-yn-1)/yn-1) * 100%
D. (y4-y2) - (y3-y1) = .........= (yn-yn-2)-(yn-1-yn-3)

Correct Answer: A
Section: (none)
Explanation

QUESTION 18
Which analytical method is considered unsupervised?

A. K-means clustering
B. Na?ve Bayesian classifier
C. Decision tree
D. Linear regression

Correct Answer: A
Section: (none)
Explanation

QUESTION 19
You have used k-means clustering to classify behavior of 100, 000 customers for a retail store. You decide to use household income, age, gender and
yearly purchase amount as measures. You have chosen to use 8 clusters and notice that 2 clusters only have 3 customers assigned. What should you
do?

A. Decrease the number of clusters


B. Increase the number of clusters
C. Decrease the number of measures used
D. Identify additional measures to add to the analysis

Correct Answer: A
Section: (none)
Explanation

QUESTION 20
What does R code nv <- v[v < 1000] do?

A. Selects the values in vector v that are less than 1000 and assigns them to the vector nv

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
B. Sets nv to TRUE or FALSE depending on whether all elements of vector v are less than 1000
C. Removes elements of vector v less than 1000 and assigns the elements >= 1000 to nv
D. Selects values of vector v less than 1000,modifies v,and makes a copy to nv

Correct Answer: A
Section: (none)
Explanation

QUESTION 21
For which class of problem is MapReduce most suitable?

A. Embarrassingly parallel
B. Minimal result data
C. Simple marginalization tasks
D. Non-overlapping queries

Correct Answer: A
Section: (none)
Explanation

Explanation/Reference:
"First Test, First Pass" - www.lead2pass.com 8
EMC E20-007 Exam

QUESTION 22
Which activity is performed in the Operationalize phase of the Data Analytics Lifecycle?

A. Define the process to maintain the model


B. Try different analytical techniques
C. Try different variables
D. Transform existing variables

Correct Answer: A
Section: (none)
Explanation

QUESTION 23
Since R factors are categorical variables, they are most closely related to which data classification level?

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. nominal
B. ordinal
C. interval
D. ratio

Correct Answer: A
Section: (none)
Explanation

QUESTION 24
In which phase of the analytic lifecycle would you expect to spend most of the project time?

A. Discovery
B. Data preparation
C. Communicate Results
D. Operationalize

Correct Answer: C
Section: (none)
Explanation

QUESTION 25
You are building a logistic regression model to predict whether a tax filer will be audited within the next two years. Your training set population is 1000
filers. The audit rate in your training data is 4.2%. What is the sum of the probabilities that the model assigns to all the filers in your training set that have
been audited?

A. 42.0
B. 4.2
C. 0.42
D. 0.042

Correct Answer: A
Section: (none)
Explanation

QUESTION 26

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Refer to exhibit.You are asked to write a report on how specific variables impact your client's sales using a data set provided to you by the client. The
data includes 15 variables that the client views

"First Test, First Pass" - www.lead2pass.com 9


EMC E20-007 Exam

as directly related to sales, and you are restricted to these variables only. After a preliminary analysis of the data, the following findings were made:

1. Multicollinearity is not an issue among the variables


2. Only three variables--A, B, and C--have significant correlation with sales

You build a linear regression model on the dependent variable of sales with the independent variables of A, B, and C. The results of the regression are
seen in the exhibit. You cannot request additional datA. what is a way that you could try to increase the R2 of the model without artificially inflating it?

A. Create clusters based on the data and use them as model inputs
B. Force all 15 variables into the model as independent variables
C. Create interaction variables based only on variables A,B,and C
D. Break variables A,B,and C into their own univariate models

Correct Answer: A
Section: (none)
Explanation

QUESTION 27
You have two tables of customers in your database. Customers in cust_table_1 were sent an e- mail promotion last year, and customers in cust_table_2
received a newsletter last year. Customers can only be entered in once per table. You want to create a table that includes all customers, and any of the
communications they received last year. Which type of join would you use for this table?

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. Full outer join
B. Inner join
C. Left outer join
D. Cross join

Correct Answer: A
Section: (none)
Explanation

QUESTION 28
In which lifecycle stage are initial hypotheses formed?

A. Discovery
B. Model planning
C. Model building
D. Data preparation
"First Test, First Pass" - www.lead2pass.com 10
EMC E20-007 Exam

Correct Answer: A
Section: (none)
Explanation

QUESTION 29
You are given 10, 000, 000 user profile pages of an online dating site in XML files, and they are stored in HDFS. You are assigned to divide the users
into groups based on the content of their profiles. You have been instructed to try K-means clustering on this data. How should you proceed?

A. Run MapReduce to transform the data,and find relevant key value pairs.
B. Divide the data into sets of 1,000 user profiles,and run K-means clustering in RHadoop iteratively.
C. Run a Naive Bayes classification as a pre-processing step in HDFS.
D. Partition the data by XML file size,and run K-means clustering in each partition.

Correct Answer: A
Section: (none)
Explanation

QUESTION 30

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
The Marketing department of your company wishes to track opinion on a new product that was recently introduced. Marketing would like to know how
many positive and negative reviews are appearing over a given period and potentially retrieve each review for more in-depth insight. They have identified
several popular product review blogs that historically have published thousands of user reviews of your company's products.
You have been asked to provide the desired analysis. You examine the RSS feeds for each blog and determine which fields are relevant. You then craft
a regular expression to match your new product's name and extract the relevant text from each matching review.
What is the next step you should take?

A. Convert the extracted text into a suitable document representation and index into a review corpus
B. Use the extracted text and your regular expression to perform a sentiment analysis based on mentions of the new product
C. Read the extracted text for each review and manually tabulate the results
D. Group the reviews using Na?ve Bayesian classification

Correct Answer: A
Section: (none)
Explanation

QUESTION 31
Which word or phrase completes the statement? A Data Scientist would consider that a RDBMS is to a Table as R is to a ______________ .

A. Data frame
B. List
C. Matrix
D. Array

Correct Answer: A
Section: (none)
Explanation

QUESTION 32
Which word or phrase completes the statement? Unix is to bash as Hadoop is to:

A. Pig
"First Test, First Pass" - www.lead2pass.com 11
EMC E20-007 Exam
B. HDFS
C. Sqoop
D. NameNode

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Correct Answer: A
Section: (none)
Explanation

QUESTION 33
A call center for a large electronics company handles an average of 35, 000 support calls a day. The head of the call center would like to optimize the
staffing of the call center during the rollout of a new product due to recent customer complaints of long wait times. You have been asked to create a
model to optimize call center costs and customer wait times.
The goals for this project include:

1. Relative to the release of a product, how does the call volume change over time?
2. How to best optimize staffing based on the call volume for the newly released product, relative to old products.
3. Historically, what time of day does the call center need to be most heavily staffed?
4. Determine the frequency of calls by both product type and customer language.

Which goals are suitable to be completed with MapReduce?

A. Goal 2 and 4
B. Goal 1 and 3
C. Goals 1,2,3,4
D. Goals 2,3,4

Correct Answer: A
Section: (none)
Explanation

QUESTION 34
Consider the example of an analysis for fraud detection on credit card usage. You will need to ensure higher-risk transactions that may indicate
fraudulent credit card activity are retained in your data for analysis, and not dropped as outliers during pre-processing. What will be your approach for
loading data into the analytical sandbox for this analysis?

A. ELT
B. ETL
C. EDW
D. OLTP

Correct Answer: A
Section: (none)
Explanation

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
QUESTION 35
Trend, seasonal, and cyclical are components of a time series. What is another component?

A. Irregular
B. Linear
C. Quadratic
D. Exponential

Correct Answer: A
Section: (none)
Explanation

Explanation/Reference:
"First Test, First Pass" - www.lead2pass.com 12
EMC E20-007 Exam

QUESTION 36
You are studying the behavior of a population, and you are provided with multidimensional data at the individual level. You have identified four specific
individuals who are valuable to your study, and would like to find all users who are most similar to each individual. Which algorithm is the most
appropriate for this study?

A. K-means clustering
B. Linear regression
C. Association rules
D. Decision trees

Correct Answer: A
Section: (none)
Explanation

QUESTION 37
Which R data structure allows elements to have different data types?

A. List
B. Vector
C. Matrix
D. Array

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Correct Answer: A
Section: (none)
Explanation

QUESTION 38
Which key role for a successful analytic project can consult and advise the project team on the value of end results and how these will be used on a day-
to-day basis?

A. Business User
B. Project Manager
C. Data Scientist
D. Business Intelligence Analyst

Correct Answer: A
Section: (none)
Explanation

QUESTION 39
A disk drive manufacturer has a defect rate of less than 1.0% with 98% confidence. A quality assurance team samples 1000 disk drives and finds 14
defective units. Which action should the team recommend?

A. The manufacturing process should be inspected for problems.


B. A larger sample size should be taken to determine if the plant is functioning properly
C. A smaller sample size should be taken to determine if the plant is functioning properly
D. The manufacturing process is functioning properly and no further action is required.

Correct Answer: A
Section: (none)
Explanation

QUESTION 40
"First Test, First Pass" - www.lead2pass.com 13
EMC E20-007 Exam

What is required in a presentation for project sponsors?

A. The "Big Picture" takeaways for executive level stakeholders

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
B. Data warehouse design changes
C. Line by line review of the developed code
D. Detailed statistical basis for the modeling approach used in the project

Correct Answer: A
Section: (none)
Explanation

QUESTION 41
A data scientist wants to predict the probability of death from heart disease based on three risk factors: age, gender, and blood cholesterol level.
What is the most appropriate method for this project?

A. Logistic regression
B. Linear regression
C. K-means clustering
D. Apriori algorithm

Correct Answer: A
Section: (none)
Explanation

QUESTION 42
What are the characteristics of Big Data?

A. Data volume,processing complexity,and data structure variety.


B. Data volume,business importance,and data structure variety.
C. Data type,processing complexity,and data structure variety.
D. Data volume,processing complexity,and business importance.

Correct Answer: A
Section: (none)
Explanation

QUESTION 43
You are analyzing data in order to build a classifier model. You discover non-linear data and discontinuities that will affect the model. Which analytical
method would you recommend?

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. Decision Trees
B. Logistic Regression
C. ARIMA
D. Linear Regression

Correct Answer: A
Section: (none)
Explanation

QUESTION 44
What is an appropriate data visualization to use in a presentation for a project sponsor?

A. Bar chart
B. Pie chart
C. Box and Whisker plot
D. Density plot
"First Test, First Pass" - www.lead2pass.com 14
EMC E20-007 Exam

Correct Answer: A
Section: (none)
Explanation

QUESTION 45
You have been assigned to do a study of the daily revenue effect of a pricing model of online transactions. All the data currently available to you has
been loaded into your analytics database; revenue data, pricing data, and online transaction data. You find that all the data comes in different levels of
granularity. The transaction data has timestamps (day, hour, minutes, seconds), pricing is stored at the daily level, and revenue data is only reported
monthly. What is your next step?

A. Report back to the business owner that the current data model does not support the business question.
B. Interpolate a daily model for revenue from the monthly revenue data.
C. Aggregate all data to the monthly level in order to create a monthly revenue model.
D. Disregard revenue as a driver in the pricing model,and create a daily model based on pricing and transactions only.

Correct Answer: A
Section: (none)
Explanation

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
QUESTION 46
Which SQL OLAP extension provides all possible grouping combinations?

A. CUBE
B. ROLLUP
C. UNION ALL
D. CROSS JOIN

Correct Answer: A
Section: (none)
Explanation

QUESTION 47
What is the primary bottleneck in text classification?

A. The availablilty of tagged training data.


B. The ability to parse unstructured text data.
C. The high dimensionality of text data.
D. The fact that text corpora are dynamic.

Correct Answer: A
Section: (none)
Explanation

QUESTION 48
Which characteristic applies only to Business Intelligence as opposed to Data Science?

A. Uses only structured data


B. Supports solving "what if" scenarios
C. Uses large data sets
D. Uses predictive modeling techniques

Correct Answer: A
Section: (none)
Explanation

Explanation/Reference:

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
"First Test, First Pass" - www.lead2pass.com 15
EMC E20-007 Exam

QUESTION 49
You have been assigned to run a linear regression model for each of 5, 000 distinct districts, and all the data is currently stored in a PostgreSQL
database. Which tool/library would you use to produce these models with the least effort?

A. MADlib
B. Mahout
C. R
D. HBase

Correct Answer: A
Section: (none)
Explanation

QUESTION 50
Your customer provided you with 2, 000 unlabeled records and asked you to separate them into three groups. What is the correct analytical method to
use?

A. K-means clustering
B. Linear regression
C. Naive Bayesian classification
D. Logistic regression

Correct Answer: A
Section: (none)
Explanation

QUESTION 51
You are performing a market basket analysis using the Apriori algorithm. Which measure is a ratio describing the how many more times two items are
present together than would be expected if those two items are statistically independent?

A. Lift
B. Leverage
C. Support
D. Confidence

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Correct Answer: A
Section: (none)
Explanation

QUESTION 52
In which lifecycle stage are appropriate analytical techniques determined?

A. Model planning
B. Model building
C. Data preparation
D. Discovery

Correct Answer: A
Section: (none)
Explanation

QUESTION 53
What is Hadoop?

A. Java classes for HDFS types and MapReduce job management and HDFS "First Test, First Pass" - www.lead2pass.com 16
EMC E20-007 Exam
B. Java classes for HDFS types and MapReduce job management and the MapReduce paradigm
C. MapReduce paradigm and HDFS
D. MapReduce paradigm and massive unstructured data storage on commodity hardware

Correct Answer: A
Section: (none)
Explanation

QUESTION 54
You are using k-means clustering to classify heart patients for a hospital. You have chosen Patient Sex, Height, Weight, Age and Income as measures
and have used 3 clusters. When you create a pair-wise plot of the clusters, you notice that there is significant overlap between the clusters.
What should you do?

A. Identify additional measures to add to the analysis


B. Remove one of the measures
C. Decrease the number of clusters

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
D. Increase the number of clusters

Correct Answer: C
Section: (none)
Explanation

QUESTION 55
How does Pig's use of a schema differ from that of a traditional RDBMS?

A. Pig's schema is optional


B. Pig's schema requires that the data is physically present when the schema is defined
C. Pig's schema is required for ETL
D. Pig's schema supports a single data type

Correct Answer: A
Section: (none)
Explanation

QUESTION 56
You are provided four different datasets. Initial analysis on these datasets show that they have identical mean, variance and correlation values. What
should your next step in the analysis be?

A. Visualize the data to further explore the characteristics of each data set
B. Select one of the four datasets and begin planning and building a model
C. Combine the data from all four of the datasets and begin planning and bulding a model
D. Recalculate the descriptive statistics since they are unlikely to be identical for each dataset

Correct Answer: A
Section: (none)
Explanation

QUESTION 57
You are asked to create a model to predict the total number of monthly subscribers for a specific magazine. You are provided with 1 year's worth of
subscription and payment data, user demographic data, and 10 years worth of content of the magazine (articles and pictures). Which algorithm is the
most appropriate for building a predictive model for subscribers?

A. Linear regression

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
B. Logistic regression
C. Decision trees
"First Test, First Pass" - www.lead2pass.com 17
EMC E20-007 Exam
D. TF-IDF

Correct Answer: A
Section: (none)
Explanation

QUESTION 58
Which word or phrase completes the statement? Structured data is to OLAP data as quasi- structured data is to____

A. Clickstream data
B. XML data
C. Text documents
D. Image files

Correct Answer: A
Section: (none)
Explanation

QUESTION 59
What describes a true property of Logistic Regression method?

A. It is robust with redundant variables and correlated variables.


B. It handles missing values well.
C. It works well with discrete variables that have many distinct values.
D. It works well with variables that affect the outcome in a discontinuous way.

Correct Answer: A
Section: (none)
Explanation

QUESTION 60
You have been assigned to do a study of the daily revenue effect of a pricing model of online transactions. You have tested all the theoretical models in
the previous model planning stage, and all tests have yielded statistically insignificant results. What is your next step?

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. Report that the results are insignificant,and reevaluate the original business question.
B. Run all the models again against a larger sample,leveraging more historical data.
C. Move forward on the model with the highest significance scores relative to the others.
D. Modify samples used by the models and iterate until a significant result occurs.

Correct Answer: A
Section: (none)
Explanation

QUESTION 61
What is a core deliverable at the end of the analytic project?

A. An implemented database design


B. A whitepaper describing the project and the implementation
C. A presentation for project sponsors
D. The training materials

Correct Answer: C
Section: (none)
Explanation

QUESTION 62
"First Test, First Pass" - www.lead2pass.com 18
EMC E20-007 Exam

You have been assigned to run a logistic regression model for each of 100 countries, and all the data is currently stored in a PostgreSQL database.
Which tool/library would you use to produce these models with the least effort?

A. MADlib
B. Mahout
C. RStudio
D. HBase

Correct Answer: A
Section: (none)
Explanation

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
QUESTION 63
Your organization has a website where visitors randomly receive one of two coupons. It is also possible that visitors to the website will not receive a
coupon. You have been asked to determine if offering a coupon to visitors to your website has any impact on their purchase decision.
Which analysis method should you use?

A. K-means clustering
B. Association rules
C. Student T-test
D. One-way ANOVA

Correct Answer: D
Section: (none)
Explanation

QUESTION 64
Imagine you are trying to hire a Data Scientist for your team. In addition to technical ability and quantitative background, which additional essential trait
would you look for in people applying for this position?

A. Communication skill
B. Scientific background
C. Domain expertise
D. Well Organized

Correct Answer: A
Section: (none)
Explanation

QUESTION 65
What describes the use of UNION clause in a SQL statement?

A. Operates on queries and potentially increases the number of rows


B. Operates on queries and potentially decreases the number of rows
C. Operates on tables and potentially decreases the number of columns
D. Operates on both tables and queries and potentially increases both the number of rows and columns

Correct Answer: A
Section: (none)

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Explanation

QUESTION 66
You have run the association rules algorithm on your data set, and the two rules {banana, apple} => {grape} and {apple, orange}=> {grape} have been
found to be relevant. What else must be true?

"First Test, First Pass" - www.lead2pass.com 19


EMC E20-007 Exam

A. {grape,apple,orange} must be a frequent itemset.


B. {banana,apple,grape,orange} must be a frequent itemset.
C. {grape} => {banana,apple} must be a relevant rule.
D. {banana,apple} => {orange} must be a relevant rule.

Correct Answer: A
Section: (none)
Explanation

QUESTION 67
When would you use a Wilcoxson Rank Sum test?

A. When you cannot make an assumption about the distribution of the populations
B. When the data can easily be sorted
C. When the populations represent the sums of other values
D. When the data cannot easily be sorted

Correct Answer: A
Section: (none)
Explanation

QUESTION 68
In the MapReduce framework, what is the purpose of the Reduce function?

A. It aggregates the results of the Map function and generates processed output
B. It distributes the input to multiple nodes for processing
C. It writes the output of the Map function to storage
D. It breaks the input into smaller components and distributes to other nodes in the cluster

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Correct Answer: A
Section: (none)
Explanation

QUESTION 69
Which of the following is an example of quasi-structured data?

A. OLAP
B. OLTP
C. Customer record table
D. Clickstream data

Correct Answer: A
Section: (none)
Explanation

QUESTION 70
A Data Scientist is assigned to build a model from a reporting data warehouse. The warehouse contains data collected from many sources and
transformed through a complex, multi-stage ETL process. What is a concern the data scientist should have about the data?

A. It is too processed
B. It is not structured
C. It is not normalized
D. It is too centralized

Correct Answer: A
Section: (none)
Explanation

Explanation/Reference:
"First Test, First Pass" - www.lead2pass.com 20
EMC E20-007 Exam

QUESTION 71
Which word or phrase completes the statement? Emphasis color is to standard color as _______ .

A. Main message is to context

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
B. Main message is to key findings
C. Frequent item set is to item
D. Pie chart is to proportions

Correct Answer: A
Section: (none)
Explanation

QUESTION 72
Which activity might be performed in the Operationalize phase of the Data Analytics Lifecycle?

A. Run a pilot
B. Try different analytical techniques
C. Try different variables
D. Transform existing variables

Correct Answer: A
Section: (none)
Explanation

QUESTION 73
Refer to the exhibit. You are asked to write a report on how specific variables impact your client's sales using a data set provided to you by the client.
The data includes 15 variables that the client views as directly related to sales, and you are restricted to these variables only. After a preliminary analysis
of the data, the following findings were made:

1. Multicollinearity is not an issue among the variables


2. Only three variables--A, B, and C--have significant correlation with sales

You build a linear regression model on the dependent variable of sales with the independent variables of A, B, and C. The results of the regression are
seen in the exhibit.
Which interpretation is supported by the analysis?

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. Variables A,B,and C are significantly impacting sales,but are not effectively estimating sales
B. Variables A,B,and C are significantly impacting sales and are effectively estimating sales
C. Due to the R2 of 0.10,the model is not valid ?the linear regression should be re-run with all 15 variables forced into the model to increase the R2
"First Test, First Pass" - www.lead2pass.com 21
EMC E20-007 Exam
D. Due to the R2 of 0.10,the model is not valid ?a different analytical model should be attempted

Correct Answer: A
Section: (none)
Explanation

QUESTION 74
Refer to the Exhibit. In the Exhibit, the table shows the values for the input Boolean attributes "A", "B", and "C". It also shows the values for the output
attribute "class". Which decision tree is valid for the data?

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. Tree B
B. Tree A
C. Tree C
D. Tree D

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Correct Answer: A
Section: (none)
Explanation

QUESTION 75
Refer to the Exhibit. In the Exhibit, the table shows the values for the input Boolean attributes "A", "B", and "C". It also shows the values for the output
attribute "class". Which decision tree is valid for the data?

"First Test, First Pass" - www.lead2pass.com 22


EMC E20-007 Exam

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. Tree B
B. Tree A
C. Tree C
D. Tree D

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Correct Answer: A
Section: (none)
Explanation

QUESTION 76
Refer to the exhibit. You are building a decision tree. In this exhibit, four variables are listed with their respective values of info-gain.
Based on this information, on which attribute would you expect the next split to be in the decision tree?

A. Credit Score
B. Age
C. Income
D. Gender
"First Test, First Pass" - www.lead2pass.com 23
EMC E20-007 Exam

Correct Answer: A
Section: (none)
Explanation

QUESTION 77
Refer to the exhibit. You are assigned to do an end of the year sales analysis of 1, 000 different products, based on the transaction table. Which column
in the end of year report requires the use of a window function?

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. Total Sales to Date
B. Daily Sales
C. Average Daily Price
D. Maximum Price

Correct Answer: A
Section: (none)
Explanation

QUESTION 78
Refer to the exhibit. After analyzing a dataset, you report findings to your team:

1. Variables A and C are significantly and positively impacting the dependent variable.
2. Variable B is significantly and negatively impacting the dependent variable.
3. Variable D is not significantly impacting the dependent variable.

After seeing your findings, the majority of your team agreed that variable B should be positively impacting the dependent variable.
What is a possible reason the coefficient for variable B was negative and not positive?

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. Variable B is interacting with another variable due to correlated inputs
B. Variable B needs a quadratic transformation due to its relationship to the dependent variable "First Test, First Pass" - www.lead2pass.com 24
EMC E20-007 Exam
C. The information gain from variable B is already provided by another variable
D. Variable B needs a logarithmic transformation due to its relationship to the dependent variable

Correct Answer: A
Section: (none)
Explanation

QUESTION 79
Refer to the Exhibit. You are working on creating an OLAP query that outputs several rows of with summary rows of subtotals and grand totals in
addition to regular rows that may contain NULL as shown in the exhibit. Which function can you use in your query to distinguish the row from a regular
row to a subtotal row?

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. GROUPING
B. RANK
C. GROUP_ID
D. ROLLUP

Correct Answer: A
Section: (none)
Explanation

QUESTION 80
Refer to the exhibit. Click on the calculator icon in the upper left corner. You are given a list of pre-defined association rules:

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. RENTER => BAD CREDIT
B. RENTER => GOOD CREDIT

"First Test, FirstPass" - www.lead2pass.com 25


EMC E20-007 Exam
C. HOME OWNER => BAD CREDIT
D. HOME OWNER => GOOD CREDIT
E. FREE HOUSING => BAD CREDIT
F. FREE HOUSING => GOOD CREDIT

For your next analysis, you must limit your dataset based on rules with confidence greater than 60%.
Which of the rules will be kept in the analysis?

A. Rules B and D
B. Rules A and F
C. Rules C and E
D. Rules D and E

Correct Answer: A
Section: (none)
Explanation

QUESTION 81
Refer to the exhibit. You have run a linear regression model against your data, and have plotted true outcome versus predicted outcome. The R-squared
of your model is 0.75. What is your assessment of the model?

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. The R-squared may be biased upwards by the extreme-valued outcomes. Remove them and refit to get a better idea of the model's quality over
typical data.
B. The R-squared is good. The model should perform well.
C. The extreme-valued outliers may negatively affect the model's performance. Remove them to see if the R-squared improves over typical data.
D. The observations seem to come from two different populations,but this model fits them both equally well.

Correct Answer: A
Section: (none)
Explanation

Explanation/Reference:
"First Test, First Pass" - www.lead2pass.com 26
EMC E20-007 Exam

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
QUESTION 82
Refer to the exhibit. You are using K-means clustering to classify customer behavior for a large retailer. You need to determine the optimum number of
customer groups. You plot the within-sum-of-squares (wss) data as shown in the exhibit. How many customer groups should you specify?

A. 2
B. 3
C. 4
D. 8

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Correct Answer: C
Section: (none)
Explanation

QUESTION 83
Refer to the exhibit. You are using k-means clustering to discover groupings within a data set. You plot within-sum-of- squares (wss) of multiple cluster
sizes. Based on the exhibit, how many clusters should you use in your analysis?

"First Test, First Pass" - www.lead2pass.com 27


EMC E20-007 Exam

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. 4
B. 2
C. 8
D. 10

Correct Answer: A
Section: (none)
Explanation

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
QUESTION 84
Refer to the exhibit Consider the training data set shown in the exhibit. What are the classification (Y = 0 or 1) and the probability of the classification for
the tupleX(0, 0, 1) using Naive Bayesian classifier?

"First Test, First Pass" - www.lead2pass.com 28


EMC E20-007 Exam

A. Classification Y = 1,Probability = 4/54


B. Classification Y = 0,Probability = 1/54
C. Classification Y = 1,Probability = 1/54
D. Classification Y = 0,Probability = 4/54

Correct Answer: A

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Section: (none)
Explanation

QUESTION 85
Refer to the exhibit. In the exhibit, a correlogram is provided based on an autocorrelation analysis of a sample dataset. What can you conclude from only
this exhibit?

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. There is significant autocorrelation through lag 3
B. There is no structure left to model in the data
C. Lag 7 has a significant negative autocorrelation
D. Differencing is required before proceeding with any analysis

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Correct Answer: A
Section: (none)
Explanation

Explanation/Reference:
"First Test, First Pass" - www.lead2pass.com 29
EMC E20-007 Exam

QUESTION 86
Refer to the exhibit. Which type of data issue would you suspect based on the exhibit?

A. "Saturated" data,indicating potential issues with data definitions

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
B. Incomplete data,indicating potential issues with data transmission
C. Mis-scaled data,indicating potential issues with data entry
D. The exhibit does not raise any obvious concerns with the data.

Correct Answer: A
Section: (none)
Explanation

QUESTION 87
Which word or phrase completes the statement? A spreadsheet is to a data island as a centralized database for reporting is to a ________?

A. Data Warehouse
B. Data Repository
C. Analytic Sandbox
D. Data Mart

Correct Answer: A
Section: (none)
Explanation

QUESTION 88
Refer to the exhibit. Click on the calculator icon in the upper left corner. An analyst is searching a corpus of documents for the topic "solid state disk". In
the Exhibit, Table A provides the inverse

"First Test, First Pass" - www.lead2pass.com 30


EMC E20-007 Exam

document frequency for each term across the corpus. Table B provides each term's frequency in four documents selected from corpus. Which of the
four documents is most relevant to the analyst's search?

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. Document C
B. Document A
C. Document B
D. Document D

Correct Answer: A
Section: (none)
Explanation

QUESTION 89

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Which analytical method is considered unsupervised?

A. K-means clustering
B. Na?ve Bayesian classifier
C. Decision tree
D. Linear regression

Correct Answer: A
Section: (none)
Explanation

QUESTION 90
Refer to the exhibit. What provides the decision tree for predicting whether or not someone is a good or bad credit risk. What would be the assigned
probability, p(good), of a single male with no known savings?

"First Test, First Pass" - www.lead2pass.com 31


EMC E20-007 Exam

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. 0.83
B. 0
C. 0.498
D. 0.6

Correct Answer: A
Section: (none)
Explanation

QUESTION 91
Refer to the exhibit. You ran a linear regression, and the final output is seen in the exhibit. Based only on the information in the exhibit and an acceptable
confidence level of 95%, how would you interpret the interaction of variable D with the dependent variable?

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. In this model,Variable D is not significantly interacting with the dependent variable
B. For every 1 unit increase in variable D,holding all other variables constant,we can expect the dependent variable to increase by 10.23 units
C. For every 1 unit increase in variable D,holding all other variables constant,we can expect the dependent variable to be multiplied by 10.23 units
D. Variable D is more significant than variables A,B,and C.
"First Test, First Pass" - www.lead2pass.com 32
EMC E20-007 Exam

Correct Answer: A
Section: (none)
Explanation

QUESTION 92
Refer to the exhibit. The exhibit shows four graphs labeled as Fig A thorough Fig D. Which figure represents the entropy function relative to a Boolean
classification and is represented by the formula shown in Exhibit?

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. Fig-A
B. Fig-B
C. Fig-C
D. Fig-D

Correct Answer: A
Section: (none)
Explanation

QUESTION 93
Trend, seasonal, and cyclical are components of a time series. What is another component?

A. Irregular
B. Linear
C. Quadratic
D. Exponential

Correct Answer: A
Section: (none)
Explanation

Explanation/Reference:
"First Test, First Pass" - www.lead2pass.com 33
EMC E20-007 Exam

QUESTION 94
Refer to the exhibit. The graph represents an ROC space with four classifiers labelled A through D. Which point in the graph represents a perfect
classification?

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. S
B. P
C. Q
D. R

Correct Answer: A
Section: (none)
Explanation

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
QUESTION 95
Refer to the exhibit. Consider the training data set shown in the exhibit. What are the classification (Y = 0 or 1) and the probability of the classification for
the tuple X(1, 0, 0) using Naive Bayesian classifier?

"First Test, First Pass" - www.lead2pass.com 34


EMC E20-007 Exam

A. Classification Y = 0,Probability = 4/54


B. Classification Y = 1,Probability = 4/54
C. Classification Y = 0,Probability = 1/54
D. Classification Y = 1,Probability = 1/54

Correct Answer: A

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Section: (none)
Explanation

QUESTION 96
Refer to the exhibit. You have scored your Naive bayesian classifier model on a hold out test data for cross validation and determined the way the
samples scored and tabluated them as shown in the exhibit. What are the Precision and Recall rate of the model?

A. Precision = 262/277
Recall = 262/288

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
B. Precision =262/288
Recall = 262/277
C. Precision = 277/262
Recall = 288/262
D. Precision = 288/262
Recall = 277/262

Correct Answer: A
Section: (none)
Explanation

QUESTION 97
You are analyzing data in order to build a classifier model. You discover non-linear data and discontinuities that will affect the model. Which analytical
method would you recommend?

"First Test, First Pass" - www.lead2pass.com 35


EMC E20-007 Exam

A. Decision Trees
B. Logistic Regression
C. ARIMA
D. Linear Regression

Correct Answer: A
Section: (none)
Explanation

QUESTION 98
Which ROC curve represents a perfect model fit?

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A.
B. "First Test, First Pass" - www.lead2pass.com 36
EMC E20-007 Exam

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
C.
D.

Correct Answer: A
Section: (none)
Explanation

QUESTION 99
You are using MADlib for Linear Regression analysis. Which value does the statement return?

SELECT (linregr(depvar, indepvar)).r2 FROM zeta1;

A. Goodness of fit
B. Coefficients
C. Standard error
D. P-value
"First Test, First Pass" - www.lead2pass.com 37
EMC E20-007 Exam

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Correct Answer: A
Section: (none)
Explanation

QUESTION 100
Refer to the exhibit. You have scored your Naive bayesian classifier model on a hold out test data for cross validation and determined the way the
samples scored and tabulated them as shown in the exhibit.
What are the the False Positive Rate (FPR) and the False Negative Rate (FNR) of the model?

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. FPR = 15/262
FNR = 26/288
B. FPR = 26/288
FNR = 15/262
C. FPR = 262/15
FNR = 288/26
D. FPR = 288/26
FNR = 262/15

Correct Answer: A
Section: (none)
Explanation

QUESTION 101
A data scientist plans to classify the sentiment polarity of 10, 000 product reviews collected from the Internet. What is the most appropriate model to
use? Suppose labeled training data is available.

A. Na?ve Bayesian classifier


B. Linear regression
C. Logistic regression
D. K-means clustering
"First Test, First Pass" - www.lead2pass.com 38
EMC E20-007 Exam

Correct Answer: A
Section: (none)
Explanation

QUESTION 102
In which lifecycle stage are test and training data sets created?

A. Model building
B. Model planning
C. Discovery
D. Data preparation

Correct Answer: A
Section: (none)

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Explanation

QUESTION 103
When creating a presentation for a technical audience, what is the main objective?

A. Show that you met the project goals


B. Show how you met the project goals
C. Show if the model will meet the SLA
D. Show the technique to be used in the production environment

Correct Answer: B
Section: (none)
Explanation

QUESTION 104
Your company has 3 different sales teams. Each team's sales manager has developed incentive offers to increase the size of each sales transaction.
Any sales manager whose incentive program can be shown to increase the size of the average sales transaction will receive a bonus. Data are available
for the number and average sale amount for transactions offering one of the incentives as well as transactions offering no incentive. The VP of Sales
has asked you to determine analytically if any of the incentive programs has resulted in a demonstrable increase in the average sale amount. Which
analytical technique would be appropriate in this situation?

A. One-way ANOVA
B. Multi-way ANOVA
C. Student's t-test
D. Wilcoxson Rank Sum Test

Correct Answer: A
Section: (none)
Explanation

QUESTION 105
In data visualization, what is used to focus the audience on a key part of a chart?

A. Emphasis colors
B. Detailed text
C. Pastel colors
D. A data table

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Correct Answer: A
Section: (none)
Explanation

Explanation/Reference:
"First Test, First Pass" - www.lead2pass.com 39
EMC E20-007 Exam

QUESTION 106
When would you use GROUP BY ROLLUP clause in your OLAP query?

A. where all subtotals and grand totals are to be included in the output
B. where only the subtotals are to be included in the output
C. where only the grand totals are to be included in the output
D. where only specific subtotals and grand totals for a combination of variables are to be included in the output

Correct Answer: A
Section: (none)
Explanation

QUESTION 107
Which type of numeric value does a logistic regression model estimate?

A. Probability
B. A p-value
C. Any integer
D. Any real number

Correct Answer: A
Section: (none)
Explanation

QUESTION 108
Your colleague, who is new to Hadoop, approaches you with a question. They want to know how best to access their data. This colleague has a strong
background in data flow languages and programming.
Which query interface would you recommend?

A. Pig

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
B. Hive
C. Howl
D. HBase

Correct Answer: A
Section: (none)
Explanation

QUESTION 109
The web analytics team uses Hadoop to process access logs. They now want to correlate this data with structured user data residing in a production
single-instance JDBC database. They collaborate with the production team to import the data into Hadoop. Which tool should they use?

A. Sqoop
B. Pig
C. Chukwa
D. Scribe

Correct Answer: A
Section: (none)
Explanation

QUESTION 110
"First Test, First Pass" - www.lead2pass.com 40
EMC E20-007 Exam

What does the R code

z <- f[1:10, ]

do?

A. Assigns the first 10 rows of f to the vector z


B. Assigns the 1st 10 columns of the 1st row of f to z
C. Assigns a sequence of values from 1 to 10 to z
D. Assigns the 1st 10 columns to z

Correct Answer: A
Section: (none)

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Explanation

QUESTION 111
In R, functions like plot() and hist() are known as what?

A. generic functions
B. virtual methods
C. virtual functions
D. generic methods

Correct Answer: B
Section: (none)
Explanation

QUESTION 112
Review the following code:

SELECT pn, vn, sum(prc*qty)


FROM sale
GROUP BY CUBE(pn, vn)
ORDER BY 1, 2, 3;

Which combination of subtotals do you expect to be returned by the query?

A. (pn,vn)
B. ( (pn,vn),(pn) )
C. ( (pn,vn),(pn),(vn) )
D. ( (pn,vn),(pn),(vn),( ) )

Correct Answer: D
Section: (none)
Explanation

QUESTION 113
In MADlib what does MAD stand for?

A. Magnetic,Agile,Deep
B. Machine Learning,Algorithms for Databases

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
C. Mathematical Algorithms for Databases
D. Modular,Accurate,Dependable

Correct Answer: C
Section: (none)
Explanation

Explanation/Reference:
"First Test, First Pass" - www.lead2pass.com 41
EMC E20-007 Exam

QUESTION 114
The web analytics team uses Hadoop to process access logs. They now want to correlate this data with structured user data residing in their massively
parallel database. Which tool should they use to export the structured data from Hadoop?

A. Sqoop
B. Pig
C. Chukwa
D. Scribe

Correct Answer: A
Section: (none)
Explanation

QUESTION 115
When would you prefer a Naive Bayes model to a logistic regression model for classification?

A. When you are using several categorical input variables with over 1000 possible values each.
B. When you need to estimate the probability of an outcome,not just which class it is in.
C. When all the input variables are numerical.
D. When some of the input variables might be correlated.

Correct Answer: A
Section: (none)
Explanation

QUESTION 116
Before you build an ARMA model, how can you tell if your time series is weakly stationary?

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. There appears to be a constant variance around a constant mean.
B. The mean of the series is close to 0.
C. The series is normally distributed.
D. There appears to be no apparent trend component.

Correct Answer: A
Section: (none)
Explanation

QUESTION 117
In a Student's t-test, what is the meaning of the p-value?

A. it is the area under the appropriate tails of the Student's distribution


B. it is the "power" of the Student's t-test
C. it is the mean of the distribution for the null hypothesis
D. it is the mean of the distribution for the alternate hypothesis

Correct Answer: A
Section: (none)
Explanation

QUESTION 118
In addition to less data movement and the ability to use larger datasets in calculations, what is a benefit of analytical calculations in a database?

A. quicker time to insight


B. more efficient handling of categorical values
"First Test, First Pass" - www.lead2pass.com 42
EMC E20-007 Exam
C. improved connections between disparate data sources
D. full use of data aggregation functionality

Correct Answer: A
Section: (none)
Explanation

QUESTION 119

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
You have been assigned to do a study of the daily revenue effect of a pricing model of online transactions. When have you completed the analytics
lifecycle?

A. You have written documentation,and the code has been handed off to the Data Base Administrator and business operations.
B. You have a completely developed model,and the results have shown statistically acceptable results.
C. You have presented the results of the model to both the internal analytics team and the business owner of the project.
D. You have a completely developed model based on both a sample of the data and the entire set of data available.

Correct Answer: A
Section: (none)
Explanation

QUESTION 120
Consider these itemsets:

(hat, scarf, coat)


(hat, scarf, coat, gloves)
(hat, scarf, gloves)
(hat, gloves)
(scarf, coat, gloves)

What is the confidence of the rule (gloves -> hat)?

A. 75%
B. 60%
C. 66%
D. 80%

Correct Answer: A
Section: (none)
Explanation

QUESTION 121
What is holdout data?

A. a subset of the provided data set selected at random and used to validate the model
B. a subset of the provided data set selected at random and used to initially construct the model
C. a subset of the provided data set that is removed by the data scientist because it contains data errors

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
D. a subset of the provided data set that is removed by the data scientist because it contains outliers

Correct Answer: A
Section: (none)
Explanation

QUESTION 122
Which characteristic applies mainly to Data Science as opposed to Business Intelligence?

"First Test, First Pass" - www.lead2pass.com 43


EMC E20-007 Exam

A. Advanced analytical methods


B. Robust reporting
C. Focus on structured data
D. Data dashboards

Correct Answer: A
Section: (none)
Explanation

QUESTION 123
Which word or phrase completes the statement?
Theater actor is to "Artistic and Expressive" as Data Scientist is to ________________

A. "Communicative and Collaborative"


B. "Introverted and Technical"
C. "Logical and Steadfast"
D. "Independent and Intelligent"

Correct Answer: A
Section: (none)
Explanation

QUESTION 124
Which process in text analysis can be used to reduce dimensionality?

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. Stemming
B. Parsing
C. Digitizing
D. Sorting

Correct Answer: A
Section: (none)
Explanation

QUESTION 125
What is the format of the output from the Map function of MapReduce?

A. Key-value pairs
B. Binary respresentation of keys concatenated with structured data
C. Compressed index
D. Unique key record and separate records of all possible values

Correct Answer: A
Section: (none)
Explanation

QUESTION 126
Which data type value is used for the observed response variable in a logistic regression model?

A. Any positive real number


B. Any integer
C. A binary value
D. Any real number

Correct Answer: C
Section: (none)
Explanation

Explanation/Reference:
"First Test, First Pass" - www.lead2pass.com 44
EMC E20-007 Exam

QUESTION 127

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A data scientist is given an R data frame, "empdata", with the columns Age, Salary, Occupation, Education, and Gender. The data scientist would like to
examine only the Salary and Occupation columns for ages greater than 40. Which command extracts the appropriate rows and columns from the data
frame?

A. empdata[empdata$Age > 40,c("Salary","Occupation")]


B. empdata[c("Salary","Occupation"),empdata$Age > 40]
C. empdata[Age > 40,("Salary","Occupation")]
D. empdata[,c("Salary","Occupation")]$Age > 40

Correct Answer: A
Section: (none)
Explanation

QUESTION 128
What is required in a presentation for business analysts?

A. Budgetary considerations and requests


B. Operational process changes
C. Detailed statistical explanation of the applicable modeling theory
D. The presentation author's credentials

Correct Answer: B
Section: (none)
Explanation

QUESTION 129
What is LOESS used for?

A. It fits a smoothed curve to scatterplot data,to give a general sense of the data's behavior.
B. It is a significance test for the correlation between two variables.
C. It plots a continuous variable versus a discrete variable,to compare distributions across classes.
D. It is run after a one-way ANOVA,to determine which population has the highest mean value.

Correct Answer: A
Section: (none)
Explanation

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
QUESTION 130
Which word or phrase completes the statement? Mahout is to Hadoop as MADlib is to ____________ .

A. PostgreSQL
B. R
C. Excel
D. SAS

Correct Answer: A
Section: (none)
Explanation

QUESTION 131
In linear regression modeling, which action can be taken to improve the linearity of the relationship between the dependent and independent variables?

A. Apply a transformation to a variable


"First Test, First Pass" - www.lead2pass.com 45
EMC E20-007 Exam
B. Use a different statistical package
C. Calculate the R-Squared value
D. Change the units of measurement on the independent variable

Correct Answer: A
Section: (none)
Explanation

QUESTION 132
Data visualization is used in the final presentation of an analytics project. For what else is this technique commonly used?

A. Assessing data quality


B. Descriptive statistics
C. ETLT
D. Model selection

Correct Answer: A
Section: (none)
Explanation

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
QUESTION 133
Which data asset is an example of semi-structured data?

A. XML data file


B. Database table
C. Webserver log
D. News article

Correct Answer: A
Section: (none)
Explanation

QUESTION 134
Your colleague, who is new to Hadoop, approaches you with a question. They want to know how best to access their data. This colleague has previously
worked extensively with SQL and databases.
Which query interface would you recommend?

A. Hive
B. Pig
C. Howl
D. HBase

Correct Answer: A
Section: (none)
Explanation

QUESTION 135
In linear regression, what indicates that an estimated coefficient is significantly different than zero?

A. A small p-value
B. R-squared near 1
C. R-squared near 0
D. The estimated coefficient is greater than 3

Correct Answer: A
Section: (none)
Explanation

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Explanation/Reference:
"First Test, First Pass" - www.lead2pass.com 46
EMC E20-007 Exam

QUESTION 136
Which graphical representation shows the distribution and multiple summary statistics of a continuous variable for each value of a corresponding
discrete variable?

A. box and whisker plot


B. dotplot
C. scatterplot
D. binplot

Correct Answer: A
Section: (none)
Explanation

QUESTION 137
Assume that you have a data frame in R. Which function would you use to display descriptive statistics about this variable?

A. summary
B. str
C. attributes
D. levels

Correct Answer: A
Section: (none)
Explanation

QUESTION 138
What is the mandatory Clause that must be included when using Window functions?

A. OVER
B. RANK
C. PARTITION BY
D. RANK BY

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Correct Answer: C
Section: (none)
Explanation

QUESTION 139
What is the purpose of the process step "parsing" in text analysis?

A. imposes a structure on the unstructured/semi-structured text for downstream analysis


B. performs the search and/or retrieval in finding a specific topic or an entity in a document
C. executes the clustering and classification to organize the contents
D. computes the TF-IDF values for all keywords and indices

Correct Answer: A
Section: (none)
Explanation

QUESTION 140
Which word or phrase completes the statement? A data warehouse is to a centralized database for reporting as an analytic sandbox is to a _______?

A. Collection of data assets for modeling


"First Test, First Pass" - www.lead2pass.com 47
EMC E20-007 Exam
B. Collection of low-volume databases
C. Centralized database of KPIs
D. Collection of data assets for ETL

Correct Answer: A
Section: (none)
Explanation

QUESTION 141
You do a Student's t-test to compare the average test scores of sample groups from populations A and B. Group A averaged 10 points higher than
group B. You find that this difference is significant, with a p-value of 0.03. What does that mean?

A. There is a 3% chance that you have identified a difference between the populations when in reality there is none.
B. The difference in scores between a sample from population A and a sample from population B will tend to be within 3% of 10 points.
C. There is a 3% chance that a sample group from population A will score 10 points higher that a sample group from population B.

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
D. There is a 97% chance that a sample group from population A will score 10 points higher that a sample group from population B.

Correct Answer: A
Section: (none)
Explanation

QUESTION 142
Which word or phrase completes the statement?
Business Intelligence is to ad-hoc reporting and dashboards as Data Science is to ______________ .

A. Optimization and Predictive Modeling


B. Alerts and Queries
C. Structured Data and Data Sources
D. Sales and profit reporting

Correct Answer: A
Section: (none)
Explanation

QUESTION 143
What is a property of window functions in SQL commands?

A. They can be used to calculate moving averages over various intervals.


B. They group rows into a single output row.
C. They can be used between the keywords FROM and WHERE in a SELECT command.
D. They don't require ordering of data within a window.

Correct Answer: A
Section: (none)
Explanation

QUESTION 144
You are attempting to find the Euclidean distance between two centroids:

Centroid A's coordinates: (X = 2, Y = 4)


Centroid B's coordinates (X = 8, Y = 10)

"First Test, FirstPass" - www.lead2pass.com 48

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
EMC E20-007 Exam

Which formula finds the correct Euclidean distance?

A. SQRT((2-8)2+(4-10)2) or 8.49
B. SQRT(((2-8) x 2) + ((4-10) x 2)) or 12.17
C. ((2-8)2+(4-10)2) or 72
D. ((2-8) x 2 + (4-10) x 2) or 148

Correct Answer: A
Section: (none)
Explanation

QUESTION 145
In data visualization, which type of chart is recommended to represent frequency data?

A. Line chart
B. Histogram
C. Q-Q chart
D. Scatterplot

Correct Answer: B
Section: (none)
Explanation

QUESTION 146
A data scientist is asked to implement an article recommendation feature for an on-line magazine. The magazine does not want to use client tracking
technologies such as cookies or reading history. Therefore, only the style and subject matter of the current article is available for making
recommendations. All of the magazine's articles are stored in a database in a format suitable for analytics.
Which method should the data scientist try first?

A. K Means Clustering
B. Naive Bayesian
C. Logistic Regression
D. Association Rules

Correct Answer: A
Section: (none)

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Explanation

QUESTION 147
How are window functions different from regular aggregate functions?

A. Rows retain their separate identities and the window function can access more than the current row.
B. Rows are grouped into an output row and the window function can access more than the current row.
C. Rows retain their separate identities and the window function can only access the current row.
D. Rows are grouped into an output row and the window function can only access the current row.

Correct Answer: A
Section: (none)
Explanation

QUESTION 148
Consider these itemsets:

(hat, scarf, coat)

"First Test, FirstPass" - www.lead2pass.com 49


EMC E20-007 Exam

(hat, scarf, coat, gloves)


(hat, scarf, gloves)
(hat, gloves)
(scarf, coat, gloves)

What is the confidence of the rule (hat, scarf) -> gloves?

A. 66%
B. 40%
C. 50%
D. 60%

Correct Answer: A
Section: (none)
Explanation

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
QUESTION 149
In the MapReduce framework, what is the purpose of the Map Function?

A. It processes the input and generates key-value pairs


B. It collects the output of the Reduce function
C. It sorts the results of the Reduce function
D. It breaks the input into smaller components and distributes to other nodes in the cluster

Correct Answer: A
Section: (none)
Explanation

QUESTION 150
You have completed your model and are handing it off to be deployed in production. What should you deliver to the production team, along with your
commented code?

A. The production team needs to understand how your model will interact with the processes they already support. Give them documentation on
expected model inputs and outputs,and guidance on error-handling.
B. The production team are technical,and they need to understand how the processes that they support work,so give them the same presentation that
you prepared for the analysts.
C. The production team supports the processes that run the organization,and they need context to understand how your model interacts with the
processes they already support. Give them the same presentation that you prepared for the project sponsor.
D. The production team supports the processes that run the organization,and they need context to understand how your model interacts with the
processes they already support. Give them the executive summary.

Correct Answer: A
Section: (none)
Explanation

QUESTION 151
While having a discussion with your colleague, this person mentions that they want to perform K- means clustering on text file data stored in HDFS.
Which tool would you recommend to this colleague?

A. Mahout
B. HBase
C. Scribe
"First Test, First Pass" - www.lead2pass.com 50
EMC E20-007 Exam

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
D. Sqoop

Correct Answer: A
Section: (none)
Explanation

QUESTION 152
Which method is used to solve for coefficients b0, b1, .., bn in your linear regression model :

Y = b0 + b1x1+b2x2+....+bnxn

A. Ordinary Least squares


B. Apriori Algorithm
C. Ridge and Lasso
D. Integer programming

Correct Answer: D
Section: (none)
Explanation

QUESTION 153
What describes a true limitation of Logistic Regression method?

A. It does not handle missing values well.


B. It does not handle redundant variables well.
C. It does not handle correlated variables well.
D. It does not have explanatory values.

Correct Answer: A
Section: (none)
Explanation

QUESTION 154
You submit a MapReduce job to a Hadoop cluster and notice that although the job was successfully submitted, it is not completing. What should you do?

A. Ensure that the TaskTracker is running.


B. Ensure that the JobTracker is running

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
C. Ensure that the NameNode is running
D. Ensure that a DataNode is running

Correct Answer: A
Section: (none)
Explanation

QUESTION 155
A disk drive manufacturer has a defect rate of less than 1.5% with 98% confidence. A quality assurance team samples 1000 disk drives and finds 14
defective units. Which action should the team recommend?

A. The manufacturing process is functioning properly and no further action is required


B. A larger sample size should be taken to determine if the plant is operating correctly
C. A smaller sample size should be taken to determine if the plant is operating correctly
D. There is a flaw in the quality assurance process and the sample should be repeated

Correct Answer: A
Section: (none)
Explanation

Explanation/Reference:
"First Test, First Pass" - www.lead2pass.com 51
EMC E20-007 Exam

QUESTION 156
Which word or phrase completes the statement? Data-ink ratio is to data visualization as __________ .

A. Confusion matrix is to classifier


B. Data scientist is to big data
C. Seasonality is to ARIMA
D. K-means is to Naive Bayes

Correct Answer: A
Section: (none)
Explanation

QUESTION 157
Consider a database with 4 transactions:

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Transaction 1: {cheese, bread, milk}
Transaction 2: {soda, bread, milk}
Transaction 3: {cheese, bread}
Transaction 4: {cheese, soda, juice}

You decide to run the association rules algorithm where minimum support is 50%. Which rule has a confidence at least 50%?

A. {cheese} => {bread}


B. {juice} => {cheese}
C. {milk} => {soda}
D. {soda} => {milk}

Correct Answer: A
Section: (none)
Explanation

QUESTION 158
You are using the Apriori algorithm to determine the likelihood that a person who owns a home has a good credit score. You have determined that the
confidence for the rules used in the algorithm is > 75%. You calculate lift = 1.011 for the rule, "People with good credit are homeowners". What can you
determine from the lift calculation?

A. Support for the association is low


B. Leverage of the rules is low
C. The rule is coincidental
D. The rule is true

Correct Answer: C
Section: (none)
Explanation

QUESTION 159
Consider a database with 4 transactions:

Transaction 1: {cheese, bread, milk}


Transaction 2: {soda, bread, milk}
Transaction 3: {cheese, bread}
Transaction 4: {cheese, soda, juice}

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
"First Test, First Pass" - www.lead2pass.com 52
EMC E20-007 Exam

The minimum support is 25%. Which rule has a confidence equal to 50%?

A. {bread,milk} => {cheese}


B. {bread} => {milk}
C. {juice} => {soda}
D. {bread} => {cheese}

Correct Answer: D
Section: (none)
Explanation

QUESTION 160
Under which circumstance do you need to implement N-fold cross-validation after creating a regression model?

A. There is not enough data to create a test set.


B. The data is unformatted.
C. There are missing values in the data.
D. There are categorical variables in the model.

Correct Answer: A
Section: (none)
Explanation

QUESTION 161
Refer to the Exhibit. In the Exhibit. For effective visualization, what is the chart's primary flaw?

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. The use of 3 dimensions.
B. The slanting of axis labels.
C. The location of the legend.
D. The order of the columns.

Correct Answer: A
Section: (none)
Explanation

Explanation/Reference:
"First Test, First Pass" - www.lead2pass.com 53

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
EMC E20-007 Exam

QUESTION 162
Refer to the exhibit. You have plotted the distribution of savings account sizes for your bank. How would you proceed, based on this distribution?

A. The data is extremely skewed. Replot the data on a logarithmic scale to get a better sense of it.
B. The data is extremely skewed,but looks bimodal; replot the data in the range 2,500-10,000 to be sure.
C. The accounts of size greater than 2500 are rare,and probably outliers. Eliminate them from your future analysis.
D. The data is extremely skewed. Split your analysis into two cohorts: accounts less than 2500,and accounts greater than 2500

Correct Answer: A
Section: (none)
Explanation

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
QUESTION 163
In the MapReduce framework, what is the purpose of the Reduce function?

A. It aggregates the results of the Map function and generates processed output
B. It distributes the input to multiple nodes for processing
C. It writes the output of the Map function to storage
D. It breaks the input into smaller components and distributes to other nodes in the cluster

Correct Answer: A
Section: (none)
Explanation

QUESTION 164
Refer to the exhibit. In the exhibit, a correlogram is provided based on an autocorrelation analysis of a sample dataset. What can you conclude based
only on this exhibit?

"First Test, First Pass" - www.lead2pass.com 54


EMC E20-007 Exam

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. There appears to be no structure left to model in the data
B. There appears to be a seasonal component in the data

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
C. Lag 1 has a significant autocorrelation
D. There appears to be a cyclical component in the data

Correct Answer: A
Section: (none)
Explanation

QUESTION 165
Which analytical method is considered unsupervised?

A. K-means clustering
B. Na?ve Bayesian classifier
C. Decision tree
D. Linear regression

Correct Answer: A
Section: (none)
Explanation

QUESTION 166
"First Test, First Pass" - www.lead2pass.com 55
EMC E20-007 Exam

Refer to the exhibit. In the exhibit, the x-axis represents the derived probability of a borrower defaulting on a loan. Also in the exhibit, the pink represents
borrowers that are known to have not defaulted on their loan, and the blue represents borrowers that are known to have defaulted on their loan.
Which analytical method could produce the probabilities needed to build this exhibit?

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. Logistic Regression
B. Linear Regression
C. Discriminant Analysis

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
D. Association Rules

Correct Answer: A
Section: (none)
Explanation

QUESTION 167
Refer to the exhibit. You have created a density plot of purchase amounts from a retail website as shown. What should you do next?

"First Test, First Pass" - www.lead2pass.com 56


EMC E20-007 Exam

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. Recreate the plot using the barplot() function
B. Use the rug() function to add elements to the plot
C. Recreate the density plot using a log normal distribution of the purchase amount data
D. Reduce the sample size of the purchase amount data used to create the plot

Correct Answer: C
Section: (none)
Explanation

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
QUESTION 168
You have been assigned to do a study of the daily revenue effect of a pricing model of online transactions. When have you completed the analytics
lifecycle?

A. You have written documentation,and the code has been handed off to the Data Base Administrator and business operations.
B. You have a completely developed model,and the results have shown statistically acceptable results.
C. You have presented the results of the model to both the internal analytics team and the business owner of the project.
D. You have a completely developed model based on both a sample of the data and the entire set of data available.

Correct Answer: A
Section: (none)
Explanation

QUESTION 169
Refer to the exhibit. Click on the calculator icon in the upper left corner. An analyst is searching a corpus of documents for the topic "solid state disk". In
the Exhibit, Table A provides the inverse

"First Test, First Pass" - www.lead2pass.com 57


EMC E20-007 Exam

document frequency for each term across the corpus. Table B provides each term's frequency in four documents selected from corpus. Which of the
four documents is most relevant to the analyst's search?

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
A. Document B
B. Document A
C. Document C
D. Document D

Correct Answer: A
Section: (none)
Explanation

QUESTION 170

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
Refer to the exhibit. Click on the calculator icon in the upper left corner. You are going into a meeting where you know your manager will have a question
on your dataset -- specifically relating to customers that are classified as renters with good credit status. In order to prepare for the meeting, you create a
rule: RENTER => GOOD CREDIT. What is the confidence of the rule?

A. 63%
B. 41%
C. 18%
"First Test, First Pass" - www.lead2pass.com 58
EMC E20-007 Exam
D. 73%

Correct Answer: A
Section: (none)
Explanation

QUESTION 171
Which data asset is an example of quasi-structured data?

A. Webserver log
B. XML data file
C. Database table
D. News article

Correct Answer: A
Section: (none)
Explanation

QUESTION 172
What would be considered "Big Data"?

A. An OLAP Cube containing customer demographic information about 100,000,000 customers


B. Daily Log files from a web server that receives 100,000 hits per minute

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn
C. Aggregated statistical data stored in a relational database table
D. Spreadsheets containing monthly sales data for a Global 100 corporation

Correct Answer: B
Section: (none)
Explanation

Explanation/Reference:
"First Test, First Pass" - www.lead2pass.com 59
About Lead2pass.com

Lead2pass.com was founded in 2006. We provide latest & high quality IT Certification Training Exam Questions, Study Guides, Practice Tests. Lead the
way to help you pass any IT Certification exams, 100% Pass Guaranteed or Full Refund. Especially Cisco, CompTIA, Citrix, EMC, HP, Oracle, VMware,
Juniper, Check Point, LPI, Nortel, EXIN and so on.

Our Slogan: First Test, First Pass.

Help you to pass any IT Certification exams at the first try.

You can reach us at any of the email addresses listed below.

Sales: sales@lead2pass.com

Support: support@lead2pass.com

Technical Assistance Center: technology@lead2pass.com

Any problems about IT certification or our products, you could rely upon us, we will give you satisfactory answers in 24 hours.

Our Official: http://www.Lead2pass.com

www.vceplus.com - Website designed to help IT pros advance their careers - Born to learn

You might also like