Big Data Exercieses

Chapter 1
Introduction to Big Data Analytics
1. What are the three characteristics of Big Data, and what are the main considerations in processing Big Data?
Big data is characterized by Volume, Variety, and Velocity each of which present unique and differing challenges.
Volume — Growing well beyond terabytes, big data can entail billions of rows and millions of columns. Data of
this size cannot efficiently be accommodated by traditional infrastructure or RDBMS.
Variety — Data that comes in many forms, not just well-structured tables with rows and columns. Some
unstructured data examples include: video files, audio files, XML, and free text. Traditional RDBMS provide little
support for these data types.
Velocity — Data that is collected and analyzed in real time. Often, this type of data is time sensitive and its
value diminishes with time. This type of data may require in-memory data grids to accommodate the real-time
nature of this data.
The main considerations in processing Big Data are how to cost effectively store and analyze the data in an
efficient manner. Often new tools and technologies (e.g. Hadoop) are necessary to accomplish these goals.
Chapter 3
Review of Basic Data Analytics Methods Using R
1. How many levels does fdata contain in the following R code?

data = c(1,2,2,3,1,2,3,3,1,2,3,3,1)
fdata = factor(data)
fdata contains three levels: 1 2 3
2. Two vectors, v1 and v2, are created with the following R code:
v1 <- 1:5
v2 <- 6:2
What are the results of cbind(v1,v2) and rbind(v1,v2)?
cbind() is used to combine variables column wise.
https://bookdown.org/ndphillips/YaRrr/creating- V1 V2
matrices-and-dataframes.html [1,] 1 6
cbind(v1 ,v2) [2,] 2 5
rbind() is used to combine datasets row wise. [3,] 3 4
it will
rbind(v1 merge the two vector into a matrix
,v2) [4,] 4 3
[5,] 5 2
[1,] [2,] [3,] [4,] [5,]
V1 1 2 3 4 5
V2 6 5 4 3 2
1
na.omit(data)
3. What R command(s) would you use to remove null values from a dataset?
https://statisticsglobe.com/na-omit-r-example/
is.na() - provides test for missing values
#:~:text=omit%20(Data%20Frame%2C%20Vector%20%26%20by%20Column),-
na.exclude() - returns the object with incomplete cases removed
Basic%20R%20Syntax&text=The%20na.,frame%2C%20matrix%20or%20vector).
4. What R command can be used to install an additional R package?
The function install.packages() is used to install a R package. For example, install.packages("ggplot2")

would install the ggplot2 package
5. What R function is used to encode a vector as a category?

as question 1
factor () is used to encode a vector as category .
fdata = factor(data)
6. What is a rug plot used for in a density plot?
rug() function creates a one dimensional density plot on the bottom of the graph to emphasize the
distribution of observation.
7. How many sections does a box-and-whisker divide the data into? What are these sections?
The "box" of the box-and-whisker shows the range that contains the central 50% of the data, and the
line inside the box is the location of the median value. The upper and lower hinges of the boxes
correspond to the first and third quartiles of the data. Upper whisker extends from the hinge to the
highest value that is within 1.5 * IQR of the hinge. Lower whisker extends from the hinge to the lowest
value within 1.5 * IQR of the hinge. Points outside the whiskers are considered as possible outliers.
8. What attributes are correlated according to Figure 3-18? How would you describe their relationships?
According to the scatterplot, within certain species, there is a high correlation between:
• sepal.length and sepal.width (setosa)
• sepal.length and petal.length (veriscolor and virginica)
• sepal.width and petal.length (veriscolor)
• sepal.width and petal.width (veriscolor)
• petal.width and petal.length (veriscolor and virginica)
The relationship between these attributes is a linear relationship. The correlations can be determined
using the cor() function.
9. What function can be used to fit a nonlinear line to the data?

loess() function with the predict() function can be used to fit a nonlinear curve to data
nls(formula, data, start)
2
https://www.tutorialspoint.com/r/r_nonlinear_least_square.htm
10. If a graph of data is skewed and all the data is positive, what mathematical technique may be used to
help detect structures that might otherwise be overlooked?
If the data is skewed and positive, viewing the logarithm of data can help detect structures that might
otherwise be overlooked in a graph with a non-logarithmic scale.
11. What is a type I error? What is a type II error? Is one always more serious than the other? Why?
Type 1 error is the rejection of null hypothesis when the null hypothesis is true.
Type 2 error is the acceptance of null hypothesis when the null hypothesis is false.
Committing one error is not necessarily more serious than the other. Given the underlying assumptions,
the type I error can be defined up front before any data is collected. For a given deviation from the null
hypothesis, the Type 2 error can be obtained by using a large enough sample size.
12. Suppose everyone who visits a retail website gets one promotional offer or no promotion at all. We
want to see if making a promotional offer makes a difference. What statistical method would you
recommend for this analysis?
Let's assume that the objective is to compare whether or not a person receiving an offer will spend
more than someone who does not receive an offer. If normality of the purchase amount distribution is
a reasonable assumption, the Student's t test could be used. Otherwise, a non-parametric test such as
the Wilcoxon rank-sum test could be applied.
13. You are analyzing two normally distributed populations, and your null hypothesis is that the mean μ1
of the first population is equal to the mean μ2 of the second. Assume the significance level is set at
0.05. If the observed p-value is 4.33e-05, what will be your decision regarding the null hypothesis?
P value of 0.0000433 < 0.05. Therefore, the decision will be to reject null hypothesis
3
14- A local retailer has a database that stores 10,000 transactions of last summer. After analyzing the
data, a data science team has identified the following statistics:
• {battery} appears in 6,000 transactions.
• {sunscreen} appears in 5,000 transactions.
• {sandals} appears in 4,000 transactions.
• {bowls} appears in 2,000 transactions.
• {battery,sunscreen} appears in 1,500 transactions.
• {battery,sandals} appears in 1,000 transactions.
• {battery,bowls} appears in 250 transactions.
• {battery,sunscreen,sandals} appears in 600 transactions.
Answer the following questions:
1- What are the support values of the preceding itemsets?

2- What are the confidence values of {battery}→{sunscreen} and {battery,sunscreen}→{sandals}?
Which of the two rules is more interesting?
3- List all the candidate rules that can be formed from the statistics. Which rules are considered
interesting at the minimum confidence 0.25? Out of these interesting rules, which rule is
considered the most useful (that is, least coincidental)?
4
15-
(a) Briefly describe how the K means clustering algorithm works?
(b) You are to cluster eight points:x1=(2, 10), x2=(2, 5), x3=(8, 4), x4=(5, 8), x5=(7, 5), x6=(6, 4), x7=(1, 2)
and x8=(4, 9).Suppose, you assigned x1, x4 and x7 as initial cluster centers for K means clustering (k=
3). Using K means compute the three clusters for each round of the algorithm until convergence.
5
16-
(a) Write the Apriori algorithm?
(b) Following is a list of five transactions that include items A, B, C, and D:

• T1 : { A,B,C }
• T2 : { A,C }
• T3 : { B,C }
• T4 : { A,D }
• T5 : { A,C,D }
Which itemsets satisfy the minimum support of 0.5? (Hint: An itemset may include more than one
item.)
(c) Trace the results of using the Apriori algorithm on the grocery store example with support threshold
s=33.34% and confidence threshold c=60%. Show the candidate and frequent item sets for each
database scan. Enumerate all the final frequent item sets. Also indicate the association rules that are
generated and highlight the strong ones, sort them by confidence.

Big Data Exercieses

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Exercieses

Uploaded by

Copyright:

Available Formats

Chapter 1

Introduction to Big Data Analytics

1. How many levels does fdata contain in the following R code?

fdata contains three levels: 1 2 3

The function install.packages() is used to install a R package. For example, install.packages("ggplot2")

5. What R function is used to encode a vector as a category?

9. What function can be used to fit a nonlinear line to the data?

Answer the following questions:

1- What are the support values of the preceding itemsets?

(a) Briefly describe how the K means clustering algorithm works?

(a) Write the Apriori algorithm?

(b) Following is a list of five transactions that include items A, B, C, and D:

You might also like