You are on page 1of 128

1. 200 data of advertising budget using YouTube and the respective sales earning were collected.

You
are asked to analyze whether increasing the advertising budget would increase the sales. The
following dataset is given to you
str(marketing)
'data.frame': 200 obs. of 4 variables:

$ youtube : num 276.1 53.4 20.6 181.8 217 ...

$ sales : num 26.5 12.5 11.2 22.2 15.5 ...


The youtube column shows the advertising budget spending, and the sales column shows the earning. All
these numbers are in thousands of dollar.

The first thing that you need to do


is Answer
Determine w hether there is a strong relationship betw een the spending on YouTube advertisement service and the sales earning.

Therefore, you plotted the advertising budget and its respective sales:

Pearson correlation test


After that, you ran a Answer and the results is R=0.7822R=0.7822.

Based on the value of R, you know


that Answer
there is a strong positive relationship betw een the spending on YouTube advertisement service and the sales earning.

. Therefore, you decided to summarize and study the relationship between the number of students and
Simple Linear Regression
the number of books sold by using Answer . The results came as
follow

Call:

lm(formula = sales ~ youtube, data = marketing)

Residuals:

Min 1Q Median 3Q Max

-10.0632 -2.3454 -0.2295 2.4805 8.6548

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 8.439112 0.549412 15.36 <2e-16 ***

youtube 0.047537 0.002691 17.67 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.91 on 198 degrees of freedom

Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099

F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16


From the model, you could conclude that:

For an advertising budget that equals to zero, a company may expect a sale of USD 8,440.

The budget spent for advertising is not a significant predictor of sales.

For each dollar spent for advertising, a company could expect a sales earning of USD 47.537.

the intercept 8.439 is a strong predictor of the model


For each dollar spent for advertising, a company could expect a sales earning of USD 8,440.

Based on the model, for a company that spent USD 1,000 for YouTube advertising, the company could

55976. 112
expect sales earning of USD Answer

2. Which function(s) that is(are) used to store a workspace in R?

Select one or more:

a. write.csv()

b. save.image()

c. save.csv()

d. save()

e. saveRDS()

3. A statistical test that compares or tests the suitability of observations against expectations or its
theoretical frequencies are

Select one:

a. ANOVA

b. Wilcoxon Rank Sum Test

c. t-test

d. Test of Independence

e. Goodness-of-Fit

4. Question text

You are interested in learning the favorite programming languages of the first year Indonesian Informatics
and/or Computer Science undergraduate students. To achieve this mission, you asked your
highschool classmates who admitted to the specified program.

TRUE
a. The data are collected properly and bias is minimized Answer
b. Because a variable is a characteristic of each individual on which data is collected, which of the
following are variables that suit well with the research question?

number of students who chose particular programming as their favorite one

chosen programming language

gender

the respondent's final score in algorithm and programming course

c. Which chart or graph would be appropriate to display the concerned variable(s)?

a boxplot

a time plot

a bar graph

a pie chart

Question 5
Question text

Which one of the following that is best treated as ordinal variables?

Select one:

a. salary

b. educational degree

c. phone number

d. city of residence

e. hair color

Question 6
Question text

1,341 undergraduate students were surveyed, to gain knowledge about the preferred teaching-and-
learning method of the whole UNSRAT students. There are three teaching-and-learning methods:
online, offline, or blended. The answers then tabulated and the frequency of each method is
presented in the report. Match the item/condition from the example above with the right term!

Statistics frequency
Answer 1

Population UNSRAT students


Answer 2

Samples the surveyed 1,341 undergraduate students


Answer 3

Variable preferred teaching-and-learning method


Answer 4

Question 7
Question text

What can be learned from a histogram and/or a stem-and-leaf display?

Select one or more:

a. Data distribution

b. Central tendency

c. Data symmetricity

d. Distribution gap

e. Outliers

Question 8
Question text

You are assigned to study whether there is a relationship between video game publishers and the video
game genres. You have a dataset with the following structure:

str(vgs)
'data.frame': 11857 obs. of 11 variables:
$ Rank : int 1 3 4 8 11 12 14 15 16 17 ...
$ Name : Factor w/ 8427 levels ".hack: Sekai no Mukou ni + Versus",..: 8048 4013 8049 8046
5006 4012 8042 8043 3598 2681 ...
$ Platform : Factor w/ 10 levels "DS","PC","PS",..: 8 8 8 8 1 1 8 8 9 5 ...
$ Year : Factor w/ 29 levels "1985","1988",..: 16 18 19 16 15 15 17 19 20 23 ...
$ Genre : Factor w/ 10 levels "Action","Adventure",..: 9 5 9 4 8 5 9 9 4 1 ...
$ Publisher : Factor w/ 467 levels "10TACLE Studios",..: 297 297 297 297 297 297 297 297 266
402 ...
$ NA_Sales : num 41.49 15.85 15.75 14.03 9.07 ...
$ EU_Sales : num 29 12.9 11 9.2 11 ...
$ JP_Sales : num 3.77 3.79 3.28 2.93 1.93 4.13 3.6 2.53 0.24 0.97 ...
$ Other_Sales : num 8.46 3.31 2.96 2.85 2.75 1.92 2.15 1.79 1.67 4.14 ...
$ Global_Sales: num 82.7 35.8 33 29 24.8 ...
You explored the data by making a barplot that shows the grouped distribution, and it came as follow:

Cross-tabulation
To achieve the goal of the study, you create a Answer table.

Genre

Platform Action Adventure Fighting Misc Racing Role-Playing Shooter Simulation

DS 343 240 36 393 67 200 42 285

PC 165 65 6 24 60 104 148 115


PS 157 69 108 76 145 97 96 60

PS2 348 196 150 222 216 187 160 90

PS3 380 74 76 124 92 119 156 31

PS4 122 19 17 15 17 47 34 5

PSP 222 213 74 106 65 192 37 29

Wii 238 84 42 280 94 35 66 87

X360 324 47 65 126 105 76 203 40

XB 155 26 48 46 123 23 132 24

Genre

Platform Sports Strategy

DS 148 79

PC 49 188

PS 222 70

PS2 400 71

PS3 213 24

PS4 43 5

PSP 135 60

Wii 261 25

X360 220 28

XB 170 21

Chi-square Test of Independence


With a 95% degree of freedom, you ran a Answer , and the result
came as follow

data: [HIDDEN]

X-squared = 2873.1, df = 81, p-value < 2.2e-16


Therefore, according to the result, then it can be concluded that

All platforms share the same amount of video game publications on each genre.

There are certain video game genres that commonly published for specific platforms.
There is a significant relationship between video-game platform and the video-game genres.

There is no significant relationship between genres and the platform used.

Question 9
Question text

The following table contains a subset of the results from a survey about how the first year UNSRAT
undergraduate students access e-Learning.

questionnaire_code program access_mean

STU001 informatics personal notebook/PC

STU002 civil shared notebook/PC

STU003 law NA

STU004 medical personal tablet

Match the item/condition from the example above with the right term!

Element STU001
Answer 1

Observation Answer 2 personal notebook/PC

Variable access_mean
Answer 3

Question 10
Question text

Background
You are assigned to analyze a dataset that contains the scores of students in a class. There are 3
quizzes given to them. Is there any difference in scores between different each quiz?

Data Exploration
The structure of the dataset is as follow

str(qrt)
'data.frame': 105 obs. of 3 variables:
$ student: Factor w/ 35 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...

$ score : num 0 23.8 26 20.4 12.1 ...

$ test : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...

boxplot
You then plotted a Answer , and the result came as follow

Since you were comparing scores in three different quizzes with the same participants (students), then
you need the proper method. Therefore, you need to decide which method to use. So you start with
distribution normality
checking the Answer of the performance score on each group by using
Shapiro-Wilk Test
the Answer . The p-value for each tested group is
shown in the following table:

quiz p-value

1 0.001545939

2 0.016061633

3 0.003481896
Statistical Tests

non-parametric method
Based on it, then you decide to use Answer to find is there any difference in
those quizzes. Due to the nature of the problem, then you ran
Friedman Test
a Answer , and the result is as follow:

data: score and test and student

[HIDDEN] chi-squared = 30.778, df = 2, p-value = 2.073e-07


Based on the result, then you decide
to Answer
continue w ith further tests to find in w hich quiz students tend to achieve better scores
.Th
Pairw ise Wilcoxon RST
erefore you ran a Answer , and the results came as
follow:

[NAME of TEST HIDDEN]

data: score and test

1 2

2 0.00060 -

3 2.1e-05 0.00046

P value adjustment method: bonferroni


Conclusion
Based on the results of the statistical tests, then you conclude that

There is no significantly different score achievements in all three quizzes

Students tend to achieve higher scores in Test 3, followed with Test 2, yet the differences are not
significant

Scores in Test 1 and 3 are significantly different, but Tests 1 and 2 are not, and so with Tests 2 and 3

Students tend to achieve significantly higher scores in Test 3, followed with Test 2. Test 1 scores are

significantly lower than the other two tests.


11. You are interested in knowing the percentage of how the first year UNSRAT undergraduate students
access e-Learning. To estimate the percentage, you survey with 500 randomly selected students
and determine what are the means used by the students.
Match the item/condition from the example above with the right term!
Population First year UNSRAT undergraduate students
Answer 1

Statistics Percentage
Answer 2

Samples 500 randomly selected first year UNSRAT undergraduate students


Answer 3

Parameter The means used to access e-Learning


Answer 4
12. Background
You are assigned to analyze a dataset that contains the performance score measures of participants at
two-time points. The aim of this study is to evaluate the effect of gender and stress on performance
scores. Is there any difference in performance between different stress levels? If any, which
one yields the best/worst performance score?

Data Exploration
The structure of the dataset is as follow

str(performance)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 60 obs. of 5 variables:

$ id : int 1 2 3 4 5 6 7 8 9 10 ...

$ gender: Factor w/ 2 levels "male","female": 1 1 1 1 1 1 1 1 1 1 ...

$ stress: Factor w/ 3 levels "low","moderate",..: 1 1 1 1 1 1 1 1 1 1 ...

$ t1 : num 5.96 5.51 5.63 5.71 5.74 ...

$ t2 : num 5.58 5.82 5.47 5.79 5.72 ...


Boundary
Since the performance is measured twice, in this problem we only focus on the first measurement (the t1
column).

boxplot
You then plotted a Answer
, and the result came as follow
Since you were comparing a variable in three different groups, then you need the proper method.
Therefore, you need to decide which method to use. So you start with checking
distribution normality
the Answer
of the performance score on each group by using
Shapiro-Wilk Test
the Answer

The p-value for each tested group is shown in the following table:
stress level p-value
low 0.11428304
moderate 0.07023834
high 0.92983350
Statistical Tests

parametric method
Based on it, then you decide to use Answer
to find is there any difference between the stress levels on the performance. Due to the nature of the
One-w ay ANOVA
problem, then you ran a Answer
, and the result is as follow:
Df Sum Sq Mean Sq F value Pr(>F)
stress 2 0.8235 0.4117 14.5 8.13e-06 ***

Residuals 57 1.6190 0.0284

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Based on the result, then you decide
to Answer
continue w ith further tests to find w hich stress level has significantly different impact on performance

TukeyHSD
Therefore you ran a Answer

, and the results came as follow:


[NAME of TEST HIDDEN]

95% family-wise confidence level

Fit: [HIDDEN]

$stress

diff lwr upr p adj

moderate-low 0.1052102 -0.02303774 0.2334582 0.1279077

high-low -0.1786052 -0.30685319 -0.0503573 0.0040329

high-moderate -0.2838155 -0.41206340 -0.1555675 0.0000053

Conclusion
Based on the results of the statistical tests, then you conclude that

employess with high stress level have significantly lower performance, followed with those with
moderate, and then low stress level

there is no significant performance differences in all stress level

the experiment is a violation of human rights

employees with a high stress level tend to have significantly lower performance compared to

employees with moderate and low stress levels

there is no significant performance difference between employees with moderate and low stress

levels
employee with moderate stress level tend to have significantly higher performance than those with
low and/or high stress levels

13. You are interested in learning the favorite programming languages of the first year Indonesian
Informatics and/or Computer Science undergraduate students. To achieve this mission, you asked
your highschool classmates who admitted to the specified program.

TRUE
a. The data are collected properly and bias is minimized Answer
b. Because a variable is a characteristic of each individual on which data is collected, which of the
following are variables that suit well with the research question?

the respondent's final score in algorithm and programming course

chosen programming language

gender

number of students who chose particular programming as their favorite one

14. Which chart or graph would be appropriate to display the concerned variable(s)?

a time plot

a pie chart

a boxplot

a bar graph
15. You are assigned to study whether there is a relationship between the category and the content
rating of selected apps in Google PlayStore. You have a dataset with the following structure:
str(googleplaystore)
'data.frame': 3398 obs. of 13 variables:

$ App : Factor w/ 3088 levels "¡Ay Caramba!",..: 2472 2679 654 2617 580 1701 2595 790 2762
2343 ...

$ Category : Factor w/ 3 levels "FAMILY","GAME",..: 2 2 2 2 2 2 2 2 2 2 ...

$ Rating : num 4.5 4.5 4.4 4.7 4.5 4.2 4.4 4.6 4.3 4.3 ...

$ Reviews : Factor w/ 2379 levels "0","1","10","100",..: 1556 1061 828 969 412 1363 1730 865 2175
68 ...

$ Size : Factor w/ 219 levels "1.0M","1.1M",..: 143 165 163 43 93 45 219 215 136 45 ...

$ Installs : Factor w/ 21 levels "0","0+","1,000,000,000+",..: 10 3 19 7 7 16 10 10 19 19 ...

$ Type : Factor w/ 3 levels "Free","NaN","Paid": 1 1 1 1 1 1 1 1 1 1 ...

$ Price : Factor w/ 38 levels "$0.99","$1.04",..: 38 38 38 38 38 38 38 38 38 38 ...

$ Content.Rating: Factor w/ 4 levels "Everyone","Everyone 10+",..: 2 2 1 1 1 1 1 2 1 1 ...


$ Genres : Factor w/ 85 levels "Action","Action;Action & Adventure",..: 4 7 23 19 23 29 1 77 1 23 ...

$ Last.Updated : Factor w/ 921 levels "April 1, 2017",..: 460 396 465 78 412 25 735 536 465 691 ...

$ Current.Ver : Factor w/ 1094 levels "0.0.1","0.0.2",..: 606 444 186 551 249 348 1093 600 347 332 ...

$ Android.Ver : Factor w/ 24 levels "1.5 and up","1.6 and up",..: 15 15 15 15 14 15 8 15 13 13 ...


You explored the data by making a barplot that shows the grouped distribution, and it came as follow:
Cross-tabulation
To achieve the goal of the study, you create a Answer
table.
Content.Rating

Category Everyone Everyone 10+ Mature 17+ Teen

FAMILY 1529 131 50 261

GAME 608 131 74 331

NEWS_AND_MAGAZINES 169 66 14 34

Chi-square Test of Independence


With a 95% degree of freedom, you ran a Answer

, and the result came as follow


data: [HIDDEN]

X-squared = 275.51, df = 6, p-value < 2.2e-16


Therefore, according to the result, then it can be concluded that

There is no significant relationship between the category and the content rating of the selected apps
from Google PlayStore.

There is a significant relationship between the category and the content rating of the selected apps

from Google PlayStore.

The content ratings of Google PlayStore apps are not related to the category.

Most apps categories in Google PlayStore are highly related with the content rating.

16. Qualitative data could be organized with the following ways:

Select one or more:

a. Relative frequency
b. Percentage

c. Frequency distribution table

d. Tally marks

e. Pie chart

f. Raw data

g. Bar plot

h. Boxplot

17. Which of the following are best treated as discrete variables?

Select one or more:

a. Gender

b. Names of your classmates

c. Number of students in a class

d. Average scores in a quiz

e. Number of students in the whole university

f. Grades frequency at the end of a course

g. Number of children in a family

h. Final scores in a course

18. The alternate hyphothesis of a _____ t-test has the form of "The mean of x of the A group is higher
than ..."

Select one:

a. Paired
b. Two-tail

c. Half-tail

d. Unpaired

e. One-tail ✅ jawaban yang benar

19. To gain information about the number elements in a vector, we use the _____ function.

Select one:

a. length() ✅ jawaban yang benar

b. ncol()

c. sizeof()

d. nrow()

e. getLength()

20. You are assigned to study if there is any connection between the district where a person lives and
his/her hobby. There are 671 randomly selected respondents that were interviewed. Their answers
are collected into a data frame with the following structure:
str(district.hobby)
'data.frame': 671 obs. of 2 variables:

$ district: Factor w/ 4 levels "DISTRICT 1","DISTRICT 2",..: 2 1 4 4 2 2 2 3 1 4 ...

$ hobby : Factor w/ 6 levels "BASKETBALL","FOOTBALL",..: 3 3 2 6 2 3 2 3 5 2 ...

You explored the data by making a barplot that shows the grouped distribution, and it came as follow:
Frequency distribution
To achieve the goal of the study, you create a Answer

table.

hobby

district BASKETBALL FOOTBALL PAINTING PHOTOGRAPHY SINGING TRAVELING

DISTRICT 1 39 29 19 28 37 29

DISTRICT 2 29 33 29 30 25 32
DISTRICT 3 26 24 30 22 30 19

DISTRICT 4 28 36 23 24 26 24

Chi-square Test of Independence


With a 95% degree of freedom, you ran a Answer

, and the result came as follow

data: [HIDDEN]

X-squared = 13.811, df = 15, p-value = 0.5399

Therefore, according to the result, then it can be concluded that

Some hobbies are significantly preferred in certain districts.

There is a significant relationship between the district where someone lives with his/her

hobby.

There is no significant relationship between the district where someone lives with his/her
hobby.jawaban yg benar

Someone's hobby is independent of the district where one lives. jawaban yg benar

21. X is a mean to organize quantitative data. It shows the sum of a class and all classes below
it. What is X?

Select one:

a. Ogive

b. Cummulative Frequency Distribution jawaban yang benar

c. Histogram

d. Stem-and-leaf display
e. Frequency distribution table

22. Methods that can be used to find out whether the data is normally distributed or not are

Select one or more:

a. Observing histogram and density plot jawaban benar

b. Observing pie chart and bar plot

c. Applying the Shapiro-Wilk test jawaban benar

d. Applying the Kruskal-Wallis test

e. Applying the Kolmogorov-Smirnov test jawaban benar

f. Applying the Wilcoxon Rank Sum test

23. Which of the following are best treated as continuous variables?

Select one or more:

a. The distance between two cities jawaban benar

b. Number of classes in a college building

c. Number of students who achieve pass grades

d. Height jawaban benar

24. A publishing company is currently reviewing proposals from bookstores in several universities.
These bookstores are asking for more programming books to be stocked for each of them. Since the
stock in the company's warehouse is limited, hence the management will decide the allocation
based on historical sales data. Therefore, the management asked you to make the analysis. The data
that they possess contains historical data of the number of students who took programming
courses and the number of programming books sold at the respective university bookstore. Should
the university with more students who took a programming course to be allocated more
books?

The dataset that the management gave you is as follow:


str(student.books)

'data.frame': 231 obs. of 2 variables:

$ nstudents : int 204 179 200 177 207 195 166 178 213 130 ...

$ books_sold: int 441 329 467 376 504 396 354 439 461 235 ...
The column nstudents shows the number of students while the column books_sold shows the
number of books sold at a university bookstore with the respective number of students
who took the programming course.

The first thing that you need to do is Answer

Therefore, you plotted the number of students with the respective numbers of books sold:

After that, you ran a Answer

and the results is R=0.7680R=0.7680.

Based on the value of R, you know that Answer


. Therefore, you decided to summarize and study the relationship between the number of

students and the number of books sold by using Answer

. The results came as follow

Call:

lm(formula = books_sold ~ nstudents, data = student.books)

Residuals:

Min 1Q Median 3Q Max

-80.265 -37.203 -2.531 38.198 83.988

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.1796 23.6051 0.05 0.96

nstudents 2.1165 0.1166 18.15 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 44.01 on 229 degrees of freedom

Multiple R-squared: 0.5898, Adjusted R-squared: 0.5881

F-statistic: 329.3 on 1 and 229 DF, p-value: < 2.2e-16

From the model, you could conclude that:

1.1796 is a strong predictor of the model


For each student taking the programming class, the respective university bookstore could

expect a sale of 2.1165 books.

For each student taking the programming class, the respective university bookstore could
expect a sale of 1.1796 books.

The number of students taking the programming class is a significant predictor of sales.

For a number of students that equals to zero, a bookstore may expect a sale of 1.1796 books,
however since the factor itself is not significant, then the bookstore should not clinging

onto that.

25. Frequency of a categoric variabel could be visualized with

Select one:

a. Pie chart

b. Ogive

c. Bar plot

d. Line plot

e. Boxplot

Dimulai pada Tuesday, 15 December 2020, 23:41

Keadaan Selesai

Selesai pada Tuesday, 15 December 2020, 23:44

Waktu yang 3 min 31 detik


digunakan

Nilai 10,00 dari 10,00 (100%)

Soal 1
Benar

Poin 1,00 dari 1,00


Tandai pertanyaan

Teks soal
The suitable statistical test(s) to compare a variable in 3 or more groups, is(are)
Pilih salah satu atau lebih:

a. Paired samples t-test

b. Kruskal-Wallis

c. ANOVA

d. Kolmogorov-Smirnov

e. Wilcoxon Rank Sum test

f. Unpaired samples t-test


Umpan balik
Pilihan-pilihan terbaik adalah: ANOVA, Kruskal-Wallis

Soal 2
Benar

Poin 1,00 dari 1,00

Tandai pertanyaan

Teks soal

Background
You are assigned to analyze a dataset that contains measures of cholesterol concentration in 72
participants treated with three different drugs. The aim is to examine the potential of a new class of
drugs in lowering cholesterol concentration and consequently reducing heart attack. The participants
include 36 males and 36 females. Males and females were further (equally) subdivided into whether
they were at low or high risk of a heart attack. Is there any difference in the impact of each drug on
cholesterol concentration? If any, which one has the highest impact, in terms of the lowest
cholesterol concentration?

Data Exploration
The structure of the dataset is as follow
str(heartattack)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 72 obs. of 5 variables:

$ gender : Factor w/ 2 levels "male","female": 1 1 1 1 1 1 1 1 1 1 ...

$ risk : Factor w/ 2 levels "high","low": 2 2 2 2 2 2 2 2 2 2 ...

$ drug : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 2 2 2 2 ...

$ cholesterol: num 5.24 5.08 4.68 5.36 4.96 ...

$ id : int 1 2 3 4 5 6 7 8 9 10 ...

boxplot
You then plotted a Jawaban

, and the result came as follow


Since you were comparing a variable in three different groups, then you need the proper method.
Therefore, you need to decide which method to use. So you start with checking
distribution normality
the Jawaban

of the cholesterol concentration on each group by using


Shapiro-Wilk Test
the Jawaban
. The p-value for each tested group is shown in the following table:

drug p-value

A 0.1537620

B 0.7674545

C 0.5537145

Statistical Tests
parametric method
Based on it, then you decide to use Jawaban

to find is there any difference between the drugs used toward the cholesterol concentration.
Due to the nature of the problem, then you ran
One-way ANOVA
a Jawaban

, and the result is as follow:

Df Sum Sq Mean Sq F value Pr(>F)

drug 2 1.235 0.6177 2.63 0.0793 .

Residuals 69 16.204 0.2348

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Based on the result, then you decide


to Jawaban
draw a final conclusion

[FURTHER TEST IS UNNECESSARY]


Therefore you ran a Jawaban

, and the results came as follow:

[NAME of TEST HIDDEN]

95% family-wise confidence level

Fit: [HIDDEN]

$drug

diff lwr upr p adj

B-A -0.277327333 -0.6124096 0.05775494 0.1241979

C-A -0.278421280 -0.6135035 0.05666099 0.1222405

C-B -0.001093947 -0.3361762 0.33398832 0.9999663

Conclusion
Based on the results of the statistical tests, then you conclude that

drug B yields a significantly less cholesterol rate than drug A

the drugs gave no significantly different impact on the cholesterol rate

drug C yields a significantly less cholesterol rate than drugs A and B

the experiment is a mess


drug that yields the lowest cholesterol rate is drug C, followed with drug B, and then drug A
Poin 5,00 dari 5,00

Pilihan terbaik adalah:

• the drugs gave no significantly different impact on the cholesterol rate

(Credit: dataset used in this vignettes is based on the heartattack dataset in the datarium package)
Soal 3
Benar

Poin 1,00 dari 1,00

Tandai pertanyaan

Teks soal
To gain information about the number elements in a vector, we use the _____ function.
Pilih salah satu:

a. ncol()

b. length()

c. sizeof()

d. nrow()

e. getLength()
Umpan balik
Pilihan terbaik adalah: length()

Soal 4
Benar

Poin 1,00 dari 1,00


Tandai pertanyaan

Teks soal
You are interested in learning the favorite programming languages of the first year Indonesian
Informatics and/or Computer Science undergraduate students. To achieve this mission, you asked your
highschool classmates who admitted to the specified program.

TRUE
a. The data are collected properly and bias is minimized Jawaban

b. Because a variable is a characteristic of each individual on which data is collected, which of


the following are variables that suit well with the research question?

gender

chosen programming language

number of students who chose particular programming as their favorite one

the respondent's final score in algorithm and programming course


Poin 1,00 dari 1,00

Pilihan terbaik adalah:

• chosen programming language

c. Which chart or graph would be appropriate to display the concerned variable(s)?

a time plot

a boxplot
a pie chart

a bar graph
Poin 1,00 dari 1,00

Pilihan terbaik adalah:

• a bar graph
• a pie chart

Soal 5
Benar

Poin 1,00 dari 1,00

Tandai pertanyaan

Teks soal
A publishing company is currently reviewing proposals from bookstores in several universities. These
bookstores are asking for more programming books to be stocked for each of them. Since the stock in
the company's warehouse is limited, hence the management will decide the allocation based on
historical sales data. Therefore, the management asked you to make the analysis. The data that they
possess contains historical data of the number of students who took programming courses and the
number of programming books sold at the respective university bookstore. Should the university with
more students who took a programming course to be allocated more books?

The dataset that the management gave you is as follow:


str(student.books)
'data.frame': 231 obs. of 2 variables:

$ nstudents : int 204 179 200 177 207 195 166 178 213 130 ...

$ books_sold: int 441 329 467 376 504 396 354 439 461 235 ...

The column nstudents shows the number of students while the column books_sold shows the
number of books sold at a university bookstore with the respective number of students who
took the programming course.
The first thing that you need to do is Jawaban Determine whether there is a strong relationship between the number of students taking a programming course and the number of programming book sold at a particular university bookstore.

Therefore, you plotted the number of students with the respective numbers of books sold:

Pearson correlation test


After that, you ran a Jawaban
and the results is R=0.7680R=0.7680.

Based on the value of R, you know that Jawaban There is a strong positive relationship between the number of students taking a programming course and the number of programming book sold at a particular university bookstore

. Therefore, you decided to summarize and study the relationship between the number of
students and the number of books sold by
Simple Linear Regression
using Jawaban

. The results came as follow

Call:

lm(formula = books_sold ~ nstudents, data = student.books)

Residuals:

Min 1Q Median 3Q Max

-80.265 -37.203 -2.531 38.198 83.988

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.1796 23.6051 0.05 0.96

nstudents 2.1165 0.1166 18.15 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 44.01 on 229 degrees of freedom

Multiple R-squared: 0.5898, Adjusted R-squared: 0.5881

F-statistic: 329.3 on 1 and 229 DF, p-value: < 2.2e-16

From the model, you could conclude that:


For each student taking the programming class, the respective university bookstore could

expect a sale of 2.1165 books.

1.1796 is a strong predictor of the model

The number of students taking the programming class is a significant predictor of

sales.

For a number of students that equals to zero, a bookstore may expect a sale of 1.1796
books, however since the factor itself is not significant, then the bookstore should not clinging

onto that.

For each student taking the programming class, the respective university bookstore could
expect a sale of 1.1796 books.
Poin 3,00 dari 3,00

Pilihan terbaik adalah:

• For a number of students that equals to zero, a bookstore may expect a sale of 1.1796
books, however since the factor itself is not significant, then the bookstore should not
clinging onto that.
• The number of students taking the programming class is a significant predictor of sales.
• For each student taking the programming class, the respective university bookstore
could expect a sale of 2.1165 books.

.
Based on the model, for a university with 100 students taking a programming course, the publisher
212,82
could expect the respective bookstore would sell Jawaban

Soal 6
Benar

Poin 1,00 dari 1,00


Tandai pertanyaan

Teks soal
What can be learned from a histogram and/or a stem-and-leaf display?
Pilih salah satu atau lebih:

a. Data symmetricity

b. Outliers

c. Central tendency

d. Distribution gap

e. Data distribution
Umpan balik
Pilihan-pilihan terbaik adalah: Data symmetricity, Data distribution, Outliers, Distribution gap,
Central tendency

Soal 7
Benar

Poin 1,00 dari 1,00

Tandai pertanyaan

Teks soal
You are assigned to study if there is any connection between the district where a person lives and
his/her preferred social media. There are 1,200 randomly selected respondents that were interviewed.
Their answers are collected into a data frame with the following structure:
str(district.socmed)
'data.frame': 1200 obs. of 2 variables:

$ district: Factor w/ 4 levels "DISTRICT 1","DISTRICT 2",..: 4 1 3 1 1 3 1 2 4 4 ...


$ socmed : Factor w/ 6 levels "FACEBOOK","FRIENDSTER",..: 4 4 4 1 5 6 3 2 1 3 ...

You explored the data by making a barplot that shows the grouped distribution, and it came as
follow:

Cross-tabulation
To achieve the goal of the study, you create a Jawaban

table.

socmed
district FACEBOOK FRIENDSTER INSTAGRAM LINKEDIN RESEARCHGATE TWITTER

DISTRICT 1 65 48 40 55 43 57

DISTRICT 2 48 46 51 53 29 58

DISTRICT 3 51 51 33 59 65 42

DISTRICT 4 49 57 57 47 54 42

Chi-square Test of Independence


With a 95% degree of freedom, you ran a Jawaban

, and the result came as follow

data: [HIDDEN]

X-squared = 13.811, df = 15, p-value = 0.004893

Therefore, according to the result, then it can be concluded that

Someone's preferred social media is independent of the district where one lives.

There is no significant relationship between the district where someone lives with his/her
preferred social media.

Some social media are significantly preferred in certain districts.

There is a significant relationship between the district where someone lives with his/her

preferred social media.


Poin 2,00 dari 2,00

Pilihan terbaik adalah:

• There is a significant relationship between the district where someone lives with
his/her preferred social media.
• Some social media are significantly preferred in certain districts.

Soal 8
Benar
Poin 1,00 dari 1,00

Tandai pertanyaan

Teks soal
You are interested in knowing the achievement of the present second-year students in your
program at their first semester. It is measured according to the GP achieved. You then collected
the 1st semester GP of 31 randomly selected second-year students and calculate the mean.
Match the item/condition from the example above with the right term!

Statistics Average
Answer 1

Population Second-year students


Answer 2

Parameter GP
Answer 3

Samples 31 randomly selected second-year students


Answer 4

Umpan balik
The correct answer is: Statistics → Average, Population → Second-year students, Parameter →
GP, Samples → 31 randomly selected second-year students

Soal 9
Benar

Poin 1,00 dari 1,00

Tandai pertanyaan

Teks soal
The following table contains a subset of the results from a survey about how the first year UNSRAT
undergraduate students access e-Learning.
questionnaire_code program access_mean

STU001 informatics personal notebook/PC


questionnaire_code program access_mean

STU002 civil shared notebook/PC

STU003 law NA

STU004 medical personal tablet

Match the item/condition from the example above with the right term!

Observation Answer 1 personal notebook/PC

Variable access_mean
Answer 2

Element STU001
Answer 3

Umpan balik
Your answer is correct.
The correct answer is: Observation → personal notebook/PC, Variable → access_mean, Element
→ STU001

Soal 10
Benar

Poin 1,00 dari 1,00

Tandai pertanyaan

Teks soal
Which of the following are best treated as discrete variables?
Pilih salah satu atau lebih:

a. Number of students in a class

b. Number of students in the whole university

c. Average scores in a quiz


d. Final scores in a course

e. Names of your classmates

f. Gender

g. Number of children in a family

h. Grades frequency at the end of a course


Umpan balik
Pilihan-pilihan terbaik adalah: Number of students in a class, Number of students in the whole
university, Grades frequency at the end of a course, Number of children in a family

SORTIR

1. The suitable statistical test(s) to compare a variable to a specific value is(are)


• Wilcoxon Signed Rank test
• One-sample t-test

2.
To achieve the goal of the study, you create a Cross-tabulation
table
With a 95% degree of freedom. you ran a Chi-square Test of Independence , and the result came
as follow

• There is no significant relationship between the district where someone lives with
his/her hobby
• Someone's hobby is independent of the district where one lives.
3. The following table contains a subset of the results from a survey about how the first year
UNSRAT undergraduate students access e-Learning

Match the item/condition from the example above with the right term!
Observation 7
Element STU003
Variabel SATICFATION
4. Frequency of a categoric variabel could be visualized with Bar plot
5. You are interested in learning the favorite programming languages of the first year Indonesian
Informatics and/or Computer Science undergraduate students. To achieve this mission, you
asked your highschool classmates who admitted to the specified program

A. The data are collected properly and bias is minimized TRUE


B. Because a variable is a characteristic of each individual on which data is collected, which of
the following are variables that suit well with the research question?
• Chosen programing language
C. Which chart or graph would be appropriate to display the concerned varable(s)?
• a bar graph
• a pie chart
6. Which one that is NOT the function used to store data?
• Write.csv()
7. You are interested in knowing the achievement of the present second year students in your
program ar their first semester. It is measured according to the GP achieved You then collected
the 1st semester GP of 31 randomly selected second-year students and calculate the mean
March the item/condition from the example above with the right term!
• Statistics AVERAGE
• Parameter GP
• Population Second-year students
• Samples 31 randomly selected second-year students
8. A publishing company is currently reviewing proposals from bookstores in several universities.
These bookstores are asking for more programming books to be stocked for each of them. Since
the stock in the company's warehouse is limited, hence the management will decide the
allocation based on historical sales data Therefore, the management asked you to make the
analysis. The data that they possess contains historical data of the number of students who took
programming courses and the number of programming books sold at the respective university
bookstore Should the university with more students who took a programming course to be
allocated more books?

The dataset that the management gave you is as


follow

The column students shows the number of students while the column books sold shows the
number of books sold at a university bookstore with the respective number of students who
took the programming course

The first thing that you need to do is


Determine whether there is a strong relationship between the number of students taking a
programming course and the number of programming book sold at a particular university
bookstore
After that, yout ran a Pearson correlation test and the results is R = 0.7680

Based on the value of R, you know that

There is a strong positive relationship between the number of students taking a programming course
and the number of ramming book sold at a particular university bookstore

Therefore, you decided to summarize and study the relationship between the number of students and
the number of books sold by using Simple Linear Regression The results came as
follow

From the model, you could conclude that:

• For each student taking the programming class, the respective university bookstore could
expect a sale of 21165 books
• The number of students taking the programming class is a significant predictor of sales
• For a number of students that equals to zero, a bookstore may expect a sale of 1.1796 books,
however since the factor itself is not significant, then the bookstore should not clinging onto
that
Based on the model, for a university with 100 students taking a programming course, the publisher
could expect the respective bookstore would sell 212.82

9. Which of the following are best treated as continuous variables?


Pilih salah satu atau lebih:

• The distance between two cities


• Height
10. You are assigned to analyze a dataset that contains the performance store measures of
participanTS at two time points The aim of thiS Study is to evaluate the effect of gender and
stress on performance scores is there any difference in performance between different stress
levels? If any, which one yields the best worst performance score?
Boundary

Since the performance is measured twice, in this problem we only focus on the first
measurement (the t1 column)
You then plotted a bloxplot and the result came as follow

Since you were comparing a variable in three different groups, then you need the proper
method. Therefore, you need to decide which method to use. So you start with checking the
distribution normality performance score on each group by using the Shapiro-Wilk Test The
pvalue for each tested group is shown in the following table:
Statistical Tests
Based on it, then you decide to use parametric method to find is there any difference
between the stress levels on the performance Due to the nature of the problem, then you ran a
One-way ANOVA and the result is as
follow

Based on the result, then you decide to


continue with further tests to find which stress level has significantly different impact on
performance
Therefore you ran a TukeyHSD and the results came as follow:

• there is no significant performance difference between employees with moderate and


low stress levels
• employees with a high stress level tend to have significantly lower performance
compared to employees with moderate and low stress levels
Question 1
Correct
Mark 1.00 out of 1.00

Flag question

Question text
Which of the following are best treated as nominal variables?

Select one or more:


a. Number of children in a family

b. Gender
c. Number of students in a class

d. Phone number
e. Number of students in the whole university

f. Names of your classmates


g. Grades frequency at the end of a course

Feedback
The correct answers are: Gender, Phone number, Names of your classmates

Question 2
Correct
Mark 1.00 out of 1.00

Flag question

Question text
The suitable statistical test(s) to compare a variable to a specific value is(are)

Select one or more:


a. Wilcoxon Rank Sum test

b. Wilcoxon Signed Rank test


c. ANOVA

d. One-sample t-test
e. Paired samples t-test
f. Kruskal-Wallis test

Feedback
The correct answers are: One-sample t-test, Wilcoxon Signed Rank test

Question 3
Correct
Mark 1.00 out of 1.00
Flag question

Question text
Qualitative data could be organized with the following ways:

Select one or more:


a. Raw data
b. Pie chart

c. Relative frequency
d. Bar plot

e. Percentage

f. Tally marks

g. Frequency distribution table


h. Boxplot

Feedback
The correct answers are: Frequency distribution table, Tally marks, Relative frequency, Percentage

Question 4
Correct
Mark 1.00 out of 1.00

Flag question

Question text
1,341 undergraduate students were surveyed, to gain knowledge about the preferred teaching-and-
learning method of the whole UNSRAT students. There are three teaching-and-learning methods:
online, offline, or blended. The answers then tabulated and the frequency of each method is
presented in the report. Match the item/condition from the example above with the right term!

Variable preferred teaching-and-learning method


Answer 1
Statistics frequency
Answer 2
Population UNSRAT students
Answer 3
Samples the surveyed 1,341 undergraduate students
Answer 4
Feedback
The correct answer is: Variable → preferred teaching-and-learning method, Statistics → frequency,
Population → UNSRAT students, Samples → the surveyed 1,341 undergraduate students

Question 5
Correct
Mark 1.00 out of 1.00

Flag question

Question text
1,341 undergraduate students were surveyed, to gain knowledge about the preferred teaching-and-
learning method of the whole UNSRAT students. There are three teaching-and-learning methods: online,
offline, or blended. The following table contains a subset of the results

questionnaire_code preferred_tlm
STU001 blended
STU002 offline
STU003 online
STU004 blended

Match the item/condition from the example above with the right term!

Element STU003
Answer 1
Variable preferred_tlm
Answer 2
Observation blended
Answer 3
Feedback
Your answer is correct.
The correct answer is: Element → STU003, Variable → preferred_tlm, Observation → blended

Question 6
Correct
Mark 1.00 out of 1.00
Flag question

Question text
A publishing company is currently reviewing proposals from bookstores in several universities. These
bookstores are asking for more programming books to be stocked for each of them. Since the stock in
the company's warehouse is limited, hence the management will decide the allocation based on
historical sales data. Therefore, the management asked you to make the analysis. The data that they
possess contains historical data of the number of students who took programming courses and the
number of programming books sold at the respective university bookstore. Should the university with
more students who took a programming course to be allocated more books?

The dataset that the management gave you is as follow:


str(student.books)
'data.frame': 231 obs. of 2 variables:

$ nstudents : int 204 179 200 177 207 195 166 178 213 130 ...

$ books_sold: int 441 329 467 376 504 396 354 439 461 235 ...

The column nstudents shows the number of students while the column books_sold shows the
number of books sold at a university bookstore with the respective number of students who took
the programming course.

The first thing that you need to do is Answer Determine whether there is a strong relationship between the number of students taking a programming course and the number of programming book sold at a particular university bookstore.

Therefore, you plotted the number of students with the respective numbers of books sold:

Pearson correlation test


After that, you ran a Answer

and the results is R=0.7680R=0.7680.

Based on the value of R, you know that Answer There is a strong positive relationship between the number of students taking a programming course and the number of programming book sold at a particular university bookstore
. Therefore, you decided to summarize and study the relationship between the number of students
Simple Linear Regression
and the number of books sold by using Answer

. The results came as follow


Call:

lm(formula = books_sold ~ nstudents, data = student.books)

Residuals:

Min 1Q Median 3Q Max

-80.265 -37.203 -2.531 38.198 83.988

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.1796 23.6051 0.05 0.96

nstudents 2.1165 0.1166 18.15 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 44.01 on 229 degrees of freedom

Multiple R-squared: 0.5898, Adjusted R-squared: 0.5881

F-statistic: 329.3 on 1 and 229 DF, p-value: < 2.2e-16

From the model, you could conclude that:


For a number of students that equals to zero, a bookstore may expect a sale of 1.1796 books,
however since the factor itself is not significant, then the bookstore should not clinging onto

that.
1.1796 is a strong predictor of the model
For each student taking the programming class, the respective university bookstore could

expect a sale of 2.1165 books.


For each student taking the programming class, the respective university bookstore could
expect a sale of 1.1796 books.
The number of students taking the programming class is a significant predictor of sales.
Mark 3.00 out of 3.00

The correct answer is:

• For a number of students that equals to zero, a bookstore may expect a sale
of 1.1796 books, however since the factor itself is not significant, then the
bookstore should not clinging onto that.
• The number of students taking the programming class is a significant
predictor of sales.
• For each student taking the programming class, the respective university
bookstore could expect a sale of 2.1165 books.

.
Based on the model, for a university with 100 students taking a programming course, the publisher
212.82
could expect the respective bookstore would sell Answer

Question 7
Correct
Mark 1.00 out of 1.00

Flag question

Question text
Suppose that you are interested in the percentage of cellphone brands owned by the students of
UNSRAT. Therefore, on Wednesday, after class, you asked all your classmates about the brands of
their cellphones.

a. Why can collecting data only from your classmates cause bias in the data?
Perhaps some of your classmates do not bring their cellphones on Wednesday.

The subjects were not randomly selected.

It assumes your classmates represent the whole population of UNSRAT students.


You should ask students from the other classes too.
It assumes the percentage of the cellphone brands owned by the first-year students may

represent the whole population of UNSRAT students.


Mark 1.00 out of 1.00

The correct answer is:

• It assumes your classmates represent the whole population of UNSRAT


students.
• It assumes the percentage of the cellphone brands owned by the first-year
students may represent the whole population of UNSRAT students.
• The subjects were not randomly selected.

b. Because a variable is a characteristic of each individual on which data is collected, which of the
following are variables in this study?
One of your classmates.
The day the data collected.

cellphone brand
gender
Mark 1.00 out of 1.00

The correct answer is:

• cellphone brand

c. Which chart or graph would be appropriate to display the proportion of the brands?

pie chart
boxplot
time plot
line plot

bar graph
Mark 1.00 out of 1.00

The correct answer is:

• bar graph
• pie chart
Question 8
Correct
Mark 1.00 out of 1.00

Flag question

Question text

Background
You are assigned to analyze a dataset that contains the final scores of students in three parallel
classes. Is there any difference in the scores of the students in a different class?

Data Exploration
The structure of the dataset is as follow
str(quiz.result)
'data.frame': 150 obs. of 2 variables:

$ class: Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 1 ...

$ score: num 55.2 88.3 91.6 91.4 81.3 ...

boxplot
You then plotted a Answer

, and the result came as follow

Since you were comparing scores in three different quizzes with the same participants (students), then
you need the proper method. Therefore, you need to decide which method to use. So you start with
distribution normality
checking the Answer

of the performance score on each group by using


Shapiro-Wilk Test
the Answer
. The p-value for each tested group is shown in the following table:

class p-value
A 0.4505716
B 0.2808105
C 0.2490939

Statistical Tests
non-parametric method
Based on it, then you decide to use Answer

to find is there any difference in those quizzes. Due to the nature of the problem, then you ran
Kruskal-Wallis test
a Answer

, and the result is as follow:


data: score by class

[HIDDEN] chi-squared = 55.454, df = 2, p-value = 9.086e-13

Based on the result, then you decide


to Answer
continue with further tests to find in which quiz students tend to achieve better scores

.
Dunn Test
Therefore you ran a Answer

, and the results came as follow:


Comparison of x by group

(Bonferroni)
Col Mean-|

Row Mean | A B

---------+----------------------

B | 2.193561

| 0.0848

C | 7.259698 5.066137

| 0.0000* 0.0000*

alpha = 0.05

Reject Ho if p <= alpha

Conclusion
Based on the results of the statistical tests, then you conclude that

There is no significantly different score achievements in all three classes


Students tend to achieve significantly higher scores in class A, followed by those in class B, and
the students in class C, yet the differences are not significant
Students tend to achieve significantly higher scores in class A, followed by those in class B, and
the students in class C.
Students in classes A and B tend to have higher scores than students in class C. The scores are

not significantly different between students in classes A and B


Mark 5.00 out of 5.00

The correct answer is:

• Students in classes A and B tend to have higher scores than students in class
C. The scores are not significantly different between students in classes A
and B

Question 9
Correct
Mark 1.00 out of 1.00
Flag question

Question text

You are assigned to study whether there is a relationship between video game publishers and the video
game genres. You have a dataset with the following structure:
str(vgs)
'data.frame': 11857 obs. of 11 variables:
$ Rank : int 1 3 4 8 11 12 14 15 16 17 ...
$ Name : Factor w/ 8427 levels ".hack: Sekai no Mukou ni + Versus",..: 8048 4013
8049 8046 5006 4012 8042 8043 3598 2681 ...
$ Platform : Factor w/ 10 levels "DS","PC","PS",..: 8 8 8 8 1 1 8 8 9 5 ...
$ Year : Factor w/ 29 levels "1985","1988",..: 16 18 19 16 15 15 17 19 20 23 ...
$ Genre : Factor w/ 10 levels "Action","Adventure",..: 9 5 9 4 8 5 9 9 4 1 ...
$ Publisher : Factor w/ 467 levels "10TACLE Studios",..: 297 297 297 297 297 297 297
297 266 402 ...
$ NA_Sales : num 41.49 15.85 15.75 14.03 9.07 ...
$ EU_Sales : num 29 12.9 11 9.2 11 ...
$ JP_Sales : num 3.77 3.79 3.28 2.93 1.93 4.13 3.6 2.53 0.24 0.97 ...
$ Other_Sales : num 8.46 3.31 2.96 2.85 2.75 1.92 2.15 1.79 1.67 4.14 ...
$ Global_Sales: num 82.7 35.8 33 29 24.8 ...

You explored the data by making a barplot that shows the grouped distribution, and it came as follow:

Cross-tabulation
To achieve the goal of the study, you create a Answer

table.
Genre

Platform Action Adventure Fighting Misc Racing Role-Playing Shooter Simulation

DS 343 240 36 393 67 200 42 285

PC 165 65 6 24 60 104 148 115

PS 157 69 108 76 145 97 96 60

PS2 348 196 150 222 216 187 160 90


PS3 380 74 76 124 92 119 156 31

PS4 122 19 17 15 17 47 34 5

PSP 222 213 74 106 65 192 37 29

Wii 238 84 42 280 94 35 66 87

X360 324 47 65 126 105 76 203 40

XB 155 26 48 46 123 23 132 24

Genre

Platform Sports Strategy

DS 148 79

PC 49 188

PS 222 70

PS2 400 71

PS3 213 24

PS4 43 5

PSP 135 60

Wii 261 25

X360 220 28

XB 170 21

Chi-square Test of Independence


With a 95% degree of freedom, you ran a Answer

, and the result came as follow


data: [HIDDEN]

X-squared = 2873.1, df = 81, p-value < 2.2e-16

Therefore, according to the result, then it can be concluded that

There is a significant relationship between video-game platform and the video-game

genres.
All platforms share the same amount of video game publications on each genre.
There is no significant relationship between genres and the platform used.
There are certain video game genres that commonly published for specific platforms.
Mark 2.00 out of 2.00

The correct answer is:

• There is a significant relationship between video-game platform and the


video-game genres.
• There are certain video game genres that commonly published for specific
platforms.

(Credit: the dataset was taken from "Video Game Sales Analyze sales data from more than 16,500 games"
by Gregory Smith, available on Kaggle.)

Question 10
Correct
Mark 1.00 out of 1.00

Flag question

Question text
To gain information about the number elements in a vector, we use the _____ function.

Select one:
a. ncol()

b. length()
c. sizeof()
d. nrow()
e. getLength()

Soal 1
Which one of the following that is best treated as ordinal variables?

Pilih salah satu:

a. educational degree
b. hair color

c. phone number

d. city of residence

e. salary

Umpan balik

Pilihan terbaik adalah: educational degree

Soal 2

Teks soal

You are interested in knowing the percentage of how the first year UNSRAT undergraduate students
access e-Learning. To estimate the percentage, you survey with 500 randomly selected students and
determine what are the means used by the students.
Match the item/condition from the example above with the right term!

Parameter Answer 1

Answer
Samples
2

Population Answer 3

Statistics Answer 4

Umpan balik
The correct answer is: Parameter → The means used to access e-Learning, Samples → 500 randomly
selected first year UNSRAT undergraduate students, Population → First year UNSRAT undergraduate
students, Statistics → Percentage

Soal 3

Teks soal

X is a mean to organize quantitative data. It shows the sum of a class and all classes below it. What is X?

Pilih salah satu:

a. Stem-and-leaf display

b. Histogram

c. Cummulative Frequency Distribution

d. Frequency distribution table

e. Ogive

Umpan balik

Pilihan terbaik adalah: Cummulative Frequency Distribution

Soal 4

Teks soal

A publishing company is currently reviewing proposals from bookstores in several universities.


These bookstores are asking for more programming books to be stocked for each of them. Since
the stock in the company's warehouse is limited, hence the management will decide the
allocation based on historical sales data. Therefore, the management asked you to make the
analysis. The data that they possess contains historical data of the number of students who took
programming courses and the number of programming books sold at the respective university
bookstore. Should the university with more students who took a programming course to be
allocated more books?

The dataset that the management gave you is as follow:


str(student.books)
'data.frame': 231 obs. of 2 variables:
$ nstudents : int 204 179 200 177 207 195 166 178 213 130 ...
$ books_sold: int 441 329 467 376 504 396 354 439 461 235 ...
The column nstudents shows the number of students while the column books_sold shows the number
of books sold at a university bookstore with the respective number of students who took the
programming course.

The first thing that you need to do is Jawaban

Therefore, you plotted the number of students with the respective numbers of books sold:

After that, you ran a Jawaban

and the results is R=0.7680

Based on the value of R, you know that Jawaban

. Therefore, you decided to summarize and study the relationship between the number of students and
the number of books sold by using Jawaban

. The results came as follow


Call:
lm(formula = books_sold ~ nstudents, data = student.books)

Residuals:
Min 1Q Median 3Q Max
-80.265 -37.203 -2.531 38.198 83.988

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.1796 23.6051 0.05 0.96
nstudents 2.1165 0.1166 18.15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 44.01 on 229 degrees of freedom


Multiple R-squared: 0.5898, Adjusted R-squared: 0.5881
F-statistic: 329.3 on 1 and 229 DF, p-value: < 2.2e-16
From the model, you could conclude that:

For each student taking the programming class, the respective university bookstore could expect a

sale of 2.1165 books.

For a number of students that equals to zero, a bookstore may expect a sale of 1.1796 books,

however since the factor itself is not significant, then the bookstore should not clinging onto that.

1.1796 is a strong predictor of the model

The number of students taking the programming class is a significant predictor of sales.

For each student taking the programming class, the respective university bookstore could expect a
sale of 1.1796 books.

Poin 3,00 dari 3,00

Pilihan terbaik adalah:

• For a number of students that equals to zero, a bookstore may expect a sale of 1.1796 books,
however since the factor itself is not significant, then the bookstore should not clinging onto
that.
• The number of students taking the programming class is a significant predictor of sales.
• For each student taking the programming class, the respective university bookstore could expect
a sale of 2.1165 books.

Based on the model, for a university with 100 students taking a programming course, the
212,82
publisher could expect the respective bookstore would sell Jawaban
Soal 5

Teks soal

You are assigned to study whether there is a relationship between the category and the content
rating of selected apps in Google PlayStore. You have a dataset with the following structure:

str(googleplaystore)
'data.frame': 3398 obs. of 13 variables:
$ App : Factor w/ 3088 levels "¡Ay Caramba!",..: 2472 2679 654
2617 580 1701 2595 790 2762 2343 ...
$ Category : Factor w/ 3 levels "FAMILY","GAME",..: 2 2 2 2 2 2 2 2 2 2
...
$ Rating : num 4.5 4.5 4.4 4.7 4.5 4.2 4.4 4.6 4.3 4.3 ...
$ Reviews : Factor w/ 2379 levels "0","1","10","100",..: 1556 1061 828
969 412 1363 1730 865 2175 68 ...
$ Size : Factor w/ 219 levels "1.0M","1.1M",..: 143 165 163 43 93
45 219 215 136 45 ...
$ Installs : Factor w/ 21 levels "0","0+","1,000,000,000+",..: 10 3 19
7 7 16 10 10 19 19 ...
$ Type : Factor w/ 3 levels "Free","NaN","Paid": 1 1 1 1 1 1 1 1 1
1 ...
$ Price : Factor w/ 38 levels "$0.99","$1.04",..: 38 38 38 38 38 38
38 38 38 38 ...
$ Content.Rating: Factor w/ 4 levels "Everyone","Everyone 10+",..: 2 2 1 1 1
1 1 2 1 1 ...
$ Genres : Factor w/ 85 levels "Action","Action;Action &
Adventure",..: 4 7 23 19 23 29 1 77 1 23 ...
$ Last.Updated : Factor w/ 921 levels "April 1, 2017",..: 460 396 465 78
412 25 735 536 465 691 ...
$ Current.Ver : Factor w/ 1094 levels "0.0.1","0.0.2",..: 606 444 186 551
249 348 1093 600 347 332 ...
$ Android.Ver : Factor w/ 24 levels "1.5 and up","1.6 and up",..: 15 15 15
15 14 15 8 15 13 13 ...
You explored the data by making a barplot that shows the grouped distribution, and it came as follow:

To achieve the goal of the study, you create a Jawaban

table.
Content.Rating
Category Everyone Everyone 10+ Mature 17+ Teen
FAMILY 1529 131 50 261
GAME 608 131 74 331
NEWS_AND_MAGAZINES 169 66 14 34

With a 95% degree of freedom, you ran a Jawaban

, and the result came as follow


data: [HIDDEN]
X-squared = 275.51, df = 6, p-value < 2.2e-16

Therefore, according to the result, then it can be concluded that

The content ratings of Google PlayStore apps are not related to the category.

There is no significant relationship between the category and the content rating of the selected

apps from Google PlayStore.

Most apps categories in Google PlayStore are highly related with the content rating.

There is a significant relationship between the category and the content rating of the selected apps

from Google PlayStore.

Poin 0,00 dari 2,00

Pilihan terbaik adalah:

• There is a significant relationship between the category and the content rating of the selected
apps from Google PlayStore.
• Most apps categories in Google PlayStore are highly related with the content rating.

(Credit: the dataset was taken and subsetted from "Google Play Store Apps Web scraped data of
10k Play Store apps for analysing the Android market" by Lavanya Gupta, available on Kaggle.)
Soal 6

Teks soal

You are interested in learning the favorite programming languages of the first year Indonesian
Informatics and/or Computer Science undergraduate students. To achieve this mission, you
asked your highschool classmates who admitted to the specified program.

a. The data are collected properly and bias is minimized Jawaban

b. Because a variable is a characteristic of each individual on which data is collected, which of the
following are variables that suit well with the research question?

chosen programming language

number of students who chose particular programming as their favorite one

the respondent's final score in algorithm and programming course

gender

Poin 1,00 dari 1,00

Pilihan terbaik adalah:

• chosen programming language

c. Which chart or graph would be appropriate to display the concerned variable(s)?

a time plot

a pie chart
a boxplot

a bar graph

Poin 1,00 dari 1,00

Pilihan terbaik adalah:

• a bar graph
• a pie chart

Soal 7
Tidak benar

Poin 0,00 dari 1,00

Tandai pertanyaan

Teks soal

Which function(s) that is(are) used to store a workspace in R?

Pilih salah satu atau lebih:

a. write.csv()

b. saveRDS()

c. save()

d. save.image()

e. save.csv()

Umpan balik

Pilihan-pilihan terbaik adalah: save.image(), save()


Soal 8
Separuh benar

Poin 0,73 dari 1,00

Tandai pertanyaan

Teks soal

Background

You are assigned to analyze a dataset that contains measures of cholesterol concentration in 72
participants treated with three different drugs. The aim is to examine the potential of a new class
of drugs in lowering cholesterol concentration and consequently reducing heart attack. The
participants include 36 males and 36 females. Males and females were further (equally)
subdivided into whether they were at low or high risk of a heart attack. Is there any difference
in the impact of each drug on cholesterol concentration? If any, which one has the highest
impact, in terms of the lowest cholesterol concentration?

Data Exploration

The structure of the dataset is as follow

str(heartattack)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 72 obs. of 5 variables:
$ gender : Factor w/ 2 levels "male","female": 1 1 1 1 1 1 1 1 1 1 ...
$ risk : Factor w/ 2 levels "high","low": 2 2 2 2 2 2 2 2 2 2 ...
$ drug : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 2 2 2 2 ...
$ cholesterol: num 5.24 5.08 4.68 5.36 4.96 ...
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
You then plotted a Jawaban

, and the result came as follow


Since you were comparing a variable in three different groups, then you need the proper method.
Therefore, you need to decide which method to use. So you start with checking the Jawaban

of the cholesterol concentration on each group by using the Jawaban

. The p-value for each tested group is shown in the following table:

drug p-value

A 0.1537620

B 0.7674545

C 0.5537145

Statistical Tests

Based on it, then you decide to use Jawaban

to find is there any difference between the drugs used toward the cholesterol concentration. Due to the
nature of the problem, then you ran a Jawaban
, and the result is as follow:
Df Sum Sq Mean Sq F value Pr(>F)
drug 2 1.235 0.6177 2.63 0.0793 .
Residuals 69 16.204 0.2348
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Based on the result, then you decide to Jawaban

Therefore you ran a Jawaban

, and the results came as follow:


[NAME of TEST HIDDEN]
95% family-wise confidence level

Fit: [HIDDEN]

$drug
diff lwr upr p adj
B-A -0.277327333 -0.6124096 0.05775494 0.1241979
C-A -0.278421280 -0.6135035 0.05666099 0.1222405
C-B -0.001093947 -0.3361762 0.33398832 0.9999663
Conclusion

Based on the results of the statistical tests, then you conclude that

drug that yields the lowest cholesterol rate is drug C, followed with drug B, and then drug A

drug C yields a significantly less cholesterol rate than drugs A and B

drug B yields a significantly less cholesterol rate than drug A

the drugs gave no significantly different impact on the cholesterol rate

the experiment is a mess

Poin 5,00 dari 5,00

Pilihan terbaik adalah:

• the drugs gave no significantly different impact on the cholesterol rate

(Credit: dataset used in this vignettes is based on the heartattack dataset in the datarium
package)

Soal 9
Tidak benar

Poin 0,00 dari 1,00

Tandai pertanyaan

Teks soal
The alternate hyphothesis of a _____ t-test has the form of "The mean of x of the A group is higher than
..."

Pilih salah satu:

a. Unpaired

b. Two-tail

c. Paired

d. Half-tail

e. One-tail

Umpan balik

Pilihan terbaik adalah: One-tail

Soal 10
Tidak benar

Poin 0,00 dari 1,00

Tandai pertanyaan

Teks soal

The following table contains a subset of the results from a survey about how the first year
UNSRAT undergraduate students access e-Learning.

questionnaire_code program satisfaction

STU001 informatics 4

STU002 civil 7

STU003 law 3

STU004 medical 5
Match the item/condition from the example above with the right term!

Element Answer 1

Observation Answer 2

Variable Answer 3

Umpan balik

Your answer is incorrect.

The correct answer is: Element → STU003, Observation → 7, Variable → satisfaction

1
The following table contains a subset of the results from a survey about how the first year UNSRAT
undergraduate students access e-Learning.
questionnaire_code program satisfaction
STU001 informatics 4
STU002 civil 7
STU003 law 3
STU004 medical 5

Match the item/condition from the example above with the right term!
The correct answer is: Element → STU003, Variable → satisfaction, Observation → 7

1,341 undergraduate students were surveyed, to gain knowledge about the preferred teaching-
and-learning method of the whole UNSRAT students. There are three teaching-and-learning
methods: online, offline, or blended. The answers then tabulated and the frequency of each
method is presented in the report. Match the item/condition from the example above with the
right term!

Population UNSRAT students


Answer 1
Variable preferred teaching-and-learning method
Answer 2
Statistics frequency
Answer 3
Samples the surveyed 1,341 undergraduate students
Answer 4
The correct answer is: Population → UNSRAT students, Variable → preferred teaching-and-
learning method, Statistics → frequency, Samples → the surveyed 1,341 undergraduate
students

You are assigned to study whether there is a relationship between the category and the content rating of
selected apps in Google PlayStore. You have a dataset with the following structure:
str(googleplaystore)
'data.frame': 3398 obs. of 13 variables:

$ App : Factor w/ 3088 levels "¡Ay Caramba!",..: 2472 2679 654 2617 580
1701 2595 790 2762 2343 ...

$ Category : Factor w/ 3 levels "FAMILY","GAME",..: 2 2 2 2 2 2 2 2 2 2 ...

$ Rating : num 4.5 4.5 4.4 4.7 4.5 4.2 4.4 4.6 4.3 4.3 ...

$ Reviews : Factor w/ 2379 levels "0","1","10","100",..: 1556 1061 828 969 412
1363 1730 865 2175 68 ...

$ Size : Factor w/ 219 levels "1.0M","1.1M",..: 143 165 163 43 93 45 219


215 136 45 ...

$ Installs : Factor w/ 21 levels "0","0+","1,000,000,000+",..: 10 3 19 7 7 16


10 10 19 19 ...

$ Type : Factor w/ 3 levels "Free","NaN","Paid": 1 1 1 1 1 1 1 1 1 1 ...

$ Price : Factor w/ 38 levels "$0.99","$1.04",..: 38 38 38 38 38 38 38 38 38


38 ...

$ Content.Rating: Factor w/ 4 levels "Everyone","Everyone 10+",..: 2 2 1 1 1 1 1 2 1


1 ...

$ Genres : Factor w/ 85 levels "Action","Action;Action & Adventure",..: 4 7


23 19 23 29 1 77 1 23 ...
$ Last.Updated : Factor w/ 921 levels "April 1, 2017",..: 460 396 465 78 412 25 735
536 465 691 ...

$ Current.Ver : Factor w/ 1094 levels "0.0.1","0.0.2",..: 606 444 186 551 249 348
1093 600 347 332 ...

$ Android.Ver : Factor w/ 24 levels "1.5 and up","1.6 and up",..: 15 15 15 15 14


15 8 15 13 13 ...

You explored the data by making a barplot that shows the grouped distribution, and it came as
follow:

Cross-tabulation
To achieve the goal of the study, you create a Answer

table.
Content.Rating

Category Everyone Everyone 10+ Mature 17+ Teen

FAMILY 1529 131 50 261

GAME 608 131 74 331

NEWS_AND_MAGAZINES 169 66 14 34

Chi-square Test of Independence


With a 95% degree of freedom, you ran a Answer
, and the result came as follow
data: [HIDDEN]

X-squared = 275.51, df = 6, p-value < 2.2e-16

Therefore, according to the result, then it can be concluded that

There is no significant relationship between the category and the content rating of the
selected apps from Google PlayStore.
The content ratings of Google PlayStore apps are not related to the category.

Most apps categories in Google PlayStore are highly related with the content rating.
There is a significant relationship between the category and the content rating of the

selected apps from Google PlayStore.

A statistical test that conducted to determine whether there is an association between the 2
categorical variables is

The correct answer is: Test of Independence

Which of the following are best treated as discrete variables?

The correct answers are: Number of students in a class, Number of students in the whole
university, Grades frequency at the end of a course, Number of children in a family

Background
You are assigned to analyze a dataset that contains the performance score measures of participants at
two-time points. The aim of this study is to evaluate the effect of gender and stress on performance
scores. Is there any difference in performance between different stress levels? If any, which one
yields the best/worst performance score?
Data Exploration
The structure of the dataset is as follow
str(performance)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 60 obs. of 5 variables:

$ id : int 1 2 3 4 5 6 7 8 9 10 ...

$ gender: Factor w/ 2 levels "male","female": 1 1 1 1 1 1 1 1 1 1 ...

$ stress: Factor w/ 3 levels "low","moderate",..: 1 1 1 1 1 1 1 1 1 1 ...

$ t1 : num 5.96 5.51 5.63 5.71 5.74 ...

$ t2 : num 5.58 5.82 5.47 5.79 5.72 ...

Boundary
Since the performance is measured twice, in this problem we only focus on the first measurement (the
t1 column).

You then plotted


boxplot
a Answer

, and the result came as follow


Since you were comparing a variable in three different groups, then you need the proper method.
Therefore, you need to decide which method to use. So you start with checking
distribution normality
the Answer

of the performance score on each group by using


Shapiro-Wilk Test
the Answer

. The p-value for each tested group is shown in the following table:
stress level p-value
low 0.11428304
moderate 0.07023834
high 0.92983350

Statistical Tests
parametric method
Based on it, then you decide to use Answer

to find is there any difference between the stress levels on the performance. Due to the nature of
One-w ay ANOVA
the problem, then you ran a Answer

, and the result is as follow:


Df Sum Sq Mean Sq F value Pr(>F)

stress 2 0.8235 0.4117 14.5 8.13e-06 ***

Residuals 57 1.6190 0.0284

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Based on the result, then you decide


to Answer
continue with further tests to find which stress level has significantly different impact on performance

.
TukeyHSD
Therefore you ran a Answer

, and the results came as follow:


[NAME of TEST HIDDEN]

95% family-wise confidence level

Fit: [HIDDEN]

$stress

diff lwr upr p adj

moderate-low 0.1052102 -0.02303774 0.2334582 0.1279077

high-low -0.1786052 -0.30685319 -0.0503573 0.0040329

high-moderate -0.2838155 -0.41206340 -0.1555675 0.0000053

Conclusion
Based on the results of the statistical tests, then you conclude that

employee with moderate stress level tend to have significantly higher performance than
those with low and/or high stress levels
the experiment is a violation of human rights
there is no significant performance differences in all stress level
there is no significant performance difference between employees with moderate and low

stress levels
employees with a high stress level tend to have significantly lower performance compared

to employees with moderate and low stress levels


employess with high stress level have significantly lower performance, followed with those
with moderate, and then low stress level

Suppose that you are interested in the percentage of cellphone brands owned by the students
of UNSRAT. Therefore, on Wednesday, after class, you asked all your classmates about the
brands of their cellphones.

a. Why can collecting data only from your classmates cause bias in the data?
It assumes the percentage of the cellphone brands owned by the first-year students may

represent the whole population of UNSRAT students.


You should ask students from the other classes too.

The subjects were not randomly selected.


Perhaps some of your classmates do not bring their cellphones on Wednesday.

It assumes your classmates represent the whole population of UNSRAT students.


Mark 1.00 out of 1.00

The correct answer is:

• It assumes your classmates represent the whole population of UNSRAT students.


• It assumes the percentage of the cellphone brands owned by the first-year students may
represent the whole population of UNSRAT students.
• The subjects were not randomly selected.

b. Because a variable is a characteristic of each individual on which data is collected, which of


the following are variables in this study?
One of your classmates.
gender

cellphone brand
The day the data collected.
Mark 1.00 out of 1.00
The correct answer is:

• cellphone brand

c. Which chart or graph would be appropriate to display the proportion of the brands?
line plot
boxplot
time plot

pie chart

bar graph
Mark 1.00 out of 1.00

The correct answer is:

• bar graph
• pie chart

200 data of advertising budget using YouTube and the respective sales earning were collected. You are
asked to analyze whether increasing the advertising budget would increase the sales. The
following dataset is given to you
str(marketing)
'data.frame': 200 obs. of 4 variables:

$ sales : num 26.5 12.5 11.2 22.2 15.5 ...

The youtube column shows the advertising budget spending, and the sales column shows the
earning. All these numbers are in thousands of dollar.

The first thing that you need to do


is Answer
Determine whether there is a strong relationship between the spending on YouTube advertisement service and the sales earning.

Therefore, you plotted the advertising budget and its respective sales:
Pearson correlation test
After that, you ran a Answer

and the results is R=0.7822R=0.7822.


Based on the value of R, you know
that Answer
there is a strong positive relationship between the spending on YouTube advertisement service and the sales earning.

. Therefore, you decided to summarize and study the relationship between the number of students
Simple Linear Regression
and the number of books sold by using Answer

. The results came as follow


Call:

lm(formula = sales ~ youtube, data = marketing)

Residuals:

Min 1Q Median 3Q Max

-10.0632 -2.3454 -0.2295 2.4805 8.6548

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 8.439112 0.549412 15.36 <2e-16 ***

youtube 0.047537 0.002691 17.67 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.91 on 198 degrees of freedom

Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099

F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16

From the model, you could conclude that

The correct answer is:

• For an advertising budget that equals to zero, a company may expect a sale of USD
8,440.
• For each dollar spent for advertising, a company could expect a sales earning of USD
47.537.
• the intercept 8.439 is a strong predictor of the model

Based on the model, for a company that spent USD 1,000 for YouTube advertising, the company could
expect sales earning of USD 55976.112
Qualitative data could be organized with the following ways:

Select one or more:


The correct answers are: Frequency distribution table, Tally marks, Relative frequency,
Percentage

10

To gain information about the number elements in a vector, we use the _____ function.

The correct answer is: length()


1. 200 data of advertising budget using YouTube and the respective sales
earning were collected. You are asked to analyze whether increasing the
advertising budget would increase the sales. The following dataset is given to
you
str(marketing)
'data.frame': 200 obs. of 4 variables:

$ youtube : num 276.1 53.4 20.6 181.8 217 ...

$ sales : num 26.5 12.5 11.2 22.2 15.5 ...

The youtube column shows the advertising budget spending, and the sales
column shows the earning. All these numbers are in thousands of dollar.

The first thing that you need to do is


The correct answer : Determine whether there is a strong relationship between the
spending on Youtube advertisement service and the sales earning

After that, you ran a Pearson correlation test and the results is R=0.7822R=0.7822.
Based on the value of R, you know that
There is a strong positive relationship between the spending on Youtube
Therefore, you decided to summarize and study the relationship between the
number of students and the number of books sold by using Simple Linear
Regression The results came as follow

Call:

lm(formula = sales ~ youtube, data = marketing)

Residuals:

Min 1Q Median 3Q Max

-10.0632 -2.3454 -0.2295 2.4805 8.6548

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 8.439112 0.549412 15.36 <2e-16 ***

youtube 0.047537 0.002691 17.67 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.91 on 198 degrees of freedom

Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099

F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16

From the model, you could conclude that:


• For each dollar spent for advertising, a company could expect a sales
earning of USD 47.537.
• the intercept 8.439 is a strong predictor of the model
• For an advertising budget that equals to zero, a company may expect a sale of USD 8,440
Based on the model, for a company that spent USD 1,000 for YouTube advertising,
the company could expect sales earning of USD 55976.112

2. You are interested in learning the favorite programming languages of the first
year Indonesian Informatics and/or Computer Science undergraduate
students. To achieve this mission, you asked your highschool classmates who
admitted to the specified program.

a. The data are collected properly and bias is minimized TRUE


b. Because a variable is a characteristic of each individual on which data is
collected, which of the following are variables that suit well with the research
question?
• chosen programming language
c. Which chart or graph would be appropriate to display the concerned
variable(s)?
• a pie chart
• a bar graph

3. You are assigned to study if there is any connection between the district
where a person lives and his/her hobby. There are 671 randomly selected
respondents that were interviewed. Their answers are collected into a data
frame with the following structure:
tr(district.hobby)
'data.frame': 671 obs. of 2 variables:

$ district: Factor w/ 4 levels "DISTRICT 1","DISTRICT 2",..:

2 1 4 4 2 2 2 3 1 4 ...

$ hobby : Factor w/ 6 levels "BASKETBALL","FOOTBALL",..: 3 3 2 6 2


3 2 3 5 2 ...

You explored the data by making a barplot that shows the grouped
distribution, and it came as follow:
To achieve the goal of the study, you create a
Cross-Tabulation
hobby

district BASKETBALL FOOTBALL PAINTING PHOTOGRAPHY SINGING


TRAVELING

DISTRICT 1 39 29 19 28 37
29

DISTRICT 2 29 33 29 30 25
32

DISTRICT 3 26 24 30 22 30
19

DISTRICT 4 28 36 23 24 26
24
With a 95% degree of freedom, you ran a
The correct answer is: Chi-square Test of Independence and the result came
as follow
data: [HIDDEN]

X-squared = 13.811, df = 15, p-value = 0.5399

Therefore, according to the result, then it can be concluded that


• Someone's hobby is independent of the district where one lives
• There is no significant relationship between the district where
someone lives with his/her hobby.

4. To gain information about the number elements in a vector, we use the


_____ function.
Select one:
• length()
5. 1,341 undergraduate students were surveyed, to gain knowledge about
the preferred teaching-and-learning method of the whole UNSRAT
students. There are three teaching-and-learning methods: online,
offline, or blended. The answers then tabulated and the frequency of
each method is presented in the report. Match the item/condition from
the example above with the right term!
Population
UNSRAT students
Statistics
Frequency
Samples
The surveyed 1,341 undergraduate students
Variable
Preferred teaching and learning method

6. Which of the following are best treated as ordinal variables?


Grades
7. A statistical test that compares or tests the suitability of observations
against expectations or its theoretical frequencies are
Goodness-of-Fit
8. Background
You are assigned to analyze a dataset that contains measures of cholesterol
concentration in 72 participants treated with three different drugs. The aim is to
examine the potential of a new class of drugs in lowering cholesterol concentration
and consequently reducing heart attack. The participants include 36 males and 36
females. Males and females were further (equally) subdivided into whether they
were at low or high risk of a heart attack. Is there any difference in the impact of
each drug on cholesterol concentration? If any, which one has the highest
impact, in terms of the lowest cholesterol concentration?

Data Exploration
The structure of the dataset is as follow
str(heartattack)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 72 obs. of 5 variables:

$ gender : Factor w/ 2 levels "male","female": 1 1 1 1 1 1 1 1 1


1 ...

$ risk : Factor w/ 2 levels "high","low": 2 2 2 2 2 2 2 2 2 2


...

$ drug : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 2 2 2 2


...

$ cholesterol: num 5.24 5.08 4.68 5.36 4.96 ...

$ id : int 1 2 3 4 5 6 7 8 9 10 ...

You then plotted a boxplot and the result came as follow


Since you were comparing a variable in three different groups, then you need the
proper method. Therefore, you need to decide which method to use. So you start
with checking the distribution normality of the cholesterol concentration on each
group by using the Shapiro-wilk Test The p-value for each tested group is shown in
the following table:
drug p-value
A 0.1537620
B 0.7674545
C 0.5537145

Statistical Tests
Based on it, then you decide to use parametric method to find is there any
difference between the drugs used toward the cholesterol concentration. Due to the
nature of the problem, then you ran a One-way ANOVA and the result is as follow:
Df Sum Sq Mean Sq F value Pr(>F)

drug 2 1.235 0.6177 2.63 0.0793 .

Residuals 69 16.204 0.2348

---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Based on the result, then you decide to


Draw a final conclusion
Therefore you ran a Answer [FURTHER TEST IS UNNECESSARY]
and the results came as follow:
[NAME of TEST HIDDEN]

95% family-wise confidence level

Fit: [HIDDEN]

$drug

diff lwr upr p adj

B-A -0.277327333 -0.6124096 0.05775494 0.1241979

C-A -0.278421280 -0.6135035 0.05666099 0.1222405

C-B -0.001093947 -0.3361762 0.33398832 0.9999663

Conclusion
Based on the results of the statistical tests, then you conclude that
the drugs gave no significantly different impact on the cholesterol rate

9. The following table contains a subset of the results from a survey about how the
first year UNSRAT undergraduate students access e-Learning.
questionnaire_code program access_mean
STU001 informatics personal notebook/PC
STU002 civil shared notebook/PC
STU003 law NA
STU004 medical personal tablet

Match the item/condition from the example above with the right term!

Observation personal notebook/PC


Element STU001

Variable access_mean

10. Quantitative data visualization which separates the first digit and
the the other digits is
Select one:
• Stem-and-leaf display
Soal 1
Separuh benar

Poin 0,67 dari 1,00

Tandai pertanyaan

Teks soal

Background
You are assigned to analyze a dataset that contains measures of cholesterol concentration in 72 participants
treated with three different drugs. The aim is to examine the potential of a new class of drugs in lowering
cholesterol concentration and consequently reducing heart attack. The participants include 36 males and 36
females. Males and females were further (equally) subdivided into whether they were at low or high risk of
a heart attack. Is there any difference in the impact of each drug on cholesterol concentration? If any,
which one has the highest impact, in terms of the lowest cholesterol concentration?

Data Exploration
The structure of the dataset is as follow
str(heartattack)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 72 obs. of 5 variables:

$ gender : Factor w/ 2 levels "male","female": 1 1 1 1 1 1 1 1 1 1 ...

$ risk : Factor w/ 2 levels "high","low": 2 2 2 2 2 2 2 2 2 2 ...

$ drug : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 2 2 2 2 ...

$ cholesterol: num 5.24 5.08 4.68 5.36 4.96 ...

$ id : int 1 2 3 4 5 6 7 8 9 10 ...

boxplot
You then plotted a Jawaban

, and the result came as follow

Since you were comparing a variable in three different groups, then you need the proper method.
Therefore, you need to decide which method to use. So you start with checking the Jawaban
distribution normality
of the cholesterol concentration on each group by using the Jawaban
Shapiro-Wilk Test

. The p-value for each tested group is shown in the following table:

drug p-value

A 0.1537620

B 0.7674545

C 0.5537145

Statistical Tests
parametric method
Based on it, then you decide to use Jawaban

to find is there any difference between the drugs used toward the cholesterol concentration. Due
to the nature of the problem, then you ran a Jawaban
One-way ANOVA

, and the result is as follow:

Df Sum Sq Mean Sq F value Pr(>F)

drug 2 1.235 0.6177 2.63 0.0793 .

Residuals 69 16.204 0.2348

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Based on the result, then you decide to Jawaban
draw a final conclusion

[FURTHER TEST IS UNNECESSARY]


Therefore you ran a Jawaban

, and the results came as follow:

[NAME of TEST HIDDEN]

95% family-wise confidence level

Fit: [HIDDEN]

$drug

diff lwr upr p adj

B-A -0.277327333 -0.6124096 0.05775494 0.1241979

C-A -0.278421280 -0.6135035 0.05666099 0.1222405

C-B -0.001093947 -0.3361762 0.33398832 0.9999663

Conclusion
Based on the results of the statistical tests, then you conclude that

the drugs gave no significantly different impact on the cholesterol rate

drug B yields a significantly less cholesterol rate than drug A

drug that yields the lowest cholesterol rate is drug C, followed with drug B, and then drug A

the experiment is a mess


drug C yields a significantly less cholesterol rate than drugs A and B
Poin -15,00 dari 5,00

Pilihan terbaik adalah:

• the drugs gave no significantly different impact on the cholesterol rate

(Credit: dataset used in this vignettes is based on the heartattack dataset in the datarium package)
Soal 2
Benar

Poin 1,00 dari 1,00

Tandai pertanyaan

Teks soal
The following table contains a subset of the results from a survey about the achievement of the present
second-year students in your program in their first semester.
questionnaire_code 1st_GP

STU001 3.92

STU002 3.88

STU003 3.2

STU004 2.78

Match the item/condition from the example above with the right term!

Element STU001
Answer 1

Variable 1st_GP
Answer 2

Observation 3.92
Answer 3

Umpan balik
Your answer is correct.
The correct answer is: Element → STU001, Variable → 1st_GP, Observation → 3.92

Soal 3
Benar

Poin 1,00 dari 1,00

Tandai pertanyaan

Teks soal
A publishing company is currently reviewing proposals from bookstores in several universities. These
bookstores are asking for more programming books to be stocked for each of them. Since the stock in the
company's warehouse is limited, hence the management will decide the allocation based on historical sales
data. Therefore, the management asked you to make the analysis. The data that they possess contains
historical data of the number of students who took programming courses and the number of programming
books sold at the respective university bookstore. Should the university with more students who took a
programming course to be allocated more books?

The dataset that the management gave you is as follow:


str(student.books)
'data.frame': 231 obs. of 2 variables:

$ nstudents : int 204 179 200 177 207 195 166 178 213 130 ...

$ books_sold: int 441 329 467 376 504 396 354 439 461 235 ...

The column nstudents shows the number of students while the column books_sold shows the
number of books sold at a university bookstore with the respective number of students who took
the programming course.

The first thing that you need to do is Jawaban Determine whether there is a strong relationship between the number of students taking a programming course and the number of programming book sold at a particular university bookstore.

Therefore, you plotted the number of students with the respective numbers of books sold:

Pearson correlation test


After that, you ran a Jawaban

and the results is R=0.7680R=0.7680.


Based on the value of R, you know that Jawaban There is a strong positive relationship between the number of students taking a programming course and the number of programming book sold at a particular university bookstore

. Therefore, you decided to summarize and study the relationship between the number of students
Simple Linear Regression
and the number of books sold by using Jawaban

. The results came as follow

Call:

lm(formula = books_sold ~ nstudents, data = student.books)

Residuals:

Min 1Q Median 3Q Max

-80.265 -37.203 -2.531 38.198 83.988

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.1796 23.6051 0.05 0.96

nstudents 2.1165 0.1166 18.15 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 44.01 on 229 degrees of freedom

Multiple R-squared: 0.5898, Adjusted R-squared: 0.5881

F-statistic: 329.3 on 1 and 229 DF, p-value: < 2.2e-16

From the model, you could conclude that:

For each student taking the programming class, the respective university bookstore could

expect a sale of 2.1165 books.


For a number of students that equals to zero, a bookstore may expect a sale of 1.1796 books,
however since the factor itself is not significant, then the bookstore should not clinging onto that.

1.1796 is a strong predictor of the model

For each student taking the programming class, the respective university bookstore could
expect a sale of 1.1796 books.

The number of students taking the programming class is a significant predictor of sales.
Poin 3,00 dari 3,00

Pilihan terbaik adalah:

• For a number of students that equals to zero, a bookstore may expect a sale of 1.1796 books,
however since the factor itself is not significant, then the bookstore should not clinging onto
that.
• The number of students taking the programming class is a significant predictor of sales.
• For each student taking the programming class, the respective university bookstore could
expect a sale of 2.1165 books.

.
Based on the model, for a university with 100 students taking a programming course, the publisher could
212.82
expect the respective bookstore would sell Jawaban

Soal 4
Benar

Poin 1,00 dari 1,00

Tandai pertanyaan

Teks soal
You are assigned to study if there is any connection between the district where a person lives and his/her
hobby. There are 671 randomly selected respondents that were interviewed. Their answers are collected
into a data frame with the following structure:
str(district.hobby)
'data.frame': 671 obs. of 2 variables:

$ district: Factor w/ 4 levels "DISTRICT 1","DISTRICT 2",..: 2 1 4 4 2 2 2 3 1 4


...

$ hobby : Factor w/ 6 levels "BASKETBALL","FOOTBALL",..: 3 3 2 6 2 3 2 3 5 2 ..


.
You explored the data by making a barplot that shows the grouped distribution, and it came as follow:

Cross-tabulation
To achieve the goal of the study, you create a Jawaban

table.

hobby

district BASKETBALL FOOTBALL PAINTING PHOTOGRAPHY SINGING TRAVELING

DISTRICT 1 39 29 19 28 37 29

DISTRICT 2 29 33 29 30 25 32

DISTRICT 3 26 24 30 22 30 19

DISTRICT 4 28 36 23 24 26 24

Chi-square Test of Independence


With a 95% degree of freedom, you ran a Jawaban

, and the result came as follow

data: [HIDDEN]

X-squared = 13.811, df = 15, p-value = 0.5399

Therefore, according to the result, then it can be concluded that

There is a significant relationship between the district where someone lives with his/her
hobby.

There is no significant relationship between the district where someone lives with his/her

hobby.
Some hobbies are significantly preferred in certain districts.

Someone's hobby is independent of the district where one lives.


Poin 2,00 dari 2,00

Pilihan terbaik adalah:

• There is no significant relationship between the district where someone lives with his/her
hobby.
• Someone's hobby is independent of the district where one lives.

Soal 5
Benar

Poin 1,00 dari 1,00

Tandai pertanyaan

Teks soal
Methods that can be used to find out whether the data is normally distributed or not are
Pilih salah satu atau lebih:

a. Applying the Kolmogorov-Smirnov test

b. Applying the Kruskal-Wallis test

c. Applying the Wilcoxon Rank Sum test

d. Observing pie chart and bar plot

e. Applying the Shapiro-Wilk test

f. Observing histogram and density plot


Umpan balik
Pilihan-pilihan terbaik adalah: Observing histogram and density plot, Applying the Shapiro-Wilk
test, Applying the Kolmogorov-Smirnov test

Soal 6
Benar
Poin 1,00 dari 1,00

Tandai pertanyaan

Teks soal
You are interested in knowing the achievement of the present second-year students in your
program at their first semester. It is measured according to the GP achieved. You then collected
the 1st semester GP of 31 randomly selected second-year students and calculate the mean.
Match the item/condition from the example above with the right term!

Population Second-year students


Answer 1

Parameter GP
Answer 2

Samples 31 randomly selected second-year students


Answer 3

Statistics Average
Answer 4

Umpan balik
The correct answer is: Population → Second-year students, Parameter → GP, Samples → 31
randomly selected second-year students, Statistics → Average

Soal 7
Benar

Poin 1,00 dari 1,00

Tandai pertanyaan

Teks soal
Qualitative data could be visualized with the following ways:
Pilih salah satu atau lebih:

a. Raw data

b. Bar plot

c. Pie chart
d. Boxplot

e. Percentage

f. Relative frequency

g. Frequency distribution table

h. Tally marks
Umpan balik
Pilihan-pilihan terbaik adalah: Bar plot, Pie chart

Soal 8
Benar

Poin 1,00 dari 1,00

Tandai pertanyaan

Teks soal
Suppose that you are interested in the percentage of cellphone brands owned by the students of
UNSRAT. Therefore, on Wednesday, after class, you asked all your classmates about the brands
of their cellphones.

a. Why can collecting data only from your classmates cause bias in the data?

The subjects were not randomly selected.

Perhaps some of your classmates do not bring their cellphones on Wednesday.

It assumes the percentage of the cellphone brands owned by the first-year students may

represent the whole population of UNSRAT students.

It assumes your classmates represent the whole population of UNSRAT students.

You should ask students from the other classes too.

Poin 1,00 dari 1,00

Pilihan terbaik adalah:


• It assumes your classmates represent the whole population of UNSRAT students.
• It assumes the percentage of the cellphone brands owned by the first-year students may
represent the whole population of UNSRAT students.
• The subjects were not randomly selected.

b. Because a variable is a characteristic of each individual on which data is collected, which of


the following are variables in this study?
The day the data collected.

gender

cellphone brand

One of your classmates.


Poin 1,00 dari 1,00

Pilihan terbaik adalah:

• cellphone brand

c. Which chart or graph would be appropriate to display the proportion of the brands?

pie chart

time plot

boxplot

line plot

bar graph
Poin 1,00 dari 1,00

Pilihan terbaik adalah:

• bar graph
• pie chart

Soal 9
Benar
Poin 1,00 dari 1,00

Tandai pertanyaan

Teks soal
The correct command(s) to create a sequence of number in R is(are)?
Pilih salah satu atau lebih:

a. seq(1, 10)

b. seq(10, 1)

c. 1:10

d. seq(10)

e. seq(10, 1, 1)

f. seq(1, 10, -1)

g. seq(10, 1, -1)
Umpan balik
Pilihan-pilihan terbaik adalah: 1:10, seq(1, 10), seq(10, 1), seq(10, 1, -1), seq(10)

Soal 10
Benar

Poin 1,00 dari 1,00

Tandai pertanyaan

Teks soal
Which of the following are best treated as discrete variables?
Pilih salah satu atau lebih:

a. Number of students who achieve pass grades


b. The distance between two cities

c. Height

d. Number of classes in a college building


Umpan balik
Pilihan-pilihan terbaik adalah: Number of classes in a college building, Number of students who
achieve pass grades

You might also like