Professional Documents
Culture Documents
Eqency
Eqency
2.
Working with Samples,
Frequency Analysis,
Installing and Using Packages
1
IPD Spring 2024
2
IPD Spring 2024
3
Knirsch@htw-berlin.de
IPD Spring 2024
Data Science
4
Knirsch@htw-berlin.de
IPD Spring 2024
5
Knirsch@htw-berlin.de
IPD Spring 2024
Recap
6
IPD Spring 2024
1 Vector
7
IPD Spring 2024
Sometimes one needs a random sample of data. With such an artificial set of data one can test code or
functions and learn how R operates or reacts. There are many cases when a random sample comes in handy.
?sample()
8
IPD Spring 2024
9
IPD Spring 2024
Sample()
The first two arguments are:
die
10
IPD Spring 2024
sample(die,size=480)
11
IPD Spring 2024
6
3
1
5
4 drawings with
argument
replace=FALSE
Here is where the argument replace comes in. The argument replace is set to FALSE by default. This means that
the sample cannot have a larger element number than the vector. Or, in other words: The sample can only have as
many elements as the vector (population).
12
IPD Spring 2024
6
3
1
5
4 drawings with
argument
replace=TRUE
If one would like the sample to have more elements than the vector (duplicates), the argument replace has to
be set to TRUE
sample(die,size=500,replace=TRUE)
13
IPD Spring 2024
The result rows start with a number in brackets []. Brackets give us the position: Row one starts with
the element at position [1], row two starts with the element in [ ] , …
When throwing a (fair) die, we expect each number to come up with the probability of 1/6. A large
number of throwings should render a result with each number showing up appr. one sixth.
14
IPD Spring 2024
A. color=
B. cols =
C. colors =
D. col =
15
IPD Spring 2024
First Script
We write our first script – a sequence of code lines that can be saved as .R file.
16
IPD Spring 2024
Why?
• you clutter your environment.
• you do not know when and how you created or manipulated the objects in the environment.
• scripts that run in your session will not run on other machines (other sessions) because other sessions do
not have your objects.
→ Good Solution: Make your work reproducible!
→ Create the objects that you need for a task in a script and save the script.
17
IPD Spring 2024
Frequency Analysis,
18
IPD Spring 2024
Frequency analysis is the most fundamental and arguably the most important analysis for almost any kind of data.
It very often is the starting point for analysis.
Example: lark-owl-biorhythm: Larks are early risers whereas owls like to stay up late.
We ask a sample of 100 students and create a vector out of their answers. In our example we create a sample:
19
IPD Spring 2024
We only have 1 variable for analysis here. What could be a second interesting variable?
20
IPD Spring 2024
weightPlants <-
c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14,4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69,6.31,5.12,5
.54,5.50,5.37,5.29,4.92,6.15,5.80,5.26)
table(weightPlants)
barplot(table(weightPlants))
21
IPD Spring 2024
Data Types
DataTypes
Categorical Numerical
22
knirsch@kiu.edu.ge
IPD Spring 2024
DataTypes
Categorical Numerical
Frequency Analysis:
Chart Type:
23
knirsch@kiu.edu.ge
IPD Spring 2024
DataTypes
Categorical Numerical
Function:
24
knirsch@kiu.edu.ge
IPD Spring 2024
R Extension Packages
Recommendations:
- use as few extension packages as possible.
- only use packages from a trustworthy developer community.
- only use well maintained and well documented packages.
IPD Spring 2024
R Extension Packages
• Packages always contain a set of additional functions. Everybody is allowed to write R functions,
collect the functions in a package and place the package for everybody to download and use at
the CRAN website. There are packages with suitable functions for practically every special task.
That is the good news.
• The bad news is that the grammar of the packages differs. There are different programmers at
work. Some packages are maintained and updated well, others might not. Not all of the
packages are compatible. Not all packages work well.
• When you update R, packages are not updated automatically. They may or may not be
compatible with the latest version of R depending on how well maintained they are.
26
IPD Spring 2024
R Extension Packages
?ggplot()
?mean()
?filter()
The help page tells us right at the top what package the function comes with.
IPD Spring 2024
R Extension Packages
• Remember:
Packages have to be installed once but loaded for every session.
• Best Practice:
Make it a habit to start each script with loading the libraries you need:
library(tidyverse)
library(mosaic)
28
IPD Spring 2024
R is picky:
• R distinguishes between uppercase and lowercase letters, i.e. revenue and Revenue are
two different objects for R.
• R uses the “.” as a decimal separator (nor the “,” as in many European countries).
• Missing values are indicated in R by NA (not available).
• Comments are introduced with the pound sign (hashtag) “#”; R then ignores the rest of
the row.
• Variable names in R should begin with letters; otherwise only numbers and underscores
(_) are allowed. You should avoid spaces.
• To read the content of a variable, simply enter the name of the object in the console (or
send the command from the script window to the console).
29
IPD Spring 2024
Assignments
IPD Spring 2024
library(tidyverse)
Attaching packages ------------------------------------------- tidyverse 1.3.0 --
v ggplot2 3.3.2 v purrr 0.3.4
v tibble 3.0.4 v dplyr 1.0.2
v tidyr 1.1.2 v stringr 1.4.0
v readr 1.4.0 v forcats 0.5.0
-- Conflicts ----------------------------------- -- tidyverse_conflicts()
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
31
IPD Spring 2024
First Script
We write our first script – a sequence of code lines that can be saved as .R file.
32
IPD Spring 2024
1. Create a random sample with 1000 elements about hobbies of people. Store the sample in a vector.
The hobbies are: swimming, football, singing, hiking and collecting stamps.
2. Calculate weighed probabilities, using the argument prob of the sample() function.
- We assume that the hobbies singing and football are more common than the other hobbies. This means that
in order to get a more realistic random sample we need to weigh the elements: Singing and football have to
show up in our sample more often than swimming, hiking and collecting stamps.
- The argument prob is itself a vector. A valid prob vector should sum to 1.
33
IPD Spring 2024
4. Calculate the relative frequency (percentage) of the hobbies. Use functions table() and length()
to do this.
6. Sort the bars in the barplot: The highest bar (=highest percentage) left, the lowest bar (lowest
percentage) right.
34
IPD Spring 2024
35
IPD Spring 2024
Temperature Measurement
36
IPD Spring 2024
Temperature Measurement
37