You are on page 1of 37

IPD Spring 2024

2.
Working with Samples,
Frequency Analysis,
Installing and Using Packages

1
IPD Spring 2024

The big picture:


What does R (Data Science) have to do with
business management or economics?

2
IPD Spring 2024

Why not Excel?

3
Knirsch@htw-berlin.de
IPD Spring 2024

Data Science

What is most important in Data Science?

4
Knirsch@htw-berlin.de
IPD Spring 2024

Data Science with R

- Code and data are kept separate.


- Changing data means coding the change.
→ It is NOT possible to manually manipulate some data in your data set without anybody being able to
trace this
- Every step is in the script:
- data source (loading the data),
- tidying of data,
- transforming and computing,
- analysing and visualizing.
- Everyone can follow each step and re-produce the results.

So, will Excel or other spreadsheet applications disappear?


No, of course not. Excel is easy and good to use for many purposes, just not for data analysis when you want to
be able to reproduce results.

5
Knirsch@htw-berlin.de
IPD Spring 2024

Recap

• We know vectors and know how to work with them.

• We can subset vectors.

• We know functions, can call functions und pass parameters.

• We know a couple of operators.

6
IPD Spring 2024

Important R Data Objects

Dimension One Data Type Multiple Data Types

1 Vector

7
IPD Spring 2024

Random Sample of Data


- What is a population?
- What is a sample?

Random sample of data:


In R there is also the function sample(). With this function, one can easily create a random (artificial) sample
out of the values of a vector.

Sometimes one needs a random sample of data. With such an artificial set of data one can test code or
functions and learn how R operates or reacts. There are many cases when a random sample comes in handy.

?sample()

What are the first two arguments of sample()?

8
IPD Spring 2024

What are the first two arguments of sample()?

a) Data object and size.

b) Size and replace.

c) There is only one argument.

9
IPD Spring 2024

Sample()
The first two arguments are:

x # vector. The sample is created out of the elements of the vector.


size # non negative integer giving the number of elements in the sample.

We already created the vector

die

that we use now. We simulate to throw the die 4 times.


sample(die,size=4)

Do we all get the same result?


What do we do if we want ot save the result?

10
IPD Spring 2024

Working with a Random Sample


Throwing the die only 4 times does not allow us to tell anything about the
throwing-behavior of the die and probabilities.
We need a much higher number. Let’s try 480.
What do we expect of the behavior of the die and the probabilities of the thrown
numbers?

sample(die,size=480)

What is the result?

11
IPD Spring 2024

Working with a Random Sample

6
3

1
5
4 drawings with
argument
replace=FALSE

Here is where the argument replace comes in. The argument replace is set to FALSE by default. This means that
the sample cannot have a larger element number than the vector. Or, in other words: The sample can only have as
many elements as the vector (population).

12
IPD Spring 2024

Working with a Random Sample


.

6
3

1
5
4 drawings with
argument
replace=TRUE

If one would like the sample to have more elements than the vector (duplicates), the argument replace has to
be set to TRUE
sample(die,size=500,replace=TRUE)
13
IPD Spring 2024

Working with Random Samples

The result rows start with a number in brackets []. Brackets give us the position: Row one starts with
the element at position [1], row two starts with the element in [ ] , …

When throwing a (fair) die, we expect each number to come up with the probability of 1/6. A large
number of throwings should render a result with each number showing up appr. one sixth.

→ Count the absolute frequency of the numbers: table()


How does the (nested) command look like?

→ Plot a barplot out of the result: barplot()


How does the nested command look like?

→ Add some color to the barplot.


What is the argument that colors our barplot? How can you find out?

14
IPD Spring 2024

What argument do you need in order to scolor the bars?

A. color=
B. cols =
C. colors =
D. col =

15
IPD Spring 2024

First Script

We write our first script – a sequence of code lines that can be saved as .R file.

1. Create a virtual die,


2. simulate rolling the die 500 times and store result in 2nd vector
3. count absolute frequencies of thrown numbers
4. barplot absolute frequencies.

• We comment each line of code.

• After testing the script we save it as .R file.

Always save your scripts.

16
IPD Spring 2024

Clean Up the Environment

The environment shows your active data objects.


 when exiting R, R offers to save the objects in .Rdata.
→ Bad Solution! Do not save your data objects.
→ Instead, set: tools → Global Options → save workspace to .Rdata on exit to
“Ask” or “Never”.

Why?
• you clutter your environment.
• you do not know when and how you created or manipulated the objects in the environment.
• scripts that run in your session will not run on other machines (other sessions) because other sessions do
not have your objects.
→ Good Solution: Make your work reproducible!
→ Create the objects that you need for a task in a script and save the script.

17
IPD Spring 2024

Frequency Analysis,

18
IPD Spring 2024

Frequency Analysis – One Variable

• descriptive statistical method


• that counts the number of occurrences of a value in a given set (population or sample).

Frequency analysis is the most fundamental and arguably the most important analysis for almost any kind of data.
It very often is the starting point for analysis.

Example: lark-owl-biorhythm: Larks are early risers whereas owls like to stay up late.
We ask a sample of 100 students and create a vector out of their answers. In our example we create a sample:

biorhythm <- sample( .....

What if we have moreowl-students than lark-students?


With argument 'prob = .... ' we can assign probability to values.

biorhythm <- sample( .....

19
IPD Spring 2024

Frequency Analysis – One Variable

What command do you need for the frequency count?

We only have 1 variable for analysis here. What could be a second interesting variable?

20
IPD Spring 2024

Frequency Analysis – One Variable

Example: Weight analysis of plants:

weightPlants <-
c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14,4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69,6.31,5.12,5
.54,5.50,5.37,5.29,4.92,6.15,5.80,5.26)
table(weightPlants)
barplot(table(weightPlants))

What is wrong with this analysis? It is not meaningful at all. Why?

21
IPD Spring 2024

Data Types

DataTypes

Categorical Numerical

Nominal Ordinal Discrete Continuous

22
knirsch@kiu.edu.ge
IPD Spring 2024

Frequency Analysis with Different Data Types

DataTypes

Categorical Numerical

Nominal Ordinal Discrete Continuous

Frequency Analysis:
Chart Type:

23
knirsch@kiu.edu.ge
IPD Spring 2024

Data Types – Frequency Analysis 1 Variable

DataTypes

Categorical Numerical

Nominal Ordinal Discrete Continuous


Barchart Barchart Barchart Histogram

Function:

24
knirsch@kiu.edu.ge
IPD Spring 2024

R Extension Packages

Good News Bad News

• many additional functions • packages have different grammar, different


• packages with specific functions for practically syntax
every special task • packages may not be compatible
• A package is called a library because it holds • not all packages are well maintained
functions like a library holds books. • no automatic update

Recommendations:
- use as few extension packages as possible.
- only use packages from a trustworthy developer community.
- only use well maintained and well documented packages.
IPD Spring 2024

R Extension Packages

• Packages always contain a set of additional functions. Everybody is allowed to write R functions,
collect the functions in a package and place the package for everybody to download and use at
the CRAN website. There are packages with suitable functions for practically every special task.
That is the good news.

• The bad news is that the grammar of the packages differs. There are different programmers at
work. Some packages are maintained and updated well, others might not. Not all of the
packages are compatible. Not all packages work well.

• When you update R, packages are not updated automatically. They may or may not be
compatible with the latest version of R depending on how well maintained they are.

26
IPD Spring 2024

R Extension Packages

Install Once Load for each session / script

• Install package tidyverse • In the first line of a script


This is actually a collection of packages with the
same grammar and syntax. • Load the packages you need with the
• Install package mosaic • command library(packagename)
contains statistical functions • example:
• Install via the button “Install” on the packages page. library(tidyverse)

How to see what package does a function belongs to?

?ggplot()
?mean()
?filter()

The help page tells us right at the top what package the function comes with.
IPD Spring 2024

R Extension Packages

• Remember:
Packages have to be installed once but loaded for every session.

• The command to load a package is:


library(name_of_package).

• Best Practice:
Make it a habit to start each script with loading the libraries you need:
library(tidyverse)
library(mosaic)

28
IPD Spring 2024

R is picky:
• R distinguishes between uppercase and lowercase letters, i.e. revenue and Revenue are
two different objects for R.
• R uses the “.” as a decimal separator (nor the “,” as in many European countries).
• Missing values are indicated in R by NA (not available).
• Comments are introduced with the pound sign (hashtag) “#”; R then ignores the rest of
the row.
• Variable names in R should begin with letters; otherwise only numbers and underscores
(_) are allowed. You should avoid spaces.
• To read the content of a variable, simply enter the name of the object in the console (or
send the command from the script window to the console).

29
IPD Spring 2024

Assignments
IPD Spring 2024

R Extension Packages: Install Once – Load Many Times

• Install Packages tidyverse and mosaic.


• Load packages tidyverse and mosaic in your session.
• When loading package tidyverse we get conflict warnings. What do they mean?

library(tidyverse)
Attaching packages ------------------------------------------- tidyverse 1.3.0 --
v ggplot2 3.3.2 v purrr 0.3.4
v tibble 3.0.4 v dplyr 1.0.2
v tidyr 1.1.2 v stringr 1.4.0
v readr 1.4.0 v forcats 0.5.0
-- Conflicts ----------------------------------- -- tidyverse_conflicts()
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()

31
IPD Spring 2024

First Script

We write our first script – a sequence of code lines that can be saved as .R file.

1. Create a virtual die,


2. simulate rolling the die 500 times and store result in 2nd vector
3. count absolute frequencies of thrown numbers
4. barplot absolute frequencies.

• Comment each line of code.


• Test your script
• Save your script as .R file

32
IPD Spring 2024

Frequency Analysis of Hobbies

Write the following script:

1. Create a random sample with 1000 elements about hobbies of people. Store the sample in a vector.
The hobbies are: swimming, football, singing, hiking and collecting stamps.

2. Calculate weighed probabilities, using the argument prob of the sample() function.

- We assume that the hobbies singing and football are more common than the other hobbies. This means that
in order to get a more realistic random sample we need to weigh the elements: Singing and football have to
show up in our sample more often than swimming, hiking and collecting stamps.

- The argument prob is itself a vector. A valid prob vector should sum to 1.

33
IPD Spring 2024

Frequency Analysis of Hobbies

3. Calculate the absolute frequency (occurrences) of the hobbies.

4. Calculate the relative frequency (percentage) of the hobbies. Use functions table() and length()
to do this.

5. Plot the percentages with a barchart.

6. Sort the bars in the barplot: The highest bar (=highest percentage) left, the lowest bar (lowest
percentage) right.

7. Comment your script and save it.

34
IPD Spring 2024

Student Height Measurement


Frequency Analysis

1. Collect the height of all students present in lab in a vector


.
2. Do a frequency analysis and plot.

3. Set fitting breakpoints.

35
IPD Spring 2024

Temperature Measurement

Given are the following temperature measurements:


{10.5839, 13.3985, -5.2386, -7.3046, 15.0012, 17.3475, -2.6572}

Write a short R script:


1. Create a vector temp with the above values,
2. Display mean value of all temperature values, round result to 2 digits,
3. Display mean value of all positive temperature values, round result to 2 digits,
4. Display mean value of all negative temperature values, round result to 2 digits,
5. Display the number of elements with values above 0,
6. Display the number of elements with values below 0
7. Comment and save your script

36
IPD Spring 2024

Temperature Measurement

Given are the following temperature measurements:


{10.5839, 13.3985, -5.2386, -7.3046, 15.0012, 17.3475, -2.6572}

What are the correct commands?

1. Read out temperature values at positions 4 and 6

2. Read out temperature values from positions 1 to 4 and position 7

1. Display all temperature values that are higher than 6.0

37

You might also like