You are on page 1of 8

1

Data Analysis with R

First name Last name

Department name, Institution name

Course: ALY 6000: Data Analysis with R


2

R Practice - Analysis Report

Introduction

It was the goal of Project 1 to develop important skills in R programming, such as

creating vectors, manipulating data and basic statistics. The analysis consisted of solving

several problems, building representations and manipulating a dataset containing salary

data. The summary of the analysis and main results is given in this report.

Key Findings

Mathematical Operations

The analysis started with simple mathematical operations, such as product

multiplication exponentiation and logical operation. It is worth noting that the evaluation of

expressions such as 123 *456 and TRUE | FALSE was simple, involving basic arithmetic or

logical operations in R. Below are the results for problem 1

Problem 1 Results:
cat("123 * 453 =", result1, "\n")
123 * 453 = 55719
cat("5^2 * 40 =", result2, "\n")
5^2 * 40 = 1000
cat("TRUE & FALSE =", result3, "\n")
TRUE & FALSE = FALSE
cat("TRUE | FALSE =", result4, "\n")
TRUE | FALSE = TRUE
cat("75 %% 10 =", result5, "\n")
75 %% 10 = 5
cat("75 / 10 =", result6, "\n\n")
75 / 10 = 7.5

Vector Creation and Manipulation

Vector construction was an integral part of the analysis. Functions such as c(), seq() and rep

were used to build vectors that had certain patterns and sequences. For example, the seq()

function allowed to easily create vectors containing even numbers within specified bounds.

The code snippet below shows how vectors can be manipulated in R, and the comments

explain further on what operations were being performed on the vectors.

# Problem 14
> # Adding 20 to each element of the 'second_vector'.
> second_vector + 20
[1] 30 32 34 36 38 40 42 44 46 48 50
3

>
> # Multiplying each element of the 'second_vector' by 20.
> second_vector * 20
[1] 200 240 280 320 360 400 440 480 520 560 600
>
> # Creating a logical vector indicating whether each element of
'second_vector' is greater than or equal to 20.
> second_vector >= 20
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
>
> # Creating a logical vector indicating whether each element of
'second_vector' is not equal to 20.
> second_vector != 20
[1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE

In the snippet below, brackets were used to extract elements from the first_vector, and the

results were as shown:

# Problem 23
> vector_from_boolean_brackets <- first_vector[c(FALSE, TRUE, FALSE,
TRUE)]
> cat("Problem 23 Result:\n")
Problem 23 Result:
> print(vector_from_boolean_brackets)
[1] 12 5
> # Comment: Elements at positions where the corresponding logical values
are TRUE are extracted from first_vector.

In problem 24, 25 and 26, we examined the pieces of code and wrote a one-sentence

comment explaining what was happening as shown in the code snippets below:

> # Problem 24
> cat(second_vector >= 20)
FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE> #returns a
logical vector indicating whether each element of second_vector is
> #greater than or equal to 20

# Problem 25
> cat(ages_vector <- seq(from = 10, to = 30, by = 2))
10 12 14 16 18 20 22 24 26 28 30> cat("ages_vector is a sequence of
numbers from 10 to 30 with a step of 2.\n\n")
ages_vector is a sequence of numbers from 10 to 30 with a step of 2.

> # Problem 26
> subset_ages_vector <- ages_vector[ages_vector >= 20]
> cat("A subset of ages_vector containing only elements greater than or
equal to
+ 20 is created.\n")
A subset of ages_vector containing only elements greater than or equal to
20 is created.
4

In problem 30, wrote code and explained in a comment what we thought the code was doing

as shown below:

> # Problem 30
> set.seed(5)
> random_vector <- runif(n = 10, min = 0, max = 1000)
> cat("A vector of 10 random numbers between 0 and 1000 is generated with
a fixed
+ seed of 5.\n")
A vector of 10 random numbers between 0 and 1000 is generated with a fixed
seed of 5.

For problem 37, we explained in a comment what we thought the code was set to do and the

answer is shown below:

# Problem 37
> set.seed(5)
> random_vector <- rnorm(n = 1000, mean = 50, sd = 15)
> cat("A vector of 1000 random numbers from a normal distribution with
mean 50 and
+ standard deviation 15 is generated with a fixed seed of 5.\n\n")
A vector of 1000 random numbers from a normal distribution with mean 50
and standard deviation 15 is generated with a fixed seed of 5.

Summary Statistics

Descriptive statistics involving sum, mean, median max and min were computed for vectors.

The analysis showed mastery of utilizing R functions to gain useful information from the

data.

For problem 38, we used the hist function and provided it with random_vector. The

histogram is as depicted below, and the explanation follows:


5

The hist function is used to create a histogram of the values in random_vector. The

histogram visually represents the distribution of the random numbers generated with mean 50

and standard deviation 15.

Conditional Subsetting and Filtering

In the analysis, conditional subsetting and filtering were used prominently. Data manipulation

skills were highlighted through the operations such as extracting specif elements from vectors

depending on logic statements and filtering dataset containing high-pay only.

Random Number Generation

Random number generation with set.seed() and runif() functions presented the capability to

produce random values which could be used for further analysis purposes afterward.

Data Exploration and Visualization

The analysis also demonstrated data exploration and visualization using ggplot2 package. The

given data set (ds_salaries.csv) was loaded into R and examined to demonstrate proficiency

in reading the data into R, as well as extracting useful information from it.

In Problem 42, explanations for the given chunk of code were provided, and the resulting

chart at the end of the code, is attached below:

# Problem 42
6

>
> # Display the first 6 rows of the dataframe
> cat("head(first_dataframe):\n")
head(first_dataframe):

> # Display the first 7 rows of the dataframe


> cat("head(first_dataframe, n = 7):\n")
head(first_dataframe, n = 7):
>
>
> # Display the column names of the dataframe
> cat("names(first_dataframe):\n")
names(first_dataframe):
>
>
> # Select only the 'job_title' and 'salary_in_usd' columns and store in a
new dataframe
> smaller_dataframe <- select(first_dataframe, job_title, salary_in_usd)
> cat("smaller_dataframe:\n")
smaller_dataframe:
>
> # Arrange the smaller dataframe by descending order of 'salary_in_usd'
> better_smaller_dataframe <- arrange(smaller_dataframe,
desc(salary_in_usd))
> cat("better_smaller_dataframe (sorted by salary_in_usd):\n")
better_smaller_dataframe (sorted by salary_in_usd):
>
>
> # Filter the dataframe to include only rows where 'salary_in_usd' >
80000
> better_smaller_dataframe <- filter(smaller_dataframe, salary_in_usd >
80000)
> cat("better_smaller_dataframe (filtered by salary_in_usd > 80000):\n")
better_smaller_dataframe (filtered by salary_in_usd > 80000):
>
> # Add a new column 'salary_in_euros' calculated from 'salary_in_usd'
> better_smaller_dataframe <- mutate(smaller_dataframe, salary_in_euros =
salary_in_usd * 0.94)
> cat("better_smaller_dataframe (added salary_in_euros):\n")
better_smaller_dataframe (added salary_in_euros):
>
> # Select specific rows from the dataframe using the slice function
> better_smaller_dataframe <- slice(smaller_dataframe, 1, 1, 2, 3, 4, 10,
1)
> cat("better_smaller_dataframe (selected specific rows):\n")
better_smaller_dataframe (selected specific rows):
>
> # Create a bar plot using ggplot for job titles vs. salaries in USD
> ggplot(better_smaller_dataframe) +
+ geom_col(mapping = aes(x = job_title, y = salary_in_usd), fill =
"blue") +
+ xlab("Job Title") +
+ ylab("Salary in US Dollars") +
+ labs(title = "Comparison of Jobs ") +
+ scale_y_continuous(labels = scales::dollar) +
+ theme(axis.text.x = element_text(angle = 50, hjust = 1))
7

Recommendations and Conclusion

The analysis was an extensive practice of essential R programming skills. The competence in

vector construction, modification and data visualization shows a progressive level of skills to

analyze data with R. Moreover, the calculation capabilities for summary statistics along with

conditional subsetting underscores practical knowledge on statistical topics. To improve the

quality of subsequent analyses, it is advisable to extend researching state-of-the art data

visualization techniques in form of interactive plots or dashboard creation. Secondly, using

datasets that better represent real-life situations with more complex structures can make a

good training model.

.
8

References
Chambers, J. M. (2008). Software for data analysis: programming with R (Vol. 2, No. 1). New York:
Springer.
Crawley, M. J. (2012). The R book. John Wiley & Sons.
Ihaka, R., & Gentleman, R. (1996). R: a language for data analysis and graphics. Journal of
computational and graphical statistics, 5(3), 299-314.
Kabacoff, R. (2022). R in action: data analysis and graphics with R and Tidyverse. Simon and
Schuster.
Maindonald, J., & Braun, J. (2006). Data analysis and graphics using R: an example-based
approach (Vol. 10). Cambridge University Press.
Peikert, A., & Brandmaier, A. M. (2021). A reproducible data analysis workflow with R Markdown,
Git, Make, and Docker. Quantitative and Computational Methods in Behavioral Sciences, 1-
27.
Ripley, B. D. (2001). The R project in statistical computing. MSOR Connections. The newsletter of
the LTSN Maths, Stats & OR Network, 1(1), 23-25.
Wickham, H., & Wickham, H. (2016). Data analysis (pp. 189-201). Springer International
Publishing.
Zagalsky, A., German, D. M., Storey, M. A., Teshima, C. G., & Poo-Caamaño, G. (2018). How the
R community creates and curates knowledge: an extended study of stack overflow and
mailing lists. Empirical Software Engineering, 23, 953-986.

You might also like