Analysis Report

1
Data Analysis with R
First name Last name
Department name, Institution name
Course: ALY 6000: Data Analysis with R

2
R Practice - Analysis Report
Introduction
It was the goal of Project 1 to develop important skills in R programming, such as
creating vectors, manipulating data and basic statistics. The analysis consisted of solving
several problems, building representations and manipulating a dataset containing salary
data. The summary of the analysis and main results is given in this report.
Key Findings
Mathematical Operations
The analysis started with simple mathematical operations, such as product
multiplication exponentiation and logical operation. It is worth noting that the evaluation of
expressions such as 123 *456 and TRUE | FALSE was simple, involving basic arithmetic or
logical operations in R. Below are the results for problem 1
Problem 1 Results:
cat("123 * 453 =", result1, "\n")
123 * 453 = 55719
cat("5^2 * 40 =", result2, "\n")
5^2 * 40 = 1000
cat("TRUE & FALSE =", result3, "\n")
TRUE & FALSE = FALSE
cat("TRUE | FALSE =", result4, "\n")
TRUE | FALSE = TRUE
cat("75 %% 10 =", result5, "\n")
75 %% 10 = 5
cat("75 / 10 =", result6, "\n\n")
75 / 10 = 7.5
Vector Creation and Manipulation
Vector construction was an integral part of the analysis. Functions such as c(), seq() and rep
were used to build vectors that had certain patterns and sequences. For example, the seq()
function allowed to easily create vectors containing even numbers within specified bounds.
The code snippet below shows how vectors can be manipulated in R, and the comments
explain further on what operations were being performed on the vectors.
# Problem 14
> # Adding 20 to each element of the 'second_vector'.
> second_vector + 20
[1] 30 32 34 36 38 40 42 44 46 48 50
3
>
> # Multiplying each element of the 'second_vector' by 20.
> second_vector * 20
[1] 200 240 280 320 360 400 440 480 520 560 600
>
> # Creating a logical vector indicating whether each element of
'second_vector' is greater than or equal to 20.
> second_vector >= 20
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
>
> # Creating a logical vector indicating whether each element of
'second_vector' is not equal to 20.
> second_vector != 20
[1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
In the snippet below, brackets were used to extract elements from the first_vector, and the
results were as shown:
# Problem 23
> vector_from_boolean_brackets <- first_vector[c(FALSE, TRUE, FALSE,
TRUE)]
> cat("Problem 23 Result:\n")
Problem 23 Result:
> print(vector_from_boolean_brackets)
[1] 12 5
> # Comment: Elements at positions where the corresponding logical values
are TRUE are extracted from first_vector.
In problem 24, 25 and 26, we examined the pieces of code and wrote a one-sentence
comment explaining what was happening as shown in the code snippets below:
> # Problem 24
> cat(second_vector >= 20)
FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE> #returns a
logical vector indicating whether each element of second_vector is
> #greater than or equal to 20
# Problem 25
> cat(ages_vector <- seq(from = 10, to = 30, by = 2))
10 12 14 16 18 20 22 24 26 28 30> cat("ages_vector is a sequence of
numbers from 10 to 30 with a step of 2.\n\n")
ages_vector is a sequence of numbers from 10 to 30 with a step of 2.
> # Problem 26
> subset_ages_vector <- ages_vector[ages_vector >= 20]
> cat("A subset of ages_vector containing only elements greater than or
equal to
+ 20 is created.\n")
A subset of ages_vector containing only elements greater than or equal to
20 is created.
4
In problem 30, wrote code and explained in a comment what we thought the code was doing
as shown below:
> # Problem 30
> set.seed(5)
> random_vector <- runif(n = 10, min = 0, max = 1000)
> cat("A vector of 10 random numbers between 0 and 1000 is generated with
a fixed
+ seed of 5.\n")
A vector of 10 random numbers between 0 and 1000 is generated with a fixed
seed of 5.
For problem 37, we explained in a comment what we thought the code was set to do and the
answer is shown below:
# Problem 37
> set.seed(5)
> random_vector <- rnorm(n = 1000, mean = 50, sd = 15)
> cat("A vector of 1000 random numbers from a normal distribution with
mean 50 and
+ standard deviation 15 is generated with a fixed seed of 5.\n\n")
A vector of 1000 random numbers from a normal distribution with mean 50
and standard deviation 15 is generated with a fixed seed of 5.
Summary Statistics
Descriptive statistics involving sum, mean, median max and min were computed for vectors.
The analysis showed mastery of utilizing R functions to gain useful information from the
data.
For problem 38, we used the hist function and provided it with random_vector. The
histogram is as depicted below, and the explanation follows:

5
The hist function is used to create a histogram of the values in random_vector. The
histogram visually represents the distribution of the random numbers generated with mean 50
and standard deviation 15.
Conditional Subsetting and Filtering
In the analysis, conditional subsetting and filtering were used prominently. Data manipulation
skills were highlighted through the operations such as extracting specif elements from vectors
depending on logic statements and filtering dataset containing high-pay only.
Random Number Generation
Random number generation with set.seed() and runif() functions presented the capability to
produce random values which could be used for further analysis purposes afterward.
Data Exploration and Visualization
The analysis also demonstrated data exploration and visualization using ggplot2 package. The
given data set (ds_salaries.csv) was loaded into R and examined to demonstrate proficiency
in reading the data into R, as well as extracting useful information from it.
In Problem 42, explanations for the given chunk of code were provided, and the resulting
chart at the end of the code, is attached below:
# Problem 42
6
>
> # Display the first 6 rows of the dataframe
> cat("head(first_dataframe):\n")
head(first_dataframe):
> # Display the first 7 rows of the dataframe

> cat("head(first_dataframe, n = 7):\n")
head(first_dataframe, n = 7):
>
>
> # Display the column names of the dataframe
> cat("names(first_dataframe):\n")
names(first_dataframe):
>
>
> # Select only the 'job_title' and 'salary_in_usd' columns and store in a
new dataframe
> smaller_dataframe <- select(first_dataframe, job_title, salary_in_usd)
> cat("smaller_dataframe:\n")
smaller_dataframe:
>
> # Arrange the smaller dataframe by descending order of 'salary_in_usd'
> better_smaller_dataframe <- arrange(smaller_dataframe,
desc(salary_in_usd))
> cat("better_smaller_dataframe (sorted by salary_in_usd):\n")
better_smaller_dataframe (sorted by salary_in_usd):
>
>
> # Filter the dataframe to include only rows where 'salary_in_usd' >
80000
> better_smaller_dataframe <- filter(smaller_dataframe, salary_in_usd >
80000)
> cat("better_smaller_dataframe (filtered by salary_in_usd > 80000):\n")
better_smaller_dataframe (filtered by salary_in_usd > 80000):
>
> # Add a new column 'salary_in_euros' calculated from 'salary_in_usd'
> better_smaller_dataframe <- mutate(smaller_dataframe, salary_in_euros =
salary_in_usd * 0.94)
> cat("better_smaller_dataframe (added salary_in_euros):\n")
better_smaller_dataframe (added salary_in_euros):
>
> # Select specific rows from the dataframe using the slice function
> better_smaller_dataframe <- slice(smaller_dataframe, 1, 1, 2, 3, 4, 10,
1)
> cat("better_smaller_dataframe (selected specific rows):\n")
better_smaller_dataframe (selected specific rows):
>
> # Create a bar plot using ggplot for job titles vs. salaries in USD
> ggplot(better_smaller_dataframe) +
+ geom_col(mapping = aes(x = job_title, y = salary_in_usd), fill =
"blue") +
+ xlab("Job Title") +
+ ylab("Salary in US Dollars") +
+ labs(title = "Comparison of Jobs ") +
+ scale_y_continuous(labels = scales::dollar) +
+ theme(axis.text.x = element_text(angle = 50, hjust = 1))
7
Recommendations and Conclusion
The analysis was an extensive practice of essential R programming skills. The competence in
vector construction, modification and data visualization shows a progressive level of skills to
analyze data with R. Moreover, the calculation capabilities for summary statistics along with
conditional subsetting underscores practical knowledge on statistical topics. To improve the
quality of subsequent analyses, it is advisable to extend researching state-of-the art data
visualization techniques in form of interactive plots or dashboard creation. Secondly, using
datasets that better represent real-life situations with more complex structures can make a
good training model.
.
8
References
Chambers, J. M. (2008). Software for data analysis: programming with R (Vol. 2, No. 1). New York:
Springer.
Crawley, M. J. (2012). The R book. John Wiley & Sons.
Ihaka, R., & Gentleman, R. (1996). R: a language for data analysis and graphics. Journal of
computational and graphical statistics, 5(3), 299-314.
Kabacoff, R. (2022). R in action: data analysis and graphics with R and Tidyverse. Simon and
Schuster.
Maindonald, J., & Braun, J. (2006). Data analysis and graphics using R: an example-based
approach (Vol. 10). Cambridge University Press.
Peikert, A., & Brandmaier, A. M. (2021). A reproducible data analysis workflow with R Markdown,
Git, Make, and Docker. Quantitative and Computational Methods in Behavioral Sciences, 1-
27.
Ripley, B. D. (2001). The R project in statistical computing. MSOR Connections. The newsletter of
the LTSN Maths, Stats & OR Network, 1(1), 23-25.
Wickham, H., & Wickham, H. (2016). Data analysis (pp. 189-201). Springer International
Publishing.
Zagalsky, A., German, D. M., Storey, M. A., Teshima, C. G., & Poo-Caamaño, G. (2018). How the
R community creates and curates knowledge: an extended study of stack overflow and
mailing lists. Empirical Software Engineering, 23, 953-986.

Analysis Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Analysis Report

Uploaded by

Copyright:

Available Formats

1

Data Analysis with R

First name Last name

Department name, Institution name

Course: ALY 6000: Data Analysis with R

R Practice - Analysis Report

It was the goal of Project 1 to develop important skills in R programming, such as

several problems, building representations and manipulating a dataset containing salary

The analysis started with simple mathematical operations, such as product

logical operations in R. Below are the results for problem 1

Vector Creation and Manipulation

explain further on what operations were being performed on the vectors.

results were as shown:

answer is shown below:

histogram is as depicted below, and the explanation follows:

and standard deviation 15.

Conditional Subsetting and Filtering

depending on logic statements and filtering dataset containing high-pay only.

Random Number Generation

Data Exploration and Visualization

chart at the end of the code, is attached below:

> # Display the first 7 rows of the dataframe

Recommendations and Conclusion

conditional subsetting underscores practical knowledge on statistical topics. To improve the

quality of subsequent analyses, it is advisable to extend researching state-of-the art data

visualization techniques in form of interactive plots or dashboard creation. Secondly, using

good training model.

You might also like