Professional Documents
Culture Documents
CT127-3-2-PFDA
PROGRAMMING FOR DATA ANALYSIS
WEIGHTAGE : 50%
INSTRUCTIONS TO CANDIDATES:
2 Students are advised to underpin their answers with the use of references
(cited using the Harvard Name System of Referencing).
Table of Contents
In this age where data is crucial for any business, it is important for a person in society to
learn about analyzing data and help extract the insights to contribute. It is said that only 0.5% of
data is analyzed due to the vast amount of data produced everyday with so little time every day.
Although it is only half a percent of the total data globally, it still represents an enormous
amount. Thus, more people are required to be trained in the data analyzing field so that more
business boosting information can be retrieved (Datapine, 2022). Thus, the data analysis
techniques such as data exploration, manipulation, visualization, transformation will be
discussed in this documentation. A dataset about degree students will be used in this analysis to
find out what factors that affect the students’ performance. By doing this, we can have a better
understanding of what is required for a student to perform well in their educational journey in the
Universities. Examples of each technique is showcased in the documentation with figures and
detailed explanation.
Before the analysis officially starts, some assumptions can be done so that we know what
to expect in the analysis. First of all, the dataset contains numeric values that are hard to analyze
and understand so they are required to be factorized into different factors so that a more readable
data can be obtained. Other than that, the data might contain N/A values that is irrelevant to the
analysis. As a result, data cleaning will be essential in this data analysis. If the dataset is perfectly
fine with clean data and readable values, we can proceed with the visualization of the data. The
visualization of the data can be assume to use “ggplot()” function most of the time.
Before visualizing the data, the dataset “student.csv” is needed to be imported into the R
platform in order to explore on all the data. Thus, I have used read.csv() function to read and
import the file into ‘data’ variables for easy access which is shown in Figure 2.1.1 below.
Following that, view() function is used to display the data frames of the data imported in another
tab. The data frame output is shown in Figure 2.1.2 to figure 2.1.4. below.
After the data set is imported, it is important to view the data in different perspectives and
explore the attributes of the data set and really understand some of the relationships between
some of it. by doing a data exploration on a given data set, we will be able to reveal patterns and
point of interest, allowing ourselves to be able to expose to a greater insight and a better analysis
as well (HEAVY.AI, n.d.). Thus, I have used names() function to clearly show all the headers of
the data sets so that I am aware of what kind of attributes are being handled. Other than that, I
have used summary(), describe() and skim() function to generate simple to detailed summary of
the dataset.
In figure 2.2.2 above is the summary of the first 5 attributes in the dataset. Simple summary of
each attributes is given. For example, numeric data class such as ‘index’ and ‘age’ has a
summary with numerous calculation (e.g. first quartile, mean, max value, median value) to get a
better understanding of the corresponding data. While, for the character data class attributes
‘school’ is shown with the length only which lead to the use of function describe().
Referring to the outputs of describe() function on the data set, more detailed information is given
for the numeric class attributes. most importantly, the character data class attribute ‘school’ is
shown with more useful details. In Figure 2.2.3, the attribute ‘school’ has been identified to have
2 unique values which is “GP” and “MS”. This allows me to know that there are only students
from the school “GP” and “MS” only in the dataset which can be a valuable information before I
come up with questions for the data analysis. Furthermore, this function helps to identify whether
there is missing value in any of the attributes.
we can clearly see that the reason there is missing values is because some of the students have 0
marks in their third year or even have 0 marks starting in their second year.
3.1 Question 1: How will a student personal life affect their final grade ?
In an article posted by College Raptor, they have listed out 10 factors that can affect the
academic performance of students. One of the factors in the article is a student’s physical activity
is proportional to their grades as they can be less stress and have increased oxygen flow to the
brain causing neuronal differentiation. Other than that, a student’s interest is very crucial in
determining their academic performance as a person is tend to learn faster and better if they are
curious and enthusiastic on a certain topic. On the other hand, their environment may cause
distraction to their studies and overly indulge in personal interests (e.g. consuming alcohol, video
games, binge watching TV shows etc.) is also the main reasons why a students find it hard to
focus on their studies (College Raptor, 2020).
Thus, I will be undergoing analysis to find out how does the various factors (e.g. daily
alcohol consumption, absent rate, activities after school) affect the students final grade. In order
to fully understand a factor, a few analysis will be done for it so that further insight can be
spotted.
In the code, ggplot() function is used to plot a graph with “data” variable as the dataset. the x-
axis is weekend alcohol consumption while the bar is filled with the students final grade. It is
given with a geom_bar() thus a visualization with bar chart geometry will be produced. Next, the
graph is written with the appropriate labels using labs() function. lastly, theme_bw() function is
used to produce a clean looking graph.
Looking at the bar chart, we can see that proportion of the grades are almost the same for each
alcohol consumption rate. However, for students who consume alcohol at a rate of 4 does not
score distinction in their final grade. This does not mean that a student who consume alcohol at a
rate of 4 will never get distinction since the student count is very low.
Analysis 1-2: What is the relationship of daily alcohol consumption with final grade ?
this analysis is done because the effect of workday alcohol consumption on the students final
grade needed to be observed since the weekly alcohol consumption rate has been analysed.
“Dalc” and “grade3” will be used to visualize in this analysis.
In the code shown above, it is almost the same with the code of analysis 1-1 except for a few
labels and the attribute “Dalc”. The bar chart has used “Dalc” which consist of the students
workday (Monday to Friday) consumption rate, while grade3 contains the students final grade.
The bar chart is label with “Workday alcohol consumption rate vs final grade” which is different
in analysis 1-1.
Although the difference of student count has a huge difference in every rating of alcohol
consumption, we can see that the proportion of credit and distinction grade gradually decreases
along the graph. In rating 3 to 5, there is no distinction grade at all. Students with alcohol
consumption rate of 4 and 5 are able to either get a pass or fail in their final grade. This tells that
students who consume alcohol constantly on workday which is when lectures are conducted,
perform way bad than those who consume alcohol on weekends only.
Analysis 1-3: effect of students using free time to hang out with friends on final grade
This analysis will focus on finding out whether a student perform better or worse if they use
some of their free time to hang out with friends.
Using the code shown in Figure 3.1.5, a bar chart with free time as the x-axis and the students
final grade as the filling of the bar. Black color is also added for the line of the bar chart to show
a clearer separation between grade types. Furthermore, 5 bars are categorized into different rate
of students hanging out with friends by using function facet_wrap(). Lastly, all the necessary
labels are added using labs() function for the graph.
After observing the different bar charts, we can see that the students who does not hang out with
friends at all do not score distinction in their final grade. In contrast, students who hang out with
friends during their free time have a better overall performance. Notably, students who hang out
with their friends moderately have the best performance in their final grade.
Analysis 1-4: how does student weekly study time affect their final grade ?
This analysis is set to help determine whether studying more will result in a better performance
for the given set of degree students.
For this analysis, density graph has been used to compare the proportion and concentration of
students grade on different weekly study time. In the code, ‘geom_density’ is meant to create a
density plot. As for the ‘alpha’ inside the bracket, it is to increase the transparency of each type
of grade so that we are able to clearly see each grade’s density in the graph.
In Figure 3.1.8, it is observed that the density of the students grade below distinction are very
much balanced. Thus, grades below distinction (e.g credit, pass, fail) is not affected by the
students study time. Shockingly, distinction grade has a decreasing density with higher study
time.
Analysis 1-5: how does having a romantic relationship affect a students final grade ?
Since degree students are mostly in their 20’s and some of them might also enter adulthood
before graduating, they are most likely to get into relationship during this time. As a result, I
would like to do an analysis on this because a student’s personal life is mostly consumed by
dating when they are having a relationship.
For this analysis, I have used geom_bar() function to plot a bar chart and added ‘dodge’ position
because there is only 2 factors for ‘romantic’ attribute. the ‘alpha’ parameter is set to 0.8 so that
the overlapping of different grades seen is clearer. Other than that, the legend of the plot is also
placed at the top of the graph using theme(legend.position = “top”). The rest of the code is just
labels for the plot.
The reason for using bar chart for this analysis is that the data of having a romantic relationship
only contains 2 factors which are YES or NO. When we look at the bar charts, we can roughly
say that there is a 2:1 ratio between having a relationship and being single. grades lower than
distinction have approximately double amount with a student being single compared to students
in a relationship. Thus, we can say that both are mostly the same. However, it is apparent that
there are way more students with distinction if they are not in a relationship. As a conclusion,
students having a romantic relationship can affect their performance.
In short, a student must adopt a well-balanced lifestyle in order to score good grades. For
example, moderately having entertainments like consuming alcohol on weekend and hanging out
with friends is beneficial to a student because it can relieve their stress and boost productivity.
3.2 Question 2: Does family relationship and status affect student’s grade
Family plays a very important role in a student’s life and have a great impact on various
aspect of the student’s life. This question is asked because analysis on students learning
performance based on their family current situation and relation needed to be investigated.
Family background is said to have a very strong influence with the student’s learning behavior
(Li & Qiu, 2018).
As a result, I will be performing visualization for most of the student’s family related
attributes in the data set to observe the attributes’ relation with the students’ grade.
In the code shown in Figure 3.2.1, I have used geom_bar() to plot a bar chart with family size as
the x-axis and students final grade performance as the bar filling. The bar chart is set as position
‘dodge’ because there is only 2 factors for family size. The ‘dodge’ position bar chart is easier to
observe and compare their differences.
After comparing the bar chart of students with family size of greater than 3 (GT3) and lower than
3 (LE3), no significant difference between the bar chart is found. The ratio of the GT3 and LE3
is almost the same and the proportion of each type of grade is approximately the same for both
side. Thus, we may conclude that family size does not affect a student’s performance
significantly.
ggplot() function is used to plot graph based on ‘data’ data frame. The graph will have the
mother’s education level (Medu) as the x-axis and father’s education level (Fedu) as the y-axis.
A combination of point and jitter plot is used in this analysis by using geom_point() and
geom_jitter() respectively so that we can observe the concentration of different student grade.
Other than that, a subtitle also added in order to show the respective education for each number
of level.
In the plot, each point on the graph show the combination of the student’s parent education level.
We can observe that most of the jitter plot have different type of grades except for a few that has
only single type of grade. On coordinate (4, 1) there is only students who fail on their final grade.
However, this doesn’t prove that every students with their parents having the same education
level will fail because the student count is too low. Other than that, we can observe that
distinction is concentrated at the top corner of the plot. The highest concentrations of distinction
are located at coordinate (4, 4) and (4, 3) which prove that parent’s education level can have a
positive impact on a student’s performance even in degree level.
In the code, father’s job (Fjob) is put as the x-axis of the plot while the mother’s job (Mjob) is
put as the y-axis. A point plot and jitter plot combination is also used in this analysis, but with
the a little transparency with the points using ‘alpha = 0.7’. After that, appropriate labels are
added into the plot to display a clean and easy to understand graph.
The plot shown in Figure 3.2.6 shows some sign of parent’s job affecting student’s grade
positively and negatively. If we observe carefully at the top right corner, students having parents
that both work as teacher has an higher average performance in their final grade with
approximately zero failure count. On the other hand, a student with their father as a teacher and
having a stay-at-home mom fail completely. This can fully prove that parent’s work definitely
has an impact on the students grade.
For this analysis, a box plot is drawn using geom_boxplot() function from ‘ggplot2’ package. the
aesthetic of the graph has “Pstatus” as the x-axis and student’s final grade (‘G3’) as the y – axis.
Other than that, the color of the box plot is set to follow ‘grade3’ so that it is easier to compare
different type of grades.
By observing the box plot shown above, there is no significant difference between grades higher
than ‘FAIL’ as they have almost the same maximum and minimum. However, we can spot that
there different size of box in the ‘FAIL’ grade. The median of student’s with parents staying
together at “FAIL” grade is higher which is odd.
I have used describe() function to get a summary of how many students for each parent’s
cohabitation status because the box plot was very odd. As a result, the box plot is not relevant for
the analysis as there is a big difference between ‘A’ and ‘T’ value with 92 and 744 respectively.
In this analysis ‘guardian’ attribute against the student’s final grade is plotted into a violin plot
using geom_violin(). Furthermore, the color of the plot is set according to ‘grade3’ attribute so
that we can easily differentiate the four different grade that are able to observe. The rest of the
code are just labels that will tidy up the plot and make it easier to be analysed.
It is very obvious that there is only a single violin plot with a distinction type of grade with
student having their mother as their guardian. At the same time, student with their mother as
guardian has also the has the most failure. All the plot does not have significant difference thus
the analysis is not relevant.
To conclude, we can say that the family relationship and the family status will influence a
student’s grade. This may because student adopt their learning habits from their parents and
parents may influence the students since they are young.
3.3 Question 3: How does a student’s situation and environment affect their grade ?
Other than family influence, another big factor determining a students educational performance
is their learning environment. According to Ella Hendrix, a positive learning environment is
crucial for a student because it helps to boost their motivation, interest, focus. Students are able
to be more engaged in a environment that has less distraction (Ella, 2019). Moreover, students
are also able to perform at their best when they are healthy and have the best resources.
So, I will be doing analysis based on attributes that are related a students environment and their
situation (e.g. health, resources, obstacles). This will allow me to understand more about how a
student’s hardship and available resources influence their grade.
Analysis 3-1: Relationship of travel time to school and student final grade
In this analysis, the final grade of the students will be compared to ‘traveltime’ attribute. This is
to find out whether if the longer the students need to get to school the lower the grade of the
student.
In the code I have used geom_jitter() to plot a graph and the x-axis is set to student’s final grade
and the y-axis is the travel time needed for the student to reach school. I have also used
facet_warp() to categorized the students into female and male to observe if there is any influence
from gender. At the end, appropriate labels are added to the plot.
After analyzing carefully on the jitter plot in Figure 3.3.2, the longer the student travel to school
the lower their grades are. The numbers of students scoring credit and distinction gets near to
zero when the student needs 2 hours or more to reach school. Furthermore, this plot shows that
there is no difference between 2 genders, they have approximately the same patterns. Thus, it is
said that the time needed to travel to school will affect the student’s final grade.
Analysis 3-2: How extra educational support affect student final grade
This analysis is done to find out whether a student perform better if they have additional
educational help from either school or home. thus we will plot graphs about students having
extra educational support using ‘schoolsup’ amd ‘famsup’ attributes.
For this analysis, 2 plot has been done to investigate 2 kind of educational support. For the first
line of ggplot(), I have mdae the x-axis as extra school educational support and family education
support on the second one. Both plots are bar chart with ‘dodge’ position which every type of
grade will represent a bar. At the end of the codes, labs() function is use to give all the
appropriate labels so that the graphs are clear and easy to understand.
For the first plot of the analysis, nothing can be concluded or analyzed because of the significant
difference between the students with support and the students with no school support. Thus, this
visualization is irrelevant and cannot be included into the analysis.
By observing this plot, we can see that the students who have family educational support have a
higher passing rate. Overall, it also does not have significant difference as the number of students
with support is has approximately twice the number of students who does not have educational
support. As a result, I would say extra educational support may affect a student’s grade by a bit,
but it will not necessarily affect a student significantly.
Analysis 3-3: Does having internet access at home affect final grade
This analysis will find out whether a student with internet access at home have a better or worse
performance than a student who does not have internet access at home.
In this analysis, bar chart is also used to visualize as the attribute that is being analyze is
categorial and non-numeric. Function geom_bar is used and with alpha = 0.7 as the parameter so
that the graph has a bit of transparency which looks nicer. Furthermore, coord_flip is used to flip
the graph clockwise so that a better and bigger view can be obtained. The position of the legend
is also put at the top so that the graph will not have to shrink to fit it at the side.
the bar charts. However, we can observe that there are still students who score distinction even
without internet access at home. Thus, students not necessary need internet access at home to
perform well.
In this analysis, bar chart is still being used but is it set into “fill” position in the geom_bar ()
function so that we are able to compare the proportion between the 2 bars. Appropriate labels are
also added to the plot so that a clear and smooth graph can be display.
By comparing the proportion of each type of grade in both bar, we are unable to spot any big
difference or pattern. The proportion of every grade is almost the same for both bar, so we can
conclude that attending a nursery school will not affect a students final grade.
In this analysis, I have used jitter plot combine with line plot to visualize the analysis since the
data for health status attribute is numerical. The color of the student’s final grade is set to follow
the corresponding grade type so that easier comparison can be done. Lastly, suitable labels are
added to beautify the graphical visualization.
From the overall view of the whole visualization plot, we can see that students with very bad
health to very good health can score any grade. There is not much impact of a student’s health
status on their final grade can be concluded from the plot given in Figure 3.3.11.
For this line of code, rm() function and ls() function is combined in order to remove all the
variables that is present in the global environment which is also called the working directory.
The ls() functions to return all the names or variables that is in the global environment. After
that, it is inserted into a list and removed from the environment by using remove function, rm().
As we can see from the output after running rm(list = ls()), the initial variables (data, install,
package) that is in the global environment are all erased. This line of code is very useful because
it helps to clear all the variables before we start a specific R script as the variables may be
irrelevant. This also help to allow the global environment to be tidy and easy to work with.
The function() command is used to create customized functions that will run their argument
when the name of the function is called. In Figure 4.2.1 shown above, I have created 2 functions
called “install” and “package”. The “install” function will install all the packages that are listed
in the arguments. While the “package” function will load all the packages that is needed to be
used.
After the customized function is created, it can be called using the function name given. For
example if I run the second last line of code shown in Figure 4.2.1, install() function will be
executed and install all the packages that is listed in the function’s argument. This is especially
useful if I work on a different device and need to install all the packages and load them all at
once.
In the figure shown above, cut() function is used to categorized the numeric values of the student
grade into 5 factors which are “FAIL”, “PASS”, “CREDIT”, “DISTINCTION”. There is a
breaks parameter in the function which is just to indicate the range of each type of grade.
After the code have been executed, 3 new columns which are “grade1”, “grade2”, “grade3” have
been added into the data frame. Using the first row as reference, we can see that the student has
“FAIL” grades in the new column as the value in G1 to G3 is in the range of 1-8. This function is
used in order to preprocess the data into more readable ones so that the analysis can be done
smoothly and efficiently.
The describe() function is used to give a summary of each attribute with information like unique
values, number of missing values and more. For the skim() function, it is used to display another
summary that shows how many column type of data and their corresponding numbers.
With the additional summary given, I am able to have a better understanding of the dataset that I
am going to analyze. Thus these 2 functions are worth to be added into the R script.
This function is used to remove all the columns that have N/A values so that the data set can be
cleaned. For the colSums() function, it is used to show all the attributes and their corresponding
total N/A values. We are able to identify there are 27 and 86 N/A values for grade2 and grade3
respectively by using colSums() function in Figure 4.5.2. Thus, na.omit() is used to remove it
which can be seen in Figure 4.5.3.
By implementing This extra feature, we are able to position the legend of the graph at the top so
that the whole plot can be expand to the right. Other than that, another benefit of putting the
legend at the top is that it is less distracting when we are analyzing the plot that is displayed.
When the bar charts are too long when viewed vertically, it can lower the efficiency and
accuracy of the analysis. Thus, coord_flip has been implemented in the given plot shown in
Figure 4.8.2. This function allow the the cartesian coordinates with x and y to be flipped and
display horizontally.
5.0 CONCLUSION
Most of the data from the dataset ‘student’ has been analyzed using R programming
language. The data is all visualized to density plot, box plot, bar charts, point graphs and more.
The fundamentals of R programming language and data analysis concept is applied in the data
analysis so that the initial data is manipulated to be more useful and easier to visualize
graphically.
In the analysis, many aspects of the students are considered, and useful information are
retrieved from the dataset. Aspect that are mainly focused on this analysis is a student’s family,
environment, and their personal interest.
All in all, there is still a lot of improvement can be done to the analysis in the near future
so that the hidden insights that has still not be found can be retrieved from the dataset. Other than
the educational field, other industries (manufacturing, entertainment, construction etc.) can also
be analyzed using the R programming language. Without a doubt, the skill of analyzing data and
turn the raw data into useful information will be useful in any field. Thus, it is very important
that we master it.
6.0 REFERENCES
HEAVY.AI. (n.d.). Data Exploration - a complete introduction. Retrieved April 29, 2022, from
https://www.heavy.ai/learn/data-exploration#:~:text=Data%20exploration%20definition
%3A%20Data%20exploration,the%20nature%20of%20the%20data.
Tableau. (n.d.). Guide to data cleaning: Definition, benefits, components, and how to clean your
data. Retrieved April 29, 2022, from https://www.tableau.com/learn/articles/what-is-data-
cleaning#:~:text=Having%20clean%20data%20will%20ultimately,clients%20and%20less
%2Dfrustrated%20employees.
College Raptor. (2020, June 15). Physical activity and others affecting academic performance.
Retrieved May 1, 2022, from https://www.collegeraptor.com/find-colleges/articles/tips-
tools-advice/trick-cheat-10-things-affecting-students-academic-performance/
Datapine. (2022, March 9). What is data analysis? methods, techniques, types & how-to.
Retrieved May 10, 2022, from https://www.datapine.com/blog/data-analysis-methods-and-
techniques/
Li, Z., & Qiu, Z. (2018, October 2). How does family background affect children's educational
achievement? evidence from contemporary China - the journal of Chinese sociology.
SpringerOpen. Retrieved May 7, 2022, from
https://journalofchinesesociology.springeropen.com/articles/10.1186/s40711-018-0083-
8#:~:text=Family%20background%20and%20children's%20learning
%20behavior&text=The%20higher%20the%20family's%20socioeconomic,effect%20on
%20children's%20learning%20behavior.
Ella, H. (2019, December 19). How your surroundings affect The way you study. UCAS.
Retrieved May 8, 2022, from https://www.ucas.com/connect/blogs/how-your-
surroundings-affect-way-you-study
GeeksforGeeks. (2020, June 17). List all the objects present in the current working directory in
R programming - LS() function. Retrieved May 9, 2022, from
https://www.geeksforgeeks.org/list-all-the-objects-present-in-the-current-working-
directory-in-r-programming-ls-function/