Heng Kang Han - tp063427

INDIVIDUAL ASSIGNMENT
TECHNOLOGY PARK MALAYSIA
CT127-3-2-PFDA
PROGRAMMING FOR DATA ANALYSIS
INTAKE CODE : APD2F2202CS (DA)
HAND OUT DATE : 28 MARCH 2022
HAND IN DATE : 13 MAY 2022
WEIGHTAGE : 50%
STUDENT NAME : HENG KANG HAN (TP063427)
INSTRUCTIONS TO CANDIDATES:
1 Submit your assignment at the administrative counter.
2 Students are advised to underpin their answers with the use of references
(cited using the Harvard Name System of Referencing).
3 Late submission will be awarded zero (0) unless Extenuating Circumstances

(EC) are upheld.
4 Cases of plagiarism will be penalized.
5 The assignment should be bound in an appropriate style (comb bound or

stapled).
6 Where the assignment should be submitted in both hardcopy and softcopy,

the softcopy of the written assignment and source code (where appropriate)
should be on a CD in an envelope / CD cover and attached to the hardcopy.
7 You must obtain 50% overall to pass this module.

CT127-3-2-PFDA PROGRAMMING FOR DATA ANALYSIS 2
Table of Contents
1.0 INTRODUCTION AND ASSUMPTIONS............................................................................4

2.0 DATASET EXPLORATION/ CLEANING/ TRANSFORMATION.................................5
2.1 Data Import............................................................................................................................5
2.2 Data Exploration....................................................................................................................6
2.3 Data Cleaning & Pre-processing............................................................................................8
3.0 QUESTIONS AND ANALYSIS...........................................................................................11
3.1 Question 1: How will a student personal life affect their final grade ?.........................11
Analysis 1-1: relationship of weekend alcohol consumption and final performance ?..........11
Analysis 1-2: What is the relationship of daily alcohol consumption with final grade ?.......12
Analysis 1-3: effect of students using free time to hang out with friends on final grade.......14
Analysis 1-4: how does student weekly study time affect their final grade ?........................15
Analysis 1-5: how does having a romantic relationship affect a students final grade ?.........16
Analysis Result: Question 1...................................................................................................17
3.2 Question 2: Does family relationship and status affect student’s grade.......................19
Analysis 2-1: family size effect on students final grade........................................................19
Analysis 2-2: parent’s education level effect on student’s grade..........................................20
Analysis 2-3: effect of parent’s job on student’s final grade.................................................22
Analysis 2-4: Effect of parent’s cohabitation status on student’s grade................................23
Analysis 2-5: Student’s guardian influence on their grade....................................................25
Analysis result: Question 2.....................................................................................................26
3.3 Question 3: How does a student’s situation and environment affect their grade ?.....27
Analysis 3-1: Relationship of travel time to school and student final grade..........................27
Analysis 3-2: How extra educational support affect student final grade...............................28
Analysis 3-3: Does having internet access at home affect final grade...................................30
Analysis 3-4: Will attending nursery school influence student’s grade.................................31
Analysis 3-5: How will student’s health affect their grade....................................................33
Analysis result: Question 3.....................................................................................................34
4.0 EXTRA FEATURES.............................................................................................................35
4.1 Extra feature 1: rm(list = ls())..............................................................................................35
4.2 Extra Feature 2: function()...................................................................................................36
LEVEL 2 Asia Pacific University of Technology and Innovation 2022

4.3 Extra Feature 3: Cut()..........................................................................................................37

4.4 Extra feature 4: skim() and describe()..................................................................................38
4.5 Extra feature 5: na.omit() and colSums().............................................................................39
4.6 Extra feature 6: facet_wrap()...............................................................................................40
4.7 Extra Function 7: theme(legend.position = “top”)...............................................................41
4.8 Extra Feature 8: coord_flip()................................................................................................42
5.0 CONCLUSION......................................................................................................................43
6.0 REFERENCES......................................................................................................................44

1.0 INTRODUCTION AND ASSUMPTIONS
In this age where data is crucial for any business, it is important for a person in society to
learn about analyzing data and help extract the insights to contribute. It is said that only 0.5% of
data is analyzed due to the vast amount of data produced everyday with so little time every day.
Although it is only half a percent of the total data globally, it still represents an enormous
amount. Thus, more people are required to be trained in the data analyzing field so that more
business boosting information can be retrieved (Datapine, 2022). Thus, the data analysis
techniques such as data exploration, manipulation, visualization, transformation will be
discussed in this documentation. A dataset about degree students will be used in this analysis to
find out what factors that affect the students’ performance. By doing this, we can have a better
understanding of what is required for a student to perform well in their educational journey in the
Universities. Examples of each technique is showcased in the documentation with figures and
detailed explanation.
Before the analysis officially starts, some assumptions can be done so that we know what
to expect in the analysis. First of all, the dataset contains numeric values that are hard to analyze
and understand so they are required to be factorized into different factors so that a more readable
data can be obtained. Other than that, the data might contain N/A values that is irrelevant to the
analysis. As a result, data cleaning will be essential in this data analysis. If the dataset is perfectly
fine with clean data and readable values, we can proceed with the visualization of the data. The
visualization of the data can be assume to use “ggplot()” function most of the time.

2.0 DATASET EXPLORATION/ CLEANING/ TRANSFORMATION
2.1 Data Import
Before visualizing the data, the dataset “student.csv” is needed to be imported into the R
platform in order to explore on all the data. Thus, I have used read.csv() function to read and
import the file into ‘data’ variables for easy access which is shown in Figure 2.1.1 below.
Following that, view() function is used to display the data frames of the data imported in another
tab. The data frame output is shown in Figure 2.1.2 to figure 2.1.4. below.
Figure 2.1.1 code snippet of data set import
Figure 2.1.2 data frame (left side)

Figure 2.1.3 data frame (middle)
Figure 2.1.4 data frame (right side)
2.2 Data Exploration
Figure 2.2.1 code snippet of data exploration
After the data set is imported, it is important to view the data in different perspectives and
explore the attributes of the data set and really understand some of the relationships between
some of it. by doing a data exploration on a given data set, we will be able to reveal patterns and
point of interest, allowing ourselves to be able to expose to a greater insight and a better analysis
as well (HEAVY.AI, n.d.). Thus, I have used names() function to clearly show all the headers of
the data sets so that I am aware of what kind of attributes are being handled. Other than that, I

have used summary(), describe() and skim() function to generate simple to detailed summary of
the dataset.
Figure 2.2.2 small part of summary() outputs
In figure 2.2.2 above is the summary of the first 5 attributes in the dataset. Simple summary of
each attributes is given. For example, numeric data class such as ‘index’ and ‘age’ has a
summary with numerous calculation (e.g. first quartile, mean, max value, median value) to get a
better understanding of the corresponding data. While, for the character data class attributes
‘school’ is shown with the length only which lead to the use of function describe().
Figure 2.2.3 small part of describe() outputs
Referring to the outputs of describe() function on the data set, more detailed information is given
for the numeric class attributes. most importantly, the character data class attribute ‘school’ is
shown with more useful details. In Figure 2.2.3, the attribute ‘school’ has been identified to have
2 unique values which is “GP” and “MS”. This allows me to know that there are only students

from the school “GP” and “MS” only in the dataset which can be a valuable information before I
come up with questions for the data analysis. Furthermore, this function helps to identify whether
there is missing value in any of the attributes.
Figure 2.2.4 upper part of skim() outputs
Figure 2.2.5 lower part of skim() outputs

At the end of data exploration, skim() function is also used in order to have a more structured
summary. In Figure 2.2.4 and 2.2.5, it is apparent that this summary is structured in a tidier and
compact manner compared to the previous two functions. 17 numeric and 17 character data class
can be identified and categorized in two parts which allowed the observation of each attributes’
data class.
2.3 Data Cleaning & Pre-processing

Figure 2.3.2 ‘index’ attribute column and built in index

Moving on, data cleaning is required as it can help to boost the overall smoothness of the data
analysis and ensure higher quality information that will affect my decision-making in a positive
way (Tableau., n.d.). In the data exploration done previously, ‘index’ attribute is spotted to be
useless as there is a built-in index display in the R studio shown in Figure 2.3.2 below. Other
than that, the attribute is not required in the data analysis in any way thus it is best to remove the
column.
Figure 2.3.1 code snippet of removing ‘index’ column

Furthermore, the grades of all the degree student is represented in the range of 0-20 which is not
very suitable for data analysis and visualization that will be discussed later in this document.
Thus, the marks of the students that is recorded in “G1”, “G2”, “G3” will be turned into 4 factors
(“FAIL”, “PASS”, “CREDIT”, “EXCELLENT”). The newly defined data will be entered into
“grade1”, “grade2”, “grade3” sub-attributes respectively. This is done for the sake of a more
readable data and a more precise analysis when the visualization is being undergone.
Figure 2.3.3 code snippet of categorizing

The explanation of the code snippet above will be done in the extra feature part of the
documentation. However, it is important to highlight that there are missing values which are N/A
values found in the new sub-attributes when the code in Figure 2.3.4 is executed. In figure 2.3.5

we can clearly see that the reason there is missing values is because some of the students have 0
marks in their third year or even have 0 marks starting in their second year.
Figure 2.3.4 number of missing value shown when ColSums() is executed
Figure 2.3.5 missing value in ‘grade2’ and ‘grade3’

Since the focus of this analysis is to identify the relationship and the effect of the various
attributes on the grade of the students, we can conclude that the students may have differed from
the school if they score zero during their degree. Thus, we can conclude that the missing values
is useless and needed to be cleaned from the data frame. In order to do that, the students with
zero marks is removed by using na.omit() shown in Figure 2.3.6 below. A total of 86 observation
is removed leaving with 836 rows left in the data frame. It is acceptable to remove the 86 rows
because it does not represent a big proportion in the data set.

Figure 2.3.6 code snippet of na.omit() and results
3.0 QUESTIONS AND ANALYSIS
3.1 Question 1: How will a student personal life affect their final grade ?
In an article posted by College Raptor, they have listed out 10 factors that can affect the
academic performance of students. One of the factors in the article is a student’s physical activity
is proportional to their grades as they can be less stress and have increased oxygen flow to the
brain causing neuronal differentiation. Other than that, a student’s interest is very crucial in
determining their academic performance as a person is tend to learn faster and better if they are
curious and enthusiastic on a certain topic. On the other hand, their environment may cause
distraction to their studies and overly indulge in personal interests (e.g. consuming alcohol, video
games, binge watching TV shows etc.) is also the main reasons why a students find it hard to
focus on their studies (College Raptor, 2020).
Thus, I will be undergoing analysis to find out how does the various factors (e.g. daily
alcohol consumption, absent rate, activities after school) affect the students final grade. In order
to fully understand a factor, a few analysis will be done for it so that further insight can be
spotted.
Analysis 1-1: relationship of weekend alcohol consumption and final performance ?

This analysis will use “Walc” and “grade3” to generate a graph to visualize their relationship.

Figure 3.1.1 code snippet of analysis 1-1
In the code, ggplot() function is used to plot a graph with “data” variable as the dataset. the x-
axis is weekend alcohol consumption while the bar is filled with the students final grade. It is
given with a geom_bar() thus a visualization with bar chart geometry will be produced. Next, the
graph is written with the appropriate labels using labs() function. lastly, theme_bw() function is
used to produce a clean looking graph.
Figure 3.1.2: bar chart of analysis 1-1
Looking at the bar chart, we can see that proportion of the grades are almost the same for each
alcohol consumption rate. However, for students who consume alcohol at a rate of 4 does not
score distinction in their final grade. This does not mean that a student who consume alcohol at a
rate of 4 will never get distinction since the student count is very low.

Analysis 1-2: What is the relationship of daily alcohol consumption with final grade ?
this analysis is done because the effect of workday alcohol consumption on the students final
grade needed to be observed since the weekly alcohol consumption rate has been analysed.
“Dalc” and “grade3” will be used to visualize in this analysis.
Figure 3.1.3: code snippet of analysis 1-2
In the code shown above, it is almost the same with the code of analysis 1-1 except for a few
labels and the attribute “Dalc”. The bar chart has used “Dalc” which consist of the students
workday (Monday to Friday) consumption rate, while grade3 contains the students final grade.
The bar chart is label with “Workday alcohol consumption rate vs final grade” which is different
in analysis 1-1.

Figure 3.1.4: bar chart of analysis 1-2
Although the difference of student count has a huge difference in every rating of alcohol
consumption, we can see that the proportion of credit and distinction grade gradually decreases
along the graph. In rating 3 to 5, there is no distinction grade at all. Students with alcohol
consumption rate of 4 and 5 are able to either get a pass or fail in their final grade. This tells that
students who consume alcohol constantly on workday which is when lectures are conducted,
perform way bad than those who consume alcohol on weekends only.
Analysis 1-3: effect of students using free time to hang out with friends on final grade
This analysis will focus on finding out whether a student perform better or worse if they use
some of their free time to hang out with friends.
Using the code shown in Figure 3.1.5, a bar chart with free time as the x-axis and the students
final grade as the filling of the bar. Black color is also added for the line of the bar chart to show
a clearer separation between grade types. Furthermore, 5 bars are categorized into different rate
of students hanging out with friends by using function facet_wrap(). Lastly, all the necessary
labels are added using labs() function for the graph.

Figure 3.1.6: visualization of analysis 1-3
After observing the different bar charts, we can see that the students who does not hang out with
friends at all do not score distinction in their final grade. In contrast, students who hang out with
friends during their free time have a better overall performance. Notably, students who hang out
with their friends moderately have the best performance in their final grade.
Analysis 1-4: how does student weekly study time affect their final grade ?
This analysis is set to help determine whether studying more will result in a better performance
for the given set of degree students.

For this analysis, density graph has been used to compare the proportion and concentration of
students grade on different weekly study time. In the code, ‘geom_density’ is meant to create a
density plot. As for the ‘alpha’ inside the bracket, it is to increase the transparency of each type
of grade so that we are able to clearly see each grade’s density in the graph.
In Figure 3.1.8, it is observed that the density of the students grade below distinction are very
much balanced. Thus, grades below distinction (e.g credit, pass, fail) is not affected by the
students study time. Shockingly, distinction grade has a decreasing density with higher study
time.
Analysis 1-5: how does having a romantic relationship affect a students final grade ?
Since degree students are mostly in their 20’s and some of them might also enter adulthood
before graduating, they are most likely to get into relationship during this time. As a result, I
would like to do an analysis on this because a student’s personal life is mostly consumed by
dating when they are having a relationship.

Figure 3.1.9: code snippet for analysis 1-5
For this analysis, I have used geom_bar() function to plot a bar chart and added ‘dodge’ position
because there is only 2 factors for ‘romantic’ attribute. the ‘alpha’ parameter is set to 0.8 so that
the overlapping of different grades seen is clearer. Other than that, the legend of the plot is also
placed at the top of the graph using theme(legend.position = “top”). The rest of the code is just
labels for the plot.
Figure 3.1.10: visualization for analysis 1-5
The reason for using bar chart for this analysis is that the data of having a romantic relationship
only contains 2 factors which are YES or NO. When we look at the bar charts, we can roughly
say that there is a 2:1 ratio between having a relationship and being single. grades lower than

distinction have approximately double amount with a student being single compared to students
in a relationship. Thus, we can say that both are mostly the same. However, it is apparent that
there are way more students with distinction if they are not in a relationship. As a conclusion,
students having a romantic relationship can affect their performance.
Analysis Result: Question 1

In order to find answer for question 1 which is “how will a student personal life affect
their final grade”, multiple analysis has been done by plotting graphs with data that is relevant to
a student’s personal life against the ‘grade3’ attribute. In the first two analysis, we have found
out that alcohol consumption lowers the performance of a student on their final grade if the
student drink on most of the workdays. This may lead to students being unable to focus on their
studies since the effect of alcohol on their bodies are strong most of their school days. Other than
that, we have also found out that students who does not or seldom socialize and hang out with
friend has a lower grade compared to students who moderately practice a balance lifestyle
between studies and social life. Shockingly, students who have the most study time tend to have
lower chance of obtaining a distinction for their final result compared to the ones who study with
less tim. In the end, we have also tested whether a romantic relationship can affect a student’s
final grade. As a result, students who got into relationship during their degree studies have way
lower chance of scoring distinction. This is expected because when they get into a relationship,
they will focus more on their partner rather than their studies while having less time to hang out
with their friends.
In short, a student must adopt a well-balanced lifestyle in order to score good grades. For
example, moderately having entertainments like consuming alcohol on weekend and hanging out
with friends is beneficial to a student because it can relieve their stress and boost productivity.

3.2 Question 2: Does family relationship and status affect student’s grade
Family plays a very important role in a student’s life and have a great impact on various
aspect of the student’s life. This question is asked because analysis on students learning
performance based on their family current situation and relation needed to be investigated.
Family background is said to have a very strong influence with the student’s learning behavior
(Li & Qiu, 2018).
As a result, I will be performing visualization for most of the student’s family related
attributes in the data set to observe the attributes’ relation with the students’ grade.
Analysis 2-1: family size effect on students final grade

This analysis is to find out whether family size will affect in the student’s final grade. This is
because I assume that bigger family size will have factors like noisy environment or distraction
cause by siblings for students who are staying at their home. Thus, it is essential to put this as the
first analysis.
In the code shown in Figure 3.2.1, I have used geom_bar() to plot a bar chart with family size as
the x-axis and students final grade performance as the bar filling. The bar chart is set as position
‘dodge’ because there is only 2 factors for family size. The ‘dodge’ position bar chart is easier to
observe and compare their differences.

After comparing the bar chart of students with family size of greater than 3 (GT3) and lower than
3 (LE3), no significant difference between the bar chart is found. The ratio of the GT3 and LE3
is almost the same and the proportion of each type of grade is approximately the same for both
side. Thus, we may conclude that family size does not affect a student’s performance
significantly.
Analysis 2-2: parent’s education level effect on student’s grade

In this analysis, the objective is to find out whether the education level (e.g. primary, secondary,
tertiary, higher education) of parents will influence the grade of students.

ggplot() function is used to plot graph based on ‘data’ data frame. The graph will have the
mother’s education level (Medu) as the x-axis and father’s education level (Fedu) as the y-axis.
A combination of point and jitter plot is used in this analysis by using geom_point() and
geom_jitter() respectively so that we can observe the concentration of different student grade.
Other than that, a subtitle also added in order to show the respective education for each number
of level.
In the plot, each point on the graph show the combination of the student’s parent education level.
We can observe that most of the jitter plot have different type of grades except for a few that has
only single type of grade. On coordinate (4, 1) there is only students who fail on their final grade.
However, this doesn’t prove that every students with their parents having the same education
level will fail because the student count is too low. Other than that, we can observe that
distinction is concentrated at the top corner of the plot. The highest concentrations of distinction
are located at coordinate (4, 4) and (4, 3) which prove that parent’s education level can have a
positive impact on a student’s performance even in degree level.

Analysis 2-3: effect of parent’s job on student’s final grade

This analysis will be produce plot using the both parent’s job attribute and student’s grade 3
result in order to search for hidden insights.
In the code, father’s job (Fjob) is put as the x-axis of the plot while the mother’s job (Mjob) is
put as the y-axis. A point plot and jitter plot combination is also used in this analysis, but with
the a little transparency with the points using ‘alpha = 0.7’. After that, appropriate labels are
added into the plot to display a clean and easy to understand graph.

The plot shown in Figure 3.2.6 shows some sign of parent’s job affecting student’s grade
positively and negatively. If we observe carefully at the top right corner, students having parents
that both work as teacher has an higher average performance in their final grade with
approximately zero failure count. On the other hand, a student with their father as a teacher and
having a stay-at-home mom fail completely. This can fully prove that parent’s work definitely
has an impact on the students grade.
Analysis 2-4: Effect of parent’s cohabitation status on student’s grade

This analysis will make use of “Pstatus” attribute which represents parent’s cohabitation status.
The parent’s are either recorded as apart (“A”) or together (“T”). This attribute will be compared
to the students final grade in order to find out whether it will affect a student’s performance.
For this analysis, a box plot is drawn using geom_boxplot() function from ‘ggplot2’ package. the
aesthetic of the graph has “Pstatus” as the x-axis and student’s final grade (‘G3’) as the y – axis.
Other than that, the color of the box plot is set to follow ‘grade3’ so that it is easier to compare
different type of grades.

By observing the box plot shown above, there is no significant difference between grades higher
than ‘FAIL’ as they have almost the same maximum and minimum. However, we can spot that
there different size of box in the ‘FAIL’ grade. The median of student’s with parents staying
together at “FAIL” grade is higher which is odd.
Figure 3.2.9: Parent cohabitation status with student count
I have used describe() function to get a summary of how many students for each parent’s
cohabitation status because the box plot was very odd. As a result, the box plot is not relevant for
the analysis as there is a big difference between ‘A’ and ‘T’ value with 92 and 744 respectively.

Analysis 2-5: Student’s guardian influence on their grade

In this analysis, we will find out the relationship of a student’s grade and the student’s guradian.
In this analysis ‘guardian’ attribute against the student’s final grade is plotted into a violin plot
using geom_violin(). Furthermore, the color of the plot is set according to ‘grade3’ attribute so
that we can easily differentiate the four different grade that are able to observe. The rest of the
code are just labels that will tidy up the plot and make it easier to be analysed.
It is very obvious that there is only a single violin plot with a distinction type of grade with
student having their mother as their guardian. At the same time, student with their mother as

guardian has also the has the most failure. All the plot does not have significant difference thus
the analysis is not relevant.
Analysis result: Question 2

After plotting multiple graph and analyzing the attributes that are relevant to family status and
relationship, we have found that some aspect of family does affect a student’s grade. This is
especially true when we compare the student’s grade to their parents. The education level and the
job of both parents have great impact on the students performance and their learning behavior.
On the flip side, the family size of a student does not affect the student’s grade significantly.
While the cohabitation status of parents and the guardian of a specific student may affect the
students grade but requires a lot more balanced data in order to undergo further analysis.
To conclude, we can say that the family relationship and the family status will influence a
student’s grade. This may because student adopt their learning habits from their parents and
parents may influence the students since they are young.

3.3 Question 3: How does a student’s situation and environment affect their grade ?
Other than family influence, another big factor determining a students educational performance
is their learning environment. According to Ella Hendrix, a positive learning environment is
crucial for a student because it helps to boost their motivation, interest, focus. Students are able
to be more engaged in a environment that has less distraction (Ella, 2019). Moreover, students
are also able to perform at their best when they are healthy and have the best resources.
So, I will be doing analysis based on attributes that are related a students environment and their
situation (e.g. health, resources, obstacles). This will allow me to understand more about how a
student’s hardship and available resources influence their grade.
Analysis 3-1: Relationship of travel time to school and student final grade
In this analysis, the final grade of the students will be compared to ‘traveltime’ attribute. This is
to find out whether if the longer the students need to get to school the lower the grade of the
student.
In the code I have used geom_jitter() to plot a graph and the x-axis is set to student’s final grade
and the y-axis is the travel time needed for the student to reach school. I have also used
facet_warp() to categorized the students into female and male to observe if there is any influence
from gender. At the end, appropriate labels are added to the plot.

After analyzing carefully on the jitter plot in Figure 3.3.2, the longer the student travel to school
the lower their grades are. The numbers of students scoring credit and distinction gets near to
zero when the student needs 2 hours or more to reach school. Furthermore, this plot shows that
there is no difference between 2 genders, they have approximately the same patterns. Thus, it is
said that the time needed to travel to school will affect the student’s final grade.
Analysis 3-2: How extra educational support affect student final grade
This analysis is done to find out whether a student perform better if they have additional
educational help from either school or home. thus we will plot graphs about students having
extra educational support using ‘schoolsup’ amd ‘famsup’ attributes.

For this analysis, 2 plot has been done to investigate 2 kind of educational support. For the first
line of ggplot(), I have mdae the x-axis as extra school educational support and family education
support on the second one. Both plots are bar chart with ‘dodge’ position which every type of
grade will represent a bar. At the end of the codes, labs() function is use to give all the
appropriate labels so that the graphs are clear and easy to understand.
Figure 3.3.4: first visualization of analysis 3-2
For the first plot of the analysis, nothing can be concluded or analyzed because of the significant
difference between the students with support and the students with no school support. Thus, this
visualization is irrelevant and cannot be included into the analysis.

Figure 3.3.5: second visualization of analysis 3-2
By observing this plot, we can see that the students who have family educational support have a
higher passing rate. Overall, it also does not have significant difference as the number of students
with support is has approximately twice the number of students who does not have educational
support. As a result, I would say extra educational support may affect a student’s grade by a bit,
but it will not necessarily affect a student significantly.
Analysis 3-3: Does having internet access at home affect final grade
This analysis will find out whether a student with internet access at home have a better or worse
performance than a student who does not have internet access at home.

In this analysis, bar chart is also used to visualize as the attribute that is being analyze is
categorial and non-numeric. Function geom_bar is used and with alpha = 0.7 as the parameter so
that the graph has a bit of transparency which looks nicer. Furthermore, coord_flip is used to flip
the graph clockwise so that a better and bigger view can be obtained. The position of the legend
is also put at the top so that the graph will not have to shrink to fit it at the side.

By just glancing over the graph, we can see that there is a big difference in the student count for
the bar charts. However, we can observe that there are still students who score distinction even
without internet access at home. Thus, students not necessary need internet access at home to
perform well.
Analysis 3-4: Will attending nursery school influence student’s grade

This analysis is about finding out whether a student who attend nursery school will have a
different result with the ones who does not attend nursery schools.

In this analysis, bar chart is still being used but is it set into “fill” position in the geom_bar ()
function so that we are able to compare the proportion between the 2 bars. Appropriate labels are
also added to the plot so that a clear and smooth graph can be display.
By comparing the proportion of each type of grade in both bar, we are unable to spot any big
difference or pattern. The proportion of every grade is almost the same for both bar, so we can
conclude that attending a nursery school will not affect a students final grade.

Analysis 3-5: How will student’s health affect their grade

When we discuss about a student’s situation, we will mostly think about their health status. So,
in this analysis we will be find the relationship between a students health status and their
corresponding grade.
In this analysis, I have used jitter plot combine with line plot to visualize the analysis since the
data for health status attribute is numerical. The color of the student’s final grade is set to follow
the corresponding grade type so that easier comparison can be done. Lastly, suitable labels are
added to beautify the graphical visualization.
From the overall view of the whole visualization plot, we can see that students with very bad
health to very good health can score any grade. There is not much impact of a student’s health
status on their final grade can be concluded from the plot given in Figure 3.3.11.

Analysis result: Question 3

After analyzing all the attributes related to a student’s environment and their current situation, it
is surprisingly that most of them does not have a very significant or apparent influence on a
student’s final grade. At most, we can say that extra educational support from family can
improve a student’s educational performance by a bit and not too significant.

4.0 EXTRA FEATURES
4.1 Extra feature 1: rm(list = ls())
Figure 4.1.1: code snippet of extra feature 1
For this line of code, rm() function and ls() function is combined in order to remove all the
variables that is present in the global environment which is also called the working directory.
The ls() functions to return all the names or variables that is in the global environment. After
that, it is inserted into a list and removed from the environment by using remove function, rm().
Figure 4.1.2: initial global environment
Figure 4.1.3: global environment after running the code
As we can see from the output after running rm(list = ls()), the initial variables (data, install,
package) that is in the global environment are all erased. This line of code is very useful because
it helps to clear all the variables before we start a specific R script as the variables may be
irrelevant. This also help to allow the global environment to be tidy and easy to work with.

4.2 Extra Feature 2: function()
The function() command is used to create customized functions that will run their argument
when the name of the function is called. In Figure 4.2.1 shown above, I have created 2 functions
called “install” and “package”. The “install” function will install all the packages that are listed
in the arguments. While the “package” function will load all the packages that is needed to be
used.
Figure 4.2.2: customized function created after running the code
After the customized function is created, it can be called using the function name given. For
example if I run the second last line of code shown in Figure 4.2.1, install() function will be
executed and install all the packages that is listed in the function’s argument. This is especially
useful if I work on a different device and need to install all the packages and load them all at
once.

4.3 Extra Feature 3: Cut()
In the figure shown above, cut() function is used to categorized the numeric values of the student
grade into 5 factors which are “FAIL”, “PASS”, “CREDIT”, “DISTINCTION”. There is a
breaks parameter in the function which is just to indicate the range of each type of grade.
Figure 4.3.2: output of cut() function
After the code have been executed, 3 new columns which are “grade1”, “grade2”, “grade3” have
been added into the data frame. Using the first row as reference, we can see that the student has
“FAIL” grades in the new column as the value in G1 to G3 is in the range of 1-8. This function is
used in order to preprocess the data into more readable ones so that the analysis can be done
smoothly and efficiently.

4.4 Extra feature 4: skim() and describe()
Figure 4.4.1: code snippet of function describe()
Figure 4.4.2: code snippet of function skim()
The describe() function is used to give a summary of each attribute with information like unique
values, number of missing values and more. For the skim() function, it is used to display another
summary that shows how many column type of data and their corresponding numbers.
Figure 4.4.3: first 2 lines of output from describe() function
Figure 4.4.4: output of skim() function
With the additional summary given, I am able to have a better understanding of the dataset that I
am going to analyze. Thus these 2 functions are worth to be added into the R script.

4.5 Extra feature 5: na.omit() and colSums()
Figure 4.5.2: output of colSums()
Figure 4.5.3: output of colSums() after omitting the N/A values
This function is used to remove all the columns that have N/A values so that the data set can be
cleaned. For the colSums() function, it is used to show all the attributes and their corresponding
total N/A values. We are able to identify there are 27 and 86 N/A values for grade2 and grade3
respectively by using colSums() function in Figure 4.5.2. Thus, na.omit() is used to remove it
which can be seen in Figure 4.5.3.

4.6 Extra feature 6: facet_wrap()
Figure 4.6.2: output of facet_wrap()

This function allow the values of an attribute to generate individual graphs. This is especially
useful when a single variable with multiple values is needed to be display in a tidy and easy to
read manner. In the Figure 4.6.2, the graphs are categorized into the “goout” attribute using
facet_wrap(~ goout) shown in Figure 4.6.1. If a third attribute is used for the analysis,
facet_wrap() function can also be used to plot a graph for each value of the attribute.

4.7 Extra Function 7: theme(legend.position = “top”)
Figure 4.7.1: output of extra feature 7
By implementing This extra feature, we are able to position the legend of the graph at the top so
that the whole plot can be expand to the right. Other than that, another benefit of putting the
legend at the top is that it is less distracting when we are analyzing the plot that is displayed.

4.8 Extra Feature 8: coord_flip()
Figure 4.8.2: output of coord_flip()
When the bar charts are too long when viewed vertically, it can lower the efficiency and
accuracy of the analysis. Thus, coord_flip has been implemented in the given plot shown in
Figure 4.8.2. This function allow the the cartesian coordinates with x and y to be flipped and
display horizontally.

5.0 CONCLUSION
Most of the data from the dataset ‘student’ has been analyzed using R programming
language. The data is all visualized to density plot, box plot, bar charts, point graphs and more.
The fundamentals of R programming language and data analysis concept is applied in the data
analysis so that the initial data is manipulated to be more useful and easier to visualize
graphically.
In the analysis, many aspects of the students are considered, and useful information are
retrieved from the dataset. Aspect that are mainly focused on this analysis is a student’s family,
environment, and their personal interest.
All in all, there is still a lot of improvement can be done to the analysis in the near future
so that the hidden insights that has still not be found can be retrieved from the dataset. Other than
the educational field, other industries (manufacturing, entertainment, construction etc.) can also
be analyzed using the R programming language. Without a doubt, the skill of analyzing data and
turn the raw data into useful information will be useful in any field. Thus, it is very important
that we master it.

6.0 REFERENCES
HEAVY.AI. (n.d.). Data Exploration - a complete introduction. Retrieved April 29, 2022, from
https://www.heavy.ai/learn/data-exploration#:~:text=Data%20exploration%20definition
%3A%20Data%20exploration,the%20nature%20of%20the%20data.
Tableau. (n.d.). Guide to data cleaning: Definition, benefits, components, and how to clean your
data. Retrieved April 29, 2022, from https://www.tableau.com/learn/articles/what-is-data-
cleaning#:~:text=Having%20clean%20data%20will%20ultimately,clients%20and%20less
%2Dfrustrated%20employees.
College Raptor. (2020, June 15). Physical activity and others affecting academic performance.
Retrieved May 1, 2022, from https://www.collegeraptor.com/find-colleges/articles/tips-
tools-advice/trick-cheat-10-things-affecting-students-academic-performance/
Datapine. (2022, March 9). What is data analysis? methods, techniques, types & how-to.
Retrieved May 10, 2022, from https://www.datapine.com/blog/data-analysis-methods-and-
techniques/
Li, Z., & Qiu, Z. (2018, October 2). How does family background affect children's educational
achievement? evidence from contemporary China - the journal of Chinese sociology.
SpringerOpen. Retrieved May 7, 2022, from
https://journalofchinesesociology.springeropen.com/articles/10.1186/s40711-018-0083-
8#:~:text=Family%20background%20and%20children's%20learning
%20behavior&text=The%20higher%20the%20family's%20socioeconomic,effect%20on
%20children's%20learning%20behavior.
Ella, H. (2019, December 19). How your surroundings affect The way you study. UCAS.
Retrieved May 8, 2022, from https://www.ucas.com/connect/blogs/how-your-
surroundings-affect-way-you-study

GeeksforGeeks. (2020, June 17). List all the objects present in the current working directory in
R programming - LS() function. Retrieved May 9, 2022, from
https://www.geeksforgeeks.org/list-all-the-objects-present-in-the-current-working-
directory-in-r-programming-ls-function/

Heng Kang Han - tp063427

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Heng Kang Han - tp063427

Uploaded by

Copyright:

Available Formats

INDIVIDUAL ASSIGNMENT

TECHNOLOGY PARK MALAYSIA

INTAKE CODE : APD2F2202CS (DA)

HAND OUT DATE : 28 MARCH 2022

HAND IN DATE : 13 MAY 2022

STUDENT NAME : HENG KANG HAN (TP063427)

1 Submit your assignment at the administrative counter.

3 Late submission will be awarded zero (0) unless Extenuating Circumstances

4 Cases of plagiarism will be penalized.

5 The assignment should be bound in an appropriate style (comb bound or

6 Where the assignment should be submitted in both hardcopy and softcopy,

7 You must obtain 50% overall to pass this module.

1.0 INTRODUCTION AND ASSUMPTIONS............................................................................4

LEVEL 2 Asia Pacific University of Technology and Innovation 2022

4.3 Extra Feature 3: Cut()..........................................................................................................37

LEVEL 2 Asia Pacific University of Technology and Innovation 2022

1.0 INTRODUCTION AND ASSUMPTIONS

LEVEL 2 Asia Pacific University of Technology and Innovation 2022

2.0 DATASET EXPLORATION/ CLEANING/ TRANSFORMATION

2.1 Data Import

Figure 2.1.1 code snippet of data set import

Figure 2.1.2 data frame (left side)

LEVEL 2 Asia Pacific University of Technology and Innovation 2022

Figure 2.1.3 data frame (middle)

Figure 2.1.4 data frame (right side)

2.2 Data Exploration

Figure 2.2.1 code snippet of data exploration

LEVEL 2 Asia Pacific University of Technology and Innovation 2022

Figure 2.2.2 small part of summary() outputs

Figure 2.2.3 small part of describe() outputs

LEVEL 2 Asia Pacific University of Technology and Innovation 2022

Figure 2.2.4 upper part of skim() outputs

Figure 2.2.5 lower part of skim() outputs

2.3 Data Cleaning & Pre-processing

LEVEL 2 Asia Pacific University of Technology and Innovation 2022

Figure 2.3.2 ‘index’ attribute column and built in index

Figure 2.3.1 code snippet of removing ‘index’ column

Figure 2.3.3 code snippet of categorizing

LEVEL 2 Asia Pacific University of Technology and Innovation 2022

Figure 2.3.4 number of missing value shown when ColSums() is executed

Figure 2.3.5 missing value in ‘grade2’ and ‘grade3’

LEVEL 2 Asia Pacific University of Technology and Innovation 2022

Figure 2.3.6 code snippet of na.omit() and results

3.0 QUESTIONS AND ANALYSIS

Analysis 1-1: relationship of weekend alcohol consumption and final performance ?

LEVEL 2 Asia Pacific University of Technology and Innovation 2022

Figure 3.1.1 code snippet of analysis 1-1

Figure 3.1.2: bar chart of analysis 1-1

LEVEL 2 Asia Pacific University of Technology and Innovation 2022

Figure 3.1.3: code snippet of analysis 1-2

LEVEL 2 Asia Pacific University of Technology and Innovation 2022

Figure 3.1.4: bar chart of analysis 1-2

Figure 3.1.5: code snippet of analysis 1-3

LEVEL 2 Asia Pacific University of Technology and Innovation 2022

Figure 3.1.6: visualization of analysis 1-3

LEVEL 2 Asia Pacific University of Technology and Innovation 2022

Figure 3.1.7: code snippet of analysis 1-4

Figure 3.1.8: visualization of analysis 1-4

LEVEL 2 Asia Pacific University of Technology and Innovation 2022

Figure 3.1.9: code snippet for analysis 1-5

Figure 3.1.10: visualization for analysis 1-5

LEVEL 2 Asia Pacific University of Technology and Innovation 2022

Analysis Result: Question 1