FULLTEXT01 AcademicPerformance ML Review

Degree Project in Machine Learning
Second cycle, 30 credits
Machine Learning to predict

student performance based on
well-being data: a technical and
ethical discussion
LUCY MCCARREN
Stockholm, Sweden, 2023

Machine Learning to predict
student performance based on
well-being data: a technical and
ethical discussion
LUCY MCCARREN
Master’s Programme, Machine Learning, 120 credits

Date: August 31, 2023
Supervisors: Hedvig Kjellström, Barbro Fröding, Jalal Nouri

Examiner: Olov Engwall
Host company: EdAider (EdAI Technologies AB)

Swedish title: Maskininlärning för att förutsäga elevers prestationer baserat på
data om mående: en teknisk och etisk diskussion
© 2023 Lucy McCarren
Abstract | i
Abstract
The data provided by educational platforms and digital tools offers new ways
of analysing students’ learning strategies. One such digital tool is the well-
being platform created by EdAider, which consists of an interface where
students can answer questions about their well-being, and a dashboard where
teachers and schools can see insights into the well-being of individual students
and groups of students. Both students and teachers can see the development
of student well-being on a weekly basis.
This thesis project investigates how Machine Learning (ML) can be used along
side Learning Analytics (LA) to understand and improve students’ well-being.
Real-world data generated by students at Swedish schools using EdAider’s
well-being platform is analysed to generate data insights. In addition ML
methods are implemented in order to build a model to predict whether students
are at risk of failing based from their well-being data, with the goal to inform
data-driven improvements of students’ education.
This thesis has three primary goals which are to:
1. Generate data insights to further understand patterns in the student well-

being data.
2. Design a classification model using ML methods to predict student
performance based on well-being data, and validate the model against
actual performance data provided by the schools.
3. Carry out an ethical evaluation of the data analysis and grade prediction
model.
The results showed that males report higher well-being on average than
females across most well-being factors, with the exception of relationships
where females report higher well-being than males. Students identifying as
non-binary gender report a considerably lower level of well-being compared
with males and females across all 8 well-being factors. However, the amount
of data for non-binary students was limited. Primary schools report higher
well-being than the older secondary school students. Students reported
anxiety/depression as the most closely correlated dimensions, followed by
engagement/accomplishment and positive emotion/depression.
Logistic regression and random forest models were used to build a

ii | Abstract
performance prediction model, which aims to predict whether a student is at

risk of performing poorly based on their reported well-being data. The model
achieved accuracy of 80-85 percent. Various methods of feature importance
including regularization, recursive feature selection, and impurity decrease for
random forest were investigated to examine which well-being factors have the
most effect on performance. All methods of examining feature importance
consistently identified three features as important: ”accomplishment,”
”depression,” and ”number of surveys answered.”
The benefits, risks and ethical value conflicts of the data analysis and
prediction model were carefully considered and discussed using a Value
Sensitive Design approach. Ethical practices for mitigating risks are discussed.
Keywords
Machine Learning, Data Science, Learning Analytics
Sammanfattning | iii
Sammanfattning
Den data som tillhandahålls av utbildningsplattformar och digitala verktyg
erbjuder nya sätt att analysera studenters inlärningsstrategier. Ett sådant
digitalt verktyg är mående plattformen skapad av EdAider, som består av ett
gränssnitt där elever kan svara på frågor om deras mående, och en dashboard
där lärare och skolor kan se insikter om individuella elevers och grupper av
elevers mående. Både elever och lärare kan se utvecklingen av elevers mående
på veckobasis.
Detta examensarbete undersöker hur Maskininlärning (ML) kan användas

tillsammans med Inlärningsanalys (LA) för att förstå och förbättra elevers
mående. Verkliga data genererade av elever vid svenska skolor med hjälp
av EdAiders måendeplattform analyseras för att skapa insikter om data.
Dessutom implementeras ML-metoder för att bygga en modell för att förutsäga
om elever riskerar att misslyckas baserat på deras mående-data, med målet att
informera data-drivna förbättringar av elevers utbildning.
Detta examensarbete har tre primära mål:
1. Skapa datainsikter för att ytterligare förstå mönster i data om elevers

mående.
2. Utforma en modell med hjälp av ML-metoder för att förutsäga
elevprestationer baserat på mående-data, och validera modellen mot
faktiska prestationsdata som tillhandahålls av skolorna.
3. Utföra en etisk utvärdering av dataanalysen och modellen för betygspre-
diktion.
Resultaten visade att pojkar i genomsnitt rapporterar högre mående än

flickor inom de flesta måendefaktorer, med undantag för relationer där
flickor rapporterar högre mående än pojkar. Elever som identifierar sig
som icke-binära rapporterar en betydligt lägre nivå av mående jämfört
med pojkar och flickor över alla 8 måendefaktorer. Men mängden data för
icke-binära elever var begränsad. Grundskolor rapporterar högre mående än
äldre gymnasieelever. Elever rapporterade ångest/depression som de mest
nära korrelerade dimensionerna, följt av engagemang/prestation och positivt
känsloläge/depression.
Logistisk regression och random forest-modeller användes för att bygga en

prestationsprediktionmodell, med en noggrannhet på 80-85 procent uppnådd.
iv | Sammanfattning
Olika metoder för feature selection undersöktes, inklusive regularisering,

recursive feature selection och impurity decrease för random forest. Alla
metoder för undersökning av feature selection identifierade konsekvent
tre funktioner som viktiga: ”prestation,” ”depression,” och ”antal svarade
enkäter.”
Fördelarna, riskerna och etiska värdekonflikterna i dataanalysen och predik-

tionsmodellen beaktades noggrant och diskuterades med hjälp av en Value
Sensitive Design-ansats.
Nyckelord
Maskininlärning, Data Science, Inlärningsanalys
Acknowledgments | v
Acknowledgments
Firstly, I would like to thank my supervisors Barbro Fröding and Hedvig
Kjellström for their valuable feedback and support. Hedvig has been open-
minded and solution-orientated whenever I encountered problems, and Barbro
has inspired and encouraged me on my journey to research further within the
field of technology ethics.
Thanks to Jalal Nouri for providing me with the opportunity to work with
EdAider’s data, and to Kirill Maltsev and Knut Sørli for answering my
questions in a timely and thorough manner.
I would also like to thank my uncle Andrew for his support and encouragement
throughout my academic journey, and for his honest and thorough feedback on
my research.
Lastly, I would like to thank my dear friends and partner for the emotional
support, laughs and encouragement during the two years of my masters degree.
Stockholm, August 2023

Lucy McCarren
vi | Acknowledgments
Contents | vii
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Research Goals . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 UN’s Global Goals . . . . . . . . . . . . . . . . . . . 3
1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.1 Limited data . . . . . . . . . . . . . . . . . . . . . . 3
1.4.2 Self-reported data . . . . . . . . . . . . . . . . . . . . 4
1.5 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . 4
2 Literature Review 5
2.1 Learning Analytics . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Prediction of Academic Performance . . . . . . . . . . . . . . 6
2.3 Student well-being . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Ethical Learning Analytics . . . . . . . . . . . . . . . . . . . 8
2.4.1 Conflicting values . . . . . . . . . . . . . . . . . . . 8
2.5 Summary of literature . . . . . . . . . . . . . . . . . . . . . . 9
3 Data 10
3.1 Description of data . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Imbalanced datasets . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Ethical data concerns . . . . . . . . . . . . . . . . . . . . . . 13
4 Machine Learning Theory 14

4.1 Classification models . . . . . . . . . . . . . . . . . . . . . . 15
viii | Contents
4.1.1 Logistic regression . . . . . . . . . . . . . . . . . . . 15

4.1.2 Decision trees . . . . . . . . . . . . . . . . . . . . . . 16
4.1.3 Random forest . . . . . . . . . . . . . . . . . . . . . 17
4.2 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Time series feature extraction . . . . . . . . . . . . . . . . . . 19
4.4 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4.1 Embedded methods . . . . . . . . . . . . . . . . . . . 20
4.4.2 Wrapper methods . . . . . . . . . . . . . . . . . . . . 21
4.4.3 Dimensionality reduction techniques . . . . . . . . . . 21
5 Methods 22
5.1 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Data analysis and visualisation . . . . . . . . . . . . . . . . . 23
5.3 Model for performance prediction . . . . . . . . . . . . . . . 23
5.3.1 Feature extraction . . . . . . . . . . . . . . . . . . . . 23
5.3.1.1 Tsfresh features . . . . . . . . . . . . . . . 24
5.3.1.2 Custom features . . . . . . . . . . . . . . . 25
5.3.2 Synthetic Minority Oversampling Technique (SMOTE) 26
5.3.3 Model selection . . . . . . . . . . . . . . . . . . . . . 26
5.3.4 Evaluation metrics . . . . . . . . . . . . . . . . . . . 28
5.3.4.1 Accuracy . . . . . . . . . . . . . . . . . . . 28
5.3.4.2 Confusion matrix . . . . . . . . . . . . . . 29
5.3.4.3 Precision and Recall . . . . . . . . . . . . . 29
5.3.4.4 Area under the ROC curve (AUC) . . . . . . 30
5.4 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.5 Value Sensitive Design . . . . . . . . . . . . . . . . . . . . . 31
6 Results 33
6.1 Major data insights . . . . . . . . . . . . . . . . . . . . . . . 33
6.1.1 The impact of gender and age on reported well-being . 33
6.1.2 Relationship between well-being categories . . . . . . 35
6.1.3 Relationship between well-being and performance . . 37
6.2 Performance prediction model . . . . . . . . . . . . . . . . . 37
6.2.1 Logistic regression . . . . . . . . . . . . . . . . . . . 37
Contents | ix
6.2.2 Random Forest . . . . . . . . . . . . . . . . . . . . . 40

6.2.3 Feature importance . . . . . . . . . . . . . . . . . . . 41
6.3 Value Sensitive Design . . . . . . . . . . . . . . . . . . . . . 44
6.3.1 Benefits for the stakeholders . . . . . . . . . . . . . . 44
6.3.2 Risks for the stakeholders . . . . . . . . . . . . . . . 45
6.3.3 Value conflicts . . . . . . . . . . . . . . . . . . . . . 46
7 Discussion 49
7.1 Data insights . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.2 Performance prediction model . . . . . . . . . . . . . . . . . 51
7.3 Ethical discussion . . . . . . . . . . . . . . . . . . . . . . . . 55
8 Conclusion 57
8.1 Answering research questions . . . . . . . . . . . . . . . . . . 57
8.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
References 61
x | Contents
Introduction | 1
Chapter 1
Introduction
This chapter introduces the importance of student well-being in primary

and secondary education, the connection between well-being and academic
performance, and discusses how Learning Analytics (LA) and Machine
Learning (ML) can be used to support student well-being. The purpose, goals
and delimitations of the thesis are discussed.
1.1 Background
Education is fundamental to human development, providing individuals with
the knowledge and skills necessary to navigate the world and achieve their
goals. With accelerated digitalisation and increased quantities of data in
educational settings, there is growing interest in understanding how these
tools can be used to measure and promote student well-being, and support
individualized learning experiences. Educational institutions have started
to pay attention to the promises of big data and data mining techniques
to support learning and teaching in more efficient and effective ways [1].
Learning Analytics (LA) is the name given to this field that uses data analytics
to measure, analyse, and understand the learning processes and results in
educational settings [2]. The main opportunities of LA are to “predict learner
outcomes, trigger interventions or curricular adaptations, and even prescribe
new pathways or strategies to improve student success” [3].
According to Skolverket, the Swedish National Agency for Education, student

health services in Sweden predominantly concentrate on individual-level
interventions for students who are experiencing ill health [4]. While there are
2 | Introduction
definitions of health promotion and prevention work, student health services

in practice tend to be largely reactive and aimed at addressing problems as
they arise. Consequently, the focus on promoting student well-being is often
overshadowed by a student health service that needs to prioritise students who
are feeling unwell. The study highlights the need for a more systemic approach
to health promotion and prevention of mental health issues in schools. By
shifting the focus from reactive measures to preventative measures, educators
and health professionals may be better able to create a supportive system that
promotes student well-being [4].
EdAider [5] is a Swedish EdTech startup founded by researchers at Stockholm

University that are aiming to address this problem. The company develops AI
and Analytics technology for educational purposes, aiming to support data-
driven decision-making in schools and to provide adaptive learning support
that intelligently addresses individual student needs. One tool designed by
EdAider is a well-being platform, which is designed to give insights into
student well-being so that schools and teachers can intervene in a timely
manner, monitor the development of students’ well-being over time, and see
the connection between student well-being and performance.
1.2 Purpose
The purpose of this study is to analyse the student well-being and performance
data collected by EdAider, with the following objectives:
1. To create insights and visualisations using the student well-being data.
This will help EdAider to improve the well-being dashboard, which
will in turn enable teachers and educational institutions to make better
pedagogical decisions by visualising the well-being data in a clear and
concise manner. A well-designed dashboard will enable teachers to
intervene promptly when required.
2. To investigate the relationship between well-being and performance.
This can help educators to identify students who are at risk of
performing poorly or who require special attention or counselling by
analysing their well-being data. This is a crucial issue in education,
affecting students at all levels and in schools and universities worldwide.
3. To identify and discuss the ethical implications of using student well-
being data and ML methods for gaining knowledge about learners’
Introduction | 3
behavior. While the use of student data can benefit students, teachers,
and institutions by enhancing understanding and constructing didactical
interventions, it also raises significant ethical concerns.
1.3 Research Goals

This thesis has three primary goals which are to:

being data.
2. Design a classification model using ML methods to predict student
grades based on well-being data, and validate the model against actual
performance data provided by the schools.
model using Value Sensitive Design [6].
1.3.1 UN’s Global Goals

In 2015, the United Nations Member States adopted the 2030 Agenda for
Sustainable Development [7], which is a plan of action for people, planet and
prosperity. This thesis primarily addresses two of these global goals; good
health and well-being, and quality education. One subgoal of the Agenda
states: ”By 2030, ensure that all girls and boys complete free, equitable and
quality primary and secondary education leading to relevant and effective
learning outcomes” [7]. This thesis aims to gain a better understanding of
student behaviour and well-being, and to identify connections between well-
being and academic performance. The resulting insights will allow schools
and educational management to improve the quality of educational practices,
thus addressing the UN’s goals of improving both well-being and education.
1.4 Limitations
1.4.1 Limited data

This study focuses on well-being data provided by eight schools in Sweden
(3 primary level schools and 5 secondary level schools). Performance and
attendance data was provided for one secondary school only. Due to the limited
4 | Introduction
size of this dataset, the results can therefore not be considered as general and
applicable for every school within the Swedish school system. The results
should be seen only as a supporting tool to teachers.
1.4.2 Self-reported data

The well-being data used for this thesis is self-reported by students. Some
challenges with self-reported data should be considered.
1. The students may not always provide accurate information. This can
occur due to a variety of reasons, such as social desirability bias [8],
where students may provide answers that they believe are more socially
acceptable or pleasing to their teacher. This risk is heightened by
the fact that the well-being data is not anonymised for teachers or
schools. Students may not want to report low well-being for fear of
being stigmatized, or on the other hand they may exaggerate symptoms
of depression or anxiety in order to receive more attention.
2. Some students may have difficulty with understanding the questions,
or difficulty communicating their experiences, which can lead to
inaccurate responses. This may be due to individual factors such as
literacy, language proficiency and cognitive function.
3. Self-reported data can be subject to recall bias [9]. EdAider’s well-
being survey asks students about their well-being over the past 7 days.
Students may therefore have difficulty to remember their experiences,
or difficulties to communicate well-being which is episodic or fluctuates
over time.
1.5 Structure of the thesis

Chapter 2 is a literature review that introduces definitions and previous
research about Learning Analytics, academic performance prediction, student
well-being and ethics in Learning Analytics. Chapter 3 presents the dataset
used for the thesis. Chapter 4 describes the theoretical Machine Learning
approaches used. Chapter 5 presents the method used to answer the research
questions. Results of the data analysis and prediction model are presented in
Chapter 6. Finally, a discussion of the results is presented in Chapter 7, and a
conclusion in Chapter 8.
Literature Review | 5
Chapter 2
Literature Review
This chapter introduces definitions and previous research about Learning

Analytics 2.1, academic performance prediction 2.2, and student well-being
2.3. In addition, research about ethics in Learning Analytics is discussed 2.4.
2.1 Learning Analytics

The Society for Learning Analytics Research (SOLAR) defines Learning
Analytics as the measurement, collection, analysis and reporting of data about
learners and their contexts, for purposes of understanding and optimising
learning and the environments in which it occurs [10]. It is therefore an
interdisciplinary area, including but not limited to data analytics, psychology,
machine learning and education [11]. The purposes of Learning Analytics
may vary, but according to SOLAR [10], some common applications are:
1. Descriptive and Diagnostic analytics

2. Prediction of student academic success
3. Identification of students that are at risk of failing a course
4. Provision of personalised feedback to students
5. Development of student awareness by supporting self-reflection
This thesis will focus predominantly on the first two applications listed, namely
descriptive and diagnostic analytics, and the prediction of student academic
success based on their well-being data.
A survey of educational data mining and learning analytics by Viberg et al.,

reviewed current research on LA in higher education and indicated that the
6 | Literature Review
most common methods of data analysis for LA are prediction, clustering and
relationship mining [12]. One of the most common tasks tackled by LA
research is the prediction of student performance, for which the most common
methods include regression and classification. López-Zambrano et al., [13]
mention that there is less data available for primary and secondary education
compared to tertiary levels, where tertiary refers to university and college
level education; 86.6% of published papers included in the review were done
at tertiary level, and only 7.3% for secondary level students, with none for
primary level students. A possible reason for this is the accessibility of the
data, as university education is much more often digitalised than primary and
secondary education. This highlights that further LA research at primary and
secondary education level is important.
2.2 Prediction of Academic Performance

The prediction of academic performance is one of the most popular subjects
in the field of Learning Analytics [14][15]. Academic performance can be
defined as the score obtained by students from an evaluation at the end of a
learning period. The term ”Early Warning System” is often used to describe
a system that uses data variables gathered from students to make predictions
about their performance or about their risk of dropping out [13].
A longitudinal study conducted in the US [16] between 2009 and 2012 used
Machine Learning to detect students that were prone to dropping out of upper
secondary school. In the study, the most important variables for the model
were identified to be the student’s GPA, age, math test scores, expulsion
record and attendance [16]. A systematic review of studies regarding Early
Prediction of Student Learning Performance [13] outlines a variety of Machine
Learning techniques and variables used for prediction. The studies included
in the review achieved varying prediction accuracy, and the accuracy was
highly dependent on the number and types of variables included. The
variables and student attributes used for prediction varied depending on the
educational environment. In general, the variables could be grouped into
student demographics, student activities and student interactions with an e-
learning platform. It is worth noting that none of the aforementioned studies
take student well-being into account as a variable in their predictive model,
meaning that this area requires further research.
2.3 Student well-being

Student well-being can be viewed as a population-based term targeting positive
feelings about oneself and indicating an ability to deal with the pressures
and challenges of student life and learning. This is in contrast to mental
health issues, which apply to a subgroup of the student population where
specific issues such as depression, worry, anxiety or isolation are having
a negative impact on the person, and who do not feel they have the inner
capacity to address these experiences [17]. Unsurprisingly, earlier research
shows that poor well-being and mental health frequently co-occur with low
academic achievement [18][19]. Furthermore, successful interventions to
improve student well-being have been shown to improve academic results,
suggesting that academic achievement may improve when mental health issues
are addressed [20][21].
A study of mental health of Canadian secondary school students [22]

investigated whether symptoms of mental illness (depression and anxiety)
and mental well-being are associated with self-reported grades and education
behaviours. This was done using simple statistical measures such as direction,
magnitude and significance. The results showed that lower depression
and higher well-being were consistently associated with better academic
performance and education behaviours. Some key findings from other research
highlight that students’ well-being varies substantially between schools, and
that schools which have practices to promote well-being have better overall
well-being levels [23].
Similar results are observed in the context of university-level education.

In a UK study [24], the authors explored the interrelationships between
health awareness, health behavior, subjective health status, satisfaction with
educational experience, and three measures of educational achievement. This
study considered factors related to well-being, such as sufficient income,
sleep, and physical health. The results indicate a reciprocal association
between health, health behaviors, and educational achievement. Another
study conducted in Romania [25] investigated the link between academic
performance, student engagement, and burnout. This study considered both
possible directions: grades as a consequence of student well-being, and
student well-being as a consequence of academic performance (grades). The
interesting result of this research is that high academic grades were found to
8 | Literature Review
be an antecedent for high student engagement and low student burnout. On

the other hand, they found that burnout has no significant effect on future
performance.
2.4 Ethical Learning Analytics

The data provided by educational platforms and digital tools offer new ways
of analysing students’ learning strategies. This information gives new insight
into how teaching and learning might be enhanced [26], and can optimise the
learner’s likelihood of success [27]. However, the focus on technology and on
how to process data in LA is far ahead of the ethics focus. The vast majority of
research in LA directs attention towards techniques and methods for analysing
complex data, rather than on the ethics related to it [26]. According to a
literature review from 2018 on learning analytics by Viberg et al., [12], only
18% of the articles mention or/and consider ethics and privacy in relation to
the conducted research, to a various degree.
2.4.1 Conflicting values

The stakeholders of educational technology, namely educational institutions,
teachers, students and EdTech companies, can have conflicting values and
interests. Educational institutions have legal and moral obligations to
demonstrate care for the well-being and growth of students, leading them to
success in their education [28]. In this sense, LA can be an aid to schools and to
teachers in the classroom, but it has also been argued that various aspects, such
as well-being campaigns, can place existing resources under strain [29]. On
the other hand we have EdTech companies that are often short on funding, and
are motivated to gather as much data as possible as fast as possible in order to
improve their products and algorithms. With new technological breakthroughs
happening quickly and a lot of money at stake, developers and researchers
of LA technology are often scrambling to release new technology before
their competitors. Ethical and societal considerations are therefore often an
afterthought. When we consider students, we cannot discount the impact
of the asymmetrical power-relationship between students and educational
institutions. Institutions have access to increasing amounts and variety of
student data, and therefore have an immense responsibility to ensure the ethical
collection, usage and storage of student data. In 2018, the EU initiated General
Data Processing Regulation (GDPR) [30]. The regulations include a statement
of how user data is allowed to be processed and applied to organisations such

as school institutions. Murchan and Siddiq (2021) [26] conclude that research
within Learning Analytics is not up to the standards of addressing issues of
ethics, privacy and regulation compared to GDPR statements. For students
to trust in educational assessments is especially important where future life
opportunities are determined by assessment scores and grades.
In order to address these conflicts of interest, Murchan and Siddiq [26] suggest
that LA research should be carried out using an ethics framework. One
example is the Sclater (2016) framework [31] which addresses 8 variables
(responsibility, transparency, consent, privacy, validity, access, minimising
adverse impacts and stewardship of data) that should be analysed. Another
approach which can be used for designing ethical LA technology is Value
Sensitive Design (VSD) [6] [32] which is discussed in detail in section 5.5.
2.5 Summary of literature

Previous studies have primarily focused on Learning Analytics and academic
performance prediction in tertiary level education, with limited research
available for primary and secondary school students. Thus further research
is required in this area. Earlier research shows an established link between
student well-being and academic performance. It is also important to consider
the causal relationship between the two; poor well-being can contribute to poor
academic performance and behaviours, whilst poor academic performance
can also contribute to poor well-being. The focus on building technology
and processing data in Learning Analytics is far ahead of the ethics focus.
The stakeholders of LA technology can have conflicting values and interests,
and one way of addressing these value conflicts is by using an ethical design
framework such as Value Sensitive Design.
10 | Data
Chapter 3
Data
3.1 Description of data

Students from 8 schools (3 primary level schools and 5 secondary schools)
in Sweden were included in the dataset. The students used EdAider’s tool to
answer questions on their well-being on a weekly basis, for varying periods
of time. The well-being questionnaire consists of 8 questions which are
indicators of 8 categories of well-being according to the PERMA model
[33]. The PERMA model of flourishing was introduced by Seligman [34],
in which psychological well-being is defined in terms of five domains:
positive emotions (P), engagement (E), relationships (R), meaning (M), and
accomplishment (A). Each of the questions is weighted as either a positive
or negative indicator of overall well-being, according to whether the question
indicates well-being or ill-being, as illustrated in table 3.1.
Question Category Weight

1 How often have you felt happy in the last 7 days? Positive emotion Positive
2 Have you felt motivated in school the last 7 days? Engagement Positive
3 My classmates have been supportive the last 7 days. Relationships peers Positive
4 My teacher has been supportive the last 7 days. Relationships teacher Positive
5 I’m satisfied with my effort in school the last 7 days. Accomplishment Positive
6 I have had a stressful workload the last 7 days. Workload Negative
7 How often have you felt bad in the last 7 days? Depression Negative
8 How often have you been worried in the last 7 days? Anxiety Negative
Table 3.1: The 8 well-being survey questions which are indicators of 8

categories of well-being according to PERMA model.
Data | 11
The questions have 5 possible answers, ranging from ”not at all” to ”very
often”, and the answers are assigned ”semantic points” according to whether
they are weighted positively or negatively.
Answer Weight Semantic points

1 Not at all Positive 0
2 Not often Positive 25
3 Sometimes Positive 50
4 Often Positive 75
5 Very often Positive 100
6 Not at all Negative 100
7 Not often Negative 75
8 Sometimes Negative 50
9 Often Negative 25
10 Very often Negative 0
Table 3.2: The answer choices given for each well-being question, and
semantic points assigned by positive/negative weight.
The resulting dataset is a time series of well-being indicators, which is grouped

by school, school type (primary or secondary) and school group (class). Not
all students answered the survey each week, resulting in a varying amount of
data points for each student. The gender of the student is also indicated. A
summary of the data points is shown below in table 3.3.
Number of Number of Number of

School
survey weeks students data points
Primary 1 8 89 392
Primary 2 16 97 1015
Primary 3 14 378 1249
Secondary 1 13 354 868
Secondary 2 29 226 3223
Secondary 3 20 395 1542
Secondary 4 21 631 2878
Secondary 5 6 55 74
Table 3.3: A summary of the number of surveys answered, students, and data
points per school.
Alongside the well-being data, performance indicators for students from one
secondary school are included in the dataset. The indicator is a binary flag set
12 | Data
by the teacher, where 1 indicates a student who is at risk of failing a course.

The performance indicator is available for a total of 226 students, where 47 of
those students were at risk of failing.
3.2 Imbalanced datasets

Imbalanced data occurs when the classification dataset class has a skewed
proportion, meaning that the labels of classes are not represented equally. This
is the case for the performance indicator data presented in this paper, since the
number of students who are at risk of failing is much lower than the number of
students who are not, as shown in table 3.4. The number of students identifying
with gender ”other” is also much lower than the number of students identifying
as male or female. The learning process of most classification algorithms are
often biased toward the majority class examples, so that minority ones are not
well modeled into the final system [35].
Performance indicator Gender Num. students Total

Male 79
Pass Female 97 179
Other 3
Male 26
Fail Female 19 47
Other 2
Table 3.4: The number of students by gender who are at risk of failing (”Fail”)
versus those who are not (”Pass”).
This bias problem can be addressed by changing the distribution of the data
with under-sampling or over-sampling. Under-sampling means that data from
the major classifier is randomly deleted until the number of data points in
each class are matching. However, deleting data implies fewer data points to
feed the algorithm with. Instead oversampling (increasing the number of data
points for the minority classifier) is preferred. An approach to oversampling is
discussed in 5.3.2. The accuracy of a classifier is the total number of correct
predictions by the classifier divided by the total number of predictions. This
may be good enough for a well-balanced class but not ideal for the imbalanced
class problem. Choosing a proper evaluation metric is therefore important as
is discussed in 5.3.4.
Data | 13
3.3 Ethical data concerns

In order to protect students’ privacy, the well-being data was anonymised by
EdAider before receiving the raw data, so that no names or identity numbers
were included. Despite this, the gathering, processing, storage and analysis of
student data entails making a series of several decisions that involve ethical,
legal and moral considerations [36]. Thus the data ethics approach for the
thesis is based on the article by Sclater (2016) [31] and guidelines according
to GDPR [30]. In the article by Sclater, eight areas of ethics concerns are
mentioned regarding data. They are responsibility, transparency, consent,
privacy, validity, access, minimising adverse impacts and stewardship. These
values were considered during the research and are further evaluated in section
7.3.
14 | Machine Learning Theory
Chapter 4
Machine Learning Theory
Figure 4.1 is a flow chart which illustrates the method of the thesis. The
steps contained within the red rectangle correspond to the second research
question outlined in section 1.3; to design a classification model using ML
methods to predict student grades based on well-being data, and to validate
the model against actual performance data provided by the schools. In order
to implement this model, theoretical ML techniques were used. In this chapter,
the theoretical techniques are presented and explained in more detail.
Figure 4.1: Flow chart of data collection process and method of the thesis.
The ML techniques required to complete the steps inside the red rectangle are
explained in this chapter.
Machine Learning Theory | 15
4.1 Classification models

To build the performance prediction model, a ML technique called
classification was used. Classification is defined as a process of recognition
and grouping of objects into predetermined categories or classes [37]. This
involves taking input data (in this case the well-being data) and feeding it into
an algorithm. The output from the algorithm is the predicted label, which in
this case is whether or not a student is at risk of failing. The basis of the
performance measure of a classification algorithm is to analyse how many
classifications are done correctly. According to a literature review by López-
Zambrano et al., (2021) [13], the best performing classification algorithms
for predicting academic performance were Decision Trees, Random Forest,
Support Vector Machine and Naive Bayes. However, the accuracy is highly
dependent on the size of the dataset and the features used for prediction.
4.1.1 Logistic regression

A suitable model in the classification setting is the Logistic regression model,
which is widely preferred among researchers because of its simplicity and
transparency [38]. Logistic regression was chosen for this project due to the
importance of building transparent models so that teachers and schools can
understand why a student might be at risk of failing. In the often-present case
where there is more than one explanatory variable, the model is called multiple
logistic regression. James et al., [37] give the multiple logistic regression
model by:
eβ0 +β1 X1 +...+βp Xp
p(Yi |Xi , ..., Xp ) = (4.1)
1 + eβ0 +β1 X1 +...+βp Xp
where X = (X1 , ..., Xp ) are p explanatory variables used to predict the
response variable Y. This expression can also be rewritten as
p(Yi |Xi , ..., Xp )

log = β0 + β1 X1 + ... + βp Xp (4.2)
1 − p(Yi |Xi , ..., Xp )
where the left hand side of the equation is called the logit function of p. From
the logit function we see that logistic regression is based on a linear model, and
therefore can be explained in terms of the coefficients of the input variables.
This feature is useful to understand the relationship between the input variables
and the output variable.
The question is then to decide at what probability an observation should be

classified as positive or negative in order to correctly classify new observations
which the model has not been trained on. The number of mis-classifications
on test data, called the test error rate, is minimized on average by assigning
each observation to its most likely class, conditioned on the values of the
explanatory variables [38]. This classifier is called Bayes classifier. In
other words, a test observation with explanatory vectors (x1 , ..., xp ) should
be assigned to the class j for which p(Y = j|X1 = xi , ..., Xp = xp ) is the
largest.
Because of the linear, noncomplex decision boundaries, logistic regression is
known to be less prone to overfitting than other classifiers, especially when the
number of input variables is relatively small [39]. This means that it can be
more robust when dealing with small datasets, or noisy or incomplete data.
Regularization techniques, such as L1 and L2 regularization mentioned in
4.4.1, can help to improve the generalization performance of the model.
4.1.2 Decision trees

A decision tree classifier is a tree-based technique in which any path beginning
from the root is described by a data separating sequence until a Boolean
outcome at the final node is achieved [40]. It is a hierarchical representation
of knowledge relationships that contain nodes and connections. The most
significant feature of decision trees is their ability to change the complicated
decision making problems into simple processes, thus finding a solution which
is understandable and easier to visualize and interpret [41].
The decision tree is an iterative process that sorts data points into different
categories. The process is binary, starting at the primary (parent) node with the
value of an attribute as a threshold and splits the data points into child nodes.
The method for evaluating the classifier strength of a node (n) is the Gini index
(G). Gini index determines the purity of a specific class after splitting along
a particular attribute. The best split increases the purity of the sets resulting
from the split. If L is a dataset with j different class labels, the Gini measure
is defined [42] as
∑j
G(L) = 1 − pi 2 (4.3)
1
where pi is the probability of event i in L. In this way the algorithm decides

on which splits that are most efficient to use [42].
Using decision tree models as a classification method has the following

advantages [43]:
• Simplifies complex relationships between input variables and target

variables by dividing original input variables into significant subgroups.
• Easy to understand and interpret.
• Non-parametric approach without distributional assumptions. This
can be an advantage when dealing with complex or unknown data
distributions.
• Easy to handle missing values without needing to resort to imputation.
• Easy to handle heavy skewed data without needing to resort to data
transformation.
• Robust to outliers and noise in the data.
One big advantage of decision tree learning over other learning methods such
as logistic regression is that it can capture more complex decision boundaries.
Decision tree learning is suitable for datasets that are not linearly separable—
there exists no hyperplane that separates out examples of two different classes
[39]. The main limitation of decision tree models is that they can be subject
to overfitting and underfitting, particularly when using a small data set [43].
4.1.3 Random forest

Random Forest is an algorithm built upon the understanding of Decision
Trees. The strength of Random Forest in comparison with Decision Tree
is that a Random Forest uses multiple trees to make a prediction, therefore
reducing overfitting [44]. The idea of constructing multiple trees is based on
broader umbrella of machine learning techniques known as bagging. Bagging
techniques are especially geared for tackling overfitting. A new feature vector
is classified differently by different decision trees in the forest [39]. These
individual classifications are aggregated to output the final classification as
illustrated in algorithm 1 from Hastie et al., [44].
4.2 Resampling
Machine learning models for classification problems are built to handle
problems with a relative equal number of observations in each class [45].
Algorithm 1 Random Forest algorithm
1. For b = 1 to B:
(a) Draw a bootstrap sample Z∗ of size N from the training data.
(b) Grow a random-forest tree Tb to the bootstrapped data, by
recursively repeating the following steps for each terminal node
of the tree, until the minimum node size nmin is reached.
i. Select m variables at random from the p variables.
ii. Pick the best variable/split-point among the m.
iii. Split the node into two daughter nodes.
2. Output the ensemble of trees {Tb }B
1 variables
To make a prediction at a∑ new point x:

ˆ
Regression: frf (x) = B B
B 1
b=1 Tb (x)
Classification: Let Ĉb (x) be the class prediction of the bth random forest tree.
B
Then Ĉrf (x) = majority vote {Ĉb (x)}B 1
As explained in section 3.2, the dataset used for this thesis is imbalanced.
When the observations in each class are imbalanced, the problem can be
tackled using a resampling method. Two basic techniques are oversampling
and undersampling. Oversampling duplicates samples from the minority
class to increase the number of minority class samples while keeping the
majority class untouched. Under-sampling works in the opposite way, samples
from the majority class are removed from the dataset while the minority
class is untouched. Both methods changes the prior knowledge of the class
distributions during the training phase while the original distributions are kept
in the testing phase. Oversampling can help improve the classification of the
minority class by providing more instances for the model to learn from [46].
However, it can also introduce the risk of overfitting if synthetic instances are
not properly generated or if the minority class is excessively over-represented.
Undersampling can help reduce the dominance of the majority class and allow
the model to pay more attention to the minority class. However, it may result
in the loss of potentially useful information from the majority class.
Due to the small size of the dataset used for this thesis, oversampling is the
most suitable method. The oversampling methods considered were:
1. Random Oversampling: This method involves randomly duplicating

instances from the minority class until it is balanced with the majority
class. While simple to implement, it can potentially lead to overfitting
due to the exact replication of data.
2. SMOTE (Synthetic Minority Over-sampling Technique): SMOTE

creates synthetic instances for the minority class by interpolating
between existing instances. It selects two or more similar instances and
generates new instances along the line connecting them in the feature
space. This helps to introduce more diversity into the minority class
and mitigate overfitting.
3. ADASYN (Adaptive Synthetic Sampling): ADASYN is an extension of

SMOTE that introduces more synthetic samples in regions of the feature
space where the class imbalance is higher. This helps to address the
issue of generating too many synthetic samples in dense regions of the
minority class.
4.3 Time series feature extraction

A time series is a sequence of observations taken sequentially in time [47].
When modelling time series for classification, it is usually assumed that we
have observations recorded at regular time intervals. In this case, however,
the dataset consists of an irregular time-series since the students answered the
survey at irregular time intervals. Feature extraction controls selecting the
important and useful features, by eliminating redundant features and noise
from the system, to yield the best predicted output. For classification and
regression tasks the significance of extracted features is of high importance,
because too many irrelevant features will impair the ability of the algorithm to
generalize beyond the train set and result in overfitting.
In order to use time series data as input to supervised learning problems, one
can choose a set of significant data points from each time series as elements of
feature vector. However, it is much more efficient and effective to characterize
the time series with respect to the distribution of data points, correlation
properties, entropy, min, max, average, percentile and other mathematical
derivations [48]. An empirical evaluation of time-series feature sets [49]
indicates that the three most popular python libraries for time-series feature
extraction are tsfresh [50], TSFEL [51], and Kats [52]. The study indicates
that tsfresh contains many unique time-series features that are not present in
the other feature sets.
4.4 Feature selection

As well as predicting the academic performance of a student, it is important to
know which of well-being factors are most influential to the result. This can
be done by using feature selection, which is a process that involves selecting
a subset of relevant features from the original set of features in a dataset. The
goal is to identify and retain the most informative and influential features while
discarding irrelevant or redundant ones. Feature selection helps to improve
model performance, reduce overfitting, enhance interpretability, and speed up
the training process by reducing the dimensionality of the data. Various feature
selection methods were chosen based on the models described in section 4.1.
4.4.1 Embedded methods

These methods incorporate feature selection within the model training process
itself [53]. They automatically select relevant features during model training.
L1 and L2 regularization are methods which can be embedded within the the
logistic regression model.
1. LASSO (Least Absolute Shrinkage and Selection Operator): L1

regularization (LASSO) adds the sum of the absolute values of the
coefficients as a penalty term to the loss function. The effect of L1
regularization is that it encourages sparsity in the coefficient values,
effectively shrinking some coefficients to zero. As a result, L1
regularization can drive the model to select a subset of the most relevant
features, discarding the less important ones. Features with non-zero
coefficients after regularization are considered selected.
2. Ridge Regression: L2 regularization adds the sum of the squared values
of the coefficients as a penalty term to the loss function. It penalizes
large coefficients, but doesn’t directly force them to zero. Instead, it
shrinks the coefficients towards zero while still keeping all features in
the model. L2 regularization can help reduce the impact of irrelevant or
correlated features by reducing their coefficient values.
4.4.2 Wrapper methods

Wrapper methods evaluate feature subsets by training and testing a specific
machine learning algorithm that we are trying to fit on a given dataset. They
use the performance of the machine learning model on different feature subsets
to determine which features are most informative. These methods can be
computationally expensive since they involve training multiple models on
different feature combinations.
Recursive Feature Elimination (RFE) is an automated feature selection method

that belongs to the group of wrapper methods. RFE recursively eliminates
variables based on their importance to the model. Starting with the full feature
space, the model is trained and evaluated on the data set and a ranking criterion
is calculated for each feature. The feature with the smallest ranking is then
removed and the process proceeds until the model only includes a predefined
number of features. The importance criterion is dependant on the external
estimator used by RFE, and for instance when using logistic regression the
score will depend on the importance of the coefficient estimates [54].
4.4.3 Dimensionality reduction techniques

These methods transform the original features into a lower-dimensional space
while preserving the essential information. One example of dimensionality
reduction explored in this paper is Principal Component Analysis (PCA),
which involves projecting the data onto orthogonal components to capture
the maximum variance. PCA is a method aiming to find a low-dimensional
representation of the data set that contains as much of the variation in the data
as possible [55]. To do this, PCA find the principal components of the data,
that is the directions along which the observations carry the most variance.
Linear Discriminant Analysis (LDA) is another dimensionality reduction

technique which finds linear combinations of features that maximize class
separability [56].
22 | Methods
Chapter 5
Methods
This chapter describes the steps for data pre-processing in 5.1 and for data
analysis in 5.2. The method for building the performance prediction model is
outlined in 5.3 and the evaluation metrics for the model are in 5.3.4. Finally,
the method for Value Sensitive Design is presented in 5.5. Figure 5.2 is a
flow chart which illustrates the method of the thesis. The green rectangles
correspond to the three research questions outlined in section 1.3.
Figure 5.1: Flow chart of data collection process and method of the thesis.
5.1 Data pre-processing

The data from EdAider was collected in a PostgreSQL database dump. The
first major step involved restoring the database and finding the tables and fields
relevant to this project. Data generated by test accounts was removed from
Methods | 23
the dataset. The data was reshaped to have each 8 well-being factors as a
dimension. A dimension called ”average well-being” was created which is
the mean of the semantic points of the 8 well-being factors. In total EdAider
had data from 17 schools, but many of them had very data few data points, so
schools which had carried out less than 10 well-being surveys were excluded
in the data pre-processing step. This resulted in a dataset with 8 schools (3
primary and 5 secondary), as described in section 3.1.
5.2 Data analysis and visualisation

As per section 1.3, one of the primary research goals of this thesis was to
generate data insights to further understand patterns in the student well-being
data. In order to do this, the stakeholders (EdAider and teachers) provided
some relevant business questions:
1. How does reported well-being differ by gender?

2. How does reported well-being differ amongst primary/secondary school
students?
3. How do the various well-being factors relate to each other?
In order to answer these questions, Pandas and Seaborn python libraries were
used in order to create visualisations of the well-being data; the results of
which are presented in section 6.1.
5.3 Model for performance prediction

The second research goal was to design a ML model to predict whether a
student is at risk of failing based on well-being data, and to validate the model
against actual performance data provided by the schools.
5.3.1 Feature extraction

The first step in building the model was to extract features from the time series
data. In order to characterise the time series – we can use information about
the series of answers for each student such as min, max, average and so on.
When modelling time series for classification, it is usually assumed that we
have observations recorded at regular time intervals. In this case, however,
the dataset consists of an irregular time-series. This occurs since the students
24 | Methods
answered the well-being surveys at irregular intervals. Students from certain

schools began using the well-being platform earlier than others, and answered
the survey more regularly, resulting in varying amount of data points.
Figure 5.2: Frequency of surveys answered by a subset of students, illustrating

the irregularity of the dataset.
Two approaches used for feature extraction from the time series in this paper
are (1) Tsfresh and (2) Custom features.
5.3.1.1 Tsfresh features
The Python package tsfresh (Time Series FeatuRe Extraction on basis of

Scalable Hypothesis tests) is specifically designed to automate the process
of feature extraction from time series data. Tsfresh accelerates the feature
extraction process by combining 63 time series characterization methods,
which by default compute a total of 794 time series features [50]. The features
capture various aspects of the time series, including statistical measures and
patterns.
Tsfresh works as follows:
• Pre-processing: time series data is pre-processed to make it suitable

for feature extraction. Common pre-processing techniques include
smoothing the data and removing outliers.
• Feature Extraction: tsfresh uses a set of feature extraction functions to
extract a large number of features from the time series data. These
features can be statistical measures such as mean, standard deviation,
Methods | 25
and skewness, or more complex features such as the number of peaks in

the time series or the frequency of specific patterns.
• Feature Filtering: the extracted features are filtered to remove irrelevant
or redundant features. This is done using a hypothesis testing framework
that tests whether each feature is statistically significant and adds it to the
final set of features only if it passes the significance tests. The hypothesis
testing framework used in tsfresh is based on the Kolmogorov-Smirnov
test.
• Finally, the features are scaled to have zero mean and unit variance to
ensure that they are all on a comparable scale.
Care needs be taken when using tsfresh with highly irregular time series.
Tsfresh uses timestamps only to order observations, while many features are
interval-agnostic (e.g., number of peaks) and can be determined for any series,
some other features (e.g., linear trend) assume equal spacing in time, and
should be used with care when this assumption is not met [57]. The results
showed that tsfresh was not suitable for this dataset as the dataset is small,
and also irregular. The number of extracted features from tsfresh was huge
and exceeded the number of data points which causes the model to overfit.
In cases such as this one where the dataset is small and the number of time
series samples is irregular, manual feature engineering techniques are more
suitable. For this reason, a separate dataset with custom features to cater for
the irregularity of the time series was also created. This approach allows for
more control over the feature selection process, focusing only on the most
relevant and informative aspects of the time series data.
5.3.1.2 Custom features
The following features were created for each of the 8 well-being factors:
1. Mean: the mean semantic points of each well-being factor (e.g.

workload) for each student for the whole period that surveys were
answered.
2. Median: the median semantic points of each well-being factor (e.g.
workload) for each student for the whole period that surveys were
answered.
3. Min: the minimum semantic points answered for a given well-being
factor under the whole period.
4. Max: the maximum semantic points answered for a given well-being
factor under the whole period.
26 | Methods
5. Standard deviation: the standard deviation of answers from the mean

answer for a given well-being factor.
In addition, features were created for the number of surveys answered by each
student and the number of days between the first and last survey answered.
This resulted in a dataset with 42 custom features.
To summarise, before the feature extraction process the dataset consisted of

a time series of data points for each student. After feature extraction, the
dataset consists of one row (data point) per student with the well-being data
characterised into features and a performance indicator which is the label. The
total number of data points (rows) included after feature extraction is 226, ,
whereby 47 rows are students who are at risk of failing, and 179 are those who
are not at risk of failing.
5.3.2 Synthetic Minority Oversampling Technique (SMOTE)

The next step in building the model was to deal with the imbalanced data
problem, since the dataset includes few data samples for those students who are
at risk of failing. Both SMOTE and ADASYN were tested for this purpose.
SMOTE is straightforward and worked well in this case when the minority
class forms dense clusters. ADASYN is more advantageous when the class
imbalance is variable across the feature space or when dealing with sparse
distributions and noisy data. SMOTE was proposed by Chawla et al., [58] in
2002 to generate synthetic observations for the minority class of imbalanced
datasets. The algorithm works by utilizing a k-nearest neighbour algorithm to
create synthetic data. SMOTE first starts by choosing random data from the
minority class, then k-nearest neighbors from the data are set. Synthetic data
would then be made between the random data and the randomly selected k-
nearest neighbor. The number of generated synthetic observations is specified
prior to the procedure and should reflect the degree of imbalance. SMOTE was
implemented using the imblearn.over_sampling package from python. Table
5.1 shows the distribution of data points (one per student) before and after
applying the SMOTE algorithm.
5.3.3 Model selection

Prior to modelling, the data set is split into train, test and validation subsets
using scikit-learn model_selection.train_test_split, where the
Methods | 27
Number of data points Number of data points

Performance indicator
(before SMOTE) (after SMOTE)
Fail 47 179
Pass 179 179
Total 226 358
Table 5.1: The number of data points for students at risk of failing (”Fail”)
versus those who are not (”Pass”) before and after applying SMOTE.
training data includes 80% of the original data and the test and validation data
both include 10% each. The models are then trained on the training data,
classification threshold is decided on the test data and lastly the models are
validated on the validation data.
Logistic regression was chosen to model the data due to its high interpretability
and since the features easily can be analyzed, where the produced regression
coefficient estimates indicate if the relation between the predictor and
target variable is positive or negative. Random forest was chosen as a
comparative model. The ensemble nature of Random Forests helps to
average out individual decision tree biases and reduces overfitting. By
combining multiple trees, the ensemble model is better equipped to capture
complex relationships in the data while avoiding the pitfalls of memorising
the training data. In the process of selecting models, several models
were evaluated besides logistic regression and random forest, such as
decision tree, support vector machine (SVM), K-Nearest Neighbors (KNN)
and Gradient Boosting. However, these models were disregarded since
they did not improve the performance of the model. The models were
implemented using scikit-learn ensemble.RandomForestClassifier
and linear_model.LogisticRegression.
Feature selection is made with the purpose of creating an improved predictive

model using only significant features, as well as to increase the interpretability
of the model and reduce overfitting. Recursive feature elimination(RFE),
implemented using scikit-learn, was tested for both logistic regression and
random forest models using the custom features dataset. L1 and L2
regularization were also implemented for the logistic regression model. Due to
the large number of features generated by Tsfresh (7047 features) - the chosen
feature selection model for this dataset was PCA. Both RFE and regularization
are feature selection methods, that is, they select and exclude features from
28 | Methods
the model without changing them. As opposed to this, principal component

analysis (PCA) is a dimensionality reduction method that transforms features
into a lower dimension, and hence interpretability of single features is lost
when using PCA. In this case, the interpretability of the features is important
in order to ensure transparancy for teachers and students. However, PCA is
investigated nonetheless in order to investigate if the method can increase
predictive power of the model.
The evaluation metrics scored for each model are precision, recall and the
area under ROC. In general, a frequently used metric for binary classification
problems is accuracy, given by the proportion of true results to all results.
Considering that the data is imbalanced, accuracy will not be presented for the
models since it can produce misleading results. For instance, if only 5% of
the data set belongs the target class (students at risk of failing) while 95% of
the data set belongs to the other class (students who are not a risk of failing),
a naive approach to classify every sample to the majority class would yield
an accuracy on 95%, which would be considered very good, yet not accurate.
Accuracy places a larger weight on the majority class, making it harder to
produce good prediction accuracy on the minority class, which is why other
metrics are evaluated instead.
5.3.4 Evaluation metrics

Prediction accuracy of a classification model is its ability to correctly predict
the class label of unseen data. The common metrics for measuring the
accuracy of classification models are confusion matrix, overall accuracy, AUC
and ROC curves, per-class accuracy, recall and precision [59]. In this section,
the metrics used to evaluate the performance prediction model outputs are
presented. The evaluation metrics have been selected to reflect the ability
of a classifier to classify the minority observations correctly as well as the
classifiers overall classification performance.
5.3.4.1 Accuracy
Accuracy simply measures how often the classifier makes the correct
prediction. It’s the ratio between the number of correct performance
predictions and the total number of predictions (the number of data points
Methods | 29
(students) in the test set):
# correct predictions
accuracy = (5.1)
# total data points
5.3.4.2 Confusion matrix
Accuracy does not make a distinction between classes (at risk of failing/not at
risk of failing). This can be problematic when the risks of misclassification
differ for the two classes. In this case, it may be considered more important
to correctly classify students who are at risk of failing, compared with those
who are not. A confusion matrix [60] shows a more detailed breakdown of
correct and incorrect classifications for each class, as shown in table 5.2. The
students who are at risk of failing are labelled ”positive”, and the class label
of students who are not at risk of failing is ”negative”.
Actual values
Positive Negative
Positive True Positive (TP) False Negative (FN)
Predicted values
Negative False Positive (FP) True Negative (TN)
Table 5.2: Confusion matrix for the binary classification problem.
The groups outlined by the confusion matrix are calculated as follows [60]:
1. True Positive (TP) - the positive examples (students at risk of failing)
classified correctly by the model
2. True Negative (TN) - the negative examples (students not at risk of
failing) classified correctly by the model.
3. False Positive (FP) - the negative values classified as positive. This
scenario is known as a Type 1 Error
4. False Negative (FN) - the positive values classified as negative. This
scenario is well known as a Type 2 Error
5.3.4.3 Precision and Recall
Precision is the ratio of correct positive predictions to the overall number of

positive predictions:
TP
P recision = (5.2)
TP + FP
30 | Methods
Recall (also known as sensitivity) is the ratio of correct positive predictions to

the overall number of positive examples in the test set:
TP
Recall = (5.3)
TP + FN
Using recall as an evaluation metric is suitable when dealing with imbalanced

data, since it measures how well the model can correctly classify observations
belonging to the minority group [61]. While recall measures the degree
to which a model correctly classifies all true positives correctly, precision
measures the degree to which a model correctly classifies positives in relation
to all observations it classifies as positives.
5.3.4.4 Area under the ROC curve (AUC)
The ROC curve is another commonly used method to assess the performance of
classification models [60]. The ROC curve is a graph that visualizes the trade-
off between True Positive Rate and False Positive Rate. For each threshold,
the True Positive Rate and False Positive Rate are calculated and plotted on
one graph. The higher the True Positive Rate and the lower the False Positive
Rate for each threshold, the better. Better classifiers have more curves on the
left. The area below the ROC curve is called the ROC AUC score, a number
that determines how good the ROC curve is. The ROC AUC Score shows how
many correct positive classifications can be gained as you allow for more and
more false positives. . A higher ROC AUC score (closer to 1) indicates that
the model has better predictive power and can effectively separate the classes
5.4 Cross-validation
When the dataset is small, removing a part of it for validation poses a problem
of underfitting. By reducing the size of the training data, there is a risk
of removing important patterns in the dataset. To tackle this problem, in
classification it is common practice to use k-fold cross-validation [62]. K
was set to 20 for model validation in this project. The dataset is randomly
partitioned into k subsets of approximately equal size. These subsets are often
referred to as ”folds.” The cross-validation process then involves k iterations.
In each iteration, one of the k folds is used as the validation set, while the
remaining k-1 folds are combined to form the training set. The model is
Methods | 31
trained on the training set and evaluated on the validation set. This process
is repeated k times, with each fold serving as the validation set once. At the
end of the k iterations, the evaluation metrics described in section 5.3.4 are
computed for each iteration. The performance metrics from all k iterations are
then aggregated to provide an overall assessment of the model’s performance.
Averaging the errors yields an overall error measure that typically will be more
robust than single measures [62].
5.5 Value Sensitive Design

In order to ethically evaluate the results of the data analysis and predictive
model, the suggested framework is Value Sensitive Design. Value Sensitive
Design (VSD) is a methodology for ethical design from the field of Human-
Computer Interaction. VSD has been shown to be an applicable approach in
the design and development of LA, for example by Chen and Zu [32] who
introduced two cases of applying VSD to support ethical considerations in LA
design. The researchers found that the following values supported by the LA
tool can be in tension with other values: autonomy, utility, ease of information
seeking, student success, accountability, engagement, usability, privacy, social
well-being (in the sense of belonging and social inclusion), cognitive overload,
pedagogical decisions, freedom from bias, fairness, self-image, and sense of
community. An approach to VSD suggested by Friedman et al [6] which can
be applied to LA technology design includes the following steps:
1. Identify direct and indirect stakeholders of the technology

2. Identify benefits and harms for each stakeholder
3. Map the benefits and harms onto their corresponding values
4. Identify potential value conflicts
5. Formulate suggestions for design improvements
Sclater [31] developed a Code of Practice for Learning Analytics, which aims
to set out the responsibilities of educational institutions to ensure that LA
is carried out responsibly, appropriately and effectively, addressing the key
legal, ethical and logistical issues which are likely to arise. They grouped
legal and ethical concerns for LA into eight headings: 1) responsibility,
2) transparency and consent, 3) privacy, 4) validity, 5) access, 6) enabling
positive interventions, 7) minimizing adverse impacts, and 8) stewardship of
data. These values are relevant to EdAider’s well-being platform and will be
discussed in the context of Value Sensitive Design in the results section 6.3.
32 | Methods
Results | 33
Chapter 6
Results
In this chapter, the results of the data analysis in 6.1 are presented in section
6.1, and the results of the performance prediction model are presented in
section 6.2. The results of the ethical evaluation using Value Sensitive
Design are outlined in 6.3. These three sections of the results chapter are
corresponding to the three research questions.
6.1 Major data insights
6.1.1 The impact of gender and age on reported well-

being
This section answers the question: how does reported well-being differ
between genders and ages? Both overall well-being, and individual well-being
factors are considered. Table 6.1 shows the results for primary school students
and table 6.2 shows the results for secondary school students.
34 | Results
Male Female Other

Number of students 279 281 4
Accomplishment 73.6 71.1 27.9
Engagement 66.8 63.6 30.8
Positive emotion 75.5 70.9 33.7
Relationships teacher 75.6 73 47.1
Relationships peers 71.4 67.2 35.6
Anxiety 82.9 78.4 36.5
Depression 81.1 75.9 32.7
Workload 70.7 69.1 29.8
Overall well-being 74.7 71.1 34.3
Table 6.1: Primary school: Number of students and mean well-being factor
by gender.
Male Female Other
Number of students 1042 539 80
Accomplishment 53.4 52.1 41.4
Engagement 49.0 45.7 38.3
Relationships teacher 60.8 62.9 54.0
Relationships peers 65.0 68.7 59.3
Anxiety 65.9 60.7 49.5
Depression 71.4 66.7 52.6
Workload 52.0 49.0 49.3
Overall well-being 60.2 58.4 49.5
Table 6.2: Secondary school: Number of students and mean well-being factor
by gender.
Figure 6.1: Average well-being by school and gender. The black line indicates
standard deviation.
Results | 35
The data shows that on average, male primary school students report higher
well-being than females across all well-being factors. This applies also to
secondary school students, with the exception of relationships where females
report higher well-being than males. Students with gender ”other” report
a considerably lower level of well-being across all 8 well-being factors.
However, it is difficult to deduce a meaningful conclusion about primary
school students with gender ”other” due to the limited number of data points.
Broken down by school, we see that there are 3 schools out of 8 where females
tended to report slightly higher well-being than males.
On average, primary schools report higher well-being than the older secondary
school students. This is with the exception of one primary school, which
reports lower well-being than all other schools. This school had fewer data
points than the other primary schools so did not have a significant effect on the
average well-being for primary schools. One possible reason for the difference
is that the primary school with lower well-being has older students than the
other 2 primary schools (grade 7-9 vs. grade 4-6).
6.1.2 Relationship between well-being categories

This section discusses how the various well-being dimensions relate to each
other. Correlation between the well-being dimensions was calculated using
pandas.DataFrame.corr and represented in a heatmap. As shown
in figure 6.2 primary school students reported anxiety/depression as the
most closely correlated dimensions, followed by engagement/accomplishment
and positive emotion/depression. This applies also to secondary school
students as shown in figure 6.3, where an even higher correlation between
anxiety/depression can be seen. The least correlated well-being dimensions
are reported workload and relationships to peers and teachers. Relationships
to peers are also weakly correlated to anxiety. This applies to both primary
and secondary school students.
36 | Results
Figure 6.2: Primary school: correlation between well-being variables
Figure 6.3: Secondary school: correlation between well-being variables

Results | 37
6.1.3 Relationship between well-being and perfor-

mance
Table 6.3 shows the the mean well-being for students who are at risk of
failing versus those who are not at risk of failing. We see that the ”at-risk”
students report lower mean well-being for all well-being variables. The most
statistically significant factors are accomplishment, relationship to peers, and
anxiety.
Pass Fail P-value

Number of students 179 47
Accomplishment 57.2 50.5 <0.001
Engagement 69.2 65.3 0.008
Relationships teacher 66.4 66.0 0.6
Relationships peers 71.2 67.2 <0.001
Anxiety 69.2 65.3 <0.001
Depression 74.9 72.0 0.008
Workload 51.3 50.7 0.6
Wellbeing 63.6 60.7 <0.001
Table 6.3: Mean well-being for students who are at risk of failing (Fail) vs
those who are not at risk of failing (Pass).
The next section show the results from the model created to predict whether a
student is at risk of failing based on the feedback from their well-being surveys.
6.2 Performance prediction model

Results from the logistic regression are presented in subsection 6.2.1 and
random forest models are presented in subsection 6.2.2. Both models were
tested with features extracted from Tsfresh and with custom features. RFE
and regularization were tested as a feature selection methods for the custom
features, and PCA was used as a feature selection method for the Tsfresh
features.
6.2.1 Logistic regression

This section includes all results for logistic regression. The tables include the
number of features after feature selection, along with the precision, recall and
38 | Results
ROC AUC scores. The mean accuracy and standard deviation from the mean
accuracy after 20-fold cross-validation are also included.
As described in Section 5.3.2, SMOTE was the favoured algorithm for

handling the imbalanced data problem. Table 6.4 shows the results of the
logistic regression model on the custom features dataset before and after
applying SMOTE. Precision and recall are greatly improved after over-
sampling for the minority class (students at risk of failing).
Precision Recall ROC AUC Accuracy Std

Before SMOTE 0.333 0.149 0.51 0.756 0.099
After SMOTE 0.747 0.799 0.8 0.765 0.084
Table 6.4: Logistic regression model trained on custom features
The remainder of the results presented have therefore had SMOTE oversam-
pling applied on the dataset before using it as input to the model. Table
6.5 shows the results of the logistic regression model trained on the custom
features dataset with where each row shows a different method of feature
selection. The first row shows the model with no feature selection, the second
and third show L1 and L2 regularization, and the fourth shows Recursive
Feature Elimination.
Num feat. Precision Recall ROC AUC Acc. Std

No feat. sel. 42 0.747 0.799 0.8 0.765 0.084
L1 reg. 33 0.716 0.799 0.83 0.742 0.115
L2 reg. 17 0.722 0.793 0.83 0.745 0.096
RFE 20 0.765 0.805 0.85 0.780 0.076
Table 6.5: Logistic regression model trained on custom features
Table 6.6 shows the results in the form of a confusion matrix for the logistic
regression model trained on the custom features dataset.
Since the number of features generated by Tsfresh is so large (7047) compared

to the sample size (358), the model is certain to be subject to overfitting before
feature selection. The model has been reduced according to PCA as explained
in Section 4.4.3. In table 6.7 ,the results from a logistic regression using 106
principal components are presented. With this number of features, 90% of
Results | 39
Actual values
Positive Negative
Positive 133 41
Predicted values
Negative 48 126
Table 6.6: Confusion matrix for logistic regression model trained on custom
features dataset
feature variance is preserved. The 5 principal components explaining most

variance are presented in table 6.8, and presented in a scree plot in figure
6.4. After PCA, the dimensionality is reduced to have 106 features, however
prediction accuracy and interpretability is weakened.
Num feat. Precision Recall ROC Acc. Std

No feat. sel. 7047 0.871 0.983 0.97 0.919 0.0699
PCA 106 0.814 0.955 0.89 0.869 0.098
Table 6.7: Logistic regression model trained on Tsfresh features
Principal component Explained variance

1 0.1169
2 0.0775
3 0.0426
4 0.0297
5 0.0271
Table 6.8: Explained variance ratio for first five principal compenents (logistic
regression
40 | Results
Figure 6.4: Scree plot showing proportion of variance explained by each

principal component (logistic regression)
6.2.2 Random Forest

This section includes all results for the random forest model. The table
6.9 includes the number of features after feature selection, along with the
precision, recall and ROC AUC scores. The mean accuracy and standard
deviation from the mean accuracy after 20-fold cross-validation are also
included.
Num features Precision Recall ROC AUC Accuracy Std

No feat. selection 42 0.9 0.879 0.85 0.907 0.092
RFE 20 0.887 0.948 0.86 0.895 0.074
Table 6.9: Random forest model trained on custom features
Table 6.10 shows the results in the form of a confusion matrix for the random
forest model trained on the custom features dataset.
Actual values
Positive Negative
Positive 153 21
Predicted values
Negative 17 157
Table 6.10: Confusion matrix for random forest model trained on custom
features dataset
As presented in Section 6.2.1, PCA was also applied on the random forest
Results | 41
model to reduce dimensionality. Better results are seen for the random
forest, with precision, recall and accuracy scores improving slightly after
dimensionality reduction.
Num features Precision Recall ROC Accuracy Std

No feat. selection 7047 0.944 0.938 0.93 0.947 0.065
PCA 106 0.977 0.939 0.8 0.958 0.063
Table 6.11: Random forest model trained on tsfresh features
6.2.3 Feature importance

In this section, feature importances logistic regression and random forest
models are discussed. Feature importances are analysed only for the custom
feature dataset. The large number of tsfresh features make the interpretation
of the features difficult.
Feature importance for the logistic regression model was measured using both
RFE and the coefficients from the regression as explained in Section 4.4. A
positive coefficient implies that an increase in the value of that predictor is
associated with an increase in the log-odds of the outcome variable taking
on the ”at risk of failing” category. This indicates a positive correlation
between the predictor and the outcome. Conversely, a negative coefficient for
a predictor variable suggests that an increase in the value of that predictor is
associated with a decrease in the log-odds of the outcome variable being in the
”at risk of failing” category. The 5 features with the largest coefficients from
the logistic regression model (shown in figure 6.5) were: anxiety, depression,
accomplishment, relationship to teacher, and positive emotion.
42 | Results
Figure 6.5: Feature importance from logistic regression coefficients (custom

features dataset)
The top 10 features selected by recursive feature elimination on the logistic

regression model were related to number of days between first and last
survey, number of surveys answered, accomplishment, engagement, anxiety.
The top 10 features selected by recursive feature elimination on the random
forest model were related to number of surveys answered, accomplishment,
relationships to peers and teachers, anxiety, depression and workload.
Figure 6.6 shows the feature importances from the Random Forest model
calculated using scikit-learn feature
ensemble.RandomForestClassifier.feature_importances_.
These importances are computed as the mean and standard deviation of
accumulation of the impurity decrease within each tree. For each decision
tree in the random forest, when a feature is used to split a node, the decrease
in impurity (measured by Gini impurity or entropy) is recorded. The more the
impurity decreases due to a split using a particular feature, the more important
that feature is considered to be. The importance scores of features are then
aggregated across all the decision trees in the random forest and normalized
for interpretation. The most important features were: accomplishment,
engagement, workload, number of surveys answered, relationship to peers,
depression, relationship to teacher.
Results | 43
Figure 6.6: Feature importance from random forest model using mean
impurity decrease (custom features dataset)
Table 6.12 shows a comparison of the features selected by each of the

experiments, where Y shows that the feature was selected as important and
N shows that it was not selected as important. The 4 feature selection
methods were 1) Coefficients for logistic regression model 2) Recursive
feature elimination of logistic regression model 3) Mean impurity decrease
for random forest model and 4) Recursive feature elimination on random forest
model. Each of the 4 feature selection methods identified accomplishment and
depression as important features.
Log. reg. Log. reg. Ran.forest Ran.forest

Feature
Coefficients RFE MID RFE
Accomplishment Y Y Y Y
Engagement N Y N N
Positive emotion Y N N N
Relationship teachers Y N Y Y
Relationship peers N N Y Y
Depression Y Y Y Y
Anxiety Y Y N Y
Workload N N Y Y
Num surveys N Y Y Y
Num days N Y N N
Table 6.12: Comparison of features selection methods

44 | Results
6.3 Value Sensitive Design

The identified stakeholders of the data analysis and prediction model are 1) the
students providing the well-being data, 2) schools, 3) teachers and 4) EdAider.
This section uses a Value Sensitive Design approach to outline the benefits and
risks of using Machine Learning to understand students’ well-being for these
stakeholders.
6.3.1 Benefits for the stakeholders

Firstly, the data analysis identified particular groups of students who tend
to have lower well-being. We saw for example that students who identify
by gender ”other” tend to rate their well-being as lower, compared to other
students. We also saw that secondary school students show lower reported
well-being than younger students at primary school. By highlighting these
differences, transparency is created which allows teachers and schools to
intervene in order to improve student well-being. Educators can use the
data insights to develop personalized interventions, strategies, or resources
to address individual well-being needs effectively. This aligns with the ethical
concerns highlighted in Sclater’s Code of Practice [31]; namely transparency
and enabling positive interventions. By monitoring well-being on a continuous
basis, teachers can identify students who need help before it becomes critical,
allowing them to support students proactively rather than re-actively. This
minimizes adverse impacts in the long term.
Secondly, the results of the prediction model show that well-being has a
significant impact on academic performance. By monitoring students’ well-
being, educators can identify factors that may be hindering their academic
progress and implement strategies to improve student engagement, motivation,
and overall performance. Monitoring well-being also goes beyond academic
performance. It encompasses various aspects of students’ lives, including
emotional, social, and physical well-being. By tracking these areas, educators
can support students in their personal growth and development, fostering
resilience, self-esteem, and a positive self-image.
Finally, educational institutions often face resource constraints such as limited

time and staff, which can make it challenging to monitor and support students’
well-being effectively. By implementing an intuitive and user-friendly tool,
Results | 45
teachers can efficiently get insights into student well-being. Data analysis of
well-being data therefore empowers teachers to efficiently allocate their time
and resources, and empowers students by providing access to support from
their teachers when they need it.
6.3.2 Risks for the stakeholders

While monitoring students’ well-being might be considered part of a teacher’s
responsibility and a pedagogical tool to intervene in certain cases, such actions
can also be seen as surveillance systems which are problematic in terms
of threatening students’ privacy. Constant monitoring and assessment may
lead to increased stress or pressure on students, especially if they feel judged
or compared based on their well-being data. Inadequate privacy measures
may result in the exposure of sensitive information, potentially leading to
emotional distress, stigma, or breaches of personal boundaries. Teachers
are also susceptible to surveillance and privacy risks in this context, as they
often feel accountable for the well-being of their students. Consequently, they
may experience increased pressure to deliver positive outcomes, especially if
schools or parents scrutinize and hold them responsible for any declining well-
being trends within a particular class.
There is also a risk that relying solely on well-being data to keep track of
student well-being may lead to a reduction in human interaction and personal
connection with students. Automation complacency occurs when automation
output receives insufficient monitoring and attention, usually because that
output is viewed as reliable [63]. This could lead to a risk that teachers
and schools become over-reliant on the results of the technology and perhaps
even de-skilling in the long-term. The tool should therefore not be seen as
a replacement for resources such as counselling and one-to-one check-ins at
school, but rather as a complementary tool. It is important for educators to
provide holistic support beyond the tool and to be aware of the limitations of
technology in addressing complex well-being issues.
As discussed further in the limitations section, the well-being data was self-
reported, and the grades data was an assessment by the students’ teacher
rather than an official state exam grade, which raises questions about the
validity of the data. The amount of data used for the grade prediction model
was also limited to one secondary school, meaning that the results cannot
46 | Results
be generalised for other schools. When the data analysis and prediction
model uses unreliable or invalid data and measures, it may lead to inaccurate
representations of student well-being, potentially resulting in inappropriate
interventions or missed opportunities to support students in need.
Under-representation of certain groups of students in the well-being data and

limited amounts of data means that the model has very limited capacity when
it comes to grade prediction. For example, there were only grades available for
2 students identifying with gender ”other”, meaning that it was not possible to
make any conclusive remarks about grade predictions for this group. There
is also a small group of students who perform well despite reporting low
levels of well-being. Predictive models may inherit biases from the training
data, potentially perpetuating existing prejudices, disparities or inequalities
in educational opportunities. Careful attention must be given to address
and mitigate biases to ensure fairness and equity in the prediction process.
Reliance on predictive models alone may result in discriminatory actions or
decisions against students from marginalized groups or with unique needs. It
is also possible that drawing conclusions from data based on gender or other
factors may reinforce gender norms, stereotypes, and societal expectations,
and impact how different genders experience and express well-being.
Failure to address these risks of invalid data and algorithmic bias can
undermine the effectiveness of EdAider’s technology, and reduce trust in
the technology in the long-term. Additionally, since the technology requires
regular monitoring, teachers may face an increased workload in terms of data
analysis, interpreting results, and implementing appropriate interventions.
This puts pressure on already strained educational resources.
6.3.3 Value conflicts

The various stakeholders of EdAider’s technology may have conflicting
interests when it comes to using well-being data. An important step in the
Value Sensitive Design process is to identify these potential value conflicts.
When it comes to privacy conflicts, all stakeholders are affected. Conflicts

can occur regarding the ethical boundaries of data collection, storage, and
usage. Ethical concerns may arise if students or parents feel that their privacy
or autonomy rights are being compromised by the extensive monitoring and
Results | 47
analysis of well-being data. Students and parents may be concerned about data
privacy, demanding strict privacy measures and control over how the data is
used. EdAider, on the other hand, may argue that data sharing is necessary
for analysis and improvement of their products. Schools may expect EdAider
to take full responsibility for data security and privacy, while the company
may place some responsibility on schools for proper implementation and data
handling. Schools and teachers may have different expectations regarding how
the collected data is stored, secured, and used. Conflicts may arise when there
are divergent perspectives on data governance and the extent of control schools
should have over the data.
Value conflicts can also arise regarding the validity of the data analysis
and prediction model results. Teachers and schools may have their own
insights and professional judgement when interpreting the analysis or
recommendations generated by the technology, which may conflict with or
challenge the technology’s findings. Teachers may question the validity and
reliability of the technology’s analysis, leading to conflicts regarding the
appropriate weight given to automated assessments versus their professional
judgment. Teachers may prioritize their professional judgement and autonomy
in assessing and supporting student well-being, while EdAider may emphasize
the use of data-driven algorithms and standardized approaches, potentially
limiting teachers’ discretion.
48 | Results
Discussion | 49
Chapter 7
Discussion
7.1 Data insights

The key insights from the data analysis in section 6.1.1 indicate that age and
gender can have a significant impact on reported well-being of students in
Sweden. This finding is in line with other international studies showing similar
results. A 2017 study by Kaye et al. [64] used data from an international survey
of child well-being across countries, to examine differences between 12 year
old boys and girls who reported low subjective well-being. Findings revealed
that girls reported well-being was lower than boys’, and that different domains
of well-being vary in their importance for boys and girls. Specifically, girls’
well-being appears to be more driven by relational factors, whilst boys’ is more
driven by perceived academic achievement. The results from the data analysis
carried out in this paper also showed that secondary school girls reported lower
accomplishment/engagement than males, but reported better relationships to
teachers and peers. Another paper from Yoon et al. [65] also reinforces
the results shown regarding gender and age. This paper investigated gender
difference in the change of adolescents’ mental health and subjective well-
being over 3 year period using multi-level regression. The results showed that
young people are at increased risk of mental health problems between the ages
of 11 and 14, particularly girls. The overall difficulty levels reported by girls
were significantly higher than boys across a range of mental health problems
and subjective well-being. The results also showed that young people show
clear signs of mental distress as they get older, an escalation which was
particularly evident among girls. However, both of the cited papers failed to
consider the well-being of students identifying as non-binary, possibly due to
50 | Discussion
a lack of available data for this group.
Another important finding was the relationship between certain well-being di-
mensions oulined in section 6.1.2. Students reported anxiety/depression as the
most closely correlated dimensions, followed by engagement/accomplishment
and positive emotion/depression. These results align with existing literature
[66] that highlights the interconnected nature of well-being dimensions.
Anxiety and depression often co-occur and can have a reciprocal relationship,
where higher levels of anxiety can contribute to increased levels of depression
and vice versa. The correlation between engagement and accomplishment
suggests that students who feel more engaged in class tend to have a better
sense of accomplishment with their studies. It is important to note that the
correlation analysis provides insights into the statistical relationship between
well-being dimensions but does not establish causal relationships. Further
research is needed to explore the underlying mechanisms and dynamics
between these dimensions to gain a deeper understanding of their interactions.
The correlation analysis of well-being dimensions contributes to a more
comprehensive understanding of student well-being and provides valuable
information for developing targeted interventions and support systems in
educational settings. By understanding the interplay between different well-
being dimensions, educators and school administrators can target specific
areas for improvement.
Lastly, the data analysis resulted in a comparison of mean well-being scores

between students who are at risk of failing (Fail) and those who are not
at risk of failing (Pass) in section 6.1.3. The data shows that students
classified as ”at-risk” report lower mean well-being scores across all well-
being variables compared to their peers who are not at risk of failing. This
finding suggests that there is a clear association between students’ well-being
and their academic performance, particularly when it comes to the risk of
failing in their studies. Among the well-being dimensions, three factors stand
out as being most statistically significant in differentiating between the two
groups of students. These significant factors are accomplishment, relationship
to peers, and anxiety. For each of these factors, the p-values are below the
threshold of 0.001, indicating a strong statistical significance.
1. Accomplishment: Students who are at risk of failing report significantly

lower levels of accomplishment compared to their counterparts who are
not at risk. This suggests that students who struggle academically may
Discussion | 51
also experience a diminished sense of achievement and competence in

their schoolwork.
2. Relationship to Peers: The data reveals that students at risk of failing
have lower well-being scores concerning their relationships with peers.
This finding highlights the importance of social connections and
positive peer interactions in fostering well-being and academic success.
3. Anxiety: Students classified as ”at-risk” report higher levels of anxiety
compared to their peers who are not at risk of failing. Previous research
shows mixed results regarding this phenomenon, depending on how the
anxiety is defined. It is indeed possible that high levels of anxiety
can also be linked with high-performing students and maladaptive
perfectionism [67].
7.2 Performance prediction model

SMOTE
Before applying SMOTE, the logistic regression model exhibited limited

predictive performance for identifying students at risk of failing. The
precision, recall, and ROC AUC scores were quite low, indicating that the
model had difficulty correctly classifying the minority class (students at risk
of failing). Specifically, the precision was only 0.333, indicating that out of
all the instances predicted as ”at risk,” only 33.3% were correctly classified.
The recall, also known as sensitivity or true positive rate, was 0.149, implying
that the model was only able to correctly identify 14.9% of all the actual ”at-
risk” students. The ROC AUC score, which measures the model’s ability to
distinguish between the two classes, was 0.51, suggesting that the model’s
discriminative ability was relatively weak.
However, after applying the SMOTE algorithm to balance the dataset,

significant improvements in the model’s predictive performance were
observed. The precision and recall scores substantially increased, reaching
0.747 and 0.799, respectively. These improvements indicate that the model
became much more effective in correctly identifying and classifying students
at risk of failing. The ROC AUC score also increased to 0.8, indicating a
notable enhancement in the model’s ability to distinguish between the two
classes. Additionally, the standard deviation from the mean accuracy was also
reduced after applying SMOTE, indicating increased stability in the model’s
52 | Discussion
performance.
These results demonstrate the significant impact of the SMOTE algorithm on

the performance of the logistic regression model in dealing with the class
imbalance problem. By generating synthetic samples of the minority class,
SMOTE effectively increased the representation of the ”at-risk” students in
the dataset. This process improved the model’s ability to learn patterns and
relationships from the minority class, leading to enhanced predictive accuracy
for identifying students who may be at risk of failing.
Tsfresh vs custom features
TsFresh is a powerful tool that automates the generation of a large number

of features based on statistical tests and algorithms applied to the time series.
While tsfresh can be highly effective in extracting relevant and informative
features from time series data, there are situations in which its application
may lead to overfitting and less reliable models. This situation is particularly
prevalent in small datasets, where the risk of overfitting is generally higher due
to the limited amount of information available for the model to learn from.
If the number of extracted features becomes comparable to or exceeds the
number of data points, the model may become too complex for the available
information, leading to poor generalization performance.
In cases such as this one where the dataset is small and the number of time
series samples is limited and irregular, simpler feature extraction techniques
or manual feature engineering are more suitable. This approach allows
for more control over the feature selection process, focusing only on the
most relevant and informative aspects of the time series data. Manual
feature extraction techniques in this case included basic statistical measures
(e.g., mean, standard deviation) and domain-specific measures that are less
computationally intensive compared to tsfresh.
PCA
The results from the principal component analysis transformed the Tsfresh
features into a lower dimension. While PCA can be effective in reducing the
dimensionality of the data and capturing its variability, it comes with a trade-
off - the interpretability of single features is lost in the process. This loss of
Discussion | 53
interpretability can be a drawback, especially when transparency is essential

for teachers and students to understand the model’s predictions. However,
PCA was investigated nonetheless in order to investigate if the method could
increase predictive power of the model. The results showed that for the
logistic regression model, performing PCA slightly worsened the predictive
power of the model. This could be due to the loss of information during
the dimensionality reduction process, which may have removed some crucial
details relevant to the prediction task. Consequently, the model’s performance
suffered. On the other hand, the random forest model did not exhibit a
significant difference in predictive power before and after applying PCA.
Random forests are robust and can handle high-dimensional data, making them
less sensitive to the dimensionality reduction effect of PCA. Therefore, the
impact of PCA on the random forest model was not as pronounced as with the
logistic regression model.
Prediction accuracy
For the custom features dataset, the logistic regression model achieved a
ROC AUC score between 0.8-0.85. This means it could distinguish between
students who were at risk of failing and students who were not at risk of
failing in 80-85% of cases. The random forest model achieved a ROC AUC
score of 0.85-0.86. Both models also achieved high precision and recall
scores, indicating their ability to correctly predict positive instances (students
at risk of failing). The ROC AUC scores also show good discrimination
ability of the models. These results mean that reported well-being can be a
reasonably good predictor of whether a student is at risk of failing. However,
several limitations of the model must be taken into consideration. Firstly, it is
difficult to establish the causal relationship between well-being and academic
performance. This model uses well-being data as a predictor of performance,
but it is also very possible that poor performance can negatively impact student
well-being. Secondly, although the model has a good classification accuracy,
the model still wrongly classifies students as being at risk of failing in 15-
20% of cases. The negative impacts of incorrectly classifying a student using
this model must be considered carefully. These risks is discussed further in
Section 7.3.
For the tsfresh features, both the logistic regression and random forest
models achieved higher levels of prediction accuracy compared with the
custom features. However, it is essential to strike a balance between
54 | Discussion
dimensionality, predictive power and interpretability. Comparing the models

trained on custom features and Tsfresh features, we see that custom features
may lead to more interpretable models, while Tsfresh features offer higher
predictive performance but can be challenging to interpret due to their high
dimensionality. This may lead to a loss of interpretability and overfitting in
cases of limited data and extremely high feature dimensionality.
Different feature selection techniques had varying impacts on model

performance. For the logistic regression model trained on the custom features
dataset, RFE achieved the best results. For the random forest model, RFE did
not result in any significant difference. This could be explained by random
forest’s ensemble approach and ability to handle non-linear relationships,
making it more robust to feature selection changes and resulting in a smaller
impact on its performance.
Feature importance
All four methods of examining feature importance consistently identified

three features as important: ”accomplishment,” ”depression,” and ”number
of surveys answered.” This consistency in feature selection across different
techniques adds credibility to the significance of these specific features
in predicting the target variable. These features likely contain valuable
information that directly impacts the academic performance. The fact that the
”number of surveys answered” has high importance indicates that the more
surveys a student has answered, the more information the model gets about
the student, and the predictive power improves. On the other hand, students
who have answered fewer surveys have fewer data points, and there is a higher
risk that that student would be misclassified by the model.
Understanding the influence of these features on student performance can help

in devising effective interventions, developing personalized support systems,
and gaining insights into students’ well-being and academic performance.
However, it is important to keep in mind that while these features consistently
stood out in this study, it is essential to validate the findings in other datasets
and possibly across different educational settings to ensure generalizability.
Feature importance may vary depending on the context and the specific dataset
being used. Nonetheless, this initial finding provides a starting point for
understanding the relationships between the well-being features and student
performance.
Discussion | 55
7.3 Ethical discussion

The results of the data analysis, prediction model, and Value Sensitive Design
study have highlighted several ethical risks that should be carefully addressed
to ensure the responsible and ethical use of such models in educational
settings. This section outlines suggestions for actions addressing these risks,
which are be summarised according to Sclater’s Code of Practice [31]. The 6
points below correspond respectively to the ethical responsibilities highlighted
by Sclater which are privacy, validity, transparency and consent, minimising
adverse impacts, responsibility and enabling positive interventions.
• Privacy and Informed Consent: When collecting and analyzing student

well-being data, it is essential to prioritize privacy and obtain informed
consent from the students or their legal guardians. Students should be
informed about the purpose of data collection, how their data will be
used, and the potential consequences of participating in the study.
• Bias and Fairness: Care should be taken to ensure that the data used
in the analysis is representative and does not perpetuate or reinforce
existing biases or inequalities. Biases in the data could lead to
biased predictions and exacerbate existing disparities in educational
opportunities. It is essential to consider and include non-binary
gender students and other underrepresented groups in data analysis and
predictive models to ensure that educational interventions are inclusive
and equitable for all students.
• Interpretability and Transparency: The use of complex models like

tsfresh and PCA can result in models that are challenging to interpret.
In educational settings, it is crucial to maintain transparency in model
predictions so that teachers, students, and stakeholders can understand
how the models arrive at their conclusions. A balance must be
struck between model complexity and interpretability to ensure that
the model’s decisions are explainable and understandable to all parties
involved.
• Impact of Misclassification: The logistic regression model, even after

SMOTE, still has a misclassification rate of approximately 15-20%.
Misclassifying students as at risk of failing when they are not and vice
versa can have significant consequences. False positives (incorrectly
classifying a student as at risk) could lead to unnecessary interventions,
56 | Discussion
misallocation of resources, reduced trust in the model, and additional

stress for students. False negatives (failing to identify students at risk)
may result in students not receiving the support they need to improve
their academic performance.
• Responsible Use of Predictive Models: The predictive models

developed in this study should be used as tools to complement teachers’
and educators’ expertise, rather than replacing their judgment entirely.
Predictive models can provide valuable insights, but they should not
be the sole basis for making decisions about students’ well-being or
academic performance. Human judgment, empathy, and understanding
are critical elements that should be integrated into the decision-making
process.
• Continuous Evaluation and Improvement: The models should be

subject to continuous evaluation and improvement as new data becomes
available and as the educational context evolves. Regular monitoring
of the models’ performance and impact on students’ well-being and
academic outcomes is necessary to ensure that they remain effective
and ethically sound.
Conclusion | 57
Chapter 8
Conclusion
In this chapter the research questions are answered along with some
reflections, and suggestions for future work are presented.
8.1 Answering research questions

This thesis had three primary goals which were:

being data.
This goal was achieved by generating data insights described in Section
6.1. The key data insights related to the impact of gender and age on
reported well-being, and the relationships between various well-being
dimensions.
2. Design a model using ML methods to predict student grades based on

well-being data, and validate the model against actual performance data
provided by the schools.
The predictive model was built as outlined in 6.2 using both logistic
regression and random forest. This led to a predictive model with
reasonably high predictive accuracy of 80-85 percent. The limitations
of this model were also considered and discussed.
model using Value Sensitive Design.
A Value Sensitive Design approach was used to discuss the benefits and
risks to the stakeholders of EdAider’s technology in Section 6.3. The
58 | Conclusion
stakeholders’ conflicting values were also discussed. This resulted in

a suggestion of ethical considerations for EdAider and its users to take
into account in 7.3, including transparency, bias and fairness, privacy
and consent, and responsible use.
The scientific contribution of this thesis is to provide information for educators

to understand the factors influencing student performance and well-being.
Implementing the predictive model in schools may help identify students who
are at risk of failing, allowing for timely interventions and support. However,
it is crucial to consider the social implications of using a technology-driven
approach for academic performance prediction. The model’s accuracy and
potential misclassifications must be balanced with the ethical considerations
and potential impacts on students’ well-being and educational experiences.
This thesis also highlights these ethical considerations.
With new AI and machine learning breakthroughs every few months, and a lot
of money at stake, developers of educational technology are often scrambling
to release new products before their competitors, meaning that ethical
considerations are often an afterthought. This can have harmful implications
for many stakeholders, resulting in conflicts and negative consequences for the
company itself if their reputation becomes damaged by engaging in unethical
practices. Learning how to incorporate ethical values into Learning Analytics
technology using techniques such as Value Sensitive Design is therefore
crucial.
8.2 Future work

The section will briefly cover what the results of the thesis could be
investigated further, or areas of research that independently can be examined.
In this paper, the student data was limited to include school and gender.
Further exploring the influence of external factors such as family support,
socio-economic status, and school environment could help contextualize the
relationship between well-being and academic performance. Understanding
how these factors interact with well-being dimensions may provide additional
insights into student success.
The academic performance data was also limited to one indicator given by the
teacher near the end of the academic year. The teacher’s assessment may be
Conclusion | 59
subject to bias because of student behaviour or previous results. For this reason
it would be beneficial to investigate official state examinations along with the
teacher’s assessment. It would also be beneficial to investigate the assessments
at more regular time intervals, as students performance may change over
time. Following students over an extended period would allow researchers to
see how changes in well-being dimensions relate to fluctuations in academic
performance and vice versa. Longitudinal data can help identify critical
periods of vulnerability or resilience in students’ well-being and academic
trajectories.
While these models investigated in this paper demonstrated promising results,

there are several other machine learning models that warrant investigation in
future research. One such approach which could be suitable for this purpose
is mixed models, also known as hierarchical linear models or multilevel
models. Well-being data collected over time creates a temporal aspect that
may influence academic performance. Mixed models can incorporate time-
series components to capture the longitudinal patterns in students’ well-being
and performance. This temporal modeling can yield insights into how changes
in well-being dimensions affect academic outcomes over time. Missing data
is a common issue in longitudinal studies, as students may not respond to all
surveys or may drop out over time. Mixed models are well-suited to handle
missing data through methods like maximum likelihood estimation, ensuring
that valuable information is not lost during analysis. Mixed models also offer
a more interpretable framework compared to some complex machine learning
models like deep neural networks.
Further research should focus on establishing causal relationships between

well-being dimensions and academic performance. Experimental designs
such as intervention studies, can shed light on whether specific well-being
interventions or changes in academic circumstances lead to changes in well-
being and academic outcomes.
60 | Conclusion
References | 61
References
[1] C. Dziuban, C. Graham, P. Moskal, A. Norberg, and N. Sicilia, “Blended

learning: the new normal and emerging technologies,” International
Journal of Educational Technology in Higher Education volume,
Feb. 2018. doi: 10.1186/s41239-017-0087-5. [Online]. Available:
https://doi.org/10.1186/s41239-017-0087-5 [Page 1.]
[2] V. Kovanovic, C. Mazziotti, and J. Lodge, “Learning analytics for

primary and secondary schools,” Journal of Learning Analytics, vol. 8,
no. 2, pp. 1–5, 2021. [Page 1.]
[3] S. A. Becker, M. Cummins, A. Davis, A. Freeman, C. G. Hall, and

V. Ananthanarayanan, “Nmc horizon report: 2017 higher education
edition,” The New Media Consortium, Tech. Rep., 2017. [Page 1.]
[4] M. Törnsén, Rektor, elevhälsan och elevers lärande och utveckling.

Skolverket Stockholm, 2014. [Pages 1 and 2.]
[5] (2023) Edaider wellbeing platform. [Online]. Available: https:

//www.edaider.com/valmaende [Page 2.]
[6] B. Friedman, P. Kahn, A. Borning, P. Zhang, and D. Galletta, Value

Sensitive Design and Information Systems, 01 2006. ISBN 978-94-007-
7843-6 [Pages 3, 9, and 31.]
[7] (2015) The 2030 agenda for sustainable development. [Online].

Available: https://sdgs.un.org/2030agenda [Page 3.]
[8] P. Grimm, “Social desirability bias,” Wiley international encyclopedia

of marketing, 2010. [Page 4.]
[9] K. Raphael, “Recall bias: a proposal for assessment and control,”

International journal of epidemiology, vol. 16, no. 2, pp. 167–170, 1987.
[Page 4.]
62 | References
[10] (2023) Society for learning analytics research. [Online]. Available: https:
//www.solaresearch.org/about/what-is-learning-analytics/ [Page 5.]
[11] C. Romero and S. Ventura, “Educational data mining and learning

analytics: An updated survey,” Jan. 2020. doi: 10.1002/widm.1355.
[Online]. Available: https://doi.org/10.1002/widm.1355 [Page 5.]
[12] O. Viberg, M. Hatakka, O. Bälter, and A. Mavroudi, “The

current landscape of learning analytics in higher education,”
Computers in Human Behavior, vol. 89, pp. 98–110, 2018.
doi: https://doi.org/10.1016/j.chb.2018.07.027. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0747563218303492
[Pages 6 and 8.]
[13] J. López-Zambrano, J. Lara-Torralbo, and C. Romero, “Early

prediction of student learning performance through data mining:
A systematic review,” Psicothema 2021, vol. 33, pp. 456–
465, 2021. doi: 10.7334/psicothema2021.62. [Online]. Available:
https://www.psicothema.com/pdf/4692.pdf [Pages 6 and 15.]
[14] M. Chatti, A. Dyckhoff, U. Schroeder, and H. Thüs, “A reference

model for learning analytics,” International Journal of Technology
Enhanced Learning, vol. 4, pp. 318–331, 01 2012. doi: 10.1504/IJ-
TEL.2012.051815 [Page 6.]
[15] A. Peña-Ayala, “Review: Educational data mining: A survey and

a data mining-based analysis of recent works,” Expert Systems with
Applications: An International Journal, vol. 41, pp. 1432–1462, 03
2014. doi: 10.1016/j.eswa.2013.08.042 [Page 6.]
[16] D. Sansone, “Beyond early warning indicators: High school dropout and
machine learning,” Oxford Bulletin of Economics and Statistics, vol. 81,
no. 2, pp. 456–485, 2019. doi: https://doi.org/10.1111/obes.12277.
[Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1111/ob
es.12277 [Page 6.]
[17] M. Barkham, E. Broglia, G. Dufour, M. Fudge, L. Knowles, A. Percy,

A. Turner, and C. Williams, “Towards an evidence-base for student
wellbeing and mental health: Definitions, developmental transitions and
data sets,” Counselling and Psychotherapy Research, vol. 19, no. 4,
pp. 351–357, 2019. doi: https://doi.org/10.1002/capr.12227. [Online].
References | 63
Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/capr.12227
[Page 7.]
[18] C. Valdez, S. Lambert, and N. Ialongo, “Identifying patterns of early risk

for mental health and academic problems in adolescence: A longitudinal
study of urban youth,” Child Psychiatry Human Development, vol. 42,
p. 521–538, Oct. 2011. doi: 10.1007/s10578-011-0230-9. [Online].
Available: https://doi.org/10.1007/s10578-011-0230-9 [Page 7.]
[19] C. Bradshaw, C. Schaeffer, and H. Petras, “Predicting negative

life outcomes from early aggressive–disruptive behavior trajectories:
Gender differences in maladaptation across life domains,” J Youth
Adolescence, vol. 39, pp. 953–966, Aug. 2010. doi: 10.1007/s10964-
009-9442-8. [Online]. Available: https://doi.org/10.1007/s10964-009-9
442-8 [Page 7.]
[20] K. Becker, N. Brandt, S. Hoover, and B. Chorpita, “A review

of educational outcomes in the children’s mental health treatment
literature,” Advances in School Mental Health Promotion, vol. 7, pp. 5–
23, dec 2013. doi: 10.1080/1754730X.2013.851980 [Page 7.]
[21] J. Durlak, R. Weissberg, A. Dymnicki, R. Taylor, and K. Schellinger,

“The impact of enhancing students’ social and emotional learning:
a meta-analysis of school-based universal interventions,” Child Dev,
vol. 82, pp. 405–32, jan 2011. doi: 10.1111/j.1467-8624.2010.01564.x.
[Page 7.]
[22] M. Duncan, K. Patte, and S. Leatherdale, “Mental health

associations with academic performance and education behaviors
in canadian secondary school students,” Canadian Journal of
School Psychology, vol. 36, p. 082957352199731, 02 2021. doi:
10.1177/0829573521997311 [Page 7.]
[23] E. Lawes and S. Boyd, “Making a difference to student wellbeing-a data

exploration,” Tech. Rep., 03 2018. [Page 7.]
[24] W. El Ansari and C. Stock, “Is the health and wellbeing of

university students associated with their academic performance? cross
sectional findings from the united kingdom,” International journal of
environmental research and public health, vol. 7, pp. 509–27, 02 2010.
doi: 10.3390/ijerph7020509 [Page 7.]
64 | References
[25] R. Palos, L. Maricutoiu, and I. Costea, “Relations between academic

performance, student engagement and student burnout: A cross-lagged
analysis of a two-wave study,” Studies In Educational Evaluation,
vol. 60, pp. 199–204, 03 2019. doi: 10.1016/j.stueduc.2019.01.005
[Page 7.]
[26] D. Murchan and F. Siddiq, “A call to action: a systematic review

of ethical and regulatory issues in using process data in educational
assessment,” Large-scale Assessments in Education, vol. 9, 12 2021. doi:
10.1186/s40536-021-00115-3 [Pages 8 and 9.]
[27] J. Reyes, “The skinny on big data in education: Learning analytics

simplified,” vol. 59, pp. 75–80, 02 2021. [Page 8.]
[28] H. Drachsler and W. Greller, “Privacy and analytics – it’s a delicate

issue. a checklist for trusted learning analytics.” 04 2016. doi:
10.1145/2883851.2883893 [Page 8.]
[29] S. Arie, “Simon wessely: “every time we have a mental health

awareness week my spirits sink”,” BMJ, 09 2017. [Online]. Available:
https://doi.org/10.1136/bmj.j4305 [Page 8.]
[30] (2023) General data protection regulation (gdpr). [Online]. Available:

https://gdpr.eu/tag/gdpr/ [Pages 8 and 13.]
[31] N. Sclater, “Developing a code of practice for learning analytics,”

Journal of Learning Analytics, vol. 3, pp. 16–42, 04 2016. doi:
10.18608/jla.2016.31.3 [Pages 9, 13, 31, 44, and 55.]
[32] B. Chen and H. Zhu, “Towards value-sensitive learning analytics

design,” 12 2018. [Pages 9 and 31.]
[33] M. L. Kern, L. E. Waters, A. Adler, and M. A. White, “A multidimen-

sional approach to measuring well-being in students: Application of the
perma framework,” The journal of positive psychology, vol. 10, no. 3,
pp. 262–271, 2015. [Page 10.]
[34] M. E. Seligman, Flourish: A visionary new understanding of happiness

and well-being. Simon and Schuster, 2011. [Page 10.]
[35] A. Fernández, S. García, M. Galar, R. C. Prati, B. Krawczyk, and

F. Herrera, Learning from imbalanced data sets. Springer, 2018,
vol. 10. [Page 12.]
References | 65
[36] Big Data and Learning Analytics in Higher Education. ISBN 978-3-
319-06519-9. [Online]. Available: https://doi.org/10.1007/978-3-319-0
6520-5 [Page 13.]
[37] G. James, D. Witten, T. Hastie, and R. Tibshirani, An introduction to

statistical learning. Springer, 2013, vol. 112. [Page 15.]
[38] J. F. Hair, W. C. Black, B. J. Babin, and R. E. Anderson, Multivariate

data analysis: Pearson new international edition PDF eBook. Pearson
Higher Ed, 2013. [Pages 15 and 16.]
[39] V. N. Gudivada, M. Irfan, E. Fathi, and D. Rao, “Cognitive analytics:

Going beyond big data analytics and machine learning,” in Handbook of
statistics. Elsevier, 2016, vol. 35, pp. 169–205. [Pages 16 and 17.]
[40] F.-J. Yang, “An extended idea about decision trees,” in 2019
International Conference on Computational Science and Computational
Intelligence (CSCI). IEEE, 2019, pp. 349–354. [Page 16.]
[41] Priyanka and D. Kumar, “Decision tree classifier: a detailed survey,”

International Journal of Information and Decision Sciences, vol. 12,
no. 3, pp. 246–269, 2020. [Page 16.]
[42] S. Tangirala, “Evaluating the impact of gini index and information gain
on classification using decision tree classifier algorithm,” International
Journal of Advanced Computer Science and Applications, vol. 11, no. 2,
pp. 612–619, 2020. [Page 16.]
[43] Y.-Y. Song and L. Ying, “Decision tree methods: applications for
classification and prediction,” Shanghai archives of psychiatry, vol. 27,
no. 2, p. 130, 2015. [Page 17.]
[44] T. Hastie, R. Tibshirani, J. Friedman, T. Hastie, R. Tibshirani, and

J. Friedman, “Random forests,” The elements of statistical learning:
Data mining, inference, and prediction, pp. 587–604, 2009. [Page 17.]
[45] C. M. Bishop and N. M. Nasrabadi, Pattern recognition and machine

learning. Springer, 2006, vol. 4, no. 4. [Page 17.]
[46] R. Ghorbani and R. Ghousi, “Comparing different resampling methods

in predicting students’ performance using machine learning techniques,”
IEEE Access, vol. 8, pp. 67 899–67 911, 2020. [Page 18.]
66 | References
[47] G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung, Time series

analysis: forecasting and control. John Wiley & Sons, 2015. [Page 19.]
[48] B. D. Fulcher, “Feature-based time-series analysis,” arXiv preprint

arXiv:1709.08055, 2017. [Page 19.]
[49] T. Henderson and B. D. Fulcher, “An empirical evaluation of time-

series feature sets,” in 2021 International Conference on Data Mining
Workshops (ICDMW), 2021. doi: 10.1109/ICDMW53433.2021.00134
pp. 1032–1038. [Page 19.]
[50] M. Christ, N. Braun, J. Neuffer, and A. W. Kempa-Liehr, “Time series

feature extraction on basis of scalable hypothesis tests (tsfresh–a python
package),” Neurocomputing, vol. 307, pp. 72–77, 2018. [Pages 19
and 24.]
[51] M. Barandas, D. Folgado, L. Fernandes, S. Santos, M. Abreu, P. Bota,

H. Liu, T. Schultz, and H. Gamboa, “Tsfel: Time series feature extraction
library,” SoftwareX, vol. 11, p. 100456, 2020. [Page 19.]
[52] (2023) Kats. [Online]. Available: https://facebookresearch.github.io/K

ats/ [Page 19.]
[53] J. Miao and L. Niu, “A survey on feature selection,” Procedia computer

science, vol. 91, pp. 919–926, 2016. [Page 20.]
[54] T. E. Mathew, “A logistic regression with recursive feature elimination

model for breast cancer diagnosis,” International Journal on Emerging
Technologies, vol. 10, no. 3, pp. 55–63, 2019. [Page 21.]
[55] A. J. Milewska, D. Jankowska, D. Citko, T. Więsak, B. Acacio, and

R. Milewski, “The use of principal component analysis and logistic
regression in prediction of infertility treatment outcome,” Studies in
Logic, GRAMMAR and rhetoric, vol. 39, no. 1, pp. 7–23, 2014.
[Page 21.]
[56] P. Xanthopoulos, P. M. Pardalos, T. B. Trafalis, P. Xanthopoulos, P. M.

Pardalos, and T. B. Trafalis, “Linear discriminant analysis,” Robust data
mining, pp. 27–33, 2013. [Page 21.]
[57] (2023) tsfresh documentation. [Online]. Available: https://tsfresh.read

thedocs.io/en/latest/text/faq.html [Page 25.]
References | 67
[58] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,

“Smote: synthetic minority over-sampling technique,” Journal of
artificial intelligence research, vol. 16, pp. 321–357, 2002. [Page 26.]
[59] A. Zheng, Evaluating Machine Learning Models, 1st ed. O’Reilly

Media, Inc., 2015. ISBN 1-4920-4875-5 [Page 28.]
[60] Ž. Vujović et al., “Classification model evaluation metrics,” Interna-

tional Journal of Advanced Computer Science and Applications, vol. 12,
no. 6, pp. 599–606, 2021. [Pages 29 and 30.]
[61] H. Kaur, H. S. Pannu, and A. K. Malhi, “A systematic review on

imbalanced data challenges in machine learning: Applications and
solutions,” ACM Computing Surveys (CSUR), vol. 52, no. 4, pp. 1–36,
2019. [Page 30.]
[62] C. Bergmeir and J. M. Benítez, “On the use of cross-validation for time
series predictor evaluation,” Information Sciences, vol. 191, pp. 192–
213, 2012. [Pages 30 and 31.]
[63] R. Parasuraman, R. Molloy, and I. L. Singh, “Performance consequences

of automation-induced’complacency’,” The International Journal of
Aviation Psychology, vol. 3, no. 1, pp. 1–23, 1993. [Page 45.]
[64] A. Kaye-Tzadok, S. S. Kim, and G. Main, “Children’s subjective

well-being in relation to gender—what can we learn from dissatisfied
children?” Children and Youth Services Review, vol. 80, pp. 96–104,
2017. [Page 49.]
[65] Y. Yoon, M. Eisenstadt, S. T. Lereya, and J. Deighton, “Gender

difference in the change of adolescents’ mental health and subjective
wellbeing trajectories,” European Child & Adolescent Psychiatry, pp.
1–10, 2022. [Page 49.]
[66] M. Hajduk, A. Heretik Jr, B. Vaseckova, L. Forgacova, and J. Pecenak,

“Prevalence and correlations of depression and anxiety among slovak
college students.” Bratislavske lekarske listy, vol. 120, no. 9, pp. 695–
698, 2019. [Page 50.]
[67] K. Eum and K. G. Rice, “Test anxiety, perfectionism, goal orientation,

and academic performance,” Anxiety, Stress, & Coping, vol. 24, no. 2,
pp. 167–178, 2011. [Page 51.]
68 | References
If you do not have an appendix, do not include the \cleardoublepage

command below; otherwise, the last page number in the metadata will
be one too large.
TRITA-EECS-EX-2023:670
www.kth.se

FULLTEXT01 AcademicPerformance ML Review

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FULLTEXT01 AcademicPerformance ML Review

Uploaded by

Copyright:

Available Formats

Degree Project in Machine Learning

Second cycle, 30 credits

Machine Learning to predict

Stockholm, Sweden, 2023

Master’s Programme, Machine Learning, 120 credits

Supervisors: Hedvig Kjellström, Barbro Fröding, Jalal Nouri

Host company: EdAider (EdAI Technologies AB)

This thesis has three primary goals which are to:

1. Generate data insights to further understand patterns in the student well-

Logistic regression and random forest models were used to build a

performance prediction model, which aims to predict whether a student is at

Detta examensarbete undersöker hur Maskininlärning (ML) kan användas

1. Skapa datainsikter för att ytterligare förstå mönster i data om elevers

Resultaten visade att pojkar i genomsnitt rapporterar högre mående än

Logistisk regression och random forest-modeller användes för att bygga en

Olika metoder för feature selection undersöktes, inklusive regularisering,

Fördelarna, riskerna och etiska värdekonflikterna i dataanalysen och predik-

Stockholm, August 2023

4 Machine Learning Theory 14

4.1.1 Logistic regression . . . . . . . . . . . . . . . . . . . 15

6.2.2 Random Forest . . . . . . . . . . . . . . . . . . . . . 40

This chapter introduces the importance of student well-being in primary

According to Skolverket, the Swedish National Agency for Education, student

definitions of health promotion and prevention work, student health services

EdAider [5] is a Swedish EdTech startup founded by researchers at Stockholm

1.3 Research Goals

1. Generate data insights to further understand patterns in the student well-

1.3.1 UN’s Global Goals

1.4.1 Limited data

1.4.2 Self-reported data

1.5 Structure of the thesis

This chapter introduces definitions and previous research about Learning

2.1 Learning Analytics

1. Descriptive and Diagnostic analytics

A survey of educational data mining and learning analytics by Viberg et al.,

2.2 Prediction of Academic Performance

2.3 Student well-being

A study of mental health of Canadian secondary school students [22]

Similar results are observed in the context of university-level education.

be an antecedent for high student engagement and low student burnout. On

2.4 Ethical Learning Analytics

2.4.1 Conflicting values

of how user data is allowed to be processed and applied to organisations such

2.5 Summary of literature

3.1 Description of data

Question Category Weight

Table 3.1: The 8 well-being survey questions which are indicators of 8

Answer Weight Semantic points

The resulting dataset is a time series of well-being indicators, which is grouped

Number of Number of Number of

by the teacher, where 1 indicates a student who is at risk of failing a course.

3.2 Imbalanced datasets

Performance indicator Gender Num. students Total

3.3 Ethical data concerns

Machine Learning Theory

4.1 Classification models

4.1.1 Logistic regression

p(Yi |Xi , ..., Xp )

The question is then to decide at what probability an observation should be

4.1.2 Decision trees

where pi is the probability of event i in L. In this way the algorithm decides

Using decision tree models as a classification method has the following

• Simplifies complex relationships between input variables and target