Professional Documents
Culture Documents
LUCY MCCARREN
Abstract
The data provided by educational platforms and digital tools offers new ways
of analysing students’ learning strategies. One such digital tool is the well-
being platform created by EdAider, which consists of an interface where
students can answer questions about their well-being, and a dashboard where
teachers and schools can see insights into the well-being of individual students
and groups of students. Both students and teachers can see the development
of student well-being on a weekly basis.
This thesis project investigates how Machine Learning (ML) can be used along
side Learning Analytics (LA) to understand and improve students’ well-being.
Real-world data generated by students at Swedish schools using EdAider’s
well-being platform is analysed to generate data insights. In addition ML
methods are implemented in order to build a model to predict whether students
are at risk of failing based from their well-being data, with the goal to inform
data-driven improvements of students’ education.
The results showed that males report higher well-being on average than
females across most well-being factors, with the exception of relationships
where females report higher well-being than males. Students identifying as
non-binary gender report a considerably lower level of well-being compared
with males and females across all 8 well-being factors. However, the amount
of data for non-binary students was limited. Primary schools report higher
well-being than the older secondary school students. Students reported
anxiety/depression as the most closely correlated dimensions, followed by
engagement/accomplishment and positive emotion/depression.
The benefits, risks and ethical value conflicts of the data analysis and
prediction model were carefully considered and discussed using a Value
Sensitive Design approach. Ethical practices for mitigating risks are discussed.
Keywords
Machine Learning, Data Science, Learning Analytics
Sammanfattning | iii
Sammanfattning
Den data som tillhandahålls av utbildningsplattformar och digitala verktyg
erbjuder nya sätt att analysera studenters inlärningsstrategier. Ett sådant
digitalt verktyg är mående plattformen skapad av EdAider, som består av ett
gränssnitt där elever kan svara på frågor om deras mående, och en dashboard
där lärare och skolor kan se insikter om individuella elevers och grupper av
elevers mående. Både elever och lärare kan se utvecklingen av elevers mående
på veckobasis.
Nyckelord
Maskininlärning, Data Science, Inlärningsanalys
Acknowledgments | v
Acknowledgments
Firstly, I would like to thank my supervisors Barbro Fröding and Hedvig
Kjellström for their valuable feedback and support. Hedvig has been open-
minded and solution-orientated whenever I encountered problems, and Barbro
has inspired and encouraged me on my journey to research further within the
field of technology ethics.
Thanks to Jalal Nouri for providing me with the opportunity to work with
EdAider’s data, and to Kirill Maltsev and Knut Sørli for answering my
questions in a timely and thorough manner.
I would also like to thank my uncle Andrew for his support and encouragement
throughout my academic journey, and for his honest and thorough feedback on
my research.
Lastly, I would like to thank my dear friends and partner for the emotional
support, laughs and encouragement during the two years of my masters degree.
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Research Goals . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 UN’s Global Goals . . . . . . . . . . . . . . . . . . . 3
1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.1 Limited data . . . . . . . . . . . . . . . . . . . . . . 3
1.4.2 Self-reported data . . . . . . . . . . . . . . . . . . . . 4
1.5 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . 4
2 Literature Review 5
2.1 Learning Analytics . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Prediction of Academic Performance . . . . . . . . . . . . . . 6
2.3 Student well-being . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Ethical Learning Analytics . . . . . . . . . . . . . . . . . . . 8
2.4.1 Conflicting values . . . . . . . . . . . . . . . . . . . 8
2.5 Summary of literature . . . . . . . . . . . . . . . . . . . . . . 9
3 Data 10
3.1 Description of data . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Imbalanced datasets . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Ethical data concerns . . . . . . . . . . . . . . . . . . . . . . 13
5 Methods 22
5.1 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Data analysis and visualisation . . . . . . . . . . . . . . . . . 23
5.3 Model for performance prediction . . . . . . . . . . . . . . . 23
5.3.1 Feature extraction . . . . . . . . . . . . . . . . . . . . 23
5.3.1.1 Tsfresh features . . . . . . . . . . . . . . . 24
5.3.1.2 Custom features . . . . . . . . . . . . . . . 25
5.3.2 Synthetic Minority Oversampling Technique (SMOTE) 26
5.3.3 Model selection . . . . . . . . . . . . . . . . . . . . . 26
5.3.4 Evaluation metrics . . . . . . . . . . . . . . . . . . . 28
5.3.4.1 Accuracy . . . . . . . . . . . . . . . . . . . 28
5.3.4.2 Confusion matrix . . . . . . . . . . . . . . 29
5.3.4.3 Precision and Recall . . . . . . . . . . . . . 29
5.3.4.4 Area under the ROC curve (AUC) . . . . . . 30
5.4 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.5 Value Sensitive Design . . . . . . . . . . . . . . . . . . . . . 31
6 Results 33
6.1 Major data insights . . . . . . . . . . . . . . . . . . . . . . . 33
6.1.1 The impact of gender and age on reported well-being . 33
6.1.2 Relationship between well-being categories . . . . . . 35
6.1.3 Relationship between well-being and performance . . 37
6.2 Performance prediction model . . . . . . . . . . . . . . . . . 37
6.2.1 Logistic regression . . . . . . . . . . . . . . . . . . . 37
Contents | ix
7 Discussion 49
7.1 Data insights . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.2 Performance prediction model . . . . . . . . . . . . . . . . . 51
7.3 Ethical discussion . . . . . . . . . . . . . . . . . . . . . . . . 55
8 Conclusion 57
8.1 Answering research questions . . . . . . . . . . . . . . . . . . 57
8.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
References 61
x | Contents
Introduction | 1
Chapter 1
Introduction
1.1 Background
Education is fundamental to human development, providing individuals with
the knowledge and skills necessary to navigate the world and achieve their
goals. With accelerated digitalisation and increased quantities of data in
educational settings, there is growing interest in understanding how these
tools can be used to measure and promote student well-being, and support
individualized learning experiences. Educational institutions have started
to pay attention to the promises of big data and data mining techniques
to support learning and teaching in more efficient and effective ways [1].
Learning Analytics (LA) is the name given to this field that uses data analytics
to measure, analyse, and understand the learning processes and results in
educational settings [2]. The main opportunities of LA are to “predict learner
outcomes, trigger interventions or curricular adaptations, and even prescribe
new pathways or strategies to improve student success” [3].
1.2 Purpose
The purpose of this study is to analyse the student well-being and performance
data collected by EdAider, with the following objectives:
1. To create insights and visualisations using the student well-being data.
This will help EdAider to improve the well-being dashboard, which
will in turn enable teachers and educational institutions to make better
pedagogical decisions by visualising the well-being data in a clear and
concise manner. A well-designed dashboard will enable teachers to
intervene promptly when required.
2. To investigate the relationship between well-being and performance.
This can help educators to identify students who are at risk of
performing poorly or who require special attention or counselling by
analysing their well-being data. This is a crucial issue in education,
affecting students at all levels and in schools and universities worldwide.
3. To identify and discuss the ethical implications of using student well-
being data and ML methods for gaining knowledge about learners’
Introduction | 3
behavior. While the use of student data can benefit students, teachers,
and institutions by enhancing understanding and constructing didactical
interventions, it also raises significant ethical concerns.
1.4 Limitations
size of this dataset, the results can therefore not be considered as general and
applicable for every school within the Swedish school system. The results
should be seen only as a supporting tool to teachers.
1. The students may not always provide accurate information. This can
occur due to a variety of reasons, such as social desirability bias [8],
where students may provide answers that they believe are more socially
acceptable or pleasing to their teacher. This risk is heightened by
the fact that the well-being data is not anonymised for teachers or
schools. Students may not want to report low well-being for fear of
being stigmatized, or on the other hand they may exaggerate symptoms
of depression or anxiety in order to receive more attention.
2. Some students may have difficulty with understanding the questions,
or difficulty communicating their experiences, which can lead to
inaccurate responses. This may be due to individual factors such as
literacy, language proficiency and cognitive function.
3. Self-reported data can be subject to recall bias [9]. EdAider’s well-
being survey asks students about their well-being over the past 7 days.
Students may therefore have difficulty to remember their experiences,
or difficulties to communicate well-being which is episodic or fluctuates
over time.
Chapter 2
Literature Review
This thesis will focus predominantly on the first two applications listed, namely
descriptive and diagnostic analytics, and the prediction of student academic
success based on their well-being data.
most common methods of data analysis for LA are prediction, clustering and
relationship mining [12]. One of the most common tasks tackled by LA
research is the prediction of student performance, for which the most common
methods include regression and classification. López-Zambrano et al., [13]
mention that there is less data available for primary and secondary education
compared to tertiary levels, where tertiary refers to university and college
level education; 86.6% of published papers included in the review were done
at tertiary level, and only 7.3% for secondary level students, with none for
primary level students. A possible reason for this is the accessibility of the
data, as university education is much more often digitalised than primary and
secondary education. This highlights that further LA research at primary and
secondary education level is important.
A longitudinal study conducted in the US [16] between 2009 and 2012 used
Machine Learning to detect students that were prone to dropping out of upper
secondary school. In the study, the most important variables for the model
were identified to be the student’s GPA, age, math test scores, expulsion
record and attendance [16]. A systematic review of studies regarding Early
Prediction of Student Learning Performance [13] outlines a variety of Machine
Learning techniques and variables used for prediction. The studies included
in the review achieved varying prediction accuracy, and the accuracy was
highly dependent on the number and types of variables included. The
variables and student attributes used for prediction varied depending on the
educational environment. In general, the variables could be grouped into
student demographics, student activities and student interactions with an e-
learning platform. It is worth noting that none of the aforementioned studies
take student well-being into account as a variable in their predictive model,
meaning that this area requires further research.
Literature Review | 7
In order to address these conflicts of interest, Murchan and Siddiq [26] suggest
that LA research should be carried out using an ethics framework. One
example is the Sclater (2016) framework [31] which addresses 8 variables
(responsibility, transparency, consent, privacy, validity, access, minimising
adverse impacts and stewardship of data) that should be analysed. Another
approach which can be used for designing ethical LA technology is Value
Sensitive Design (VSD) [6] [32] which is discussed in detail in section 5.5.
Chapter 3
Data
The questions have 5 possible answers, ranging from ”not at all” to ”very
often”, and the answers are assigned ”semantic points” according to whether
they are weighted positively or negatively.
Table 3.2: The answer choices given for each well-being question, and
semantic points assigned by positive/negative weight.
Table 3.3: A summary of the number of surveys answered, students, and data
points per school.
Alongside the well-being data, performance indicators for students from one
secondary school are included in the dataset. The indicator is a binary flag set
12 | Data
Table 3.4: The number of students by gender who are at risk of failing (”Fail”)
versus those who are not (”Pass”).
This bias problem can be addressed by changing the distribution of the data
with under-sampling or over-sampling. Under-sampling means that data from
the major classifier is randomly deleted until the number of data points in
each class are matching. However, deleting data implies fewer data points to
feed the algorithm with. Instead oversampling (increasing the number of data
points for the minority classifier) is preferred. An approach to oversampling is
discussed in 5.3.2. The accuracy of a classifier is the total number of correct
predictions by the classifier divided by the total number of predictions. This
may be good enough for a well-balanced class but not ideal for the imbalanced
class problem. Choosing a proper evaluation metric is therefore important as
is discussed in 5.3.4.
Data | 13
Chapter 4
Figure 4.1 is a flow chart which illustrates the method of the thesis. The
steps contained within the red rectangle correspond to the second research
question outlined in section 1.3; to design a classification model using ML
methods to predict student grades based on well-being data, and to validate
the model against actual performance data provided by the schools. In order
to implement this model, theoretical ML techniques were used. In this chapter,
the theoretical techniques are presented and explained in more detail.
Figure 4.1: Flow chart of data collection process and method of the thesis.
The ML techniques required to complete the steps inside the red rectangle are
explained in this chapter.
Machine Learning Theory | 15
where the left hand side of the equation is called the logit function of p. From
the logit function we see that logistic regression is based on a linear model, and
therefore can be explained in terms of the coefficients of the input variables.
This feature is useful to understand the relationship between the input variables
and the output variable.
16 | Machine Learning Theory
The decision tree is an iterative process that sorts data points into different
categories. The process is binary, starting at the primary (parent) node with the
value of an attribute as a threshold and splits the data points into child nodes.
The method for evaluating the classifier strength of a node (n) is the Gini index
(G). Gini index determines the purity of a specific class after splitting along
a particular attribute. The best split increases the purity of the sets resulting
from the split. If L is a dataset with j different class labels, the Gini measure
is defined [42] as
∑j
G(L) = 1 − pi 2 (4.3)
1
One big advantage of decision tree learning over other learning methods such
as logistic regression is that it can capture more complex decision boundaries.
Decision tree learning is suitable for datasets that are not linearly separable—
there exists no hyperplane that separates out examples of two different classes
[39]. The main limitation of decision tree models is that they can be subject
to overfitting and underfitting, particularly when using a small data set [43].
4.2 Resampling
Machine learning models for classification problems are built to handle
problems with a relative equal number of observations in each class [45].
18 | Machine Learning Theory
1. For b = 1 to B:
(a) Draw a bootstrap sample Z∗ of size N from the training data.
(b) Grow a random-forest tree Tb to the bootstrapped data, by
recursively repeating the following steps for each terminal node
of the tree, until the minimum node size nmin is reached.
i. Select m variables at random from the p variables.
ii. Pick the best variable/split-point among the m.
iii. Split the node into two daughter nodes.
2. Output the ensemble of trees {Tb }B
1 variables
As explained in section 3.2, the dataset used for this thesis is imbalanced.
When the observations in each class are imbalanced, the problem can be
tackled using a resampling method. Two basic techniques are oversampling
and undersampling. Oversampling duplicates samples from the minority
class to increase the number of minority class samples while keeping the
majority class untouched. Under-sampling works in the opposite way, samples
from the majority class are removed from the dataset while the minority
class is untouched. Both methods changes the prior knowledge of the class
distributions during the training phase while the original distributions are kept
in the testing phase. Oversampling can help improve the classification of the
minority class by providing more instances for the model to learn from [46].
However, it can also introduce the risk of overfitting if synthetic instances are
not properly generated or if the minority class is excessively over-represented.
Undersampling can help reduce the dominance of the majority class and allow
the model to pay more attention to the minority class. However, it may result
in the loss of potentially useful information from the majority class.
Due to the small size of the dataset used for this thesis, oversampling is the
most suitable method. The oversampling methods considered were:
instances from the minority class until it is balanced with the majority
class. While simple to implement, it can potentially lead to overfitting
due to the exact replication of data.
In order to use time series data as input to supervised learning problems, one
can choose a set of significant data points from each time series as elements of
feature vector. However, it is much more efficient and effective to characterize
the time series with respect to the distribution of data points, correlation
properties, entropy, min, max, average, percentile and other mathematical
derivations [48]. An empirical evaluation of time-series feature sets [49]
indicates that the three most popular python libraries for time-series feature
extraction are tsfresh [50], TSFEL [51], and Kats [52]. The study indicates
that tsfresh contains many unique time-series features that are not present in
20 | Machine Learning Theory
Chapter 5
Methods
This chapter describes the steps for data pre-processing in 5.1 and for data
analysis in 5.2. The method for building the performance prediction model is
outlined in 5.3 and the evaluation metrics for the model are in 5.3.4. Finally,
the method for Value Sensitive Design is presented in 5.5. Figure 5.2 is a
flow chart which illustrates the method of the thesis. The green rectangles
correspond to the three research questions outlined in section 1.3.
Figure 5.1: Flow chart of data collection process and method of the thesis.
the dataset. The data was reshaped to have each 8 well-being factors as a
dimension. A dimension called ”average well-being” was created which is
the mean of the semantic points of the 8 well-being factors. In total EdAider
had data from 17 schools, but many of them had very data few data points, so
schools which had carried out less than 10 well-being surveys were excluded
in the data pre-processing step. This resulted in a dataset with 8 schools (3
primary and 5 secondary), as described in section 3.1.
In order to answer these questions, Pandas and Seaborn python libraries were
used in order to create visualisations of the well-being data; the results of
which are presented in section 6.1.
Two approaches used for feature extraction from the time series in this paper
are (1) Tsfresh and (2) Custom features.
Care needs be taken when using tsfresh with highly irregular time series.
Tsfresh uses timestamps only to order observations, while many features are
interval-agnostic (e.g., number of peaks) and can be determined for any series,
some other features (e.g., linear trend) assume equal spacing in time, and
should be used with care when this assumption is not met [57]. The results
showed that tsfresh was not suitable for this dataset as the dataset is small,
and also irregular. The number of extracted features from tsfresh was huge
and exceeded the number of data points which causes the model to overfit.
In cases such as this one where the dataset is small and the number of time
series samples is irregular, manual feature engineering techniques are more
suitable. For this reason, a separate dataset with custom features to cater for
the irregularity of the time series was also created. This approach allows for
more control over the feature selection process, focusing only on the most
relevant and informative aspects of the time series data.
The following features were created for each of the 8 well-being factors:
In addition, features were created for the number of surveys answered by each
student and the number of days between the first and last survey answered.
This resulted in a dataset with 42 custom features.
Table 5.1: The number of data points for students at risk of failing (”Fail”)
versus those who are not (”Pass”) before and after applying SMOTE.
training data includes 80% of the original data and the test and validation data
both include 10% each. The models are then trained on the training data,
classification threshold is decided on the test data and lastly the models are
validated on the validation data.
Logistic regression was chosen to model the data due to its high interpretability
and since the features easily can be analyzed, where the produced regression
coefficient estimates indicate if the relation between the predictor and
target variable is positive or negative. Random forest was chosen as a
comparative model. The ensemble nature of Random Forests helps to
average out individual decision tree biases and reduces overfitting. By
combining multiple trees, the ensemble model is better equipped to capture
complex relationships in the data while avoiding the pitfalls of memorising
the training data. In the process of selecting models, several models
were evaluated besides logistic regression and random forest, such as
decision tree, support vector machine (SVM), K-Nearest Neighbors (KNN)
and Gradient Boosting. However, these models were disregarded since
they did not improve the performance of the model. The models were
implemented using scikit-learn ensemble.RandomForestClassifier
and linear_model.LogisticRegression.
The evaluation metrics scored for each model are precision, recall and the
area under ROC. In general, a frequently used metric for binary classification
problems is accuracy, given by the proportion of true results to all results.
Considering that the data is imbalanced, accuracy will not be presented for the
models since it can produce misleading results. For instance, if only 5% of
the data set belongs the target class (students at risk of failing) while 95% of
the data set belongs to the other class (students who are not a risk of failing),
a naive approach to classify every sample to the majority class would yield
an accuracy on 95%, which would be considered very good, yet not accurate.
Accuracy places a larger weight on the majority class, making it harder to
produce good prediction accuracy on the minority class, which is why other
metrics are evaluated instead.
5.3.4.1 Accuracy
Accuracy simply measures how often the classifier makes the correct
prediction. It’s the ratio between the number of correct performance
predictions and the total number of predictions (the number of data points
Methods | 29
# correct predictions
accuracy = (5.1)
# total data points
Accuracy does not make a distinction between classes (at risk of failing/not at
risk of failing). This can be problematic when the risks of misclassification
differ for the two classes. In this case, it may be considered more important
to correctly classify students who are at risk of failing, compared with those
who are not. A confusion matrix [60] shows a more detailed breakdown of
correct and incorrect classifications for each class, as shown in table 5.2. The
students who are at risk of failing are labelled ”positive”, and the class label
of students who are not at risk of failing is ”negative”.
Actual values
Positive Negative
Positive True Positive (TP) False Negative (FN)
Predicted values
Negative False Positive (FP) True Negative (TN)
The groups outlined by the confusion matrix are calculated as follows [60]:
1. True Positive (TP) - the positive examples (students at risk of failing)
classified correctly by the model
2. True Negative (TN) - the negative examples (students not at risk of
failing) classified correctly by the model.
3. False Positive (FP) - the negative values classified as positive. This
scenario is known as a Type 1 Error
4. False Negative (FN) - the positive values classified as negative. This
scenario is well known as a Type 2 Error
TP
Recall = (5.3)
TP + FN
The ROC curve is another commonly used method to assess the performance of
classification models [60]. The ROC curve is a graph that visualizes the trade-
off between True Positive Rate and False Positive Rate. For each threshold,
the True Positive Rate and False Positive Rate are calculated and plotted on
one graph. The higher the True Positive Rate and the lower the False Positive
Rate for each threshold, the better. Better classifiers have more curves on the
left. The area below the ROC curve is called the ROC AUC score, a number
that determines how good the ROC curve is. The ROC AUC Score shows how
many correct positive classifications can be gained as you allow for more and
more false positives. . A higher ROC AUC score (closer to 1) indicates that
the model has better predictive power and can effectively separate the classes
5.4 Cross-validation
When the dataset is small, removing a part of it for validation poses a problem
of underfitting. By reducing the size of the training data, there is a risk
of removing important patterns in the dataset. To tackle this problem, in
classification it is common practice to use k-fold cross-validation [62]. K
was set to 20 for model validation in this project. The dataset is randomly
partitioned into k subsets of approximately equal size. These subsets are often
referred to as ”folds.” The cross-validation process then involves k iterations.
In each iteration, one of the k folds is used as the validation set, while the
remaining k-1 folds are combined to form the training set. The model is
Methods | 31
trained on the training set and evaluated on the validation set. This process
is repeated k times, with each fold serving as the validation set once. At the
end of the k iterations, the evaluation metrics described in section 5.3.4 are
computed for each iteration. The performance metrics from all k iterations are
then aggregated to provide an overall assessment of the model’s performance.
Averaging the errors yields an overall error measure that typically will be more
robust than single measures [62].
Sclater [31] developed a Code of Practice for Learning Analytics, which aims
to set out the responsibilities of educational institutions to ensure that LA
is carried out responsibly, appropriately and effectively, addressing the key
legal, ethical and logistical issues which are likely to arise. They grouped
legal and ethical concerns for LA into eight headings: 1) responsibility,
2) transparency and consent, 3) privacy, 4) validity, 5) access, 6) enabling
positive interventions, 7) minimizing adverse impacts, and 8) stewardship of
data. These values are relevant to EdAider’s well-being platform and will be
discussed in the context of Value Sensitive Design in the results section 6.3.
32 | Methods
Results | 33
Chapter 6
Results
In this chapter, the results of the data analysis in 6.1 are presented in section
6.1, and the results of the performance prediction model are presented in
section 6.2. The results of the ethical evaluation using Value Sensitive
Design are outlined in 6.3. These three sections of the results chapter are
corresponding to the three research questions.
Table 6.1: Primary school: Number of students and mean well-being factor
by gender.
Male Female Other
Number of students 1042 539 80
Accomplishment 53.4 52.1 41.4
Engagement 49.0 45.7 38.3
Positive emotion 64.2 61.8 51.4
Relationships teacher 60.8 62.9 54.0
Relationships peers 65.0 68.7 59.3
Anxiety 65.9 60.7 49.5
Depression 71.4 66.7 52.6
Workload 52.0 49.0 49.3
Overall well-being 60.2 58.4 49.5
Table 6.2: Secondary school: Number of students and mean well-being factor
by gender.
Figure 6.1: Average well-being by school and gender. The black line indicates
standard deviation.
Results | 35
The data shows that on average, male primary school students report higher
well-being than females across all well-being factors. This applies also to
secondary school students, with the exception of relationships where females
report higher well-being than males. Students with gender ”other” report
a considerably lower level of well-being across all 8 well-being factors.
However, it is difficult to deduce a meaningful conclusion about primary
school students with gender ”other” due to the limited number of data points.
Broken down by school, we see that there are 3 schools out of 8 where females
tended to report slightly higher well-being than males.
On average, primary schools report higher well-being than the older secondary
school students. This is with the exception of one primary school, which
reports lower well-being than all other schools. This school had fewer data
points than the other primary schools so did not have a significant effect on the
average well-being for primary schools. One possible reason for the difference
is that the primary school with lower well-being has older students than the
other 2 primary schools (grade 7-9 vs. grade 4-6).
Table 6.3: Mean well-being for students who are at risk of failing (Fail) vs
those who are not at risk of failing (Pass).
The next section show the results from the model created to predict whether a
student is at risk of failing based on the feedback from their well-being surveys.
ROC AUC scores. The mean accuracy and standard deviation from the mean
accuracy after 20-fold cross-validation are also included.
The remainder of the results presented have therefore had SMOTE oversam-
pling applied on the dataset before using it as input to the model. Table
6.5 shows the results of the logistic regression model trained on the custom
features dataset with where each row shows a different method of feature
selection. The first row shows the model with no feature selection, the second
and third show L1 and L2 regularization, and the fourth shows Recursive
Feature Elimination.
Table 6.6 shows the results in the form of a confusion matrix for the logistic
regression model trained on the custom features dataset.
Actual values
Positive Negative
Positive 133 41
Predicted values
Negative 48 126
Table 6.6: Confusion matrix for logistic regression model trained on custom
features dataset
Table 6.8: Explained variance ratio for first five principal compenents (logistic
regression
40 | Results
Table 6.10 shows the results in the form of a confusion matrix for the random
forest model trained on the custom features dataset.
Actual values
Positive Negative
Positive 153 21
Predicted values
Negative 17 157
Table 6.10: Confusion matrix for random forest model trained on custom
features dataset
As presented in Section 6.2.1, PCA was also applied on the random forest
Results | 41
model to reduce dimensionality. Better results are seen for the random
forest, with precision, recall and accuracy scores improving slightly after
dimensionality reduction.
Feature importance for the logistic regression model was measured using both
RFE and the coefficients from the regression as explained in Section 4.4. A
positive coefficient implies that an increase in the value of that predictor is
associated with an increase in the log-odds of the outcome variable taking
on the ”at risk of failing” category. This indicates a positive correlation
between the predictor and the outcome. Conversely, a negative coefficient for
a predictor variable suggests that an increase in the value of that predictor is
associated with a decrease in the log-odds of the outcome variable being in the
”at risk of failing” category. The 5 features with the largest coefficients from
the logistic regression model (shown in figure 6.5) were: anxiety, depression,
accomplishment, relationship to teacher, and positive emotion.
42 | Results
Figure 6.6 shows the feature importances from the Random Forest model
calculated using scikit-learn feature
ensemble.RandomForestClassifier.feature_importances_.
These importances are computed as the mean and standard deviation of
accumulation of the impurity decrease within each tree. For each decision
tree in the random forest, when a feature is used to split a node, the decrease
in impurity (measured by Gini impurity or entropy) is recorded. The more the
impurity decreases due to a split using a particular feature, the more important
that feature is considered to be. The importance scores of features are then
aggregated across all the decision trees in the random forest and normalized
for interpretation. The most important features were: accomplishment,
engagement, workload, number of surveys answered, relationship to peers,
depression, relationship to teacher.
Results | 43
Figure 6.6: Feature importance from random forest model using mean
impurity decrease (custom features dataset)
Secondly, the results of the prediction model show that well-being has a
significant impact on academic performance. By monitoring students’ well-
being, educators can identify factors that may be hindering their academic
progress and implement strategies to improve student engagement, motivation,
and overall performance. Monitoring well-being also goes beyond academic
performance. It encompasses various aspects of students’ lives, including
emotional, social, and physical well-being. By tracking these areas, educators
can support students in their personal growth and development, fostering
resilience, self-esteem, and a positive self-image.
teachers can efficiently get insights into student well-being. Data analysis of
well-being data therefore empowers teachers to efficiently allocate their time
and resources, and empowers students by providing access to support from
their teachers when they need it.
There is also a risk that relying solely on well-being data to keep track of
student well-being may lead to a reduction in human interaction and personal
connection with students. Automation complacency occurs when automation
output receives insufficient monitoring and attention, usually because that
output is viewed as reliable [63]. This could lead to a risk that teachers
and schools become over-reliant on the results of the technology and perhaps
even de-skilling in the long-term. The tool should therefore not be seen as
a replacement for resources such as counselling and one-to-one check-ins at
school, but rather as a complementary tool. It is important for educators to
provide holistic support beyond the tool and to be aware of the limitations of
technology in addressing complex well-being issues.
As discussed further in the limitations section, the well-being data was self-
reported, and the grades data was an assessment by the students’ teacher
rather than an official state exam grade, which raises questions about the
validity of the data. The amount of data used for the grade prediction model
was also limited to one secondary school, meaning that the results cannot
46 | Results
be generalised for other schools. When the data analysis and prediction
model uses unreliable or invalid data and measures, it may lead to inaccurate
representations of student well-being, potentially resulting in inappropriate
interventions or missed opportunities to support students in need.
Failure to address these risks of invalid data and algorithmic bias can
undermine the effectiveness of EdAider’s technology, and reduce trust in
the technology in the long-term. Additionally, since the technology requires
regular monitoring, teachers may face an increased workload in terms of data
analysis, interpreting results, and implementing appropriate interventions.
This puts pressure on already strained educational resources.
analysis of well-being data. Students and parents may be concerned about data
privacy, demanding strict privacy measures and control over how the data is
used. EdAider, on the other hand, may argue that data sharing is necessary
for analysis and improvement of their products. Schools may expect EdAider
to take full responsibility for data security and privacy, while the company
may place some responsibility on schools for proper implementation and data
handling. Schools and teachers may have different expectations regarding how
the collected data is stored, secured, and used. Conflicts may arise when there
are divergent perspectives on data governance and the extent of control schools
should have over the data.
Value conflicts can also arise regarding the validity of the data analysis
and prediction model results. Teachers and schools may have their own
insights and professional judgement when interpreting the analysis or
recommendations generated by the technology, which may conflict with or
challenge the technology’s findings. Teachers may question the validity and
reliability of the technology’s analysis, leading to conflicts regarding the
appropriate weight given to automated assessments versus their professional
judgment. Teachers may prioritize their professional judgement and autonomy
in assessing and supporting student well-being, while EdAider may emphasize
the use of data-driven algorithms and standardized approaches, potentially
limiting teachers’ discretion.
48 | Results
Discussion | 49
Chapter 7
Discussion
Another important finding was the relationship between certain well-being di-
mensions oulined in section 6.1.2. Students reported anxiety/depression as the
most closely correlated dimensions, followed by engagement/accomplishment
and positive emotion/depression. These results align with existing literature
[66] that highlights the interconnected nature of well-being dimensions.
Anxiety and depression often co-occur and can have a reciprocal relationship,
where higher levels of anxiety can contribute to increased levels of depression
and vice versa. The correlation between engagement and accomplishment
suggests that students who feel more engaged in class tend to have a better
sense of accomplishment with their studies. It is important to note that the
correlation analysis provides insights into the statistical relationship between
well-being dimensions but does not establish causal relationships. Further
research is needed to explore the underlying mechanisms and dynamics
between these dimensions to gain a deeper understanding of their interactions.
The correlation analysis of well-being dimensions contributes to a more
comprehensive understanding of student well-being and provides valuable
information for developing targeted interventions and support systems in
educational settings. By understanding the interplay between different well-
being dimensions, educators and school administrators can target specific
areas for improvement.
performance.
In cases such as this one where the dataset is small and the number of time
series samples is limited and irregular, simpler feature extraction techniques
or manual feature engineering are more suitable. This approach allows
for more control over the feature selection process, focusing only on the
most relevant and informative aspects of the time series data. Manual
feature extraction techniques in this case included basic statistical measures
(e.g., mean, standard deviation) and domain-specific measures that are less
computationally intensive compared to tsfresh.
PCA
The results from the principal component analysis transformed the Tsfresh
features into a lower dimension. While PCA can be effective in reducing the
dimensionality of the data and capturing its variability, it comes with a trade-
off - the interpretability of single features is lost in the process. This loss of
Discussion | 53
Prediction accuracy
For the custom features dataset, the logistic regression model achieved a
ROC AUC score between 0.8-0.85. This means it could distinguish between
students who were at risk of failing and students who were not at risk of
failing in 80-85% of cases. The random forest model achieved a ROC AUC
score of 0.85-0.86. Both models also achieved high precision and recall
scores, indicating their ability to correctly predict positive instances (students
at risk of failing). The ROC AUC scores also show good discrimination
ability of the models. These results mean that reported well-being can be a
reasonably good predictor of whether a student is at risk of failing. However,
several limitations of the model must be taken into consideration. Firstly, it is
difficult to establish the causal relationship between well-being and academic
performance. This model uses well-being data as a predictor of performance,
but it is also very possible that poor performance can negatively impact student
well-being. Secondly, although the model has a good classification accuracy,
the model still wrongly classifies students as being at risk of failing in 15-
20% of cases. The negative impacts of incorrectly classifying a student using
this model must be considered carefully. These risks is discussed further in
Section 7.3.
For the tsfresh features, both the logistic regression and random forest
models achieved higher levels of prediction accuracy compared with the
custom features. However, it is essential to strike a balance between
54 | Discussion
Feature importance
• Bias and Fairness: Care should be taken to ensure that the data used
in the analysis is representative and does not perpetuate or reinforce
existing biases or inequalities. Biases in the data could lead to
biased predictions and exacerbate existing disparities in educational
opportunities. It is essential to consider and include non-binary
gender students and other underrepresented groups in data analysis and
predictive models to ensure that educational interventions are inclusive
and equitable for all students.
Chapter 8
Conclusion
In this chapter the research questions are answered along with some
reflections, and suggestions for future work are presented.
3. Carry out an ethical evaluation of the data analysis and grade prediction
model using Value Sensitive Design.
A Value Sensitive Design approach was used to discuss the benefits and
risks to the stakeholders of EdAider’s technology in Section 6.3. The
58 | Conclusion
With new AI and machine learning breakthroughs every few months, and a lot
of money at stake, developers of educational technology are often scrambling
to release new products before their competitors, meaning that ethical
considerations are often an afterthought. This can have harmful implications
for many stakeholders, resulting in conflicts and negative consequences for the
company itself if their reputation becomes damaged by engaging in unethical
practices. Learning how to incorporate ethical values into Learning Analytics
technology using techniques such as Value Sensitive Design is therefore
crucial.
The academic performance data was also limited to one indicator given by the
teacher near the end of the academic year. The teacher’s assessment may be
Conclusion | 59
subject to bias because of student behaviour or previous results. For this reason
it would be beneficial to investigate official state examinations along with the
teacher’s assessment. It would also be beneficial to investigate the assessments
at more regular time intervals, as students performance may change over
time. Following students over an extended period would allow researchers to
see how changes in well-being dimensions relate to fluctuations in academic
performance and vice versa. Longitudinal data can help identify critical
periods of vulnerability or resilience in students’ well-being and academic
trajectories.
References
[10] (2023) Society for learning analytics research. [Online]. Available: https:
//www.solaresearch.org/about/what-is-learning-analytics/ [Page 5.]
[16] D. Sansone, “Beyond early warning indicators: High school dropout and
machine learning,” Oxford Bulletin of Economics and Statistics, vol. 81,
no. 2, pp. 456–485, 2019. doi: https://doi.org/10.1111/obes.12277.
[Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1111/ob
es.12277 [Page 6.]
Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/capr.12227
[Page 7.]
[36] Big Data and Learning Analytics in Higher Education. ISBN 978-3-
319-06519-9. [Online]. Available: https://doi.org/10.1007/978-3-319-0
6520-5 [Page 13.]
[40] F.-J. Yang, “An extended idea about decision trees,” in 2019
International Conference on Computational Science and Computational
Intelligence (CSCI). IEEE, 2019, pp. 349–354. [Page 16.]
[42] S. Tangirala, “Evaluating the impact of gini index and information gain
on classification using decision tree classifier algorithm,” International
Journal of Advanced Computer Science and Applications, vol. 11, no. 2,
pp. 612–619, 2020. [Page 16.]
[43] Y.-Y. Song and L. Ying, “Decision tree methods: applications for
classification and prediction,” Shanghai archives of psychiatry, vol. 27,
no. 2, p. 130, 2015. [Page 17.]
[62] C. Bergmeir and J. M. Benítez, “On the use of cross-validation for time
series predictor evaluation,” Information Sciences, vol. 191, pp. 192–
213, 2012. [Pages 30 and 31.]
www.kth.se