Quality Issues in Machine Learning Systems

Quality issues in Machine Learning Software
Systems
Pierre-Olivier Côté, Amin Nikanjam, Rached Bouchoucha, Foutse Khomh
SWAT Lab., Polytechnique Montréal, Québec, Canada
{pierre-olivier.cote,amin.nikanjam,rached.bouchoucha,foutse.khomh}@polymtl.ca
Abstract—Context: An increasing demand is observed in var- of real-world quality issues in MLSSs from the viewpoint of
arXiv:2208.08982v1 [cs.SE] 18 Aug 2022
ious domains to employ Machine Learning (ML) for solving practitioners identifying a list of bad-practices related to poor
complex problems. ML models are implemented as software systems/models quality. This is a requirement of comprehen-
components and deployed in Machine Learning Software Systems
(MLSSs). Problem: There is a strong need for ensuring the sive quality assessment of MLSSs, as Zhang et al. already
serving quality of MLSSs. False or poor decisions of such systems acknowledged the lack of such empirical study and asserted
can lead to malfunction of other systems, significant financial that conducting empirical studies on the prevalence of poor
losses, or even threat to human life. The quality assurance of models among deployed ML models should be interesting
MLSSs is considered as a challenging task and currently is a [8]. This study will cover all relevant quality factors like
hot research topic. Moreover, it is important to cover all various
aspects of the quality in MLSSs. Objective: This paper aims to performance (accuracy), robustness, explainability, scalability,
investigate the characteristics of real quality issues in MLSSs hardware demand and model complexity. We plan to conduct
from the viewpoint of practitioners. This empirical study aims a set of interviews with practitioners/experts, believing that
to identify a catalog of bad-practices related to poor quality interviews are the best method to retrieve their experience and
in MLSSs. Method: We plan to conduct a set of interviews practices when dealing with quality issues. We expect that
with practitioners/experts, believing that interviews are the best
method to retrieve their experience and practices when dealing the catalog of issues developed at this step will also help us
with quality issues. We expect that the catalog of issues developed later to identify the severity, root causes and possible remedy
at this step will also help us later to identify the severity, root for quality issues of MLSSs, allowing us to develop efficient
causes, and possible remedy for quality issues of MLSSs, allowing quality assurance tools for ML models and MLSSs. We present
us to develop efficient quality assurance tools for ML models and in the following the proposed methodology to achieve our
MLSSs.
Index Terms—Machine Learning based Software Systems,
objectives.
Quality Assurance, Quality issues, Interview.
II. R ELATED W ORK
I. I NTRODUCTION In 2015, Sculley et al. shared through a seminal work, the

challenges that Google faced while building, deploying, and
Nowadays, Machine Learning Software Systems (MLSSs) maintaining ML models [9]. Following this work, many other
have become a part of our daily life (e.g., recommendation studies tried to characterize quality issues in MLSSs [10]–
systems, speech recognition, face detection). An increasing [17]. Van Oort et al. mined open-source GitHub repositories
demand is observed in various companies to employ Ma- to aggregate a list of the most common maintenance-related
chine Learning (ML) for solving problems in their business. modifications in Deep Learning (DL) projects [12]. From this
Typically, a MLSS receives data as input and employs ML list, they extracted 5 code smells in DL systems. Furthermore,
models to make intelligent decisions automatically based on they measured how prevalent and problematic the code smells
learned patterns, associations and knowledge from data [1]. are from the point of view of practitioners using a survey.
Therefore, ML models are implemented as software compo- Similarly, Dilhara et al. [18] mined 26 open-source MLSSs
nents integrated to other subsystems in MLSSs, and like other and identified 14 refactoring and 7 new technical debt (TD)
software systems, quality assurance is necessary. According to categories specific to ML. Instead of finding new quality
the growing importance of MLSSs in today’s world, there is issues, Alahdab et al. shared with the scientific community
a strong need for ensuring their serving quality. False or poor how TD types appear in the early phases of an industrial
decisions of such systems can lead to malfunction of other DL project [11]. Facing the increasing number of papers
systems, significant financial losses or even threat to human regarding technical debt and code smells in MLSSs, Bogner
life [2]. et al. attempted to aggregate the knowledge on these topics
The quality assessment of MLSSs is regarded as a chal- by performing a systematic mapping study that presented 4
lenging task [3] and currently is a hot research topic [4], new types of technical debt, 72 antipatterns along with 46
[5]. Recently some research work on the quality of MLSSs is solutions [12]. A similar work has been done by Washizaki
suggested to cover all different aspects of the quality [6], [7], et al. but by consulting grey literature as well [10]. Other
i.e., not only their prediction accuracy. In this paper, we, with works indirectly contributed toward increasing the knowledge
our industrial partner, plan to investigate the characteristics on quality issues in MLSSs, such as the study of Nahar
et al. [17], which looked at collaboration challenges while can use numbers/amounts (like time-series data) or
building ML systems. In reaction to the growing concern of even text. Each data type comes with its own data
quality issues in MLSSs, researchers developed tools such quality challenges when training ML models. By
as the DataLinter [19] or the data validation component of answering this RQ, future research will be guided
TFX [5] to automate quality assurance processes. Another towards the most pressing data quality challenges
group of researchers adopted a different approach and instead faced by practitioners. Authors in [20] mention that
shared a checklist of tests to assess the production readiness of data collection processes, the process of gathering
MLSSs [7]. In a similar fashion, a process model is proposed data, may affect the quality of data. For example,
by Studer et al. for the development of ML applications one could choose to train her/his model on public
with a quality assurance methodology [6]. While there is an datasets, or manually acquire more data with data
increasingly growing number of studies on quality issues in collectors. Each of these processes have different
MLSSs, a significant portion of them is produced by large challenges which may lead to different data quality
software enterprises such as Google or Meta, which limits the issues. In this study, we want to identify the data
generalization of the findings [12]. We believe that there is a collection processes that are most prone to data
need for a study presenting the quality issues of MLSSs that quality issues.
are encountered by practitioners from different backgrounds RQ5: What are the challenges of data quality assurance
and company sizes. during model evolution? Many MLSSs encounter
rapidly changing/non-stationary data, adversarial in-
III. S TUDY D ESIGN
put, or difference in data distribution (concept drift,
A. Objectives and Research Questions e.g., content recommendation systems and financial
The goal of this study is to provide a detailed analysis ML applications). Hence, the quality of the deployed
of quality issues (including data and model) in MLSSs. We model may be decreasing over time and conse-
believe that interviewing people who really experienced these quently affects the performance of the whole system.
issues is an effective way to gain that knowledge. Thus, we Therefore, the robustness and the accuracy of the
will proceed this way and we define the following Research model’s predictions must be assessed frequently in
Questions (RQ): production. Actively monitoring the quality of the
RQ1: What are the quality issues encountered by prac- deployed model in production is crucial to detect
titioners when building MLSSs and which ones performance degradation and model staleness.
are the most prevalent? For future works to solve
quality issues, they must know the issues that exist. B. Industrial Partner
While there is some literature covering some quality MoovAI1 is a Montreal-based company in Canada, which is
issues [6], [9], [17], we believe that a lot is still active in developing AI/ML quality assurance solutions to ad-
unknown. In this study, we aim to share with the dress practical needs in the various businesses. MoovAI’s ex-
research community the quality challenges encoun- perts guide their customers (i.e., companies) to take advantage
tered by practitioners when building ML systems. of these cutting-edge technologies, regardless of their level of
This includes issues related to data, model, and other maturity in data science. Nowadays, various sectors in industry
components in MLSSs. are and will be developing ML models in their systems, e.g.
RQ2: What are the root causes, symptoms, and con- energy sector, financial sector, supply chain recommendations,
sequences of quality issues? To understand quality medical diagnosis and treatment. As ML-based technologies
issues in ML systems, it is not sufficient to know become more widespread, the quality demands of ML become
they exist, but also their root causes, symptoms and more important. Poor quality models are a potential barrier in
potential consequences. We hope that the answer to exploiting ML in real-world applications and currently more
that question will guide future work towards solving companies request for qualified ML-systems [21]. A study
the most pressing issues. showed that 87% of ML proof of concepts have never come
RQ3: How are the quality issues currently handled to production [22], usually due to lack of enough quality.
by the practitioners? Once quality issues have Currently MoovAI is using a model validation tool to
been detected (e.g., in data or model), we expect validate ML models prior to deployment in the real world
practitioners to have put in place mechanisms to [23]. This tool includes preliminary validation methods to
mitigate or at least attenuate their consequences. We assess the overall quality of a ML model. The tool evaluates
are interested in understanding the current mitigation accuracy, stability, biases and sensitivity of ML models. Now,
approaches implemented in the industry. MoovAI aims to push their tool forward to develop advanced
RQ4: In the case of data, which data types and collec- methods for ensuring the quality of not only the ML models
tion processes are the most challenging in terms but also ML-based systems (i.e., software systems containing
of data quality? The training data ingested by ML ML components) and to develop a stand-alone model vali-
models comes in many forms. Face recognition sys-
tems use images, while stock prediction applications 1 https://moov.ai/en/
dation platform. Experts at MoovAI now are looking for a data-cleaning, dataset, machine-learning and artificial-
comprehensive quality evaluation tool to assess and monitor intelligence. We will search for practitioners on Data
the quality of ML models during their whole life cycle; from Science Stack Exchange and on Stack Overflow using
data collection, to development, deployment, and maintenance. the 4 tags we have described (1 for Data Science Stack
Exchange + 4 tags on Stack Overflow, 5 in total). Top
C. Participants users are ranked in two categories: ’Last 30 Days’ and
We plan to interview at least 50 participants, which is more ’All Time’; we will pick the first 10 users from both. In
than previous similar studies [24], [25]. The interviews will total, we will send 100 emails. Using the results from a
occur in two rounds. In the first round of the study, we similar study [24], we expect to meet 10 interviewees.
will interview the employees of our industrial partner (i.e., • Social networks: Social networks are platforms in which
MoovAI). They are Data Scientists, Machine Learning Engi- users may engage conversations on a wide range of
neers, Data Engineers and Project Managers of ML projects. topics. Some of them host discussions on ML and Ar-
They have worked on many different ML projects for different tificial Intelligence in general. We believe users with
clients. Thus, we think that they are able to provide a global important ML expertise can be found on these websites.
view of quality issues encountered in the industry. We expect We are planning to post an invitation for interviews on
to be able to conduct at least 25 interviews with MoovAI’s the deep learning and ML communities of Reddit6 similar
practitioners. In the second phase of the study, we plan to to our previous works [26]. We expect at least 5 positive
interview practitioners from other companies. We will adopt responses.
4 strategies to recruit participants. • Freelance platforms: Following previous work [24], we
• Personal contacts: We will start recruiting people for plan to use freelance platforms to find practitioners with
interviews through our personal contacts, since it usually ML expertise if all the other techniques have been used
has a higher response rate than cold emails. We will and we have not reached theoretical saturation yet [27].
contact industrial partners in our network. For example, We will follow the methodology of similar studies like
we plan to reach out to companies who are partners [24] to select the candidates for interviews.
in Software Engineering for Artificial Intelligence1, a
training program co-created by our lab which includes D. Interview Process
companies such as IBM, Ericsson, and Cisco. We will Since the subject of quality issues is not mature and has
also use LinkedIn2 to find qualified experts that may have still room for research work, we will follow a research
relevant expertise for this project. Using the results from a procedure suited for exploratory work, Straussian Grounded
similar study [24], we expect at least 5 positive responses. Theory [28]. It is a research method in which data collection
• Q&A websites: Questions and Answers websites are plat- and data analysis are executed in an iterative manner until
forms on which a lot of knowledge is shared. Following a new theory emerges from the data. As opposed to many
similar previous work [24], we will search for practi- deductive approaches where a theory is first conceptualized
tioners with meaningful experience on ML willing to be then tested through experiments, Grounded Theory goes the
interviewed on quality issues in ML systems. We chose opposite way by inductively generating a theory from the
to search on Stack Overflow3 and Data Science Stack data. The knowledge gathered from data using open and axial
Exchange4, because both these websites are significantly coding should guide the sampling process, a concept named
used by the ML community for question answering. theoretical sampling. Data collection stops when theoretical
We will reach out to the top askers and answerers5 saturation is met: when the understanding of the subject is
since they have shown significant involvement in the ML complete and new data does not invalidate the emerging theory.
community and most likely have expertise. In order to Grounded Theory is often used when little is known about a
find them, we will search on two platforms: 1) Data phenomenon, because of its flexibility and its aptitude for the
Science Stack Exchange, on which we will simply look discovery of unknown concepts. It has been used in similar
for the top answerers and askers of the Q&A platform on studies, such as [17]. Because we are doing exploratory work,
any topic, and 2) Stack Overflow, for which a different the interviews have to be structured in a way that allows the
strategy must be adopted since the website also holds interviewee to share knowledge we might not be aware of. For
questions regarding Software Engineering in general. this reason, we will be conducting semi-structured interviews
Thus, we will search for the top askers and answerers similar to [24]. We devised an interview guide that to help the
for topics related to the subject of quality issues in ML interviewer cover every relevant topic. It is composed mostly
systems, using tags. Similar to [24], we selected the tags of open-ended questions, to allow the interviewee direct us
towards interesting information. It is the responsibility of the
1 https://se4ai.org/
2 https://www.linkedin.com/
interviewer to ask follow-up questions when the respondent
3 https://stackoverflow.com/ touches upon a subject relevant to the study. To avoid the inter-
4 https://datascience.stackexchange.com/ viewer being overwhelmed with his tasks and having difficulty
5 The words askers and answerers are part of the terminology used by
the Q&A websites to describe people asking and answering questions 6 https://www.reddit.com/
asking the right follow up questions, each interview will be process. We consider any challenge related to the cleaning
conducted by two persons. This follows the recommendation and transformation of the data.
of previous works which show that interviewees share more • Model evaluation: We probe for potential problems that
information when two interviewers are present rather than one the interviewee experienced when evaluating the quality
[29]. The second interviewer will be tasked with helping the of its model. For example, we ask if the respondent
primary one to ask follow-up questions and to transcribe the ever faced a situation where the model performed poorly
interview with the support of the tool Descript1 . This tool is on some group of people, potentially leading to fairness
an automated speech recognition tool that can transform the issues.
audio stream of an online meeting into text. The role of the • Model deployment: We gather general information about
second interviewer will simply be to correct the mistakes of the process by which models are put into production.
the transcription of the tool. Then, we search for issues encountered at this step. For
The interview guide will be subject to change as our knowl- example, one question is: “Did you ever deploy a model
edge on the topic grows, since we are following Grounded that performed well locally but poorly once deployed?”.
Theory. During its writing, we will consider quality issues • Model maintenance: We ask the interviewee how he
elicited by other studies [7], [9], [10], [12], [25]. We also took ensures that the quality of its models remains the same
care of asking questions that have potential to cover different after deployment. We specifically ask for past instances
quality dimensions as defined by [30]. We will complete our of model staleness and how it has been handled.
interview guide with questions drawn from a similar study At the end of every section, we ask for any other issue that the
[24]. In order to assess the quality of our interview guide, we interviewee may have at this step of the workflow in case we
will conduct a pilot of our study. We will purposefully select missed out on something. We will conclude the interview with
5 participants with diverse experience (i.e. Data scientists, the open question: ”In your opinion, what is the most pressing
Data/ML Engineers and AI project managers). Doing so, we quality issue researchers should work on in an attempt to solve
will be able to verify that our questions are unambiguous, the problem?”. The answer to that question might provide
precise and able to answer our research questions effectively. interesting future work directions and follows Harvard’s best
A few days prior to the interview, we will share a summary practices for qualitative interviews2 . The interview guide is
of the content interview with the participants so they can be available online 3 .
familiar with the objectives of the study. Prior to starting the interview, we will ask to the respondent
Interviews will take place in English or French, depending for the permission to register and share the transcription. In
on the preference of the interviewee. We will start by giving order to follow ethical guidelines, we validated our research
a brief overview of the project followed by a quick round project with Research Ethics Committee at Polytechnique
of introductions. Then we will ask participants for some Montréal4 and got their approval. We will anonymize the
background information regarding his experience in ML. The respondent of the interviews in the transcript in order to respect
interview will officially start with a general question: ”What their privacy.
are the main quality issues you have encountered with your E. Questionnaire
data/model so far”? By asking an open-ended question, we are
allowing the interviewee to share experiences he is the most After interviews have been conducted and analyzed, we
confident to talk about. Then we will probe the interviewee’s expect to have a set of potential quality issues. In order to
experience in an attempt to discover quality issues. We will validate the quality of our findings, we will investigate their
cover each phase of a ML workflow as described by [13]. prevalence in real MLSSs. Owners of such systems will be
in order to exhaustively search for situations where quality contacted and asked to answer a questionnaire where quality
issues might occur. As a difference, we do not include the issues are presented. The questionnaire will be built using
model requirement phase in our study and chose to merge Google Forms5 . We will reach out to owners of MLSSs by
data cleaning with feature engineering and data labeling with contacting MoovAI’s clients. We expect to have answers from
data collection for conciseness. at least 20 respondents from 4 different companies. Because
we are aware that they might not have a profound under-
• Data collection: We ask for experiences with different standing of ML, the form will describe the symptoms of the
data collection processes: data collectors, automatically potential quality issues including their technical description.
generated data, public datasets, external services (e.g. a The respondents will evaluate on a Likert scale [31] how often
weather API), or predictions of another model (effectively this kind of problem happens in their experience. In the case
creating cascading models). For each one of them, we of a positive response, they will be invited to share in more
search for issues the interviewee may have experienced detail about the issues that they experienced and to describe
collecting the data along with solutions they put in place. their consequences in MLSSs.
• Data preparation: Notably, we ask about pain points
2 https://sociology.fas.harvard.edu/files/sociology/files/interview
strategies.pdf
when preparing data for ML and for tools to automate the
3 https://github.com/poclecoqq/quality issues in MLSSs
4 https://www.polymtl.ca/recherche/la-recherche-polytechnique/exigences-deontologiques
1 https://www.descript.com/transcription 5 https://www.google.ca/forms/about/
F. Analysis Plan we are conducting semi-structured interviews; some of the
For all the research questions except RQ4, we will use the conversations with the interviewee will be improvised. Thus,
coding techniques from Straussian Grounded Theory [32] to it is possible that the findings are tainted by the precon-
extract knowledge from data. When an interview transcript ception of the interviewer about the potential quality issues
is coded for the first time, we will use open coding [33] to in MLSSs. For example, there could be more emphasis on
break it into discrete parts. These codes will be reorganized explainability issues in the interview, which may lead to an
and grouped into categories through a step of axial coding. over-representation of these issues in our findings. Second, it
This process allows the researchers to analyze its data on a is possible that the questions asked when the interviews are
higher level so to have a better understanding of it. In the in English are not exactly the same as the ones asked when
final rounds of coding, the central theme around which all the interviews are in French. While our primary interviewers
categories relate can be established in the final procedure have a good understanding of both languages, their choice
of selective coding. While the current document has been of words might convey slightly different meanings, leading
written such as the data collection and analysis plan are to bias. Third, the coding of the interview may be prone
separated, it is important to understand that these two steps to researcher bias, because it partially relies on subjective
will happen in an iterative manner. We expect to write down interpretation. To mitigate this bias, each interview will be
memos of preliminary categories throughout the process of coded by two researchers, and inconsistencies in the codes will
data collection and analysis. They will contribute to the rise be resolved by a third researcher. Fourth, there is a risk that our
of our theory through phases of memo sorting. findings for RQ2 are inaccurate because of our data collection
For RQ4, because the question is less open-ended, we will method. Asking practitioners about the root cause of quality
adopt a simpler coding strategy. The codes will be the data issues may lead to misunderstandings or even misinformation
set quality dimensions defined in [30] and the data collection in case they do not want to admit their mistakes. To mitigate
processes we will encounter in our interviews. We will sum the this issue, we will cross our findings with MoovAI knowledge
codes throughout the interviews and we will take into account on similar cases.
duplicates: when two interviewees are on the same team and For external validity, we identified two potential ways that
share the same problem. our sample of practitioners could be unrepresentative of the
In order to measure the most prevalent quality issues and industry. First, we expect a majority of the interviewees to
effectively answer RQ1, we will average the results (that were be practitioners from Moov.AI. Their clients are companies of
on a Likert scale) from the questionnaire. medium to large size. Thus, our results may not reflect the
To help us code the transcripts, we will use Delve qualitative reality of very small companies, such as startups, or larger
analysis tool1 , because the researchers are familiar with it and software enterprises, such as Google, Microsoft, and the likes.
it is easy to use. Delve is a computer-assisted qualitative data Second, because the study follows Grounded Theory and we
analysis software (CADQAS) that provides simple interfaces are doing theoretical sampling [27], it is possible that we end
to code and to analyze data. In order to ensure the quality of up focusing on some slice of the population and leaving others
the analysis, each document will be coded by two researchers. understudied.
In case of a disagreement in codes, a third researcher will play About construct validity, limitations of our approach could
the role of moderator and will select the final code for a text come from keywords (tags) used for finding candidates over
segment. This process will be helpful for the construction of Q&A websites and social media. For keywords, we used data-
a shared understanding of the data. cleaning, dataset, machine-learning and artificial-intelligence
On completion of the study, we will share with the public which are sufficiently general to match a lot of users and is
a replication package. This package will contain the interview more likely to generate False Positive (users that we would
guide, the anonymized transcription of the interviews and any end up ignoring) rather than False Negative.
discussion or analysis between the contributors that could help Conclusion limitations can be of potential wrongly understood
replicating the study. issues, missing issues, and the replicability of the study. We
believe that our sources of information are sound and various
G. Threats to validity enough to be representative, and not misleading us in our
Using some of the validity threats described in [34], we conclusions. At last, we provided a replication package2 to
will divide the analysis of the limitations of the study into allow for the reproducibility of our results, and also for other
four categories: internal validity, external validity, construct researchers to build on our study.
validity and conclusion limitations. The former refers to the
degree of confidence that the findings are trustworthy. For ACKNOWLEDGMENT
example, confounding factors or a lack of scientific rigor could
hinder the quality of the results. In our case, we identified This work is partly funded by the Natural Sciences and En-
four potential threats to internal validity. First, there is a gineering Research Council of Canada (NSERC), PROMPT,
risk of some quality issues being over-represented. In fact, and Les Technologies MoovAI Inc.
1 https://delvetool.com/ 2 https://github.com/poclecoqq/quality issues in MLSSs

R EFERENCES [16] Y. Tang, R. Khatchadourian, M. Bagherzadeh, R. Singh, A. Stewart,
and A. Raja, “An empirical study of refactorings and technical debt
[1] D. Marijan, A. Gotlieb, and M. K. Ahuja, “Challenges of testing machine in machine learning systems,” in 2021 IEEE/ACM 43rd International
learning based systems,” in 2019 IEEE International Conference On
Conference on Software Engineering (ICSE), pp. 238–250, IEEE, 2021.
Artificial Intelligence Testing (AITest), pp. 101–102, IEEE, 2019.
[17] N. Nahar, S. Zhou, G. Lewis, and C. Kästner, “Collaboration challenges
[2] H. Foidl and M. Felderer, “Risk-based data validation in machine
in building ml-enabled systems: Communication, documentation, engi-
learning-based software systems,” in proceedings of the 3rd ACM neering, and process,” Organization, vol. 1, no. 2, p. 3, 2022.
SIGSOFT international workshop on machine learning techniques for
[18] M. Dilhara, A. Ketkar, and D. Dig, “Understanding software-2.0: A
software quality evaluation, pp. 13–18, 2019.
study of machine learning library usage and evolution,” ACM Trans-
[3] H. B. Braiek and F. Khomh, “On testing machine learning programs,” actions on Software Engineering and Methodology (TOSEM), vol. 30,
Journal of Systems and Software, vol. 164, p. 110542, 2020.
no. 4, pp. 1–42, 2021.
[4] F. Khomh, B. Adams, J. Cheng, M. Fokaefs, and G. Antoniol, “Software
[19] N. Hynes, D. Sculley, and M. Terry, “The data linter: Lightweight,
engineering for machine-learning applications: The road ahead,” IEEE
automated sanity checking for ml data sets,” in NIPS MLSys Workshop,
Software, vol. 35, no. 5, pp. 81–84, 2018. vol. 1, 2017.
[5] E. Breck, N. Polyzotis, S. Roy, S. Whang, and M. Zinkevich, “Data
[20] S. E. Whang, Y. Roh, H. Song, and J.-G. Lee, “Data collection and
validation for machine learning.,” in MLSys, 2019.
quality challenges in deep learning: A data-centric ai perspective,” arXiv
[6] S. Studer, T. B. Bui, C. Drescher, A. Hanuschkin, L. Winkler, S. Peters, preprint arXiv:2112.06409, 2021.
and K.-R. Müller, “Towards crisp-ml (q): a machine learning process [21] D. Sato, A. Wider, and C. Windheuser, “Continuous delivery for machine
model with quality assurance methodology,” Machine Learning and learning,” 2019.
Knowledge Extraction, vol. 3, no. 2, pp. 392–413, 2021. [22] S. Azimi and C. Pahl, “Root cause analysis and remediation for quality
[7] E. Breck, S. Cai, E. Nielsen, M. Salib, and D. Sculley, “The ml test score: and value improvement in machine learning driven information models.,”
A rubric for ml production readiness and technical debt reduction,” in in ICEIS (1), pp. 656–665, 2020.
2017 IEEE International Conference on Big Data (Big Data), pp. 1123– [23] O. Blais, “Validate and monitor your machine learning models,” 2020.
1132, IEEE, 2017. [24] N. Humbatova, G. Jahangirova, G. Bavota, V. Riccio, A. Stocco,
[8] J. M. Zhang, M. Harman, L. Ma, and Y. Liu, “Machine learning test- and P. Tonella, “Taxonomy of real faults in deep learning systems,”
ing: Survey, landscapes and horizons,” IEEE Transactions on Software in Proceedings of the ACM/IEEE 42nd International Conference on
Engineering, 2020. Software Engineering, pp. 1110–1121, 2020.
[9] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, [25] A. Serban and J. Visser, “An empirical study of software architecture
V. Chaudhary, M. Young, J.-F. Crespo, and D. Dennison, “Hidden for machine learning,” arXiv preprint arXiv:2105.12422, 2021.
technical debt in machine learning systems,” in Advances in Neural [26] A. Nikanjam and F. Khomh, “Design smells in deep learning programs:
Information Processing Systems (C. Cortes, N. Lawrence, D. Lee, An empirical study,” in 2021 IEEE International Conference on Software
M. Sugiyama, and R. Garnett, eds.), vol. 28, Curran Associates, Inc., Maintenance and Evolution (ICSME), pp. 332–342, IEEE, 2021.
2015. [27] K.-J. Stol, P. Ralph, and B. Fitzgerald, “Grounded theory in software
[10] H. Washizaki, H. Uchida, F. Khomh, and Y.-G. Guéhéneuc, “Studying engineering research: a critical review and guidelines,” in Proceedings
software engineering patterns for designing machine learning systems,” of the 38th International Conference on Software Engineering, pp. 120–
in 2019 10th International Workshop on Empirical Software Engineering 131, 2016.
in Practice (IWESEP), pp. 49–495, IEEE, 2019. [28] A. Strauss, J. Corbin, and J. Corbin, Basics of Qualitative Research:
[11] M. Alahdab and G. Çalıklı, “Empirical analysis of hidden technical debt Techniques and Procedures for Developing Grounded Theory. SAGE
patterns in machine learning software,” in International Conference on Publications, 1998.
Product-Focused Software Process Improvement, pp. 195–202, Springer, [29] S. Hove and B. Anda, “Experiences from conducting semi-structured
2019. interviews in empirical software engineering research,” in 11th IEEE
[12] J. Bogner, R. Verdecchia, and I. Gerostathopoulos, “Characterizing tech- International Software Metrics Symposium (METRICS’05), pp. 10 pp.–
nical debt and antipatterns in ai-based systems: A systematic mapping 23, 2005.
study,” in 2021 IEEE/ACM International Conference on Technical Debt [30] C. Cappi, C. Chapdelaine, L. Gardes, E. Jenn, B. Lefevre, S. Picard,
(TechDebt), pp. 64–73, IEEE, 2021. and T. Soumarmon, “Dataset definition standard (dds),” arXiv preprint
[13] S. Amershi, A. Begel, C. Bird, R. DeLine, H. Gall, E. Kamar, N. Na- arXiv:2101.03020, 2021.
gappan, B. Nushi, and T. Zimmermann, “Software engineering for [31] A. Joshi, S. Kale, S. Chandel, and D. K. Pal, “Likert scale: Explored
machine learning: A case study,” in 2019 IEEE/ACM 41st International and explained,” British journal of applied science & technology, vol. 7,
Conference on Software Engineering: Software Engineering in Practice no. 4, p. 396, 2015.
(ICSE-SEIP), pp. 291–300, IEEE, 2019. [32] A. Strauss and J. Corbin, “Grounded theory methodology: An
[14] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, overview.,” 1994.
V. Chaudhary, and M. Young, “Machine learning: The high interest [33] C. B. Seaman, “Qualitative methods in empirical studies of software
credit card of technical debt,” 2014. engineering,” IEEE Transactions on software engineering, vol. 25, no. 4,
[15] B. van Oort, L. Cruz, M. Aniche, and A. van Deursen, “The prevalence pp. 557–572, 1999.
of code smells in machine learning projects,” in 2021 IEEE/ACM 1st [34] R. Feldt and A. Magazinius, “Validity threats in empirical software
Workshop on AI Engineering-Software Engineering for AI (WAIN), engineering research-an initial survey.,” in Seke, pp. 374–379, 2010.
pp. 1–8, IEEE, 2021.
This figure "fig1.png" is available in "png" format from:
http://arxiv.org/ps/2208.08982v1

Quality Issues in Machine Learning Systems

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Quality Issues in Machine Learning Systems

Uploaded by

Copyright:

Available Formats

Quality issues in Machine Learning Software

I. I NTRODUCTION In 2015, Sculley et al. shared through a seminal work, the

1 https://delvetool.com/ 2 https://github.com/poclecoqq/quality issues in MLSSs

You might also like