Measuring Code Maintainability With Deep Neural Networks: Yamin HU (, Hao Jiang, Zongyao HU

Front.
Comput. Sci., 2023, 17(6): 176214
https://doi.org/10.1007/s11704-022-2313-0
RESEARCH ARTICLE
Measuring code maintainability with deep neural networks
Yamin HU ( ✉)1, Hao JIANG2, Zongyao HU1

1 School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
2 School of Artificial Intelligence, Anhui University, Hefei 230601, China
Higher Education Press 2023

Abstract The maintainability of source code is a key quality inevitable because of the complexity of software and the
characteristic for software quality. Many approaches have been variability of requirements. The cost of software maintenance
proposed to quantitatively measure code maintainability. Such is high and up to 67% of the total cost during the software
approaches rely heavily on code metrics, e.g., the number of development life cycle [4]. Consequently, software
Lines of Code and McCabe’s Cyclomatic Complexity. The maintenance is essential for continuous evolution of software.
employed code metrics are essentially statistics regarding code Source code is the major artifact to modify during software
elements, e.g., the numbers of tokens, lines, references, and maintenance process [5]. In the international standard
branch statements. However, natural language in source code, ISO/IEC 5055:2021 for automated source code quality
especially identifiers, is rarely exploited by such approaches. measures [6], code maintainability is defined as:
As a result, replacing meaningful identifiers with nonsense The capability of a product to be modified by the intended
tokens would not significantly influence their outputs, although maintainers with effectiveness and efficiency.
the replacement should have significantly reduced code Code maintainability is a key concern for maintainers. To
maintainability. To this end, in this paper, we propose a novel make the modifications, the maintainers must first read
approach (called DeepM) to measure code maintainability by (readability) and understand (understandability) the source
exploiting the lexical semantics of text in source code. DeepM code, then analyze (analyzability) the source code to find the
leverages deep learning techniques (e.g., LSTM and attention modification positions, and finally may reuse (reusability) the
mechanism) to exploit these lexical semantics in measuring existing components to complete the modifications
code maintainability. Another key rationale of DeepM is that (modifiability) [7]. If source code can be modified by the
measuring code maintainability is complex and often far intended maintainers with fewer efforts to successfully
beyond the capabilities of statistics or simple heuristics. complete software maintenance activities such as fixing bugs,
Consequently, DeepM leverages deep learning techniques to the code is of high maintainability. Code maintainability is
automatically select useful features from complex and lengthy critical for measuring software maintainability and predicting
inputs and to construct a complex mapping (rather than simple software maintenance cost. Low-maintainability code would
heuristics) from the input to the output (code maintainability result in that future maintainers (who may be not familiar with
index). DeepM is evaluated on a manually-assessed dataset. the considered code) would spend much more efforts to
The evaluation results suggest that DeepM is accurate, and it understand the code than the original developers, and thus the
generates the same rankings of code maintainability as those of cost of software maintenance will rise significantly [8].
experienced programmers on 87.5% of manually ranked pairs Notably, code quality covers multiple aspects. The first and
of Java classes. foremost aspect is the functionality and performance of the
implementation (source code) with regard to its corresponding
Keywords code maintainability, lexical semantics, deep requirements, i.e., the degree to which the source code works
learning, neural networks as expected [9]. Another key aspect of code quality is the
effectiveness and efficiency with which the maintainers
1 Introduction modify the source code, i.e., code maintainability [10]. The
Software maintenance is the modification of a software second aspect (concerning human interpretation and
product after delivery to fix bugs, to improve performance or modification of the source code) is often more subjective than
other attributes, or to adapt the involved software to a new the first aspect (concerning the functionality and performance
environment [1,2]. Among these maintenance activities, 21% of the source code); thus, the second aspect is more difficult to
are corrective, 50% are perfective, 25% are adaptive, and 4% quantify.
are
others [3]. Notably, software maintenance activities are Many approaches have been proposed to quantitatively
measure the maintainability of source code. These approaches
Received May 25, 2022; accepted October 28, 2022
usually rely heavily on code metrics, e.g., Lines of Code
E-mail: ymhu@bit.edu.cn (LOC) and McCabe’s Cyclomatic Complexity (MCC) [8,11].
2 Front. Comput. Sci., 2023, 17(6): 176214
The widely used code metric is Maintainability Index (MI)

[12], which computes a maintainability index rigorously
following a well-defined formula whose inputs include LOC,
MCC, Halstead Volume, and the percent of lines of comments
(optional). Heitlager et al. [13] proposed a hierarchical model
to measure a system-level maintainability index based on the
quality model defined in the international standard ISO/IEC
9126-1 for software quality models [14]. The maintainability
index is computed by taking a weighted average of the scores
of four quality subcharacteristics of maintainability, i.e.,

analyzability, changeability, stability, and testability; the score
of each subcharacteristic is computed by aggregating the

scores of related code properties (such as volume and
complexity); the score of each code property is computed by
aggregating one or more code metrics, e.g., LOC, MCC, and
the number of assert statements. Al Dallal [8] learned a
logistic regression classifier that predicts whether a Java class
is modified heavily in terms of the number of revised lines of
code during software evolution. The input of the classifier is

19 code metrics (such as LOC and the number of methods) for
three code properties: size, cohesion, and coupling. terms in a dictionary) of source code were proposed to
Although existing approaches to measuring code measure code readability [20], these features are still
maintainability have significantly facilitated the quantitative essentially statistics regarding text in source code.
measurement of code maintainability [8,11,15], the resulting To this end, in this paper, we propose a novel approach
maintainability indexes may diverge from human perception called DeepM to quantitatively measure the maintainability of
[16–18]. Existing approaches to measuring code source code. The key rationale of DeepM is that the lexical
maintainability rely heavily on code metrics. The employed semantics of identifiers should be exploited to measure code
code metrics are essentially statistics regarding code elements, maintainability. Identifiers account for approximately 70% of
e.g., the numbers of tokens, lines, references, and branch source code (in terms of characters) [21] and serve as the
statements. For example, LOC counts source code lines, MCC major source of code comprehension [22]. However, they
counts the branch statements, CBO (Coupling Between have not yet been exploited by existing approaches for
Objects) counts classes that are coupled to a given class, and measuring code maintainability. In this paper, we leverage
RFC (Response For a Class) counts unique methods invoked deep learning techniques (i.e., word embedding [23], LSTM
in a class. Such code metrics and the approaches built on them [24], tree-LSTM [25], attention mechanisms [26], and dense
do not consider the semantics of the natural language, networks [27]) to exploit the lexical semantics of identifiers
especially identifiers. The natural language contains rich and to merge the resulting semantics, the structural features of
semantic information, which is an important source for the source code, and the statistical features of the source code
programmers to understand the business logic of programs into a single maintainability index.
[19]. Consequently, such natural language may significantly In summary, our paper makes the following contributions:
influence the quality of source code, especially its
maintainability. For example, replacing meaningful identifiers ● A deep learning-based approach to the quantitative
with nonsense tokens would significantly reduce the measurement of code maintainability. To the best of our
maintainability of the source code. However, the replacement knowledge, this is the first approach that exploits the
may not significantly influence the outputs of existing semantics of identifiers for code maintainability
approaches because the replacement does not significantly measurement.
change the underlying code metrics, e.g., the MCC, CBO, or ● A public implementation of the proposed approach and
RFC. An example is presented in Listing 1−2. Listing 1 two publicly available datasets for the training and
presents method getTimeImpl from class java.util.Date in JDK evaluation of code maintainability models. The
1.8, which normalizes the date of a Date object if necessary implementation and datasets are publicly available at
and returns the corresponding milliseconds. The method is github/DeepQuality/Deep_Maintainability.
transformed into the one in Listing 2 by replacing its ● An initial evaluation of the proposed approach. The
identifiers with nonsense tokens. The two codes have the same evaluation results suggest that the proposed approach
numbers of tokens, lines, and branch statements. However, the outperforms the state-of-the-art approaches, improving
latter is much harder to understand and maintain than the the accuracy of the results from 75.0% and 57.5% to
former because of the nonsense tokens. As a result, the 87.5%.
maintainability indexes produced by such approaches
significantly diverge from human perception. Notably, The rest of the paper is structured as follows. Section 2
although some textual features (e.g., the number of identifier introduces related work, Section 3 proposes the new approach,
Yamin HU et al. Measuring code maintainability with deep neural networks 3
Section 4 evaluates the approach, Section 5 discusses related computed by aggregating one or more code metrics.

issues, and Section 6 concludes the paper. These heuristic-based approaches employ well-defined
formulas as the mappings between the input source code and
2 Related work the maintainability indexes, and the parameters in the
In this section, we present the related work on code mappings are determined manually [13] or calculated through
maintainability measurement. The existing approaches for numerical analysis methods, such as eigenvalue extraction
measuring code maintainability usually rely heavily on code [34]. However, such mappings result in significant limitations
metrics, which are essentially statistics regarding code for code maintainability measurement. Therefore, we propose
elements. In the systematic mapping study on code metrics DeepM, which leverages deep neural networks to establish the
[28], existing code metrics are divided into five categories: mappings.
size, complexity, inheritance, coupling, and cohesion. These
2.2 Machine learning-based approaches
code metrics are frequently used to measure different aspects

To establish complex mappings between the input source code
of code maintainability of object-oriented programs. For
and maintainability indexes, many approaches based on
example, CBO measure modularity [29] and reusability [30];
machine learning techniques have been proposed [8,15,35].
RFC measures testability [31], LCOM measures modularity Malhotra et al. [35] conducted an empirical study for class
[29], WMC measures reusability [30] and modifiability [32]. maintainability prediction models. They compared the
In addition, some code metrics were specifically proposed to performance of 28 techniques for constructing mappings
measure code maintainability. For example, Alzahrani et al. between extracted code metrics from Java classes and code
[33] defined a client-based class cohesion metric to measure maintainability (i.e., high or low maintainability), e.g.,
class maintainability; the metric considers the ways that other AdaBoost, k-Nearest Neighbor, Logistic Regression with
classes use the considered class. ridge estimator (noted as LRR), and CHC adaptive search for
Single metric only measures one aspect of source code such instance selection-based k-nearest neighbors (noted as CHC).
as complexity and thus has limitations for measuring code In their experiments, LRR and CHC obtain the best
maintainability. Consequently, existing approaches leverage performance in terms of two different evaluation metrics. Al
multiple code metrics to measure code maintainability. Dallal [8] predicted class maintainability using various code
According to the way to construct mappings between code metrics. The author first collected source files from three
metrics and maintainability indexes, these approaches can be open-source Java projects, including 1,099 Java classes, then,
divided into two categories: heuristic-based approaches and constructed a dataset that contained the code metrics of the
machine learning-based approaches. source files and the corresponding labels indicating class
maintainability. There were 19 code metrics in total, which
2.1 Heuristic-based approaches

covered three kinds of code properties: size, cohesion, and

Heuristic-based approaches are implemented based on well- coupling. The labels were two kinds of indirect maintainability
defined formulas, whose inputs are dozens of code metrics and indicators: the number of revisions in which the class was
output is maintainability indexes. Notably, the forms and changed, and the number of revised (add, delete, and change)
parameters of such formulas are usually determined with the lines of code; the smaller the two numbers are, the more
help of experts. Oman et al. [12] proposed the metric of maintainable the Java classes are. Based on this dataset, the
Maintainability Index (MI), which computes maintainability authors trained a multivariate logistic regression model to
indexes based on LOC, MCC, Halstead Volume, and the predict the maintainability of the considered class.
percent of lines of comments (optional). The higher the values These machine learning-based approaches and heuristic-
of MI are, the more maintainable the corresponding source based approaches rely heavily on code metrics. However, code
code should be. Maintainability Index is easy to compute, and metrics may significantly affect code maintainability
thus it has been widely used in industry. However, it still has measurement for the following reasons. 1) Code metrics may
several problems. For example, it is easy to construct a low- diverge from the human perception of code maintainability
maintainability source code, which has the same values of the [16–18]. For example, decomposing a complex method into
three or four code metrics with another source code with high multiple smaller ones could improve code maintainability,
maintainability; thus, both the code have the same value of because it makes the resulting methods have small and clear
MI, but they have significantly different maintainability. responsibilities [17]. However, such decomposition may
In addition, some approaches employ software quality improve the value of code metric WMC (Weighted Methods
models to construct the formulas of computing code per Class) of the enclosing class (a larger WMC means higher
maintainability indexes. For example, Heitlager et al. [13] complexity), which is the sum of the MCCs of its methods
employed the software quality model defined in ISO/IEC [17]. 2) Metrics-based approaches usually rely on metric
9126-1 (a previous standard for software quality models) to thresholds to assess code maintainability; however, the
compute a system-level maintainability score. The thresholds are hard to determine [36–38]. 3) It is challenging
maintainability score is computed by taking a weighted to measure code maintainability, which is often far beyond the
average of the scores of the four quality subcharacteristics of abilities of simple code metrics. That is, such metrics may not
maintainability (i.e., analyzability, changeability, stability, and be sufficient for capturing the essence of code maintainability
testability); the score of each subcharacteristic is computed by because of the complexity of the input code. For example, the
aggregating the scores of related code properties (such as composition of source code is complex: source code may
volume and complexity); the score of each code property is contain various code elements, such as comments, methods,
4 Front. Comput. Sci., 2023, 17(6): 176214
and statements, and there are complex relationships among in JDK 1.8. It is composed of two methods and two fields only

these elements. Most importantly, the existing approaches do due to space limitations. The class Date represents a specific
not consider the semantics of text in source code. However, instant in time. The first method Date is a constructor, which
high-quality identifiers are critical for code maintainability. initializes a Date object using a number representing the
DeepM leverages deep learning techniques (rather than milliseconds since January 1, 1970, 00:00:00 GMT. The
software quality models and traditional machine learning second method normalizes the date of a Date object if
techniques) to model source code for measuring code necessary and returns the corresponding milliseconds.
maintainability and simultaneously exploits the lexical
semantics of identifiers in source code. 3.2 Lexical features

As suggested by Fig. 1, the lexical features are decomposed

3 Approach into three categories: identifiers, field declarations, and
3.1 Overview

method headers. In the following subsections (Sections 3.2.1−
An overview of the proposed approach (DeepM) is presented 3.2.3), we explain in detail what they are, why they are useful,
in Fig. 1. DeepM takes a Java class as input and generates a and how they are extracted automatically.
maintainability index for the given class. DeepM is composed
of two steps: feature extraction and deep learning-based 3.2.1 Identifiers

prediction. Feature extraction automatically extracts as many Identifiers account for approximately 70% of the source code

features as possible related to code maintainability from a Java in terms of characters [21] and contain rich semantics [41].
class. The extracted features include lexical features (i.e., The semantics are crucial for code understandability [42].
identifiers, field declarations, and method headers), structural Nonsense identifiers could significantly reduce code
features (i.e., simplified Abstract Syntax Trees (ASTs) and understandability; in this case, programmers would spend
structural format), and traditional code metrics. Lexical much more time and effort understanding the code before
features are exploited because the quality of natural language modifying it. For example, if the running example is
in source code (especially in identifiers) represents how easily transformed by replacing its identifiers with nonsense ones (as
programmers can understand the code, i.e., code shown in Listing 4), the transformed version becomes hard to
understandability (a crucial aspect of code maintainability) understand and modify, finally resulting in low
[22,39]. Structural features represent how well the source code maintainability.
elements are formed and structured and thus represent code
readability and understandability [40]. Traditional code

metrics are statistics concerning different quality aspects, e.g.,
coupling, complexity, and cohesion. Such statistics have been
proven useful in measuring code maintainability [8].
Deep learning-based prediction involves estimating the
maintainability of a given Java class according to the features
extracted by the feature extraction step. One of the technical
challenges is that the features are diverse, including long word
sequences, short word sequences, large trees, floats, and
integers. To handle such diverse inputs, we specifically design
different neural networks to preprocess each category of
inputs before they are finally merged and fed into a dense
network. The output of the final dense network is a float
varying from zero to one to indicate the maintainability of the
given Java class, where one indicates the highest
maintainability and zero indicates the lowest maintainability.
In the following subsections (Sections 3.2−3.5), we leverage
a running example to illustrate how the proposed approach
works. The running example is presented in Listing 3. The
class in Listing 3 is a simplified version of class java.util.Date

Fig. 1 Overview of DeepM
To this end, identifiers within the source code are taken by name. containingNumbers , containingUnderscores, and

DeepM as the first lexical feature for maintainability containingDollars indicate whether the identifier contains
measurement. Identifiers are extracted and preprocessed as numbers, underscores (_), and dollar signs (＄), respectively.
follows. Notably, following Jiang’s work [44], we split an identifier
First, we retrieve all identifiers from the given Java class by into soft words based on capital letters and division markers,
enumerating all SimpleName nodes with the well-known including numbers, underscores, and dollar signs.
JavaParser [43]. Such identifiers are represented as a We represent the identifier f astT ime on Line 2 in Listing 3
sequence according to their corresponding locations within the as follows:
document:
f astT ime = ⟨2, [ f ast, T ime], “Field Name”,

identi f iers = ⟨identi f ier1 , . . . , identi f iern ⟩, (1) f alse, f alse, f alse⟩. (4)
where identi f ieri is an identifier in the Java class. The expression suggests that the identifier is composed of two
For the running example in Listing 3, we extract the soft words (i.e., f ast and T ime), it is a field name, and it does
following identifiers: not contain any numbers, underscores, or dollar signs.

Third, we further refine each of the soft words as follows:
identi f iers = ⟨Date, f astT ime,
BaseCalendar, Date, cdate, Date, so f tWordi = ⟨text, position, isInDic,

date, f astT ime, date, getT imeImpl, f ormat⟩, (5)
cdate, cdate, isNormalized, where text is a sequence of characters within the soft word.
normalize, f astT ime⟩. (2) position indicates the sequence number of the soft word within
We notice that the same identifier may appear multiple times the enclosing identifier. isInDic indicates whether the soft
within the identifier sequence. For example, the identifier date word is an English dictionary word. f ormat specifies the form
appears twice within the running example (Lines 5 and 6). We of the soft word, i.e., whether all letters within the word are
keep such repetitive instances because the frequencies of lowercase (format=1), all but the first letter are lowercase
identifiers are useful clues regarding the importance (weights) (format=2), all letters are capitalized (format=3), or none of
of the identifiers: The more popular the identifier is, the the preceding forms are correct (format=4).
greater influence it will have on the code maintainability of We demonstrate the rationale of extracting the above
the whole document. features as follows: Regarding isInDic, identifiers composed
Second, each of the identifiers is further represented as of dictionary words (instead of rare character sequences)
follows: are usually more readable and understandable [45,46].
Regarding containingNumbers , containingUnderscores,
identi f ieri = ⟨k, [so f tWord1 , so f tWord2 , containingDollars, and f ormat, naming conventions make
. . . , so f tWordk ], idenT ype, programs more understandable through placing constraints on
containingNumbers, the use of characters (i.e., numbers, underscores, dollar signs,
containingUnderscores, uppercase letters, and lowercase letters) [47]. The forms of
containingDollars⟩, (3) well-named identifiers can reflect the functions of identifiers;
where k is the number of soft words split from the identifier. for example, the forms can tell whether an identifier is a
so f tWordi is the ith soft word. idenT ype is the type of constant, which is helpful for understanding source code [47].
identifier, i.e., whether it is a class name, a field name, According to the general Java naming convention, class names
a method name, a parameter name, or a variable should be in mixed case with the first letter of each token

capitalized; programmers should not use dollar signs in the
identifiers of their own code; the names of class constants
should be all uppercase with tokens separated by underscores
[47]. Thus, the features of containingNumbers ,
containingUnderscores, containingDollars, and f ormat
could indicate whether identifiers are compliant with the
general Java naming convention; if so, the identifiers are
usually more readable and maintainable.
For the identifier f astT ime on Line 2 in Listing 3, its final
representation is as follows:

f astT ime = ⟨2, [⟨“fast”, 1, true, 1⟩,

⟨“Time”, 2, true, 2⟩], “Field Name”,
f alse, f alse, f alse⟩
= [⟨2, “fast”, 1, true, 1, “Field Name”,
f alse, f alse, f alse⟩,
⟨2, “Time”, 2, true, 2, “Field Name”,
f alse, f alse, f alse⟩]. (6)

6 Front. Comput. Sci., 2023, 17(6): 176214
3.2.2 Field declarations
feature for code maintainability measurement.
Fields of classes represent the data/statues of the enclosing First, we retrieve all method headers from a Java class:
classes, and thus field names are critical for the interpretation
methodHeaders = ⟨methodHeader1 , . . . ,
of object statuses. Well-formed fields could help maintainers
methodHeadern ⟩. (10)
quickly understand the usages of the field variables. For
example, a field representing a constant value (declared by Each of the resulting method headers is further refined as
static and final) should be named in uppercase [48]. A field follows:

declaration specifies the names, access modifiers, and data methodHeaderi = ⟨methodName, k,
types of the fields. [parameter1 , parameter2 ,
First, we retrieve all field declarations from a Java class: . . . , parameterk ],

f ieldDeclarations = ⟨ f ieldDeclaration1 , . . . , modi f ier, isS tatic,

f ieldDeclarationn ⟩. (7) hasComment⟩. (11)
Each of the resulting field declarations is further refined as: where k is the number of parameters in the method header.

methodName and parameteri are the method name and ith

f ieldDeclarationi =⟨k, [ f ield1 , f ield2 , . . . ,
parameter, respectively; they are further refined using Eq. (3).
f ieldk ], modi f ier,
modifier, isStatic, and hasComment are the same as those in
isS tatic, hasComment⟩, (8) Eq. (8) for field declarations.
where k is the number of fields in the field declaration.
Notably, a field declaration may declare more than one field. 3.3 Structural features

For example, the following statement “public int a,b;” declares Compared with natural language, source code is highly

two fields (i.e., a and b) simultaneously. modi f ier specifies structured, and the structure should be well designed [49].
the accessibility or scope of a field, i.e., whether the field is Poorly formatted code (represented by the feature of structural
public, protected, default, or private. isS tatic indicates format) would significantly reduce code readability [50]; in
whether the field declaration contains the keyword “static”. addition, poorly designed source code (represented by the
hasComment indicates whether the field declaration is feature of simplified ASTs) often results in strange structures,
associated with comments. e.g., deep nests and numerous repetitive switches, which
We further refine each of the declared fields in Eq. (8) as would decrease code maintainability, especially
follows: understandability [42]. Notably, code readability and code

understandability are two crucial aspects for code

f ieldi = ⟨ f ieldName, isAssigned⟩, (9)
maintainability. To this end, we take the structures of Java
where f ieldName is the name of the field and isAssigned classes as important features for maintainability measurement.
indicates whether the field is assigned explicitly. f ieldName is
essentially an identifier, and thus, it is further refined 3.3.1 Simplified ASTs

according to Eq. (3). An AST is a tree representation of the syntactic structure of
For the running example in Listing 3, the field declarations source code. ASTs are widely used to represent the structures
are represented in Fig. 2. of source code [51]. Consequently, we retrieve the AST for
each method within the given class. Each node of the AST tree
3.2.3 Method headers

represents a source code element, e.g., a method declaration, a
Methods specify the behaviors of the associated class. variable, or a literal.
Although methods are composed of method headers and Notably, we leverage only the structure of the AST tree and
method bodies, maintainers often heavily rely on method the types of tree nodes. Other information embedded in the
headers to determine the behaviors of a class because lengthy tree, e.g., the identifiers and the values of literals, is ignored.
method bodies are often much more difficult to read than well- The resulting AST tree (excluding ignored information) is
formed method headers [49]. Consequently, the quality of called a simplified AST.
method headers may significantly influence code
maintainability (especially code reusability) of the whole 3.3.2 Structural format

class. To this end, we take method headers as an important Although ASTs are widely used to represent the structures of

Fig. 2 Representation of the field declarations of running example
source code, they ignore the coding styles and formats of 3.5 Deep learning-based quality prediction

tokens. For example, indentation, successive white spaces, and Deep learning techniques have been successfully applied to

blank lines are ignored by ASTs. However, such formatting resolve various software engineering tasks, e.g., code
information is critical for code maintainability because well- generation [56], code completion [57], code smell detection
designed source code with a poor format could result in low [19], and software testing [58]. One of the significant
readability [52]. To this end, we leverage the structural format advantages of deep learning techniques is the ability to exploit
of the source code as an important feature for maintainability the deep semantics embedded in lengthy text [59]. To this end,
measurement. we leverage deep learning techniques to exploit the lexical
To represent the formatting of the input source code, we features of source code. Another key advantage of deep
take the source code as plain text and keep all special learning techniques is that they are good at simulating
characters, e.g., white spaces and line breaks. Notably, we complex mappings from inputs to outputs [60]. Code
replace all identifiers, comments, and literals within the maintainability is a complex concept concerning diverse
source code with constant strings “<IDENTIFIER>”, aspects of the input code, where the relations among such
“<COMMENT>”, and “<LITERAL>”, respectively. The aspects are hard to quantify. Consequently, the mappings from
replacement significantly reduces the vocabulary of the the code features to maintainability indexes should be rather
resulting features, i.e., the number of unique tokens appearing complex. To this end, we leverage deep learning techniques to
in the extracted features. The significantly reduced automatically recover the complex mappings based on a large
vocabulary, in turn, facilitates the following task (neural set of training data.
network-based machine learning) because smaller Figure 3 presents the overview of the deep neural network
vocabularies often result in shorter and more effective digital for code maintainability measurement proposed in this paper.
representations of text [53]. The network takes all features extracted in the preceding
sections (Sections 3.2−3.4). The output maintainability index
3.4 Code metrics

is a single float varying from zero to one, which indicates the
Traditional code metrics have been widely used to measure probability of the given Java class being maintainable (output
code maintainability [8,11,35]. To describe different concerns by the final sigmoid activation function) [61]. We specifically
regarding code maintainability, various code metrics have design deep neural networks to preprocess each category of
been proposed [28], and the final overall maintainability index input features. The features are diverse, containing long word
is often a synthesis of a large number of diverse metrics sequences, short word sequences, large trees, floats, and
[8,11,35]. According to a systematic mapping study of source integers, making it challenging to preprocess them with a
code metrics [28], existing code metrics cover the following universal network.
concerns: size, complexity, inheritance, coupling, and For text inputs, we employ the well-known word embedding
cohesion. DeepM leverages metrics covering all of the above technique [23] to transform textual tokens into fixed-length
concerns except for inheritance by reusing all code metrics numerical vectors. Because the length of the structural format
from CK [54,55] except for those related to inheritance. An (extracted in Section 3.3.2) varies significantly from class to
overview of the employed metrics is presented in Table 1. class, we leverage LSTM [24] to turn such variable-length
Notably, DeepM is fed with a single Java class document, and inputs into a fixed-length vector. LSTM is a special kind of
thus, it is difficult to discover its parents without additional recurrent neural network that accepts variable-length inputs.
inputs (e.g., the enclosing Java project). Consequently, DeepM For variable-length inputs where the order of elements can
does not leverage the inheritance of classes. be ignored (e.g., the order of different field declarations), we

Table 1 Employed code metrics
Category Code metrics
Coupling Coupling between objects (CBO) response for a class (RFC)
Cohesion Lack of cohesion of methods (LCOM)
Complexity Weighted methods per class (WMC)
Size Lines of code (LOC)
Numbers of {anonymous, inner} classes
Numbers of {total, abstract, default, final, private, protected, public, static, synchronized, visible} methods
Numbers of {total, default, final, private, protected, public, static, synchronized} fields
Number of assignments
Number of log statements
Number of loops
Number of returns
Number of try/catches
Basic counting Number of comparisons
Number of lambdas
Number of math operations
Maximum number of nested blocks
Number of numbers
Number of parenthesized expressions
Number of string literals
Number of unique words

Number of variables
8 Front. Comput. Sci., 2023, 17(6): 176214

Fig. 3 Deep learning-based model for code maintainability measurement
leverage attention-based dense networks [26] to transform an attention-based dense network to generate the vector
them into fixed-length numeric vectors. Attention-based dense representing identifiers. Similarly, we can get the vector
networks have a simplified attention mechanism, which is representing field declarations and the vector representing
proposed to solve the “addition” and “multiplication” method headers. For simplified ASTs, the nodes of each
problems of long-term memory [26]. Notably, attention-based simplified AST are embedded and fed into a tree-LSTM
dense networks request a customized learnable function to network; then the outputs of the tree-LSTM network resulting
produce a probability vector. To this end, we employ a dense from multiple simplified ASTs are fed into an attention-based
network to simulate such functions (as presented in Fig. 4). neural network. For structural format, the internal texts are
For the simplified ASTs (extracted in Section 3.3.1), we embedded and fed into an LSTM network. For code metrics,
employ a variant of LSTM, i.e., the child-sum tree-LSTM the values of all metrics are fed into a dense network. Finally,
[25], to learn the embeddings of the nodes. The standard the outputs of six subnetworks are merged and fed into a
LSTM handles sequential data. In contrast, tree-LSTM is dense network with the activation function of sigmoid, and the
specially designed to handle tree-structured data such as output of the dense network is the final maintainability index.
ASTs. In the tree-LSTM, a unit is updated according to the
current input, its cell state, and the hidden states of its 4 Evaluation

children. In a bottom-up manner, the hidden state of the root 4.1 Research questions

unit is finally updated and serves as the embedding of the Our evaluation investigates the following research questions.
given AST. ● RQ1: Is DeepM accurate, and to what extent can DeepM
The deep learning-based model works as follows. First, each outperform the state-of-the-art approaches in measuring
feature is fed into the corresponding subnetwork. For code maintainability?
identifiers, their features are extracted according to Eq. (6); for ● RQ2: How do the various features influence the
the features of each soft word in identifiers (corresponding to performance of DeepM?
an element of the array in Eq. (6)), the text of the soft word is ● RQ3: Is DeepM efficient and scalable?
embedded and concatenated to other corresponding numeric
features; and then the resulting vectors (the number of the RQ1 investigates the performance of DeepM in measuring
vectors is equal to the number of the soft words) are fed into code maintainability. To answer RQ1, according to the

Fig. 4 Attention-based dense network
empirical study on class maintainability prediction [35], we maintainability; we repeat this step until agreement is
compare DeepM with two baseline approaches: logistic reached.
regression with ridge estimator (noted as LRR) and CHC
adaptive search for instance selection-based k-nearest 4.2.2 Training data

neighbors (k-NN) algorithm (noted as CHC) [35]. According Ideally, we should have constructed training data in the same

to the empirical study conducted by Malhotra et al. [35], they way as we construct the testing data to guarantee the quality of
are the best in predicting the maintainability of Java classes. the resulting data. However, manual constructing as we do in
RQ2 investigates the importance of the selected features Section 4.2.1 is time-consuming, making it challenging to
with regard to the performance of DeepM. By answering RQ2, construct large-scale training data. Notably, the proposed
we may know which features are more important than others. approach leverages a deep neural network that requests a large
RQ3 concerns the time complexity of DeepM, since deep amount of training data. Consequently, we automatically
construct a large-scale training dataset as follows:
learning-based approaches often require time-consuming
training phases. Investigating RQ3 would reveal the ● First, we select high-quality Java projects from GitHub.
quantitative time required by the training phase and its Each of the selected projects should 1) have more than
quantitative relationship with the size of the training data. 500 stars; 2) have more than 30 contributors; 3) have
Answering RQ3 would also reveal whether DeepM can be more than 300 closed pull requests; 4) have a
efficiently applied to large-scale training data. significant part (no less than 8%) of the closed pull
requests already reviewed. Notably, the criteria are
4.2 Dataset

empirically designed according to our experience in

To evaluate DeepM, we construct two datasets (training data
open source applications [62,63]: 1) These projects are
and testing data) as presented in Table 2. The second column
continuously maintained by active communities in
presents the sizes of the datasets, i.e., the numbers of classes
which many experienced programmers contribute to the
involved in the datasets. The last column specifies how the
projects. 2) These projects have specific code quality
datasets are constructed.
criteria demonstrated in the document of contribution
4.2.1 Testing data

guidelines. 3) The committed code is usually reviewed
The testing data is constructed manually to guarantee the before merging them into the code repositories.
quality of the data. The construction is done as follows: Consequently, the source code in these projects is
usually of high maintainability. The selection results in
● First, we select five Java projects from GitHub: Tomcat, 378 projects noted as high-quality projects.
Log4j 2, Apollo, Druid, and Sentinel. The projects are ● Second, we select low-quality Java projects from
selected for the following reasons. First, all of them are GitHub. Each of the selected projects should 1) have no
well-known, high-quality projects with long and star; 2) have fewer than 5 contributors; 3) have fewer
successful evolution histories. Second, these projects than 5 closed pull requests; 4) have no closed pull
cover different domains: Tomcat is a web server, Log4j requests that have already been reviewed. The selection
2 is a logging utility, Apollo is a configuration results in 54,882 projects, noted as low-quality projects.
management system, Druid is a database connection ● Third, we retrieve the Java classes from the selected
pool, and Sentinel is a flow control component. projects. We also label all classes from the high-quality
● Second, from each of the five projects, we randomly projects as good (of high maintainability) and those
select 80 classes and distribute them into 40 groups from the low-quality projects as bad (of low
(pairs), where each group is composed of exactly two maintainability).
classes. Three programmers (with at least five years of ● Finally, we apply simple random sampling to exclude
experience) rank the maintainability of each pair potential confounding variables. Notably, the high-
according to their maintenance experience and the maintainability classes collected in the preceding steps
requirements for code maintainability, especially code are usually lengthier than the poor classes.
readability, understandability, and modifiability; Consequently, if we take all of the collected classes as
notably, the programmers have learned the definitions training data to train the proposed model, the sizes of
of code maintainability and related quality such classes could serve as a significant confounding
characteristics of code maintainability before labeling variable [64]. Thus, for each possible LOC, we separate
the data. If they reach agreement, we keep the pair in the classes with the LOC into a high-maintainability set
the testing data; otherwise, the pair is dropped. If the and a low-maintainability set and randomly drop classes
latter occurs, we randomly reselect two different in the larger set to make the two sets equal in size. The
classes, and the three programmers rank their resulting dataset is composed of 697,257 high-

maintainability classes and 697,257 low-maintainability
Table 2 Datasets for evaluation classes.
Dataset # Classes Construction approach 4.3 Process

Testing data 400 Manual Construction To answer RQ1, we evaluate DeepM, and two baseline
Training data

1,394,514 Automated Construction approaches, i.e., LRR and CHC, through the following steps:
10 Front. Comput. Sci., 2023, 17(6): 176214
● First, the training data is partitioned randomly into the #class pairs ranked correctly
RankingAccuracy = . (12)
training part (96%), and the validation part (4%). #generated class pairs
● Second, DeepM and the baseline approaches are trained To investigate the importance of different features (RQ2),
independently on the training part and fine-tuned with we conduct an ablation study [66], which is a frequently used
the validation part. method to understand the reasons for the success of neural
● Third, DeepM and the baseline approaches are evaluated network-based models. Each time, we disable a single feature
with the testing data. and repeat the evaluation to investigate how the performance
Suitable hyper-parameter configurations are crucial for of DeepM changes.
DeepM to measure code maintainability. However, the 4.4 RQ1: DeepM Outperforms the Baseline Approaches

training of DeepM requires lots of time (~62 hours), which To answer RQ1, we compare DeepM with two baseline
results in that we cannot test many combinations of hyper- approaches in terms of measuring code maintainability. We
parameters. Consequently, we first identify several typical present the evaluation results in Table 3, and make the
configurations [65] to train DeepM, and observe the following observations:
performance on the validation data. Subsequently, we try other
configurations according to the heuristics in [65] to find better ● First, DeepM is accurate with respect to ranking class
hyper-parameter configurations. pairs. In most cases (87.5%), DeepM generates the
We demonstrate the final hyper-parameter configurations as correct rankings. The accuracies on the five projects
follows. We train DeepM using the Adam optimizer for 3 vary from 80.0% to 92.5%, which demonstrates the
epochs with the learning rate of 0.001, the batch size of 32, effectiveness of DeepM.
and the loss function of binary cross entropy. Regarding the ● Second, DeepM significantly outperforms the two
structure of our network, the attention-based dense networks baseline approaches, i.e., LRR and CHC. On all the five
are 3-layer networks (16-8-1) with the ELU activation projects, DeepM generates higher accuracies than both
function. The dense network for the feature of code metrics is the two baseline approaches. It improves the average
also a 3-layer network (32-16-8) with the ELU activation RankingAccuracy from 75.0% for LRR and 57.5% for
function. The LSTM is a single-layer network with 32 units CHC to 87.5%, where the improvements are 12.5 and
and 8-dimensional embedding of the input text. The tree- 30.0 percentage points, respectively.
LSTM is a single-layer network with 8-dimensional ● Finally, it is challenging for LRR and CHC for
embedding of both the simplified AST nodes and the final measuring code maintainability: The average
representations. The final dense network is a 4-layer network RankingAccuracy are low, i.e., 75.0% for LRR and
(32-16-8-1) with the ELU activation function for the front 57.5% for CHC. This demonstrates that measuring code
layers and the sigmoid activation function for the last layer. maintainability is far beyond the abilities of simple code
The sizes of vocabularies for soft words, simplified AST metrics.
nodes, and texts in structural format are 56949 (covering 99%
of all soft words), 591 (all simplified AST nodes), and 110 (all To further reveal the reasons for the success of DeepM, we
texts in structural format) respectively. retrieve all the 27 (27/200 = 13.5%) class pairs of the testing
For the baseline approach CHC, the original implementation data for which the two baseline approaches’ rankings are
(in KEEL) employs 1-nearest neighbor (1-NN) algorithm to incorrect but DeepM’s rankings are correct, and then analyze
classify unseen instances based on selected instances. To the resulting classes from multiple aspects such as identifier
compute maintainability indexes, we modify the 1-NN to 7- semantics, logical complexity, and code style. We find that
NN (optimal in terms of the accuracy regarding RQ1). For one of the major reasons for the success of DeepM is that it
other parameters of the two baseline approaches, we take the takes full advantage of the natural language in identifiers.
same configurations as [35]. DeepM correctly ranks 25 (25/27 = 92.6%) class pairs because
During the testing process, the compared approaches are it exploits the feature of identifiers: forms (length, abbreviated
employed to rank each class pair in the testing data that is or full form, and letter case) and semantics. An example from
composed of a low-maintainability class and a high- Apollo project is Java class ConfigChangeEvent. The metrics
maintainability class. If the maintainability index generated of this Java class, e.g., CBO, WMC, RFC, and LCOM, are
for the high-maintainability class is greater than that of the reasonable and comparable to those of the high-
low-maintainability class, the ranking of the two classes is maintainability classes within the training data. Consequently,
deemed correct. Otherwise, the ranking is incorrect. The the two baseline approaches predict this class as higher
performance of the evaluated approaches in ranking the maintainability (than the other class in the same class pair).
classes is computed as follows: However, the identifiers within this class are poorly designed.

Table 3 Ranking accuracy on the testing data
Approach Tomcat Log4j 2 Apollo Druid Sentinel Average
DeepM 90.0% 80.0% 90.0% 85.0% 92.5% 87.5%
LRR 80.0% 70.0% 77.5% 75.0% 72.5% 75.0%
CHC 65.0% 55.0% 70.0% 45.0% 52.5% 57.5%

The first issue with the identifiers within this class is the form ● Third, disabling field declarations results in less

of the method name changedKeys. Programmers usually reduction than that yielded by disabling other features
leverage verb phrases to name methods [67], and noun phrases with respect to the average RankingAccuracy, i.e., a
(such as the name changedKeys) are rarely employed as reduction rate of 2.9%. The reasons for this are
method names. The second issue is the low readability of the probably as follows: 1) The features of field
identifiers because of containing nonsense tokens and hard-to- declarations contain only static information (without
interpret abbreviations such as “m_changes”. Thus, DeepM dynamic behaviors) about the enclosing classes. 2) The
suggests that it is a lower-maintainability class. semantics contained in field declarations are covered by
Seven (3.5% = 7/200) class pairs are incorrectly ranked by the features of identifiers.
DeepM but correctly ranked by both the two baseline
approaches. We analyze the seven classes (like we do above) 4.6 RQ3: DeepM is efficient

and find that the reasons for misprediction might be that many To investigate the efficiency of DeepM, during the evaluation
identifiers within these classes are very descriptive. An specified in Section 4.3, we record the time consumed by its
example from Tomcat project is Java class JspPropertyGroup, training and testing processes. Notably, the training and
which is of lower maintainability in the enclosing class pair; testing procedures are conducted on a workstation with a GPU
however, DeepM suggests that the class is of higher whose settings are listed as follows: 128 GB of RAM, Intel(R)
maintainability. Many issues result in the class being of lower Xeon(R) E5-2620 v4 CPU, and NVIDIA TITAN RTX GPU.
maintainability. For example, the boolean variable The evaluation results are shown in Table 5.
deferredSyntax should be named as isDeferredSyatax instead From the table, we observe that the training phase can be
of the noun phrase [68]; the field urlPattern should be named accomplished within 62 hours (3 epochs) on a single
as urlPatterns because it is a collection variable with the type workstation, although the training data (containing 85,560,826
of LinkedHashSet<String> [68]. From the example, we can lines of code) is massive. We also notice that DeepM is
conclude that measuring code maintainability is complex, and efficient in generating maintainability indexes. It takes DeepM
the word embedding technique may be not sufficient enough only 0.076 seconds on average to predict the maintainability
for exploiting the semantics of identifiers to measure code index for a single class. Notably, “Training Time”, “Testing
maintainability. In the future, we will leverage other Time”, and “Testing Time per Class” include the time
techniques to model the semantics in source code. required for feature extraction.
Figure 5 further investigates how the size of the training
4.5 RQ2: All exploited features are helpful
dataset influences the training time. From this figure, we
To investigate the importance of the exploited features, we observe that there is a linear relationship between the training
conduct an ablation study [66] on the testing data. The time and the size of the training dataset. Consequently, DeepM
evaluation results are presented in Table 4. From the table, we can be efficiently applied to large-scale training data.
make the following observations:
4.7 Threats to validity

● First, all exploited features are helpful. Disabling any A threat to constructive validity is that the maintainability
feature results in large reductions with respect to the labels and maintainability rankings of the classes in the
average RankingAccuracy. The reduction rates vary training data and the testing data may be inaccurate. To collect
from 2.9% to 15.4% (last column in Table 4). labeled data for training and testing, we collect a large number
● Second, identifiers contribute most to the ability of of Java projects from GitHub and label the maintainability
DeepM to measure code maintainability. Disabling indexes of the Java classes within such projects automatically
identifiers results in the largest reduction with respect to according to some project-level metrics, e.g., the number of
the average RankingAccuracy, i.e., a reduction rate of stars associated with the projects. However, there is no strong
up to 15.4%. In addition, for each of the five and quantitative evidence that such employed metrics can
applications, disabling identifiers has the largest impact accurately distinguish high-quality projects from low-quality
on the performance of DeepM. Consequently, we can projects. Furthermore, not all classes from high-quality (low-
conclude that identifiers are important for code quality) projects are of high (low) maintainability. To reduce
maintainability measurement. the threat level, three programmers label 100 randomly-

Table 4 Effects of the exploited features on the testing data (ranking accuracy)
Setting Tomcat Log4j 2 Apollo Druid Sentinel Average ↓ Rate
Default 90.0% 80.0% 90.0% 85.0% 92.5% 87.5%
Disabling identifiers 75.0% 60.0% 80.0% 72.5% 82.5% 74.0% 15.4%
Disabling field declarations 80.0% 82.5% 90.0% 80.0% 92.5% 85.0% 2.9%
Disabling method headers 80.0% 75.0% 90.0% 85.0% 90.0% 84.0% 4.0%
Disabling simplified ASTs 85.0% 70.0% 87.5% 82.5% 92.5% 83.5% 4.6%
Disabling structural format 82.5% 67.5% 90.0% 80.0% 92.5% 82.5% 5.7%
Disabling code metrics 80.0% 67.5% 87.5% 80.0% 92.5% 81.5% 6.9%

12 Front. Comput. Sci., 2023, 17(6): 176214

Table 5 Performance of DeepM regression with ridge estimator (noted as LRR) and CHC
Metric Value adaptive search for instance selection-based k-nearest
Size of training data 85,560,826 LOC neighbors algorithm (noted as CHC), cannot output a float
Training time (3 epoch) ~62 hours number indicating code maintainability; thus, we modify them
Size of testing data 29,434 LOC for evaluation purposes. In the empirical study on class
Testing time ~30 seconds maintainability [35], Malhotra et al. constructed class
Testing time per class ~0.076 seconds

maintainability prediction models using the Knowledge

Extraction based on Evolutionary Learning (KEEL) tool [69].
Consequently, we modify the two implementations in KEEL
to obtain code maintainability indexes, i.e., a float number
varying from zero to one: 1) For LRR, the implementation
itself (in KEEL) has a variable whose values vary from zero to
one indicating code maintainability; thus, we output the values
to result files for evaluation purposes. 2) For CHC, the
implementation (in KEEL) employs 1-nearest neighbor (1-
NN) algorithm to classify unseen instances based on selected
instances; we modify the 1-NN to k-NN, thus we can compute
maintainability indexes. The experimental results in RQ1 are
obtained using the 7-NN model, which is optimal in terms of
accuracy. In addition, CHC is so slow that it cannot be trained
on the entire training data (1,338,733 Java classes) even if we
Fig. 5 Training time versus size of training dataset optimize it through parallel streams in Java and run it on a 32-
core workstation. Consequently, we train CHC on randomly-

sampled, high-maintainability classes and 100 low- sampled 100,000 Java classes from the training data, and
maintainability classes according to their maintenance report corresponding experimental results in RQ1. Notably,
experiences; the programmers have at least five years of the size of the sampled dataset (i.e, 100,000) is enough for
experience developing or maintaining large commercial training CHC, because the difference between the accuracy
applications using Java language. The results suggest that 189 obtained on 100,000 Java classes and that obtained on 50,000
out of the 200 classes are labeled correctly (The programmers Java classes is less than 1%. However, the reimplementation
reach agreement and the labels are correct). In addition, we could be buggy, which may influence the conclusions drawn
manually create another dataset (called testing data), where the about the baseline approaches. We also make the modification
classes are ranked manually by the three experienced publicly available as a part of the replication package
programmers. Notably, manual labeling is not guaranteed to (available at github/DeepQuality/Deep_Maintainability).
be correct because maintainability is essentially subjective. A threat to external validity is that the datasets, especially
The three experienced programmers rank class pairs from the the manually constructed dataset, are small. Some special
same high-quality projects together. The class pairs are put in characteristics of such data may bias the evaluation, and the
the testing data only when the programmers reach agreement. obtained conclusions may not hold on other data. To reduce
Another threat related to the datasets is that DeepM is the threat level, we sample Java classes from diversified
trained with the datasets with the labels of high or low projects that come from different domains and were developed
maintainability instead of the labels of floating numbers by different developers/companies.
between 0 and 1 indicating code maintainability. Ideally, a
large dataset with fine-grained labels (i.e., floating numbers 5 Limitations
between 0 and 1) should be constructed to train DeepM. The proposed approach is currently confined to a single source
However, it is hard to obtain fine-grained maintainability file, i.e., a Java class. It is challenging to extend our approach
indexes of lots of class files: (1) We have tried our best to to project-level code maintainability measurement because of
automatically construct such a dataset according to the the inefficiency of deep neural networks in handling lengthy
information on refactoring and code reviews in code text. A project in the industry usually contains thousands of
repositories, but we cannot obtain fine-grained maintainability Java classes where such classes are connected with complex
indexes especially for lots of class files; (2) Manually labeling semantic and syntactic correlations. Feeding all such tokens
the maintainability indexes of lots of class files is infeasible. and their relations into a single network could result in
Consequently, we construct the training data through the extremely large neural work containing billions of parameters.
heuristic strategy to train our neural network-based model. To However, tuning such parameters may request numerous
reduce the threat level, we use the manually labeled testing labeled data that we cannot offer. Notably, the average
data to evaluate the effectiveness of DeepM; ranking the class maintainability index of the classes within a project (called
pairs of the testing data is more challenging than classifying a average index for short) alone is insufficient to reveal the
class file as high or low maintainability. maintainability of the whole project because it ignores some
A threat to internal validity is that the original correlations (e.g., inheritance) among the classes.
implementations of the baseline approaches, i.e., logistic It is also difficult to apply DeepM to code commits. A code
commit represents changes to a project for a specific of the Maintenance of Computer Application Software in 487 Data
programming task, e.g., fixing a bug. Although code commits Processing Organizations. Reading: Addison-Wesley, 1980
4. Yau S S, Collofello J S. Some stability measures for software
are often small, they could reach multiple Java files. As a

maintenance. IEEE Transactions on Software Engineering, 1980, SE-

result, DeepM could not be applied to such code commits. 6(6): 545–552
Furthermore, code commits focus on the local design of 5. Reiss S P. Incremental maintenance of software artifacts. IEEE

source code (e.g., a single line of source code) whereas Transactions on Software Engineering, 2006, 32(9): 682–697
DeepM focuses on the overall maintainability of classes. 6. ISO. ISO/IEC 5055: 2021 Information technology — Software

Consequently, we cannot apply DeepM to code commits measurement — Software quality measurement — Automated source
code quality measures. Geneva: ISO, 2021
without significant adaption.
7. Grubb P, Takang A A. Software Maintenance: Concepts and Practice.

Due to the nature of deep learning-based models, the outputs 2nd ed. London: World Scientific Publishing, 2003
of DeepM cannot directly help programmers improve the 8. Al Dallal J. Object-oriented class maintainability prediction using

maintainability of their code. However, DeepM has other internal quality attributes. Information and Software Technology, 2013,

usages. For example, when automatic refactoring is applied to 55(11): 2028–2048
improve code maintainability, DeepM can check whether the 9. Myers G J, Badgett T, Thomas T M, Sandler C. The Art of Software

Testing. 2nd ed. Hoboken: John Wiley & Sons, 2004
post-refactoring class is of high maintainability automatically,
10. Mari M, Eila N. The impact of maintainability on component-based
and then accept or reject the refactoring. In addition, DeepM

software systems. In: Proceedings of the 29th Euromicro Conference.

could check whether each changed class in commits satisfies 2003, 25−32
maintainability requirements before merging them, and thus 11. Li W, Henry S. Object-oriented metrics that predict maintainability.

make the repository in a maintainable state. Journal of Systems and Software, 1993, 23(2): 111–122
Finally, the current implementation of the proposed 12. Oman P, Hagemeister J. Construction and testing of polynomials

predicting software maintainability. Journal of Systems and Software,

approach is confined to Java. Java is selected because of its
1994, 24(3): 251–266
popularity [70]. However, our approach does not make use of 13. Heitlager I, Kuipers T, Visser J. A practical model for measuring

any Java-specific features. Consequently, it may have the maintainability. In: Proceedings of the 6th International Conference on

ability to be extended to other object-oriented programming the Quality of Information and Communications Technology (QUATIC
languages, such as C++. In the future, we should investigate 2007). 2007, 30−39
how well it works on other programming languages. 14. ISO. ISO/IEC 9126-1: 2001 Software engineering—Product quality —

Part 1: Quality model. Geneva: ISO, 2001
15. Zhou Y, Leung H. Predicting object-oriented software maintainability
6 Conclusions
using multivariate adaptive regression splines. Journal of Systems and

The maintainability of source code is one of the key concerns Software, 2007, 80(8): 1349–1361
in software maintenance. However, the existing approaches 16. Bavota G, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A.

for measuring code maintainability rely heavily on code An empirical study on the developers’ perception of software coupling.

metrics, and the quality of the natural language within source In: Proceedings of the 35th International Conference on Software
code is ignored. Consequently, in this paper, we propose a Engineering (ICSE). 2013, 692−701
17. Pantiuchina J, Lanza M, Bavota G. Improving code: the (Mis)
deep learning-based approach called DeepM to quantitatively

perception of quality metrics. In: Proceedings of 2018 IEEE

measure the maintainability of Java classes. DeepM leverages International Conference on Software Maintenance and Evolution
deep learning techniques to exploit the lexical semantics of the (ICSME). 2018, 80−91
natural language (especially identifiers) in source code and to 18. Lenhard J, Blom M, Herold S. Exploring the suitability of source code

build mappings from the resulting lexical semantics, as well as metrics for indicating architectural inconsistencies. Software Quality

structural features and traditional code metrics, to Journal, 2019, 27(1): 241–274
19. Liu H, Xu Z, Zou Y. Deep learning based feature envy detection. In:
maintainability indexes. DeepM is evaluated on manually-

Proceedings of the 33rd ACM/IEEE International Conference on

assessed source code, and our evaluation results suggest that Automated Software Engineering. 2018, 385−396
DeepM is accurate and generates the same rankings of code 20. Scalabrino S, Linares-Vasquez M, Poshyvanyk D, Oliveto R. Improving

maintainability as those of experienced programmers on code readability models with textual features. In: Proceedings of the
87.5% of manually ranked pairs of Java classes. 24th IEEE International Conference on Program Comprehension. 2016,
In the future, we would like to extend the proposed 1−10
21. Kim S, Kim D. Automatic identifier inconsistency detection using code
approach to other object-oriented programming languages,

dictionary. Empirical Software Engineering, 2016, 21(2): 565–604
such as C++. It would also be interesting to investigate the 22. Schankin A, Berger A, Holt D V, Hofmeister J C, Riedel T, Beigl M.

potential of the proposed approach regarding the evaluation of Descriptive compound identifier names improve source code

larger elements, e.g., packages, modules, subsystems, and comprehension. In: Proceedings of the 26th IEEE/ACM International
even whole systems. Conference on Program Comprehension (ICPC). 2018, 31−40
23. Lai S, Liu K, He S, Zhao J. How to generate a good word embedding.

References IEEE Intelligent Systems, 2016, 31(6): 5–14
24. Greff K, Srivastava R K, Koutník J, Steunebrink B R, Schmidhuber J.

1. Schneidewind N F. The state of software maintenance. IEEE

LSTM: a search space odyssey. IEEE Transactions on Neural Networks
Transactions on Software Engineering, 1987, SE-13(3): 303–310 and Learning Systems, 2017, 28(10): 2222–2232
2. Bennett K H, Rajlich V T. Software maintenance and evolution: a

25. Tai K S, Socher R, Manning C D. Improved semantic representations

roadmap. In: Proceedings of the Conference on the Future of Software from tree-structured long short-term memory networks. In: Proceedings
Engineering. 2000, 73−87 of the 53rd Annual Meeting of the Association for Computational
3. Lientz B P, Swanson E B. Software Maintenance Management: A Study
Linguistics and the 7th International Joint Conference on Natural
14 Front. Comput. Sci., 2023, 17(6): 176214
Language Processing (Volume 1: Long Papers). 2015, 1556−1566 47. ORACLE. Naming Conventions. See oracle-base/articles/misc/naming-

26. Raffel C, Ellis D P. Feed-forward networks with attention can solve

conventions website, 2017
some long-term memory problems. 2016, arXiv preprint arXiv: 48. Olsson M. Constants. In: Olsson M, ed. Java 17 Quick Syntax

1512.08756 Reference. 3rd ed. Berkele: Apress Berkele, 2022, 85−87
27. McCulloch W S, Pitts W. A logical calculus of the ideas immanent in

49. Fowler M. Refactoring: Improving the Design of Existing Code. 2nd ed.

nervous activity. The Bulletin of Mathematical Biophysics, 1943, 5(4): Reading: Addison-Wesley Professional, 2018
115–133 50. Mi Q, Xiao Y, Cai Z, Jia X. The effectiveness of data augmentation in

28. Nuñez-Varela A S, Pérez-Gonzalez H G, Martínez-Perez F E,

code readability classification. Information and Software Technology,

Soubervielle-Montalvo C. Source code metrics: a systematic mapping 2021, 129: 106378
study. Journal of Systems and Software, 2017, 128: 164–197 51. Alon U, Zilberstein M, Levy O, Yahav E. Code2vec: learning

29. Przybyłek A. Where the truth lies: AOP and its impact on software

distributed representations of code. Proceedings of the ACM on

modularity. In: Proceedings of the 14th International Conference on Programming Languages, 2019, 3(POPL): 40
Fundamental Approaches to Software Engineering. 2011, 447−461 52. Lee T, Lee J B, In H P. A study of different coding styles affecting code

30. Goel B M, Bhatia P K. Analysis of reusability of object-oriented

readability. International Journal of Software Engineering and its

systems using object-oriented metrics. ACM SIGSOFT Software Applications, 2013, 7(5): 413–422
Engineering Notes, 2013, 38(4): 1–5 53. Sun H, Wang R, Chen K, Utiyama M, Sumita E, Zhao T. Unsupervised

31. Bruntink M, Van Deursen A. An empirical study into class testability.

bilingual word embedding agreement for unsupervised neural machine

Journal of Systems and Software, 2006, 79(9): 1219–1232 translation. In: Proceedings of the 57th Annual Meeting of the
32. Poort E R, Martens N, Van De Weerd I, Van Vliet H. How architects

Association for Computational Linguistics. 2019, 1235−1245
see non-functional requirements: beware of modifiability. In: 54. Chidamber S R, Kemerer C F. A metrics suite for object oriented

Proceedings of the 18th International Conference on Requirements design. IEEE Transactions on Software Engineering, 1994, 20(6):
Engineering: Foundation for Software Quality. 2012, 37−51 476–493
33. Alzahrani M, Alqithami S, Melton A. Using client-based class cohesion

55. Aniche M. Java code metrics calculator (CK), 2015

metrics to predict class maintainability. In: Proceedings of the 43rd 56. Svyatkovskiy A, Deng S K, Fu S, Sundaresan N. IntelliCode compose:

IEEE Annual Computer Software and Applications Conference code generation using transformer. In: Proceedings of the 28th ACM
(COMPSAC). 2019, 72−80 Joint Meeting on European Software Engineering Conference and
34. Kanellopoulos Y, Antonellis P, Antoniou D, Makris C, Theodoridis E,

Symposium on the Foundations of Software Engineering. 2020,

Tjortjis C, Tsirakis N. Code quality evaluation methodology using the 1433−1443
ISO/IEC 9126 standard. International Journal of Software Engineering 57. Svyatkovskiy A, Zhao Y, Fu S, Sundaresan N. Pythia: AI-assisted code

& Applications, 2010, 1(3): 17–36 completion system. In: Proceedings of the 25th ACM SIGKDD
35. Malhotra R, Lata K. An empirical study to investigate the impact of data

International Conference on Knowledge Discovery & Data Mining.

resampling techniques on the performance of class maintainability 2019, 2727−2735
prediction models. Neurocomputing, 2021, 459: 432–453 58. Li Y, Yang Z, Guo Y, Chen X. Humanoid: a deep learning-based

36. Padhy N, Panigrahi R, Neeraja K. Threshold estimation from software

approach to automated black-box android app testing. In: Proceedings

metrics by using evolutionary techniques and its proposed algorithms, of the 34th IEEE/ACM International Conference on Automated
models. Evolutionary Intelligence, 2019, 14(2): 315–329 Software Engineering (ASE). 2019, 1070−1073
37. Shatnawi R. Comparison of threshold identification techniques for

59. Gu X, Zhang H, Zhang D, Kim S. Deep API learning. In: Proceedings

object-oriented software metrics. IET Software, 2020, 14(6): 727–738 of the 24th ACM SIGSOFT International Symposium on Foundations

38. Ferreira K A M, Bigonha M A S, Bigonha R S, Mendes L F O, Almeida

of Software Engineering. 2016, 631−642
H C. Identifying thresholds for object-oriented software metrics. Journal 60. Goodfellow I, Bengio Y, Courville A. Deep Learning. Cambridge: MIT

of Systems and Software, 2012, 85(2): 244–257 Press, 2016
39. Hofmeister J, Siegmund J, Holt D V. Shorter identifier names take

61. Lever J, Krzywinski M, Altman N. Logistic regression. Nature

longer to comprehend. In: Proceedings of the 24th IEEE International Methods, 2016, 13(7): 541–543
Conference on Software Analysis, Evolution and Reengineering 62. Zhang Y, Zhou M, Mockus A, Jin Z. Companies’ participation in OSS

(SANER). 2017, 217−227 development−an empirical study of OpenStack. IEEE Transactions on

40. Morales R, Khomh F, Antoniol G. RePOR: mimicking humans on

Software Engineering, 2021, 47(10): 2242–2259
refactoring tasks. Are we there yet? Empirical Software Engineering, 63. Amreen S, Mockus A, Zaretzki R, Bogart C, Zhang Y X. ALFAA:

2020, 25(4): 2960–2996 active learning fingerprint based anti-aliasing for correcting developer

41. Hussain Y, Huang Z, Zhou Y. Improving source code suggestion with

identity errors in version control systems. Empirical Software

code embedding and enhanced convolutional long short-term memory. Engineering, 2020, 25(2): 1136–1167
IET Software, 2021, 15(3): 199–213 64. El Emam K, Benlarbi S, Goel N, Rai S N. The confounding effect of

42. Scalabrino S, Bavota G, Vendome C, Linares-Vasquez M, Poshyvanyk

class size on the validity of object-oriented metrics. IEEE Transactions
D, Oliveto R. Automatically assessing code understandability. IEEE on Software Engineering, 2001, 27(7): 630–650
Transactions on Software Engineering, 2021, 47(3): 595–613 65. Bengio Y. Practical recommendations for gradient-based training of

43. Smith N, Van Bruggen D, Tomassetti F. JavaParser: Visited. Victoria:

deep architectures. In: Montavon G, Orr G B, Müller K R, eds. Neural
Leanpub, 2021 Networks: Tricks of the Trade. Berlin: Springer, 2012, 437−478
44. Jiang Y, Liu H, Zhu J, Zhang L. Automatic and accurate expansion of

66. Wang W, Li G, Shen S, Xia X, Jin Z. Modular tree network for source

abbreviations in parameters. IEEE Transactions on Software code representation learning. ACM Transactions on Software
Engineering, 2020, 46(7): 732–747 Engineering and Methodology, 2020, 29(4): 31
45. Butler S. The effect of identifier naming on source code readability and

67. Allamanis M, Barr E T, Bird C, Sutton C. Suggesting accurate method

quality. In: Proceedings of the Doctoral Symposium for ESEC/FSE on and class names. In: Proceedings of the 10th Joint Meeting on
Doctoral Symposium. 2009, 33−34 Foundations of Software Engineering. 2015, 38−49
46. Lawrie D, Feild H, Binkley D. Quantifying identifier quality: an

68. Baranasuriya N. Java Coding Standard. See se-education/guides/

analysis of trends. Empirical Software Engineering, 2007, 12(4): conventions/java/index website, 2022

359–388 69. Triguero I, González S, Moyano J M, García S, Alcalá-Fdez J, Luengo

J, Fernández A, Del Jesús M J, Sánchez L, Herrera F. KEEL 3.0: an Hao Jiang received a BSc degree from the
open source software for multi-stage analysis in data mining. Department of Computer Science and
International Journal of Computational Intelligence Systems, 2017, Technology, North China Electric Power
10(1): 1238–1249 University, China in 2014, and PhD degree from
70. Zerouali A, Mens T. Analyzing the evolution of testing library usage in
the School of Computer Science and Technology,

open source java projects. In: Proceedings of the 24th IEEE
International Conference on Software Analysis, Evolution and University of Science and Technology of China,
Reengineering (SANER). 2017, 417−421 China. He is currently an Associate Professor with
the School of Artificial Intelligence, Anhui University, China. His
Yamin Hu received a BSc degree from the current research interests include complex network analysis,
College of Information Engineering, Northwest combinatorial optimization, and computational intelligence.
A&F University, China in 2016, and MSc degree
from the School of Computer Science and Zongyao Hu is currently pursuing the PhD degree
Technology, University of Science and with the School of Computer Science and
Technology of China, China in 2019. He is Technology, Beijing Institute of Technology,
currently working toward a PhD degree at the China. His current research interests include deep
School of Computer Science and Technology, Beijing Institute of learning and image processing.
Technology, China. His current research interests include software
engineering and database systems.

Measuring Code Maintainability With Deep Neural Networks: Yamin HU (, Hao Jiang, Zongyao HU

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Measuring Code Maintainability With Deep Neural Networks: Yamin HU (, Hao Jiang, Zongyao HU

Uploaded by

Copyright:

Available Formats

Front.

Measuring code maintainability with deep neural networks

Yamin HU ( ✉)1, Hao JIANG2, Zongyao HU1

The widely used code metric is Maintainability Index (MI)

of each subcharacteristic is computed by aggregating the

Section 4 evaluates the approach, Section 5 discusses related computed by aggregating one or more code metrics.

covered three kinds of code properties: size, cohesion, and

and statements, and there are complex relationships among in JDK 1.8. It is composed of two methods and two fields only

As suggested by Fig. 1, the lexical features are decomposed

prediction. Feature extraction automatically extracts as many Identifiers account for approximately 70% of the source code

readability and understandability [40]. Traditional code

To this end, identifiers within the source code are taken by name. containingNumbers , containingUnderscores, and

f astT ime = ⟨2, [ f ast, T ime], “Field Name”,

BaseCalendar, Date, cdate, Date, so f tWordi = ⟨text, position, isInDic,

f astT ime = ⟨2, [⟨“fast”, 1, true, 1⟩,

f ieldDeclarations = ⟨ f ieldDeclaration1 , . . . , modi f ier, isS tatic,

methodName and parameteri are the method name and ith

For example, the following statement “public int a,b;” declares Compared with natural language, source code is highly

understandability are two crucial aspects for code

tokens. For example, indentation, successive white spaces, and Deep learning techniques have been successfully applied to

neighbors (k-NN) algorithm (noted as CHC) [35]. According Ideally, we should have constructed training data in the same

empirically designed according to our experience in

The first issue with the identifiers within this class is the form ● Third, disabling field declarations results in less

maintainability prediction models using the Knowledge

maintenance. IEEE Transactions on Software Engineering, 1980, SE-

maintainability of their code. However, DeepM has other internal quality attributes. Information and Software Technology, 2013,

software systems. In: Proceedings of the 29th Euromicro Conference.

predicting software maintainability. Journal of Systems and Software,

any Java-specific features. Consequently, it may have the maintainability. In: Proceedings of the 6th International Conference on

using multivariate adaptive regression splines. Journal of Systems and

for measuring code maintainability rely heavily on code An empirical study on the developers’ perception of software coupling.

perception of quality metrics. In: Proceedings of 2018 IEEE

build mappings from the resulting lexical semantics, as well as metrics for indicating architectural inconsistencies. Software Quality

Proceedings of the 33rd ACM/IEEE International Conference on

potential of the proposed approach regarding the evaluation of Descriptive compound identifier names improve source code

1. Schneidewind N F. The state of software maintenance. IEEE

Language Processing (Volume 1: Long Papers). 2015, 1556−1566 47. ORACLE. Naming Conventions. See oracle-base/articles/misc/naming-

28. Nuñez-Varela A S, Pérez-Gonzalez H G, Martínez-Perez F E,

code readability classification. Information and Software Technology,

distributed representations of code. Proceedings of the ACM on

readability. International Journal of Software Engineering and its

bilingual word embedding agreement for unsupervised neural machine

Symposium on the Foundations of Software Engineering. 2020,

International Conference on Knowledge Discovery & Data Mining.

approach to automated black-box android app testing. In: Proceedings

object-oriented software metrics. IET Software, 2020, 14(6): 727–738 of the 24th ACM SIGSOFT International Symposium on Foundations

(SANER). 2017, 217−227 development−an empirical study of OpenStack. IEEE Transactions on

2020, 25(4): 2960–2996 active learning fingerprint based anti-aliasing for correcting developer

identity errors in version control systems. Empirical Software

68. Baranasuriya N. Java Coding Standard. See se-education/guides/

analysis of trends. Empirical Software Engineering, 2007, 12(4): conventions/java/index website, 2022

You might also like