Professional Documents
Culture Documents
Comput. Sci., 2023, 17(6): 176214
https://doi.org/10.1007/s11704-022-2313-0
RESEARCH ARTICLE
Higher Education Press 2023
Abstract The maintainability of source code is a key quality inevitable because of the complexity of software and the
characteristic for software quality. Many approaches have been variability of requirements. The cost of software maintenance
proposed to quantitatively measure code maintainability. Such is high and up to 67% of the total cost during the software
approaches rely heavily on code metrics, e.g., the number of development life cycle [4]. Consequently, software
Lines of Code and McCabe’s Cyclomatic Complexity. The maintenance is essential for continuous evolution of software.
employed code metrics are essentially statistics regarding code Source code is the major artifact to modify during software
elements, e.g., the numbers of tokens, lines, references, and maintenance process [5]. In the international standard
branch statements. However, natural language in source code, ISO/IEC 5055:2021 for automated source code quality
especially identifiers, is rarely exploited by such approaches. measures [6], code maintainability is defined as:
As a result, replacing meaningful identifiers with nonsense The capability of a product to be modified by the intended
tokens would not significantly influence their outputs, although maintainers with effectiveness and efficiency.
the replacement should have significantly reduced code Code maintainability is a key concern for maintainers. To
maintainability. To this end, in this paper, we propose a novel make the modifications, the maintainers must first read
approach (called DeepM) to measure code maintainability by (readability) and understand (understandability) the source
exploiting the lexical semantics of text in source code. DeepM code, then analyze (analyzability) the source code to find the
leverages deep learning techniques (e.g., LSTM and attention modification positions, and finally may reuse (reusability) the
mechanism) to exploit these lexical semantics in measuring existing components to complete the modifications
code maintainability. Another key rationale of DeepM is that (modifiability) [7]. If source code can be modified by the
measuring code maintainability is complex and often far intended maintainers with fewer efforts to successfully
beyond the capabilities of statistics or simple heuristics. complete software maintenance activities such as fixing bugs,
Consequently, DeepM leverages deep learning techniques to the code is of high maintainability. Code maintainability is
automatically select useful features from complex and lengthy critical for measuring software maintainability and predicting
inputs and to construct a complex mapping (rather than simple software maintenance cost. Low-maintainability code would
heuristics) from the input to the output (code maintainability result in that future maintainers (who may be not familiar with
index). DeepM is evaluated on a manually-assessed dataset. the considered code) would spend much more efforts to
The evaluation results suggest that DeepM is accurate, and it understand the code than the original developers, and thus the
generates the same rankings of code maintainability as those of cost of software maintenance will rise significantly [8].
experienced programmers on 87.5% of manually ranked pairs Notably, code quality covers multiple aspects. The first and
of Java classes. foremost aspect is the functionality and performance of the
implementation (source code) with regard to its corresponding
Keywords code maintainability, lexical semantics, deep requirements, i.e., the degree to which the source code works
learning, neural networks as expected [9]. Another key aspect of code quality is the
effectiveness and efficiency with which the maintainers
1 Introduction modify the source code, i.e., code maintainability [10]. The
Software maintenance is the modification of a software second aspect (concerning human interpretation and
product after delivery to fix bugs, to improve performance or modification of the source code) is often more subjective than
other attributes, or to adapt the involved software to a new the first aspect (concerning the functionality and performance
environment [1,2]. Among these maintenance activities, 21% of the source code); thus, the second aspect is more difficult to
are corrective, 50% are perfective, 25% are adaptive, and 4% quantify.
are
others [3]. Notably, software maintenance activities are Many approaches have been proposed to quantitatively
measure the maintainability of source code. These approaches
Received May 25, 2022; accepted October 28, 2022
usually rely heavily on code metrics, e.g., Lines of Code
E-mail: ymhu@bit.edu.cn (LOC) and McCabe’s Cyclomatic Complexity (MCC) [8,11].
2 Front. Comput. Sci., 2023, 17(6): 176214
analyzability, changeability, stability, and testability; the score
three code properties: size, cohesion, and coupling. terms in a dictionary) of source code were proposed to
Although existing approaches to measuring code measure code readability [20], these features are still
maintainability have significantly facilitated the quantitative essentially statistics regarding text in source code.
measurement of code maintainability [8,11,15], the resulting To this end, in this paper, we propose a novel approach
maintainability indexes may diverge from human perception called DeepM to quantitatively measure the maintainability of
[16–18]. Existing approaches to measuring code source code. The key rationale of DeepM is that the lexical
maintainability rely heavily on code metrics. The employed semantics of identifiers should be exploited to measure code
code metrics are essentially statistics regarding code elements, maintainability. Identifiers account for approximately 70% of
e.g., the numbers of tokens, lines, references, and branch source code (in terms of characters) [21] and serve as the
statements. For example, LOC counts source code lines, MCC major source of code comprehension [22]. However, they
counts the branch statements, CBO (Coupling Between have not yet been exploited by existing approaches for
Objects) counts classes that are coupled to a given class, and measuring code maintainability. In this paper, we leverage
RFC (Response For a Class) counts unique methods invoked deep learning techniques (i.e., word embedding [23], LSTM
in a class. Such code metrics and the approaches built on them [24], tree-LSTM [25], attention mechanisms [26], and dense
do not consider the semantics of the natural language, networks [27]) to exploit the lexical semantics of identifiers
especially identifiers. The natural language contains rich and to merge the resulting semantics, the structural features of
semantic information, which is an important source for the source code, and the statistical features of the source code
programmers to understand the business logic of programs into a single maintainability index.
[19]. Consequently, such natural language may significantly In summary, our paper makes the following contributions:
influence the quality of source code, especially its
maintainability. For example, replacing meaningful identifiers ● A deep learning-based approach to the quantitative
with nonsense tokens would significantly reduce the measurement of code maintainability. To the best of our
maintainability of the source code. However, the replacement knowledge, this is the first approach that exploits the
may not significantly influence the outputs of existing semantics of identifiers for code maintainability
approaches because the replacement does not significantly measurement.
change the underlying code metrics, e.g., the MCC, CBO, or ● A public implementation of the proposed approach and
RFC. An example is presented in Listing 1−2. Listing 1 two publicly available datasets for the training and
presents method getTimeImpl from class java.util.Date in JDK evaluation of code maintainability models. The
1.8, which normalizes the date of a Date object if necessary implementation and datasets are publicly available at
and returns the corresponding milliseconds. The method is github/DeepQuality/Deep_Maintainability.
transformed into the one in Listing 2 by replacing its ● An initial evaluation of the proposed approach. The
identifiers with nonsense tokens. The two codes have the same evaluation results suggest that the proposed approach
numbers of tokens, lines, and branch statements. However, the outperforms the state-of-the-art approaches, improving
latter is much harder to understand and maintain than the the accuracy of the results from 75.0% and 57.5% to
former because of the nonsense tokens. As a result, the 87.5%.
maintainability indexes produced by such approaches
significantly diverge from human perception. Notably, The rest of the paper is structured as follows. Section 2
although some textual features (e.g., the number of identifier introduces related work, Section 3 proposes the new approach,
Yamin HU et al. Measuring code maintainability with deep neural networks 3
To establish complex mappings between the input source code
of code maintainability of object-oriented programs. For
and maintainability indexes, many approaches based on
example, CBO measure modularity [29] and reusability [30];
machine learning techniques have been proposed [8,15,35].
RFC measures testability [31], LCOM measures modularity Malhotra et al. [35] conducted an empirical study for class
[29], WMC measures reusability [30] and modifiability [32]. maintainability prediction models. They compared the
In addition, some code metrics were specifically proposed to performance of 28 techniques for constructing mappings
measure code maintainability. For example, Alzahrani et al. between extracted code metrics from Java classes and code
[33] defined a client-based class cohesion metric to measure maintainability (i.e., high or low maintainability), e.g.,
class maintainability; the metric considers the ways that other AdaBoost, k-Nearest Neighbor, Logistic Regression with
classes use the considered class. ridge estimator (noted as LRR), and CHC adaptive search for
Single metric only measures one aspect of source code such instance selection-based k-nearest neighbors (noted as CHC).
as complexity and thus has limitations for measuring code In their experiments, LRR and CHC obtain the best
maintainability. Consequently, existing approaches leverage performance in terms of two different evaluation metrics. Al
multiple code metrics to measure code maintainability. Dallal [8] predicted class maintainability using various code
According to the way to construct mappings between code metrics. The author first collected source files from three
metrics and maintainability indexes, these approaches can be open-source Java projects, including 1,099 Java classes, then,
divided into two categories: heuristic-based approaches and constructed a dataset that contained the code metrics of the
machine learning-based approaches. source files and the corresponding labels indicating class
maintainability. There were 19 code metrics in total, which
2.1 Heuristic-based approaches
Fig. 1 Overview of DeepM
Yamin HU et al. Measuring code maintainability with deep neural networks 5
identi f iers = ⟨identi f ier1 , . . . , identi f iern ⟩, (1) f alse, f alse, f alse⟩. (4)
where identi f ieri is an identifier in the Java class. The expression suggests that the identifier is composed of two
For the running example in Listing 3, we extract the soft words (i.e., f ast and T ime), it is a field name, and it does
following identifiers: not contain any numbers, underscores, or dollar signs.
Third, we further refine each of the soft words as follows:
identi f iers = ⟨Date, f astT ime,
3.2.2 Field declarations
feature for code maintainability measurement.
Fields of classes represent the data/statues of the enclosing First, we retrieve all method headers from a Java class:
classes, and thus field names are critical for the interpretation
methodHeaders = ⟨methodHeader1 , . . . ,
of object statuses. Well-formed fields could help maintainers
methodHeadern ⟩. (10)
quickly understand the usages of the field variables. For
example, a field representing a constant value (declared by Each of the resulting method headers is further refined as
static and final) should be named in uppercase [48]. A field follows:
declaration specifies the names, access modifiers, and data methodHeaderi = ⟨methodName, k,
types of the fields. [parameter1 , parameter2 ,
First, we retrieve all field declarations from a Java class: . . . , parameterk ],
according to Eq. (3). An AST is a tree representation of the syntactic structure of
For the running example in Listing 3, the field declarations source code. ASTs are widely used to represent the structures
are represented in Fig. 2. of source code [51]. Consequently, we retrieve the AST for
each method within the given class. Each node of the AST tree
3.2.3 Method headers
represents a source code element, e.g., a method declaration, a
Methods specify the behaviors of the associated class. variable, or a literal.
Although methods are composed of method headers and Notably, we leverage only the structure of the AST tree and
method bodies, maintainers often heavily rely on method the types of tree nodes. Other information embedded in the
headers to determine the behaviors of a class because lengthy tree, e.g., the identifiers and the values of literals, is ignored.
method bodies are often much more difficult to read than well- The resulting AST tree (excluding ignored information) is
formed method headers [49]. Consequently, the quality of called a simplified AST.
method headers may significantly influence code
maintainability (especially code reusability) of the whole 3.3.2 Structural format
class. To this end, we take method headers as an important Although ASTs are widely used to represent the structures of
Fig. 2 Representation of the field declarations of running example
Yamin HU et al. Measuring code maintainability with deep neural networks 7
source code, they ignore the coding styles and formats of 3.5 Deep learning-based quality prediction
Table 1 Employed code metrics
Category Code metrics
Coupling Coupling between objects (CBO) response for a class (RFC)
Cohesion Lack of cohesion of methods (LCOM)
Complexity Weighted methods per class (WMC)
Size Lines of code (LOC)
Numbers of {anonymous, inner} classes
Numbers of {total, abstract, default, final, private, protected, public, static, synchronized, visible} methods
Numbers of {total, default, final, private, protected, public, static, synchronized} fields
Number of assignments
Number of log statements
Number of loops
Number of returns
Number of try/catches
Basic counting Number of comparisons
Number of lambdas
Number of math operations
Maximum number of nested blocks
Number of numbers
Number of parenthesized expressions
Number of string literals
Number of unique words
Number of variables
8 Front. Comput. Sci., 2023, 17(6): 176214
Fig. 3 Deep learning-based model for code maintainability measurement
leverage attention-based dense networks [26] to transform an attention-based dense network to generate the vector
them into fixed-length numeric vectors. Attention-based dense representing identifiers. Similarly, we can get the vector
networks have a simplified attention mechanism, which is representing field declarations and the vector representing
proposed to solve the “addition” and “multiplication” method headers. For simplified ASTs, the nodes of each
problems of long-term memory [26]. Notably, attention-based simplified AST are embedded and fed into a tree-LSTM
dense networks request a customized learnable function to network; then the outputs of the tree-LSTM network resulting
produce a probability vector. To this end, we employ a dense from multiple simplified ASTs are fed into an attention-based
network to simulate such functions (as presented in Fig. 4). neural network. For structural format, the internal texts are
For the simplified ASTs (extracted in Section 3.3.1), we embedded and fed into an LSTM network. For code metrics,
employ a variant of LSTM, i.e., the child-sum tree-LSTM the values of all metrics are fed into a dense network. Finally,
[25], to learn the embeddings of the nodes. The standard the outputs of six subnetworks are merged and fed into a
LSTM handles sequential data. In contrast, tree-LSTM is dense network with the activation function of sigmoid, and the
specially designed to handle tree-structured data such as output of the dense network is the final maintainability index.
ASTs. In the tree-LSTM, a unit is updated according to the
current input, its cell state, and the hidden states of its 4 Evaluation
children. In a bottom-up manner, the hidden state of the root 4.1 Research questions
unit is finally updated and serves as the embedding of the Our evaluation investigates the following research questions.
given AST. ● RQ1: Is DeepM accurate, and to what extent can DeepM
The deep learning-based model works as follows. First, each outperform the state-of-the-art approaches in measuring
feature is fed into the corresponding subnetwork. For code maintainability?
identifiers, their features are extracted according to Eq. (6); for ● RQ2: How do the various features influence the
the features of each soft word in identifiers (corresponding to performance of DeepM?
an element of the array in Eq. (6)), the text of the soft word is ● RQ3: Is DeepM efficient and scalable?
embedded and concatenated to other corresponding numeric
features; and then the resulting vectors (the number of the RQ1 investigates the performance of DeepM in measuring
vectors is equal to the number of the soft words) are fed into code maintainability. To answer RQ1, according to the
Fig. 4 Attention-based dense network
Yamin HU et al. Measuring code maintainability with deep neural networks 9
empirical study on class maintainability prediction [35], we maintainability; we repeat this step until agreement is
compare DeepM with two baseline approaches: logistic reached.
regression with ridge estimator (noted as LRR) and CHC
adaptive search for instance selection-based k-nearest 4.2.2 Training data
Testing data 400 Manual Construction To answer RQ1, we evaluate DeepM, and two baseline
Training data
1,394,514 Automated Construction approaches, i.e., LRR and CHC, through the following steps:
10 Front. Comput. Sci., 2023, 17(6): 176214
● First, the training data is partitioned randomly into the #class pairs ranked correctly
RankingAccuracy = . (12)
training part (96%), and the validation part (4%). #generated class pairs
● Second, DeepM and the baseline approaches are trained To investigate the importance of different features (RQ2),
independently on the training part and fine-tuned with we conduct an ablation study [66], which is a frequently used
the validation part. method to understand the reasons for the success of neural
● Third, DeepM and the baseline approaches are evaluated network-based models. Each time, we disable a single feature
with the testing data. and repeat the evaluation to investigate how the performance
Suitable hyper-parameter configurations are crucial for of DeepM changes.
DeepM to measure code maintainability. However, the 4.4 RQ1: DeepM Outperforms the Baseline Approaches
training of DeepM requires lots of time (~62 hours), which To answer RQ1, we compare DeepM with two baseline
results in that we cannot test many combinations of hyper- approaches in terms of measuring code maintainability. We
parameters. Consequently, we first identify several typical present the evaluation results in Table 3, and make the
configurations [65] to train DeepM, and observe the following observations:
performance on the validation data. Subsequently, we try other
configurations according to the heuristics in [65] to find better ● First, DeepM is accurate with respect to ranking class
hyper-parameter configurations. pairs. In most cases (87.5%), DeepM generates the
We demonstrate the final hyper-parameter configurations as correct rankings. The accuracies on the five projects
follows. We train DeepM using the Adam optimizer for 3 vary from 80.0% to 92.5%, which demonstrates the
epochs with the learning rate of 0.001, the batch size of 32, effectiveness of DeepM.
and the loss function of binary cross entropy. Regarding the ● Second, DeepM significantly outperforms the two
structure of our network, the attention-based dense networks baseline approaches, i.e., LRR and CHC. On all the five
are 3-layer networks (16-8-1) with the ELU activation projects, DeepM generates higher accuracies than both
function. The dense network for the feature of code metrics is the two baseline approaches. It improves the average
also a 3-layer network (32-16-8) with the ELU activation RankingAccuracy from 75.0% for LRR and 57.5% for
function. The LSTM is a single-layer network with 32 units CHC to 87.5%, where the improvements are 12.5 and
and 8-dimensional embedding of the input text. The tree- 30.0 percentage points, respectively.
LSTM is a single-layer network with 8-dimensional ● Finally, it is challenging for LRR and CHC for
embedding of both the simplified AST nodes and the final measuring code maintainability: The average
representations. The final dense network is a 4-layer network RankingAccuracy are low, i.e., 75.0% for LRR and
(32-16-8-1) with the ELU activation function for the front 57.5% for CHC. This demonstrates that measuring code
layers and the sigmoid activation function for the last layer. maintainability is far beyond the abilities of simple code
The sizes of vocabularies for soft words, simplified AST metrics.
nodes, and texts in structural format are 56949 (covering 99%
of all soft words), 591 (all simplified AST nodes), and 110 (all To further reveal the reasons for the success of DeepM, we
texts in structural format) respectively. retrieve all the 27 (27/200 = 13.5%) class pairs of the testing
For the baseline approach CHC, the original implementation data for which the two baseline approaches’ rankings are
(in KEEL) employs 1-nearest neighbor (1-NN) algorithm to incorrect but DeepM’s rankings are correct, and then analyze
classify unseen instances based on selected instances. To the resulting classes from multiple aspects such as identifier
compute maintainability indexes, we modify the 1-NN to 7- semantics, logical complexity, and code style. We find that
NN (optimal in terms of the accuracy regarding RQ1). For one of the major reasons for the success of DeepM is that it
other parameters of the two baseline approaches, we take the takes full advantage of the natural language in identifiers.
same configurations as [35]. DeepM correctly ranks 25 (25/27 = 92.6%) class pairs because
During the testing process, the compared approaches are it exploits the feature of identifiers: forms (length, abbreviated
employed to rank each class pair in the testing data that is or full form, and letter case) and semantics. An example from
composed of a low-maintainability class and a high- Apollo project is Java class ConfigChangeEvent. The metrics
maintainability class. If the maintainability index generated of this Java class, e.g., CBO, WMC, RFC, and LCOM, are
for the high-maintainability class is greater than that of the reasonable and comparable to those of the high-
low-maintainability class, the ranking of the two classes is maintainability classes within the training data. Consequently,
deemed correct. Otherwise, the ranking is incorrect. The the two baseline approaches predict this class as higher
performance of the evaluated approaches in ranking the maintainability (than the other class in the same class pair).
classes is computed as follows: However, the identifiers within this class are poorly designed.
Table 3 Ranking accuracy on the testing data
Approach Tomcat Log4j 2 Apollo Druid Sentinel Average
DeepM 90.0% 80.0% 90.0% 85.0% 92.5% 87.5%
LRR 80.0% 70.0% 77.5% 75.0% 72.5% 75.0%
CHC 65.0% 55.0% 70.0% 45.0% 52.5% 57.5%
Yamin HU et al. Measuring code maintainability with deep neural networks 11
and find that the reasons for misprediction might be that many To investigate the efficiency of DeepM, during the evaluation
identifiers within these classes are very descriptive. An specified in Section 4.3, we record the time consumed by its
example from Tomcat project is Java class JspPropertyGroup, training and testing processes. Notably, the training and
which is of lower maintainability in the enclosing class pair; testing procedures are conducted on a workstation with a GPU
however, DeepM suggests that the class is of higher whose settings are listed as follows: 128 GB of RAM, Intel(R)
maintainability. Many issues result in the class being of lower Xeon(R) E5-2620 v4 CPU, and NVIDIA TITAN RTX GPU.
maintainability. For example, the boolean variable The evaluation results are shown in Table 5.
deferredSyntax should be named as isDeferredSyatax instead From the table, we observe that the training phase can be
of the noun phrase [68]; the field urlPattern should be named accomplished within 62 hours (3 epochs) on a single
as urlPatterns because it is a collection variable with the type workstation, although the training data (containing 85,560,826
of LinkedHashSet<String> [68]. From the example, we can lines of code) is massive. We also notice that DeepM is
conclude that measuring code maintainability is complex, and efficient in generating maintainability indexes. It takes DeepM
the word embedding technique may be not sufficient enough only 0.076 seconds on average to predict the maintainability
for exploiting the semantics of identifiers to measure code index for a single class. Notably, “Training Time”, “Testing
maintainability. In the future, we will leverage other Time”, and “Testing Time per Class” include the time
techniques to model the semantics in source code. required for feature extraction.
Figure 5 further investigates how the size of the training
4.5 RQ2: All exploited features are helpful
dataset influences the training time. From this figure, we
To investigate the importance of the exploited features, we observe that there is a linear relationship between the training
conduct an ablation study [66] on the testing data. The time and the size of the training dataset. Consequently, DeepM
evaluation results are presented in Table 4. From the table, we can be efficiently applied to large-scale training data.
make the following observations:
4.7 Threats to validity
● First, all exploited features are helpful. Disabling any A threat to constructive validity is that the maintainability
feature results in large reductions with respect to the labels and maintainability rankings of the classes in the
average RankingAccuracy. The reduction rates vary training data and the testing data may be inaccurate. To collect
from 2.9% to 15.4% (last column in Table 4). labeled data for training and testing, we collect a large number
● Second, identifiers contribute most to the ability of of Java projects from GitHub and label the maintainability
DeepM to measure code maintainability. Disabling indexes of the Java classes within such projects automatically
identifiers results in the largest reduction with respect to according to some project-level metrics, e.g., the number of
the average RankingAccuracy, i.e., a reduction rate of stars associated with the projects. However, there is no strong
up to 15.4%. In addition, for each of the five and quantitative evidence that such employed metrics can
applications, disabling identifiers has the largest impact accurately distinguish high-quality projects from low-quality
on the performance of DeepM. Consequently, we can projects. Furthermore, not all classes from high-quality (low-
conclude that identifiers are important for code quality) projects are of high (low) maintainability. To reduce
maintainability measurement. the threat level, three programmers label 100 randomly-
Table 4 Effects of the exploited features on the testing data (ranking accuracy)
Setting Tomcat Log4j 2 Apollo Druid Sentinel Average ↓ Rate
Default 90.0% 80.0% 90.0% 85.0% 92.5% 87.5%
Disabling identifiers 75.0% 60.0% 80.0% 72.5% 82.5% 74.0% 15.4%
Disabling field declarations 80.0% 82.5% 90.0% 80.0% 92.5% 85.0% 2.9%
Disabling method headers 80.0% 75.0% 90.0% 85.0% 90.0% 84.0% 4.0%
Disabling simplified ASTs 85.0% 70.0% 87.5% 82.5% 92.5% 83.5% 4.6%
Disabling structural format 82.5% 67.5% 90.0% 80.0% 92.5% 82.5% 5.7%
Disabling code metrics 80.0% 67.5% 87.5% 80.0% 92.5% 81.5% 6.9%
12 Front. Comput. Sci., 2023, 17(6): 176214
Table 5 Performance of DeepM regression with ridge estimator (noted as LRR) and CHC
Metric Value adaptive search for instance selection-based k-nearest
Size of training data 85,560,826 LOC neighbors algorithm (noted as CHC), cannot output a float
Training time (3 epoch) ~62 hours number indicating code maintainability; thus, we modify them
Size of testing data 29,434 LOC for evaluation purposes. In the empirical study on class
Testing time ~30 seconds maintainability [35], Malhotra et al. constructed class
Testing time per class ~0.076 seconds
sampled, high-maintainability classes and 100 low- sampled 100,000 Java classes from the training data, and
maintainability classes according to their maintenance report corresponding experimental results in RQ1. Notably,
experiences; the programmers have at least five years of the size of the sampled dataset (i.e, 100,000) is enough for
experience developing or maintaining large commercial training CHC, because the difference between the accuracy
applications using Java language. The results suggest that 189 obtained on 100,000 Java classes and that obtained on 50,000
out of the 200 classes are labeled correctly (The programmers Java classes is less than 1%. However, the reimplementation
reach agreement and the labels are correct). In addition, we could be buggy, which may influence the conclusions drawn
manually create another dataset (called testing data), where the about the baseline approaches. We also make the modification
classes are ranked manually by the three experienced publicly available as a part of the replication package
programmers. Notably, manual labeling is not guaranteed to (available at github/DeepQuality/Deep_Maintainability).
be correct because maintainability is essentially subjective. A threat to external validity is that the datasets, especially
The three experienced programmers rank class pairs from the the manually constructed dataset, are small. Some special
same high-quality projects together. The class pairs are put in characteristics of such data may bias the evaluation, and the
the testing data only when the programmers reach agreement. obtained conclusions may not hold on other data. To reduce
Another threat related to the datasets is that DeepM is the threat level, we sample Java classes from diversified
trained with the datasets with the labels of high or low projects that come from different domains and were developed
maintainability instead of the labels of floating numbers by different developers/companies.
between 0 and 1 indicating code maintainability. Ideally, a
large dataset with fine-grained labels (i.e., floating numbers 5 Limitations
between 0 and 1) should be constructed to train DeepM. The proposed approach is currently confined to a single source
However, it is hard to obtain fine-grained maintainability file, i.e., a Java class. It is challenging to extend our approach
indexes of lots of class files: (1) We have tried our best to to project-level code maintainability measurement because of
automatically construct such a dataset according to the the inefficiency of deep neural networks in handling lengthy
information on refactoring and code reviews in code text. A project in the industry usually contains thousands of
repositories, but we cannot obtain fine-grained maintainability Java classes where such classes are connected with complex
indexes especially for lots of class files; (2) Manually labeling semantic and syntactic correlations. Feeding all such tokens
the maintainability indexes of lots of class files is infeasible. and their relations into a single network could result in
Consequently, we construct the training data through the extremely large neural work containing billions of parameters.
heuristic strategy to train our neural network-based model. To However, tuning such parameters may request numerous
reduce the threat level, we use the manually labeled testing labeled data that we cannot offer. Notably, the average
data to evaluate the effectiveness of DeepM; ranking the class maintainability index of the classes within a project (called
pairs of the testing data is more challenging than classifying a average index for short) alone is insufficient to reveal the
class file as high or low maintainability. maintainability of the whole project because it ignores some
A threat to internal validity is that the original correlations (e.g., inheritance) among the classes.
implementations of the baseline approaches, i.e., logistic It is also difficult to apply DeepM to code commits. A code
Yamin HU et al. Measuring code maintainability with deep neural networks 13
commit represents changes to a project for a specific of the Maintenance of Computer Application Software in 487 Data
programming task, e.g., fixing a bug. Although code commits Processing Organizations. Reading: Addison-Wesley, 1980
4. Yau S S, Collofello J S. Some stability measures for software
are often small, they could reach multiple Java files. As a
source code (e.g., a single line of source code) whereas Transactions on Software Engineering, 2006, 32(9): 682–697
DeepM focuses on the overall maintainability of classes. 6. ISO. ISO/IEC 5055: 2021 Information technology — Software
Consequently, we cannot apply DeepM to code commits measurement — Software quality measurement — Automated source
code quality measures. Geneva: ISO, 2021
without significant adaption.
7. Grubb P, Takang A A. Software Maintenance: Concepts and Practice.
Due to the nature of deep learning-based models, the outputs 2nd ed. London: World Scientific Publishing, 2003
of DeepM cannot directly help programmers improve the 8. Al Dallal J. Object-oriented class maintainability prediction using
Testing. 2nd ed. Hoboken: John Wiley & Sons, 2004
post-refactoring class is of high maintainability automatically,
10. Mari M, Eila N. The impact of maintainability on component-based
and then accept or reject the refactoring. In addition, DeepM
make the repository in a maintainable state. Journal of Systems and Software, 1993, 23(2): 111–122
Finally, the current implementation of the proposed 12. Oman P, Hagemeister J. Construction and testing of polynomials
Part 1: Quality model. Geneva: ISO, 2001
15. Zhou Y, Leung H. Predicting object-oriented software maintainability
6 Conclusions
maintainability as those of experienced programmers on code readability models with textual features. In: Proceedings of the
87.5% of manually ranked pairs of Java classes. 24th IEEE International Conference on Program Comprehension. 2016,
In the future, we would like to extend the proposed 1−10
21. Kim S, Kim D. Automatic identifier inconsistency detection using code
approach to other object-oriented programming languages,
dictionary. Empirical Software Engineering, 2016, 21(2): 565–604
such as C++. It would also be interesting to investigate the 22. Schankin A, Berger A, Holt D V, Hofmeister J C, Riedel T, Beigl M.
References IEEE Intelligent Systems, 2016, 31(6): 5–14
24. Greff K, Srivastava R K, Koutník J, Steunebrink B R, Schmidhuber J.
LSTM: a search space odyssey. IEEE Transactions on Neural Networks
Transactions on Software Engineering, 1987, SE-13(3): 303–310 and Learning Systems, 2017, 28(10): 2222–2232
2. Bennett K H, Rajlich V T. Software maintenance and evolution: a
25. Tai K S, Socher R, Manning C D. Improved semantic representations
roadmap. In: Proceedings of the Conference on the Future of Software from tree-structured long short-term memory networks. In: Proceedings
Engineering. 2000, 73−87 of the 53rd Annual Meeting of the Association for Computational
3. Lientz B P, Swanson E B. Software Maintenance Management: A Study
Linguistics and the 7th International Joint Conference on Natural
14 Front. Comput. Sci., 2023, 17(6): 176214
26. Raffel C, Ellis D P. Feed-forward networks with attention can solve
conventions website, 2017
some long-term memory problems. 2016, arXiv preprint arXiv: 48. Olsson M. Constants. In: Olsson M, ed. Java 17 Quick Syntax
1512.08756 Reference. 3rd ed. Berkele: Apress Berkele, 2022, 85−87
27. McCulloch W S, Pitts W. A logical calculus of the ideas immanent in
49. Fowler M. Refactoring: Improving the Design of Existing Code. 2nd ed.
nervous activity. The Bulletin of Mathematical Biophysics, 1943, 5(4): Reading: Addison-Wesley Professional, 2018
115–133 50. Mi Q, Xiao Y, Cai Z, Jia X. The effectiveness of data augmentation in
29. Przybyłek A. Where the truth lies: AOP and its impact on software
30. Goel B M, Bhatia P K. Analysis of reusability of object-oriented
31. Bruntink M, Van Deursen A. An empirical study into class testability.
Association for Computational Linguistics. 2019, 1235−1245
see non-functional requirements: beware of modifiability. In: 54. Chidamber S R, Kemerer C F. A metrics suite for object oriented
Proceedings of the 18th International Conference on Requirements design. IEEE Transactions on Software Engineering, 1994, 20(6):
Engineering: Foundation for Software Quality. 2012, 37−51 476–493
33. Alzahrani M, Alqithami S, Melton A. Using client-based class cohesion
55. Aniche M. Java code metrics calculator (CK), 2015
metrics to predict class maintainability. In: Proceedings of the 43rd 56. Svyatkovskiy A, Deng S K, Fu S, Sundaresan N. IntelliCode compose:
IEEE Annual Computer Software and Applications Conference code generation using transformer. In: Proceedings of the 28th ACM
(COMPSAC). 2019, 72−80 Joint Meeting on European Software Engineering Conference and
34. Kanellopoulos Y, Antonellis P, Antoniou D, Makris C, Theodoridis E,
& Applications, 2010, 1(3): 17–36 completion system. In: Proceedings of the 25th ACM SIGKDD
35. Malhotra R, Lata K. An empirical study to investigate the impact of data
36. Padhy N, Panigrahi R, Neeraja K. Threshold estimation from software
59. Gu X, Zhang H, Zhang D, Kim S. Deep API learning. In: Proceedings
of Software Engineering. 2016, 631−642
H C. Identifying thresholds for object-oriented software metrics. Journal 60. Goodfellow I, Bengio Y, Courville A. Deep Learning. Cambridge: MIT
of Systems and Software, 2012, 85(2): 244–257 Press, 2016
39. Hofmeister J, Siegmund J, Holt D V. Shorter identifier names take
61. Lever J, Krzywinski M, Altman N. Logistic regression. Nature
longer to comprehend. In: Proceedings of the 24th IEEE International Methods, 2016, 13(7): 541–543
Conference on Software Analysis, Evolution and Reengineering 62. Zhang Y, Zhou M, Mockus A, Jin Z. Companies’ participation in OSS
Software Engineering, 2021, 47(10): 2242–2259
refactoring tasks. Are we there yet? Empirical Software Engineering, 63. Amreen S, Mockus A, Zaretzki R, Bogart C, Zhang Y X. ALFAA:
42. Scalabrino S, Bavota G, Vendome C, Linares-Vasquez M, Poshyvanyk
class size on the validity of object-oriented metrics. IEEE Transactions
D, Oliveto R. Automatically assessing code understandability. IEEE on Software Engineering, 2001, 27(7): 630–650
Transactions on Software Engineering, 2021, 47(3): 595–613 65. Bengio Y. Practical recommendations for gradient-based training of
43. Smith N, Van Bruggen D, Tomassetti F. JavaParser: Visited. Victoria:
deep architectures. In: Montavon G, Orr G B, Müller K R, eds. Neural
Leanpub, 2021 Networks: Tricks of the Trade. Berlin: Springer, 2012, 437−478
44. Jiang Y, Liu H, Zhu J, Zhang L. Automatic and accurate expansion of
66. Wang W, Li G, Shen S, Xia X, Jin Z. Modular tree network for source
abbreviations in parameters. IEEE Transactions on Software code representation learning. ACM Transactions on Software
Engineering, 2020, 46(7): 732–747 Engineering and Methodology, 2020, 29(4): 31
45. Butler S. The effect of identifier naming on source code readability and
67. Allamanis M, Barr E T, Bird C, Sutton C. Suggesting accurate method
quality. In: Proceedings of the Doctoral Symposium for ESEC/FSE on and class names. In: Proceedings of the 10th Joint Meeting on
Doctoral Symposium. 2009, 33−34 Foundations of Software Engineering. 2015, 38−49
46. Lawrie D, Feild H, Binkley D. Quantifying identifier quality: an
J, Fernández A, Del Jesús M J, Sánchez L, Herrera F. KEEL 3.0: an Hao Jiang received a BSc degree from the
open source software for multi-stage analysis in data mining. Department of Computer Science and
International Journal of Computational Intelligence Systems, 2017, Technology, North China Electric Power
10(1): 1238–1249 University, China in 2014, and PhD degree from
70. Zerouali A, Mens T. Analyzing the evolution of testing library usage in
the School of Computer Science and Technology,
open source java projects. In: Proceedings of the 24th IEEE
International Conference on Software Analysis, Evolution and University of Science and Technology of China,
Reengineering (SANER). 2017, 417−421 China. He is currently an Associate Professor with
the School of Artificial Intelligence, Anhui University, China. His
Yamin Hu received a BSc degree from the current research interests include complex network analysis,
College of Information Engineering, Northwest combinatorial optimization, and computational intelligence.
A&F University, China in 2016, and MSc degree
from the School of Computer Science and Zongyao Hu is currently pursuing the PhD degree
Technology, University of Science and with the School of Computer Science and
Technology of China, China in 2019. He is Technology, Beijing Institute of Technology,
currently working toward a PhD degree at the China. His current research interests include deep
School of Computer Science and Technology, Beijing Institute of learning and image processing.
Technology, China. His current research interests include software
engineering and database systems.