Professional Documents
Culture Documents
com
Available online at www.sciencedirect.com
6FLHQFH'LUHFW
6FLHQFH'LUHFW
Procedia Computer
Available online atScience 00 (2020) 000–000
www.sciencedirect.com
Procedia Computer Science 00 (2020) 000–000 www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia
ScienceDirect
Procedia Computer Science 176 (2020) 1621–1625
RecruitResearch
Nomura
b a Sumai Company
Institute,Ltd.
Ltd., ,Minato, Tokyo,
Chiyoda, Tokyo105-0023,
100-0004,Japan
Japan
bNomura
cGraduate School Research
of Business Institute,
Scinece, Ltd., of
University Tsukuba,
Chiyoda, Bunkyo,
Tokyo Tokyo,Japan
100-0004, 112-0012, Japan
c Graduate School of Business Scinece, University of Tsukuba, Bunkyo, Tokyo, 112-0012, Japan
Abstract
Abstract
In recent years, it has become a common function to automatically distribute content suitable for each user by letting AI learn the
In recent
user's years, pattern
behavior it has become
from thea common function
user's web accesstolog.
automatically
On the otherdistribute content suitable
hand, browsing for each
information by auser byincluded
bot is letting AIinlearn
the webthe
user's
accessbehavior pattern
log. There from thebots
are malicious user's
forweb
the access
purposelog. On theattacks
of DDos other hand, browsing
and illegal massinformation
extraction ofbycontent.
a bot isFurthermore,
included in theit isweb
not
access
uncommon log. There aretomalicious
for bots bots for theaspurpose
disguise themselves if they of DDos
were attackstheir
showing andattributes
illegal mass extraction
to the user. In of
thiscontent. Furthermore,
study, we propose a it is not
method
uncommon for bots
to discriminate to disguise
between the userthemselves as if
and the bot's webthey werelog
access showing their
in order attributesthetobot's
to exclude the user.
web In this study,
access log fromwethe
propose a method
analysis target.
to discriminate between the user and the bot's web access log in order to exclude the bot's web access log from the analysis target.
© 2019 The Author(s). Published by Elsevier B.V.
©©2019
2020The TheAuthor(s).
Authors. Published
Published by
byElsevier
Elsevier B.V.
B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
This is an open
Peer-review access
under article under
responsibility of the CC
KES BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
International.
Peer-review under responsibility of the scientific committee of the KES International.
Peer-review under responsibility of KES International.
Keywords: Type your keywords here, separated by semicolons ;
Keywords: Type your keywords here, separated by semicolons ;
1. Introduction
1. Introduction
Today's websites are accessed by bots, such as search engine crawlers such as Google[1], and crawlers that
Today's
collect websites are
site listings[2]. Theaccessed by bots,
former bots such as
commonly search websites
accesses engine crawlers
with the such as Google[1],
information andagent[3]
of the user crawlers that
which
collect site listings[2]. The former bots commonly accesses websites with the information of the user
shows they are non-human accesses. On the other hand, the latter bots often access with a fake user agent string inagent[3] which
shows
order tothey
passare non-human
access accesses.
restrictions[4]. On with
Bots the other
a fakehand,
user the latter
agent bots
sting often access
is difficult with a fake
to distinguish user
from agentaccesses.
human string in
order to pass access restrictions[4]. Bots with a fake user agent sting is difficult to distinguish from
However, bot access logs are noisy for analyzing user behavior and getting knowledge by from website access logs.human accesses.
However, botwe
In this study, access logs aare
propose noisythat
system fordistinguishes
analyzing user behavior
human and getting
accesses and botsknowledge by from website access logs.
on the access.
In this study, we propose a system that distinguishes human accesses and bots on the access.
2. Related Works
There are many challenges for bot detection on web logs. Zhang et al. [5] and Kheir et al. [6] analyze user agent
strings which we also use for bot detection. Monroy et al. [7] takes an approach based on contrast pattern mining
and focus on a class imbalance problem of bot detection. Stassopoulou et al.[8] analyze web server access logs and
construct a Bayesian network that classifies automatically access log sessions. Mitterhofer et al.[9] analyze web
server log of online game player activities. Musud et al. [10] use a multiple log-file based temporal correlation
technique and analyze correlation of two host-based log files.
In this study, we will build a model that distinguishes human accesses and bots using web access logs of real
estate advertising websites. The information of access logs includes the contents of user behaviors that are visiting
hours to website, the numbers of pages he checked, and so on. We aim that it is useful for bots discrimination to
analyzing patterns of fake user agent strings and fake user behaviors.
The unit to be distinguished in the model is the unit of one site visit from the same cookie[11]. Site visits are
defined in terms of sessions[12]. The session starts from the first page view of the cookie, and page views within 30
minutes from the previous page view are treated as the same session. If a page view occurs for more than 30
minutes, treat it as a separate session.
For the label for distinguishing bots, information on whether or not Javascript[13] deployed on the web page is
operating when the user accesses the web page is adopted. On modern websites, Javascript is usually the primary
function of page behavior, as is the site that provides experimental data. Furthermore, many bots generally adopt a
technical method that does not operate Javascript implemented on a web page in order to collect site information at
high speed. It should be noted that the acquisition of information on the presence or absence of Javascript operation
is not complete because a defect occurs depending on the communication status of the client side accessing the
website and the transition timing to the next page. Also, there are bots that employ a technical method for operating
Javascript. However, since the number is relatively small, it is excluded from this study.
The user agent indicating the access source of a session often includes the name of a general web browser when
accessed by a user. On the other hand, in the case of access by a bot, the name of the OS or programming language is
often included, and can be used as an effective feature amount for bot determination. However, the user agent is text
information, and some conversion process is required to use it as a model feature. In this research, we adopt bag-of-
words expression [14], which is a general conversion process of text information.
The experimental data used in this study included 4,930 user agents and extracted 691 words. Note that numerical
values and symbols included in the user agent are excluded, and all alphabets are converted to lower case. The
following are examples of user agents before and after conversion.
Takamasa TANAKA et al. / Procedia Computer Science 176 (2020) 1621–1625 1623
Author name / Procedia Computer Science 00 (2020) 000–000 3
Initial State:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1
Safari/605.1.15
Transformed:
mozilla / macintosh / intel / mac / os / x / applewebkit / khtml / like / gecko / version / safari
In addition, as another effective feature amount, information on the behavior in the site is examined. For
example, when visiting a site, users are more likely to visit during the day, while bots have more access from late
night to early morning. In addition, the page information immediately before flowing into a site called a referrer is
often a search site or a web advertisement page if it is a user, but the referrer itself cannot often be acquired by a bot.
In addition, we created 14 features related to the behavior of the bot's site, such as the number of page views being
unusually large, the time interval between page views, and the type of target page for page views.
In order to evaluate the explanatory power of bots by user agents, a logistic regression model[15] using only user
agents is constructed. Introduce feature quantities for 691 words and perform L1 regularization[16] to narrow down
the effective words for discrimination. The coefficient of the regularization term is determined while checking the
AUC[17] of the model and the number of feature quantities whose partial regression coefficients are not 0.
Identify bots using user agent and site behavior. LightGBM[18], a tree-structured model, is used to combine the
feature quantities using words contained in user agents and 14 feature quantities related to behavior. In addition, two
models are constructed: a model in which all 691 words are input, and a model that narrows down to words that
have been confirmed to be effective by L1 regularization.
We divide 80% of the experimental data as learning data and 20% as verification data. For the model generated
from the training data, the accuracy of applying to the verification data is evaluated using four indicators: AUC,
Accuracy, Precision, and Recall[19]. The evaluation results are shown in Table 1.
First, although AUC and Accuracy of model 1 are the lowest among the three, the discrimination accuracy
exceeding 90% has been confirmed, and it can be said that the bag-of-words expression of the user agent is
functioning effectively. In addition, models 2 and 3 that have introduced the behavior in the site as feature values
have been confirmed to have higher discrimination accuracy than models that do not have input, and it can be said
that effective feature values have been created. Model 2 includes all 691 words included in the user agent, and
1624 Takamasa TANAKA et al. / Procedia Computer Science 176 (2020) 1621–1625
4 Author name / Procedia Computer Science 00 (2020) 000–000
model 3 uses 17 words that have been confirmed to be effective for discrimination by model 1 as features. The
difference between the accuracy of models 2 and 3 is binomial tested to confirm that there is no statistically
significant difference. Since the same performance was confirmed with a small number of words, Model 3 can be
said to be the best model.
In model 1, L1 regularization enabled us to narrow down the number of words with non-zero partial regression
coefficient from 691 to 17 words. An excerpt of the word is shown in Figure 2. In addition, when regularization was
performed, three regularization coefficients were tried. FIG. 3 shows a process in which words having partial
regression coefficients other than 0 are narrowed down by increasing the regularization coefficient. Table 2 shows
the number of words and AUC for each regularization factor. Since the AUC is similar, 17 words were considered
important.
Fig. 3. Relation between coefficient α of regularization term and partial regression coefficient
Table 2. Relation between coefficient of regularization term and number of valid words and AUC
In addition, the features that indicate the behavior in the site that were introduced in Models 2 and 3 were
confirmed to be highly important in terms of the number of page views, the average value and variance of the
intervals, and the time zone visited the site.
5. Conclusion
In this study, we proposed a method to discriminate accesses from bots that cause noise in web access log
analysis. As a future work, countermeasures against bots that run Javascript will be implemented using outlier
detection techniques based on distribution estimation [20].
References