You are on page 1of 5

Available online at www.sciencedirect.

com
Available online at www.sciencedirect.com

6FLHQFH'LUHFW
6FLHQFH'LUHFW
Procedia Computer
Available online atScience 00 (2020) 000–000
www.sciencedirect.com
Procedia Computer Science 00 (2020) 000–000 www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia
ScienceDirect
Procedia Computer Science 176 (2020) 1621–1625

24th International Conference on Knowledge-Based and Intelligent Information & Engineering


24th International Conference on Knowledge-Based
Systems and Intelligent Information & Engineering
Systems
Bot Detection Model using User Agent and User Behavior for Web
Bot Detection Model using User Agent and User Behavior for Web
Log Analysis
Log Analysis
Takamasa TANAKAa*, Hidekazu NIIBORIa, Shiyingxue LIa, Shimpei NOMURAa
Takamasa TANAKAa*, Hidekazu NIIBORIa, Shiyingxue LIa, Shimpei NOMURAa
Hiroki KAWASHIMAb, Kazuhiko TSUDAc
Hiroki KAWASHIMAb, Kazuhiko TSUDAc
a Recruit Sumai Company Ltd., Minato, Tokyo, 105-0023, Japan

RecruitResearch
Nomura
b a Sumai Company
Institute,Ltd.
Ltd., ,Minato, Tokyo,
Chiyoda, Tokyo105-0023,
100-0004,Japan
Japan
bNomura
cGraduate School Research
of Business Institute,
Scinece, Ltd., of
University Tsukuba,
Chiyoda, Bunkyo,
Tokyo Tokyo,Japan
100-0004, 112-0012, Japan
c Graduate School of Business Scinece, University of Tsukuba, Bunkyo, Tokyo, 112-0012, Japan

Abstract
Abstract
In recent years, it has become a common function to automatically distribute content suitable for each user by letting AI learn the
In recent
user's years, pattern
behavior it has become
from thea common function
user's web accesstolog.
automatically
On the otherdistribute content suitable
hand, browsing for each
information by auser byincluded
bot is letting AIinlearn
the webthe
user's
accessbehavior pattern
log. There from thebots
are malicious user's
forweb
the access
purposelog. On theattacks
of DDos other hand, browsing
and illegal massinformation
extraction ofbycontent.
a bot isFurthermore,
included in theit isweb
not
access
uncommon log. There aretomalicious
for bots bots for theaspurpose
disguise themselves if they of DDos
were attackstheir
showing andattributes
illegal mass extraction
to the user. In of
thiscontent. Furthermore,
study, we propose a it is not
method
uncommon for bots
to discriminate to disguise
between the userthemselves as if
and the bot's webthey werelog
access showing their
in order attributesthetobot's
to exclude the user.
web In this study,
access log fromwethe
propose a method
analysis target.
to discriminate between the user and the bot's web access log in order to exclude the bot's web access log from the analysis target.
© 2019 The Author(s). Published by Elsevier B.V.
©©2019
2020The TheAuthor(s).
Authors. Published
Published by
byElsevier
Elsevier B.V.
B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
This is an open
Peer-review access
under article under
responsibility of the CC
KES BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
International.
Peer-review under responsibility of the scientific committee of the KES International.
Peer-review under responsibility of KES International.
Keywords: Type your keywords here, separated by semicolons ;
Keywords: Type your keywords here, separated by semicolons ;

1. Introduction
1. Introduction
Today's websites are accessed by bots, such as search engine crawlers such as Google[1], and crawlers that
Today's
collect websites are
site listings[2]. Theaccessed by bots,
former bots such as
commonly search websites
accesses engine crawlers
with the such as Google[1],
information andagent[3]
of the user crawlers that
which
collect site listings[2]. The former bots commonly accesses websites with the information of the user
shows they are non-human accesses. On the other hand, the latter bots often access with a fake user agent string inagent[3] which
shows
order tothey
passare non-human
access accesses.
restrictions[4]. On with
Bots the other
a fakehand,
user the latter
agent bots
sting often access
is difficult with a fake
to distinguish user
from agentaccesses.
human string in
order to pass access restrictions[4]. Bots with a fake user agent sting is difficult to distinguish from
However, bot access logs are noisy for analyzing user behavior and getting knowledge by from website access logs.human accesses.
However, botwe
In this study, access logs aare
propose noisythat
system fordistinguishes
analyzing user behavior
human and getting
accesses and botsknowledge by from website access logs.
on the access.
In this study, we propose a system that distinguishes human accesses and bots on the access.

1877-0509 © 2019 The Author(s). Published by Elsevier B.V.


This is an open
1877-0509 © 2019
access
The article
Author(s).
underPublished
the CC BY-NC-ND
by Elsevier license
B.V. (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review
This is an open
under
access
responsibility
article under
of KES
the CC
International.
BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of KES International.

1877-0509 © 2020 The Authors. Published by Elsevier B.V.


This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the scientific committee of the KES International.
10.1016/j.procs.2020.09.185
1622 Takamasa TANAKA et al. / Procedia Computer Science 176 (2020) 1621–1625
2 Author name / Procedia Computer Science 00 (2020) 000–000

2. Related Works

There are many challenges for bot detection on web logs. Zhang et al. [5] and Kheir et al. [6] analyze user agent
strings which we also use for bot detection. Monroy et al. [7] takes an approach based on contrast pattern mining
and focus on a class imbalance problem of bot detection. Stassopoulou et al.[8] analyze web server access logs and
construct a Bayesian network that classifies automatically access log sessions. Mitterhofer et al.[9] analyze web
server log of online game player activities. Musud et al. [10] use a multiple log-file based temporal correlation
technique and analyze correlation of two host-based log files.

3. Construction of Bot Detection Models

3.1. Access Logs on The Experiment

In this study, we will build a model that distinguishes human accesses and bots using web access logs of real
estate advertising websites. The information of access logs includes the contents of user behaviors that are visiting
hours to website, the numbers of pages he checked, and so on. We aim that it is useful for bots discrimination to
analyzing patterns of fake user agent strings and fake user behaviors.

3.2. Unit of Access Logs for discrimination Human and Bot

The unit to be distinguished in the model is the unit of one site visit from the same cookie[11]. Site visits are
defined in terms of sessions[12]. The session starts from the first page view of the cookie, and page views within 30
minutes from the previous page view are treated as the same session. If a page view occurs for more than 30
minutes, treat it as a separate session.

3.3. Discrimination Labeling

For the label for distinguishing bots, information on whether or not Javascript[13] deployed on the web page is
operating when the user accesses the web page is adopted. On modern websites, Javascript is usually the primary
function of page behavior, as is the site that provides experimental data. Furthermore, many bots generally adopt a
technical method that does not operate Javascript implemented on a web page in order to collect site information at
high speed. It should be noted that the acquisition of information on the presence or absence of Javascript operation
is not complete because a defect occurs depending on the communication status of the client side accessing the
website and the transition timing to the next page. Also, there are bots that employ a technical method for operating
Javascript. However, since the number is relatively small, it is excluded from this study.

3.4. Features Extraction

The user agent indicating the access source of a session often includes the name of a general web browser when
accessed by a user. On the other hand, in the case of access by a bot, the name of the OS or programming language is
often included, and can be used as an effective feature amount for bot determination. However, the user agent is text
information, and some conversion process is required to use it as a model feature. In this research, we adopt bag-of-
words expression [14], which is a general conversion process of text information.
The experimental data used in this study included 4,930 user agents and extracted 691 words. Note that numerical
values and symbols included in the user agent are excluded, and all alphabets are converted to lower case. The
following are examples of user agents before and after conversion.
Takamasa TANAKA et al. / Procedia Computer Science 176 (2020) 1621–1625 1623
Author name / Procedia Computer Science 00 (2020) 000–000 3

Initial State:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1
Safari/605.1.15

Transformed:
mozilla / macintosh / intel / mac / os / x / applewebkit / khtml / like / gecko / version / safari

Fig. 1. Examples Transforming of User Agents

In addition, as another effective feature amount, information on the behavior in the site is examined. For
example, when visiting a site, users are more likely to visit during the day, while bots have more access from late
night to early morning. In addition, the page information immediately before flowing into a site called a referrer is
often a search site or a web advertisement page if it is a user, but the referrer itself cannot often be acquired by a bot.
In addition, we created 14 features related to the behavior of the bot's site, such as the number of page views being
unusually large, the time interval between page views, and the type of target page for page views.

3.5. Logistic Regression Model with User Agents

In order to evaluate the explanatory power of bots by user agents, a logistic regression model[15] using only user
agents is constructed. Introduce feature quantities for 691 words and perform L1 regularization[16] to narrow down
the effective words for discrimination. The coefficient of the regularization term is determined while checking the
AUC[17] of the model and the number of feature quantities whose partial regression coefficients are not 0.

3.6. Tree Model with User Agent and User Behaviors

Identify bots using user agent and site behavior. LightGBM[18], a tree-structured model, is used to combine the
feature quantities using words contained in user agents and 14 feature quantities related to behavior. In addition, two
models are constructed: a model in which all 691 words are input, and a model that narrows down to words that
have been confirmed to be effective by L1 regularization.

4. Assessment of BOT Detection models

4.1. Performance of Discrimination Rates

We divide 80% of the experimental data as learning data and 20% as verification data. For the model generated
from the training data, the accuracy of applying to the verification data is evaluated using four indicators: AUC,
Accuracy, Precision, and Recall[19]. The evaluation results are shown in Table 1.

Table 1. Comparison of Performance


Model Features AUC Accuracy Precision Recall
1.Logistic regression User Agent (691words) 0.933 0.902 0.997 0.813
2.LightGBM User Agent (691words) 0.990 0.965 0.963 0.969
User Behavior
3.LightGBM User Agent (17 words) 0.989 0.964 0.963 0.968
User Behavior

First, although AUC and Accuracy of model 1 are the lowest among the three, the discrimination accuracy
exceeding 90% has been confirmed, and it can be said that the bag-of-words expression of the user agent is
functioning effectively. In addition, models 2 and 3 that have introduced the behavior in the site as feature values
have been confirmed to have higher discrimination accuracy than models that do not have input, and it can be said
that effective feature values have been created. Model 2 includes all 691 words included in the user agent, and
1624 Takamasa TANAKA et al. / Procedia Computer Science 176 (2020) 1621–1625
4 Author name / Procedia Computer Science 00 (2020) 000–000

model 3 uses 17 words that have been confirmed to be effective for discrimination by model 1 as features. The
difference between the accuracy of models 2 and 3 is binomial tested to confirm that there is no statistically
significant difference. Since the same performance was confirmed with a small number of words, Model 3 can be
said to be the best model.

4.2. Utilization of User Agent

In model 1, L1 regularization enabled us to narrow down the number of words with non-zero partial regression
coefficient from 691 to 17 words. An excerpt of the word is shown in Figure 2. In addition, when regularization was
performed, three regularization coefficients were tried. FIG. 3 shows a process in which words having partial
regression coefficients other than 0 are narrowed down by increasing the regularization coefficient. Table 2 shows
the number of words and AUC for each regularization factor. Since the AUC is similar, 17 words were considered
important.

Words Bot includes:ubuntu/apple/go/linux


Words Bot doesn’t include:like/android/win/gecko

Fig. 2. Difference of Words with Bot’s User Agent or Not

Fig. 3. Relation between coefficient α of regularization term and partial regression coefficient

Table 2. Relation between coefficient of regularization term and number of valid words and AUC

Regularization term coefficient 10-5 10-4 10-3


Number of words whose partial regression coefficient is not 0 151 35 17
AUC 0.934 0.934 0.933
Takamasa TANAKA et al. / Procedia Computer Science 176 (2020) 1621–1625 1625
Author name / Procedia Computer Science 00 (2020) 000–000 5

In addition, the features that indicate the behavior in the site that were introduced in Models 2 and 3 were
confirmed to be highly important in terms of the number of page views, the average value and variance of the
intervals, and the time zone visited the site.

5. Conclusion

In this study, we proposed a method to discriminate accesses from bots that cause noise in web access log
analysis. As a future work, countermeasures against bots that run Javascript will be implemented using outlier
detection techniques based on distribution estimation [20].

References

[1] Najork, Marc. "Web Crawler Architecture." (2009): 3462-3465.


[2] Massimino, Brett. "Accessing online data: Web‐crawling and information‐scraping techniques to automate the assembly of research data."
Journal of Business Logistics 37.1 (2016): 34-42.
[3] Zhu, Ling, et al. "Method and system for providing a user agent string database." U.S. Patent No. 10,025,847. 17 Jul. 2018.
[4] Kheir, Nizar. "Analyzing http user agent anomalies for malware detection." Data Privacy Management and Autonomous Spontaneous
Security. Springer, Berlin, Heidelberg, 2012. 187-200.
[5] Zhang, Yang, et al. "Detecting malicious activities with user‐agent‐based profiles." International Journal of Network Management 25.5
(2015): 306-319.
[6] Kheir, Nizar. "Analyzing http user agent anomalies for malware detection." Data Privacy Management and Autonomous Spontaneous
Security. Springer, Berlin, Heidelberg, 2012. 187-200.
[7] Loyola-González O., Monroy R., Medina-Pérez M.A., Cervantes B., Grimaldo-Tijerina J.E. (2018) An Approach Based on Contrast Patterns
for Bot Detection on Web Log Files. In: Batyrshin I., Martínez-Villaseñor M., Ponce Espinosa H. (eds) Advances in Soft Computing. MICAI
2018. Lecture Notes in Computer Science, vol 11288. Springer, Cham
[8] Stassopoulou, Athena, and Marios D. Dikaiakos. "Web robot detection: A probabilistic reasoning approach." Computer Networks 53.3
(2009): 265-278.
[9] Mitterhofer, Stefan, et al. "Server-side bot detection in massively multiplayer online games." IEEE Security & Privacy 7.3 (2009): 29-36.
[10] Masud, Mohammad M., et al. "Flow-based identification of botnet traffic by mining multiple log files." 2008 First International Conference
on Distributed Framework and Applications. IEEE, 2008.
[11] Field, Kimen Catherine. "Browser cookie analysis and targeted content delivery." U.S. Patent No. 9,210,222. 8 Dec. 2015.
[12] Motukuru, Vamsi, Vikas Pooven Chathoth, and Vipin Anaparakkal Koottayi. "Cookie based session management." U.S. Patent No.
9,866,640. 9 Jan. 2018.
[13] Duckett, Jon. Javascript and jquery: Interactive front-end web development. Wiley Publishing, 2014.
[14] Zhang, Yin, Rong Jin, and Zhi-Hua Zhou. "Understanding bag-of-words model: a statistical framework." International Journal of Machine
Learning and Cybernetics 1.1-4 (2010): 43-52.
[15] Kleinbaum, David G., et al. Logistic regression. New York: Springer-Verlag, 2002.
[16] Park, Mee Young, and Trevor Hastie. "L1‐regularization path algorithm for generalized linear models." Journal of the Royal Statistical
Society: Series B (Statistical Methodology) 69.4 (2007): 659-677.
[17] Huang, Jin, and Charles X. Ling. "Using AUC and accuracy in evaluating learning algorithms." IEEE Transactions on knowledge and Data
Engineering 17.3 (2005): 299-310.
[18] Ke, Guolin, et al. "Lightgbm: A highly efficient gradient boosting decision tree." Advances in neural information processing systems. 2017.
[19] Powers, David Martin. "Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation." (2011).
[20] Erfani, Sarah M., et al. "High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning." Pattern
Recognition 58 (2016): 121-134.

You might also like