Professional Documents
Culture Documents
Copyright © 2011 MECS I.J. Computer Network and Information Security, 2011, 3, 41-47
42 Design and Implementation for Malicious Links Detection System Based On
Security Relevance of Webpage Script Text
Copyright © 2011 MECS I.J. Computer Network and Information Security, 2011, 3, 41-47
Design and Implementation for Malicious Links Detection System Based On 43
Security Relevance of Webpage Script Text
A. Related Concepts
1) UL and SL
By analysing the correlation between web links and
web contents, the model determines probability of
webpage malicious links, so we can define the concept as
follows:
Unsafe Links:Web links that have the probability
of being malicious links, denoted by
UL {ul1 , ul2 , ul3 ,L, ulm }
Safe Links:Web links that have no probability of
being malicious links, denoted by
SL {sl1 , sl2 , sl3 ,L, sln }
2) Effective Keywords
Unsafe links and safe links cannot be directly used
for the calculation of correlation, so define the concept as
follows:
Copyright © 2011 MECS I.J. Computer Network and Information Security, 2011, 3, 41-47
44 Design and Implementation for Malicious Links Detection System Based On
Security Relevance of Webpage Script Text
M ⁝ ⋱ ⁝ contents.
bm1 L bmk (1)
The more correlation close to zero, indicating that
the greater probability of web links being Malicious links
Matrix elements value 1 or 0. is, vice-versa.
Sum of each row is:
1 t1 VI MODEL DESCRIPTION OF ANALYSIS MODULE
M t M
* ⁝ ⁝ Model describes the relationship between the
elements (unsafe links, safe links, authoritative
(2)
1 tm
{t1 , t2 , t3 ,L, tm information, non-authoritative information and webpage
Define as CMAI of unsafe content), the model is defined as a quintuple:
}
{ul1 , ul2 , ul3 ,L, ulm } A (Algorithm) = {UL, SL, AI, NAI, WC}
links . -UL(Unsafe Links) : Web links that have the
2) Calculation of CMNAI probability of being determined malicious links
Define the relationship between KU to KS -SL(Safe Links) : Web links that have no
{ {ku},{ks} {ku} {ks} } probability of being malicious links
Expressed in matrix form: -AI(Authoritative Information) : Key information
which can represents the webpage significance
a11 L a1n -NAI(Non-Authoritative Information) :
M ⁝ ⋱ ⁝ Information that can’t represent the webpage
am1 L amn significance, but it is correlated with webpage contents.
(3) In this model, safe links are the non-authoritative
Matrix elements value 1 or 0. information.
Sum of each row is: -WC(Webpage Script Text Content):Information
1 l1 of the webpage script code text contents
M M * ⁝ UL {L , L , L ,L, L }
⁝
Copyright © 2011 MECS I.J. Computer Network and Information Security, 2011, 3, 41-47
Design and Implementation for Malicious Links Detection System Based On 45
Security Relevance of Webpage Script Text
l
1 2 3 m
1 l L(links) {CMAI , CMNAI , CMW }
n m (4) ---CMAI: Correlation modulus for
Define {l , l , l ,L, l authoritative information
m as CMNAI of unsafe
1 2 3
---CMNAI: Correlation modulus for non-
}
{ul1 , ul2 , ul3 ,L, ulm authoritative information
links
} ---CMW: Correlation modulus for web
.
Copyright © 2011 MECS I.J. Computer Network and Information Security, 2011, 3, 41-47
46 Design and Implementation for Malicious Links Detection System Based On
Security Relevance of Webpage Script Text
VII DESIGN OF ANALYSIS MODULE Flow chart of core code for calculating CMNAI is
shown in Fig. 6:
Based on the above discussion, analysis module is
designed as followed. Module process is divided into
four steps:
Step 1. Information Extraction
Extract the unsafe links, non-authoritative
information and authoritative information from the
webpage. In this model, safe links are the non-
authoritative information.
Step 2. Information Pre-processing
Remove the interference words from the extracted
information which can be used to calculate correlation,
such as www, cn, us, de, and so on.
For example, if the extracted information is
“www.google.com.hk”, the words “www”, “hk” and dot
are the interference words.
Step 3. Calculation of Correlation Modulus
Calculate CMAI according to the correlation
between unsafe links and authoritative information, and
calculate CMNAI according to the correlation between
unsafe links and non-authoritative information. Model
specifically addressed as follows:
if Li AI then L .CMAI Max
,
i
Copyright © 2011 MECS I.J. Computer Network and Information Security, 2011, 3, 41-47
Design and Implementation for Malicious Links Detection System Based On 47
Security Relevance of Webpage Script Text
Figure 5.Flow chart of core code for calculating CMAI
Copyright © 2011 MECS I.J. Computer Network and Information Security, 2011, 3, 41-47
48 Design and Implementation for Malicious Links Detection System Based On
Security Relevance of Webpage Script Text
CMW
0.4
A. Information Extraction 0.2
0
Extract the unsafe links, non-authoritative
information and authoritative information. We select newspaper.dufe.ed… www.80xiazai.c… www.zhaosfpk.c…
...
KU ={{newspaper, dufe},{80xiazai},{zhaosfpk},L }
Effective keywords of safe links are as followed:
Effective keywords of authoritative information are Figure 9.Unsafe links in the webpage text
as followed: These links are all related with online games or
KAI ={colors,dufe} Internet advertisement. They were embedded into
webpage by the hackers. These links are malicious. So
C. Calculation of Correlation Modulus the system in this paper can improve the efficiency for
According to the formula (1) and formula (2), the malicious links detection.
calculate CMAI with effective keywords of unsafe links
and authoritative information. According to the formula IX CONCLUSION
(3) and formula (4), calculate CMNAI with effective In this paper, we propose a system for malicious
keywords of unsafe links and non-authoritative links detection based on security relevance of webpage
information. script text. Experimental results showed that the
D. Correlation Analysis correlation calculated by the model reflects the
According to the formula (5) and formula (6), relationship between web links and webpage contents.
calculate CMW with CMAI and CMNAI, according to Since the model does not consider the text feature of web
which we calculate correlation between unsafe links and links, it can improve the efficiency and accuracy of
web contents. malicious links detection.
We select some of the correlation between unsafe In addition, there is a lot of information available
links and web contents shown in Table II and Fig. 8: for text mining in future. We will work further to
optimize the model, including drawing IP addresses and
Chinese Information factor into text features of web links
to improve the efficiency and accuracy of malicious links
detection and reduce the false positive rate.
Copyright © 2011 MECS I.J. Computer Network and Information Security, 2011, 3, 41-47
Design and Implementation for Malicious Links Detection System Based On 49
Security Relevance of Webpage Script Text
REFERENCES
[1] ZHUGE Jianwei, YE Zhiyuan, ZOU Wei. “Research on
Classification of Attack Technologies” [J]. Computer
Engineering,Vol.31, No.21, pp.121-126, 2005.
[2] LUO Chuan,XIN Mingting,LING Zhixiang. Analysis
and realization of the web Trojan horse [J]. NETWORK &
COMPUTER SECURITY,Vol.12, pp.83-85, 2007.
[3] ZHUGE Jianwei , HAN Xinhui , ZHOU Yonglin.
HoneyBow: an automated malware collection tool based
on the high-interaction honeypot principle [J]. Journal on
Communications,2007,28(12):8-13.
[4] E. Glover, K. Tsioutsiouliklis, S. Lawrence, D. Pennock,
and G. Flake. Using web structure for classifying and
describing web pages. In Proc. of WWW, Hawaii, USA,
May 2002. ACM Press.
[5] WU Runpu, FANG Yong, WU Shaohua. Web Trojan
Detection Model Based on Statistics and Code
Characteristics Analysis [J]. Information and Electronic
Engineering,Vol.1, pp.71-75, 2009.
[6] A. Sun and E.-P. Lim. Web unit mining – finding and
classifying subgraphs of web pages. In Proc. of 12th ACM
CIKM, pages 108–115, New Orleans, LA, USA, Nov.
2003.
[7] Han J, Kamber M. Data Mining: Concepts and Techniques
SanMateo, CA: Morgan Kaufmann, 2000.
[8] HAN Jia-Wei, MENG Xiao-Feng, WANG Jing, and LI
Sheng-En. Research On Web Mining: A Survey [J].
JOURNAL OF COMPUTER RESEARCH &
DEVELOPMENT, Vol. 38, No. 4, pp. 405-414, 2001.
[9] XUE Wei-min and LU Yu-chang. Research On Text Data
Mining [J]. Journal of Beijing Union University(Natural
Sciences), Vol. 19, No. 4, pp. 59-63, 2005.
[10] WU Runpu, FANG Yong, WU Shaohua. Web Trojan
Detection Model Based on Statistics and Code
Characteristics Analysis [J]. Information and Electronic
Engineering,Vol.1, pp.71-75, 2009.
[11] LEVINE J, GRIZZARD J, OWEN H. Application of a
methodology to characterize rootkits retrieved from
honeynets[A]. Proceedings of the Fifth Annual
Information Assurance Workshop[C]. West Point, NY,
USA, 2004. 15-21.
[12] PROVOS N. A virtual honeypot framework [A].
Proceedings of 13th USENIX Security Symposium[C].
San Diego, CA, USA, 2004. 1-14.
[13] Ali Ikinci , Thorsten Holz , Felix Freiling. Monkey-
Spider : Detecting Malicious Websites with Low-
Interaction HoneyClients[C]//Gesellschaft für Informatik.
Proceedings of Sicherheit. Mannheim : University
Mannheim,2008:233-244.
[14] CHEN Ling, WANG Yi-jun and XUE Zhi. Detection of
Web-site with Trojan Based on HoneyClient [J]. CHINA
INFORMATION SECURITY, Vol. 5, pp. 75-77, 2010.
Copyright © 2011 MECS I.J. Computer Network and Information Security, 2011, 3, 41-47