You are on page 1of 4

Efficient multistage phishing website detection model

By César Barreto

Phishing is today the favorite method of cybercriminals to commit data theft,


and this cyberthreat is evolving, since reports of privacy leaks and financial
losses through this type of cyberattack continue. It is worth noting that the
existing ODS Phishing detection method does not fully analyze the
characteristics of Phishing, and the performance and efficiency of the detection
models only apply to certain limited data sets and need to be improved to be
applied to the real web environment. . Therefore, users demand new methods
that detect phishing websites quickly and accurately. For this, the principles of
social engineering provide aspects of interest that allow the development of
effective methods for the detection in various stages of Phishing sites,
especially in the real web environment.

Advance of Phishing

Keep in mind that phishing is a typical social engineering attack, that is, cyber
attackers use their instincts and the curiosity, trust, fear and greed of users to
commit crimes. Phishing increased 350% during the COVID-19 quarantine,
according to cybersecurity research reports. It is estimated that the cost of
Phishing is now 1/4 of the cost of traditional cyberattacks, but the revenue is
double what it was in the past. Midsize businesses paid an average of $1.6
million to deal with phishing attacks, as a business can lose customers faster
than gain them thanks to this cyberthreat.

Phishing attacks take many forms and typically involve a variety of


communication channels, such as email, messaging, and social media.
Regardless of the channel used, attackers often spoof well-known banks, credit
card companies, or famous e-commerce websites to intimidate or urge users to
log into the Phishing website to do things the customer will later regret. . For
example, a user might once again receive an instant message indicating a
problem with their bank account and be directed to a web link that is very similar
to a link used by the bank. Without hesitation, the user enters his username and
password in the fields provided by the criminal, records her information, and
then the criminal uses that information to access the user's session.

What should be improved in Antiphising methods?

To answer this question we must understand the typical Phishing process,


which is shown in the following figure.
Click through Steal

Leaking privacy
Receiving deceptive information and losing property

Visiting visually similar spoofing websites

Machine learning-based Phishing website detection is the current mainstream


Antiphishing method. This online detection mode is based on statistical learning
and is the main Antiphishing method, but its robustness and efficiency in
complex web environments need to be improved. The main problems of
antiphishing methods based on machine learning are summarized below:

• An increasing number of features are being removed by Antiphishing methods,


but it is not clear why these features are being removed. Existing features do
not reflect well the nature of Phishing, which steals sensitive information
through spoofing. This leads to a result where the functions are only valid in
some limited and specific scenarios, such as specific data sets or a browser
plugin.

• Existing algorithms treat all websites in the same way, which leads to the
inefficiency of the statistical model. In other words, the models are not suitable
for the real web environment, which contains a large number of complex web
pages.

• Most data sets do not contain enough samples and sample diversity is not
considered; furthermore, the proportion of positive and negative samples is
unrealistic. In general, models based on such data sets experience a large
overfit and the robustness of the models needs further improvement.

What has been the advance in Antiphishing methods?

In recent years, efforts have been made to develop a large-scale, robust and
efficient Antiphishing method in a real web environment based on statistical
machine learning algorithms whose innovations are based on the following
tests:

• In the aspect of the extraction of Antiphishing statistical functions, through an


in-depth analysis of the pattern of Phishing attacks. Current models extract
comprehensive and interpretable quaternary features, called CASE, including
"Counterfeit", "Affiliation", "Theft". ” and “Assessment” functions. This model
reflects the social engineering characteristics of Phishing attacks, as well as the
relevance and quality of web content. The CASE model covers the feature
space that reflects the spoofing nature of Phishing, ensures feature
discrimination and generalization, and provides feature-level support for
effective Phishing detection.

• Considering the unbalanced reality of legitimate and Phishing websites,


current detection models are based on a multi-stage security system. The idea
of multi-stage detection models to ensure "fast filtering + accurate recognition".
In operational terms, legitimate websites are excluded during the quick filter
stage; then accurate supervised recognition is performed by learning specific
positive and negative samples in a smaller range, this detection philosophy
ensures high performance under the premise of shorter detection time, which is
more applicable to real web environment.

• The new Antiphishing models are based on building data sets as similar as
possible to those of the real web environment with different languages, content
qualities and brands. Furthermore, taking into account that Phishing detection is
a class imbalance problem, nowadays a large proportion of positive and
negative samples are considered confused, which are very difficult to detect. All
these features increase the difficulty of detection in order to develop robust,
effective and practical Antiphishing detection models in a real web environment.

Even many Antiphishing models work based on URLs, titles, hyperlinks, login
boxes, copyright information, confidential terms and search engine information,
and even with the comparison of logos of the brands of the websites and it has
been shown that they can be used to identify phishing websites. In addition to
this, visual spoofing features and evaluation features have received more
attention in recent years. However, they have not been robust enough to
determine if a website is a Phishing website.

APWG statistics show that "more than 98% of Phishing websites use fake
domain names." Researchers use information from the URL string to create
antiphishing models, but extracting the underlying information behind the
domain name, such as domain registration and resolution, are also very
important for phishing detection. This information can often indicate whether a
domain name is entitled to provide related brand services. Therefore, extracting
effective features is the main task of Antiphishing researchers, who analyze
social engineering attacks and propose a comprehensive and interpretable
feature framework that not only covers all aspects of Phishing attacks, but also
covers the quality and relevance of web content.

Multistage phishing detection

This consists of: Stage 1. White list filtering stage; This stage includes filtering
real web pages from suspicious ones based on the phishing domain name of
the target brand's website. Stage 2. It is the stage of rapid filtering of counterfeit
bills; this includes extraction of the following functions: counterfeit title function,
counterfeit text function, visual counterfeit function using a detection model The
last stage is the extraction and fusion of functions of accurate recognition of
counterfeit function, theft function, function of affiliation, evaluation function,
training and detection of models, using the CASE function, which is a training
based on detection of phishing based on altered models.
Conclusion

With this Antiphishing philosophy based on multi-scale detection, during 2022,


there will be 883 phishing in China Mobile, 86 phishing in Bank of China, 19
phishing in Facebook, 13 phishing in Apple Demonstrating that the CASE model
covers the feature space that reflects the spoofing nature of Phishing, ensures
feature discrimination and generalization, and provides feature-level support for
effective Phishing detection.

You might also like