You are on page 1of 10

Real-Time Segregation of Encrypted Data Using

Entropy

Gowtham Akshaya Kumaran P1 and Amritha P.P.2

TIFAC-CORE in Cyber Security12


Amrita School of Engineering, Coimbatore
Amrita Vishwa Vidyapeetham, India
gakshay47@gmail.com
pp amritha@cb.amrita.edu

Abstract. Encryption translates data into another form. It can be read-


only with the keys. Encrypted data is often known as ciphertext, whereas
unencrypted data is known as plaintext. Encryption protects files or en-
crypts them with a key, making them accessible only to those who have
the keys to decrypt. The main idea is to prevent unauthorized parties
from accessing the files. These days, One must protect information stored
on their computers or communicated over the internet against cyber-
attacks. Cryptographic methods come in a variety of shapes and sizes.
Choosing a cryptographic process is mainly determined by application
requirements such as reaction speed, bandwidth, integrity and confiden-
tiality. However, each cryptographic algorithm will have its own set of
strengths and weaknesses. Here we have segregated the encrypted data
using the entropy as a measure. The encryption algorithm taken for anal-
ysis are 3DES, AES, RC4 and blowfish.

Keywords: Encryption Algorithms · Encrypted Traffic Classification ·


Cipher Text · File Encryption

1 INTRODUCTION

Encryption is the process of encoding information or data to prevent unautho-


rised access. Encryption is one of the most used methods for ensuring the security
of sensitive data. The encryption algorithm changes plaintext (the original mes-
sage before encryption) into ciphertext by performing numerous replacements
and transformations (scrambled message after encryption). In the field of infor-
mation security, many encryption techniques are widely available and employed.
Encryption algorithms can be divided asymmetric-key and symmetric-key en-
cryption.
This paper is an experimental investigation on how to identify the encryption
algorithms of files with their entropy value. Our idea is to analyze the encrypted
traffic and encrypted files to find its algorithm. In this paper, twenty different
files with more than 100 words are used for the analysis and encrypting them
using Advanced Encryption Standard(AES), Rivest Cipher 4 (RC4), Triple Data
2 Gowtham Akshaya Kumaran P and Amritha P.P.

Encryption Algorithm (3DES), and Blowfish. Then, all the files encrypted are
sent into network traffic and sniffing it through network sniffing tools and finding
the entropy values. With those range of values we can identify the encryption
algorithms.
It is assumed that alice and bob are sharing messages in a public network,
with one portion of the message being encrypted and the other unencrypted.
Further, another user, eve (i.e. the adversary), can passively intercept alice and
bob’s communication. This method is based on the attacker’s point of view.
When an attacker eavesdrops on a conversation between two valid users, the
attacker can record the files in the conversation and attempt to decipher the
message by analysing the data. If the data is encrypted using the cypher text
format and string values, the attacker can figure out the encryption technique
for such files and then decrypt them to acquire the data. Twenty different files
of same size are used here with more than 100 words of data and encrypted in
four different encryption algorithms AES, 3DES, RC4 and Blowfish. Encryptions
are done using python programming. Here, the main motive is to analyze the
encrypted files and classify the encryption algorithms.
Entropy is a metric for measuring randomness. This notion originated in
thermodynamics, but it was adapted to digital communications by Claude E.
Shannon in 1948 [1]. Shannon was interested in figuring out how much a digital
file could theoretically be compressed. Using the entropy values it can identified
as encrypted or non-encrypted traffic [18].
A file is compressed simply by replacing more extended patterns of bits with
shorter patterns of bits. As a result, the more entropy in a data file, the less
compressed it may be. Determining a file’s entropy [8] might also help to figure
out if it’s likely to be encrypted. There are formal proofs in the discipline of
cryptography that indicate that if an adversary can accurately differentiate an
encrypted file from a genuinely random file with a better than 50% probability,
then he has ”the advantage”. The opponent can then take advantage of this ad-
vantage and crack the encryption. The mathematical examination of encryption
schemes uses this concept of benefit. In the actual world, however, files contain-
ing random data are useless in a file system. As a result, files with high entropy
are very likely to be encrypted or compressed.
There are many research papers based on with machine learning techniques
[11] used to identify the classification of traffics [21] and traffics based on appli-
cation usages [19] and traffics based on user actions [20] all these works are on
finding the encryption algorithms but with different methods.

2 RELATED WORKS

Numerous types of research are currently underway to identify the encryption


algorithms used to protect files in encrypted and unencrypted traffic. The re-
searcher employed a variety of machine learning and deep learning techniques.
The entropy values of the files are used to classify traffic as encrypted or unen-
crypted, according to paper [2]. They distinguished encrypted and compressed
Analysis on Encrypted Files to Classify Encryption Algorithms Using Entropy 3

traffic using a new classification scheme called High Entropy DistinGuishEr


(HEDGE). It is based on determining the uniqueness of data streams and can
be used to determine the uniqueness of individual packets without access to
the entire stream. Barbosa [3] indicates that once the encoding algorithms are
identified, encrypted text files can be analysed. Plain texts were encoded using a
variety of cryptographic techniques, and the resulting codifications were analysed
for metadata. The algorithm is then identified using data mining techniques. Dif-
ferent cryptographic algorithms are used to encode and extract metadata from
these subsequent actions of rule arrangement according to a system. Numerous
data mining techniques, such as PART, FT, J48 Multi-layer Perceptron, and
Complement Naive Bayes, can be used to identify the algorithm. Cheng Tan [4]
presented a method for detecting the cryptographic algorithm using only the
ciphertext in this research. They had previously demonstrated the entire imple-
mentation architecture for the identification system. After that, the identification
mechanism was used to decrypt five popular block ciphers: AES, Blowfish, 3DES,
RC5, and DES. After analyzing the experiment’s results, it was determined that
when the keys for training and testing ciphertexts are identical, the identifica-
tion rate can reach approximately 90%. When different keys are used for training
and testing ciphertexts, it can still distinguish AES from any of the other four
cryptographic algorithms with a high identification rate in one-to-one identi-
fication. This article demonstrates the use of machine learning techniques for
algorithm identification. Under the ECB mode, seven cryptographic algorithms
are used to encrypt plain text files written in seven different languages. Addi-
tionally, it was capable of creating metadata files by transforming the obtained
cryptograms. [10]The concept of dynamic key aggregation for cloud storage is
well understood. They developed a model that will prevent unauthorised ac-
cess to modified files in the cloud by dynamically generating the aggregate key.
Because dynamic key aggregation is used, once a user uses a key to access a
set of files, that key cannot be used by another user. The primary objective of
this paper [12] is to classify and characterise encrypted traffic such as (SSH) [9]
secure shell for various deep and shallow networks. They conduct the analysis
by estimating a statistical feature set from a variety of private and public se-
curity organisations. This paper [13] discusses a lightweight cryptographic hash
function. They provide a simple hash function with low hardware complexity
that is capable of achieving conventional security. It is based on a sponge design
with a permutation function that updates two non-linear feedback shift regis-
ters. As a result, it offers at least 80-bit protection against generic attacks, which
is sufficient for the time being. In this paper [5] they discuss the procedure of
extracting features and training machine learning models for detecting and clas-
sifying cryptographic algorithms in compiled code Using four distinct learning
methods, three different types of models were evaluated on four different fea-
ture sets. This approach [6] discusses all the existing entropy-based approaches
for identifying encrypted communication are ineffective, this study presents an
encrypted traffic identification method based on n-gram entropy and cumula-
tive sum. This method examines the n-gram entropy properties of text, picture,
4 Gowtham Akshaya Kumaran P and Amritha P.P.

compressed file, and encrypted network traffic. Additionally, a cumulative sum


analysis is conducted in order to better identify compressed file traffic from en-
crypted data.Using a C4.5 decision tree, [7] this research proposes a method for
identifying the encryption algorithm. Eight different properties were retrieved
from the cypher texts produced by various known encryption algorithms in or-
der to construct a training model using C4.5 decision tree, which was then used
to identify the encryption algorithm’s name.

3 METHODOLOGY

Fig. 1: Proposed Method Flow Diagram

At first, twenty different files with plain text data is created in a text file
format. Then files are encrypted using four encryption algorithms. All the en-
cryptions are done using python programming. There’s a research paper for
Entropy model for symmetric key cryptography [15].

3.1 AES Encryption in Python


The python AES script uses a python library which is pyAesCrypt. It uses a
function called encryptFile. This function accepts three parameters: the plain
text file, the ciphertext output file path, and the key that the user gives for
encryption. And internally, the process goes into different rounds of encryption
processes and then the ciphertext is retrieved as a file since the input is a plain
text file.
Analysis on Encrypted Files to Classify Encryption Algorithms Using Entropy 5

3.2 3DES Encryption in Python


This script uses two python libraries which are Cryoto.Cipher and hashlib and
in which the script imports only DES3 module from Cryoto.Cipher and imports
the md5 module from hashlib, which is the hashing algorithm that will be used
in this script. At first, the retrieved key from the user will be hashed using the
md5 hash function, and the cipher object will be retrieved by giving the obtained
hash key to the DES3 sub-function as a parameter. Now the encrypt function
will handle the input’s encryption process, which is a plain-text file.

3.3 RC4 Encryption in Python


In this script, we first create the Key Scheduling and Pseudo-Random Generation
part, leading to the creation of keystream. Key Scheduling will use the secret key
for the array operations where the array is rearranged to generate the Pseudo
Random numbers. This keystream is then XORed with the plain text file which
is given by the user, and this is done byte by a byte which will retrieve the
encrypted text.

3.4 Blowfish Encryption in Python


In this script, we first create the Key Scheduling and Pseudo-Random Generation
part which will lead to the creation of keystream. Key Scheduling will use a
secret key for the array operations where the array is rearranged which is used
to generate the Pseudo Random numbers. This keystream is then XORed with
the plain text file which is given by the user and this is done byte by a byte
which will retrieve the encrypted text.

Finally, all twenty files are encrypted in all four encryption methods, and Now
twenty encrypted files are obtained. The proposed methodology flow diagram is
shown in figure 1. After collecting all of the encrypted files, files are sent into
network traffic, where all other shared network traffics are found. Those obtained
files have been sniffed using any network sniffing tools at hand. Here Wireshark is
used for sniffing the file (Figure 2), but it does not give the actual file. Therefore,
there will be data modifications, and they will not be identical to the original
files. Also, now it will have few more characters added to the original data.

3.5 ENTROPY CALCULATION


The unpredictability acquired by an application for use in cryptography that
requires random data is known as entropy. Entropy deficiency can impact neg-
atively on performance and security. Entropy is a term used in cryptography to
describe the amount of unpredictability collected by a system for use in algo-
rithms that require random data. A cryptosystem with insufficient entropy can
become insecure and unable to encrypt data securely. Shannon entropy is found
for all the encrypted files. Using the entropy values, classifying the encryption
6 Gowtham Akshaya Kumaran P and Amritha P.P.

Fig. 2: sniffed data classification

Table 1: Entropy Values for encrypted files 1 to 5


Encryption Algorithm File 1 File 2 File 3 File 4 File 5
AES 7.739 7.682 7.747 7.671 7.736
RC4 6.275 6.248 6.238 6.240 6.274
Blowfish 7.881 7.881 7.904 7.859 7.908
3DES 7.900 7.871 7.900 7.881 7.910

algorithms as each encryption type will have different values. The Shannon en-
tropy values of all the files are calculated and listed in the [1,2,3,4] Shannon
entropy can
P be calculated using the below formula
H(X) = - p(X) log p(X)

It is clearly shown that the same range of values is to be encrypted using the
same procedures. Shannon’s entropy is a measure of how much information is
stored in data. Entropy measures how widely the data are dispersed throughout
all potential values. Thus, an increase in entropy value means that the data is
spread out as widely as possible. In contrast, a decreased entropy value implies
that information is practically all concentrated on one value. Entropy estimation
for real time traffic is discussed in [16].

4 RESULT AND DISCUSSION

In this paper, real-time segregation of encrypted data was done using entropy
as a measure. Twenty different files are created and encrypted using python
Analysis on Encrypted Files to Classify Encryption Algorithms Using Entropy 7

Table 2: Entropy Values for encrypted files 6 to 10


Encryption Algorithm File 6 File 7 File 8 File 9 File 10
AES 7.738 7.739 7.737 7.766 7.696
RC4 6.243 6.272 6.285 6.257 6.258
Blowfish 7.902 7.893 7.898 7.908 7.850
3DES 7.907 7.907 7.892 7.909 7.877

Table 3: Entropy Values for encrypted files 11 to 15


Encryption Algorithm File 11 File 12 File 13 File 14 File 15
AES 7.743 7.724 7.747 7.709 7.670
RC4 6.258 6.241 6.258 6.237 6.235
Blowfish 7.908 7.885 7.894 7.889 7.882
3DES 7.892 7.892 7.913 7.913 7.878

programming. Finding an algorithm with the same files created will have no
relevance, so those files are sent into a network where all common traffics are
shared. Using a packet sniffing tool Wireshark, all the twenty encrypted files are
captured, and Shannon entropy values are calculated. Here Python programming
language is used for calculating the Shannon entropy value for the files. The files
encrypted with the same algorithm have entropy values found to be almost
similar or can fall in the same range. We have classified each algorithm into a
particular range of values. The range of entropy values are represented in the
table: [5]. There’s also way to classifying encryption algorithms using pattern
recognition [17].
From these calculated entropy values, it is found that each encryption algo-
rithm has some range of values. Hence, if a file’s entropy is in the 7.66 -7.74
range, it will be in the AES encryption method. On the other hand, the various
encryption techniques are categorised according to their range, and the Shan-
non entropy value of the files is confirmed before encryption. It was all ranged
between 4.3 - 4.8 values tab: [6,7]. So it clearly shows encrypted files entropy
values are more than files that are not encrypted. Usually, an entropy value of
an encrypted file is near 8 here it goes between 6 - 8 values.

Table 4: Entropy Values for encrypted files 16 to 20


Encryption Algorithm File 16 File 17 File 18 File 19 File 20
AES 7.687 7.682 7.606 7.713 7.714
RC4 6.251 6.239 6.205 6.262 6.251
Blowfish 7.867 7.864 7.859 7.871 7.894
3DES 7.897 7.879 7.867 7.902 7.890
8 Gowtham Akshaya Kumaran P and Amritha P.P.

Table 5: Range of Entropy Values of Encrypted Files


Encryption Algorithm Entropy Range
AES 7.66 - 7.74
RC4 6.23 - 6.30
Blowfish 7.70 - 7.88
3DES 7.89 - 7.95

Table 6: Entropy Values of plain text files from 1 - 10


Plain Text Files (Sample-1) Entropy Values
File 1 4.81736
File 2 4.33283
File 3 4.41631
File 4 4.39035
File 5 4.48682
File 6 4.43098
File 7 4.41376
File 8 4.54127
File 9 4.64932
File 10 4.45465

Table 7: Entropy Values of plain text files 11 - 20


Plain Text Files (Sample-2) Entropy Values
File 11 4.74707
File 12 4.56419
File 13 4.47308
File 14 4.4274
File 15 4.53229
File 16 4.55865
File 17 4.34072
File 18 4.47571
File 19 4.50534
File 20 4.42811
Analysis on Encrypted Files to Classify Encryption Algorithms Using Entropy 9

5 CONCLUSION AND FUTURE WORK

Finding an encryption algorithm of an encrypted file is a complicated method.


In this paper, we have segregated the encrypted data based on entropy value.
Based on the analysis, if a file would have been encrypted with x algorithm
then from the calculated entropy values we can give some information about
the encryption algorithm performed on the file. This method is an experimental
method to classify the encryption algorithms, and only text files are used here.
Instead of text files, this method will work on Images, PDF’s, Videos, and Mp3
files of various sizes. Other measures such as n-grams, Incidence of Co-Occurrence
can also be used to segregate the encrypted data. However, here only entropy
values are used to identify encryption algorithms. And in future, this can be
implemented in Next-Generation Firewalls(NGFW), where the incoming traffic
contents can be segregated and labelled at the buffer level.

References

1. Shannon, Claude Elwood. ”A mathematical theory of communication.” The Bell


system technical journal 27.3 (1948): 379-423.
2. Casino, Fran, Kim-Kwang Raymond Choo, and Constantinos Patsakis. “Hedge: Effi-
cient traffic classification of encrypted and compressed packets”, IEEE Transactions
on Information Forensics and Security 14.11 (2019): 2916-2926.
3. Barbosa, Flávio, Arthur Vidal, and Flávio Mello. “Machine learning for crypto-
graphic algorithm identification”. Journal of Information Security and Cryptogra-
phy (Enigma) 3.1 (2016): 3-8.
4. Tan, Cheng, and Qingbing Ji. “An approach to identifying cryptographic algorithm
from ciphertext”. 2016 8th IEEE International Conference on Communication Soft-
ware and Networks (ICCSN). IEEE, 2016..
5. Hosfelt, Diane Duros. “Automated detection and classification of cryptographic
algorithms in binary programs through machine learning”. arXiv preprint
arXiv:1503.01186 (2015).
6. Cheng, Guang, and Ying Hu. “Encrypted traffic identification based on n-gram
entropy and cumulative sum test”. Proceedings of the 13th International Conference
on Future Internet Technologies. 2018.
7. Manjula, R., and R. Anitha. “Identification of encryption algorithm using decision
tree”. International Conference on Computer Science and Information Technology.
Springer, Berlin, Heidelberg, 2011.
8. Mamun, Mohammad Saiful Islam, Ali A. Ghorbani, and Natalia Stakhanova. “An
entropy based encrypted traffic classifier”. International Conference on Information
and Communications Security. Springer, Cham, 2015.
9. Alshammari, Riyad, et al. “Classifying ssh encrypted traffic with minimum packet
header features using genetic programming”. Proceedings of the 11th Annual Con-
ference Companion on Genetic and Evolutionary Computation Conference: Late
Breaking Papers. 2009.
10. James, Maria, et al. “Decrypting Shared Encrypted Data Files Stored in a Cloud
Using Dynamic Key Aggregation”. Computational Intelligence, Cyber Security and
Computational Models. Springer, Singapore, 2016. 385-392.
10 Gowtham Akshaya Kumaran P and Amritha P.P.

11. Cha, Seunghun, and Hyoungshick Kim. “Detecting encrypted traffic: a machine
learning approach”. International Workshop on Information Security Applications.
Springer, Cham, 2016.
12. Vinayakumar, R., K. P. Soman, and Prabaharan Poornachandran. ”Secure shell
(ssh) traffic analysis with flow based features using shallow and deep networks.”
2017 International Conference on Advances in Computing, Communications and
Informatics (ICACCI). IEEE, 2017.
13. Mukundan, Puliparambil Megha, et al. “Hash-One: a lightweight cryptographic
hash function”. IET Information Security 10.5 (2016): 225-231.
14. Krishnan, Lekshmi R., M. Sindhu, and Chungath Srinivasan. “Analysis of sponge
function based authenticated encryption schemes”. 2017 4th International Confer-
ence on Advanced Computing and Communication Systems (ICACCS). IEEE, 2017.
15. Othman, Hiba, Youssef Hassoun, and Michel Owayjan. “Entropy model for sym-
metric key cryptography algorithms based on numerical methods” .2015 Inter-
national Conference on Applied Research in Computer Science and Engineering
(ICAR). IEEE, 2015.
16. Dorfinger, Peter, Georg Panholzer, and Wolfgang John. “Entropy estimation for
real-time encrypted traffic identification”. International workshop on traffic moni-
toring and analysis. Springer, Berlin, Heidelberg, 2011.
17. Sharif, Suhaila O., L. I. Kuncheva, and S. P. Mansoor. “Classifying encryption al-
gorithms using pattern recognition techniques”.2010 IEEE International Conference
on Information Theory and Information Security. IEEE, 2010.
18. Tang, Zhengzhi, Xuewen Zeng, and Yiqiang Sheng. “Entropy-based feature extrac-
tion algorithm for encrypted and non-encrypted compressed traffic classification”.
International Journal of ICIC 15.3 (2019): 845.
19. Peter Dorfinger, Georg Panholzer, Brian Trammell, and Teresa Pepe. “Entropy-
based traffic filtering to support real-time skype detection”. In Proceedings of the
6th International Wireless Communications and Mobile Computing Conference,
pages 747–751. ACM, 2010.
20. M. Conti, L. V. Mancini, R. Spolaor, and N. V. Verde. “Analyzing android en-
crypted network traffic to identify user actions”. IEEE Transactions on Information
Forensics and Security, 11(1):114–125, Jan 2016.
21. Jeffrey Erman, Anirban Mahanti, Martin Arlitt, Ira Cohen, and Carey
Williamson.“ Offline/realtime traffic classification using semi-supervised learning”.
Perform. Eval., 64(9-12):1194–1213, October 2007.

You might also like