You are on page 1of 32

1.

Abstract
2. Introduction
3. DataSet Source and Preparation:
4. System Architecture
5. Technologies overview:
6. Voice-over-Internet protocol (VoIP)
7. TOR
8. VPN
9. Machine learning
10. Machine Learning Steps
11. Deep learning
12. Deep Learning Models:
1) MLP
2) 1D-CNN
3) LSTM
13. Python:
14. Keras:
15. UML
Abstract:

Network management is facing a great challenge to analyze and identify encrypted


network traffic with specific applications and protocols. A significant number of
network users applying different encryption techniques to network applications and
services to hide the true nature of the network communication. These challenges
attract the network community to improve network security and enhance network
service quality. Network managers need novel techniques to cope with the failure
and shortcomings of the port-based and payload-based classification methods of
encrypted network traffic due to emergent security technologies. Mainly, the famous
network hopping mechanisms used to make network traffic unknown and
anonymous are VPN (virtual private network) and TOR (Onion Router). This paper
presents a novel scheme to unveil encrypted network traffic and easily identify the
tunneled and anonymous network traffic. The proposed identification scheme uses
the highly desirable deep learning techniques to easily and efficiently identify the
anonymous network traffic and extract the Voice over IP (VoIP) and Non VoIP ones
within encrypted traffic flows. Finally, the captured traf c has been classified into four
different categories, i-e., VPN VoIP, VPN Non-VoIP, TOR VoIP, and TOR Non-VoIP.
The experimental results show that our identification engine is extremely robust to
VPN and TOR network traffic.

Introduction:

In today’s world, the Internet is the fast growing technology industry and become the essential need to facilitate
human being in widespread elds of life. Network traf c comprises Internet activities in the shape of data
encapsulated in network packets. Network traf c needs accurate analysis methods such as identi cation and
classi cation for associating net-work traf c ows to a speci c application class according to network planning and
network management. Monitoring encrypted network traf c becomes a challenging issue for many network tasks,
including rewall enforcement, quality of service (QoS) implementations, traf c engineering and network security.
Due to the exponential proliferation of numerous network applications, network traf c identi cation techniques
need to keep pace with many real-world devel-opments. Classifying encrypted network ows and anonymous
communications by their application types is the fundamental process of many crucial network traf c ow monitoring, con-
trolling tasks and forensic investigation of cybercrime. Gener-ally, it mainly focuses on accurate identi cation and
detailed classi cation. Thus, encrypted and anonymous network traf-c lose their unique characteristics.

The followings are the main research contributions of this paper:

(1) The main objective of this paper is to present the Flow Spatio-Temporal Features (FSTFs) for
distinguishing VPN and TOR traf c. These FSTFs set of attributes are composed of packet length and timing
components, which are more suitable for characterizing the VPN and TOR network traf c into VoIP and Non-VoIP
ones.

(2) A proli c dataset is generated via FSTFs, which is mature enough to train the classi er based on
deep neural network and accurately identify the VoIP traf c ow in VPN and TOR network traf c.

(3) The light-weight proposed identi cation method is validated via three state-of-art deep learning algorithms,
including multi-layer perceptron (MLP), convolution neural network (CNN), and long-short term memory (LSTM).
The neural network models are trained with a training set, vali-dated with a validation set and nally tested with
20% unseen data of the total dataset. According to the consideration of the practical implementation ef ciency
demand, only these three deep learning techniques are employed and tested in this project
DataSet Source and Preparation:

1. VPN-nonVPN dataset (ISCXVPN2016)

The Canadian institute of cyber security / university of new Brunswick has provided data for network
security researchers. To generate a representative dataset of real-world traffic in ISCX we defined a
set of tasks, assuring that our dataset is rich enough in diversity and quantity. We created accounts
for users Alice and Bob in order to use services like Skype, Facebook, etc. Below we provide the
complete list of different types of traffic and applications considered in our dataset for each traffic
type (VoIP, P2P, etc.)

We captured a regular session and a session over VPN, therefore we have a total of 14 traffic
categories: VOIP, VPN-VOIP, P2P, VPN-P2P, etc. We also give a detailed description of the different
types of traffic generated:

Browsing: Under this label we have HTTPS traffic generated by users while browsing or performing
any task that includes the use of a browser. For instance, when we captured voice-calls using
hangouts, even though browsing is not the main activity, we captured several browsing flows.

Email: The traffic samples generated using a Thunderbird client, and Alice and Bob Gmail accounts.
The clients were configured to deliver mail through SMTP/S, and receive it using POP3/SSL in one
client and IMAP/SSL in the other.

Chat: The chat label identifies instant-messaging applications. Under this label we have Facebook
and Hangouts via web browsers, Skype, and IAM and ICQ using an application called pidgin [14].

Streaming: The streaming label identifies multimedia applications that require a continuous and
steady stream of data. We captured traffic from Youtube (HTML5 and flash versions) and Vimeo
services using Chrome and Firefox.

File Transfer: This label identifies traffic applications whose main purpose is to send or receive files
and documents. For our dataset we captured Skype file transfers, FTP over SSH (SFTP) and FTP over
SSL (FTPS) traffic sessions.
VoIP: The Voice over IP label groups all traffic generated by voice applications. Within this label we
captured voice calls using Facebook, Hangouts and Skype.

TraP2P: This label is used to identify file-sharing protocols like Bittorrent. To generate this traffic we
downloaded different .torrent files from a public a repository and captured traffic sessions using the
uTorrent and Transmission applications.

The traffic was captured using Wireshark and tcpdump, generating a total amount of 28GB of data.
For the VPN, we used an external VPN service provider and connected to it using OpenVPN (UDP
mode). To generate SFTP and FTPS traffic we also used an external service provider and Filezilla as a
client.

To facilitate the labeling process, when capturing the traffic all unnecessary services and applications
were closed. (The only application executed was the objective of the capture, e.g., Skype voice-call,
SFTP file transfer, etc.) We used a filter to capture only the packets with source or destination IP, the
address of the local client (Alice or Bob).

ISCXFlowMeter has been written in Java for reading the pcap files and create the csv file based on
selected features. The UNB ISCX Network Traffic (VPN-nonVPN) dataset consists of labeled network
traffic, including full packet in pcap format and csv (flows generated by ISCXFlowMeter) also are
publicly available for researchers.

UNB ISCX Network Traffic Dataset content

Traffic: Content

Web Browsing: Firefox and Chrome

Email: SMPTS, POP3S and IMAPS

Chat: ICQ, AIM, Skype, Facebook and Hangouts

Streaming: Vimeo and Youtube

File Transfer: Skype, FTPS and SFTP using Filezilla and an external service

VoIP: Facebook, Skype and Hangouts voice calls (1h duration)

P2P: uTorrent and Transmission (Bittorrent)

2. Sample data:
@RELATION <ISCXFlowMeter-generated-flows>,,,,,,,,,,,,,,,,,,,,,,,

,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE duration NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE total_fiat NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE total_biat NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE min_fiat NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE min_biat NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE max_fiat NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE max_biat NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE mean_fiat NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE mean_biat NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE flowPktsPerSecond NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE flowBytesPerSecond NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE min_flowiat NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE max_flowiat NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE mean_flowiat NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE std_flowiat NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE min_active NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE mean_active NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE max_active NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE std_active NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE min_idle NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE mean_idle NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE max_idle NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE std_idle NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE class1 {VPN-BROWSING,VPN-CHAT,VPN-STREAMING,VPN-MAIL,VPN-VOIP,VPN-


P2P,VPN-FT},,,,,,,,,,,,,,,,,

@DATA,,,,,,,,,,,,,,,,,,,,,,,
2218398,2191528,2192676,9,17,60217,86113,601.0773450357,370.0094498819,4315.7269344816,
3789735.65609057,0,60217,231.7348793482,1815.7413356553,-1,0,-1,0,-1,0,-1,0,VPN-BROWSING

3517813,3517813,3304404,6,7,324059,217426,1413.911977492,998.9129383313,1648.1831183181
,1322206.43905745,1,217299,606.8333620838,8971.7215188467,-1,0,-1,0,-1,0,-1,0,VPN-BROWSING

13567657,13394098,13437531,0,0,3906748,3908013,10604.9865399842,5204.3109992254,283.541
9556966,257925.299850962,-
9,3777807,3527.7319292772,65256.4421527067,5816127,5816127,5816127,0,3777807,3777807,37
77807,0,VPN-BROWSING

14697122,14567358,14567724,4,14,2447126,2406746,15799.737527115,6497.6467439786,215.416
3243661,198402.925416282,4,2275153,4643.6404423381,51735.3826721319,5068736,5435233.5,5
801731,518305.735074869,1141447,1708300,2275153,801651.200471252,VPN-BROWSING

270207,203681,244006,15,10,27147,66572,593.8221574344,363.1041666667,3763.7811011558,34
79554.56372337,3,66526,265.9517716535,2648.2935975742,-1,0,-1,0,-1,0,-1,0,VPN-BROWSING

386584,319144,360487,23,10,37637,44652,1039.5570032573,527.7994143485,2566.0658485607,2
427443.97077996,0,44652,390.0948536831,2772.2421345564,-1,0,-1,0,-1,0,-1,0,VPN-BROWSING

3. The class labels VPN-BROWSING,VPN-CHAT,VPN-STREAMING,VPN-MAIL, VPN-P2P,VPN-FT


are replaced with VPN-NON-VOIP
4. Tor-nonTor dataset (ISCXTor2016)

To be sure about the quantity and diversity of this dataset in CIC, we defined a set of tasks to
generate a representative dataset of real-world traffic. We created three users for the browser
traffic collection and two users for the communication parts such as chat, mail, FTP, p2p, etc. For the
non-Tor traffic we used previous benign traffic from VPN project and for the Tor traffic we used 7
traffic categories:

Browsing: Under this label we have HTTP and HTTPS traffic generated by users while browsing
(Firefox and Chrome).

Email: Traffic samples generated using a Thunderbird client, and Alice and Bob Gmail accounts. The
clients were configured to deliver mail through SMTP/S, and receive it using POP3/SSL in one client
and IMAP/SSL in the other.

Chat: The chat label identifies instant-messaging applications. Under this label we have Facebook
and Hangouts via web browser, Skype, and IAM and ICQ using an application called pidgin.

Audio-Streaming: The streaming label identifies audio applications that require a continuous and
steady stream of data. We captured traffic from Spotify.

Video-Streaming: The streaming label identifies video applications that require a continuous and
steady stream of data. We captured traffic from YouTube (HTML5 and flash versions) and Vimeo
services using Chrome and Firefox.

FTP: This label identifies traffic applications whose main purpose is to send or receive files and
documents. For our dataset we captured Skype file transfers, FTP over SSH (SFTP) and FTP over SSL
(FTPS) traffic sessions.

VoIP: The Voice over IP label groups all traffic generated by voice applications. Within this label we
captured voice-calls using Facebook, Hangouts and Skype.

P2P: This label is used to identify file-sharing protocols like Bittorrent. To generate this traffic we
downloaded different .torrent files from the Kali linux distribution and captured traffic sessions using
the Vuze application. We also used different combinations of upload and download speeds.

The traffic was captured using Wireshark and tcpdump, generating a total of 22GB of data. To
facilitate the labeling process, as we explained in the related published paper, we captured the
outgoing traffic at the workstation and the gateway simultaneously, collecting a set of pairs of .pcap
files: one regular traffic pcap (workstation) and one Tor traffic pcap (gateway) file.

Later, we labelled the captured traffic in two steps. First, we processed the .pcap files captured at
the workstation: we extracted the flows, and we confirmed that the majority of traffic flows were
generated by application X (Skype, ftps, etc.), the object of the traffic capture. Then, we labelled all
flows from the Tor .pcap file as X.

ISCXFlowMeter has been written in Java for reading the pcap files and create the csv file based on
selected features. The UNB CIC Network Traffic (Tor-nonTor) dataset consists of labeled network
traffic, including full packet in pcap format and csv (flows generated by ISCXFlowMeter) also are
publicly available for researchers.

5. Sample data:

@RELATION <ISCXFlowMeter-generated-flows>,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE duration NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE total_fiat NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE total_biat NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE min_fiat NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE min_biat NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE max_fiat NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE max_biat NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE mean_fiat NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE mean_biat NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE flowPktsPerSecond NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE flowBytesPerSecond NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE min_flowiat NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE max_flowiat NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE mean_flowiat NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE std_flowiat NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE min_active NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE mean_active NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE max_active NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE std_active NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE min_idle NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE mean_idle NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE max_idle NUMERIC,,,,,,,,,,,,,,,,,,,,,,,

@ATTRIBUTE std_idle NUMERIC,,,,,,,,,,,,,,,,,,,,,,,


@ATTRIBUTE class1 {VOIP,AUDIO-STREAMING,BROWSING,CHAT,EMAIL,FILE-TRANSFER,P2P,VIDEO-
STREAMING},,,,,,,,,,,,,,,,

@DATA,,,,,,,,,,,,,,,,,,,,,,,

10345300,10345257,10345174,60,52,5871778,5870638,517262.85,470235.181818182,4.253139106
6,2260.4467729307,13,5870638,240588.372093023,1022684.72516946,4092108,4981436,5870764,
1257699.71899774,3435979,4653308.5,5870638,1721563.88877657,CHAT

14966353,14966200,14966053,7,0,635321,635242,10502.5964912281,5615.7797373358,273.41330
24926,227450.334760913,0,635042,3658.3605475434,21347.5751363246,-1,0,-1,0,-1,0,-1,0,VIDEO-
STREAMING

272867,233627,272820,20303,19353,213324,253467,116813.5,136410,21.9887344384,8733.19236
11137,47,213324,54573.4,90181.7371536499,-1,0,-1,0,-1,0,-1,0,CHAT

14999391,14998887,14999391,0,0,149034,146556,9062.7716012085,4758.6900380711,320.613016
8885,277900.949445214,0,146556,3119.6736688852,9995.7823251853,-1,0,-1,0,-1,0,-1,0,VIDEO-
STREAMING

7190597,7189947,7190597,116,492,6555090,6594711,898743.375,898824.625,2.5032692,1137.179
5693737,12,6555090,422976.294117647,1582350.33441528,7051063,7051063,7051063,0,6555090,
6555090,6555090,0,CHAT

14990289,14990249,14990288,0,0,335322,335264,10482.6916083916,5736.8113279755,269.84136
19644,223305.43460503,0,335264,3706.7974777448,13031.1786320245,-1,0,-1,0,-1,0,-1,0,VIDEO-
STREAMING

4894345,4894050,4894345,8600,354,1857545,1857548,444913.636363636,407862.083333333,5.10
79357912,2363.9526841692,24,1857504,203931.041666667,463960.72947095,1796398,2139641,2
482884,485418.905788598,1438583,1648043.5,1857504,296221.879879762,CHAT

13843963,13842978,13843963,9,0,2192110,2192320,16077.7909407666,8943.1285529716,174.155
3339893,145403.523543078,0,2191964,5744.3829875519,52777.3975342276,11840857,11800000,
11840857,0,2191964,2191964,2191964,0,VIDEO-STREAMING

3334635,3334345,3334447,123,61,1028396,1028363,123494.259259259,133377.88,16.1936763694
,6876.6146819667,16,1028116,62917.641509434,160897.891051173,2322320,2322320,2322320,0,
1028116,1028116,1028116,0,CHAT

6. The class labels AUDIO-STREAMING,BROWSING,CHAT,EMAIL,FILE-TRANSFER,P2P,VIDEO-


STREAMING are replaced with TOR-NON-VOIP and VOIP with TOR-VOIP
7. To Convert .arff to .csv the following python script is used

# arff2csv.py

# Importing library

import os

# Getting all the arff files from the current directory

files = [arff for arff in os.listdir('.') if arff.endswith(".arff")]

# Function for converting arff list to csv list

def toCsv(text):

data = False

header = ""

new_content = []

for line in text:

if not data:

if "@ATTRIBUTE" in line or "@attribute" in line:

attributes = line.split()

if("@attribute" in line):

attri_case = "@attribute"

else:

attri_case = "@ATTRIBUTE"

column_name = attributes[attributes.index(attri_case) + 1]

header = header + column_name + ","

elif "@DATA" in line or "@data" in line:

data = True

header = header[:-1]
header += '\n'

new_content.append(header)

else:

new_content.append(line)

return new_content

# Main loop for reading and writing files

for file in files:

with open(file, "r") as inFile:

content = inFile.readlines()

name, ext = os.path.splitext(inFile.name)

new = toCsv(content)

with open(name + ".csv", "w") as outFile:

outFile.writelines(new)

8. The two datasets are then merged

To improve performance and for further machine learning operations, the dataset is split into train
and validations sets and converted to numpy arrays.

The python script used is as follows:

# csv2npy.py
import pandas as pd

import numpy as np

df = pd.read_csv('merged2.csv')

df['split'] = np.random.randn(df.shape[0], 1)

msk = np.random.rand(len(df)) <= 0.7

train = df[msk]

val = df[~msk]

reg_x_train=train[0:-1]

reg_y_train=train[-1:]

reg_x_val=val[0:-1]

reg_y_val=val[-1:]

np.save('reg_x_train.npy', reg_x_train.to_numpy() )

np.save('reg_y_train.npy', reg_y_train.to_numpy())

np.save('reg_x_val.npy', reg_x_val.to_numpy() )

np.save('reg_y_val.npy', reg_y_val.to_numpy())
System Architecture

Technologies overview:

Voice-over-Internet protocol (VoIP)

Voice-over-Internet protocol (VoIP) is a technology that lets users make calls using a broadband
Internet connection instead of a standard phone line.

VoIP technology converts the voice signal used in traditional phone calls into a digital signal that
travels via the Internet rather than analog phone lines.

Because calls are being made over the Internet, they are essentially free when made wherever the
Internet is available.

The traditional telephone industry was hit hard by the VoIP boom, with many users abandoning it as
some of its services have become nearly obsolete.

Understanding Voice-Over-Internet Protocol (VoIP)

Voice-over-Internet-Protocol (VoIP) technology allows users to make "telephone calls" through


Internet connections instead of through analog telephone lines, which renders these calls effectively
free wherever the Internet is available. VoIP changed the telecommunications industry by making
traditional phone lines and services nearly obsolete and reducing demand for them significantly.

As access to the Internet has become more widely available, VoIP has become ubiquitous both for
personal use and for business use.
How Voice-Over-Internet Protocol (VoIP) Works

VoIP works by converting voice audio into packets of data that then travel through the Internet like
any other type of data such as text or pictures. These packets of sound data travel almost instantly
through public and private Internet networks to route from the origination to the destination. Any
landline or mobile phone that is connected to the Internet can place and receive VoIP calls. VoIP calls
can also be conducted on computers through the computer microphone and speakers or headsets.

Because VoIP calls travel through the Internet instead of through analog telephone lines, they are
subject to the same lags and delays as other data traveling the Internet when bandwidth is
compromised or overwhelmed.

Advantages and Disadvantages of Voice-Over-Internet Protocol (VoIP)

VoIP technology reduces the cost of voice communication to almost nothing for personal and
commercial use. Many Internet providers throw in VoIP telephone service for free as an incentive to
buy broadband or higher-speed Internet connection and Internet cable television channels. Since it
costs the Internet provider a little extra to provide this service and it costs the customer nothing
extra for this service, it is a win-win for everyone involved in the transaction.

VoIP service has also enabled video calls, conference calls, and webinars for commercial and
personal use at prices that are affordable or free. Previously, video conferencing and web
conferencing were expensive and only available to companies large enough to justify the expense,
but VoIP allows companies of all sizes, including solo practitioners and freelancers, to afford it.

The main con of VoIP services is that they can lag or clump. Because the sound goes in packets, it is
slightly delayed. Under normal circumstances, untrained listeners won't be able to tell the difference
between VoIP and analog calls. But when there is high bandwidth usage on the Internet, the packets
may cluster or be delayed, which can cause a jerky, clumped sound on VoIP calls.

Some VoIP services can't work during power outages if the user or provider does not have backup
power. Some 9-1-1 services do not have the ability to recognize locations of VoIP calls.
TOR

Tor, short for The Onion Router, is free and open-source software for enabling anonymous
communication.[8] It directs Internet traffic through a free, worldwide, volunteer overlay network,
consisting of more than six thousand relays,[9] to conceal a user's location and usage from anyone
performing network surveillance or traffic analysis.[10] Using Tor makes it more difficult to trace a
user's Internet activity. Tor's intended use is to protect the personal privacy of its users, as well as
their freedom and ability to communicate confidentially through IP address anonymity using Tor exit
nodes.

Tor enables its users to surf the Internet, chat and send instant messages anonymously, and is used
by a wide variety of people for both licit and illicit purposes.[24] Tor has, for example, been used by
criminal enterprises, hacktivism groups, and law enforcement agencies at cross purposes, sometimes
simultaneously;[25][26] likewise, agencies within the U.S. government variously fund Tor (the U.S.
State Department, the National Science Foundation, and – through the Broadcasting Board of
Governors, which itself partially funded Tor until October 2012 – Radio Free Asia) and seek to
subvert it.[27][12]

Tor is not meant to completely solve the issue of anonymity on the web. Tor is not designed to
completely erase tracking but instead to reduce the likelihood for sites to trace actions and data
back to the user

Tor is also used for illegal activities. These can include privacy protection or censorship
circumvention,[29] as well as distribution of child abuse content, drug sales, or malware distribution.
[30] According to one estimate,[by whom?] "overall, on an average country/day, ∼6.7% of Tor
network users connect to Onion/Hidden Services that are disproportionately used for illicit
purposes."

Operation

Tor aims to conceal its users' identities and their online activity from surveillance and traffic analysis
by separating identification and routing. It is an implementation of onion routing, which encrypts
and then randomly bounces communications through a network of relays run by volunteers around
the globe. These onion routers employ encryption in a multi-layered manner (hence the onion
metaphor) to ensure perfect forward secrecy between relays, thereby providing users with
anonymity in a network location. That anonymity extends to the hosting of censorship-resistant
content by Tor's anonymous onion service feature.[57] Furthermore, by keeping some of the entry
relays (bridge relays) secret, users can evade Internet censorship that relies upon blocking public Tor
relays.[58]
Because the IP address of the sender and the recipient are not both in cleartext at any hop along the
way, anyone eavesdropping at any point along the communication channel cannot directly identify
both ends. Furthermore, to the recipient, it appears that the last Tor node (called the exit node),
rather than the sender, is the originator of the communication.

VPN

A virtual private network (VPN) extends a private network across a public network and enables users
to send and receive data across shared or public networks as if their computing devices were directly
connected to the private network.[1] The benefits of a VPN include increases in functionality,
security, and management of the private network. It provides access to resources that are
inaccessible on the public network and is typically used for remote workers. Encryption is common,
although not an inherent part of a VPN connection.[2]

A VPN is created by establishing a virtual point-to-point connection through the use of dedicated
circuits or with tunneling protocols over existing networks. A VPN available from the public Internet
can provide some of the benefits of a wide area network (WAN). From a user perspective, the
resources available within the private network can be accessed remotely.

VPNs cannot make online connections completely anonymous, but they can increase privacy and
security. To prevent disclosure of private information or data sniffing, VPNs typically allow only
authenticated remote access using tunneling protocols and secure encryption techniques.

The VPN security model provides:

Confidentiality such that even if the network traffic is sniffed at the packet level (see network
sniffer or deep packet inspection), an attacker would see only encrypted data, not the raw data.

Sender authentication to prevent unauthorized users from accessing the VPN.

Message integrity to detect and reject any instances of tampering with transmitted messages.

Secure VPN protocols include the following:


Internet Protocol Security (IPsec) was initially developed by the Internet Engineering Task Force
(IETF) for IPv6, which was required in all standards-compliant implementations of IPv6 before RFC
6434 made it only a recommendation.[6] This standards-based security protocol is also widely used
with IPv4 and the Layer 2 Tunneling Protocol. Its design meets most security goals: availability,
integrity, and confidentiality. IPsec uses encryption, encapsulating an IP packet inside an IPsec
packet. De-encapsulation happens at the end of the tunnel, where the original IP packet is decrypted
and forwarded to its intended destination.

Transport Layer Security (SSL/TLS) can tunnel an entire network's traffic (as it does in the
OpenVPN project and SoftEther VPN project[7]) or secure an individual connection. A number of
vendors provide remote-access VPN capabilities through SSL. An SSL VPN can connect from locations
where IPsec runs into trouble with Network Address Translation and firewall rules.

Machine learning

Machine learning is the process of making systems that learn and improve by themselves, by being
specifically programmed.

The ultimate goal of machine learning is to design algorithms that automatically help a system
gather data and use that data to learn more. Systems are expected to look for patterns in the data
collected and use them to make vital decisions for themselves.

In general, machine learning is getting systems to think and act like humans, show human-like
intelligence, and give them a brain. In the real world, there are existing machine learning models
capable of tasks like :

Separating spam from actual emails, as seen in Gmail

Correcting grammar and spelling mistakes, as seen in autocorrect

Thanks to machine learning, the world has also seen design systems capable of exhibiting uncanny
human-like thinking, which performs tasks like:

Object and image recognition

Detecting fake news


Understanding written or spoken words

Bots on websites that interact with humans, like humans

Self-driven cars

The Complete Guide to Machine Learning Steps

Lesson 3 of 32By Simplilearn

Last updated on May 23, 202234922

The Complete Guide to Understanding Machine Learning Steps

PreviousNext

Table of Contents

What Is Machine Learning?Machine Learning StepsHow to Implement Machine Learning Steps in


Python?Conclusion

Machine Learning is a fantastic new branch of science that is slowly taking over day-to-day life. From
targeted ads to even cancer cell recognition, machine learning is everywhere. The high-level tasks
performed by simple code blocks raise the question, "How is machine learning done?".

In this tutorial titled ‘The Complete Guide to Understanding Machine Learning Steps’, you will go
through the steps involved in making a machine learning model.

Post Graduate Program in AI and Machine Learning

In Partnership with Purdue UniversityEXPLORE COURSEPost Graduate Program in AI and Machine


Learning

What Is Machine Learning?

Machine learning is the process of making systems that learn and improve by themselves, by being
specifically programmed.

The ultimate goal of machine learning is to design algorithms that automatically help a system
gather data and use that data to learn more. Systems are expected to look for patterns in the data
collected and use them to make vital decisions for themselves.
In general, machine learning is getting systems to think and act like humans, show human-like
intelligence, and give them a brain. In the real world, there are existing machine learning models
capable of tasks like :

Separating spam from actual emails, as seen in Gmail

Correcting grammar and spelling mistakes, as seen in autocorrect

Thanks to machine learning, the world has also seen design systems capable of exhibiting uncanny
human-like thinking, which performs tasks like:

Object and image recognition

Detecting fake news

Understanding written or spoken words

Bots on websites that interact with humans, like humans

Self-driven cars

Figure_1_Machine_learning.

Figure 1: Machine learning

Machine Learning Steps

The task of imparting intelligence to machines seems daunting and impossible. But it is actually
really easy. It can be broken down into 7 major steps :

1. Collecting Data:

As you know, machines initially learn from the data that you give them. It is of the utmost
importance to collect reliable data so that your machine learning model can find the correct
patterns. The quality of the data that you feed to the machine will determine how accurate your
model is. If you have incorrect or outdated data, you will have wrong outcomes or predictions which
are not relevant.
Make sure you use data from a reliable source, as it will directly affect the outcome of your model.
Good data is relevant, contains very few missing and repeated values, and has a good representation
of the various subcategories/classes present.

2. Preparing the Data:

After you have your data, you have to prepare it. You can do this by :

Putting together all the data you have and randomizing it. This helps make sure that data is evenly
distributed, and the ordering does not affect the learning process.

Cleaning the data to remove unwanted data, missing values, rows, and columns, duplicate values,
data type conversion, etc. You might even have to restructure the dataset and change the rows and
columns or index of rows and columns.

Visualize the data to understand how it is structured and understand the relationship between
various variables and classes present.

Splitting the cleaned data into two sets - a training set and a testing set. The training set is the set
your model learns from. A testing set is used to check the accuracy of your model after training.

3. Choosing a Model:

A machine learning model determines the output you get after running a machine learning algorithm
on the collected data. It is important to choose a model which is relevant to the task at hand. Over
the years, scientists and engineers developed various models suited for different tasks like speech
recognition, image recognition, prediction, etc. Apart from this, you also have to see if your model is
suited for numerical or categorical data and choose accordingly.

4. Training the Model:

Training is the most important step in machine learning. In training, you pass the prepared data to
your machine learning model to find patterns and make predictions. It results in the model learning
from the data so that it can accomplish the task set. Over time, with training, the model gets better
at predicting.

5. Evaluating the Model:

After training your model, you have to check to see how it’s performing. This is done by testing the
performance of the model on previously unseen data. The unseen data used is the testing set that
you split our data into earlier. If testing was done on the same data which is used for training, you
will not get an accurate measure, as the model is already used to the data, and finds the same
patterns in it, as it previously did. This will give you disproportionately high accuracy.

Deep learning

Deep learning is a subset of machine learning, which is essentially a neural network with three or
more layers. These neural networks attempt to simulate the behavior of the human brain—albeit far
from matching its ability—allowing it to “learn” from large amounts of data. While a neural network
with a single layer can still make approximate predictions, additional hidden layers can help to
optimize and refine for accuracy.

Deep learning drives many artificial intelligence (AI) applications and services that improve
automation, performing analytical and physical tasks without human intervention. Deep learning
technology lies behind everyday products and services (such as digital assistants, voice-enabled TV
remotes, and credit card fraud detection) as well as emerging technologies (such as self-driving cars).

Deep Learning Models:

1) MLP

MLP is a kind of feed-forward neural network model consists of multiple layers, including an input layer, multiple
hidden layers, and an output layer [65]. The dimension of these layers varies model-to-model according to the
nature of the problem. The neurons of each layer are fully connected to the neurons of the subsequent layer.

2) 1D-CNN

The convolutional neural network (CNN) is a widely-used model of deep learning; initially, it is preferred for image
recognition problems. Later on, the researchers applied it in various elds and achieved state-of-art accuracies such as
object detection, image classi cation, and network TC. CNN has a strong ability to automatically extract the criti-cal
features via chaining convolutional layers. Each layer is comprised of a set of lters (or kernels) that are convolved with
the input units to extract spatial features of the certain input region. Another important feature of CNN architecture is the
pooling layer. It is located in between successive convo-lutional layers, aiming downsampling to reduce complexity

and parameters and also reduce the over tting [28]. The output layer, commonly called the softmax layer,
contains the activation function to accomplish the classi cation task. The output layer outputs N-
dimensional (N is the number of the output classes) probability distribution vector [0, 1]. Each real value
represents an output class score. The architecture of the CNN could be 1D or 2D or 3D, depends upon the
nature of the speci c problem

3) LSTM
The RNN is an ineffective technique for long sequence mod-eling due to gradient disappearance. To overcome
the short-coming of standard RNN, Hochreiter et al. [66] introduced the developed form of RNN called long-short
term mem-ory (LSTM) model, which is able to model long-term depen-dencies. LSTM works on the sequential
form of data and has been used in a variety of elds such as speech recognition, handwriting recognition, natural
language processing tasks such as machine translation, speech recognition and con-stituency parsing, and
language modeling. LSTM contains complex memory units instead of neurons in general neural networks. LSTM
units have the ability to store the informa-tion for longer time periods in the shape of a state vector. The memory
units contain several gates such as input gate, forget gate, output gate to control the information passing along a
sequence.

Python:

Python is a high-level, interpreted, general-purpose programming language. Its design philosophy


emphasizes code readability with the use of significant indentation.

Python is dynamically-typed and garbage-collected. It supports multiple programming paradigms,


including structured (particularly procedural), object-oriented and functional programming. It is
often described as a "batteries included" language due to its comprehensive standard library.

Guido van Rossum began working on Python in the late 1980s as a successor to the ABC
programming language and first released it in 1991 as Python 0.9.0. Python 2.0 was released in 2000
and introduced new features such as list comprehensions, cycle-detecting garbage collection,
reference counting, and Unicode support. Python 3.0, released in 2008, was a major revision that is
not completely backward-compatible with earlier versions. Python 2 was discontinued with version
2.7.18 in 2020.

Python consistently ranks as one of the most popular programming languages

Keras:

Keras is a python deep learning API designed with best practices for reducing cognitive load:

it offers consistent & simple APIs,

it minimizes the number of user actions required for common use cases,

and it provides clear & actionable error messages.

The Keras functional API is a way to create models that are more flexible than the tf.keras.Sequential
API. The functional API can handle models with non-linear topology, shared layers, and even multiple
inputs or outputs.
The main idea is that a deep learning model is usually a directed acyclic graph (DAG) of layers. So the
functional API is a way to build graphs of layers.

Consider the following model:

(input: 784-dimensional vectors)

[Dense (64 units, relu activation)]

[Dense (64 units, relu activation)]

[Dense (10 units, softmax activation)]

(output: logits of a probability distribution over 10 classes)

This is a basic graph with three layers. To build this model using the functional API, start by creating
an input node:

inputs = keras.Input(shape=(784,))

The shape of the data is set as a 784-dimensional vector. The batch size is always omitted since only
the shape of each sample is specified.

If, for example, you have an image input with a shape of (32, 32, 3), you would use:

# Just for demonstration purposes.

img_inputs = keras.Input(shape=(32, 32, 3))

The inputs that is returned contains information about the shape and dtype of the input data that
you feed to your model. Here's the shape:

inputs.shape
TensorShape([None, 784])

Here's the dtype:

inputs.dtype

tf.float32

You create a new node in the graph of layers by calling a layer on this inputs object:

dense = layers.Dense(64, activation="relu")

x = dense(inputs)

The "layer call" action is like drawing an arrow from "inputs" to this layer you created. You're
"passing" the inputs to the dense layer, and you get x as the output.

Let's add a few more layers to the graph of layers:

x = layers.Dense(64, activation="relu")(x)

outputs = layers.Dense(10)(x)

At this point, you can create a Model by specifying its inputs and outputs in the graph of layers:

model = keras.Model(inputs=inputs, outputs=outputs, name="mnist_model")

Let's check out what the model summary looks like:

model.summary()

Model: "mnist_model"

_________________________________________________________________

Layer (type) Output Shape Param #

=================================================================

input_1 (InputLayer) [(None, 784)] 0

_________________________________________________________________
dense (Dense) (None, 64) 50240

_________________________________________________________________

dense_1 (Dense) (None, 64) 4160

_________________________________________________________________

dense_2 (Dense) (None, 10) 650

=================================================================

Total params: 55,050

Trainable params: 55,050

Non-trainable params: 0

A Sequential model is appropriate for a plain stack of layers where each layer has exactly one input
tensor and one output tensor.

Schematically, the following Sequential model:

# Define Sequential model with 3 layers

model = keras.Sequential(

layers.Dense(2, activation="relu", name="layer1"),

layers.Dense(3, activation="relu", name="layer2"),

layers.Dense(4, name="layer3"),

# Call model on a test input

x = tf.ones((3, 3))

y = model(x)

is equivalent to this function:


# Create 3 layers

layer1 = layers.Dense(2, activation="relu", name="layer1")

layer2 = layers.Dense(3, activation="relu", name="layer2")

layer3 = layers.Dense(4, name="layer3")

# Call layers on a test input

x = tf.ones((3, 3))

y = layer3(layer2(layer1(x)))

A Sequential model is not appropriate when:

Your model has multiple inputs or multiple outputs

Any of your layers has multiple inputs or multiple outputs

You need to do layer sharing

You want non-linear topology (e.g. a residual connection, a multi-branch model)

Creating a Sequential model

You can create a Sequential model by passing a list of layers to the Sequential constructor:

model = keras.Sequential(

layers.Dense(2, activation="relu"),

layers.Dense(3, activation="relu"),

layers.Dense(4),

Its layers are accessible via the layers attribute:

model.layers
[<tensorflow.python.keras.layers.core.Dense at 0x7fbd5f285a00>,

<tensorflow.python.keras.layers.core.Dense at 0x7fbd5f285c70>,

<tensorflow.python.keras.layers.core.Dense at 0x7fbd5f285ee0>]

You can also create a Sequential model incrementally via the add() method:

model = keras.Sequential()

model.add(layers.Dense(2, activation="relu"))

model.add(layers.Dense(3, activation="relu"))

model.add(layers.Dense(4))

Note that there's also a corresponding pop() method to remove layers: a Sequential model behaves
very much like a list of layers.

model.pop()

print(len(model.layers)) # 2

Also note that the Sequential constructor accepts a name argument, just like any layer or model in
Keras. This is useful to annotate TensorBoard graphs with semantically meaningful names.

model = keras.Sequential(name="my_sequential")

model.add(layers.Dense(2, activation="relu", name="layer1"))

model.add(layers.Dense(3, activation="relu", name="layer2"))

model.add(layers.Dense(4, name="layer3"))

Specifying the input shape in advance

Generally, all layers in Keras need to know the shape of their inputs in order to be able to create
their weights. So when you create a layer like this, initially, it has no weights:

layer = layers.Dense(3)

layer.weights # Empty

[]
It creates its weights the first time it is called on an input, since the shape of the weights depends on
the shape of the inputs:

# Call layer on a test input

x = tf.ones((1, 4))

y = layer(x)

layer.weights # Now it has weights, of shape (4, 3) and (3,)

[<tf.Variable 'dense_6/kernel:0' shape=(4, 3) dtype=float32, numpy=

array([[-0.5312456 , -0.02559239, -0.77284306],

[-0.18156391, 0.7774476 , -0.05044252],

[-0.3559971 , 0.43751895, 0.3434813 ],

[-0.25133908, 0.8889308 , -0.6510118 ]], dtype=float32)>,

<tf.Variable 'dense_6/bias:0' shape=(3,) dtype=float32, numpy=array([0., 0., 0.], dtype=float32)>]

Naturally, this also applies to Sequential models. When you instantiate a Sequential model without
an input shape, it isn't "built": it has no weights (and calling model.weights results in an error stating
just this). The weights are created when the model first sees some input data:

model = keras.Sequential(

layers.Dense(2, activation="relu"),

layers.Dense(3, activation="relu"),

layers.Dense(4),

) # No weights at this stage!

# At this point, you can't do this:

# model.weights

# You also can't do this:


# model.summary()

# Call the model on a test input

x = tf.ones((1, 4))

y = model(x)

print("Number of weights after calling the model:", len(model.weights)) # 6

Number of weights after calling the model: 6

Once a model is "built", you can call its summary() method to display its contents:

model.summary()

Model: "sequential_3"

_________________________________________________________________

Layer (type) Output Shape Param #

=================================================================

dense_7 (Dense) (1, 2) 10

_________________________________________________________________

dense_8 (Dense) (1, 3) 9

_________________________________________________________________

dense_9 (Dense) (1, 4) 16

=================================================================

Total params: 35

Trainable params: 35

Non-trainable params: 0
UML

Use Case:

You might also like