Professional Documents
Culture Documents
ipynb
Overview Data
In [1]: import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [2]: import pandas as pd
data = pd.read_csv('C://Users/ditama/Downloads/Unicauca-dataset-April-June-2019-Networ
data head()
Out[2]:
flow_key src_ip_numeric src_ip src_port dst_ip dst_port
5 rows × 50 columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2704839 entries, 0 to 2704838
Data columns (total 50 columns):
# Column Dtype
--- ------ -----
0 flow_key object
1 src_ip_numeric int64
2 src_ip object
3 src_port int64
4 dst_ip object
5 dst_port int64
6 proto int64
7 pktTotalCount int64
8 octetTotalCount int64
9 min_ps int64
10 max_ps int64
11 avg_ps float64
12 std_dev_ps float64
13 flowStart float64
14 flowEnd float64
15 flowDuration float64
16 min_piat float64
17 max_piat float64
18 avg_piat float64
19 std_dev_piat float64
20 f_pktTotalCount int64
21 f_octetTotalCount int64
22 f_min_ps int64
23 f_max_ps int64
24 f_avg_ps float64
25 f_std_dev_ps float64
26 f_flowStart float64
27 f_flowEnd float64
28 f_flowDuration float64
29 f_min_piat float64
30 f_max_piat float64
31 f_avg_piat float64
32 f_std_dev_piat float64
33 b_pktTotalCount int64
34 b_octetTotalCount int64
35 b_min_ps int64
36 b_max_ps int64
37 b_avg_ps float64
38 b_std_dev_ps float64
39 b_flowStart float64
40 b_flowEnd float64
41 b_flowDuration float64
42 b_min_piat float64
43 b_max_piat float64
44 b_avg_piat float64
45 b_std_dev_piat float64
46 flowEndReason int64
47 category object
48 application_protocol object
49 web_service object
dtypes: float64(27), int64(17), object(6)
memory usage: 1.0+ GB
HTTP : 5.65%
Facebook : 4.47%
Amazon : 3.24%
GoogleServices : 3.23%
BitTorrent : 2.62%
YouTube : 2.06%
Messenger : 1.67%
HTTP_Proxy : 1.25%
Others : 14.04%
In [11]: data[num_cols].describe()
Out[11]:
b_avg_ps f_pktTotalCount f_max_piat f_std_dev_piat b_avg_piat flowEnd
8 rows × 44 columns
Out[12]: []
For the columns having <=50 unique values, we plot histograms, for
others we just list distribution of most frequent values as in case of
category columns
Correlation Matrix
Prepocessing
Feature Selection
In [26]: ipdata_num.drop(['f_flowStart','flowEnd','octetTotalCount','b_octetTotalCount',
Final Feature
In [31]: df head()
Out[31]:
b_std_dev_piat proto min_ps dst_port flowDuration b_max_piat f_std_dev_ps min_piat src_ip_num
In [32]: df shape
In [37]: #normalisasi
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler fit_transform(X_test)
In [38]: X_train_scaled
In [39]: X_test_scaled
label_encoder = LabelEncoder()
Y_train_encode = label_encoder.fit_transform(y_train)
In [43]: Y_test_encode
tree_train_accuracy = clf_gini.score(X_train_scaled,Y_train_encode)
tree_accuracy = clf_gini.score(X_test_scaled,Y_test_encode)
cnt = 1
# split() method generate indices to split data into training and test set.
for train_index, test_index in kf.split(X_train_scaled, Y_train_encode):
print(f'Fold:{cnt}, Train set: {len(train_index)}, Test set:{len(test_index
cnt+=1
# Note that:
# cross_val_score() parameter 'cv' will by default use StratifiedKFold spliting starte
# So you can bypass above step and just specify cv= 5 in cross_val_score() function
C:\Users\ditama\anaconda3\lib\site-packages\sklearn\model_selection\_split.p
y:684: UserWarning: The least populated class in y has only 1 members, which
is less than n_splits=5.
warnings.warn(
C:\Users\ditama\anaconda3\lib\site-packages\sklearn\model_selection\_split.p
y:684: UserWarning: The least populated class in y has only 1 members, which
is less than n_splits=5.
warnings.warn(
Scores for each fold are: [0.80910815 0.80830958 0.80759604 0.80804709 0.8090
2241]
Average score: 0.81
cnt = 1
# split() method generate indices to split data into training and test set.
for train_index, test_index in kf2.split(X_test_scaled, Y_test_encode):
print(f'Fold:{cnt}, Train set: {len(train_index)}, Test set:{len(test_index
cnt+=1
# Note that:
# cross_val_score() parameter 'cv' will by default use StratifiedKFold spliting starte
# So you can bypass above step and just specify cv= 5 in cross_val_score() function
C:\Users\ditama\anaconda3\lib\site-packages\sklearn\model_selection\_split.p
y:684: UserWarning: The least populated class in y has only 1 members, which
is less than n_splits=5.
warnings.warn(
C:\Users\ditama\anaconda3\lib\site-packages\sklearn\model_selection\_split.p
y:684: UserWarning: The least populated class in y has only 1 members, which
is less than n_splits=5.
warnings.warn(
Scores for each fold are: [0.80932628 0.80940392 0.80885305 0.80842789 0.8093
6695]
Average score: 0.81
Naive Bayes
GaussianNB()
KNN Model
Out[59]: ▾ MLPClassifier
MLPClassifier(hidden_layer_sizes=(3, 2))
Random Forest
Out[62]: ▾ RandomForestClassifier
RandomForestClassifier(n_estimators=1)
Evaluation With DT
In [65]: y = label_encoder.inverse_transform([28,17,102,116])
y
Out[65]: array(['Google', 'DNS', 'TLS', 'Unknown'], dtype=object)