(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 1, 2011
with http as being the primary characteristic. See that paper forthe entire graphic and an explanation of it.Fig. 2 shows an 11x3 sample cutout of an 18x14 SOM fromHoglund [11] in a hexagonal format representing userbehaviors such as CPU times, characters transmitted, andblocks read in a U-matrix display, meaning that every otherhexagon is a node (marked in Fig. 2 with either a dot or anumerical label) and that the intervening hexagons are in a greyscale indicating the distances between the neighboring nodes,darker meaning a larger distance and lighter meaning a closerdistance. The labels indicate a user number and the number of Best Matching Node (BMN) hits in a node for that user. Forexample, 127_8 means that User 127 had 8 BMN hits on thenode with that label. A single user as reported in this paper canhave hits on nodes in numerous areas of the map. Thehexagons provide better representation than a standard 2Dlayout, but the rectangular layout of the hexagons limits thispotential. The U-matrix display is an advantage in that itvisually highlights clusters of nodes. The researchers on thisproject probably have a good idea of the characteristics of various clusters, but these characteristics are not readilyapparent from the displayed map. See that paper for the entiregraphic and explanation of it. Rectangular U-matrix hexagonalmaps were also used by Cho [12] in 2002.Fig. 3 is a sample cutout of a SOM from Kayacik [13] in2003 based on network traffic where each hexagon is a nodeand the amount of filling in the hexagon represents how manyBMN hits the corresponding node has (the more hits, the largerthe filling). This can create different patterns for differenttypes of traffic, attack vs. normal traffic, for example. See thatpaper for the entire graphic and explanation of it. A similarhistogram map was used by Yeloglu [14] in 2007. This type of map produces useful visual patterns, but does not indicatedistances between nodes nor characteristics of nodes.Fig. 4 is a sample cutout of a U-matrix SOM from Kayacik [15] in 2006 which has been labeled with acronyms and withboundaries drawn to enclose clusters. MHP, for example,stands for
multihop
, and is in a region in this cutout called
host-based attack group
. See that paper for the full graphic and anexplanation of it. This type of map provides more informationthat previous ones, but is still somewhat cryptic.III.
B
ACKGROUND
This research evolved from a one dimensional SOM, nowcalled 1D ANNaBell, reported by Langin [16 and 17]. This 1DANNaBell has discovered numerous real life instances of malicious network traffic, being the first self-trainedcomputational intelligence to find feral malware, as far as theauthors know, on March 29, 2008, and is still in productionafter more than two years.The 1D ANNaBell was only conceptually a map becausethere was very little visually, just a jagged line, to observe. Aparticular jag in the line represented the BMN only for the IPaddresses of two local computers infected with a certain kind of bot. This node was used to find additional infected computers.The SOM
hound dog
had the scent, but it was not clear totechnicians what the scent was---thus, the
black box
effect.So 1D ANNaBell was redesigned, using the same data, as ahexagonal map with the intent of producing something visualwhich would aid technicians in understanding the SOMprocess. Some of the methodology, using grey scale, for thishexagonal ANNaBell was described by Langin [18 and 19].This paper continues the methodology by showing howcolorization influenced the map, and this paper also shows themap as a 3D
island
. Look ahead to Fig. 14 for the full colormap and Fig. 23 for the 3D island to see where this is leading.The source data for ANNaBell Island is from firewall logsand is in the form of a six dimensional vector for each local IPaddress---these are the pertinent features, given here as areference for the rest of this paper:1
tot_norm:
Total normalized
. The total number of logentries in a 24 hour period, normalized. The lowestnumber of entries in the source data for a local IPaddress was 0 and the highest number was 2,020,349.These counts were normalized to a range of 0 to 1.2
src_rat:
Source ratio
. The ratio of unique source(external) IP addresses to the total number of logentries.3
port_rat:
Port ratio
. The ratio of unique destination(local) ports to the total number of log entries.4
lo_norm:
Lowest port normalized
. The lowestattempted destination (local) port, normalized from 0to 1, with the lowest possible port being 0 and thehighest possible port being 65,535.5
hi_norm:
Highest port normalized
. The highestattempted destination (local) port, normalized from 0to 1, with the lowest possible port being 0 and thehighest possible port being 65,535.6
udp_rat:
UDP ratio
. The ratio of UDP network traffic to all network traffic.For example, a local IP address with 1,548 log entries in a24-hour period, from 139 external IP addresses, directed at 58local ports, from Port 22 to Port 61,123, with 1,345 of the logentries being for UDP traffic would have a vector of 0.000766204, 0.089793282, 0.0374677, 0.000335698,
Figure 3,Histogram
Figure 4,AcronymsFigure 2, U-matrix
2http://sites.google.com/site/ijcsis/ISSN 1947-5500