DOI QR코드

DOI QR Code

Flow-based Anomaly Detection Using Access Behavior Profiling and Time-sequenced Relation Mining

  • Liu, Weixin (Information Security Center, Beijing University of Posts and Telecommunications) ;
  • Zheng, Kangfeng (Information Security Center, Beijing University of Posts and Telecommunications) ;
  • Wu, Bin (Information Security Center, Beijing University of Posts and Telecommunications) ;
  • Wu, Chunhua (Information Security Center, Beijing University of Posts and Telecommunications) ;
  • Niu, Xinxin (Information Security Center, Beijing University of Posts and Telecommunications)
  • Received : 2015.12.20
  • Accepted : 2016.05.16
  • Published : 2016.06.30

Abstract

Emerging attacks aim to access proprietary assets and steal data for business or political motives, such as Operation Aurora and Operation Shady RAT. Skilled Intruders would likely remove their traces on targeted hosts, but their network movements, which are continuously recorded by network devices, cannot be easily eliminated by themselves. However, without complete knowledge about both inbound/outbound and internal traffic, it is difficult for security team to unveil hidden traces of intruders. In this paper, we propose an autonomous anomaly detection system based on behavior profiling and relation mining. The single-hop access profiling model employ a novel linear grouping algorithm PSOLGA to create behavior profiles for each individual server application discovered automatically in historical flow analysis. Besides that, the double-hop access relation model utilizes in-memory graph to mine time-sequenced access relations between different server applications. Using the behavior profiles and relation rules, this approach is able to detect possible anomalies and violations in real-time detection. Finally, the experimental results demonstrate that the designed models are promising in terms of accuracy and computational efficiency.

Keywords

1. Introduction

Company networks and their complexity have been evolving rapidly over the past decade, and emerging threats raise a global concern about security issues. More and more sophisticated and multi-stage attacks are uncovered by security analysts, such as Operation Aurora[1] and Operation Shady RAT[2], which are also known as Advanced Persistent Attacks (APT). Intruders use zero-day vulnerabilities and social engineering techniques to accomplish a set of stealthy and continuous hacking processes. However, traditional security systems, which mostly depend on static signatures, are difficult to keep up with the changing threat landscape and dynamic environment. In addition, due to the lack of knowledge about internal network behavior patterns, intrusion detection systems are incapable of distinguishing malicious insiders from benign ones.

Therefore, it is necessary for intrusion detection systems to keep track of overall network activities and create profiles for network applications. Network flow, which represent by nature aggregated information, is a scalable approach of passive network monitoring and behavior analysis in high-speed networks [3]. Flow-based behavior profiling is a promising approach to detect traffic anomalies and protect network from unknown exploits [4]. Thus, this research’s proposal is to create an autonomous flow-based network monitoring system capable of identifying the normal behavior of network applications and detecting anomalies in enterprise network.

The entirety of this research is accomplished through the analysis of flow features. The proposed system is divided into three parts: autonomous discoverying server appications, access behavior profiling and access relation mining. In the first application discovery phase, we extract source and destination ip/port to distinguish clients from server applications, and convert flows into access flows towards server applications. Based on access flows, six flow features (the quantitative attributes (bits and packets in two directions, flow duration and number of flows between client and server application in analysis interval) and two tag features (flow direction and occurrence period) are selected to create access behavior profile for each individual application. Many experts today believe that the disregarded relationships between applications are the major weak spot abused by attackers to compromise systems[5], thus we generate relation rules from frequently related access behavior of different server applications. Based on the previous behavior profiles and relation rules, this approach is able to detect deviation from historical network behavior.

The key contributions of this article are as follows: First, we provide an applicable method to discover active server applications without pre-defined knowledge, and create behavior profiles for each server application by applying a novel linear grouping algorithm PSOLGA. PSOLGA is a PSO-based [6] clustering algorithm, with better grouping stability and time complexity than LGA [7].In addition, we use in-memory graph model to establish anomaly detection rules from time-dependent access flows, such as clients→ web server→ database. To evaluate the proposed system, a variety of tests is performed using simulation data and real-world data from an enterprise network.

The remainder of this paper is organized as follows: Section 2 reviews prior literatures related to flow analysis and anomaly detection. Methodology and major algorithms are given in Section 3. Section 4 explains the two access models and corresponding anomaly detection approaches. The results of evaluation are presented in Section 5. Finally, Section 6 concludes this article.

 

2. Related Work

Flow analysis is broadly used in large-scale network behavior profiling[8-10] ,Qos optimization[11-13] and intusion dectection[14,15]. As is shown in [16], netflow is utilized to profile block-level network activities, as well as track and quantify changes in blocks. Gilberto[8] introduces a profile-based anomaly detection system PCADS-AD which is able to detect DDoS and Flash Crowds by using PCA. In [10], the authors developed a flow-based anomaly detector by using ANN-based classifier and selective sampling.

Clustering is often used as an unsupervised techniqe to discover traffic patterns [17]. A netwok access control machanism, basing on X-means[18] and majority voting, was studied by Frias-Martinez. As an important evolutionary computation technique, PSO[6] is applied to many subject areas such as clustering optimization and multiobjective optimization. LI et.al. [19] uncovered the host members of Botnet in the organizational network by using a combination of PSO and Kmeans. Although K-means algorithm may be deemed as the most important flat clustering algorithm due to its simplicity, it has serveral drawbacks, such as uncapable to deal with non-spherical data. In our pratical experience, network behavior of many server application is distributed in linear strutures, thus K-means may not be our best choice of clustering algorithm. Linear Grouping Algorithm(LGA), first introduced by Van Aelst[7] in 2006, can be useful for investigating subsets that follow different linear relationships in data sets. Garcia[20,21] introduced Robust LGA to obtain better grouping results against outliers in 2009, by optimizing LGA through trimming methology. In our approach, we build behavior profiles from historical normal access flows, where no observation should be considered as an outlier, thus trimming is not applicable.

Regarding event correlation and self-learning system model, Friedberg et al.[22] aimed to detect anomalies by generating rules from log-information. As the authors assume no prior knowledge about the structure of event logs, all different combination of log atoms should be extracted for covering potential hypothesises H, and additional refinement is required to drop the trival ones.

The proposed approach distinguishes from the abovementioned literatures due to auto discovery of server applications. Besides that, while previous researches focused on detecting major changes and anomalies for overall network behavior, such as DDoS, Scan and Trojan activities[8,9,17], our approach is able to build profiles for different server applications in passive monitoring and detect deviation from historical behavior of the specific server application, such as illegal data dump and malicious insider activities. Moreover, this paper contributes by using PSOLGA to find best groups of access behavior towards individual server application. We are first to apply PSO in linear grouping algorithm, optimizing the grouping stability and time complexity.And then, we use the grouping result to generate anomaly alarms in real-time detection.

In addition, compared to [22],our rule generation of correlated access flows avoid additional rule refinement due to the clear struture of flow attributes and the application of in-memory graph model. Besides that, in Friedberg’s approach, only situation that the condition event does not triger implication event is considered anomaly. However, it failed to consider the situation that the implication event occurs without the preceding condition event, which should be included in anomaly detection for completeness. Thus, we introduce two different conditional probability for evaluating both of the previously mentioned situations.

 

3. Methodology and Background

As is shown in 0, we collect netflow data from all switches and routers of a small enterprise network, which is divided into a DMZ zone and an intranet zone. Two web servers F and G are deployed in DMZ zone, which are exposed to external users. In intranet zone, there are two different databases(Mysql and Redis) for web servers(F and G) , two HDFS(Hadoop Distributed File System) nodes for storing flow records, and other internal rserves. Two group of external users are used for evaluation, one of which is labeled as normal users(A,B,C) and the other as attackers(D and E). The recording for flow dataset last for 140 hours, of which the first 136 hours for training phase and no attacks are injected. Attackers launch their attacks from the 137th hour. Our system model is presented in 0. Historical flow traces are automatically converted to access flows and ingested by the two models:single-hop access profiling model and double-hop access relation model. The single-hop model extracts access behavior profile of each specific server application from historical access flows,while the double-hop model generates relation rules between access flows towards different server applications. Both the behavior profiles and relation rules are applied in realtime anomaly detection afterwards.

Fig. 1.Evaluation evironment

Fig. 2.System model

3.1. Automatic server discovery and access flows

Few studies have placed attention on how to identify servers in flows, while routers and switches mostly export unidirection flows, such as Netflow. Some efforts have been made to identify client/server basing on packet-level analysis[23], or to convert unidirectional flows to bi-directional flows[24]. In order to adapt to any flow export protocol, we develop a methodology for converting input flows to bidirectional flows towards server side, without packet-level attributes or pre-defined server ports. As different configuration of flow caching timeout may break up long-lived flows into fragments of different duration, we merge input raw flows into flow_feature_setT to ensure source/destination key are unique in each analysis interval T. flow_feature_setT is the aggregated form of input flows, which can be derived from any type of flow protocols. pkt_cnt_in, byte_cnt_in, pkt_cnt_out and byte_cnt_out are the sum of corresponding attributes in all associated packets with the same specific key (proto, ip_src, port_src, ip_dst, port_dst) in two directions. start_time is the minimum start time and end_time is the maximum end time for packets in a flow.

Definition 1. flow_feature_setT = {key: (proto, ip_src, port_src, ip_dst, port_dst), value: (port_dst, pkt_cnt_in, byte_cnt_in, pkt_cnt_out, byte_cnt_out, start_time, end_time)}

Definition 2. OED[ip,port] = #(unique pairs of ip_srci and port_srci) + #(unique pairs of ip_dstj and port_dstj) | i, j, ∈N, ip= ip_dsti= ip_srcj, port= port_dsti= port_srcj, N is the totoal count of flow_feature_setTs

OED[ip,port] is denoted as the opposite-end divergence of specific pair of ip and port. Under the assumption “clients always appear with multiple IPs and random source ports, while servers mostly use unique set of IPs and listening ports”, opposite-end divergence of server-side pairs of ip/port are more likely larger than that of client-side pairs, which is also proved in real-world network traces we captured. Thus, we are able to distinguish servers from clients in historical flow data via comparing opposite-end divergence of ip_src/port_src pair and ip_dst/port_dst pair.

Definition 3. access_flow_feature_setT= {key: (proto, ip_server, port_server, ip_client, tinterval), value: (pkt_cnt_to, byte_cnt_to, pkt_cnt_from, byte_cnt_from, flowscount, start_timemin, end_timemax)}

Based on the auto-discovered servers, we aggregate flow_feature_setT into bidirectional access flows access_flow_feature_setT from client towards server within a certain interval T. We merge flows with the same pair of (proto, ip_server, port_server, ip_client, tinterval). tinterval=start_time/T, denoting the time slot flow occurs in. pkt_cnt_to and byte_cnt_to are #packets and #bytes towards server side of a specific key, and pkt_cnt_from and byte_cnt_from are for the opposite direction. start_timemin and end_timemax are the minimum start_time and the maximum end_time within tinterval. flowscount is #flows from the client-end to server-end within tinterval.

Euclidean Distance is the most common choice in clustering flow behavior[19,25,26], as authors usually assume data is ditributed in spherical structures. However, we have discovered that most serverside applications constrain the size or type of their returning content of different requests and linear structures dominates our feature space in flow monitoring. For expample, web servers return limited textual content, hypertexts or multimedia content when normal clients request for online articles, web links or personal photos. Clients commonly establish limited flows towards servers within a certain interval. 0 shows the multi-dimensional visualization of normal access behavior of two different server applications(HDFS control message exchange and web server), which is a scatterplot matrix for pairs of each two different dimensions. It is obvious that access flows follow a certain set of linear grouping strutures. Thus we are inspired to use linear grouping algorithm to cluster behavioral features of access profiles.

Fig. 3.scatter plots of hdfs control exchange(left) and web server(right)

3.2 PSOLGA algorithm

LGA combines ideas from principal components, clustering methods and resampling algorithms, with the objective to find the grouping result with the minimal sum of square regression residuals(ROSS). Square regression residual is the square distance between a point and its associated hyperplane, measuring how far a point lies from this hyperplane. Resampling is the key of LGA to search for the best grouping result, which needs to take enough starting samples. Van Aelst offered a funtion to calculate m[7] as the minimal number of starting values, which is sometimes insufficient to guarantee the fittest result.

Particle swarm optimization(PSO)[6] is a computational method that optimizes a problem by iteratively trying to improve a candidate solution with regard to a given measure of quality. It is broadly used in optimizing classification[27] and clustering approaches[28,29]. In this section ,we propose a novel linear grouping method PSOLGA with a combination of PSO and LGA, which is able to optimize the resampling process in LGA and output more stable grouping result.

The fundamental expression of PSO is as follows:

LGA has five major steps: scaling, generation of the starting values, initializaiton of the groups, iterative refinement and resampling[7]. Considering a data set of size n in d dimensions, LGA need to generate several starting sample groups, each of which contains k×d points(k is the desired number of clusters). We consider each starting sample group as a particle in k× d× d dimensions. Xi(t) denotes the starting position in each swarm iteration. Location_LGA_Iteration function is introduced to fundamental PSO, which means paritcles fly in both location iteration and global swarm iteration. For each particle in Location_LGA_Iteration, initializaiton of the groups and iterative refinement are firstly finished as in LGA, and then new samples of d-subsets are taken from each of the local final k groups, which are formulated to XEi(t) . XEi(t) is the ending position in each location iteration. Taking samples of d-subset from a specific ouput group increases the convergence towards the fittest hyperplane, as they have already been assigned to the same linear group. We modify Eq.(1-2) into Eq.(3-5):

As is shown in 0, PSOLGA is described as follows:

Fig. 4.PSOLGA workflow

Step 1. Scaling of the variables. Considering xj is one of the totoal n observations, and xj[l] denotes the value of xj in the lst dimension, where j∈[1,n], l∈[1,d].The scalabe expression of xj[l] is caluated as follows:

x'j[l] = (xj[l] - xj[l]min) / (xj[l]max - xj[l]min)

Step 2. Swarm initialization. The initial swarm consists of m’ particles, each of which is generated by ramdomly selecting k exclusive subsets of d points(d-subsets[7]). Each particle is a vector in k× d× d dimensions.

Xi(t) = (z(t)i1 , z(t)i2 ,..., z(t)ik), t ∈ [1, Iterationmax], i ∈ [1, m']

z(t)im = (x1, x1,..., xd), m ∈ [1, k]

Step 3. Global PSO Iteration. For each particle in each global iteration, Location_Iteration function is firstly applied to get grouping results and ending position XEi(t) of the specific particle. We choose ROSS as the fitness value of current iteration based on the previous grouping result and update the new starting position of the specific particle based on Eq.(3-5). Global iterations continue until the best fitness value has stayed unchanged for Stagemax or total number of iteration has exceeded Iterationmax. If the position of particle exceed the range of [Xmin,Xmax], then its position will be reassigend to Xmin or Xmax. Similarly, if the velocity of a particle exceeds the range of [Vmin,Vmax], then its velocity will be reassigned to Vmin or Vmax. Xmin, Xmax, Vmin, Vmax are derived from data scope, while Stagemax and Iterationmax are set to limit computational time.

Step 4. Ouput grouping result: Output the grouping result with the best fitness value, along with the hyperplane coefficients.

 

4. Access Models and Anomaly Detection

Anomalies in access flows can be divided into two parts: deviation from historical behavior and violation of access time sequence.The single-hop access profiling model offers a method to model direct access towards server applications and detect anomalies through outlier-based and tag-based approaches. Intruders may act like normal clients without any deviation in flow level attributes. For example, in a common portal environment, the web sever normally access database server after client request for personal information or upload some specific data, which can be expressed in time-sequenced pattern : (clients→web server→database). Either web server accessing database server with no previous client-request for its service, or no follwing access to database server from web server when clients request for their personal data, should be considered as abnormal. Thus, double-hop access relation modelis a reasonable complement for the single-hop access profiling model. Both models and their detection approaches will be detailed in the following sub-sections.

4.1 Single-Hop Access Profiling Model

We formulate six features from access flows, which are BYTE_CNT_TO, PKT_CNT_TO, BYTE_CNT_FROM, PKT_CNT_FROM, FLOWSCOUNT, DURATION. The first five features are directly derived from the corresponding attributes of access_flow_feature_setTs with the same key in a specific interval tinterval. DURATION=end_timemax-start_timemin, is derived from the time difference between start_timemin and end_timemax. We disregard client ports in analysising access flows as they are always random and useless, thus a client is only denoted by its ip address, while a server is denoted by its listening ip and listening port. We remove ip_client and tinterval from key of access_flow_feature_setT, as we profile access behavior towards a server application for all clients. Expression of merged access profiles is shown in Definition 4.

Definition 4. merged_access_profile = {key: (proto, ip_server, port_server), value: {Features: (BYTE_CNT_TO, PKT_CNT_TO, BYTE_CNT_FROM, PKT_CNT_FROM, FLOWSCOUNT, DURATION) , Tags :(DIRECTION, TIME)}}

DIRECTION is a 4-bit binary digit, each bit of which is denoted as a diffent direction of access flows, including external→intra(0001), intra→external (0010), external→external (0100), intra→intra(1000). An intra ip belongs to the intra-network or public ip addresses owned by the coorperation, while external ip means the others. TIME is a t-bit binary, every bit of which is denoted as a diffent time slot of 24 hours, for example, if we split 24 hours into 4 time slot(t=4), then 0001 means the the specific profile occurs in 0:00-6:00.

After feature formulation of traning flow data, merged_access_profiles are stored,as well as the OED map for opposite-end divergence of ip/port pairs, which will be used in further realtime identificaiton. Each unique key of merged_access_profiles is considered as a specific server application, while the value shows flow behavior of a certain client request in analysis interver T. Clients of different server application may appear in dissimilar flow patterns. Thus, we employ PSOLGA to get access flow behavior of distinct server applications, by analysing observations of merged_access_profile with distinct key(proto, ip_server, port_server).

After PSOLGA clustering, results for each key(proto, ip_server, port_server) are obtained,as is shown in Definition 5. cluster[key] consists of k clusters, each of which contains four elements and two tags to describe the cluster. hyperplanei shows the orthogonal hyperplanes for the ith linear group. residuali is the maximal absolute value of orthogonal residual of the ith cluster. centeri is the average of observation associated with the ith cluster. radiusi is the maximal Euclidean Distance between intra-cluster points pic and centeri.

Definition 5. Clustering result: cluster[key] = {(hyperplanei, residuali, centeri, radiusi, Directioni, Timei)| iϵ[1,k]}. hyperplanei = {(wji,ei)| j∈[1:d], d is the dimension of features, ei is the orthogonal residual for the ith hyperplane and wji is the jth coefficient). residuali= max(|wipic+ei|), pic is the scalable observations assigned to this cluster, cϵ[1, # of observations associated with the ith cluster of cluster[key]. centeri = {avg(pic[j]) | j∈[1:d]}. radiusi=max(distanceEuclidean(pic, centeri)). Directioni and Timei are tags of the ith cluster of key, which describe the direction and time occurance of the specific cluster.

Both Directioni and Timei are derived from OR operations of corresponding tag of every single observation asscociated with the specific cluster. For example, Directioni is 1001 when intra server application is accessed by external and intra clients. Similarly, if we split 24 hours into 4 time slot, then 0101 means the behavior in the specific cluster occurs in both 0:00-6:00 and 12:00-18:00.

4.2 Single-hop Access Anomaly Detection

Four steps are taken to analysing incoming flow batches, using cluster results from the single-hop access profiling model:

Step 1. Flow merging. Servers and clients are identified by using OED map previously stored. Incoming flow batches will be then merged into form of merged_access_profileT, along with the Direction tag and Time tag.

Step 2. Normalization. Basing on the maximal and minimal value in each dimention of training data, incoming merged_access_profiles are projected to the training feature space. yj is the jth observation in incoming merged_access_profiles, xtraining[l]min is the minimal value of training data in the lst dimension, while xtraining[l]max is the maximal value.

y'j[l] = (yj[l] - xtraining[l]min) / (xtraining[l]max - xtraining[l]min)

Step 3. Outiler-based anomaly detection. Two different distance are used to determine whether the incoming merged_access_profile is anomalous. The incoming profile yj is assigned to the closest cluster[key of yj]m, of which hyperplanem and yj has the minimal orthogonal distance. The specific incoming profile yj is consider as anomalous, if either the orthogonal residual of yj to hyperplanei is larger than the associated residuali, or the Euclidean Distance to centeri is larger than the associated radiusi.

Step 4. Tag-based anomaly detection. We can derive the DIRECTION and TIME tag of the incoming profile from ownership of the associated ip addresses and occurrence time of access flows. The tags are then applied with OR operation with Directioni and Timei, while clusteri is the cluster that the profile is assigned to.If the result differs from the corresponding tag of clusteri, the incoming profile is also regarded as abnormal. For example, the Directionk tag of clusterk is 1000,which means historical clients accessing the specific server application in this pattern are external users. If the incoming profile assigned to clusterk with the Direction tag of 0001, it will be considered as an anomaly , which may be caused by internal fake-ip attacks or modifications of firewall rules.

4.3 Double-Hop Access Relation Model

In this section, we propose in-memory graph model to extract time-sequenced correlation from access flows, without repeating scanning training dataset. In-memory graph model M and double-hop access correlation rules DHA_RULE are formulized in Definition 6-8. F is the merge result of access_flow_feature_setT with the same protocol, server ip address, server port, client ip address and occurrence time slot. tiinterval denotes the ith time slot [i× T : (i+1)× T] F occurs in. start_timeimin and start_timeimax are the minimal and maximal start time of access flows in tiinterval. F is for further use in rules generation. Server=(proto, ip_server, port_server), Serverpre is the connected server application in pre order, denoted by its protocol, listening ip address and listening port. Serverpost is the connected server application in post order. Precedent access flow fpre is the access flow towards Serverpre from any client ip addresses in a certain interval tinterval. Posterior access flow fpost is the access flow towards Serverpost from ip_serverpre in the same tinterval.

Definition 6. F= {key: (proto, ip_server, port_server, ip_client), value: (tiinterval, start_timemin, start_timemax)}, fpre, fpost ∈F

Definition 7. In-memory graph model M={Vip, Vserver, Etimespan}

• Vip={Vcip, Vsip| vcip=(ip, type), vsip=(ip, type)} are the verticles representing for either a client ip address or a server ip address. type=(Client| Sever| Client and Server) denotes the role of a corresponding ip address in access flows.

• Vserver={vserver| vserver=(proto, ip_server, port_server, tr_table)} represent for server applications, tr_table={(tinterval, start_timemin, start_timemax)} collects all access time records towards the specific server application.

• Etimespan={Es, Ec| es = (vserver→ vsip), ec=(vcip→ vserver, tr_table)} are edges connecting Vip and Vserver. Es connect Vserver and Vsip with the same ip_server and no value is attached. Ec connect Vcip and Vserver, representing a unique access pair between server application(proto, ip_server, port_server) and a specific client ip address(ip_client), of which tr_table collects all time records vcip accessing vserver.

Definition 8. DHA_RULE= {(Serverpre→Serverpost, Probpre, Probpost, Cntpre, Cntpost, Cntco) } is the form of double-hop access rules.

• Serverpre is the server application in fpre. Serverpost is the server application in fpost.

• Cntpre is the distinct count of tinterval, during each of which fpre occurs.

• Cntpost is the distinct count of tinterval, during each of which fpost occurs.

• Cntco is the distinct count of tinterval, during each of which Serverpost is accessed by ip_serverpre after Serverpre being accessed by any client.

• Probpre= Cntco / Cntpre, is the probability that fpost occurs after the first fpre in the same analysis interval tinterval.

• Probpost= Cntco / Cntpost, is the probability that fpost occurs and at least one fpre occurs before fpost in the same analysis interval tinterval.

• Probpre, Probpost>THprob, Cntpre, Cntpost>THcnt. (6)

• THprob and THcnt are filtering threshold, used to filter out rules of strong confidence.

Three major steps to generate DHA_RULEs are described as follows:

Step 1. Merging. Merge access_flow_feature_setTs within all unique analysis interval tinterval into F. Fs are then used to initialize M.

Step 2. Initialization of model M and graph computing. Insert all unique ip addresses in keys of Fs into Vip, and update type of Vip basing on whether it is a client ip, or a server ip, or both. Insert all unique pair(proto, ip_server, port_server) into Vserver, and insert or update tr_table of the specific vserver. Similarly, Es and Ec are inserted into M. After Initialization, connections and time records are utilized to extract rules. In-edges of a vsip are connected to vservers represented as its server applications, while the out-edges connecting to vservers which have been requested for service from vsip during the whole traning period.

Step 3. Rule Extraction. Inner join operation is done to the associated time records of vpreserver and epostc to calculate the co-occurrence times.Only rules satisfying Eq.(6) are outputed.

In 0, the workflow for rule generation is shown in detail. Table 1 explains the DHA_RULE Generation Algorithm in further detail. This algorithm benefits from the compact struture of graph models and hash methods, with computational complexity of which is reduced to O(N+I×s×n×m). N is record count for training access flows. s is the number of vsip with type=(Client and Server). n is the maximum server ports associated with a server ip address. m is the maximum count of out-connected server ports by a specific client ip address. I is distinct count of tinterval in inputing access flows. As s, n and m are far less than N, the computational complexity can be approximate to O(N+I) in pratice.

Fig. 5.workflow for rule generation

Table 1.DHA_RULE Generation Algorithm

4.4 Double-hop access anomaly detection

The extracted DHA_RULEs are used in anomaly detection. Each rule has two evaluation streams: EVRpre and EVRpost. EVRpre is considered as a binomial trial of size SL in descrete time, with the given probability Probpre, while Probpost is for EVRpost, as shown in Definition 9. SL is the length of both evaluation streams.

Definition 9. EVRpre~b(Probpre,SL), EVRpost~b(Probpost,SL)

When realtime access flows matched with a specific DHA_RULE show up, a new evaluation value will be append to the corresponding evaluation stream and only the latest SL values will be kept in the stream. Evaluation values for different situations are listed in Table 2. Evaluation values of two evaluation streams are set to 1 only when fpre and fpost both occur in the same time slot tinvertal and fpost occurs after at least one fpre. If no fpost occurs after fpre , only evpre is set to 0 and no value to be inserted into EVRpost. Similarly, only evpost is set to 0, when fpost occurs without any fpre. If no fpost or fpre shows up, no value will be inserted into evaluation streams.

Table 2.Evaluation values

Anomaly value is defined in Definition 10. Cumulative probability distribution is used to distinguish where an evaluation stream should be considered as an anomaly. AValue is the probability anomaly happens. Only when AValue is higher than α is considered as a valid anomaly. R is a specific DHA_RULE under evaluation. n is the count of positive evaluation in an evaluation stream. p is Probpre for EVRpre and Probpost for EVRpost.α is the anomaly detection threhold.

Definition 10. , AGap=AValue-α, isAnomalous(R) = AGap ≥0.

 

5. Experiment Evaluation

5.1 Evaluation of PSOLGA algorithm

We use a sample of synthetic data of 4 groups of linear distributed 2-dimensional points and the hockey data set(nhl194) for evaluating the improvement of PSOLGA compared to LGA. nhl194 contains information on the performance of players in the Canadian National Hockey League for the 94–95 competition[7]. Four features(PTS,P/M,PIM,PP) of nhl194 is under consideration and the best group number of nhl194 is 3 as mentioned in [7]. Hence ,we try to divide the synthetic data into 4 groups and nhl194 into 3 groups, by applying both LGA and PSOLGA. Both grouping results are listed and discussed as below.

0 shows contrasive results of using LGA and PSOLGA in the two mensioned datasets. The left ones are plot/scatter plot of gouping results and the right ones show the minimum ROSS of both grouping algorithms in 20 resampling rounds. With the minimal starting value m suggested in [7], we notice that LGA cannot always find the best ROSS while PSOLGA shows good stablility towards the fittest result. In totoal 20 resampling tests, LGA achieves the best ROSS for 16 times in synthetic data and 9 times in nhl194, while PSOLGA achieves 100% success in both testing datasets.

Fig. 6.grouping results of synthetic data(#starting value=77) and nhl194(#starting value =44)

The computional efficiency of both algorithms is listed in Table 3. #starting hyperplane is the starting value of LGA and the size of swarm in PSOLGA. resample is the times for repeating the corresponding algorithm. calc is the times algorithm scans the dataset. ROSSmin is the best found fittness value. With the same starting value, PSOLGA requires more caculation than LGA, as PSOLGA has to iterate grouping for multiple rounds until the best fitness value stay unchanged for certain rounds or exceed the upper limit of totoal optimizaition rounds. There is a tradeoff between calulational complexity and clustering stability. Benefited from the strategic random search mechanism brought in by PSO, PSOLGA is able to reduce computational load by choosing a smaller starting value while still succeed to find the best clustering result.

Table 3.comparation of computational complexity

As shown in Table 3, PSOLGA can still find the minimum ROSS value after 7 iterations when the size of swarm is reduced by half to 20. Compared to result of PSOLGA with 44 particles, more than 50% computational efforts are reduced. However, PSOLGA cannot gurantee for finding the best result when the swarm size is too small, which is 10 for dataset nhl194. Experience shows swarm of half the size of the starting groups suggested by LGA is sufficient to find the best grouping result.

0 shows the status of PSOLGA within each interation for both datasets. Blue dots denote error(objective value) of particles in a certain iteration, and the red triangles are the best/minimum error in each iteration. We can see process of particles flying toward the best location, until the best grouping result is obtained.

Fig. 7.iteration status of PSOLGA

5.2 Evaluation of single-hop access anomaly detection

As is shown in 0, we collect normal access flows from normal user A, B and C, as well as attacking access flows from attackers D and E. Web server F is their target, which offers 3 kinds of services: log-in, insert and query personal data. Attackers use the tool sqlmap for sql injection attempts, such as guessing authentication information, acquiring version of database and structure of tables. Raw flows are first converted to access flows.The analysis interval T is set to be 10 second.

To get a glimpse of feature space of the training dataset, we take 200 random samples from the training dataset, in which the first 100 observations are from nomarl access flows and the other 100 are abnormal access flows originated from attackers. The similarity matrix in 0 is a heat map of similarity between two observations in euclidean distance. The darker the color is, the more similar the two observations are. We can notice that abnormal flows are scattered over a wide area, as attackers launched multiple different attacks which show dissimilarities in flow features. For example, attackers can acquire version of database by a single connection towards the webserver, but multiple flows need to be initialized to get the column names. Besides that, some abnormal access flows are similar to benign ones in euclidean distance.

Fig. 8.Similarity matrix

We split the training dataset in two equal subset to do cross exmination. We compare PSOLGA and Kmeans in detecting abnormal access flows. As we assume no knowledge about the actrual groups of normal access flows, we test grouping from 1 to 7 clusters, and choose the cluster number with the best detection rate. Accuracy rate is count of true classified samples divided by overall sample count. As is shown in 0, PSOLGA achieves the best detection rate 98.45% when #clusters=3, while the best result of the Kmeans approach shows at #clusters=6. True negative rate and true positive rate are shown in 0 and 0. In 0, Gap analysis[7] is used for estimating the number of linear groups, which also suggest 3 clusters. It means Kmeans tend to overestimate group numbers and PSOLGA is able to achieve best accuracy rate when clustering data with the correct group number. From 0 and 0, we can see that Kmeans is insufficient to group data distributed in linear strutures, while PSOLGA is able fit data into 3 groups in different colors.

Fig. 9.Accuracy rate(%)

Fig. 10.True negative rate(%)

Fig. 11.True positive rate(%)

Fig. 12.Gap Analysis

Fig. 13.Groups of normal access flows (Kmeans, #cluster = 6)

Fig. 14.Groups of normal access flows (PSOLGA, #cluster = 3)

Our approach use both orthogonal distance and euclidean distance to detect outliers, thus it is able to distinguish abnormal flows which are similar to normal ones in euclidean distance (as shown in 0) but deviate from the corresponding linear grouping struture. Howerver, the Kmeans approach uses merely euclidean distance, so it is incapable to detect these abnormal flows. When PSOLGA try to group data into more clusters than the correct cluster number(3,suggested by GAP analysis), it leads to overfitting. Overfitting will bring down the overall accuracy rate and some unpreditable small fluctuation of curves. For example, the fluctuation in accuracy rate and true positive rate occurs when cluster number is increased from 3 to 4 and from 4 to 5, as shown in 0 and 0, due to dataset distribution and wrong clustering groups, but the downtrend of accuracy rate would not be affected.

5.3 Evaluation of double-hop access anomaly detection

We generate 58 rules from the training dataset without knowledge about predifined server applications. Five typical rules are listed in Table 4. Rules webmysql and webredis are rules for web servers and corresponding databases. hdfsctrl,hdfsctrl and hdfsdbctrl are rules for HDFS nodes. 50010 is the listening port of HDFS nodes for control messages. 9000 is the webservice port for the master node of HDFS. We conduct two different attacks: illegal database dump and directory traversal/path traversal, to show the use of the two evaluation streams: EVRpre and EVRpost. The first attack results in anomaly “Illegal Database Access”, and the second one leads to anomaly “Abnormal Web Access”. We set the size of evaluation stream SL=10 and anomaly detection threhold α=0.99.

Table 4.Double-hop access rules

Situation “Illegal Database Access”: The attacker got root control of the web server F and the authenticaiton information for the mysql database from the previous sql injection. After logging onto F, the attacker try to dump data from the database from 13:55:00 to 14:01:40. As is shown in rule webmysql, probpost is 0.73, meaning within a certain interval(10s in this article), fpost towards mysql database follows the fpre towards web server F for the probability of 73%. We can see from the second trend chart of 0 that the AValuepost starts to arise right after database dump started.When AValuepost reaches the anomaly detection threhold α(0.99) and AGappost gets larger than 0, a valid anomaly will be reported.

Fig. 15.Evaluation streams

Situation “Abnormal Web Access”: The attacker succeed to locate a vulnerbility of directory traversal/path traversal on the web server F, and launch attacks to access files on F and execute system commands. During the attack(14:36:00-14:41:40), F do not need to access the mysql database and thus no access flows between F and mysql database appear, which is abnormal for normal users. Negative value starts to be append to EVRpre, and AValuepre gets higher afterwards. After AValuepre gets higher than the anomaly detection threhold α(0.99), a valid anomaly is triggered.

 

6. Conclusion

Advanced persistent threats and insider threats remain a serious concern to organisations. Lack of appropriate methods to keep track of overall network activities makes it difficult for security team to uncover unknown exploits and malicous insiders. Thus, it is neccesary to arm network administrators with autonomous inventory of netwok assets and behavior analysis technique.

In this paper, we investigate autonomous flow-based anomaly detection in enterprise network. Compared with existing anomaly detection methods, this work has the following differences: First of all, we propose a methodology of discovering server applications in the targeted network without prior knowledge and merge flows into access flows towards server applications. Besides that, we introduce a novel linear grouping algorithm PSOLGA for mining the significant linear strutures in access flows, which are then used to build behavior profiles for each indivual server application. PSOLGA achieves better grouping stability and computational efficiency than traditional LGA. In addition, we use in-memeroy graph model to search for highly dependent access flows in time series and reduce the overall computational workload. These dependent flow sequences are formulated into rules for the detection of violation in access relations. Finally, we conduct experiments with both simulation data and real-world flow dataset. Performance and accuracy of our model are verified to be promising.

References

  1. Binde, Beth, Russ McRee, and Terrence J. O’Connor, "Assessing outbound traffic to uncover advanced persistent threat," SANS Institute, 2011. Article (CrossRef Link)
  2. Alperovitch, Dmitri, "Revealed: operation shady RAT," McAfee, vol. 3, 2011. Article (CrossRef Link)
  3. Claise, Benoit, Brian Trammell, and Paul Aitken, “Specification of the IP Flow Information Export (IPFIX) protocol for the exchange of flow information,” draft-ietf-ipfix-protocol-rfc5101bis-08 (work in progress), 2013. Article (CrossRef Link)
  4. Sheikhan, Mansour, and Zahra Jadidi, "Flow-based anomaly detection in high-speed links using modified GSA-optimized neural network," Neural Computing and Applications, vol. 24, no. 3-4, pp. 599-611, 2014. Article (CrossRef Link) https://doi.org/10.1007/s00521-012-1263-0
  5. Animesh Patcha and Jung-Min Park, "An overview of anomaly detection techniques: Existing solutions and latest technological trends," Computer Networks,vol. 51,no. 12, pp.3448-3470,2007. Article (CrossRef Link) https://doi.org/10.1016/j.comnet.2007.02.001
  6. Riccardo Poli , James Kennedy and Tim Blackwell, “Particle swarm optimization,” Swarm Intelligence, vol 1, no 1, pp.33-57, June. 2007. Article (CrossRef Link) https://doi.org/10.1007/s11721-007-0002-0
  7. Stefan Van Aelst, Xiaogang Steven Wang, Ruben H. Zamar and Rong Zhu, "Linear grouping using orthogonal regression," Computational Statistics & Data Analysis, vol. 50, no. 5, pp.1287-1312, 2006.Article (CrossRef Link) https://doi.org/10.1016/j.csda.2004.11.011
  8. Gilberto Fernandes, Joel J. P. C. Rodrigues and Mario Lemes Proença, "Autonomous profile-based anomaly detection system using principal component analysis and flow analysis," Applied Soft Computing,vol. 34, pp.513-525,2015.Article (CrossRef Link) https://doi.org/10.1016/j.asoc.2015.05.019
  9. Gilberto Fernandes Jr., Luiz F. Carvalho, Joel J. P. C. Rodrigues and Mario Lemes Proença Jr., "Network anomaly detection using IP flows with Principal Component Analysis and Ant Colony Optimization," Journal of Network and Computer Applications, vol. 64, pp.1-11, 2016. Article (CrossRef Link) https://doi.org/10.1016/j.jnca.2015.11.024
  10. Zahra Jadidi, Vallipuram Muthukkumarasamy, Elankayer Sithirasenan and Kalvinder Singh, "Performance of Flow-based Anomaly Detection in Sampled Traffic," Journal of Networks, vol. 10, No. 9, 2016. Article (CrossRef Link) https://doi.org/10.4304/jnw.10.9.512-520
  11. M. Shojafar, N. Cordeschi; E. Baccarelli, "Energy-efficient Adaptive Resource Management for Real-time Vehicular Cloud Services," in Proc. of IEEE Transactions on Cloud Computing , vol.PP, no.99, pp.1-1. 2016. Article (CrossRef Link)
  12. Quan Guo, Jia Jia and, Guangyao Shen et al., "Learning robust uniform features for cross-media social data by using cross autoencoders," Knowledge-Based Systems,vol. 102, pp.64-75, June 15, 2016. Article (CrossRef Link) https://doi.org/10.1016/j.knosys.2016.03.028
  13. S. Beheshti, F. Alajaji and T. Linder, "Optimal Joint Decoding of Correlated Data Over Orthogonal Multiple Access Channels with Memory," IEEE Transactions on Vehicular Technology , vol.PP, no.99, pp.1-1. Mar.17, 2016.Article (CrossRef Link) https://doi.org/10.1109/TVT.2016.2543221
  14. Bingdong Li, Jeff Springer, George Bebis and Mehmet Hadi Gunes, "A survey of network flow applications," Journal of Network and Computer Applications, vol. 36, no. 2, pp. 567-581, March , 2013. Article (CrossRef Link) https://doi.org/10.1016/j.jnca.2012.12.020
  15. Minho Jo, Longzhe Han, Dohoon Kim and Hoh Peter In, "Selfish attacks and detection in cognitive radio Ad-Hoc networks," IEEE Network, vol. 27, no. 3, pp.46-50, 2013.Article (CrossRef Link) https://doi.org/10.1109/MNET.2013.6523808
  16. E. Sharafuddin, N. Jiang, Y. Jin and Z. L. Zhang, "Know Your Enemy, Know Yourself: Block-Level Network Behavior Profiling and Tracking," in Global Telecommunications Conference (GLOBECOM 2010) , pp. 1-6, Dec, 2010. Article (CrossRef Link)
  17. Zhang Xiaochen, Liu Shengli, “Meng Leiand Shi Yunfang. Trojan Detection Based on Network Flow Clustering,” Multimedia Information Networking and Security (MINES), pp. 947-950, 2012. Article (CrossRef Link).
  18. Pelleg, Dan, and Andrew W. Moore, "X-means: Extending K-means with Efficient Estimation of the Number of Clusters." ICML, vol. 1, 2000. Article (CrossRef Link)
  19. Shing-Han Li, Yu-Cheng Kao, Zong-Cyuan Zhang, Ying-Ping Chuang and David C. Yen, "A network behavior-based botnet detection mechanism using PSO and K-means," ACM Transactions on Management Information Systems, vol. 6, no. 1, 2015.Article (CrossRef Link) https://doi.org/10.1145/2676869
  20. L. A. García-Escudero, A. Gordaliza, R. San Martín, S. Van Aelst and R. Zamar," Robust linear clustering," Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 71, no. 1, pp.301-318, 2009.Article (CrossRef Link) https://doi.org/10.1111/j.1467-9868.2008.00682.x
  21. L. A. García-Escudero, A. Gordaliza, A. Mayo-Iscar and R. San Martín, "Robust clusterwise linear regression through trimming," Computational Statistics & Data Analysis, vol. 54, no. 12, pp.3057-3069, 2010.Article (CrossRef Link) https://doi.org/10.1016/j.csda.2009.07.002
  22. Ivo Friedberg, Florian Skopik, Giuseppe Settanni and Roman Fiedler, "Combating advanced persistent threats: From network event correlation to incident detection," Computers & Security, vol. 48, pp.35-57, 2015.Article (CrossRef Link) https://doi.org/10.1016/j.cose.2014.09.006
  23. Barsamian and Alexander V, "Network characterization for botnet detection using statistical-behavioral methods. Doctoral dissertation, Dartmouth College, 2009. Article (CrossRef Link)
  24. Haag Peter, "Watch your Flows with NfSen and NFDUMP," in 50th RIPE Meeting. 2005. Article (CrossRef Link)
  25. Peng Bichen, Guo Wei, Liu Daiping and Fu Jianming, "Dynamic application flow cluster based on traffic behavior distance," in Proc. of IEEE Computer Society, vol. 1, pp. 1291-1296 2010. Article (CrossRef Link)
  26. V. Frias-Martinez, J. Sherrick, S. J. Stolfo and A. D. Keromytis, "A Network Access Control Mechanism Based on Behavior Profiles," in Proc. of Computer Security Applications Conference, pp. 3-12, 2009. Article (CrossRef Link)
  27. Gintautas Garsva and Paulius Danenas, "Particle swarm optimization for linear support vector machines based classifier selection," Nonlinear Analysis-Modelling and Control, vol. 19, no. 1, pp.26-42, 2014. Article (CrossRef Link)
  28. Ying Tan, Yuhui Shi and, Fernando Buarque et al., "A Population-Based Clustering Technique Using Particle Swarm Optimization and K-Means," Advances in Swarm and Computational Intelligence, pp. 145-152, 2015. Article (CrossRef Link)
  29. Z. p. Yan, C. Deng, J. j. Zhou and D. n. Chi, "A novel two-subpopulation particle swarm optimization," in Proc. of 10th World Congress on Intelligent Control and Automation (WCICA), pp. 4113-4117, 2012. Article (CrossRef Link)

Cited by

  1. 사전유입 에이전트가 발생하는 이상트래픽 탐지 방안 vol.28, pp.5, 2016, https://doi.org/10.13089/jkiisc.2018.28.5.1169