A method for monitoring online security threats comprising of a machine-learning service that receives data related to a plurality of features related to internet traffic metrics, the service then processes said data by performing operations selected from among: an operation of ranking at least one feature, an operation of classifying at least one feature, an operation of predicting at least one feature, and an operation of clustering at least one feature, and as a result the machine learning service outputs metrics that aid in the detection, identification, and prediction of an attack.
1. A method for monitoring online security threats comprising;
receiving data related to a plurality of features by a machine-learning service, wherein the received data comprises of internet traffic metrics; generating an output by the machine-learning service performing a machine-learning operation on at least one feature of the plurality of features, wherein the machine-learning operation is selected from among: an operation of ranking at least one feature, an operation of classifying at least one feature, an operation of predicting at least one feature, and an operation of clustering at least one feature, wherein the output comprises detection, identification, and prediction of an attack; and sending the output from the machine-learning service. 2. The method of 3. The method of 4. The method of 5. The method of the multiple threat feeds are aggregated and processed by the machine learning service, the machine learning service's output is categorized into groups based on the threat feed sent by each respective client user, and the grouped output is sent to the respective client user. 6. The method of extracting a set of meta information from the threat data; extracting a set of features from the meta information; clustering the set of features based on network size; and building a model of cluster classifiers. 7. The method of 8. The method of 9. The method of 10. The method of 11. The method of
This present disclosure claims the benefit of U.S. Provisional Application Ser. No. 62/028,197, filed on Jul. 23, 2014. The present invention relates generally to online security and the prevention of malicious attacks. Specifically, the present invention relates to an automated computer system implemented for the identification, detection, prediction, and correlation of online security threats. The term “big data” is used ubiquitously these days as a synonym for the management of data sets that are well beyond the capability of an individual person. In the arena of internet security, for example, security experts are tasked with handling increasing larger amounts of threat feeds and logs (“big data”) that need to be analyzed and cross referenced in order to find patterns to detect potential online threats to companies, institutions, agencies, and internet users worldwide. Currently the industry is so overwhelmed by the vast amounts information that there is a shortage of experts in the field of big data and machine learning who can tackle these challenges. In order to make effective use of all this security data, there is also a rising demand for “Security Data Scientists”. These scientists are not only highly trained data scientists, who can apply machine learning and data mining approaches to handle big data and detect patterns in them, but they are also security researchers who understand the online threat landscape and are experts in identifying and detecting Internet threats. However finding such talent nowadays is proving extremely difficult due to the dual set of expertise that is required. Indeed it would take an individual an entire career to become an expert in just one of these fields. Additionally, due to the exponential growth and complexity of the Internet it is proving increasingly difficult for organizations to find and retrain talented security data scientists who can help track and monitor all of the detectable and potential threats online. Thus what is needed is a scalable method for monitoring online threats that can scale with the growth of the Internet. The present invention overcomes these human limitations through a plug and play platform that enables security researchers and analysts to apply big data and machine learning approaches to security problems at the click of a mouse. The present invention further utilizes machine learning techniques in order to harness the information provided by the platform's users and partners in order to implement a scalable computer platform for dealing with online threats. The platform and its machine learning capacities culminate to create a machine learning service that may be trained to automatically recognize suspicious patterns in internet traffic and internet registry data and to alert the appropriate users and client systems. The machine learning service of the present invention comprises of at least four novel components: 1) a threat plug and play platform, 2) a threat identification and detection engine, 3) a threat prediction engine, and 4) a threat correlation engine. Each one of these components is described in detail in the accompanying illustrations and respective descriptions. For the purpose of promoting an understanding of the principles of the present disclosure, reference will now be made to the embodiments illustrated in the drawings and specific language will be used to describe the same. It will, nevertheless, be understood that no limitation of the scope of the disclosure is thereby intended; any alterations and further modifications of the described or illustrated embodiments, and any further applications of the principles of the disclosure as illustrated therein are contemplated as would normally occur to one skilled in the art to which the disclosure relates. All limitations of scope should be determined in accordance with and as expressed in the claims. Applying data mining and machine learning algorithms usually requires scripting and coding as well as basic knowledge of the theory behind these algorithms. The present invention however maybe used by internet security researchers and analysts that do not need any prior theoretical knowledge of these algorithms or even any scripting or coding skills. As depicted in Machine learning is a scientific discipline that explores the construction and study of algorithms that can learn from data. Such algorithms operate by building a model based on inputs and using those models to make predictions or decisions, rather than only following explicitly programmed instructions. The present invention makes use of these types of algorithms by way of building models of “healthy” and “unhealthy” internet networks and traffic—based on past internet data—and then comparing those models against contemporaneous internet data to estimate the level of risk that is present. The present invention is enabled by its capability to receive various sources of data inputs of different formats. In one embodiment the automated threat detection service 120 receives a package of data 110 from a user that comprises of any of a multiple sources of internet traffic data 100 Threat Feeds 100 The threat detection service 120 may also accept an entire host of internet activity logs including, but not limited to, DNS logs 100 In addition, the system 120, may also accept malware binary files 100 Similarly, pre-generated malware sandbox output files 100 In addition, the system 120, may also accept packet capture files (PCAPs) 100 Finally system 120 may also accept regional Internet registry (RIR) data 100 In one embodiment, when prompted, the automated service 120 may produce a detailed output listing Malice Scores and Malicious Components 180 as well as Network Risk Reports 190. Malice Scores may be numbers ranging from 0, indicating benign traffic, and 1, indicating malicious traffic. Malicious Components may include IP addresses, domain names, network blocks, and URLs. The service may also include a reason why such traffic was classified as malicious. Network Risk Reports may include an updated list of IPs, domains, and CIDRs that have high threat scores. To offer an example, consider the following code which makes an API call to a fictitious threat detection service hosted at “XYZsecurity.com.” If a user wanted to investigate the IP address 91.220.62.190 the user may issue the following command:
The Meta Information Extraction 210 stage comprises of constructing structured meta information from the un-structured threat data. The categorizer 130 may extract six core pieces of evidence from threat data 110: IP addresses, Timestamps of Attacks, URLs, Domain Names, Attack Category, and the Threat Feed that reported the attack. Once the IP address is extracted it is fed it to a Boarder Gateway Protocol (BGP) extraction engine to find the network prefix (CIDR) and the Autonomous System (AS) that the IP maps to. In addition, this extraction provides the geo-location of the IP and the RIR that the IP belongs to. Next, for every IP the categorizer 130 constructs a time series comprising all timestamps that an attack was reported on that IP. This log of timestamps is beneficial in extracting the queueing-based features in the Feature Extraction 220 stage. The attack categories may be grouped into five main attack categories:
The end product of the Meta Information Extraction 210 stage is a set of Meta Data 211 which is then used for Feature Extraction 220. In the Feature Extraction 220 stage the categorizer 130 extracts four categories of features: Queueing-Based Features, URL-Based Features, Domain-Based Features, and BGP-Based Features. Queueing-Based Features are modeled on five components of a network: i) IP address, ii) Network block, also known as CIDR, iii) Autonomous System (AS), which is a group of CIDRs that have the same routing policy, iv) Country, which is the geolocation of the IP, and v) Regional Internet Registry (RIR), which is the region the IP resides in. Each of these components may be considered a “queue.” The rate at which attacks arrive at the network are considered the “infection rate.” The rate in which they get taken down is considered the “departure rate.” The duration of how long an attack stays on a network is the “service rate.” The difference between arrival rate and the departure rate is the “network utilization.” It is assumed that attacks (infections) arrive to the queue, stay in the queue during the infection period, and finally get taken down, which is simply when the infection is cleaned. Thus there are five important properties of the queue:
URL-Based Features are extracted statistical features that capture the following patterns in URLs:
Domain-based Features are extractions of the following attributes and aspects of domains:
BGP-based Features are features that are related to CIDRs and ASNs. The following features may be extracted per IP:
The end product of the Feature Extraction 220 stage is a set of Features 221, which are then clustered according to network size 230. In the Network Size Clustering 220 stage the categorizer 130 may use the k-means clustering algorithm to cluster the data (namely IPs, domains, and URLs) and features into four clusters depending on their CIDR size. This clustering step is necessary because larger networks cannot be modeled the same as small networks and thus the models need to be trained and classified independently. The clusters are determined as follows:
In practice it is often the case that the characteristics of Cluster 3 and Cluster 4 are similar enough that they may be combined into a single cluster [/16, /0) to save processing time, storage, and other computing resources. In the Model Building 240 stage the categorizer 130 trains a Random Forest classifier for each of the clusters that were created in the previous section. To train the classifier we construct a training set that comprises a positive set and a negative set. The positive set contains malicious samples that the classifier needs to learn the patterns for. The negative set contains benign samples that the classifier needs to discriminate against. The data in the dataset corresponds to the features that were discussed in the previous sections. Some features are represented based on whether or not they exist. For example, one feature can be if an IP belongs to cluster 5. This feature is represented as 1 or 0. The feature will be 1 if the IP belongs to cluster 5 and 0 if the IP does not belong to cluster 5. Other features are presented as numerical values. For example, one feature can be the total number of IPs in a network. Eventually the dataset can be thought of as a table, in which the rows comprise the sample points, which is in our case; IP addresses. And the columns are the features that we extracted for these IP addresses. Since we are dealing with a classification problem (i.e. classifying traffic into benign and malicious) the dataset must contain a column that shows the label of the data, which is simply if this IP is benign (0) or malicious (1). To evaluate the classifier that was built in the previous step, the training set may be divided into two sets: a training set and a test set. The test set is used to evaluate the performance of the classifier (the performance in terms of detection not speed). Based on the detection of the sample data in the test set, one can evaluate the accuracy of the classifier, the error rate, the false positive rate, and the false negative rate. The end product of the Model Building 240 stage is a set of trained Threat Classifiers 241 for each Cluster Grouping 231. These classifiers exist as trained models 160 that may later be compared against future internet data 110 in order to identify and detect potential threats. Once models 160 are trained by the network feature trainer 200 they may be used by an attack ID and detection engine 300 to analyze potential threats. The IP Features 311 category may further comprise of IP Stats which include the number of threat feeds that list the IP and the number of attack categories the IP falls under. The CIDR Features 312 category may further comprise of CIDR Stats which include the CIDR size, the number of infected IPs within the CIDR, and the cluster ID. The ASN Features 313 category may comprise of ASN Stats including the number of CIDRs within the ASN, the number of infected IPs within each CIDR, and thus the number of infected CIDRs. The CC Features 314 category may further comprise of CC Stats including the number of infected IP's, thus the number of infected CIDR's, and thus the number of infected ASN's. The RIR Features 315 category may comprise of RIR Stats including the number of infected IP's, thus the number of infected CIDR's, thus the number of infected ASN's, and thus the number of infected CC's. Since the listed feature categories 311-315 follow a hierarchy (i.e. IPs reside on CIDRs, which reside within ASNs, which further reside within CCs, which finally exist within RIRs) the aggregated averages 316 of some of features in one feature category may be used to estimate the stats in another feature category. For example The Information Extraction 410 stage comprises of first receiving daily information about IP 401 and network assignments, reassignments, allocations, reallocations 402 and newly registered domain names from top level domain (TLD) zone files 403. This information is acquired through the five RIRs, i.e. ARIN, RIPE, APnic, AFRInic, and LACnic. At this stage the categorizer 130 identifies the individuals or groups that the IPs or network blocks were assigned to 411. In the Feature Extraction 420 stage the categorizer 130 extracts 2 categories of features: Contact Entropy Based Features and pDNS Based Features. Contact Entropy Based Features are features used to detect network blocks that will be used by, threat actors. The threat actors use anonymous or private registration information when they register for reassigned network blocks. Thus in order to identify these malicious actors the entropy of the registration information for newly assigned network blocks are features that need to be aggregated and correlated. Suspicious networks will likely have higher entropy. The system further finds passive DNS (pDNS) evidence on the IPs that were identified in the registration information from the previous feature. The system further calculates pDNS features on the IPs and domains that are retrieved from the previous step. Then the system correlates the domains and IPs with a malware DB to find which IPs and domains were associated with malware in the past. Finally the system calculates maliciousness scores for all IPs and domains that it gets from the pDNS evidence. In a manner analogous to how threat detection models 160 were generated in the earlier example these datasets are also grouped into clusters 430 depending on the CIDR size and analyzed by their respective cluster classifiers 440. The end product of the Model Building 440 stage is a set of trained Threat Classifiers 441 for each Cluster Grouping 431. These classifiers exist as trained models 170 that may later be compared against future internet data 110 in order to predict potential threats. Once models 170 are trained by the network feature trainer 400 they may be used by an attack prediction engine 500 to analyze potential threats. The Contact-Based Features 511 category may further comprise of sub-categories including Shannon Diversity Index of Registration Information, Shannon Entropy of Registry Information, Shannon Diversity Index of Registrants Addresses, and Shannon Entropy Index of Registrants Addresses. The pDNS-Based Features 512 category may further comprise of sub-categories including Average Shannon Entropy for Domain Names, Statistical Features for Domain Name Entropy (e.g. min, max, standard deviation of entropy), Shannon Diversity Index of Top Level Domains, and Statistical Features for Top Level Domains Entropy (e.g. min, max, mean, standard deviation of entropy). In order to build a model for correlation, the system must be fed with the same six entities that are extracted in the network features trainer 200. The correlation classifier 600 then periodizes the timestamps of the various attacks in the different feeds. This is done by grouping and aggregating 610 IP address, CIDR, ASN, CC, and RIR information in all of these feeds. Then on every IP, CIDR, ASN, CC, and RIR the classifier 600 groups the attacks by their categories. The classifier 600 then periodizes attacks on each of these entities by their attack category. Next the classifier 600 extracts features 620 on all six entities similar to the steps we followed in the network features trainer 200. The classifier 600 then performs the familiar a clustering step 630 and then builds models 640 for the four clusters. The models that were generated in this process are then used to score all the threat data from all threat feeds 100 For example, in one embodiment, clients 700 This method of aggregating input data 110 Naturally, in order to process the requisite amount of data, the present invention requires significant hardware resources to operate. To deploy an instance of the analytics engine that supports up to 100 active users, the automated threat detection service 120 requires, at a minimum, the following hardware specifications: A CPU with 8 cores, each operating at 2.0 GHz; 32 GB of RAM; and 50 GB of free HDD space. While the system 120 at times may be functional with less resources the requirements specified above should be scaled according to the number of users for effective performance. As with any automated services the platform 120 will sometimes misclassify legitimate traffic as malicious (also known as false positives) or classify malicious traffic as legitimate (also known as false negatives). While these incidents should be rare, the present invention may require a mechanism to feed these misclassifications to the platform 120 so that its classifiers can be retrained. This may be done simply by pushing a configuration file update to the platform 120. This configuration file may contain amendments to a classifier model or alternatively include an entirely new retrained model.CROSS-REFERENCE TO RELATED APPLICATIONS
TECHNICAL FIELD
BRIEF BACKGROUND
SUMMARY OF THE INVENTION
BRIEF DESCRIPTION OF THE FIGURES
DETAILED DESCRIPTION OF THE INVENTION
Platform Overview
Machine Learning
Data Input
Data Output
Once entered the IP addresses 91.220.62.190 is sent to the server 120. Subsequently the automated threat detection service 120 will respond 180 with the following information is displayed to the user:
The following is a brief description of each displayed parameter:
“CIDR_SCORE” represents the malicious score of the CIDR between 0 and 1 with 0 being benign and 1 being malicious.
“ASN_SCORE” represents the malicious score of the ASN between 0 and 1 with 0 being benign and 1 being malicious.
“IP_SCORE” represents the malicious score of the IP address and can range between 0 and 1 with 0 being benign and 1 being malicious.
“RIR” identifies the Regional Internet Registry the IP belongs to. In this example “RI” is the abbreviation for RIPE which is the European registry.
“ASN” is the Autonomous System Number that the IP belongs to.
“ASN_DESC” is a textual description of the owner of the ASN.
“CIDR” is the network block that the IP belongs to. In this example the network block is “/24.”
“CC” is the Country Code of where the IP resides. In this example “RU” stands for Russia.
“IP” is the IP address in question. In this example the address is of course “91.220.62.190.”
Threat Identification and Detection Models
Meta Information Extraction
Feature Construction
Queueing-Based Features
URL-Based Features
Domain-Based Features
BGP-Based Features
The following features are extracted per domain name:
Clustering
Model Building and Training
Threat Identification and Detection Engine
Threat Prediction Models
Information Extraction
Feature Construction
Contact Entropy Based Features
pDNS Based Features
Clustering and Model Building
Threat Prediction Engine
Threat Correlation Classifier
Hardware Limitations
Updates
GLOSSARY