July 14, 2014
Our antivirus formula.
Every system is based on a unique algorithm; without the algorithm there’s no system. It doesn’t really matter what kind of algorithm the system follows – linear, hierarchical, determined, stochastic or whatever. What’s important is that to reach the best result the system needs to follow certain rules.
We’re often asked about our products‘ algorithms – especially how they help us detect future threats better than the competition.
Well, for obvious reasons I can’t divulge the details of our magic formulae; however, what I will be doing in this tech-post (perhaps the techiest post on this blog ever) is open ajar the door to our technological kitchen – to give you a glimpse of what goes on inside. And if you still want more info, please fire away with your questions in the comments, below.
We use a deductive method to detect unknown malware – from the general to the particular. All malware performs, say, x, y, and z action. A certain file also performs x, y, and z; therefore, that file would appear to be malicious. However, in practice things aren’t so simple.
First of all, it’s impossible to say that a specific action unequivocally confirms the maliciousness of an object. A classic example of this is access to the master boot record (MBR): you can’t say that everything that uses this command is malicious, since there are many applications that can use it for peaceful ends. The same goes for all other actions; after all, every command was first created to do useful stuff.
In other words, simply separating the wheat from the chaff here is of no use. However, trying to work out the proportions and composition of both the wheat and the chaff is of use. And that’s exactly what gets done: finding out what’s going on one’s own granary, plus what’s going on in the neighboring granaries, analyzing the results, and then taking a substantiated decision on the overall wheat/chaff situation – how much ‘friend’, how much ‘foe’ – and the corresponding follow-up.
To do this we use technology that unpretentiously goes by the name of SR. Not to be confused with old-school toothpaste, this stands for Security Rating. It’s basically a branchy, self-teaching system of weights, which helps better understand the true nature of an object in the process of its formal evaluation and emulation.
Kaspersky’s magic formula: a self-teaching system of weights to understand the true nature of an object during its evaluation & emulationTweet
SR analyzes the composition and density of events generated by an object and also its outward attributes (name, size, location, compression, etc.). Based on a complex of sets of rules each such attribute gets a danger rating (0-100%). The first set of rules (and there are now more than 500) was the result of a manual study of more than 80,000 unique malicious programs of different families. Now rules are developed mostly automatically, leaving human experts to just fine tune the self-teaching algorithms.
To make testing and maintenance more manageable, rules are divided into groups (for example, ‘Internet’, ‘Passwords’, ‘Registry’, etc.), and if an object checks against one or several of them the corresponding sanctions are applied to it.
Examples of the simplest of rules:
‘Loading driver via low level API ntdll.dll’ rule
API function: NtLoadDriver
Argument 1: *
Argument 2: *
Argument 3…N: *
Assessment: Single operation – 40%, 2-3 operations – 50%, >3 operations – 60%
Harmful: No
‘Analysis of kernel machine code (taking hooks)’ rule
API function: CreateFile
Argument 1: Contains ‘ntoskrnl.exe’ entry
Argument 2: *
Argument 3…N: *
Assessment: Single operation – 100%, 2-3 operations – 100%, >3 operations – 100%
Harmful: Yes
The total rating of an object is the sum of all the individual ratings after a check using the whole rule database. In other words, it’s a typical artificial neural network, which collects signals from a multitude of sensors, analyzes their qualitative and quantitative characteristics, explores the connections, and gives its verdict.
That’s how SR started out in 2007 (patent US7530106). Since then we’ve been improving the tech ever since. As if you couldn’t guess that!
Problem number one early on was that an analyzed file can generate a huge number of insignificant events, and these events could lead to incorrectly indicating the file as malicious. For example, a Delphi application when launched gives birth to up to 500 such events. They’ll be identical across any application written in that language, and will represent zero useful information about the real intentions of a file. This ‘noise’ doesn’t only use up resources of the computer, it also makes the analysis more difficult.
So we made a filter for sifting out all that noise. Besides, unlike for usual rules here a Boolean attribute is sufficient. Thus, the rule greatly simplifies and therefore expedites the work. As a result, the rules contain only the name of the API function and the masks for its arguments.
For example:
SysAllocString (*,-NULL-,-NULL-)
Here, if the first argument of the function has any meaning, while the rest have no meaning, then the event will be deemed insignificant.
For automatic generation of filtration rules for unimportant events we used three methods:
The first is the drosophilae method. We prepare a simple application displaying ‘Hello World’ using development tool X, and as far as possible use its most popular DLLs. We feed the compiled application into the emulator, and all the generated drosophilaen events we enter into the ‘insignificant’ field.
The second is the packed drosophilae method. It’s just like the first method, except that here we’re interested in the behavioral events of the packer/protector. For this we process a dummy written in Assembler with all sorts of packers and protectors, feed the emulator, and… well, the rest you can guess. If not :)… we filter the insignificant events.
How do drosophilae help KL fight future threats?Tweet
The third method is the statistical one. We analyze a large quantity of both legitimate and malicious files, and mark out the API calls that are often observed in the behavior of both types of files. This method supplements the first two and is effective if there’s no possibility of creating any sort of drosophilae mentioned above. An illustrative example of application of this method is marking out insignificant events generated by the GUI and memory allocation functions.
But that (automatic generation of filtration rules for unimportant events) was just one of the easiest challenges. Further on things got more interesting…
The first version of SR worked on a single protected computer practically in isolation. We didn’t have the global picture; and we didn’t understand what rules were triggered – or how often or how accurately – and couldn’t quickly change their rating. As a result, there were big unused possibilities for increased effectiveness…
…Enter our cloud-based KSN, which was developing at full steam ahead, and to which we’d already added the Astraea expert system (for analyzing the colossal volumes of signals from protected computers and issuing reasonable conclusions about the cyber-epidemiological situation in the world).
Then, in 2009, we were happy to report the release of the next version of SR – SR2 (US8640245), which had merged with KSN and Astraea.
This gave us big data with good drill-down opportunities, which in the security industry is a magic recipe for success!
In essence, we’d gotten ourselves the ability to (i) zap dead (ineffective) rules, (ii) temporarily turn off or test rules, and (iii) practically correct in real time ratings of rules using special coefficients. Moreover, the size of the coefficient database was silly small – measured in kilobytes – and its updating even back in 2009 hardly affected the Internet connection of the protected computer.
Astraea also widened the statistical framework for calculating ratings – signals not only from different emulators were used in the calculations, but also lots of other sensors that were connected to KSN. Besides, the product could get a previously issued verdict from the cloud (KSN), thus skipping the process of emulation. And there’s yet one more pleasant bonus: we can reliably pick from the stream unknown ‘species’ which we haven’t got much data on yet – but which nevertheless act suspiciously – for manual analysis.
What’s really important is that Astraea corrects rules automatically; a human expert is only needed for regularly evaluating the effectiveness of the mathematical model applied and optimizing it (patent application US20140096184).
Smart tech can do the job itself. Example: #Kaspersky fights future threats w/o human involvementTweet
Getting our mitts on global big data immediately saw us come up with new ideas for solving old problems: first of all, the problem of false positives.
We’d been experimenting using SR in the ongoing fight against false positives from its very inception in our products. But it was in 2011 when things really got going in this respect: we rolled out several new features for minimizing false positives in one fell swoop.
There are many operations executed by legitimate software with fully peaceful aims. For example, installers delete files in the System32 folder. So auto-regulation of the rating of this operation leads to its groundless degradation and we start to miss real maliciousness. Therefore, a compromise is needed, because you can’t have your cake and eat it too. So we decided to divide the mechanism of calculation of ratings into three parts:
First: the calculation described above – the more dangerous the behavior is and the more often it is met, the higher the rating.
Second: sort of whitelist rules, which revoke or correct actions of the usual rules applicable to concrete situations or files.
Third: detection rules for legitimate applications, which lower the danger rating when typical behavior is found, and which can even form a rating of safety.
Example:
‘Creation of registry key of autorun’ rule
API function: Registry: establishing the meaning of the parameter (RegSetValueEx)
Argument 1: Contains entry
‘RegistryMachineSoftwareClasses*shellexContextMenuHandlersNotepad++’
Argument 2: *
Argument 3…N: *
Evaluation: Single operation – 1%, 2-3 operations – 1%, >3 operations – 1%
Harmfulness: None
Here we can clearly see that the registry key is being accessed; however, this is only Notepad++ delivering its DLL. The argument of the rule removes falses, while the main rule remains steady; upon other attempts to change the key it will work as it’s meant to.
Later in 2011 we introduced yet another (we don’t mess about, you know) useful feature.
As mentioned above, rules worked independently from one another in SR; therefore, we couldn’t study complex interdependencies like load file – save file to disk – adding to autorun key. But if we were able to track such interdependencies, it’d be possible to give ratings that are more than just the sums of ratings of separate events. Or less :). So we decided to enable correlation of events in SR2 for more precise detection of unknown malware.
We did this in two ways.
First, we created bit masks, which determine groups of rules or separate rules with OR and AND. The main description is the bit index of classifications of behavior. Initially this was thought up for the clusterization of malware based on the specifics of its behavior, but a similar approach can also be applied for refining assessments of ratings. Indeed, with the help of masks we can implement functions like (RULE76 or RULE151) and (RULE270 or RULE540). The good thing about such masks is their compactness and high speed of work; the limitation – inflexibility.
Second, we developed special scripts to carry out global analysis after SR‘s calculations (patent US8607349). The scripts can be launched in turn independently, or as and when a rule is triggered. Each of them has access to the database of accumulated statistics of earlier triggered rules and groups of rules. As a consequence, we got the ability (i) to use complex logic – conditions, calculations, cycles, and activation of subprograms; (ii) to use neural networks to the max; and (iii) to use scripts not only for getting more precise SR ratings, but also new knowledge which can be applied by subsequent scripts.
For example, on the basis of analysis of a dozen rules, the first script may decide that the ‘application tries to get the passwords of other programs’. A second script decides that the ‘application transfers something onto the Internet’. While a third script decides that ‘if the application shows an interest in passwords and transfers something onto the Internet, then it gets a +100% rating’.
Besides, scripts can be used with any rule, with the final rule becoming a type of trigger for some kind of algorithm.
An example of a script:
VarS : string;begin if copy(ExtractFilePath(APIArg[1]), 2, 100)=’:’ then begin AddSR(20); s := LowerCase(ExtractFileExt(APIArg[1]));if S = ‘.exe’ then AddSR(40); if S = ‘.com’ then AddSR(50); if S = ‘.pif’ then AddSR(60); if S = ‘.zip’ then AddSR(-20); end; end. |
In this example the script assesses the operation of creating a file. We check that the file was created at the root of the disk, and for that we give 20% to SR. Further, depending on the file extension, we add an extra rating with a ‘+’ or ‘-‘ sign.
The given example highlights the main advantage of scripts: the ability to undertake a complex differentiated assessment of arguments of function with the assigning of an individual SR rating based on the results of different checks. What’s more, some checks can increase the rating, others – lower it, which allows running complex checks and complex analysis aimed directly at further suppression of false negatives.
And now a little about the future…
We’ve already started rolling out our 2015 personal product line. We thought long and hard… and finally decided to give up on local SR and to instead fully transfer the calculation of ratings to the cloud.
Such an approach instantly gives us many advantages: The quality of analysis doesn’t suffer, while the resources required on a protected computer are lowered – since all computing is in the cloud. And as to the delay in the delivery of verdicts, it’ll make up… well, actually practically nothing – fractions of milliseconds, only noticeable to special software; our dear users sure won’t notice!
So there you have it. A very brief look at our Coca-Cola-like ‘secret’ magical formula in a little over 2000 words :). But it’s of course is just the tip of the iceberg: a more detailed description of the technology would take several days. All the same, if you do want more detail, let me hear from you – in the comments below!
Jee! The techiest post from @e_kaspersky re future threats. But it’s worth a readTweet