October 18, 2011
The Holy Grail of AV Testing, and Why It Will Never Be Found
So, my expectations were fulfilled. My recent post on an AV performance test caused more than a bit of a stir. But that stir was not so much on the blog but in and around the anti-malware industry.
In short, it worked – since the facts of the matter are now out in the open and being actively discussed. But that’s not all: let’s hope it won’t just stimulate discussion, but also bring the much-needed change in the way AV tests are done, which is years overdue, and is also what I’ve been “campaigning” for for years.
So, how should AV be tested?
Well, first, to avoid insults, overreaction and misplaced criticism, let me just say that I’m not here to tell testers how to do their job in a certain way so that our products come out top – to have them use our special recipe which we know we’re better than everyone else at. No, I’m not doing that, and anyway, it’s rare when we don’t figure in the top-three in different tests, so, like, why would I want to?
Second – what I’ll be talking about here isn’t something I’ve made up, but based on the established industry standards – those of AMTSO (the Anti-Malware Testing Standards Organization), on the board of which sit representatives of practically all the leading AV vendors and various authoritative experts.
Alas, the general public, and even the progressive IT-savvy world community mostly don’t know about AMTSO. But even the majority of testers themselves appear not to take any notice of AMTSO either – still preferring to do their tests the same old way they’ve always done them. But that’s understandable – the old way’s cheap and familiar, and the user, allegedly, is interested in just one simple thing – the resultant ratings: who takes first, second and third place, and who gets the raspberry award.
It would appear everyone’s happy with the status quo (I personally can’t stand In the Army Now, but it seems I’m in the minority on this one). But no one would be happy if they only knew the facts. Old-school testing really distorts the true picture. As a result not the best software gets to take the gold, and the other rankings also practically in no way correspond to real levels of provided protection. In short, it’s all just a lot of misleading nonsense, with the consumer getting conned.
Why am I getting so steamed up about this?
It just seems a pity that AV firms’ time and resources are spent – not on getting their products to do their jobs properly – but instead on going after the best ranking in the same old BS-testing; on getting a result no worse than those who hone their products – not to achieve real quality – but only to pass tests better than the rest.
Right. Now let’s turn from the borscht starter to the pelmeny main course.
How NOT to Test
The classic, bad old, on-demand test.
This is the most conventional, standard and familiar test. One time, long ago, before the mass-Internet era, it used to be the best type of test. Incidentally, in 1994 we organized the international premiere of our AVP based on such a test – one carried out by the University of Hamburg. It was the first time we took part in testing, but still we swept up the competition :)
The on-demand testing method goes like this: you take a hard disc and stuff it full of malware – the more and the more varied – the better; indeed, anything and everything you can get your hands on – in it goes. Then you set different anti-virus scanners upon that disc, and measure the number of detections. Cheap and easy. And nasty. And for ten years already utterly meaningless!
Why? Because anti-virus signature, heuristic, and other “scanning” engines make up only small parts of the whole complex of technologies that are used in real-life protection. What’s more, the relevance of these engines in terms of the overall level of protection is falling fast as time goes by and the AV task mutates. Besides, a scanner in general mostly works as the means of last resort for purely surgical works: for example, our System Watcher first tracks Trojans, gets to understand the whole picture of the infection, and only then gives the scanner the task of weeding it out.
Another shortcoming of on-demand testing relates to the malware database used for scanning. Here there are two extremes – both of them unsound. There can be too few malware files – making the test irrelevant. Or too many – also making the test irrelevant: in a mega-collection there’s just too much garbage (corrupt files; raw data files; kinda-malware, for example, scripts that use malware; etc.); and to clean up a collection to get rid of such rubbish is one heck of a difficult and thankless task. Not to mention poorly paid.
And finally, the most significant defect with on-demand testing is that it is possible to fine tune a product specifically for such tests in order to achieve excellent results. And product tuning for a test is elementary, dear Watson – all that needs to be done is detect the particular files used in the test. You get my drift?
In order to show in “scanning tests” a near-100% result, an AV firm doesn’t need to go to all the trouble of actually raising the quality of its technology. No sir. All it needs do is detect everything that shows up in the tests. It’s almost like the joke about the hunters being chased by the bear:
“Yikes! The bear’s running faster than us!” says the first hunter.
“No worries – I don’t need to run faster than the bear, I just need to run faster than you!” says the second.
To win in these tests you don’t have to run faster than the bear. Instead you just suck up to the source of the malware used by the most famous testers (and these sources are well-known – VirusTotal, Jotti, and the malware-swappers in different AV companies), and then detect everything that all the others detect; that is, if a file is detected by competitors, to simply detect it using MD5 or something similar. No in-depth research and superior technologies to combat real life attacks are needed.
To demonstrate this I’d be willing to construct a scanner from scratch – with the help of a pair of developers – and have it reach a 100% detection rate within a couple of months. Why would I need a pair of developers? Just in case one of them falls sick.
In short, on-demand tests are bogus. This is because nothing real gets shown up with them, they are easy to adjust a product to, and they are extremely difficult to win honestly.
How Tests CAN Be Done, but Only for Specialist Audiences
There’s also a whole bunch of niche tests that measure the quality of anti-virus in very specific ways.
In principle, I guess they have a right to exist, and can be extremely useful for quality comparisons of specific features. However, they should come with caveats stated in large bold print about this specific character and that the tests do NOT take into account all the functionality of products. These tests are not much good at all to the general public, i.e., non-specialists. They are strictly industry tests, indicating data useful only to IT security experts and other such geeks.
Some examples of these specialist-only tests:
- a test for treating a particular occurrence (how a product copes with healing a system infected with specific malware);
- a false positives test (how many erroneous infection-notifications a product gives on a clean system);
- a proactive test (how a product catches malware without a signature – that is, using only proactive technologies);
- an on-access test (measuring quality of an on-access scanner in real-time operations with malware);
- performance and interface ease-of-use; etc.
The above tests only measure specific things. Like with cars – time taken to go from 0 to 60 mph, braking speed, gas consumption – they’re all individual, with little interconnection among them. Besides, the best result in one of them may be entirely unusable in a real-life product, a bit like a Formula-1 racing car to pop to the shops in :)
Finally, How Tests SHOULD Be Done – the Ones to Take Notice of…
First, as Ali G would say, let’s “rewind“. What are tests needed for anyway? First and foremost, to demonstrate the quality of protection. Usability and efficiency are of course also important, but ultimately, they’re only chaff, and certainly need separating from the wheat.
So how do you test quality of protection?
It stands to reason that it should be done in an environment that mirrors reality as closely as possible. The methodology of a good test must be based on the most common and widespread scenarios of what users face in real life. Everything else is incidental stuff that just doesn’t matter.
And so we get dynamic or real world tests.
The idea’s simple: you take a typical computer, install anti-virus software on it with its settings at default, and try with all means available to launch current malware. That’s it – simple! The product works at full steam, and the environment is as near to real-life conditions as possible. Though simple, what the user gets is the most accurate measurement of quality of an anti-virus product there is, and thus gets the relevant information to enable a rational choice as to a purchase. Then, to measure system resources used, the size of updates, and to account for a product’s price, each category of testing is given a weighting in relation to the earned marks – and the results are ready.
It’s that simple, right? Alas – no.
Besides the difficulties in getting the malware selection right for a test, there is the greater difficulty in obtaining conditions that are closest to reality, since they are extremely difficult to automate. That is, such tests demand a great deal of mechanical manual work.
And it’s for this very reason that such tests are conducted extremely rarely and in a rather truncated manner. In my previous post on performance testing I listed a few of them. But even they have their own little quirks and nuances.
In summary, how to give good advice on proper testing – this to me is clear and obvious. But where can you find crazy folk who would take it upon themselves to conduct such tests, and for free? I don’t know. Thus, sadly, proper results from proper tests are these days simply nowhere to be found, despite their Holy Grail status.
And on that somber minor chord, we conclude this gloomy piece of music.