The Holy Grail of AV Testing, and Why It Will Never Be Found

So, my expectations were fulfilled. My recent post on an AV performance test caused more than a bit of a stir. But that stir was not so much on the blog but in and around the anti-malware industry.

In short, it worked – since the facts of the matter are now out in the open and being actively discussed. But that’s not all: let’s hope it won’t just stimulate discussion, but also bring the much-needed change in the way AV tests are done, which is years overdue, and is also what I’ve been “campaigning” for for years.

So, how should AV be tested?

Well, first, to avoid insults, overreaction and misplaced criticism, let me just say that I’m not here to tell testers how to do their job in a certain way so that our products come out top – to have them use our special recipe which we know we’re better than everyone else at. No, I’m not doing that, and anyway, it’s rare when we don’t figure in the top-three in different tests, so, like, why would I want to?

Second – what I’ll be talking about here isn’t something I’ve made up, but based on the established industry standards – those of AMTSO (the Anti-Malware Testing Standards Organization), on the board of which sit representatives of practically all the leading AV vendors and various authoritative experts.

Alas, the general public, and even the progressive IT-savvy world community mostly don’t know about AMTSO. But even the majority of testers themselves appear not to take any notice of AMTSO either – still preferring to do their tests the same old way they’ve always done them. But that’s understandable – the old way’s cheap and familiar, and the user, allegedly, is interested in just one simple thing – the resultant ratings: who takes first, second and third place, and who gets the raspberry award.

It would appear everyone’s happy with the status quo (I personally can’t stand In the Army Now, but it seems I’m in the minority on this one). But no one would be happy if they only knew the facts. Old-school testing really distorts the true picture. As a result not the best software gets to take the gold, and the other rankings also practically in no way correspond to real levels of provided protection. In short, it’s all just a lot of misleading nonsense, with the consumer getting conned.

Why am I getting so steamed up about this?

It just seems a pity that AV firms’ time and resources are spent – not on getting their products to do their jobs properly – but instead on going after the best ranking in the same old BS-testing; on getting a result no worse than those who hone their products – not to achieve real quality – but only to pass tests better than the rest.

Right. Now let’s turn from the borscht starter to the pelmeny main course.

How NOT to Test

The classic, bad old, on-demand test.

This is the most conventional, standard and familiar test. One time, long ago, before the mass-Internet era, it used to be the best type of test. Incidentally, in 1994 we organized the international premiere of our AVP based on such a test – one carried out by the University of Hamburg. It was the first time we took part in testing, but still we swept up the competition :)

The on-demand testing method goes like this: you take a hard disc and stuff it full of malware – the more and the more varied – the better; indeed, anything and everything you can get your hands on – in it goes. Then you set different anti-virus scanners upon that disc, and measure the number of detections. Cheap and easy. And nasty. And for ten years already utterly meaningless!

Why? Because anti-virus signature, heuristic, and other “scanning” engines make up only small parts of the whole complex of technologies that are used in real-life protection. What’s more, the relevance of these engines in terms of the overall level of protection is falling fast as time goes by and the AV task mutates. Besides, a scanner in general mostly works as the means of last resort for purely surgical works: for example, our System Watcher first tracks Trojans, gets to understand the whole picture of the infection, and only then gives the scanner the task of weeding it out.

Another shortcoming of on-demand testing relates to the malware database used for scanning. Here there are two extremes – both of them unsound. There can be too few malware files – making the test irrelevant. Or too many – also making the test irrelevant: in a mega-collection there’s just too much garbage (corrupt files; raw data files; kinda-malware, for example, scripts that use malware; etc.); and to clean up a collection to get rid of such rubbish is one heck of a difficult and thankless task. Not to mention poorly paid.

And finally, the most significant defect with on-demand testing is that it is possible to fine tune a product specifically for such tests in order to achieve excellent results. And product tuning for a test is elementary, dear Watson – all that needs to be done is detect the particular files used in the test. You get my drift?

In order to show in “scanning tests” a near-100% result, an AV firm doesn’t need to go to all the trouble of actually raising the quality of its technology. No sir. All it needs do is detect everything that shows up in the tests. It’s almost like the joke about the hunters being chased by the bear:

“Yikes! The bear’s running faster than us!” says the first hunter.

“No worries – I don’t need to run faster than the bear, I just need to run faster than you!” says the second. 

To win in these tests you don’t have to run faster than the bear. Instead you just suck up to the source of the malware used by the most famous testers (and these sources are well-known – VirusTotal, Jotti, and the malware-swappers in different AV companies), and then detect everything that all the others detect; that is, if a file is detected by competitors, to simply detect it using MD5 or something similar. No in-depth research and superior technologies to combat real life attacks are needed.

To demonstrate this I’d be willing to construct a scanner from scratch – with the help of a pair of developers – and have it reach a 100% detection rate within a couple of months. Why would I need a pair of developers? Just in case one of them falls sick.

In short, on-demand tests are bogus. This is because nothing real gets shown up with them, they are easy to adjust a product to, and they are extremely difficult to win honestly.

How Tests CAN Be Done, but Only for Specialist Audiences

There’s also a whole bunch of niche tests that measure the quality of anti-virus in very specific ways.

In principle, I guess they have a right to exist, and can be extremely useful for quality comparisons of specific features. However, they should come with caveats stated in large bold print about this specific character and that the tests do NOT take into account all the functionality of products. These tests are not much good at all to the general public, i.e., non-specialists. They are strictly industry tests, indicating data useful only to IT security experts and other such geeks.

Some examples of these specialist-only tests:

  • a test for treating a particular occurrence (how a product copes with healing a system infected with specific malware);
  • a false positives test (how many erroneous infection-notifications a product gives on a clean system);
  • a proactive test (how a product catches malware without a signature – that is, using only proactive technologies);
  • an on-access test (measuring quality of an on-access scanner in real-time operations with malware);
  • performance and interface ease-of-use; etc.

The above tests only measure specific things. Like with cars – time taken to go from 0 to 60 mph, braking speed, gas consumption – they’re all individual, with little interconnection among them. Besides, the best result in one of them may be entirely unusable in a real-life product, a bit like a Formula-1 racing car to pop to the shops in :)

Finally, How Tests SHOULD Be Done – the Ones to Take Notice of…

First, as Ali G would say, let’s “rewind. What are tests needed for anyway? First and foremost, to demonstrate the quality of protection. Usability and efficiency are of course also important, but ultimately, they’re only chaff, and certainly need separating from the wheat.

So how do you test quality of protection?

It stands to reason that it should be done in an environment that mirrors reality as closely as possible. The methodology of a good test must be based on the most common and widespread scenarios of what users face in real life. Everything else is incidental stuff that just doesn’t matter.

And so we get dynamic or real world tests.

The idea’s simple: you take a typical computer, install anti-virus software on it with its settings at default, and try with all means available to launch current malware. That’s it – simple! The product works at full steam, and the environment is as near to real-life conditions as possible. Though simple, what the user gets is the most accurate measurement of quality of an anti-virus product there is, and thus gets the relevant information to enable a rational choice as to a purchase. Then, to measure system resources used, the size of updates, and to account for a product’s price, each category of testing is given a weighting in relation to the earned marks – and the results are ready.

It’s that simple, right? Alas – no.

Besides the difficulties in getting the malware selection right for a test, there is the greater difficulty in obtaining conditions that are closest to reality, since they are extremely difficult to automate. That is, such tests demand a great deal of mechanical manual work.

And it’s for this very reason that such tests are conducted extremely rarely and in a rather truncated manner. In my previous post on performance testing I listed a few of them. But even they have their own little quirks and nuances.

In summary, how to give good advice on proper testing – this to me is clear and obvious. But where can you find crazy folk who would take it upon themselves to conduct such tests, and for free? I don’t know. Thus, sadly, proper results from proper tests are these days simply nowhere to be found, despite their Holy Grail status.

And on that somber minor chord, we conclude this gloomy piece of music.

READ COMMENTS 12
Comments 6 Leave a note

    Simon Edwards

    It is right that such tests exist. AV Test, AV Comparatives and Dennis Technology Labs* all perform these labour-intensive tasks. The problem is that these testers must be paid by someone, and if that someone is a vendor there is always the suspicion of bias or worse from other vendors and from the consumer**.

    My serious suggestion is that all vendors in a test pay the tester the same amount of money before the results are delivered. The tester then needs to communicate well with those vendors whose products do not perform very well – in an effort to show where the problems lie.

    There is a downside to this, though… When testers and vendors work this closely the consumer may believe that there is an industry conspiracy!

    So I’m glad to learn that my (DTL*) tests are the holy grail. Now we need to work on them being paid for by everyone :)

    * http://www.dennistechnologylabs.com
    ** Let’s not forget that some consumers are fans of certain vendors and products, so they can be biased too!

    Timur Tsoriev

    I like this idea. But it seems that to ensure success and credibility of such tests it needs to be an industry-level project, with the majority of largest vendors being involved. May be the AMTSO could act as the framework for such a project? Let’s see if we can escalate this idea and start a discussion with other vendors and the AMTSO.

    Mac

    It’s funny how you silently put your terribly biased “Dennis tests” along those two well known av labs. Sorry sir, but Norton the “forever winner” in your own tests is one of the products to stay away from.

    Keith P

    Good post Eugene! Long overdue and dead on…

    Cliff

    Eugene, you are perfectly right. However, keep in mind that managers/CEOs with AV experience are a minority in the industry :) All others can’t tell the difference between a sandbox engine that generically detects 90% of a sample set and a simple CRC32-based engine that uses signatures to detect 100% of that sample set. I can safely estimate that most of them will prefer the CRC32 engine because it’s probably faster (and has better detection).
    Since most CEOs don’t understand technology at all, they set “quantifiable” objectives for their teams (e.g. 98+% in Av-Test, etc.). Thus, the technical teams are sometimes forced to drop complex and innovative detection technologies in favour of “automatic signature grabbers”. Sad, but true. That’s one of the reasons why some engines have 3 million signatures, while others have 30+ million signatures :)

    The solution is obvious : better testing procedures that favour “true” detection technologies and true innovation.

    Ed

    So Gene,

    Is this the same reasoning that whlie I sit here and watch “Ghost Hunters” that they don’t find the ghost? “cause it’s entertaining.

    As an “old” dude I was taught…”if it ain’t broke – don’t fix it. But if you have to fix it…do it right.

    And what makes Kaspersky’s fix right…I firmly believe that with 25 years of computers something is wrong with this laptop but, I can’t find the problem. It’s as small as a mistyped key or and internet connection that fails or a quick blip on the screen.

    You should know tha twith MS-Dos Solitaire my wife put a black card on a black card and kept playing. The problem is fundemental but do thes people who solve the AV problem…create the AV problem to solve it?

    Ed

Trackbacks 6

Testing: the Knight’s Tale « amtso

Number of the Month: 70K per Day. | Nota Bene

Rooting out Rootkits. | Nota Bene

We’re AV-Comparatives’ Product of the Year! | Nota Bene

New AMTSO Resources « amtso

Doing The Homework. | Nota Bene

Leave a note