Here it is, the original Spam! Hmmm, yummy… but healthy? Is anything in a tin? Ok, will leave off the foodie lecturing just for today…
// It’ll be interesting to see if this post with the above pic in it will get through the anti-spam filters of those who subscribe to my mail-outs.
So here we are once again on a subject that it seems will never go away – spam, this time about a particular kind thereof – “image spam” – and the protective technologies that fight it.
I’ll start with a brief bit of historical background.
The first time image spam was released into the world and what it looked like are unknown. What is known is that it had become a mass phenomenon by 2003. It shot to fame and became mega-popular the world over, but then – much like, say, Frankie Goes to Hollywood – its popularity fizzled out fairly rapidly. Within just a few years its share of total spam soared up to 40%, only to fall back down to today’s 6-7%. One hit wonder, indeed.
If one thinks about it, putting advertising texts into graphics was not only a logical move for the cyber wrong ‘uns to make, it also fitted into a certain philosophical view of the world and the eternal conflict between attackers and the attacked, or weapons and armor. Let me explain this…
By 2003 only the very lazy or ignorant hadn’t installed a security solution that filtered plain text spam. Providers used (mostly free) server solutions, while anti-spam modules in home antivirus products were becoming commonplace. The quality of these technologies (especially compared to today’s) was never much to write home about, but at the same time we can’t really diss them too much either since after all they did in fact ‘work’ – they achieved the required result: the effectiveness of simple text spam decreased greatly. But dodgy cyber business is dodgy cyber business, so spammers looked for another way to foul up e-inboxes. And a way was found. You’ve guessed it – with image spam.
Image spam around 2003-2004 was reaaaally basic. Oh, how we’d run rings round this simpleton if it appeared these days. Back then spammers just put their text into GIFs or JPEGs and mass-sent them out through the usual channels. And though the then-anti-spam solutions were primitive and provided very little real protection from such mailings, they quickly got the better of this enemy by simply detecting the inserted images using meta-data (e.g., by the hashes of different characteristics) and using other proven ways like checking sender reputations and public blacklists (which still worked well back then).
A while later a kind of arms race began. Spammers realized that image spam had a bright future – if it would only get past the teething stages to become more sophisticated – since at that time there was still no anti-spam software capable of recognizing the contents of images.
First, spammers made special robots that automatically slightly changed each and every picture sent out (pixel added -> hash changed -> spam not detected -> new signature needed).
Then, when the first technologies arrived to filter out this neat trick, the spammers came up with all sorts of new and fantastic methods of adding noise, distortion and disguises to image spam. They used colorful backgrounds and exotic, uneven fonts; twisted their drawings and divided them up into segments; added “jumping” letters; mixed text up with graphics; sent out animated GIFs; and so on.
And all to complicate anti-spam’s task of extracting content from graphics in order to analyze it and if necessary ban it. But the anti-spam developers gave as good as they got in response too; they thought of everything! Some even went so far as to stuffing full-featured OCR into spam filters… but that never really took off. (First, it’s not designed for extreme noise and distortion so it needs to be worked on further, and that’s not easy. Second, it puts a huge drain on resources rendering it impractical – especially for server-based solutions.)
But the content had to be extracted somehow to be able to beat these damn spammers. So how else could the contents of a picture be drawn out?
What was needed was some kind of compromise between resource usage and accuracy. And so – da-daaa! – this compromise was delivered by our anti-spam guys in 2003 – in the form of GSG technology.
Let me tell you a little about the magic of GSG – how it finds spam images to fish out spam emails. I can’t tell you everything, because then I’d give away all that precious know-how (spammers also read blogs). But I can tell you a few things.
First off, GSG has an object extractor. Its task is to clean up an image to remove the noise and spammer garbage, and then to extract from the cleaned-up image the outlines of discernible objects, and to send them off for analysis.
Then we go to work on these outlines. First, a heuristic text detector has a bash at them. At this point the system tries to figure out whether there’s any text in the image. It examines sizes, positioning, and other characteristics of the objects. Here we use a super intelligent algorithm that’s highly resistant to gimmicks such as jumping letters, different types of distortions, and graphical noise. As a result the detector builds a signature (data about the location and volume of any text found) and passes this signature to the next level.
Even if text isn’t found – for us this is no reason to give up. We let loose onto the picture an OCR-like tool, which very quickly searches for familiar objects in the outlines. Here some incredibly complex math goes down like geometry of convex sets – of which I’ll spare you the details here. Briefly – geometrical characteristics (tangent angles, etc.) are pulled from the outlines, arcs and segments are made out, and the signature is formed.
The third stage is the most tangible and understandable, though by no means the easiest in terms of implementation. The obtained signatures are compared with samples of spam messages in the database. That bit’s straightforward enough. The not very easy bit: this database is maintained 24/7 by our anti-spam lab – a dedicated team of around 30 dear (in all senses of the word) analysts. Oh, and the remainder of this stage is also easy to comprehend – the system analyzes the weights of the developed signatures and takes a decision as to whether the image is spam or regular stuff such as graphics relating to sales or samples of new products.
And now for a little interactive test! Read the description of GSG again from start to finish and record how long it takes. How many seconds? Minutes? Hmm, well, GSG is a bit sprightlier than you – it spends analyzing one picture just 10-40 milliseconds!
And this little whiz-kid called GSG can be found in each of our (your) server and personal products:
And finally… a bit of Darwinism.
If image spam is so difficult to detect (as might appear), why has its share fallen from 40% to 6-7% over the last six years? When the effectiveness of anti-text-spam filters is higher than that of their anti-image-spam filter brothers, surely it would be worthwhile spammers using image spam, no? Nope – actually not. They’ve all but given up on it. But why?
I think there are two reasons, and they both boil down to one thing: sending spam has become easier and quicker.
First, the ecosystem of spam mailing has changed. In the early 2000s, spammers used both legal and semi-legal providers (and they had to change providers frequently). Now, 90% of spam uses botnets – and just try banning them! Second, bandwidth has grown real fast, as has broadband penetration.
Conditions seem to be just perfect for the creation and proliferation of “heavy” image spam. But no! Simple dumb spam remains way out in front. Why? It’s all quite simple – why go have to – when it’s just not worth the effort? It’s easier and more effective to send more of the quick, dumb text spam than conjure up sophisticated image spam that’s real tricky to pull off. As a result, in the best tradition of Darwinism, the biggest and the strongest that made it through evolution here were not the most intelligent but the most prolific.
But spammers also have market relations. Image spam is still out there – with a market niche of its own, which caters to just a few select discerning customers.
So long, folks!