Filtering spam email is all about sifting through email content, to find dodgy words or phrases, right? Well... no. While that was certainly true in the email Dark Ages (ten years ago), today’s filters are completely different.
Spam filters sift through your incoming email and automatically decide which are spam and which are legitimate — or ham. It’s remendously difficult to get right; to reach today’s levels of filter accuracy, anti-spam technologists over the years have invested huge amounts of financial and intellectual capital.
To decode the mysteries, read on...
The first truth to understand is that spam filters have a huge arsenal of techniques available to them. In the jargon of computer performance, some techniques are more expensive than others — that is, they need more computer horsepower.
Typically, a filter starts with the least expensive tests and performs more and more expensive tests until it’s sure whether the message is spam or ham. It does that by combining the results of the tests — sometimes known as a cocktail spam score. Tests that analyse the message content tend to be the most expensive, so they’re actually rarely used.
So, despite what many people believe, looking for naughty words and phrases is no longer the most common test for spam.
Let’s look at the more popular and useful tests, in order from least expensive to most. Although not all spam filters include all these tests, that doesn’t mean they’re bad filters: some tests have large overlaps in the spam they catch.
Making a Connection
Connection analysis techniques evaluate the incoming connection during the earliest stages of receiving a message. State-of-the-art spam filters identify and reject at least 75% of spam early on. The filters often do so without even receiving any of the message, saving server resources and Internet bandwidth.
Ideally, a spam filter can decide whether a message is spam or ham as soon as the Internet connection is made. For this reason, the best place to site an organization’s spam filter is at the boundary between your network and the Internet. This is where the filter has access to the most information about the sending email server. Just as you often can tell that a postal mail item is junk by looking at the unopened envelope, a spam filter can recognise that a message is spam by watching the delivery process.
The spam filter should be the email server to which the rest of the world talks (the so-called MX, or Mail eXchanger). It is itself an SMTP server (SMTP is the standard Internet protocol for transferring email). Such boundary or DMZ filters receive your email, remove the spam, and immediately redirect the ham to you.
You can even extend this boundary to the cloud, thus neatly outsourcing your spam filtering task to a service run by experts.
Sender Reputation: Today’s spam filters look first at the IP address of the incoming connection. They check to see if anything is known about the address’s reputation. They do this by searching for it in one or more reputation services, to see if the address is known to send ham or spam.
If at this point the address is conclusively shown to have a history of sending spam, the connection can be rejected without further processing. In their simplest form, such services are IP address blacklists and whitelists — or blocklists and allowlists, if you prefer — but modern reputation services can offer more subtlety. They can represent “shades of gray” such as senders of a mixture of ham and spam, or IP addresses that have only recently started sending email.
Filters can also use more sophisticated tests to calculate the reputation of the sender’s Internet domain, but those tests require the message to be received, as we’ll see later.
DNS Standards Compliance: Legitimate email servers usually have their connections configured in a way that is strictly correct and adheres to published standards. RFC 1912 section 2.1 describes Forward-confirmed reverse DNS: setting up DNS records so that spam filters can not only look up the server’s IP address by its name, but also check that the reverse lookup gives the same information.
The idea is to make your PTR and A records consistent. Most spammers aren’t so careful.
Nolisting: The recipient domain administrator publishes fake, unreachable primary and tertiary MX records (DNS records that tell a sending email server where the receiving server is for a domain). This leaves only the secondary record referring to the true mail server — or rather, the spam filter.
Legitimate senders attempt to contact the fake primary server, fail, then fall back to the true secondary. Many spammers either only try the primary or try the tertiary first. Either way, it’s a spam sign.
Email Server Profiling: By carefully probing the sending email server, a filter can work out which operating system is running on that server. This technique potentially can mop up a large proportion of spam, because most spam is sent by botnets (networks of malware-compromised PCs).
For example, it is hardly likely that legitimate email would be sent directly from Grannie’s PC running Windows ME; she’d send it via a legitimate email server.
Rewarding Good Behaviour
As I mentioned earlier, a good proportion of the spam has already been identified before the email transfer protocol starts, let alone before any email is sent across the connection. But if the filter hasn’t yet decided whether the current connection is coming from a spammer, it moves on to techniques that examine the behaviour of the sending email server, during the initial stages of the SMTP conversation. It’s important to use these techniques selectively, because they can delay delivery.
Greylisting: SMTP allows a receiving system to interrupt a transfer, while reporting a temporary error. For example,
451 4.7.1 Please try again later. If a spam filter is suspicious of this connection, it could deliberately send one of these temporary error messages. A legitimate sender disconnects and retries the transfer later. But many spam senders don’t bother retrying, in which case the filter didn’t even need to decide whether the connection was spammy; the spammer does our filtering work for us.
A potential downside is that some legitimate email servers don’t retry in a timely fashion, causing unnecessary delay. However, those legitimate servers are usually large, well-known bulk email senders and should have already received a free pass in the reputation stage.
Greetpause: A receiving mail server is expected to reply to an incoming connection with an SMTP greeting — for example,
220 mail.example.com ESMTP Service ready. The sending server must wait until it’s received this greeting; it mustn’t start sending information beforehand.
If the receiving spam filter deliberately delays the greeting, but the sender doesn’t wait, the sender’s probably a spammer.
Throttling: The spam filter could deliberately slow down a suspicious connection. Many spammers simply give up when sending to unusually slow mail servers; again, the spammer does the filtering work for us.
Some anti-spam technologists philosophically like this idea because it wastes spammers’ resources; in this context, the technique is often known as a tarpit, or teergrube.
Finally! Scanning the Content...
By this point, our spam filter has rejected the vast majority of spam, with little or no data actually transferred across the Internet. Also, it has recognised much of the ham, so only a little uncertain email is left. So, at last, it’s time to look at the message content — both email body and headers.
These tests are typically the most expensive, in terms of both server horsepower and network bandwidth. That’s why we want to leave them until last, and only use them on the few messages that we can’t confidently categorise as spam or ham.
Domain Reputation: Email headers and other metadata contain domain names that belong to the sender; the spam filter can look these up using a more sophisticated version of the reputation service that I talked about earlier.
Email sender domain information can be forged, but the filter is often able to verify the domain using sender-authentication techniques, such as SPF or DKIM (Sender Policy Framework, DomainKeys Identified Mail). In DKIM, the stronger of the two standards, legitimate email servers digitally sign all their outgoing email with a private key, and publishes a public key, with which recipients can verify the signatures. For example, if the message purports to come from
paypal.com but the spam filter fails to authenticate the message sender, it’s probably a phishing message.
Call-to-Action Reputation: Spammers usually need to include some way for you to take the next step (for example, a Web site where you can buy their fake pills, an email address, or a phone number). Reputation databases can store reputation opinions about the text used in these calls to action.
Incorrect Formatting: Sloppily designed email headers or HTML body text are possible spam signs. For example, a
Message-ID: header with no domain, a
Date: header set several days in the future or past, an HTML body with no plain-text alternative version, or an HTML
<P> tag with no matching
These are errors that wouldn’t normally stop ham from being read, but indicate that the message wasn’t generated by a legitimate email service.
Other Heuristics: Spam filters often use rules of thumb about content features that distinguish spam from ham. Examples include the proportion of capitals to lowercase letters, hidden Shakespearian text, or evidence that a spammer has misconfigured the spam-sending software (for example, mistakenly included text such as
Conversation Tracking: If the spam filter identifies a message as a genuine reply to a local user, it’s almost certainly ham.
Statistical Content Analysis: By performing a full-on, statistical analysis of the pattern of words in the message, a spam filter can learn to distinguish between spam and ham — using techniques like naïve Bayesian classification or Markovian discrimination.
These are the most expensive, “brute force” techniques. They were hailed as revolutionary in 2003; but over time, as spam volumes increased, it became unrealistic for most organisations to scan all email content.
Just about anybody with an email address knows about spam filters — or they think they do. But it’s a complex subject, with more to it than meets the eye.
Richi Jennings is an independent analyst, specialising in blogging, email, spam, security, and other technology topics. His writing has won ASBPE and Neal awards. You can follow him as @richi on Twitter, pretend to be his friend at Facebook.com/richij or just use boring old email: firstname.lastname@example.org.