Everything You Know about Spam Filters is Wrong
Filtering spam email is all about looking in the content of the message, trying to find certain dodgy words or phrases, right? Well... no. Not so much. While that was certainly true in the email Dark Ages of ten years ago, today’s filters are completely different.
Spam filters need to sift through your incoming email stream and automatically decide which are spam and which are legitimate—or ham. It’s a tremendously difficult job to get right; to reach today’s levels of filter accuracy, anti-spam technologists over the years have invested huge amounts of financial and intellectual capital.
To decode the mysteries, read on.
Cocktail Hour
The first truth to understand is that spam filters have a huge arsenal of techniques available to them. In the jargon of computer performance, some techniques are more expensive than others — that is, they need vastly more computer horsepower.
Typically, a filter starts with the least expensive tests and performs more and more expensive tests until it’s sure whether the message is spam or ham. It does that by combining the results of the tests — sometimes known as a cocktail spam score. Tests that analyze the message content tend to be the most expensive, so they’re actually rarely used.
So, despite what many people believe, looking for naughty words and phrases is no longer the most common test for spam.
Let’s look at the more popular and useful tests, in order from least expensive to most. Although not all spam filters include all these tests, that doesn’t mean they’re bad filters: some tests have large overlaps in the spam they catch.
Making a Connection
Connection analysis techniques evaluate the incoming connection during the early stages of receiving a message. State-of-the-art spam filters identify and reject at least 75% of spam early on. The filters often do so without even receiving any of the message, saving server resources and Internet bandwidth.
Ideally, a spam filter can decide whether a message is spam or ham as soon as the connection is made. For this reason, the best place to site an organization’s spam filter is at the boundary between your network and the Internet. This is where the filter has access to the most information about the sending email server. Just as you often can tell that a postal mail item is junk by looking at the envelope, an email server can recognize that a message is spam by watching the delivery process.
The spam filter should be the email server to which the rest of the world talks (the so-called MX, or Mail eXchanger). It is itself an SMTP server (SMTP is the standard Internet protocol for transferring email). Such boundary or DMZ filters receive your email, remove the spam, and immediately redirect the ham to you. You can easily extend this boundary to the cloud, thus neatly outsourcing your spam filtering task to a service run by experts.
Sender Reputation: Today’s spam filters look first at the IP address of the incoming connection. They check to see if anything is known about the address’s reputation. They search for it in one or more reputation services, to see if the address is known to send ham or spam. If at this point the address is conclusively shown to have a history of sending spam, the connection can be rejected without further processing. In their simplest form, such services are IP address blacklists and whitelists — or blocklists and allowlists, if you prefer — but modern reputation services can offer more subtlety. They can represent “shades of gray” such as senders of a mixture of ham and spam, or IP addresses that have only recently started sending email. (Filters can also use more sophisticated tests to calculate the reputation of the sender’s Internet domain, but those tests require the message to be received, as we’ll see later.)
DNS Standards Compliance: Legitimate email servers usually have their connections configured in a way that is strictly correct and adheres to published standards. For example, RFC 1912 section 2.1 describes Forward-confirmed reverse DNS: setting up DNS records so that spam filters can not only look up the server’s IP address by its name, but also check that the reverse lookup gives the same information. The idea is to make your PTR and A records consistent. Most spammers aren’t so careful.
Nolisting: The recipient domain administrator publishes fake, unreachable primary and tertiary MX records (DNS records that tell a sending email server where the receiving server is for a domain). This leaves only the secondary record referring to the true mail server—or rather, the spam filter. Legitimate senders attempt to contact the fake primary server, fail, then fall back to the true secondary. Many spammers either only try the primary or try the fake tertiary first. Either way, it’s a spam sign.
Email Server Profiling: By carefully probing the sending email server, a filter can work out which operating system is running on that server. This technique potentially can mop up a large proportion of spam, because most spam is sent by botnets (networks of malware-compromised PCs). For example, it is hardly likely that legitimate email would be sent directly from Grandma’s PC running Windows ME; she’d send it via a legitimate email server.
Rewarding Good Behavior
As we mentioned earlier, a good proportion of the spam has already been identified at this stage—before the email transfer protocol starts, let alone before any email is sent across the connection. But if the filter hasn’t yet decided whether the current connection is coming from a spammer, it moves on to techniques that examine the behavior of the sending email server, during the initial stages of the SMTP conversation. It’s important to use these techniques selectively, because they can delay delivery.
Greylisting: SMTP allows a receiving system to interrupt a transfer, while reporting a temporary error. For example, 451 4.7.1 Please try again later. If a spam filter is suspicious of this connection, it could deliberately send one of these temporary error messages. A legitimate sender disconnects and retries the transfer later. Many spam senders don’t bother retrying, in which case the filter didn’t even need to decide whether the connection was spammy; the spammer does our filtering work for us. A potential downside is that some legitimate email servers don’t retry in a timely fashion either, causing unnecessary delay. However, those legitimate servers are usually large, well-known bulk email senders and should have already received a free pass in the reputation stage.
Greetpause: A receiving mail server is expected to reply to an incoming connection with an SMTP greeting—for example, 220 mail.example.com ESMTP Service ready. The sending server must wait until it’s received this greeting; it mustn’t start sending information beforehand. If the receiving spam filter deliberately delays the greeting, and the sender doesn’t wait, the sender’s probably a spammer.
Throttling: The spam filter could deliberately slow down a suspicious connection. Many spammers simply give up when sending to unusually slow mail servers; again, the spammer does the filtering work for us. (Some anti-spam technologists also like this idea because it wastes spammers’ resources; in this context, the technique is often known as a tarpit, or teergrube.)
Finally! Scanning the Content...
By this point, our spam filter has rejected the vast majority of spam, with little or no data actually transferred across the Internet. Also, it’s recognized much of the ham, so only a little uncertain email is left. So, at last, it’s time to look at the message content — both email body and headers.
These tests are typically the most expensive, in terms of both server horsepower and network bandwidth. That’s why we want to leave them until last, and only use them on the few messages that we can’t confidently categorize as spam or ham.
Domain Reputation: Email headers and other metadata contain domain names that belong to the sender; the spam filter can look these up using a more sophisticated version of the reputation service that we talked about earlier. Email sender domain information can be forged, but the filter may be able to verify the domain using sender-authentication techniques, such as SPF or DKIM (Sender Policy Framework or DomainKeys Identified Mail). In DKIM, the stronger of the two standards, a legitimate sender digitally signs all its outgoing email with a private key, and publishes a public key, with which recipients can verify a signature. For example, if the message purports to come from paypal.com but the spam filter fails to authenticate the message sender, it’s probably a phishing message.
Call-to-Action Reputation: Spammers usually need to include some way for you to take the next step (for example, a Web site where you can buy their fake pharmaceuticals, an email address, or a phone number). Reputation databases can store reputation opinions about the text used in these calls to action.
Incorrect Formatting: Sloppily designed email headers or HTML body text are possible spam signs. For example, a Message-ID: header with no domain, a Date: header set several days in the future, an HTML body with no plain-text equivalent version, or an HTML <P> tag with no matching </P>. These are errors that wouldn’t normally stop ham from being read, but indicate that the message wasn’t generated by a legitimate email service.
Other Heuristics: Spam filters often use rules of thumb about content features that distinguish spam from ham. Examples include the proportion of capitals to lowercase letters, hidden Shakespearian text, or evidence that a spammer has misconfigured the spam-sending software (for example, mistakenly included text such as %RANDOM_WORD).
Conversation Tracking: If the spam filter identifies a message as a genuine reply to a local user, it’s almost certainly ham.
Statistical Content Analysis: By performing a full-on, statistical analysis of the pattern of words in the message, a spam filter can learn to distinguish between spam and ham—using techniques like naïve Bayesian classification or Markovian discrimination. These are the most expensive, “brute force” techniques. They were hailed as revolutionary in 2003; but over time, as spam volumes increased, it became unrealistic for most organizations to scan all email content.
Just about anybody with an email address knows about spam filters — or they think they do. But it’s a complex subject, with more to it than meets the eye.
Richi Jennings is an independent analyst, specializing in blogging, email, spam, security, and other technology topics. His writing has won American Society of Business Publication Editors and Jesse H. Neal awards. You can follow him as @richi on Twitter, pretend to be his friend at Facebook.com/richij or just use boring old email: io@richij.com.
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Email to a Friend
- Printer Friendly Page
- Report Inappropriate Content








