Filter-foiling Gibberish Becoming A Spam Staple 606
hcg50a writes "Wired has a story about the random words which have recently been appearing in spam. Antispam experts agreed that this isn't a brand-new technique, but said the addition of potentially filter-foiling gibberish is rapidly becoming a common component of spam."
Bayes filters deal with it fine (Score:5, Informative)
Re:W@.n7 A B37t.er J0b.? millions (Score:2, Informative)
Re:Sometimes it isn't random words (Score:3, Informative)
Who would have thought Project Gutenberg [gutenberg.net]'s biggest use would be for hawking herbal remedies?
Re:why not filter out 1337 sp3@k? (Score:5, Informative)
Why bother? A decently trained Bayesian filter will be able to recognize a spam that contains a misspelled word or two, or one that contains substitutions of similar characters. Then it will learn that those modified forms are a very strong indicator of spam. As Paul Graham [paulgraham.com] (the main early advocate of Bayesian Filters) has pointed out [paulgraham.com], there are legitimate reasons why you might see a mention of "Viagra" in your email, but no legitimate reason that you would see "V1agra", "\/iagra", "Vi@gra", or the like. Instead of slipping by my Bayesian filter, those variants actually stand out as particularly strong spam indicators.
Re:The problem with this technique (Score:3, Informative)
I'd love to -- in fact, I've even got my own website registered for it -- neuralnw.com [neuralnw.com] -- but development has stalled recently, and you'll find no trace of the program on the website. The filter, or at least a rudimentary version of it, is available if you know where to look for it. We published a paper at USENIX back in June covering this program. Since then, I haven't done much development, because frankly, there are better ways to spend my time than reading spam and trying to devise methods to filter it out.
However, comments such as yours are very encouraging. With enough positive encouragement I might be persuaded to take up the development once again :-) The code base hasn't changed since last February, but I do regularly re-train my filter.
One day, when it becomes automated and easier to use, I will release it as a serious product. I've got too much other shit on my plate right now, though.
Thanks for your interest.
Re:The problem with this technique (Score:4, Informative)
Re:What I don't understand (Score:5, Informative)
I was very irritated by that, too, until one day I was testing the HTML viewer of an e-mail client.
I use that method (Score:3, Informative)
Mercury Mail's session logs indicate a closed connection to indicate where e-mails begin and end but if you're using something else there's a RinetD mod with source which logs e-mails in such a way so that ripping through them is easy.
My filter is all of 23KB and I get virtually no spam. I update every once in awhile when a spam gets through.
I also have a couple sub-domains that point to a spamcan on my home connection which I use to bait spammers so I can preemptively filter them out without paying for the bandwidth.
Ben
Re:The problem with this technique (Score:3, Informative)
Re:Bayes filters hubert balloons c6as6g89y9aigah98 (Score:3, Informative)
Re:Spamkiller doesn't care (Score:3, Informative)
*Someone* does, but not the parent to this. SA *does* "incorporate Bayesian analysis techniques," and some of its rules are about handling the results. You can score those rules to 0 for non-Bayesian filtering, or score everything else to 0 for pure Bayesian.
Re:Why? (Score:3, Informative)
Random strings of text are used to get through the internal checks that large ISPs run on their message traffic.
Yahoo, Hotmail, etc have "bulk email" type folders. In addition to using spamassasin type techniques, the filter scripts that put messages in these folders will check to see if the same message is being sent to multiple addresses. If this is so, it raises a flag and someone checks to see if its a genuine mailing list. If it is, the list gets whitelisted internally. If it is spam, it gets moved into all the users' bulk mail folder and gets used to improve the bulk mail folder's automatic filters.
Random strings of text in messages get around this because the filter has a harder time detecting these mass spams, since each individual message will show up as being slightly different.
Re:What I don't understand (Score:5, Informative)
Don't ever do that, all spam has forged headers. You're just making life hard on someone who had their address sold.
I work for a big company, an icon the the computer business. Our mail servers get spammed a lot. We often have typical user names grafted onto the From or Reply lines. Since my user name is pretty damn common, and some of my work mail aliases are TLAs, I look at a lot of spam. When I read the headers (in a text file, not easily spoofed mail software), almost always the senders domain is not even close to the domain of the spamming machine. Go put the IP addresses into dnsstuff.com, and compare that to the hostname. These turds hack the sendmail.cf file of the spamming machine. "SallySmith@aol.com" probably did not send spam-mail from a ".kr" ISP.
Re:Gibberish, or code? (Score:2, Informative)
Re:What I don't understand (Score:3, Informative)
Use SPF! (Score:3, Informative)
That's what SPF [pobox.com] is for. It allows the owner of a domain to publish a specification of IP addresses which are allowed to use that domain name (foo.com). If somebody, who claims to be pete@foo.com now attempts to send a mail to an SPF-enabled receiver, his mail is rejected, because his IP is not in the foo.com approved set.
Rejection happens immediately on submission, so the mail stays on the fraudulent server.
"SallySmith@aol.com" probably did not send spam-mail from a ".kr" ISP.
Nor would that mail be accepted by an SPF-enabled sendmail. Indeed, AOL is one of the first major ISPs to have published SPF records [slashdot.org].
Re:It's SO gibberish (Score:3, Informative)
I use POPFile [sourceforge.net], which is a perl Baysean filter. It works quite well even with spam which includes garbled words. I haven't tried playing with it yet, but it seems like it would be relatively straightforward to check for the number of words which are not already in its dictionary. Aftern the initial training, an email with more than a few new words is highly likely to be garbled spam (or from someone who received a new Thesaurus for Christmas.)