Follow Slashdot stories on Twitter

Filter-foiling Gibberish Becoming A Spam Staple 606

Posted by timothy on Tuesday January 13, 2004 @10:16PM from the re:-claire-yum-donut-manhattan-regrets-cute dept.

hcg50a writes "Wired has a story about the random words which have recently been appearing in spam. Antispam experts agreed that this isn't a brand-new technique, but said the addition of potentially filter-foiling gibberish is rapidly becoming a common component of spam."

This discussion has been archived. No new comments can be posted.

Filter-foiling Gibberish Becoming A Spam Staple

Search 606 Comments Log In/Create an Account

Comments Filter:

Bayes filters deal with it fine (Score:5, Informative)

by sidney ( 95068 ) writes: on Tuesday January 13, 2004 @10:26PM (#7969296) Homepage

Paul Graham mentions the technique in this article [paulgraham.com], pointing out that the Bayesian filters look for words that commonly appear just in spam or just in non-spam. The random words are common in neither, so are simply ignored by the filters. As a technique, the random words would get past a filter that looks for some spammy to non-spammy word ratio. But that's not how the spam filters work.

Share
twitter facebook
Re:W@.n7 A B37t.er J0b.? millions (Score:2, Informative)

by jbplou ( 732414 ) writes: on Tuesday January 13, 2004 @10:26PM (#7969301)

Some moron buys something. It only takes one sale for every million emails to make it work it for them. Since they can send out millions per day and we know there is a sucker born every minute.

Parent Share
twitter facebook
Re:Sometimes it isn't random words (Score:3, Informative)

by srcosmo ( 73503 ) writes: <`moc.liamg' `ta' `nortagemartlu'> on Tuesday January 13, 2004 @10:26PM (#7969305) Journal

I also recenty received some Alice in Wonderland citations with my spam.
Who would have thought Project Gutenberg [gutenberg.net]'s biggest use would be for hawking herbal remedies?

Parent Share
twitter facebook
Re:why not filter out 1337 sp3@k? (Score:5, Informative)

by rgmoore ( 133276 ) * writes: <glandauer@charter.net> on Tuesday January 13, 2004 @10:44PM (#7969469) Homepage

Why bother? A decently trained Bayesian filter will be able to recognize a spam that contains a misspelled word or two, or one that contains substitutions of similar characters. Then it will learn that those modified forms are a very strong indicator of spam. As Paul Graham [paulgraham.com] (the main early advocate of Bayesian Filters) has pointed out [paulgraham.com], there are legitimate reasons why you might see a mention of "Viagra" in your email, but no legitimate reason that you would see "V1agra", "\/iagra", "Vi@gra", or the like. Instead of slipping by my Bayesian filter, those variants actually stand out as particularly strong spam indicators.

Parent Share
twitter facebook
Re:The problem with this technique (Score:3, Informative)

by pclminion ( 145572 ) writes: on Tuesday January 13, 2004 @10:44PM (#7969471)

Well what are you standing around talking for? Hook us up!
I'd love to -- in fact, I've even got my own website registered for it -- neuralnw.com [neuralnw.com] -- but development has stalled recently, and you'll find no trace of the program on the website. The filter, or at least a rudimentary version of it, is available if you know where to look for it. We published a paper at USENIX back in June covering this program. Since then, I haven't done much development, because frankly, there are better ways to spend my time than reading spam and trying to devise methods to filter it out.
However, comments such as yours are very encouraging. With enough positive encouragement I might be persuaded to take up the development once again :-) The code base hasn't changed since last February, but I do regularly re-train my filter.
One day, when it becomes automated and easier to use, I will release it as a serious product. I've got too much other shit on my plate right now, though.
Thanks for your interest.

Parent Share
twitter facebook
Re:The problem with this technique (Score:4, Informative)

by sketerpot ( 454020 ) writes: <.moc.liamg. .ta. .topreteks.> on Tuesday January 13, 2004 @10:58PM (#7969589)

In most adaptive filters, only words that have been used a certain number of times are taken into consideration. For example, the original Plan for Spam algorithm ignores any word that doesn't appear over 5 times in the corpus.

Parent Share
twitter facebook
Re:What I don't understand (Score:5, Informative)

by he-sk ( 103163 ) writes: on Tuesday January 13, 2004 @11:04PM (#7969635)

That's the text/plain part you see. The "advertisement" is in the text/html part.

I was very irritated by that, too, until one day I was testing the HTML viewer of an e-mail client.

Parent Share
twitter facebook
I use that method (Score:3, Informative)

by KalvinB ( 205500 ) writes: on Tuesday January 13, 2004 @11:08PM (#7969667) Homepage

includes sourcecode [icarusindie.com]

Mercury Mail's session logs indicate a closed connection to indicate where e-mails begin and end but if you're using something else there's a RinetD mod with source which logs e-mails in such a way so that ripping through them is easy.

My filter is all of 23KB and I get virtually no spam. I update every once in awhile when a spam gets through.

I also have a couple sub-domains that point to a spamcan on my home connection which I use to bait spammers so I can preemptively filter them out without paying for the bandwidth.

Ben

Parent Share
twitter facebook
Re:The problem with this technique (Score:3, Informative)

by anthony_baxter ( 48233 ) writes: on Tuesday January 13, 2004 @11:23PM (#7969777)

I've actually observed this problem - the issue is "overtraining", that is training on everything. I recently threw away my training database and now only train on messages that don't score 0.0 or 1.0 ("non-edge" training). This produces a much smaller database, and is far more deadly against the random spam words attempts.

Parent Share
twitter facebook
Re:Bayes filters hubert balloons c6as6g89y9aigah98 (Score:3, Informative)

by mabhatter654 ( 561290 ) writes: on Tuesday January 13, 2004 @11:45PM (#7969971)

to clarify it, say you report a spam to Yahoo, they most likely are getting 10,000 of the same subject from similar IPs so they just drop the connection after the subject is entered [that is an elemtary feature of even the oldest email servers]...it never gets sent thru the system or to your spam filter. But now they have to run the spam filter on every single email...costing more time than simply dropping it because of subject...remember they deal with 10,000 of the same spam at once in a day....except now it dosen't look the same every time.

Parent Share
twitter facebook
Re:Spamkiller doesn't care (Score:3, Informative)

by M. Silver ( 141590 ) writes: <silver@phoCOUGARenyx.net minus cat> on Wednesday January 14, 2004 @01:07AM (#7970581) Homepage Journal

Umm. SpamAssassin isn't Bayesian, it's rule-based. Someone needs better research

*Someone* does, but not the parent to this. SA *does* "incorporate Bayesian analysis techniques," and some of its rules are about handling the results. You can score those rules to 0 for non-Bayesian filtering, or score everything else to 0 for pure Bayesian.

Parent Share
twitter facebook
Re:Why? (Score:3, Informative)

by Gherald ( 682277 ) writes: on Wednesday January 14, 2004 @02:08AM (#7970875) Journal

Yes, ISPs do not use Bayesian filters. Those are rare and spammers do not care about them.

Random strings of text are used to get through the internal checks that large ISPs run on their message traffic.

Yahoo, Hotmail, etc have "bulk email" type folders. In addition to using spamassasin type techniques, the filter scripts that put messages in these folders will check to see if the same message is being sent to multiple addresses. If this is so, it raises a flag and someone checks to see if its a genuine mailing list. If it is, the list gets whitelisted internally. If it is spam, it gets moved into all the users' bulk mail folder and gets used to improve the bulk mail folder's automatic filters.

Random strings of text in messages get around this because the filter has a harder time detecting these mass spams, since each individual message will show up as being slightly different.

Parent Share
twitter facebook
Re:What I don't understand (Score:5, Informative)

by ElectricRook ( 264648 ) writes: on Wednesday January 14, 2004 @02:38AM (#7970995)

I hope to hell they're fishing for non-bouncing addresses, because at the moment any email which SpamAssassin says is spam, I bounce.
Don't ever do that, all spam has forged headers. You're just making life hard on someone who had their address sold.
I work for a big company, an icon the the computer business. Our mail servers get spammed a lot. We often have typical user names grafted onto the From or Reply lines. Since my user name is pretty damn common, and some of my work mail aliases are TLAs, I look at a lot of spam. When I read the headers (in a text file, not easily spoofed mail software), almost always the senders domain is not even close to the domain of the spamming machine. Go put the IP addresses into dnsstuff.com, and compare that to the hostname. These turds hack the sendmail.cf file of the spamming machine. "SallySmith@aol.com" probably did not send spam-mail from a ".kr" ISP.

Parent Share
twitter facebook
Re:Gibberish, or code? (Score:2, Informative)

by ckolar ( 43016 ) writes: <(chris) (at) (kolar.org)> on Wednesday January 14, 2004 @03:47AM (#7971224) Homepage Journal

This really exists, www.spammimic.com [spammimic.com]. I'd swear that /. did a story on it when it came out. --ck

Parent Share
twitter facebook
Re:What I don't understand (Score:3, Informative)

by funky womble ( 518255 ) writes: on Wednesday January 14, 2004 @07:16AM (#7971847)

Bouncing high scoring mail works pretty well, as long as you do it right [duncanthrax.net].

Parent Share
twitter facebook
Use SPF! (Score:3, Informative)

by TheMidget ( 512188 ) writes: on Wednesday January 14, 2004 @07:30AM (#7971889)

Don't ever do that, all spam has forged headers. You're just making life hard on someone who had their address sold.
That's what SPF [pobox.com] is for. It allows the owner of a domain to publish a specification of IP addresses which are allowed to use that domain name (foo.com). If somebody, who claims to be pete@foo.com now attempts to send a mail to an SPF-enabled receiver, his mail is rejected, because his IP is not in the foo.com approved set.
Rejection happens immediately on submission, so the mail stays on the fraudulent server.
"SallySmith@aol.com" probably did not send spam-mail from a ".kr" ISP.
Nor would that mail be accepted by an SPF-enabled sendmail. Indeed, AOL is one of the first major ISPs to have published SPF records [slashdot.org].

Parent Share
twitter facebook
Re:It's SO gibberish (Score:3, Informative)

by B'Trey ( 111263 ) writes: on Wednesday January 14, 2004 @09:50AM (#7972598)

Certainly it is. And for those who use high-ASCII or UNICODE, it isn't a valid technique. That doesn't mean that it isn't a valid technique for the millions of people who don't use anything outside the normal ASCII characters.

I use POPFile [sourceforge.net], which is a perl Baysean filter. It works quite well even with spam which includes garbled words. I haven't tried playing with it yet, but it seems like it would be relatively straightforward to check for the number of words which are not already in its dictionary. Aftern the initial training, an email with more than a few new words is highly likely to be garbled spam (or from someone who received a new Thesaurus for Christmas.)

Parent Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Filter-foiling Gibberish Becoming A Spam Staple 606

Filter-foiling Gibberish Becoming A Spam Staple More Login

Filter-foiling Gibberish Becoming A Spam Staple

Bayes filters deal with it fine (Score:5, Informative)

Re:W@.n7 A B37t.er J0b.? millions (Score:2, Informative)

Re:Sometimes it isn't random words (Score:3, Informative)

Re:why not filter out 1337 sp3@k? (Score:5, Informative)

Re:The problem with this technique (Score:3, Informative)

Re:The problem with this technique (Score:4, Informative)

Re:What I don't understand (Score:5, Informative)

I use that method (Score:3, Informative)

Re:The problem with this technique (Score:3, Informative)

Re:Bayes filters hubert balloons c6as6g89y9aigah98 (Score:3, Informative)

Re:Spamkiller doesn't care (Score:3, Informative)

Re:Why? (Score:3, Informative)

Re:What I don't understand (Score:5, Informative)

Re:Gibberish, or code? (Score:2, Informative)

Re:What I don't understand (Score:3, Informative)

Use SPF! (Score:3, Informative)

Re:It's SO gibberish (Score:3, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot