Forgot your password?
typodupeerror
Spam The Internet Your Rights Online

Filter-foiling Gibberish Becoming A Spam Staple 606

Posted by timothy
from the re:-claire-yum-donut-manhattan-regrets-cute dept.
hcg50a writes "Wired has a story about the random words which have recently been appearing in spam. Antispam experts agreed that this isn't a brand-new technique, but said the addition of potentially filter-foiling gibberish is rapidly becoming a common component of spam."
This discussion has been archived. No new comments can be posted.

Filter-foiling Gibberish Becoming A Spam Staple

Comments Filter:
  • by sidney (95068) on Tuesday January 13, 2004 @10:26PM (#7969296) Homepage
    Paul Graham mentions the technique in this article [paulgraham.com], pointing out that the Bayesian filters look for words that commonly appear just in spam or just in non-spam. The random words are common in neither, so are simply ignored by the filters. As a technique, the random words would get past a filter that looks for some spammy to non-spammy word ratio. But that's not how the spam filters work.
  • by jbplou (732414) on Tuesday January 13, 2004 @10:26PM (#7969301)
    Some moron buys something. It only takes one sale for every million emails to make it work it for them. Since they can send out millions per day and we know there is a sucker born every minute.
  • by srcosmo (73503) <ultramegatron&gmail,com> on Tuesday January 13, 2004 @10:26PM (#7969305) Journal
    I also recenty received some Alice in Wonderland citations with my spam.
    Who would have thought Project Gutenberg [gutenberg.net]'s biggest use would be for hawking herbal remedies?
  • by rgmoore (133276) * <glandauer@charter.net> on Tuesday January 13, 2004 @10:44PM (#7969469) Homepage

    Why bother? A decently trained Bayesian filter will be able to recognize a spam that contains a misspelled word or two, or one that contains substitutions of similar characters. Then it will learn that those modified forms are a very strong indicator of spam. As Paul Graham [paulgraham.com] (the main early advocate of Bayesian Filters) has pointed out [paulgraham.com], there are legitimate reasons why you might see a mention of "Viagra" in your email, but no legitimate reason that you would see "V1agra", "\/iagra", "Vi@gra", or the like. Instead of slipping by my Bayesian filter, those variants actually stand out as particularly strong spam indicators.

  • by pclminion (145572) on Tuesday January 13, 2004 @10:44PM (#7969471)
    Well what are you standing around talking for? Hook us up!

    I'd love to -- in fact, I've even got my own website registered for it -- neuralnw.com [neuralnw.com] -- but development has stalled recently, and you'll find no trace of the program on the website. The filter, or at least a rudimentary version of it, is available if you know where to look for it. We published a paper at USENIX back in June covering this program. Since then, I haven't done much development, because frankly, there are better ways to spend my time than reading spam and trying to devise methods to filter it out.

    However, comments such as yours are very encouraging. With enough positive encouragement I might be persuaded to take up the development once again :-) The code base hasn't changed since last February, but I do regularly re-train my filter.

    One day, when it becomes automated and easier to use, I will release it as a serious product. I've got too much other shit on my plate right now, though.

    Thanks for your interest.

  • by sketerpot (454020) <sketerpot.gmail@com> on Tuesday January 13, 2004 @10:58PM (#7969589)
    In most adaptive filters, only words that have been used a certain number of times are taken into consideration. For example, the original Plan for Spam algorithm ignores any word that doesn't appear over 5 times in the corpus.
  • by he-sk (103163) on Tuesday January 13, 2004 @11:04PM (#7969635)
    That's the text/plain part you see. The "advertisement" is in the text/html part.

    I was very irritated by that, too, until one day I was testing the HTML viewer of an e-mail client.
  • I use that method (Score:3, Informative)

    by KalvinB (205500) on Tuesday January 13, 2004 @11:08PM (#7969667) Homepage
    includes sourcecode [icarusindie.com]

    Mercury Mail's session logs indicate a closed connection to indicate where e-mails begin and end but if you're using something else there's a RinetD mod with source which logs e-mails in such a way so that ripping through them is easy.

    My filter is all of 23KB and I get virtually no spam. I update every once in awhile when a spam gets through.

    I also have a couple sub-domains that point to a spamcan on my home connection which I use to bait spammers so I can preemptively filter them out without paying for the bandwidth.

    Ben
  • by anthony_baxter (48233) on Tuesday January 13, 2004 @11:23PM (#7969777)
    I've actually observed this problem - the issue is "overtraining", that is training on everything. I recently threw away my training database and now only train on messages that don't score 0.0 or 1.0 ("non-edge" training). This produces a much smaller database, and is far more deadly against the random spam words attempts.

  • by mabhatter654 (561290) on Tuesday January 13, 2004 @11:45PM (#7969971)
    to clarify it, say you report a spam to Yahoo, they most likely are getting 10,000 of the same subject from similar IPs so they just drop the connection after the subject is entered [that is an elemtary feature of even the oldest email servers]...it never gets sent thru the system or to your spam filter. But now they have to run the spam filter on every single email...costing more time than simply dropping it because of subject...remember they deal with 10,000 of the same spam at once in a day....except now it dosen't look the same every time.
  • by M. Silver (141590) <silver@phoe[ ].net ['nyx' in gap]> on Wednesday January 14, 2004 @01:07AM (#7970581) Homepage Journal
    Umm. SpamAssassin isn't Bayesian, it's rule-based. Someone needs better research

    *Someone* does, but not the parent to this. SA *does* "incorporate Bayesian analysis techniques," and some of its rules are about handling the results. You can score those rules to 0 for non-Bayesian filtering, or score everything else to 0 for pure Bayesian.
  • Re:Why? (Score:3, Informative)

    by Gherald (682277) on Wednesday January 14, 2004 @02:08AM (#7970875) Journal
    Yes, ISPs do not use Bayesian filters. Those are rare and spammers do not care about them.

    Random strings of text are used to get through the internal checks that large ISPs run on their message traffic.

    Yahoo, Hotmail, etc have "bulk email" type folders. In addition to using spamassasin type techniques, the filter scripts that put messages in these folders will check to see if the same message is being sent to multiple addresses. If this is so, it raises a flag and someone checks to see if its a genuine mailing list. If it is, the list gets whitelisted internally. If it is spam, it gets moved into all the users' bulk mail folder and gets used to improve the bulk mail folder's automatic filters.

    Random strings of text in messages get around this because the filter has a harder time detecting these mass spams, since each individual message will show up as being slightly different.
  • by ElectricRook (264648) on Wednesday January 14, 2004 @02:38AM (#7970995)
    I hope to hell they're fishing for non-bouncing addresses, because at the moment any email which SpamAssassin says is spam, I bounce.

    Don't ever do that, all spam has forged headers. You're just making life hard on someone who had their address sold.

    I work for a big company, an icon the the computer business. Our mail servers get spammed a lot. We often have typical user names grafted onto the From or Reply lines. Since my user name is pretty damn common, and some of my work mail aliases are TLAs, I look at a lot of spam. When I read the headers (in a text file, not easily spoofed mail software), almost always the senders domain is not even close to the domain of the spamming machine. Go put the IP addresses into dnsstuff.com, and compare that to the hostname. These turds hack the sendmail.cf file of the spamming machine. "SallySmith@aol.com" probably did not send spam-mail from a ".kr" ISP.

  • by ckolar (43016) <chrisNO@SPAMkolar.org> on Wednesday January 14, 2004 @03:47AM (#7971224) Homepage Journal
    This really exists, www.spammimic.com [spammimic.com]. I'd swear that /. did a story on it when it came out. --ck
  • by funky womble (518255) on Wednesday January 14, 2004 @07:16AM (#7971847)
    Bouncing high scoring mail works pretty well, as long as you do it right [duncanthrax.net].
  • Use SPF! (Score:3, Informative)

    by TheMidget (512188) on Wednesday January 14, 2004 @07:30AM (#7971889)
    Don't ever do that, all spam has forged headers. You're just making life hard on someone who had their address sold.

    That's what SPF [pobox.com] is for. It allows the owner of a domain to publish a specification of IP addresses which are allowed to use that domain name (foo.com). If somebody, who claims to be pete@foo.com now attempts to send a mail to an SPF-enabled receiver, his mail is rejected, because his IP is not in the foo.com approved set.

    Rejection happens immediately on submission, so the mail stays on the fraudulent server.

    "SallySmith@aol.com" probably did not send spam-mail from a ".kr" ISP.

    Nor would that mail be accepted by an SPF-enabled sendmail. Indeed, AOL is one of the first major ISPs to have published SPF records [slashdot.org].

  • Re:It's SO gibberish (Score:3, Informative)

    by B'Trey (111263) on Wednesday January 14, 2004 @09:50AM (#7972598)
    Certainly it is. And for those who use high-ASCII or UNICODE, it isn't a valid technique. That doesn't mean that it isn't a valid technique for the millions of people who don't use anything outside the normal ASCII characters.

    I use POPFile [sourceforge.net], which is a perl Baysean filter. It works quite well even with spam which includes garbled words. I haven't tried playing with it yet, but it seems like it would be relatively straightforward to check for the number of words which are not already in its dictionary. Aftern the initial training, an email with more than a few new words is highly likely to be garbled spam (or from someone who received a new Thesaurus for Christmas.)

"How do I love thee? My accumulator overflows."

Working...