Forgot your password?
typodupeerror
Spam The Internet Your Rights Online

Filter-foiling Gibberish Becoming A Spam Staple 606

Posted by timothy
from the re:-claire-yum-donut-manhattan-regrets-cute dept.
hcg50a writes "Wired has a story about the random words which have recently been appearing in spam. Antispam experts agreed that this isn't a brand-new technique, but said the addition of potentially filter-foiling gibberish is rapidly becoming a common component of spam."
This discussion has been archived. No new comments can be posted.

Filter-foiling Gibberish Becoming A Spam Staple

Comments Filter:
  • Well... (Score:4, Interesting)

    by i_am_syco (694486) on Tuesday January 13, 2004 @10:19PM (#7969205)
    A lot of the time that "random gibberish" comes in the form of a story or something. Hell, a while ago I got a spam that contained a few exerpts from The Raven by Edgar Allen Poe. I got a laugh of that one.
  • by Frisky070802 (591229) * on Tuesday January 13, 2004 @10:19PM (#7969207) Journal
    My Mcafee Spamkiller ignores the white noise, and simply nukes all the mail containing viagra, etc.
  • by phr1 (211689) on Tuesday January 13, 2004 @10:20PM (#7969226)
    They are sending sekrit instructions to al-spamda about where to hide the weaponz of mass distraction. Or who knows. Any government efforts to control steganography (like reported just yesterday [slashdot.org]) better go after spammers first, or we have to wonder what they're really up to.
  • Simple Solution... (Score:3, Interesting)

    by tunabomber (259585) on Tuesday January 13, 2004 @10:21PM (#7969240) Homepage
    We just need a lameness filter for spam that looks for non-sequiturs and other crap like O.,b|f-u.s,c;a,t.e,d W,.o.r.d.s.
  • by dswensen (252552) * on Tuesday January 13, 2004 @10:21PM (#7969246) Homepage
    ...is knowing how successful this spam becomes. I get a lot of it, and I have to think that you'd have to be beyond merely dim or technically inept to take it seriously -- you'd have to be insane or have some sort of debilitating head injury. (Granted, that still may leave a lot of the Internet covered, but still).

    Spammers seem to have a lot of success when they're emulating more legitimate sources like Ebay, Microsoft, etc., but I get spam now that can't even seem to decide what it's selling. The subject line says "get rid of mortgage payments" and the body is selling "V.I.A.G.01331.A." I'm not even sure what I'd be getting if I were dull enough to actually click on anything in the message. Heck, I'm not sure if even the SPAMMERS know.

    I'd be interested to know if these spams are as successful as past efforts have been.
  • by Len (89493) on Tuesday January 13, 2004 @10:21PM (#7969248)
    This doesn't seem to be a very effective spam technique. It works pretty well at fooling my "bayesian" spam filter, but the spam messages have gibberish subject lines! Who's going to read a message titled "deprecatory parrot bizarre dessert"? (an actual example)
  • There is so much crap flooding my inbox these days that the spam filter is slowly becoming a whitelist of my coworkers and a few external customers. Hardly anything else that comes in is worth the time to look at.

    I know that whitelists aren't the answer, but then nothing short of immediate execution of spammers is.
  • The Grammar Filter (Score:3, Interesting)

    by Esteanil (710082) on Tuesday January 13, 2004 @10:25PM (#7969287) Homepage Journal
    Let's see... There is translation software out there that has some basic understanding of grammar.
    Should we add a grammar-filter to the list of things we look for it spam?
    A large amount of incorrect grammar would increase the chances of the file being caught in the spam filter.
    Of course, this would lock out most of AOL users from writing email... But is that really so bad? :P
  • by pclminion (145572) on Tuesday January 13, 2004 @10:27PM (#7969314)
    The problem with this technique for foiling spam filters is that Bayesian filters only examine words which occur in the dictionary of commonly used words. A Bayesian filter is individually trained on your personal mail. If the "red herring" words in the spam don't occur in your personal dictionary, they will be ignored by the filter and have no impact on its decision.

    For example, take the word "Byzantine." This is a very non-spammish word. However, if you've never received a legitimate email containing the word "Byzantine," your Bayesian filter will not have it in its dictionary, and the word will be ineffective in "tricking" the filter. The red herring words only have an impact if they are relevent to your actual mail sample. Since everybody's email communication is different (some of us are programmers, some of us are literature majors, etc.), this is a real sledgehammer approach to defeating the filters -- and it's extremely ineffective.

    This technique just proves that spammers don't understand the theoretical underpinnings of current Bayesian anti-spam methods. Otherwise, they'd be using much more common words as red herrings, instead of these extremely rare, and therefore insignificant, words.

    I personally use a spam filter of my own design which is based on information-theoretic and neural network techniques. It kicks the shit out of spam, even the messages that include these stupid red herring words. The spammers once again prove that they are morons, incapable of understanding how anti-spam technology actually works.

  • by KalvinB (205500) on Tuesday January 13, 2004 @10:32PM (#7969376) Homepage
    randomly grab a paragraph from a book and include it with the spam.

    It would also help spammers to write better pitches. Use real words, actual English but put it in narrative real world sceneario format. So it reads like someone you know telling you how they use such and such a product.

    "I went up the cabin last week with my girlfriend and tried out those new pills I heard about while I was there."

    There's pretty much nothing in there that would be filtered. And then a slight plug of the product name with a link and you're done. It's also Marketing 101 that the less of an ad sounds like an ad the more effective it is.

    But none of that thwarts my method which is to filter based on the URLs of links found in spams.

    I get virtually no spam with a Mercury rule file that's all of 23KB and grows very slowly as spammers use new domains to host their product pages.

    Ben
  • Different Techniques (Score:5, Interesting)

    by kalidasa (577403) * on Tuesday January 13, 2004 @10:33PM (#7969381) Journal

    The article doesn't do a good enough job of explaining the different techniques in use.

    First, hash busters. Yes, spammers are loading a random jumble of meaningful words in meaningless sequences into their spam, usually in the plaintext message body of a message with HTML content (i.e., you get hash buster - html message with spam content - hash buster). So HTML-aware clients (the main clients targeted I'm sure are AOL and Outlook Express) show the spam message, but not the hash buster. I'm guessing that this is specifically targeting bayesian filtering tools at AOL (anyone know if AOL is using a bayesian filter?); it works by introducing words that would not be found in a spam corpus in greater numbers than those that would.

    Second, noisy spelling, like v1@gr@. Obviously this is also intended to defeat regex-based filters like spamassassin. If you vary your cliches enough, and you introduce very strange, but easy-for-a-human-reader-to-recognize spelling variants, you make it much more difficult for filter writers to write effective regexes.

  • by Jerf (17166) on Tuesday January 13, 2004 @10:33PM (#7969384) Journal
    The real problem will be when the spammers finally figure out how to deliberately poison the Bayesian filters. So far they're using more-or-less random words, but that won't really work against Bayesian; it can tolerate that.

    However, what constitutes "non-spam" is not as unique as most people think, as I've examined here [jerf.org]. If they figure out how to deliberately put in hammy words, Bayesian will fall.

    I feel OK posting this because I freely admit to this point I've overestimated them; I'm sure spammers have read that piece, and to date they have been too stupid to figure out what I said in plain English. But sooner or later one of them is going to figure out.

    There's a strong core of "ham" that is "ham" for everybody, and sooner or later they're going to start abusing that.

    And if I may forstall one objection... "But you don't understand Bayesian, it's [awesome for some reason and can't be beat ever, by anybody]" - I'll listen when you've actually written a program to examine filters yourself, OK? I understand it pretty damn well. It'll take more then bald assertions to convince me I'm wrong, I've done actual research, in the original sense of the word.
  • by HeelToe (615905) on Tuesday January 13, 2004 @10:33PM (#7969386) Homepage
    I thought about this after seeing my inbox spam increase to about 80 a day (the box that contains what is filtered is usually 10 per hour - my adress has been valid for just short of 10 years).

    Why not check the subject or first few lines of plain (not html) text and see if 80% of it is in /usr/share/dict/words? I thought about trying this out, but have been too busy to get off my ass and do it.
  • by mjprobst (95305) on Tuesday January 13, 2004 @10:34PM (#7969391) Homepage Journal
    I saw one just yesterday that contained a list of important key sentences and phrases from the literature of common charities and political activism organizations.

    In other words, if your Bayesian filter accepts those, based on your past decisions, it will detect the spam. If you reject the spam, you reject these communications as well.

    Good filtering practice would dictate that one reads the junk box carefully enough to find both false positives and negatives. But the sheer bulk of mail that ends up in the junk box makes this unfeasible for many.

    I have started letting these particular kinds of spam through, manually categorizing them (many words of random strings, dictionary vocabulary attack, positive phrase attack) in the hopes that filtering technology will soon advance to the point where these can be used as inputs to a more intelligent system.

    Of course overhauling the mail system is a prerequisite to solving any of this long-term. For once I don't mind D. J. Bernstein's Internet Mail 2000 proposals. Of course there are other proposed systems, none of which has enough momentum to start a slow steady change. The end result of any non-consensus system will be to fragment the worldwide network of Email into competing, noncompatible systems that need to communicate through some kind of loophole or gateway. Back to FIDO-net days.
  • I see this too (Score:5, Interesting)

    by rockwood (141675) on Tuesday January 13, 2004 @10:37PM (#7969418) Homepage Journal
    I've been using "SpamBayes Outlook Plugin" since a previous /. article talked about it.

    Agreeing with this article, over the past week or two I have seen excessive about of spam being missed by SpamBayes, even after marking them as spam for improved filter, they continue to hit the inbox whereas previous absolutely no spam made my outbox. Additionally, there may have only been 2 or 3 emails marked as possible spam when they were not. And zero items mark as definite spam that were not.

    SpamBayes has worked great previously, but now even it is falling short.

    I feel as the spammers manipulate the conents/context of the spam, it will eventually become impossible to determine the difference without physically looking at 500+ email daily.
    My primary use of email is business and not personal, therefore I cannot risk missing a client email, payment, question, etc... I've also see a progression of clients having MY emails deleted or caught in spam filters due to the business aspect and requests for payments. I feel this is primarily due to the comparison of too-often-common-phrases that a spam email and a business email contain. Such things as Click here to submit payment, or Buy these Products, Overdue etc... Even though all clients I email are only clients that contact me. I never cold-email anyone.

    More spammer are using this random text as the only text in the subject and body, and using an image as the content of their email, which makes scanning even more complicated, if not impossible.

    Being on the net prior to what is is today (going on 20 years), I often wonder how much control the spam actually has over the net in several aspects

    • If spam were to disappear, will overhead costs decrease that greatly in order for ISP's to pass along higher saving to the consumer?
    • If Spam were to disappear completely, how much faster would the Internet be?
    Has anyone ever done a study to determine how much effect spam has on degrading the net, and what would it be like if all spam was gone tomorrow?
  • by YU Nicks NE Way (129084) on Tuesday January 13, 2004 @10:38PM (#7969429)
    Actually, the attack is more subtle than you think. The value of a random-words attack lies in the long-term damage it does to adaptive filters, not in how well or poorly it does with fixed filters.

    When an adaptive filter sees a rare word in a spam, it is likely to assign that word high spamminess. Problem is, the next time you see that word is likely to be in a piece of ham, resulting in a false categorization of a piece of ham as spam. The user cost of such an assignment is very high, and so users will be forced to look at their junk mail...which is, after all, what the spammers want.
  • by McDutchie (151611) on Tuesday January 13, 2004 @10:40PM (#7969446) Homepage
    Why bother to market to them at all?

    In addition to living in their own criminally delusional world, spammers often don't spam for themselves but work for others. They get paid by their, er, client for each message sent, it doesn't matter to them whether it's wanted or not.

    Plus, there's always that .001% of suckers to keep the biz going if the cost of sending is close to zero.

  • by phutureboy (70690) on Tuesday January 13, 2004 @10:49PM (#7969517) Homepage
    Yeah, really.

    What I don't get is the spam which advertises a product, but gives you no way to follow through and purchase it. I've even looked at the message source and there is no brand name, 800 number, URL, or contact info. Just one paragraph which reads along the lines of "Our Cable Descrambler is the best on the market. It descrambles stuff better than the others. Purchase one today!"

    Not that I would actually purchase something; it just makes me wonder WTF the point was of sending the message in the first place. It seems like a 100% waste of time and bandwidth for everyone.
  • by Trejkaz (615352) on Tuesday January 13, 2004 @10:50PM (#7969520) Homepage

    What I don't understand about this type of spam is that often it doesn't contain any actual advertisement, just three or four lines of random words, and the end of the email right there.

    I don't get it. If you're not selling a product, what is the spam for?

    Mind you since TMDA, I haven't been seeing any spam anyway.

  • by crazyphilman (609923) on Tuesday January 13, 2004 @10:51PM (#7969535) Journal
    It's old fashioned, and some of you will probably make fun of me for using it, but hey, I'm old school. FYI, here's my method:

    1. Create manual spam filters (NOT beyesian filters) in your inbox called "Friends and Family", "Work", "Services", "logfiles", and any others you find you need. Each category applies to a broad type of email address you'll receive email from. Then create a subdirectory in your inbox for each of these filters (named the same way, naturally).

    2. For each filter, build a list of people who are allowed to email you. For example, your ISP, your bank, and your phone company would probably be added to services. Just add the email address they send their messages from to the list.

    3. For each filter, have the filter move messages matching the filter (From equals ) to the correct subdirectory for the filter. Then stop processing for that message, so it doesn't get interpereted by other filters. Think of this as an analogy for ipfilter or ipfw in your firewall setup -- only you're filtering emails instead of packets.

    4. Finally, DELETE EVERYTHING ELSE in the very last filter.

    You USE this approach by doing a quick scan of the deleted items folder to see if anything is interesting. If not, just clean out those deleted items. It's a one step operation, much easier than selectively deleting a hundred emails one at a time.

    Then, you scan each of the folders you set up, IF the folder has picked up an email, focusing only on your REAL email.

    This approach has saved me a HUGE amount of work lately. My life is a whole lot easier, and it's way easier than trying to train a Beyesian filter. If I don't know you, you can't get too much of my attention.

    It's all about being on the list, sort of like getting into a nightclub... ;)

  • by tomstdenis (446163) <tomstdenis AT gmail DOT com> on Tuesday January 13, 2004 @10:52PM (#7969540) Homepage
    Just block the domain name/ip of the hosted images. Most spams I get come from random IPs but usually have common IP/domain name for the hosted images e.g.

    hostz300001.com/ads/viagra.jpg

    Or whatever. I've cut down from 50 spams to about 3 or so a day by doing that.

    I bet a bayesian filter would work nicer but unfortunately I'm too lazy to mod the mail setup [that isn't mine] to get one installed..

    Tom
  • by mrpuffypants (444598) * <`mrpuffypants' `at' `gmail.com'> on Tuesday January 13, 2004 @10:54PM (#7969559)
    The solution to randomness is to spell check and grammar check incoming e-mail

    Apparently you've never gotten emails from either a:

    1) 14-year old girl
    2) Gamer
    3) UNIX sysadmin describing a sendmail .cf file

    Yikes.
  • by the_mad_poster (640772) <shattoc@adelphia.com> on Tuesday January 13, 2004 @10:55PM (#7969570) Homepage Journal

    1337 speak isn't a big deal. It's definitely filterable.

    I've begun seeing chunks of text appearing in messages that are like legitimate mini-messages in and of themselves. Sort of like a counter weight. I don't think the aim is to pound Spam through the filters now, because what's happening is spam is getting slightly lower ratings each time while legitimate messages are getting slightly higher ratings.

    In other words, the spam probably won't ever be legitimate, but it's making me lower my threshold for what is spam more and more. Eventually, I'll get to the point where some legit messages will cross over into being labeled as spam and spam will go through legit because the thresholds will be so close together as to practically overlap. It's also killing my ability to keep a spam trap that I can use to quickly train filters.

    Whether this scene will actually play out and the "plot" will be succesful or not remains to be seen, however.

  • by The I Shing (700142) * on Tuesday January 13, 2004 @10:58PM (#7969592) Journal
    I keep praying for that silver bullet that will end spam forever.

    The thing that seems so insane about spam is that it's gotten to the point where apparently all spammers care about is getting past your filters. They must know that you're going to delete the message the moment you physically set eyes on the word "\/1A6RA," but it's as if they don't care. They just want to induce you to look at the word, and force you hit the Junk Mail button or Delete key. They just want to waste your time filling your Inbox with their insane crap.

    It's like they're nasty little demons spitting up madness from the bowels of hell for the pleasure of their horned master. I can't picture a spammer as a human being at all... I always imagine hooves and a pointy tail, a slimy, crooked red finger pushing its sharp, black, malevolent fingernail into an eagerly pulsating "SEND" button.

    Read any interviews with these people? My god, they really are monstrous. The arrogance, the pomposity, and the self-justification spewing from each of their mouths combine to form a portrait of a person so utterly bereft of morals, ethics, or humanity that I just want to clip the spammer's photo out of the magazine, scan it, and send it to X-Wipes to be made into toilet paper. I'll let you imagine the rest.

    I've said it before and I'll say it again... spammers have done more than their share in turning the wonderful information highway into a sleazy backalley of filth, perversion, and fraud. Every day as I wait for my email client to download and process the two hundred or so spam messages that are clogging up my inbox, I sit in silent hope, praying that someone will find a way to end the madness at the source, and cut the spammers out of our lives forever and ever, amen.
  • by fermion (181285) on Tuesday January 13, 2004 @11:02PM (#7969616) Homepage Journal
    This is another subtle feature of modern email that allows spam to propagate: the HTML/RTF mail. Many mailers now default to the HTML setting. This is to allow lusers to put in obnoxious color schemes and use every font on their computer. It reminds of 15 years ago when we were first doing desktop publishing.

    The real benefit is to the spammers. They can put inline images that make the email look like it came from a legitimate company, they can have the text version look random, but the HTML rendered version human readable. Almost all spam is going to be HTML, and my experience is that 95% of HTML mail is spam.

    Which means that if we filtered HTML most spam would go away overnight, and the bandwidth wasted by the remainder would be significantly reduced. We would also significantly reduce the security risks. Unfortunately the lusers that use services such as Yahoo! would also be filtered. I wonder if the decision to default to HTML is purely to satisfy the general customer, or a feature targeted directly to facilitate advertising.

  • Word Salad (Score:3, Interesting)

    by JohnGrahamCumming (684871) * <slashdotNO@SPAMjgc.org> on Tuesday January 13, 2004 @11:03PM (#7969623) Homepage Journal
    Weird. I am talking about this at the MIT Spam Conference [spamconference.org] on Friday and on a technique that can break a Bayesian spam filter.

    John.
  • How I deal with spam (Score:3, Interesting)

    by mabu (178417) on Tuesday January 13, 2004 @11:08PM (#7969658)
    I have had my main e-mail published and unchanged since 1995. It's probably on 99% of all spam mailing lists. One of my servers handles about 600 POP3 accounts. My stats currently indicate that now more than 80% of our SMTP traffic is confirmed spam.

    I don't believe in content-based filtering. We have a strict policy of not examining in any way, shape, or form, the content of any e-mail on our network.

    We deal with spam by implementing an array of fully-tested, fairly conservative relay blacklists which block the inbound SMTP connection before the junk mail is even transmitted.

    In more than two years of operation, we've only confirmed about six legitimate e-mails that were blocked, and we handle tremendous mail volume. It's an easy matter to "whitelist" anyone who might end up getting RBL'd to make sure the client can communicate with who they want. In EVERY case where a legitimate source was blacklisted, it was shown their ISP was irresponsible and the listing was valid.

    In addition to using RBLs, we also have an array of hard-coded IP blocks that our server will not accept mail from. This covers a good bit of the rogue Asia-pacific ISPs that are the largest source of open relays. Something as simple as blocking major portions of 61.* have shown to reduce spam by 30+%. Anyone legitimately in China that needs to communicate with our network can be quickly whitelisted. Ironically, most of the ISP SMTP relays are not near the same broadband IP ranges - they obviously know how effective this technique is.

    With RBLs and hard-coded spamming in effect, instead of 200 spams a day, I might get 3-5. As soon as I get new spam, I report it to Spamcop, and I notice a quick reduction in future spam of that nature immediately.

    We're now getting near the point of blacklisting the entire 24.* IP block as well - which encompasses, among other things, a large portion of Comcast IP blocks that Comcast can't or won't control.

    I'd like to see more ISPs simply refuse to accept mail from rogue networks. Then these networks would have to be more responsible.

    Let me preface all this by saying our policy is to whitelist anyone who complains they have legitimate mail being blocked. For some strange reason, we don't hear any spammers making these requests. That's a shame because I'd be happy to visit them personally to make sure their situation is resolved in a mutually-deserving manner.
  • Spam Poetry (Score:2, Interesting)

    by GoogolPlexPlex (412555) on Tuesday January 13, 2004 @11:13PM (#7969694)
    I get a lot of spams with contain 3 random words in the subject. Currently, I collect the subject lines in a text file and arrange them to make poetry. A few sample verses:

    i'll take this
    open window into
    imflammatory tales about
    pieces of herring

    shooting caused panic
    that surely only
    constituted a prelude
    or else maybe
    had ever happened

  • Re:gibberish... (Score:3, Interesting)

    by Ophidian P. Jones (466787) on Tuesday January 13, 2004 @11:19PM (#7969754)
    Worse yet, they keep spamming, Someone keeps buying from spam.

    Why was this marked Redundant?

    Maybe I missed someone else pointing this out, but it's a very important point. The spammers will only stay in business until it's no longer profitable. The technological solutions beat the legislative ones right now, but getting the word out to people that buying from spammers only encourages spam would really help too.
  • by dragonman97 (185927) * on Tuesday January 13, 2004 @11:26PM (#7969801)
    Yeah, I've noticed this pattern as well - and I've just been studying a mess of spam today to try and train a crappy spam filter. In my dept., we're speculating that some of this meaningless crap spam is actually an attack of some sort, designed to slow down e-mail systems, and/or crush them (think really small offices). There cannot be any real purpose to some of the spam out there - you would have to be brain dead to respond to some of the absolutely crappy messages that are being sent. It is entirely possible that some of these pointless spams might actually serve one other purpose - validating e-mail addresses through IMG message-tracking tags. (As such, I've been very carefully examining e-mails inside my favorite MUA - mutt :-).)
  • Some ideas (Score:2, Interesting)

    by Boyceterous (596732) on Tuesday January 13, 2004 @11:38PM (#7969910)
    1 - I've posted about this before; since I can look at just the subject, sender, and recipient fields and figure out if an email is spam, then I should be able to get/write a program to do that also, and therefore not have to even download the entire garbage content. I'm using my own email header spam-scoring system that gets about the same results as more sophisticated filters that examine email content.

    2- Most of the solutions to spam have involved ideas where senders pay or trying to swamp spammers with so much return junk that they get annoyed or driven out of business. Is it feasible to use an email system where the email content does not hop from one server to another? Just send the headers and where to get the content. In other words, when an email is sent, it would sit on the SMTP server provided the sender's ISP(s). That way recipients have to go and get it ( just like web pages, right?) It seems to me that would cut way down on traffic, could provide accountability, and alleviate the ridiculous burden on recipient's ISP to provide storage for every idiot that wants to send their trash to my e-doorstep. ISPs would be pressured to either charge for holding millions of emails until they're read, and at the same time quickley get blacklisted if they allow spammers to operate from their servers - and the sender ISPs know who they are, which might make it possible to get the actual spammers more directly. Seems like such a system might at least direct more of the cost towards the sender side rather than the recipient side.

  • by adrianbaugh (696007) on Tuesday January 13, 2004 @11:39PM (#7969917) Homepage Journal
    It seems to me it would be much harder to poison a filter that did Bayes by splitting email into word pairs or triplets and assigning ham and spam probabilities for each. That way the bad grammar and random word lists would be extra-bad. I suspect longer sequences would become harder and harder to foil. They might require extra training of the database, but if you're getting lots of spam that isn't really a problem. Perhaps the word sequence length could be configurable.
  • Re:I see this too (Score:2, Interesting)

    by Wild Wizard (309461) on Tuesday January 13, 2004 @11:57PM (#7970058) Journal
    We managed a score of 42.8 recently with SpamAssassin

    http://spamhalloffame.abnormalpenguin.com/ [abnormalpenguin.com]

    Only a few slip through at a level of 5 for us, haven't yet got to piping the high level ones directly to /dev/null yet
  • by vacuum_tuber (707626) * on Tuesday January 13, 2004 @11:59PM (#7970076) Journal

    mabu wrote:

    We're now getting near the point of blacklisting the entire 24.* IP block as well - which encompasses, among other things, a large portion of Comcast IP blocks that Comcast can't or won't control.

    That's the real problem with blocking by IP ranges. I'm in 24.* because it's the only high-speed Internet I can get. It's not Comcast but I see tons of probes from infected machines local to me in my area of 24.*. But I'm not the only legitimate business living in a broadband network that contains tons of clueless residential subscribers. What would you have us do, get T1 lines and $3,500/mo ISP feeds? Go back to dialup? What's wrong with this picture?

    I have a static IP, my own domains, and run my own Web and email servers. My site is business, has tons of information on a niche IT subject, has forums, and some growing e-commerce for parts and equipment in my niche.

    If and when you block 24.*, either your users won't be able to write to me or I won't be able to reply to them, and if you follow the pattern of a lot of clueless admins out there you will also block to postmaster, so it will be impossible to let you know that you're blocking legitimate traffic.

    Anyone legitimately in China that needs to communicate with our network can be quickly whitelisted.

    Aside from the amusing notion of "Anyone legitimately in China" (what's the alternative -- being an illegal immigrant?), just how would a sender of legitimate email from China to a user in your network let you know that you are blocking their email? How would they let the person who can't receive their mail that the block is preventing them from communicating?

    Most of my business contacts are initiated by the OP by email, from all over the world. If someone can't reach me because I block more than I should, that person will likely never reach me and I will never get any business from them. From my business perspective that would be exceptionally stupid network management.

    I filter inbound spam by whitelist and then content. I get zero false negatives in my New Mail folder at the price of having to pick up some new correspondents from the SPAM folder and whitelist them. At least that way, though, I have a folder of truly confirmed spam to send to SpamCop by script, and thanks to the recent trend of gibberish tacked onto the Subject and other highly human-recognizable signals in From and Subject visible in the folder list, I no longer have to actually open any messages to confirm they are spam. Even when I do, though, my mail client doesn't retrieve any graphics from any servers.

    Not retrieving graphics doesn't save me from confirming I am here, though, because as soon as I pass the confirmed spam to one of my servers the spam is first sent to SpamCop, then all the URLs are parsed out, spammer's email addresses are substituted for all occurrences of my email address in the URLs, spammer domains are substituted for any occurrences of my domain, and scripts then download the entire spam sites, once for each URL they have sent me.

    That still leaves encoded values in the URLs, which I presume contain at least a cross reference to the email address the spam was sent to, but I don't care. "Send me spam and get your site downloaded. More spam -- more downloads." Most spam is, after all, an explicit invitation to visit a spamvertised Website.

  • by letxa2000 (215841) on Wednesday January 14, 2004 @12:03AM (#7970112)
    You're completely right. I love it that spammers try to conceal their mail with weird combinations of words.

    Examples from my corpus:

    VIAGRA: 99.797%
    V!AGRA: 99.9999%
    AGRA: 99.9999% (from things like VI.AGRA)
    IAGRA: 99.9999%

    PORN: 98.573%
    P0RN: 99.9999%
    PR0N: 99.9999%

    Plus, the trick is looking for things that give away spam that aren't just words. I call them "characteristics." For example:

    Various pharmacy related terms: 99.9999%
    HTML using % escape sequences: 98.789%
    Http:// references that don't use www: 85.538%
    =?ISO- in Subject: 99.9999%
    Suspicious domains (BIZ, BR, PRO, etc.): 99.174%
    1 "Adult Term": 70.8%
    2 "Adult Terms": 85.7%
    5+ "Adult Terms": 99.9999%
    5+ HTML Comments: 92.0%
    10+ HTML Comments: 98.3%
    30+ HTML Comments: 99.9999%

    In short, there are so many aspects of a message you can analyze and make "Characteristics" that my Bayesian filter can often make a decision entirely based on the characteristics without even looking at some of the terms used within the message. But if the characteristics aren't damning enough, the content virtually always is.

  • by K-Man (4117) on Wednesday January 14, 2004 @12:09AM (#7970163)
    Let's see:

    Gen3r@c v|agar@
    Gener@c v|agar@
    Generic v|agar@
    Generic viagar@
    Generic viagr@
    Generic viagra

    That's an edit distance of 5, pretty large, but still findable with a little approximate matching, especially if it's weighted, to recognize the similarity between @ and a, or i and |.

    Most spam contains repeated phrases 40+ characters long. the mistake is to use word-counting techniques which ignore phraseology.

    For instance, here are some phrases from spam, circa one year ago:

    Please fill out the form below for more information
    To unsubscribe
    To remove your
    in the Marshall Islands
    Please allow 48-72 hours for removal
    to this email with REMOVE in the
    the Northern Ratak
    the information
    thousands of dollars
    that you will
    this list, please
    this advertisement
    this email in error
    this message, you may email our
    this transaction
    of thousands of
    of EnenKio and
    of Eneen-Kio Atoll
    of His Majesty
    our mailing list
    out 5,000 e-mails each for a
    opportunity to make

  • by Kurt Wall (677000) on Wednesday January 14, 2004 @12:12AM (#7970179) Homepage

    So, the spammer sub-life forms start inserting filter-foiling gibberish, which has various effects:

    1. Foils anti-spam filters - obviously, this sucks
    2. Makes it easy to detect visually - this bites if you don't even want to see spam
    3. Makes the spam itself hard to read - and the downside of this is?
    4. [insert favorite misfeature here]

    It occurs to me, though, that if spam gets hard to read, no one reads it. If no one reads it, spam ceases to work. If spam ceases to work, spammers are out of work (sniff -- not!).

    So when spam becomes so convoluted to get past anti-spam systems, it will become too convoluted to work. We can only hope.

  • by letxa2000 (215841) on Wednesday January 14, 2004 @12:16AM (#7970205)
    The encoding V*I*A*G*R*A would break out to the letters V I A G R and A.

    V: 76.9% Spam score
    I: 47.2% spam score
    A: 68.8% spam score
    G: 72.2% spam score
    R: 72.2% spam score

    On balance, if I get a message with the individual "words" of V, I, A, G, R, and A, that's going to be leaning towards spam.

    That's the beauty of Bayesian. Anything the spammers do will eventually come back and bite them in the butt. Even some of the "random words" they are starting to use are getting high spam scores:

    WHEREUPON: 99.9999%
    NEOCONSERVATIVE: 99.9999%
    LIBERAL: 74.3%
    LIBERTY: 84.0%
    MEGATON: 99.9999%
    METHANE: 99.9999%

    These are just a few of the "random words" I found in recent spams and, interestingly, the random words they are using are actually INCREASING their spam probability.

    Statistically, it's a lost cause for the spammers, they just don't realize it yet.

  • Spell checking works (Score:1, Interesting)

    by Anonymous Coward on Wednesday January 14, 2004 @12:19AM (#7970221)
    I've written a bayesian filter into my email client [memecode.com]. And it was this added peice of functionality that makes a big difference. It spell checks each word in the incomming email that isn't in either corpus of mail. In the case that it's misspelt it weights the word in the spam direction.

    The upshot is that it makes using nonsense words pointless.

  • by Jerf (17166) on Wednesday January 14, 2004 @12:31AM (#7970319) Journal
    Do you have evidence to back that assertion? In my case (I know it's just me), ham basically means either refering to my open-source projects or written in French (even then spambayes does a good job at rejecting French spam).

    Language is often a big indicator; since spam is aimed at a particular langauge group I don't consider it much. The fact my filter marks Japanese or Korean messages as spam is almost irrelevant, in a way, since I can't read it anyhow and it's easily dismissed.

    But there's this common misconception that inside the spam filter it just looks for the three or four key words that mark "your" ham to the exclusion of all else. In reality there are big cues that are indepedent of "personalization"; see the Interesting Results [jerf.org] section. Would you have guessed that "I'm" is such a non-spam indicator?

    There's a strong core of hamminess that will be common to nearly everybody. (Also clarifies your point 1.)

    2) Lack of "training data" for them We have lots of data from which we can learn how to avoid spam, but they have very little data which they can use to "train" anti-filter techniques.

    Well, I sure didn't have any trouble finding ham for my training! Collecting 20,000 ham messages took me about 15 minutes; it took me longer to process them then find them. If I were a dedicated spammer I could collect a million in a couple of days, depending on how diverse a selection I want to acquire. One "weakness" of my experiment is the limited selection I acquired, but that's easily fixed and I think based on my experience it's already plenty diverse.

    3) They have to get the main message through. Eventually, if you can detect all forms (that remains to be seen) of the word "Viagra", they simply can't use that word in their email anymore (assuming I've got no ham containing that word).

    Yes and no. I already acknoleged in my post that without "cheating", you can't really get a sex spam through. (Though you'll have a hard time getting a real sex email through, too, if that is a normal email for you.)

    But I "played fair"... spammers don't have to. They can craft a highly hammy message and append it to their spam. Even if your filter stop it, it poisons the filter. The filter writers can then take countermeasures against that, but you're back to an arms race and that's not a gain over what we had before the Bayesian filters.

    4) Because each spam message is different, they have to find a cost-effective way to make each of them immune to filters. That's not easy either.

    Well, creating a highly hammy message and appending any short spam to it they want ought to work. That's not too expensive.

    Even so, you're sending a lot of people the same message for so little money it boggles the mind. Raising the bar for writing a message a little won't stop the flow, because it amortizes across all copies of the message sent too well. You need to raise cost per message or a number of other approaches.

    I don't think spammers are that dumb either.

    I used to not think so, and I had bet that Bayesian would already be useless by now. But I now realize that I have overestimated them by a significant margin. Like I said, I know some of them have read that piece. I get hits for "bypassing Bayesian filters" nearly every week from Google. I've gotten several requests for source code to my program, and I wager not all of them were legitimately academic. (Fortunately, I've lost it through a hard drive crash, but I consider my results still scientifically valid as at least in my opinion, I've given enough information to replicate my results.)

    But they still haven't progressed past stupid o.b.f.u.s.c.a.t.i.o.n techniques (no, that won't get past Bayesian) and purely random words (neither will that) very far. (Remember, which a lot of people seem to miss when they read my piece, I respect Bayesian
  • by berzerke (319205) on Wednesday January 14, 2004 @12:33AM (#7970334) Homepage

    [What I don't understand about this type of spam is that often it doesn't contain any actual advertisement, just three or four lines of random words, and the end of the email right there.] Actually I was viewing the source of the whole email, not the text part.

    I too see this sometimes. You're not crazy (at least with regards to this). I've looked at the full source, but still can't figure out what the goal is. My best guess is either they are fishing for bounces (ok, these are bad addresses; the ones that don't bounce may be good addresses), or the spamming software has a problem (bug or is misconfigured).

  • Habeas SWE in spam (Score:3, Interesting)

    by YetAnotherDave (159442) on Wednesday January 14, 2004 @12:42AM (#7970418)
    Has anyone else seen a spurt of Habeas SWE headers in spam?

    I'd never seen any until this week, and suddenly I've got like 5/day.

    I forwarded them to the good folks at habeas, hopefully the spammer will get sued into oblivion, but it's forced me to re-score SWE with a much lower bonus in spamassassin...

    http://habeas.com/servicesHowSWEWorks.html for those who don't know what I'm talking about, btw
  • by mabu (178417) on Wednesday January 14, 2004 @01:00AM (#7970529)
    That's the real problem with blocking by IP ranges. I'm in 24.* because it's the only high-speed Internet I can get. It's not Comcast but I see tons of probes from infected machines local to me in my area of 24.*. But I'm not the only legitimate business living in a broadband network that contains tons of clueless residential subscribers. What would you have us do, get T1 lines and $3,500/mo ISP feeds? Go back to dialup? What's wrong with this picture?


    We're not blocking all of 24.* right now because there are some people like you on that block, but if Comcast and other ISPs that are in that class A don't get their act together, you guys are likely to have problems, because I'm sure I'm not the only person that notices that net block is a never-ending source of problems.

    I am also of the believe that many of these large blocks are DULs. If you have legitimate permission from your ISP to run your own servers, I'd hope they would separate you in the IP space from the DUL RBLs. If not, that's an issue your ISP should consider.

    I don't have much sympathy for Comcast however. They are proving to be THE worst American ISP in terms of controlling spam.

    Let me also say something.. the 2+ tier backbone providers in most cases don't have the performance of someone like Worldcom (as much as I'd like to not admit it). You can get by with less bandwidth on a higher-performing network that doesn't go through a bunch of goofy networks that don't have their act together. Shop around if you find yourself serviced by an ISP that is indescriminate about who they do business with. There are always options.

    just how would a sender of legitimate email from China to a user in your network let you know that you are blocking their email?

    All relay-blacklisted e-mail is returned to the sender with an error message that redirects them to a web page with an e-mail form they can use to contact us. The only downside to this is that we have to expire the deferred mail cache more quickly than we would normally prefer, but since the server in question is just for inbound and not outbound relaying, it's not a problem.

    Spamcop-RBL'd mail similarly echos an error message to the user with a URL they can click on to actually show the spam history of the smtp relay in question. It works very well, and best of all, it dramatically cuts down on the bandwidth that spammers consume.

    Thanks for reporting to Spamcop. I really like their service too. The problem is, there are so many Asia-pacific and Comcast IPs, Spamcop isn't as effective when spammers have such a diverse array of IPs to hijack, so we've had to resort to some additional block blacklisting. It has proven to be very effective and we never leave legitimate users in the dark. If you had a mail relay in the block and tried to send me mail, you'd get a message and a quick way to contact me to have yourself authorized.
  • by La Camiseta (59684) <me@nathanclayton.com> on Wednesday January 14, 2004 @01:05AM (#7970572) Homepage Journal
    Because of this, my baysean spam filter is gatering statistics as to what words/letters together create legible paragraphs, sentences, words, etc. I.e. it filters out paragraphs that aren't realistisc nor make sense.

    That makes me wonder if all of this statistical data would be of use when it comes to some sort of Natural Language Processing.
  • Free pi||$ !!!1!! (Score:2, Interesting)

    by Flingles (698457) on Wednesday January 14, 2004 @01:31AM (#7970693) Journal
    Is it just me or do many of the spams lead no-where? I actually tried going to a few of them in my junk mail folder, and half of them are broken links! They must just like to annoy people, because they are getting 0 sales off a broken link (as opposed to %0.0001 response).

    Also, it seems to me we need a pay per email system fast. There are a few holes to patch though. Imagine, person presses send, and pays their ISP say 5c. Already there are several holes, every ISP in the world would have to comply to stop spam. So change it round, a person presses send, and the destination ISP says "wait, you need to pay" -unless 5c is given to the receiver's ISP the email is never sent. Any ISP who doesn't have the software to pay the other providers will obviously lose their whole customer base, thus forcing them to use pay per email. Another hole is that legitimate newsgroups would operate at huge costs and businesses with many employees would be paying hundreds per day. So, make a deposit system, person sends email-5c is payed to receiver's ISP, and when they read it a button is displayed to give their 5c back. If not the ISP gets to keep a whole lot of 5c's (hopefully lowering prices)

    If this were possible, spammers would operate at a huge loss, because no one would send back their deposit.
  • by Anonymous Coward on Wednesday January 14, 2004 @01:37AM (#7970729)
    Why don't we simply add a 'correctness' metric to our spam filters that runs a check of each word against a hash of all known words, such as that found in the parts-of-speech.txt file found at http://aspell.sourceforge.net/wl/ ... This would allow spam filters to detect 'garbage' most of the time, and flag for closer inspection.

    Of course... This would also encourage people to spell-check their emails! Wooo!

  • Re:gibberish... (Score:3, Interesting)

    by Mr Z (6791) on Wednesday January 14, 2004 @01:40AM (#7970746) Homepage Journal

    Actually, I avoid deleting my spam. I have an archive now of over 270MB of spam that I can use for a training set for whatever filter I might intend to deploy.

    That archive has more than just spam, mind you. It also has all the virus/worm email I've received over the years as well, such as the "Internet Email System" informing me of an undeliverable message, or "Microsoft Corporation" providing me a convenient, easy to click "December 2003 Internet Update" or whatever.

    *sigh*

    --Joe
  • by Mr Z (6791) on Wednesday January 14, 2004 @01:56AM (#7970831) Homepage Journal

    So far as I can tell, most mainstream mailreaders (in their default configuration) will show you only the HTML component, if both variants are provided.

    Thus, the spammer puts their filter-fooling gibberish in the text/plain component, and their add in the text/html component. The recipient is none the wiser about the gibberish.

    Since I use mutt, and I don't have an HTML filter configured, I'm immune to the ads in most spam. Since spam advertisements like to have tracker images and so on (to measure how often people actually open spam), I seem to get relatively little spam that lacks an HTML component. Further, most spam lacks a meaningful text/plain component.

    The only annoyance with this arrangement is the fact that one or two of my coworkers insist on sending HTML-only email. *sigh* (Since one of them is the father of JTAG, [findarticles.com] I don't bother trying to bend his ways.)

    --Joe
  • Gibberish, or code? (Score:5, Interesting)

    by cr0sh (43134) on Wednesday January 14, 2004 @02:16AM (#7970904) Homepage
    I, too, have noticed these seemingly random words that seemed to have nothing to do with the main text of the spam. I have also noticed the "gibberish words". One of my thoughts was that it was for defeating or bypassing bayesian filters - and likely, that is the case. But my thoughts turned to another possible use...

    What if spam and the spammers software - was actually being used by a third party in a surepticious manner to send/receive messages? Kinda like plaintext stego. Maybe the software used by spammers is backdoored by this third party - he sends instructions to the machine(s), maybe via a virus or something simpler, the spammers send their messages, but "unknown" to them the spams have this garbage at the end. The spammer doesn't really care, maybe he bitches at whatever passes as tech support for the spam software. Most people who recieve the spam see the stuff as garbage, or filter busters. But a certain group of the third party's friends - they have special email software that downloads these spams, and strips the garbage out, decodes it, and reassembles it into the real message. Maybe each spam only contains the equivalent of a couple of characters after decoding (maybe the garbage is actually packets telling order in the sequence, and other info to reconstruct the message) - but over a week or so, an entire message could be sent...

    What is the possibility of that? Occam's Razor suggests otherwise, and filter busters are probably what the stuff is - but...what if...?

  • Sorry (Score:2, Interesting)

    by Douglas Simmons (628988) on Wednesday January 14, 2004 @02:21AM (#7970934) Homepage
    "Getting the word out" to stop patronizing spammers will not curb spamming because spamming is a free, quick and easy method to reach however many people you want. Once you find yourself a list of harvested email addresses and an open relay, sending an advertisement to hundreds of thousands of people with a few clicks for zero dollars is something you would not be deterred from doing because of diminished hit rates caused by a campaign you're suggesting.

    As time passes, more people figure out how to spam and more email addresses get snagged by harvesting. This will keep the flow of spam increasing exponentially no matter what curbs we come up with. At least it's creating a market for anti-spam products, as well as offering the larger ISPs something to claim they know how to defeat in their advertisements. Good for the economy.

    Now what we do have a shot at getting rid of is real-life leafletting. Nothing pisses me off more than these Bush-approved illegals obstructing my path on the sidewalk to shove some piece of paper advertising cheap suits in my face. Maybe this is only something that bothers fellow New Yorkers though...

  • by Trejkaz (615352) on Wednesday January 14, 2004 @03:31AM (#7971170) Homepage

    Whereas it might be true that all "spam" has forged headers, not all email which passes the 5.0 threshold has forged headers.

    Also aren't other mail servers supposed to check that the envelope sender matches the host it's being sent from?

  • by Anonymous Coward on Wednesday January 14, 2004 @06:34AM (#7971708)
    Spammers typically are looking for two responses to email.

    1. Go to a website
    2. reply to their email

    The answer to 1 is to simply dump all embedded html. Problems solved. Nobody I know ever needs to send me email disguised as a web page. And yes mom, that means you have to lose that gawd awful floral background in all of your 'how are you son' emails.

    The answer to number 2: (the other number 2)

    What we need is not a better filter, we need a better response mechanism.

    Spammers rely on the fact that smart readers who are non customers will not respond to their ads. This reduces the responses they receive to legitimate customers, people who are simply verifying their email address's by asking not to be spammed, and of course spam from other spammers.

    What if instead of never responding to spam, everyone automatically responded to every spam with 'canned ham'.. seamingly sincere messages
    filled with info culled from the spam itself.

    Yes, please make my P3nis larger. 13" is no longer interesting now that everyone is growing beyond belief thanks to your wonderful products. Please send your wonderful pen!s enlarging kit to me right away. Do you need a credit card? Tell you what, don't use this email adress, use my hotmail address. areallybigone@hotmail.com

    With everyone responding to all the spam but with no intent on following up on the correspondance the spammers inboxes will be flooded with responses from legitimate accounts all of which will seem like willing customers but will in fact be completely useless.

    A few things. One is that it will render every mail list completely useless. It will give spammers a taste of their own medicine. It will vastly increase the amount of mail traffic for a very short amount of timing causing the ISPs to take notice and perhaps fscking do something about the spam problem. It will be mildly humorous in the short term to watch all the spammers drown in a sea of BS email.

  • Re:gibberish... (Score:4, Interesting)

    by 1u3hr (530656) on Wednesday January 14, 2004 @07:34AM (#7971905)
    Someone keeps buying from spam.

    Not necessarily. I'm sure most of those people (had to backspace over a few epithets) who spam Make Money Fast either lose money or get into legal trouble. But the damage is done (to me) before they learn that it won't make money. I think the driving force is selling spam services to gullible clients like these. (Not including the industrious Nigerians who seem to take a more personalised DIY approach.) Even if someone DID want penis-enlarging cream, I think by now they'd have a source of supply, that market must be pretty saturated by now.

  • by G4from128k (686170) on Wednesday January 14, 2004 @08:31AM (#7972058)
    I'm surprised that spam filtering software doesn't just just run a quick spellchecker on the email. So much spam tries to evade literal word filtering by clever spellings of p3nis and \/iagra. But if we filter out emails with too many spelling errors (and punctuation-addled non-words) in the subject and body, then all those clever ploys are for nought. (As a side benefit, more people would be careful about spelling in legitimate e-mails).

    Fitering out misspelled emails puts spammers in a real quandry -- spell words correctly (and get filtered) or misspell (and get filtered).
  • Threshold? Bah! (Score:3, Interesting)

    by glpierce (731733) on Wednesday January 14, 2004 @09:34AM (#7972445) Homepage
    I'm worried about spammers realizing that they can effectively negate the usefulness of filters without breaking a sweat (spammers, please don't read the following). If they switched from super-short fake messages to mock-real messages (a paragraph or two long, a legit-sounding subject, etc.) and they all sent out millions a day, everyone would be forced to turn off their filters. There would be no effective to distinguish those fake messages from real messages for most people (without a whitelist/blacklist system, which does more harm than good for most).

    In such a situation, email would grind to a halt. Anyone who kept trying to train their filters would just end up blocking most legit emails, and those who don't train for it or turn off would be flooded with real and fake messages they can't distinguish between. The messages would even be profitable, so long as your "friend" included a link to some "cool website" that happens to sell [fill in spam product here]. Go ahead and train your filter to block emails containing URLs. Hah! Maybe if you don't have a job, friends, or buy things over the internet you can, but for most it's just not going to work.
  • by geoff lane (93738) on Wednesday January 14, 2004 @10:44AM (#7973097)
    when SCO, sorry CoS, were spamming ARS a couple of years ago it was possible to kill 99% of the spam just by computing the average word length in the spam. Ordinary humans generated messages with an average word length of 4.5 letters, CoS random word spam had an average word length of 5.5 letters.

    I was surprised that such a simple test worked so well.

    One day I must re-implement the test for email spam and see if it works as well.
  • by Brian Ristuccia (2238) <brianr-slashdotspam@osiris.978.org> on Wednesday January 14, 2004 @12:51PM (#7974422) Homepage
    I hope to hell they're fishing for non-bouncing addresses, because at the moment any email which SpamAssassin says is spam, I bounce.
    Don't ever do that, all spam has forged headers. You're just making life hard on someone who had their address sold.

    Returning suspected spam might have a small adverse effect on the legitimate holders of forged addresses, but silently deleting suspectred spam adversely affects everyone by causing misclassified messages to be silently lost. The practice of bouncing spam doesn't increase collateral damage, it prevents it. Automated processes must cause mail to either reach its destination or be returned to its purported sender. Otherwise legitimate mail will get silently lost. That's collateral damage.

    This balance of burdens is fair too. Fake bounces are much easier to filter than ordinary spam. Even if the bouncing MTA engages in the unfortunate practice of sending bounces that don't contain the original message you can still filter all fake bounces with 100% reliability. Simply send each of your outgoing messages with a unique tagged, timestamped envelope sender address. Bounces which arrive at other addresses are always in response to forgeries and can be safely discarded.

  • by letxa2000 (215841) on Wednesday January 14, 2004 @02:35PM (#7975777)
    I get the same statistics as you with my SA install, most of it is given a BAYES_99 score. Unfortunately, many don't train their own filters, and this is rather effective against them.

    True. Although an obvious caveat of using Bayesian to filter is that you HAVE to train it. In the anti-spam service I use (see tagline) it defaults to NOT using Bayesian. If you turn Bayesian on it specifically sends you an email reminding you that you MUST train it or things will actually get worse.

    But you're right, a misused Bayesian filter might actually be worse than no Bayesian filter at all. But that's the case whether or not spammers insert random words.

    There are ways to poison Bayes-filters that are better than this, and that may well be effective. If you sit down and think about it, I'm sure you can think of something too. I'm not going to write them, because it will be too easy for spammers to implement. Fortunately, spammers are stupid, and that buys us some time, but we still need more options.

    Let's talk about them. We're not going to come up with anything that spammers can't come up with so I don't think we're going to make things any easier for them or give away the farm by discussing it publically.

    I personally have thought about it and I'm unaware of how they could poison Bayesian statistics. I only see two approaches, theoretically. 1) Make your spam get a lower Bayesian score so it gets through. 2) Make non-spam get a higher Bayesian score so it gets caught as a false positive.

    Approach #1: Short of going to the "spam of the future" predicted by Paul Graham, I don't see any way for spammers to really get a lower spam score.I've seen entire sections of the Constitution embedded in spam that still got a 98% spam score. The only way spammers are going to get a lower spam score is by doing things like using the names of my friends, using words related to topics I often discuss, etc. And that's just not possible. Like I said, they might get an occasional lucky shot but what gets through to me most probably won't get through to you. I just don't see any way for them to reliably get past a significant number of Bayesian filters.

    Approach #2: Poison the Bayesian stats such that non-spam mail gets tagged as spam. I'm pretty convinced this isn't possible, either. Again, they'd have to heavily use words that are specifically non-spam for the receiver such that the spam rating for those words increases so high that it is considered spam. But if the words are heavily used in both spam (trying to poison the stats) and non-spam, it's going to float to a middle position, like the word "THE" which has a 53.2% chance of being spam (and that's only because 92% of my mail is spam so a neutral word is usually slightly over 50%). But neutral words are completely ignored by Bayesian--only the "most interesting" are considered, those that are 99% spam or 1%--THOSE are the words that define whether or not the message gets scored as spam or not. Plus if they knew which words to poison, those are the same words they could use to get their spam past the filter to start with... so poisoning the filters is pointless anyway.

    I really don't see how they can get around it. I'd be interested in your views. If you really think it's dangerous to talk about it in public then let me know and I'll email you at your mangled address above. Is that your correct address?

  • by steveha (103154) on Wednesday January 14, 2004 @05:34PM (#7978281) Homepage
    I read your article, but I am not as worried as you are.

    First, my credentials: I haven't run an organized study of spam, as you have, but I did set up a Bayesian filter, SpamProbe, on my mail server (and I wrote an article [linuxjournal.com] about it). I get about 150 spam messages per day, and I only see the ones that get past my Bayesian filter. So I have looked over dozens of spams to see why they fooled my filter. (My filter is about 95% effective, and once I had it trained, I haven't observed any false positives.)

    Yes, if a spammer works carefully, he can craft a message that will have a better chance of slipping past a Bayesian filter. But my Bayesian filter is not 100% effective anyway; as long as I only have to manually handle 5-15 messages per day, I'd say the filter is working. So the question is not whether the spammers can ever slip a message past the filter; the question is whether the spammers can completely destroy the usefulness of Bayesian filters, as you fear.

    Bayesian filters look at the whole message, and they can learn to recognize spam in unexpected ways. For example, HTML font tags that set large red letters are a good spam indicator. HTML font tags that set white-on-white text are another. So Bayesian filters will force spammers to change the format of their spam.

    Most spammers want you to call a phone number or view a URL. Since the Bayesian filter will learn the phone numbers and URLs are spam flags, Bayesian filters will force spammers to keep setting up new phone numbers and servers.

    The "from" addresses of my friends will quickly become good ham indicators, and that will be difficult for the spammers to exploit (since everyone has different friends).

    Also, my understanding is that you cannot really "poison" a word for Bayesian filtering; all you can do is lessen its usefulness as a spam/ham indicator. If spammers use different hammy words for each spam, the poison's dosage will be diluted; while if they use the same hammy words for each spam, those words will then be a legitimate spam flag.

    There probably are a few refinements that could be made to spam filters. I'd like to see a spam filter that, if there is both an HTML part and a plain text part, only checks the HTML. That way the spammers can include ham in the text part and it won't affect the filtering.

    In summary, I am reasonably hopeful that there is no way for spammers to completely defeat Bayesian filtering. The best they can hope to do is to sneak some mildly-phrased messages by the filters.

    P.S. I agree with you that the ultimate anti-spam measure would be a "for-pay" mail system. I envision a mail protocol that allows you to specify how much it costs to send you an email: you put your friends on the free list, and otherwise it costs 5 cents or whatever. If you are really famous you might raise the cost up to reduce the volume of email you receive. There should be a mechanism in place to quickly refund the costs, and friends should be identified with a digital signature, not by an easily forged string. Spam only works because it's so cheap to send many messages, so a 0.001% response rate is enough. At even 5 cents per message, spam wouldn't be cost-effective anymore. You would still get ads in the mail, but they would be less obnoxious and more carefully targeted. Send me an ad for Mexican Viagra and you won't get your 5 cents back, but send me an ad for something I actually want and I'll consider it.

    steveha

Please go away.

Working...