Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Using gzip As A Spam Filter

Posted by timothy on Mon Jan 27, 2003 08:15 AM
from the showing-some-adaptability dept.
captainclever writes "Kuro5hin have an interesting article on detecting spam using gzip." Here's a sample: "Loosely speaking, the LZ (Zip) and the related gzip compression algorithms look for repeated strings within a text, and replace each repeat with a reference to the first occurrence. The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text."
This discussion has been archived. No new comments can be posted.
Using gzip As A Spam Filter | Log In/Create an Account | Top | 268 comments (Spill at 50!) | Index Only | Search Discussion
Display Options Threshold:
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • Grep it instead! by WestieDog (Score:2) Monday January 27 2003, @08:20AM
  • Raw data (Score:5, Informative)

    by gazbo (517111) on Monday January 27 2003, @08:20AM (#5166798)
    This article will make much more sense if you look at the raw data [willets.org] in tabular form.
    • Not that different (Score:5, Interesting)

      by Synonymous Soured (627748) on Monday January 27 2003, @08:32AM (#5166857)
      A Bayesian spam filter uses an underlying order-0 Markov model of email messages. gzip uses an underlying order-1 Markov model.

      A Bayesian filter uses words as "symbols." gzip uses bytes as symbols.

      The right thing to do would be to combine them.Ttake a gzip-style Markov model, using bytes as symbols and conditional probabilities, and plug it into a Bayesian filter. That would (1) make the filter more powerful and (2) make the filter applicable to any sort of data, arbitrary binary or readable text. Negligible computational overhead, sharper discrimination.

      [ Parent ]
      • Sorry, that's not right (Score:5, Interesting)

        by martin-boundary (547041) on Monday January 27 2003, @09:35AM (#5167122)
        Only naive bayesian models are 0-order Markov. The "naive" refers precisely to the zero order independence assumption. You can have 1-order, 2-order, n-th order bayesian models if you like. Those are called n-gram models. After that, you can have bayesian phrase based models if you like, or paragraph based also.

        Bayesian only refers to how you use the probabilities.

        Now gzip implements similar ideas to LZW compression, which uses variable sized prefixes, which is quite different from an 1-order Markov model. For example, and order 1 Markov model is not allowed to depend on more than the current and immediately preceding symbol.

        [ Parent ]
      • Where to read about Markov models etc by A55M0NKEY (Score:1) Monday January 27 2003, @10:08AM
    • bzip2 results (Score:5, Informative)

      by K-Man (4117) on Monday January 27 2003, @02:31PM (#5168756)
      Several knowledgeable people pointed out that the first try was limited by gzip's 32k window size, so I did a quick run with bzip2, which uses a 900k block, and put the results here [willets.org]. Somewhat different, but still a spread between spam/ham.

      And, of course, do try this at home.
      [ Parent ]
  • It's all spam (Score:4, Funny)

    by amigaluvr (644269) on Monday January 27 2003, @08:21AM (#5166800) Journal
    Hey if you compress all of your mail with gzip then it all looks like foreign spam anyway!
    • Re:It's all spam (Score:5, Interesting)

      by greenjinjo (580285) on Monday January 27 2003, @08:39AM (#5166885)

      You know, I noticed something peculiar. If you're from a non-English speaking country, like I am, you can filter the spam by looking at the language of the subject. In my case, if it is English it is almost certainly spam.

      Do English-speaking people receive spam in foreign languages?

      [ Parent ]
  • Maybe I am missing something here by Anonymous Coward (Score:1) Monday January 27 2003, @08:22AM
  • Slashdot filter (Score:5, Interesting)

    by fredrikj (629833) on Monday January 27 2003, @08:22AM (#5166809) Homepage
    Sounds very much like that lameness filter on Slashdot that refuses to accept a post if its contents can be compressed easily... of course, it's quite simplistic compared to gzip.
  • this is nice by teejie (Score:1) Monday January 27 2003, @08:24AM
  • Meet the Bayesian Filtering Algorythm (Score:5, Informative)

    by dpete4552 (310481) <slashdotNO@SPAMtuxcontact.com> on Monday January 27 2003, @08:25AM (#5166822) Homepage
    http://www.paulgraham.com/spam.html
    • Re:Meet the Bayesian Filtering Algorythm by dilute (Score:3) Monday January 27 2003, @08:55AM
      • by coyul (119455) on Monday January 27 2003, @10:34AM (#5167480)

        OTOH, it seems to me that some other model, such as a scheme that gives legitimate senders explicit advance AUTHORIZATION to send you email, might be what's needed.

        I understand what you're saying, but there are a couple of problems with this, depending on how you implement it. If you allow potential correspondents to request authorization by email, you'll still have to process at least one message per originating address. That obviously won't work to eliminate spam (or even cut it down to size...) The other option is to force potential correspondents to request authorization over another channel (phone, fax, whatever), but this neatly destroys a lot of the convenience of email. It also eliminates the impersonal nature of email (by forcing a personal contact) when it is partly this impersonality that distinguishes it in the first place (and encourages some first time correspondents to make contact at all...)

        May not be the ultimate filter (and I doubt it could be), but it's real interesting, I think, that this appears to have considerably greater than zero accuracy.

        Actually, the Bayesian filter implemented by POPFile [sourceforge.net] is remarkably accurate. A friend of mine has been using it since it debuted on slashdot in November [slashdot.org] and it has correctly classified all of the spam he's received since (76% of his email in total, unfortunately...)

        You can also set up POPFile to process the headers of your messages as well as the body, so it will effectively learn the email addresses of people you're willing to receive email from anyway. Depending on how you define words (what you use as token separators), you could attempt to make it generalize to domains as well.

        [ Parent ]
  • Compression detection of spam by Alcohol Fueled (Score:1) Monday January 27 2003, @08:25AM
  • Right tool for right job by WPIDalamar (Score:2) Monday January 27 2003, @08:25AM
    • RTFA! by jotaeleemeese (Score:1) Tuesday January 28 2003, @07:11AM
    • 1 reply beneath your current threshold.
  • HTML (Score:5, Interesting)

    by Pilferer (311795) on Monday January 27 2003, @08:26AM (#5166827)
    That's because most spam includes large amounts of HTML.

    My friends do not use HTML in email. Ads for "Crimescene Cocksuckers" does.
    • Re:HTML by ^BR (Score:1) Monday January 27 2003, @08:43AM
    • Re:HTML by UberLord (Score:1) Monday January 27 2003, @10:21AM
    • Re:HTML by phrantic (Score:2) Monday January 27 2003, @10:42AM
      • Re:HTML by Lussarn (Score:2) Monday January 27 2003, @12:42PM
        • Re:HTML by Lord_Breetai (Score:1) Monday January 27 2003, @07:26PM
          • Re:HTML by T-Ranger (Score:2) Monday January 27 2003, @10:23PM
      • Re:HTML by PetWolverine (Score:1) Monday January 27 2003, @03:01PM
  • Great by FungiSpunk (Score:1) Monday January 27 2003, @08:27AM
  • Excellent (Score:5, Funny)

    by Phosphor3k (542747) on Monday January 27 2003, @08:28AM (#5166835)
    Slashdot can use it to filert out duplicate stories.
    • Re:Excellent by oktokie (Score:1) Monday January 27 2003, @01:45PM
    • 1 reply beneath your current threshold.
  • It won't work for businesses (Score:5, Funny)

    by autocracy (192714) <slashdot2007&storyinmemo,com> on Monday January 27 2003, @08:29AM (#5166840) Homepage
    Anything from mid-level management or the marketing department would immediately be marked as spam and trashed. Maybe not very important in the first place, but you'd at least need to be able to say "yeah, I saw the memo on the TPS reports."
  • Spam Conference talk (Score:5, Interesting)

    by Matts (1628) on Monday January 27 2003, @08:30AM (#5166845) Homepage
    Jason Rennie gave an extremely interesting talk about this at the MIT Spam Conference this month, although he wasn't using quite as direct a method, instead he was looking at MLD - Minimum Length Description. This is a technique to discover features in corpora that allow you to describe the classification of a corpus in the minimum number of details.

    Basically it's a way to discover features in emails using compression techniques, so rather than having us SpamAssassin developers have to carefully and manually examine emails to see what's new and interesting about them, MLD techniques can automatically detect these features.

    Jason Rennie's web page (talk and paper available) about this is here [mit.edu]. Please do read it as it's extremely interesting.

    The one downside of it is that Jason said at the end of his talk that it's extremely slow at doing the feature detection. When asked how slow he said that on a reasonably small corpus it took 4 months (although he said it was written in Perl, so a C port is probably a good plan).

    In comparison to Bayesian techniques the MLD technique presents a great deal of interest - primarily because I work for a company doing spam filtering at the internet level [messagelabs.com], and so we can't feasibly do personal training which is what makes Bayesian techniques so great (see the talk I gave at the MIT spam conference). Without the personal training Bayes is only about 90-95% effective, so it should be interesting to see where these techniques lead us.
    • Re:Spam Conference talk by ajs (Score:3) Monday January 27 2003, @09:17AM
    • Re:Spam Conference talk (Score:5, Insightful)

      by archeopterix (594938) on Monday January 27 2003, @10:39AM (#5167503) Journal
      MLD, gzip, neural networks, bayesian filtering and probably a bunch of other spam-filtering methods are all based on the following scheme: get a (big) number of spam messages, a number of non-spam messages (preferably specific to the current user of the filter) and use a learning algorithm on these to produce an automatic classifier.

      What bothers me about this method is that you can never be 100% sure what the learning algorithm will actually learn. My friends seldom send me HTML mail. Most of my spam is HTML. A learning algorithm will probably learn that HTML mail is spam, especially if it never gets HTML "ham" during its training period. Then if one of my clueless friends sends me a HTML message, it will not go through and this is clearly bad.

      I will never trust an automatic filter so as to delete a message marked as "spam" without reading, but I think it can still be useful for ranking messages, so that spam gets read less often and deleted faster.

      [ Parent ]
  • Quantitive, not qualititive (Score:5, Interesting)

    by psplay (572886) <J AT psplay DOT com> on Monday January 27 2003, @08:30AM (#5166846)
    Its not simply the words that are used in a mail, but the way they are used (the order) that gives a sentence its meaning.

    for example Two Emails:

    1 (ham) "You have won a brand new Convertible, from the competition you entered."

    and

    2 (spam) "A brand new convertible to be won, have you entered?"

    Ham would match about 80% with spam.

    Word matching is a blunt instrument as mentioned. The English language is far too complex for simple calculations, this fact should be taken into consideration, when applying a 'Spam Likelihood' rating to Emails.
  • Good, now go to level 2... by zanderredux (Score:1) Monday January 27 2003, @08:31AM
  • Don't compress (Score:3, Funny)

    by Fuzzums (250400) on Monday January 27 2003, @08:31AM (#5166854) Homepage
    Usually I don't compress my spam.

    I delete it.

    This will save me a lot more space ;-)

  • Same old problem... (Score:5, Insightful)

    by artemis67 (93453) on Monday January 27 2003, @08:34AM (#5166864) Homepage
    Filtering is not a true spam solution. All it takes is for one false positive on a Really Important Email and be accidentally deleted to totally destroy the value of any filtering system.

    Given that, the alternative to having tagged emails automativally deleted is to collect them in a folder and scan the message senders and subject lines. If you're doing that, then the spammer is getting a pitch through to you in the subject line. This therefore does not lessen the incentive for the spammer, but simply causes him to change tactics and put his best pitch in his subject line.

    Right now, I get 60-80 spams a day. What happens when I start getting 600-800 a day? Again, filtering starts to break down, because I have SO MANY messages to scan everyday that the possibility of me missing a legitimate one is very high.
    • Re:Same old problem... by isorox (Score:3) Monday January 27 2003, @08:46AM
    • Re:Same old problem... by timlewis_atlanta (Score:1) Monday January 27 2003, @08:47AM
    • Re:Same old problem... (Score:5, Interesting)

      by djmurdoch (306849) on Monday January 27 2003, @08:49AM (#5166939)
      Filtering is not a true spam solution. All it takes is for one false positive on a Really Important Email and be accidentally deleted to totally destroy the value of any filtering system.

      One of the side effects of spam is that there are no "Really Important Emails" any more. Spam and spam filters have degraded the reliability of email to such an extent that you'd have to be crazy to send anything Really Important by email.

      Right now, I get 60-80 spams a day. What happens when I start getting 600-800 a day?

      That's a good point. The solution is to get less spam. You can do that by changing email addresses frequently (a really inconvenient solution that I don't recommend), or by getting spammers shut down (or yourself listwashed by the spammers).

      Let the spammers know that if they send something to you, they'll lose money, and they won't send you so much spam. SpamCop [spamcop.net] reporting makes this easy. If you want to be listwashed, don't munge your address when you send reports. (This is an option with SpamCop.)

      Some people claim that you'll get more spam or get listbombed or something if you send complaints without munging; that's not my experience. I get 20-30 spams per day, total, at all of my 4 publicly available email addresses. (Ninety to 95 percent of them get caught by the SpamCop filters, which have almost never caught valid email.)
      [ Parent ]
    • Re:Same old problem... by johnburton (Score:2) Monday January 27 2003, @08:50AM
    • Re:Same old problem... by Kragg (Score:2) Monday January 27 2003, @08:55AM
    • Re:Same old problem... by ch-chuck (Score:2) Monday January 27 2003, @08:59AM
    • Medical community figured this out years ago. by Anonymous Coward (Score:1) Monday January 27 2003, @09:04AM
    • Re:Same old problem... by squiggleslash (Score:2) Monday January 27 2003, @09:34AM
    • Re:Same old problem... by kirkjobsluder (Score:2) Monday January 27 2003, @11:46AM
    • Risk Analysis by po8 (Score:2) Monday January 27 2003, @01:19PM
    • Re:Same old problem... by DaCool42 (Score:1) Monday January 27 2003, @06:11PM
    • Re:Same old problem... by artemis67 (Score:2) Monday January 27 2003, @10:46AM
    • 3 replies beneath your current threshold.
  • Spammers will adjust their tactics (Score:5, Interesting)

    by ultrabot (200914) on Monday January 27 2003, @08:34AM (#5166866)
    Obviously it wouldn't be a big problem for the spammers to run their creative gems through gzip, and alter the content until they achieve lower compression ratio. Even including a bunch of garbage after the message might do the trick. I believe equivalent analysis can be done cheaper with non-gzip tools...
  • Alternative (Score:5, Interesting)

    by Dexter77 (442723) on Monday January 27 2003, @08:35AM (#5166869)
    When the spam is filtered at user-account level, you can only do it by parsing a single mail in some way and determine if it's spam or not. It's like trying to tell whether a movie is bad by looking at one picture. If the spam could be filtered at the server level, by comparing mails that are received into to different accounts, you could really tell which ones are part of a mass-mail (spam).

    One problem with this is the right to open other people's mail. But you could use some basic scrambling (rot-13) to make sure that no one sees the inside. It wouldn't make difference to the comparing script.

    Mailing lists might cause a problem too but wouldn't it be easier to allow the mailing lists you belong to than trying to block the ones you don't belong to?
  • Sequitur Most Likely Superior (Score:5, Interesting)

    by Baldrson (78598) on Monday January 27 2003, @08:36AM (#5166872) Homepage Journal
    The statistics generated by Sequitur [rutgers.edu] are most likely superior to Gzip.

    As an example of how Sequitur works, the string 'abcabdabcabd' produces the following grammar rules:

    1. 2 c 2 d
    2. a b
    Representing the original string then is the sequence:

    1 1

    The usage counts of the rules are available as output options.

  • Dupes by BESTouff (Score:1) Monday January 27 2003, @08:37AM
  • Yay! (Score:5, Funny)

    by Anonymous Coward on Monday January 27 2003, @08:42AM (#5166900)
    What an idea!

    I could use this to avoid those people who keep saying the same thing all the time, over and over again...

    Now, how can I convince my mother to use e-mail?
  • My spam compression approach by joshv (Score:2) Monday January 27 2003, @08:43AM
    • 1 reply beneath your current threshold.
  • What is spam, though? (Score:5, Funny)

    by Big Mark (575945) <m_t_douglas&hotmail,com> on Monday January 27 2003, @08:43AM (#5166903)
    The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text.
    Ah. I thought to detect really useless, annoying, pointless, bandwith-sapping and time-consuming email all you had to do was look for "fwd:" in the subject line.

    -Mark
  • How to stop spam.... (Score:4, Informative)

    by oliverthered (187439) <oliverthered@NOSPAm.hotmail.com> on Monday January 27 2003, @08:44AM (#5166906)
    1: Get an email account with unlimited addresses.
    2: when registering use a unique address e.g. slashdot@mydomain.com
    3: Make sure you check/uncheck the give my email address to mailing lists.
    4: If ever you get spam to that account get litigious.

    Use something like mailinglists@mydomain.com, and block anything that doesn't come from mailing lists you've subscribed to.
    • Re:How to stop spam.... by Jugalator (Score:3) Monday January 27 2003, @08:49AM
    • Re:How to stop spam.... by BenV666 (Score:1) Monday January 27 2003, @09:16AM
    • Re:How to stop spam.... by NoseyNick (Score:2) Monday January 27 2003, @09:18AM
    • Re:How to stop spam.... (Score:5, Interesting)

      by DeadSea (69598) on Monday January 27 2003, @10:13AM (#5167317) Homepage Journal

      You need to expand on your step 4.

      When I started getting spam, I wanted it to stop. I realized I couldn't just disable the email address because there might be somebody out there counting on it to contact me. I could disable it and send an autoreply with my current email address, but then spammers would just be able to look at the reply. I needed some solution where people could send me email even if the address they used had been disabled, but spammers wouldn't be able to get my current address. I decided to put a contact form on my website. Now I autorespond to a disabled email address with the contact form url. In addition, I was able to remove email addresses from my website which was a large source of spam.

      Not being able to find a contact form that was secure, I ended up writing my own and releasing it under the GPL. You can get it at http://ostermiller.org/contactform/ [ostermiller.org].

      I also realized that no matter how hard you try, your email address will leak to spammers. Ever giving an email address only to your closest friends and family will not prevent it from leaking out. Between the klez virus, gift certificates, invitation, greeting card, and crushlink websites, even my most personal email address leaked to spammers. You can't be afraid to disable an email address and send your friends the new one. Now even if I missed a friend, they can still get a message to me.

      [ Parent ]
    • E-mail address id-ing by fulldecent (Score:1) Monday January 27 2003, @10:23AM
    • Re:How to stop spam.... by deisher (Score:1) Monday January 27 2003, @10:29AM
    • Litiogeoususizing (sp) by A55M0NKEY (Score:1) Monday January 27 2003, @10:53AM
    • Re:How to stop spam.... by Phroggy (Score:1) Monday January 27 2003, @06:36PM
    • 2 replies beneath your current threshold.
  • by Domini (103836) <marius@e.co.za> on Monday January 27 2003, @08:46AM (#5166920) Homepage Journal
    It's inefficient to have so much memory overhead.

    Besides, if I were a spammer, I could pad it with high entropy data at the end to make up for my warbling.
  • Compression algorithms as filters... (Score:5, Insightful)

    by Jugalator (259273) on Monday January 27 2003, @08:46AM (#5166925) Journal
    .. sounds like a poor idea to me. Yes, you can measure the amount of redundancy in a message, but:

    a) Spammers might not always use messages redundant enough to be detectable from regular text.

    b) If I happened to use some words a little too often, especially when writing mails discussing technical stuff or posting computer code fragments, would that be classified as spam?

    I think this is a nice filter when sorting out more or less repetitive mails (spam or not) from novels, but a filter based on a spam database sounds better to me.
  • I can't figure this out... (Score:4, Interesting)

    by shivianzealot (621339) on Monday January 27 2003, @08:52AM (#5166950)

    A couple of posts above state that spammers will "just adjust their tactics." Talk like this always puzzles me; on the spammer's side, does this not help him? If I'm selling a combination weight loss drug/mail order bride/penis enlarger/cable descrambler for only three payments of $49.99 in such a manner that every spam blocker in the world filters me, logically I'm only being filtered by people who know better than to buy my "product," thus not irritating them, in effect helping to slow regulation, and I don't loose touch with any significant chunk of my target demographic. Of course, this applies with the exception of corporate environments or similiar situations where Joe Insecure has someone else managing spam.

    Can anyone share some +5 Insight on the matter?

    • Re:I can't figure this out... (Score:5, Insightful)

      by Motherfucking Shit (636021) on Monday January 27 2003, @10:01AM (#5167249) Homepage Journal
      If I'm selling a combination weight loss drug/mail order bride/penis enlarger/cable descrambler for only three payments of $49.99 in such a manner that every spam blocker in the world filters me, logically I'm only being filtered by people who know better than to buy my "product," thus not irritating them, in effect helping to slow regulation, and I don't loose touch with any significant chunk of my target demographic.
      This would make sense if the only people implementing spam filters were end users. Unfortunately, the logic breaks down when you consider that some ISPs do the filtering on behalf of their customers. It breaks down further when you factor in the number of situations in which a) the customer might not even know that the filtering is happening, or b) the customer blindly trusts the ISP's filtering system.

      Take Yahoo, for example. They're a popular webmail service and they also do spam filtering to some extent on inbound email. I would say that, in general, people who use Yahoo mail are not necessarily the type of people who "know better" than to buy spamvertised products. That's not a slam on Yahoo, nor on the people who use Yahoo mail, it's just the way the demographics work out. The ratio of ripe targets to clued-in antispammers is simply better at Yahoo than it is on other domains.

      To that end, Yahoo's spam filters aren't helping the spammers any. A spammer's goal is to get his ad in front of as many potential targets as possible, and Yahoo is full of potential targets. But if Yahoo's filters catch the spammer's message and route it straight to everyone's Bulk Mail folder, there's (thousands|millions) of "targets" who will never see the message.

      So no, I can't agree that filtering helps the spammers any, at least not the big spammers who are after volume. There's probably a bit of "collateral assistance" in that people who would report the spam may never see it, but I'd say that benefit is cancelled out by the number of possible targets lost to filters.
      [ Parent ]
    • Re:I can't figure this out... by stilwebm (Score:3) Monday January 27 2003, @10:38AM
    • Re:I can't figure this out... by buss_error (Score:2) Monday January 27 2003, @11:42AM
  • Stopping Spam (Score:5, Insightful)

    by Inflatable Hippo (202606) <inflatable_hippo@yahoo. c o .uk> on Monday January 27 2003, @08:55AM (#5166961) Journal
    > stupid filtering isnt gonna get you rid of spam... go complain at spammers upstream providers...

    Filters only work to a limited extend, and so might shutting down the spammers, if it were possible.

    But neither is going to solve this problem.

    The only solution I can think of is wide-spread adoption of PGP (or equivalent) aware mailers and certification of mail.

    The problem with mail addresses is that you have no control over their spread. If I give one to a company it'll usually leak out in the end and it's open season on my inbox.

    However if "genuine" mail is certified and mailers use certification validity as a filtering critera then it simplifies the game hugely.

    Your mailer can spot the people you've genuinely given your address to, and naturally "distrust" uncertified (effectively anonymous) mail or mail whos certificate has been revoked or is unknown to you.

    The "only" things standing in the way of this are:

    1. Slow adoption of certification/encryption in mass market mailers. Usually poor or missing.
    2. Cost/diffiulty of getting a valid certificate (e.g. with Verisign).
    3. The pain of typing a password every time you send a mail.
    4. It only works if everyone joins in.

    But nothing's for free and this strikes at the heart of emails useability.

    I'm continually suprised by the lack of certification use at least by large corporations and governments, but I suppose it removes plausible deniability :-)
  • Spam compression testing by Alien Being (Score:1) Monday January 27 2003, @08:57AM
  • Email to my girlfriend (Score:5, Funny)

    by FroBugg (24957) <.frobugg. .at. .bellsouth.net.> on Monday January 27 2003, @09:13AM (#5167031) Homepage
    Unfortunately, using this my girlfriend would never get any of my emails.

    "I'm sorry. Really, really, really, really sorry. I'm so very, very, very sorry. I'm sorry..."

  • Better spam filtering.... by JollyFinn (Score:1) Monday January 27 2003, @09:13AM
  • by SystematicPsycho (456042) on Monday January 27 2003, @09:18AM (#5167053)
    I received a nice piece of spam the other day. I didn't read it but I usually scroll to the bottom to see if they have the mandatory (in some places mandatory I think) unsubscribe method. This method sure gets me mad -

    To unsubscribe by postal mail, please send your request to:
    P.O Box 272521
    Boca Raton, FL 33427
    Ref # XXXXXX -- scd

    (XXXX.. replaced real reference number)

    It seems that the unsubscription method doesn't have to be by email - just as long as it's by something and it's there. They musn't be specific in the law. Of course, no one is going to go write a letter by snail mail to unsubscribe to spam, although sending them some dog shit through the mail is tempting. I forgot the site that provides that service. Hrmm I should change my sig.
  • 32k Window... (Score:4, Informative)

    by pridkett (2666) <(ten.mortsgaw) (ta) (todhsals)> on Monday January 27 2003, @09:29AM (#5167094) Homepage Journal
    The fact is, that unless your SPAM corpus and HAM corpus are both under 32k, this won't work. Gzip is fast because it only has a 32k sliding window, meaning that it only searches for like strings in a 32k window around what you're currently compressing. Hate to break it to you, but 32k is not enough for a corpus. I think Bzip2 uses something larger (900k?), but I forget what it is.

    I'll be happy with spam assassin [spamassassin.org] until I get CRM114 [sourceforge.net] (and mailfilter) trained and working.
  • Similar article on heise was published a year ago by hanzwurst (Score:2) Monday January 27 2003, @09:30AM
  • Even Better by HereAllNight (Score:2) Monday January 27 2003, @09:37AM
    • 1 reply beneath your current threshold.
  • Yawn -- read your papers (Score:4, Informative)

    by Anonymous Coward on Monday January 27 2003, @10:01AM (#5167247)
    There was a paper published in PRL a couple of years ago that wanted to identify languages using gzip (Benedetto et al: Language Trees and Zipping [sissa.it]). It sure sounded cool, but was quickly forgotten when Joshua Goodman took a closer look [microsoft.com] (link is down at the moment, probably IIS, Text version in Google Cache [google.com]).
  • Correction by misof (Score:2) Monday January 27 2003, @10:09AM
    • Re:Correction by kirkjobsluder (Score:2) Monday January 27 2003, @02:50PM
  • Repost? (Score:4, Interesting)

    by fulldecent (598482) on Monday January 27 2003, @10:16AM (#5167334) Homepage
    This post looks like it came from my previous reply [slashdot.org] on a way to detect entropy (non-repititious content)in P2P files

    Here is a code snippet from the comment:

    #!/bin/bash
    # Entropic analysis by Full Decent
    SIZE=$(cat $1 | wc -c).0
    CSIZE=$(gzip -c --best $1 | wc -c).0
    ENTROPY=$(echo "scale=4; $CSIZE / $SIZE * 100" | bc)
    echo "$1 is ${ENTROPY}% entropic"
    • Re: Repost? by Omniscient Ferret (Score:1) Monday January 27 2003, @08:11PM
  • How about.... (Score:3, Interesting)

    by slummerx86 (642287) on Monday January 27 2003, @10:22AM (#5167381)
    if all the email clients had a little button saying "This is Spam" and if you click it the mail gets sent to some nice spam black list agency. They'd wait for about 10 people to do this, then verify it for the spam it is and then A) black list the spammer and B) send anti-spam email (subject: spam sender here ) nice and easy :)
  • A similar idea (no pun intended) by Ed Avis (Score:2) Monday January 27 2003, @10:24AM
  • hotmail by Koatdus (Score:1) Monday January 27 2003, @10:45AM
    • Re:hotmail by julesh (Score:1) Monday January 27 2003, @11:39AM
    • Brazil by jjga (Score:1) Monday January 27 2003, @12:55PM
      • Re:Brazil by Koatdus (Score:1) Monday January 27 2003, @07:14PM
    • 2 replies beneath your current threshold.
  • Bayesian Filters by tacocat (Score:2) Monday January 27 2003, @10:52AM
  • by Adam9 (93947) on Monday January 27 2003, @11:01AM (#5167590) Journal
    Don't use this filtering if you're a high school teacher or something else that involves getting messages from teenagers..

    [E-mail from skittles9333@some.email marked as spam and deleted] So like, I was like sick, and like, I didn't go to school today. So like, I was told like, that Jim like said, that like you might like, have some homework due like tomorrow. Could you like, tell me what like that homework would like be?
  • Nope (Score:3, Insightful)

    by I Am The Owl (531076) on Monday January 27 2003, @11:14AM (#5167686) Homepage Journal
    Doesn't work for the Lameness Filter, won't work for spam .
  • Zip on DNA & Different Languages. by wilgamesh (Score:2) Monday January 27 2003, @12:26PM
  • My ruleset for Sendmail by doorbot.com (Score:1) Monday January 27 2003, @01:36PM
  • how to get rid of spam: those "99" values by kipple (Score:2) Monday January 27 2003, @02:29PM
  • Puts a dent in the old essay idea... by shepd (Score:1) Monday January 27 2003, @02:59PM
  • Similar techniques used to out author using alias by bahco (Score:1) Monday January 27 2003, @05:45PM
  • Putting it to the test. by kirkjobsluder (Score:2) Monday January 27 2003, @07:18PM
  • Who wrote this? by Puppet Master (Score:1) Tuesday January 28 2003, @09:06AM
  • Re:Text of the full article (Score:5, Insightful)

    by Anonymous Coward on Monday January 27 2003, @08:24AM (#5166818)
    > The current fad among spam filters is word-counting, with various statistical heuristics applied to the results.

    The current fad is in fact Bayesian filtering, sophisticated statistical analysis.

    gzip used this way can be viewed as a very poor Bayesian analysis with substantially lower effectiveness. Lets just skip the half-assed attempt and go straight to the real thing.
    [ Parent ]
    • Re:Text of the full article by compling (Score:1) Monday January 27 2003, @09:04AM
    • Re:Text of the full article (Score:5, Informative)

      by Hal-9001 (43188) on Monday January 27 2003, @09:14AM (#5167038) Homepage Journal
      The scheme described in the article is not Bayesian at all. It's more like a very crude hash comparison. If two similar messages are concatenated, they should compress very well. If two dissimilar messages are concatenated, they will not compress as well.

      An actual Bayesian filter would perform a statistical analysis of an existing body of spam and non-spam messages, identify key words or phrases that identify a message as spam or non-spam, and calculate the probability for every key word that a message containing that word is spam. Then every new message is classified as spam or non-spam by running a statistical analysis on its content, and the statistics of that message update and improve the probability model.
      [ Parent ]
    • Re:Text of the full article by NoseyNick (Score:2) Monday January 27 2003, @09:32AM
    • Re:Text of the full article by timeOday (Score:2) Monday January 27 2003, @10:38AM
    • GZIP used this way ... by fygment (Score:3) Monday January 27 2003, @12:59PM
  • Re:Text of the full article by Anonymous Coward (Score:2) Monday January 27 2003, @08:51AM
  • Re:Legislation (Score:3, Funny)

    by liquidsin (398151) on Monday January 27 2003, @09:05AM (#5166998) Homepage
    That's pretty harsh. Once the death sentence has been carried out, I see no reason not to parole them. Have some compassion.

    [ Parent ]
  • RBL (Score:5, Interesting)

    by Penguinoflight (517245) on Monday January 27 2003, @09:11AM (#5167022) Homepage Journal
    RBL blocks a lot of stuff that isn't spam. It's probably a better idea to go with bayesian filtering. You can read up on it here: http://www.paulgraham.com/better.html [paulgraham.com]
    [ Parent ]
  • Re:WTF? by CheeseburgerBlue (Score:1) Monday January 27 2003, @09:33AM
  • 21 replies beneath your current threshold.