Catch up on stories from the past week (and beyond) at the Slashdot story archive

Using gzip As A Spam Filter 268

Posted by timothy on Monday January 27, 2003 @09:15AM from the showing-some-adaptability dept.

captainclever writes "Kuro5hin have an interesting article on detecting spam using gzip." Here's a sample: "Loosely speaking, the LZ (Zip) and the related gzip compression algorithms look for repeated strings within a text, and replace each repeat with a reference to the first occurrence. The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text."

This discussion has been archived. No new comments can be posted.

Using gzip As A Spam Filter

Load All Comments

Search 268 Comments Log In/Create an Account

Comments Filter:

Grep it instead! (Score:2, Funny)

by WestieDog ( 592175 ) writes:

Forget about gzip all the 'cool' geeks use grep! :)
- Re:Grep it instead! (Score:2)
  
  by dubbayu_d_40 ( 622643 ) writes:
  
  How exactly would you know what to grep for? I believe the value is in pattern identification and grep can't do that. Then again maybe I'm just not 'cool' ;-)
  - Re:Grep it instead! (Score:5, Funny)
    
    by Walterk ( 124748 ) writes: <slashdot.dublet@org> on Monday January 27, 2003 @10:19AM (#5167060) Homepage Journal
    
    Just egrep for '(penis|enlarge|money|auction|cash|advance|fortune )'. And hope no hot babes email you complimenting your penis, or mention they want their breasts enlarged, offer you money, auction off your award winning lego collection or anything like that.
    
    Parent Share
    twitter facebook
Raw data (Score:5, Informative)

by gazbo ( 517111 ) writes: on Monday January 27, 2003 @09:20AM (#5166798)

This article will make much more sense if you look at the raw data [willets.org] in tabular form.

Share
twitter facebook
- Not that different (Score:5, Interesting)
  
  by Synonymous Soured ( 627748 ) writes: on Monday January 27, 2003 @09:32AM (#5166857)
  
  A Bayesian spam filter uses an underlying order-0 Markov model of email messages. gzip uses an underlying order-1 Markov model.
  
  A Bayesian filter uses words as "symbols." gzip uses bytes as symbols.
  
  The right thing to do would be to combine them.Ttake a gzip-style Markov model, using bytes as symbols and conditional probabilities, and plug it into a Bayesian filter. That would (1) make the filter more powerful and (2) make the filter applicable to any sort of data, arbitrary binary or readable text. Negligible computational overhead, sharper discrimination.
  
  Parent Share
  twitter facebook
  - Sorry, that's not right (Score:5, Interesting)
    
    by martin-boundary ( 547041 ) writes: on Monday January 27, 2003 @10:35AM (#5167122)
    
    Only naive bayesian models are 0-order Markov. The "naive" refers precisely to the zero order independence assumption. You can have 1-order, 2-order, n-th order bayesian models if you like. Those are called n-gram models. After that, you can have bayesian phrase based models if you like, or paragraph based also.
    Bayesian only refers to how you use the probabilities.
    Now gzip implements similar ideas to LZW compression, which uses variable sized prefixes, which is quite different from an 1-order Markov model. For example, and order 1 Markov model is not allowed to depend on more than the current and immediately preceding symbol.
    
    Parent Share
    twitter facebook
- bzip2 results (Score:5, Informative)
  
  by K-Man ( 4117 ) writes: on Monday January 27, 2003 @03:31PM (#5168756)
  
  Several knowledgeable people pointed out that the first try was limited by gzip's 32k window size, so I did a quick run with bzip2, which uses a 900k block, and put the results here [willets.org]. Somewhat different, but still a spread between spam/ham.
  
  And, of course, do try this at home.
  
  Parent Share
  twitter facebook
It's all spam (Score:4, Funny)

by amigaluvr ( 644269 ) writes: on Monday January 27, 2003 @09:21AM (#5166800) Journal

Hey if you compress all of your mail with gzip then it all looks like foreign spam anyway!

Share
twitter facebook
- Re:It's all spam (Score:5, Interesting)
  
  by greenjinjo ( 580285 ) writes: on Monday January 27, 2003 @09:39AM (#5166885)
  
  You know, I noticed something peculiar. If you're from a non-English speaking country, like I am, you can filter the spam by looking at the language of the subject. In my case, if it is English it is almost certainly spam.
  Do English-speaking people receive spam in foreign languages?
  
  Parent Share
  twitter facebook
  - I receive a lot of Russian spam (Score:2)
    
    by Mustang Matt ( 133426 ) writes:
    
    I'm mostly guessing it's Russian. I don't recognize it as any other language and it usually comes from an unmasked .ru domain.
    
    In Soviet Russia...
  - Korean (Score:2)
    
    by ONOIML8 ( 23262 ) writes:
    
    I have no idea why but I receive a lot of spam in korean.
  - Foreign Language Spam (Score:2)
    
    by phorm ( 591458 ) writes:
    
    Not really very often, although since I have an email account on a German provider I have gotten a slight bit of German spam. I think a lot of it comes from "sign up" sites, unless you have a strongly public-visible website with your email address on the main page (damn trafficmagnet ads) - most companies in other countries probably aren't going to both pick up your email address if they don't except you to understand the language.
    
    Since a large portion of popular sites onlines are in english, it stands to reason that when you sign in your email address on an english site, it gets added to an english spamlist. Since I don't sign up on any Korean/Swiss/etc sites, they haven't yet gotten my email address yet (or don't care about it).
    
    That being said, people in N. America and english speaking countries do get a lot of spam in english from foreign servers - which is where IP range blocklists and spamassassin come in handy.
  - Re:It's all spam (Score:2)
    
    by Jester99 ( 23135 ) writes:
    
    I've received thousands of spam messages (20-30 per day...) and perhaps with a couple exceptions that I'm forgetting, they've *all* been in English.
    
    Of course, an insane number of those spam messages seem to be duplicates of themselves sent day after day, but still. Everything's in English in my account. :\
Slashdot filter (Score:5, Interesting)

by fredrikj ( 629833 ) writes: on Monday January 27, 2003 @09:22AM (#5166809) Homepage

Sounds very much like that lameness filter on Slashdot that refuses to accept a post if its contents can be compressed easily... of course, it's quite simplistic compared to gzip.

Share
twitter facebook
- Re:Slashdot filter (Score:4, Informative)
  
  by pudge ( 3605 ) writes: <slashdot AT pudge DOT net> on Monday January 27, 2003 @10:20AM (#5167061) Homepage Journal
  
  Um, except that Slash uses gzip for its compression. So, no. :-)
  
  What is different, as has been pointed out, is that Slash compresses a particular post and looks at how well it compresses, but does not compress/compare with other posts.
  
  Parent Share
  twitter facebook
  - Re:Slashdot filter (Score:3, Funny)
    
    by fredrikj ( 629833 ) writes:
    
    Oops. Well, my experience from my troll accounts is that the filter does a lousy job, I could never have guessed that something that sophisticated was behind it ;)
    
    Err, ignore the troll account part, I never said that.
- Re:Slashdot filter (Score:2)
  
  by McCart42 ( 207315 ) writes:
  
  In other words, if your post is more noise than signal, you're on the right track. ;)
Meet the Bayesian Filtering Algorythm (Score:5, Informative)

by dpete4552 ( 310481 ) writes: <slashdot&tuxcontact,com> on Monday January 27, 2003 @09:25AM (#5166822) Homepage

http://www.paulgraham.com/spam.html

Share
twitter facebook
- Re:Meet the Bayesian Filtering Algorythm (Score:3, Informative)
  
  by dilute ( 74234 ) writes:
  
  Baysian filtering looks at word occurrence statistics. This is saying just compare the bulk redundancies of a message as compared to a collection of test messages of a known type, without even looking at the "words". May not be the ultimate filter (and I doubt it could be), but it's real interesting, I think, that this appears to have considerably greater than zero accuracy.
  
  OTOH, it seems to me that some other model, such as a scheme that gives legitimate senders explicit advance AUTHORIZATION to send you email, might be what's needed. How to implement that is, well, left as "an exercise for the reader" -- actually, this has been discussed on /.
  - Re:Meet the Bayesian Filtering Algorythm (Score:5, Informative)
    
    by coyul ( 119455 ) writes: on Monday January 27, 2003 @11:34AM (#5167480)
    
    OTOH, it seems to me that some other model, such as a scheme that gives legitimate senders explicit advance AUTHORIZATION to send you email, might be what's needed.
    
    I understand what you're saying, but there are a couple of problems with this, depending on how you implement it. If you allow potential correspondents to request authorization by email, you'll still have to process at least one message per originating address. That obviously won't work to eliminate spam (or even cut it down to size...) The other option is to force potential correspondents to request authorization over another channel (phone, fax, whatever), but this neatly destroys a lot of the convenience of email. It also eliminates the impersonal nature of email (by forcing a personal contact) when it is partly this impersonality that distinguishes it in the first place (and encourages some first time correspondents to make contact at all...)
    
    May not be the ultimate filter (and I doubt it could be), but it's real interesting, I think, that this appears to have considerably greater than zero accuracy.
    
    Actually, the Bayesian filter implemented by POPFile [sourceforge.net] is remarkably accurate. A friend of mine has been using it since it debuted on slashdot in November [slashdot.org] and it has correctly classified all of the spam he's received since (76% of his email in total, unfortunately...)
    
    You can also set up POPFile to process the headers of your messages as well as the body, so it will effectively learn the email addresses of people you're willing to receive email from anyway. Depending on how you define words (what you use as token separators), you could attempt to make it generalize to domains as well.
    
    Parent Share
    twitter facebook
    - Re:Meet the Bayesian Filtering Algorythm (Score:2)
      
      by dilute ( 74234 ) writes:
      
      You could put their mail in a putative spam folder and send them an explanatory message with a link to a web page where they can get "authorized" and put on a "buddy list". On that page you could do a variety of things, depending on whether you just wanted to screen out automated mailers or really wanted to pre-qualify the sender.
      
      I'm skeptical about heuristic filters, because of the possibility of the occasional false positive, which could be an embarrasment (or worse).
      
      However, the filtering technology is very much of interest to me, for other reasons... I will take a look at POPFile for sure.
Right tool for right job (Score:2, Interesting)

by WPIDalamar ( 122110 ) writes:

Sure, this sounds like a nice academic activity, but really ... In the real world, use the right tool for the right job. I tend to think word redundancy does not correlate directly to spaminess.
HTML (Score:5, Interesting)

by Pilferer ( 311795 ) writes: on Monday January 27, 2003 @09:26AM (#5166827)

That's because most spam includes large amounts of HTML.

My friends do not use HTML in email. Ads for "Crimescene Cocksuckers" does.

Share
twitter facebook
- Re:HTML (Score:2, Informative)
  
  by phrantic ( 630202 ) writes:
  
  Another problem with html is that, if there is some level of sophistication on the part of the spammer they can embedd a file (a gif or jpg) in the html that has a unique name that is uniquely associated with your email address. You open the mail, the file is requested (it doesn't even have to exist) but the 404 error or the html get can be logged on the server, and then it is a simple matter of matching the requested files to the email address and you have a list of good email addresses. This is a really useful technique for "closed loop marketing" which is the corporate edition of Spam.
  - Re:HTML (Score:2)
    
    by Lussarn ( 105276 ) writes:
    
    You are quite right except that you don't have to embed anything. just put a image tag in the mail.
Excellent (Score:5, Funny)

by Phosphor3k ( 542747 ) writes: on Monday January 27, 2003 @09:28AM (#5166835)

Slashdot can use it to filert out duplicate stories.

Share
twitter facebook
It won't work for businesses (Score:5, Funny)

by autocracy ( 192714 ) writes: <slashdot2007@stor y i n m e m o .com> on Monday January 27, 2003 @09:29AM (#5166840) Homepage

Anything from mid-level management or the marketing department would immediately be marked as spam and trashed. Maybe not very important in the first place, but you'd at least need to be able to say "yeah, I saw the memo on the TPS reports."

Share
twitter facebook
- Yes please! (Score:2)
  
  by CoolVibe ( 11466 ) writes:
  
  If the marketing goons would have to write properly punctuated, nicely formatted mails to reach me, instead of that all UPPER CASE or all lower case overhyped brainless repeated dribble they usually pelt me with, I say sign me up!
  ;-)
Spam Conference talk (Score:5, Interesting)

by Matts ( 1628 ) writes: on Monday January 27, 2003 @09:30AM (#5166845) Homepage

Jason Rennie gave an extremely interesting talk about this at the MIT Spam Conference this month, although he wasn't using quite as direct a method, instead he was looking at MLD - Minimum Length Description. This is a technique to discover features in corpora that allow you to describe the classification of a corpus in the minimum number of details.

Basically it's a way to discover features in emails using compression techniques, so rather than having us SpamAssassin developers have to carefully and manually examine emails to see what's new and interesting about them, MLD techniques can automatically detect these features.

Jason Rennie's web page (talk and paper available) about this is here [mit.edu]. Please do read it as it's extremely interesting.

The one downside of it is that Jason said at the end of his talk that it's extremely slow at doing the feature detection. When asked how slow he said that on a reasonably small corpus it took 4 months (although he said it was written in Perl, so a C port is probably a good plan).

In comparison to Bayesian techniques the MLD technique presents a great deal of interest - primarily because I work for a company doing spam filtering at the internet level [messagelabs.com], and so we can't feasibly do personal training which is what makes Bayesian techniques so great (see the talk I gave at the MIT spam conference). Without the personal training Bayes is only about 90-95% effective, so it should be interesting to see where these techniques lead us.

Share
twitter facebook
- Re:Spam Conference talk (Score:3, Interesting)
  
  by ajs ( 35943 ) writes:
  
  I think, at the Internet level, RBLs (mirrored by you, obviously for speed's sake) and such are your best weapon. The more of the net you have by the short patch-cables, the more significant you make each RBL that you listen to.
  
  At the personal level, each of these newly "discovered" techniques (I remember a /. article about using gzip for analysis of other document structures years ago) will make a fine addition to statistical systems like SpamAssassin, which uses them to build a very accurate model of a piece of mail's "spamishness".
  - Re:Spam Conference talk (Score:3, Insightful)
    
    by Matts ( 1628 ) writes:
    
    Actually it's the other way around. DNSBL's (not RBLs - thats a specific term for MAPS' list) are fine for personal users, and even for some businesses, but generally they have way too high a false positive rate for any kind of generic filtering. The SpamAssassin team has done lots of research into this, see for example the slide at the very end of my talk.
    
    No, for a large scale service you need much lower rates of false positives than any of the DNSBLs provide right now. They're fine as inputs into heuristic or statistical systems, but on their own they are just not accurate enough.
    - Re:Spam Conference talk (Score:3, Interesting)
      
      by ajs ( 35943 ) writes:
      
      But, aren't those "false positives" (usually so-called innocent open relays and people sharing netblocks with spammers) what you want?
      
      In the case of open relays, yes a whole company can be hosed mail-wise when the get on a list, but if multiple BLs agree, then they've got a problem that needs to be fixed.
      
      For the case of people who share a spammers address range, I feel for them, but... do I really want to take the pressure off of them in favor of flooding the world with spam? I'd personally be pissed at my ISP for allowing such spammers to screw over MY reputation among the BLs. ISPs should behave accordingly, but right now why would they? They get far more money from spammers than from people who will leave because a few folks listening to the BLs get mail from your customers.
      
      Spam is an ugly thing, and combating it is hard. Casualties are going to arrise. The question is: how do you minimize that list of casualties and make sure that people know the safety dance ahead of time.
      - Re:Spam Conference talk (Score:2)
        
        by Matts ( 1628 ) writes:
        
        No. False positives are always bad. A false positive means you blocked a legitimate mail. A mail that was not spam. A mail that was not from a spammer, but from a person trying to contact you.
        
        Frankly it's the spammers that should suffer, not the legitimate users. False positives in the fight against spam cause nothing but animosity. We've had DNSBLs for a long time now, and I see nothing but an increase in the level of spam. Are DNSBLs working for you? Maybe. Is the collateral damage model reducing the amount of spam the world sees? Nope. Not remotely.
        
        Time to move on, try something else. Time to stop more spam and hit them in the pocket. We've no evidence that will work either, but at least we're trying something.
- Re:Spam Conference talk (Score:5, Insightful)
  
  by archeopterix ( 594938 ) writes: on Monday January 27, 2003 @11:39AM (#5167503) Journal
  
  MLD, gzip, neural networks, bayesian filtering and probably a bunch of other spam-filtering methods are all based on the following scheme: get a (big) number of spam messages, a number of non-spam messages (preferably specific to the current user of the filter) and use a learning algorithm on these to produce an automatic classifier.
  What bothers me about this method is that you can never be 100% sure what the learning algorithm will actually learn. My friends seldom send me HTML mail. Most of my spam is HTML. A learning algorithm will probably learn that HTML mail is spam, especially if it never gets HTML "ham" during its training period. Then if one of my clueless friends sends me a HTML message, it will not go through and this is clearly bad.
  I will never trust an automatic filter so as to delete a message marked as "spam" without reading, but I think it can still be useful for ranking messages, so that spam gets read less often and deleted faster.
  
  Parent Share
  twitter facebook
Quantitive, not qualititive (Score:5, Interesting)

by psplay ( 572886 ) writes: <J@@@psplay...com> on Monday January 27, 2003 @09:30AM (#5166846)

Its not simply the words that are used in a mail, but the way they are used (the order) that gives a sentence its meaning.

for example Two Emails:

1 (ham) "You have won a brand new Convertible, from the competition you entered."

and

2 (spam) "A brand new convertible to be won, have you entered?"

Ham would match about 80% with spam.

Word matching is a blunt instrument as mentioned. The English language is far too complex for simple calculations, this fact should be taken into consideration, when applying a 'Spam Likelihood' rating to Emails.

Share
twitter facebook
- Re:Quantitive, not qualititive (Score:5, Interesting)
  
  by iapetus ( 24050 ) writes: on Monday January 27, 2003 @09:32AM (#5166856) Homepage
  
  If I see either of those in my inbox, it's almost certainly spam. You don't think you really filled in all of those 'feedback' forms about sex toys that you keep getting responses from, do you?
  
  Parent Share
  twitter facebook
- This happened to a friend of mine (Score:2)
  
  by pommiekiwifruit ( 570416 ) writes:
  
  He actually won a nice car from a contest he entered with an internet company! I saw his picture sitting in the car in a dead-tree magazine.
  The only slight problem was that he doesn't drive :-)
Don't compress (Score:3, Funny)

by Fuzzums ( 250400 ) writes: on Monday January 27, 2003 @09:31AM (#5166854) Homepage

Usually I don't compress my spam.

I delete it.

This will save me a lot more space ;-)

Share
twitter facebook
Same old problem... (Score:5, Insightful)

by artemis67 ( 93453 ) writes: on Monday January 27, 2003 @09:34AM (#5166864)

Filtering is not a true spam solution. All it takes is for one false positive on a Really Important Email and be accidentally deleted to totally destroy the value of any filtering system.

Given that, the alternative to having tagged emails automativally deleted is to collect them in a folder and scan the message senders and subject lines. If you're doing that, then the spammer is getting a pitch through to you in the subject line. This therefore does not lessen the incentive for the spammer, but simply causes him to change tactics and put his best pitch in his subject line.

Right now, I get 60-80 spams a day. What happens when I start getting 600-800 a day? Again, filtering starts to break down, because I have SO MANY messages to scan everyday that the possibility of me missing a legitimate one is very high.

Share
twitter facebook
- Re:Same old problem... (Score:3, Informative)
  
  by isorox ( 205688 ) writes:
  
  I usually cope by having a couple of folders in kmail I flush spam into
  
  BODY contains "The following message was sent to you as an opt-in subscriber to RB Express."
  FROM contains Trivia
  TO or CC contains "johnsmith@isorox.co.ku"
  FROM contains theracingpost.com
  TO or CC contains "spam" (I use sitespam@isorox to sign up to sites)
  BODY contains "to receive" AND "more of these offers"
  Move to a Spam folder
  
  If TO or CC doesnt contain
  isorox.co.ku
  exeter.ac.ku
  ex.ac.ku
  
  Move to possible Spam
  
  That gets about 80-90% of my spam.
  
  I skim Possible Spam when I get time, usually once or twice a day. I skim Spam about once every 2 days. i've got a couple of rules that just delete the spam straight off (known junk addresses that I'll never need, certain subjects, etc). Keep all my spam too, and check it when I get time, just in case.
- Re:Same old problem... (Score:5, Interesting)
  
  by djmurdoch ( 306849 ) writes: on Monday January 27, 2003 @09:49AM (#5166939)
  
  Filtering is not a true spam solution. All it takes is for one false positive on a Really Important Email and be accidentally deleted to totally destroy the value of any filtering system.
  
  One of the side effects of spam is that there are no "Really Important Emails" any more. Spam and spam filters have degraded the reliability of email to such an extent that you'd have to be crazy to send anything Really Important by email.
  
  Right now, I get 60-80 spams a day. What happens when I start getting 600-800 a day?
  
  That's a good point. The solution is to get less spam. You can do that by changing email addresses frequently (a really inconvenient solution that I don't recommend), or by getting spammers shut down (or yourself listwashed by the spammers).
  
  Let the spammers know that if they send something to you, they'll lose money, and they won't send you so much spam. SpamCop [spamcop.net] reporting makes this easy. If you want to be listwashed, don't munge your address when you send reports. (This is an option with SpamCop.)
  
  Some people claim that you'll get more spam or get listbombed or something if you send complaints without munging; that's not my experience. I get 20-30 spams per day, total, at all of my 4 publicly available email addresses. (Ninety to 95 percent of them get caught by the SpamCop filters, which have almost never caught valid email.)
  
  Parent Share
  twitter facebook
- Re:Same old problem... (Score:2)
  
  by johnburton ( 21870 ) writes:
  
  Frankly I doubt that very many people are so important that losing a single email is that important. And if it is then email is not the appropriate way to send the information as it's not 100% reliable anyway.
- Re:Same old problem... (Score:2)
  
  by Kragg ( 300602 ) writes:
  
  False positives don't destroy the value of filtering at all. I find it massively helpful not to get irritated by alerts 50 times a day when I receive another bloody spam message.
  And I don't miss the false positives because I scan my spam. But the key point is I don't interrupt what I'm doing in order to respond to spam anymore. Well, less often anyway.
  
  Spam is bad, but spam is life. Filtering is not perfect, but it is helpful.
- Re:Same old problem... (Score:2, Interesting)
  
  by ch-chuck ( 9622 ) writes:
  
  If you're looking for the mathematically perfect zero fault spam solution in a world full of Msft and human beings, forget it.
  
  What happens when I start getting 600-800 a day?
  
  Start another account and don't give it to strangers who might sell it. Only give it to the person or persons who are going to send that really important email message. Throw in a few random numbers so if one gets leaked to spammers you can track the source (i.e., I gave my employment agency (obviously an important contact) chuck369, and nobody else. Now if chuck369 starts getting spam we know employment agency leaked it). Use 'throw away' accounts for untrusted contacts who might leak it to spammers.
- Re: (Score:2)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
- Re:Same old problem... (Score:2)
  
  by kirkjobsluder ( 520465 ) writes:
  
  Filtering is not a true spam solution. All it takes is for one false positive on a Really Important Email and be accidentally deleted to totally destroy the value of any filtering system.
  
  Given that, the alternative to having tagged emails automativally deleted is to collect them in a folder and scan the message senders and subject lines. If you're doing that, then the spammer is getting a pitch through to you in the subject line. This therefore does not lessen the incentive for the spammer, but simply causes him to change tactics and put his best pitch in his subject line.
  
  I guess that this is an interesting question. I keep hearing this argument that filtering is a bad thing because of the risk of false positives. But how is the risk of false positives reduced by removing the filter? Spam filtering for me is a valuable cognitive aid. (One modification to spam assassin would be to put the spam score on the subject line.) I can live with skimming subject lines because many spam models are based on the number of hits from users who buy or click on links in spam.
  
  I also think that it argues a straw man. I don't read very many comments from people who believe that filtering is "the solution". However, content-based filtering is one valuable tool for sorting through large numbers of messages. By all means we should persue trasport-based and source-based strategies for fighting spam as well. But these have their own problems.
  
  Finally, if someone wants to cold-call me out of the blue with a Really Important Message, don't they have a responsibility to compose their message without much of the hype, and html text that gets flagged as spam? It would seem that such a cold-call would have no problems getting through as long as they don't make excessive use of all caps, font tags, embedded images, base-64 encoded text, and references to my penis. If it was really important enough to be worth my time, then it probably is not going to have enough spam features to be flagged as spam.
- - Re:Same old problem... (Score:2)
    
    by artemis67 ( 93453 ) writes:
    
    The problem is not the 9,999 messages that you know are going to come from good senders; the problem is the one message that may be coming for an unexpected source that is going to cause you to sift through 50,000 spam emails looking for it.
    
    That is why filtering fails as a solution.
    
    You know that email from the headhunter that wanted to double your current pay rate and cut your hours by a third? No you don't, because it got flagged as spam and accidentally deleted.
Spammers will adjust their tactics (Score:5, Interesting)

by ultrabot ( 200914 ) writes: on Monday January 27, 2003 @09:34AM (#5166866)

Obviously it wouldn't be a big problem for the spammers to run their creative gems through gzip, and alter the content until they achieve lower compression ratio. Even including a bunch of garbage after the message might do the trick. I believe equivalent analysis can be done cheaper with non-gzip tools...

Share
twitter facebook
- - Re:Moron (Score:5, Interesting)
    
    by ultrabot ( 200914 ) writes: on Monday January 27, 2003 @09:59AM (#5166968)
    
    Another moron the tdisn't read the article.
    
    I actually read the article.
    
    The proposal is not to see how compressible is the message but to use a compression tool to see how lookalike the message is to a corpus of spam.
    
    Yes, and compression ratio is used to determine this.
    
    Parent Share
    twitter facebook
Alternative (Score:5, Interesting)

by Dexter77 ( 442723 ) writes: on Monday January 27, 2003 @09:35AM (#5166869)

When the spam is filtered at user-account level, you can only do it by parsing a single mail in some way and determine if it's spam or not. It's like trying to tell whether a movie is bad by looking at one picture. If the spam could be filtered at the server level, by comparing mails that are received into to different accounts, you could really tell which ones are part of a mass-mail (spam).

One problem with this is the right to open other people's mail. But you could use some basic scrambling (rot-13) to make sure that no one sees the inside. It wouldn't make difference to the comparing script.

Mailing lists might cause a problem too but wouldn't it be easier to allow the mailing lists you belong to than trying to block the ones you don't belong to?

Share
twitter facebook
Sequitur Most Likely Superior (Score:5, Interesting)

by Baldrson ( 78598 ) writes: on Monday January 27, 2003 @09:36AM (#5166872) Homepage Journal
The statistics generated by Sequitur [rutgers.edu] are most likely superior to Gzip.
As an example of how Sequitur works, the string 'abcabdabcabd' produces the following grammar rules:
1. 2 c 2 d
2. a b
Representing the original string then is the sequence:
1 1
The usage counts of the rules are available as output options.
Share
twitter facebook
- - Re:Sequitur Most Likely Superior (Score:2, Insightful)
    
    by A55M0NKEY ( 554964 ) writes:
    
    But your rule list is now getting big and still has to be stored. Compression is about minimizing the amount of stuff that has to be stored to recreate the original. It would be nice to have a few simple, very reusable rules that you can use to generate the original with a very few commands.
Yay! (Score:5, Funny)

by Anonymous Coward writes: on Monday January 27, 2003 @09:42AM (#5166900)

What an idea!

I could use this to avoid those people who keep saying the same thing all the time, over and over again...

Now, how can I convince my mother to use e-mail?

Share
twitter facebook
My spam compression approach (Score:2)

by joshv ( 13017 ) writes:

I just use one of those new fangled file compression utlities that you can apply recursively to the compressed output, resulting in any arbitrary degree of compression one desires.

After at most 10 applications of said compression utility, all emails looks like this:
"1"

I never see any spam.

-josh
What is spam, though? (Score:5, Funny)

by Big Mark ( 575945 ) writes: on Monday January 27, 2003 @09:43AM (#5166903)

The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text.

Ah. I thought to detect really useless, annoying, pointless, bandwith-sapping and time-consuming email all you had to do was look for "fwd:" in the subject line.

-Mark

Share
twitter facebook
How to stop spam.... (Score:4, Informative)

by oliverthered ( 187439 ) writes: <oliverthered&hotmail,com> on Monday January 27, 2003 @09:44AM (#5166906) Journal

1: Get an email account with unlimited addresses.
2: when registering use a unique address e.g. slashdot@mydomain.com
3: Make sure you check/uncheck the give my email address to mailing lists.
4: If ever you get spam to that account get litigious.

Use something like mailinglists@mydomain.com, and block anything that doesn't come from mailing lists you've subscribed to.

Share
twitter facebook
- Re:How to stop spam.... (Score:3, Insightful)
  
  by Jugalator ( 259273 ) writes:
  
  Still, you use hotmail (aka "spammer's heaven") here on Slashdot. But thanks for the tip, perhaps we should start trying it out? :-)
  - Re:How to stop spam.... (Score:2)
    
    by Matey-O ( 518004 ) writes:
    
    An 'airtight' hotmail account (One signed up that's not advertised nor given out on USENET or the web) STAYS just as spam free as one from aol or earthlink.
    
    I've got two hotmail accounts that have been relatively spam free for years.
    
    I say relatively because you'll still receive spam if they guess [commonfirstname][commonMiddleName][CommonLastname ]@msn.com
    
    Heck, one of 'em's the email I signed up on slashdot with!
- Re:How to stop spam.... (Score:2, Informative)
  
  by NoseyNick ( 19946 ) writes:
  
  I've been doing this for years, and in practice, it just means I get 12 copies of most spams, because they got my address from 12 different places, usually web archives of the mailing-lists.
  
  You can't refuse mail from non-lists to mailinglists@your.domain, because then nobody can contact you saying "I saw your post on foo-list and was wondering if I could get a copy of foo-prog and if you could tell me how you made it foo bar baz".
  - filtering across multiple accounts (Score:2)
    
    by klparrot ( 549422 ) writes:
    
    it just means I get 12 copies of most spams
    What about having a filter check all your accounts at once? If you're receiving the same email on more than one account, chances are it's spam.
- Re:How to stop spam.... (Score:5, Interesting)
  
  by DeadSea ( 69598 ) writes: on Monday January 27, 2003 @11:13AM (#5167317) Homepage Journal
  
  You need to expand on your step 4.
  When I started getting spam, I wanted it to stop. I realized I couldn't just disable the email address because there might be somebody out there counting on it to contact me. I could disable it and send an autoreply with my current email address, but then spammers would just be able to look at the reply. I needed some solution where people could send me email even if the address they used had been disabled, but spammers wouldn't be able to get my current address. I decided to put a contact form on my website. Now I autorespond to a disabled email address with the contact form url. In addition, I was able to remove email addresses from my website which was a large source of spam.
  Not being able to find a contact form that was secure, I ended up writing my own and releasing it under the GPL. You can get it at http://ostermiller.org/contactform/ [ostermiller.org].
  I also realized that no matter how hard you try, your email address will leak to spammers. Ever giving an email address only to your closest friends and family will not prevent it from leaking out. Between the klez virus, gift certificates, invitation, greeting card, and crushlink websites, even my most personal email address leaked to spammers. You can't be afraid to disable an email address and send your friends the new one. Now even if I missed a friend, they can still get a message to me.
  
  Parent Share
  twitter facebook
Just use a string entropy calculation algorithm... (Score:4, Interesting)

by Domini ( 103836 ) writes: on Monday January 27, 2003 @09:46AM (#5166920) Journal

It's inefficient to have so much memory overhead.

Besides, if I were a spammer, I could pad it with high entropy data at the end to make up for my warbling.

Share
twitter facebook
- Re:Just use a string entropy calculation algorithm (Score:2)
  
  by a2800276 ( 50374 ) writes:
  
  If I were a spammer, I couldn't care less if some nerd using string entropy calculation filters out my spam, because said nerd using weird home grown filtering is also more likely to a.) not reply anyway b.) submit my open relays to blackhole lists c.) complain to my ISP etc. etc.
  
  If I were a spammer I'd concentrate more on trying to get average users to open my mail even though they've learned that Cindy's "Haven't seen you in ages, JOE23" Emails aren't real. And how to circumvent whatever anti-spam measures come installed in JOE23's AOL software.
  
  Anyways, some geek in his dorm room is not likely to have enough money to buy penis prosthetics anyway and can also figure out how to jerk off to free thumbnail-pics.
  
  If spammers started padding their mail with high entropy data I would set up a filter that filters out mails based on how close the character recognition is to standard English HTML-formatted mails, and discards random junk.
  
  But then spammers would start not just using high entropy material from /dev/srandom (really nerdy spammers themselves, who know not to trust /dev/random) but generating random characters with similar charateristics as English.
  
  Then the antispammer would have to use fuzzy-logic spell-checking and the spammer would have to start using random words out of the dictionary and finally spammers would be left with no other option than to send me really nice personalized eCards that say "Happy Birthday!" with a little singing chicken, because I haven't found a way to filter those yet. I can only filter spam with mammals
  - Re:Just use a string entropy calculation algorithm (Score:2)
    
    by Domini ( 103836 ) writes:
    
    Agreed that this is not the best way to filter spam... it is fraught with peril.
    
    What I was suggesting is that ISPs actually employ these methods... thus the average user will not even know they were spammed. (Most IPSs employ a troop of Geeks who know where to do:
    "strings /dev/random")
    
    Personally I prefer an active approach (such as ASK), and preferably the one with the features that has a minimal impact on legitemate users. I still receive about 30 spam mails a day, but with a combination between my IPSs anti-spam system, and my active spam protection, I see about 1 every month only.
    - Re:Just use a string entropy calculation algorithm (Score:2)
      
      by a2800276 ( 50374 ) writes:
      
      Just to keep on bickering (sorry, bad habit): strings /dev/random wouldn't work cause my super duper filter checks for the proper distribution of letters, i.e. more e's than q's and, cause it's spam, lot's of html thingies.
      
      You're right on the money though what filtering at the ISP is concerned, that's where the most benefit would be for the end-user. I see two problems, though.
      
      First, the ISP has to pay bandwidth for the incoming email, spend money on filtering but then isn't rewarded with more time/bandwidth consume by their clients.Secondly, I think they'd be deathly afraid of inadvertantly filtering out some false positives and being sued.
      
      Think what would happen if some marketing department tries to send their customer the rough draft of a mailing and it keeps getting eaten by the ISP's spam filter.
  - - Re:Just use a string entropy calculation algorithm (Score:3, Interesting)
      
      by a2800276 ( 50374 ) writes:
      
      d0rk! Ignoring the fact that I was being sarcastic and artistic license would have permitted me to specify /dev/my_ass let me just say this: before you make statements trying to make people look stupid you should probably have a clue what your talking about.
      
      While true that your measly Linux machine has no /dev/srandom, this device is the source for _s_ecure random data on OpenBSD and it's probably available some other places as well. Some random trivia (pun intented), checking around I noticed: AIX and Solaris both don't typically have /dev/random at all.
      
      But anyway, back to your question: if you're sad you don't have /dev/srandom you could try the following:
      
      ln -s /dev/srandom /dev/zero
Compression algorithms as filters... (Score:5, Insightful)

by Jugalator ( 259273 ) writes: on Monday January 27, 2003 @09:46AM (#5166925) Journal

.. sounds like a poor idea to me. Yes, you can measure the amount of redundancy in a message, but:

a) Spammers might not always use messages redundant enough to be detectable from regular text.

b) If I happened to use some words a little too often, especially when writing mails discussing technical stuff or posting computer code fragments, would that be classified as spam?

I think this is a nice filter when sorting out more or less repetitive mails (spam or not) from novels, but a filter based on a spam database sounds better to me.

Share
twitter facebook
I can't figure this out... (Score:4, Interesting)

by shivianzealot ( 621339 ) writes: on Monday January 27, 2003 @09:52AM (#5166950)

A couple of posts above state that spammers will "just adjust their tactics." Talk like this always puzzles me; on the spammer's side, does this not help him? If I'm selling a combination weight loss drug/mail order bride/penis enlarger/cable descrambler for only three payments of $49.99 in such a manner that every spam blocker in the world filters me, logically I'm only being filtered by people who know better than to buy my "product," thus not irritating them, in effect helping to slow regulation, and I don't loose touch with any significant chunk of my target demographic. Of course, this applies with the exception of corporate environments or similiar situations where Joe Insecure has someone else managing spam.

Can anyone share some +5 Insight on the matter?

Share
twitter facebook
- Re:I can't figure this out... (Score:5, Insightful)
  
  by Motherfucking Shit ( 636021 ) writes: on Monday January 27, 2003 @11:01AM (#5167249) Journal
  
  If I'm selling a combination weight loss drug/mail order bride/penis enlarger/cable descrambler for only three payments of $49.99 in such a manner that every spam blocker in the world filters me, logically I'm only being filtered by people who know better than to buy my "product," thus not irritating them, in effect helping to slow regulation, and I don't loose touch with any significant chunk of my target demographic.
  This would make sense if the only people implementing spam filters were end users. Unfortunately, the logic breaks down when you consider that some ISPs do the filtering on behalf of their customers. It breaks down further when you factor in the number of situations in which a) the customer might not even know that the filtering is happening, or b) the customer blindly trusts the ISP's filtering system.
  
  Take Yahoo, for example. They're a popular webmail service and they also do spam filtering to some extent on inbound email. I would say that, in general, people who use Yahoo mail are not necessarily the type of people who "know better" than to buy spamvertised products. That's not a slam on Yahoo, nor on the people who use Yahoo mail, it's just the way the demographics work out. The ratio of ripe targets to clued-in antispammers is simply better at Yahoo than it is on other domains.
  
  To that end, Yahoo's spam filters aren't helping the spammers any. A spammer's goal is to get his ad in front of as many potential targets as possible, and Yahoo is full of potential targets. But if Yahoo's filters catch the spammer's message and route it straight to everyone's Bulk Mail folder, there's (thousands|millions) of "targets" who will never see the message.
  
  So no, I can't agree that filtering helps the spammers any, at least not the big spammers who are after volume. There's probably a bit of "collateral assistance" in that people who would report the spam may never see it, but I'd say that benefit is cancelled out by the number of possible targets lost to filters.
  
  Parent Share
  twitter facebook
- Re:I can't figure this out... (Score:3, Insightful)
  
  by stilwebm ( 129567 ) writes:
  
  It's true that the sellers want that. However, you may have noticed spammers are not always the sellers. The seller is looking for someone to do some "email marketing" for them. They are looking for wide coverage. They want to see things like "your email can be sent to 30 million unique email addresses," which means a few million that might get through, a few thousand that will actually get read, and maybe a few purchases. Spammers are just creepy marketers who want to make it sound like emailing as many people as possible is better, and should cost the seller more. Since they use open relays and random forged "From" email addresses, they never see what email gets blocked. Using images in HTML email they can get an idea of how many emails were read (this is why you should turn off images in email). While the spammer makes a commission on every sold item, they also make money selling lists and marketing services.
  
  The numbers are part of their pissing contest, and the pool is your inbox. Spammers are not that bright, but their customers are much, much more stupid.
- Re:I can't figure this out... (Score:2)
  
  by buss_error ( 142273 ) writes:
  
  Can anyone share some +5 Insight on the matter?
  Rule 1: Spammers lie.
  Rule 2: Spammers are stupid. Not to say they are not cunning, but stupid.
  Rule 3: If you think a spammer is telling you the truth, see Rule 1.
  Rule 4: Spammers will stop when they can't make money fast! spamming.
Stopping Spam (Score:5, Insightful)

by Inflatable Hippo ( 202606 ) writes: <{ku.oc.oohay} {ta} {oppih_elbatalfni}> on Monday January 27, 2003 @09:55AM (#5166961) Journal

> stupid filtering isnt gonna get you rid of spam... go complain at spammers upstream providers...

Filters only work to a limited extend, and so might shutting down the spammers, if it were possible.

But neither is going to solve this problem.

The only solution I can think of is wide-spread adoption of PGP (or equivalent) aware mailers and certification of mail.

The problem with mail addresses is that you have no control over their spread. If I give one to a company it'll usually leak out in the end and it's open season on my inbox.

However if "genuine" mail is certified and mailers use certification validity as a filtering critera then it simplifies the game hugely.

Your mailer can spot the people you've genuinely given your address to, and naturally "distrust" uncertified (effectively anonymous) mail or mail whos certificate has been revoked or is unknown to you.

The "only" things standing in the way of this are:

1. Slow adoption of certification/encryption in mass market mailers. Usually poor or missing.
2. Cost/diffiulty of getting a valid certificate (e.g. with Verisign).
3. The pain of typing a password every time you send a mail.
4. It only works if everyone joins in.

But nothing's for free and this strikes at the heart of emails useability.

I'm continually suprised by the lack of certification use at least by large corporations and governments, but I suppose it removes plausible deniability :-)

Share
twitter facebook
- Re:Stopping Spam (Score:2, Insightful)
  
  by iamchris ( 311218 ) writes:
  
  Think about this: Why do I get 1000's of spam emails per month and I get 10's of peices of junk snail mail/month? Simple: It costs nearly nothing to send millions of spam messages, while it costs a bundle to send junk snail mail.
  
  A simple solution would be to find a way to charge per email...
  
  Now, I certainly wouldn't pay per email. But, I shouldn't complain when someone abuses a messaging system that allows millions of messages to be sent out for nearly no cost. I use that system too, on a much smaller scale, for personal and legitimate business use.
  
  All I can do is ignore as much of the mail as I can, and BOYCOTT anything that is sold via spam.
  
  Ag.
- Disposable Email addresses (Score:2)
  
  by KMitchell ( 223623 ) writes:
  
  The problem with mail addresses is that you have no control over their spread. If I give one to a company it'll usually leak out in the end and it's open season on my inbox.
  
  I came to this realization driving home from work one night. My immediate follow-up thought was, why not make email addresses disposable, with a nice automated interface to control which ones will fwd to your "real" mailbox? I had worked out a rough framework for how I'd implement this at a site-wide level by the time I got home, only to discover that I wasn't the first one to come up with the idea. A quick google search on "disposable email address" found about half a dozen services that do (more or less) what I'd hashed out.
  
  Doesn't solve everything, but it does give you a lot more control when choosing what to put in the "email" form when you buy something online :)
- Re:Stopping Spam (Score:2, Interesting)
  
  by misof ( 617420 ) writes:
  
  The only solution I can think of is wide-spread adoption of PGP (or equivalent) aware mailers and certification of mail.
  
  I have to discourage your optimism a bit. IF the public-key encryption ever finds its way to the general public (I hope and think so), there are two possibilities:
  a) Your public key will be available for the general public -- this is how it will probably work. If someone wants to send you an e-mail, he obtains your public key in a trusted way (e.g. from a trusted key server), encrypts the message and sends it. If the spammer wants to send you spam, once he gets your e-mail address, he does exactly the same. Obtains your public key, encrypts the spam and sends it. The only difference with today's situation: it will be impossible to filter spam on the server side (only to block some spamming IP addresses, no server-side spam filters).
  b) You give your public key only to your friends you trust. This is exactly the approach "everything coming from an address, that's not in my address book, has to be spam." and even contradicts the basic idea: it's your public key...
Email to my girlfriend (Score:5, Funny)

by FroBugg ( 24957 ) writes: on Monday January 27, 2003 @10:13AM (#5167031) Homepage

Unfortunately, using this my girlfriend would never get any of my emails.

"I'm sorry. Really, really, really, really sorry. I'm so very, very, very sorry. I'm sorry..."

Share
twitter facebook
Spammers just found another loophole.. (Score:5, Interesting)

by SystematicPsycho ( 456042 ) writes: on Monday January 27, 2003 @10:18AM (#5167053)

I received a nice piece of spam the other day. I didn't read it but I usually scroll to the bottom to see if they have the mandatory (in some places mandatory I think) unsubscribe method. This method sure gets me mad -

To unsubscribe by postal mail, please send your request to:
P.O Box 272521
Boca Raton, FL 33427
Ref # XXXXXX -- scd

(XXXX.. replaced real reference number)

It seems that the unsubscription method doesn't have to be by email - just as long as it's by something and it's there. They musn't be specific in the law. Of course, no one is going to go write a letter by snail mail to unsubscribe to spam, although sending them some dog shit through the mail is tempting. I forgot the site that provides that service. Hrmm I should change my sig.

Share
twitter facebook
32k Window... (Score:4, Informative)

by pridkett ( 2666 ) writes: on Monday January 27, 2003 @10:29AM (#5167094) Homepage Journal

The fact is, that unless your SPAM corpus and HAM corpus are both under 32k, this won't work. Gzip is fast because it only has a 32k sliding window, meaning that it only searches for like strings in a 32k window around what you're currently compressing. Hate to break it to you, but 32k is not enough for a corpus. I think Bzip2 uses something larger (900k?), but I forget what it is.

I'll be happy with spam assassin [spamassassin.org] until I get CRM114 [sourceforge.net] (and mailfilter) trained and working.

Share
twitter facebook
Similar article on heise was published a year ago (Score:2, Informative)

by hanzwurst ( 547741 ) writes:

German newsticker heise [heise.de] had a similar article a year ago, altough it does not cover spam explicitly.
The article has a link to another article published in "Physical Review Letters" which deals with the topic of identifying content/author by applying compression algorithms.
The underlying idea is that LZ77 compressed data is near to the entropy of a message.
Even Better (Score:2, Informative)

by HereAllNight ( 645064 ) writes:

Who needs all of these complicated schemes? I just filter the sending domains as they come. Filter every sender containing "specials", "optin", "offer", "special", "deal", "email", "reward", "value", "promotion", "special" and "super, and all subject lines starting with "friend", and 85% is taken care of right away. So far my formula has had no false positives.
Yawn -- read your papers (Score:4, Informative)

by Anonymous Coward writes: on Monday January 27, 2003 @11:01AM (#5167247)

There was a paper published in PRL a couple of years ago that wanted to identify languages using gzip (Benedetto et al: Language Trees and Zipping [sissa.it]). It sure sounded cool, but was quickly forgotten when Joshua Goodman took a closer look [microsoft.com] (link is down at the moment, probably IIS, Text version in Google Cache [google.com]).

Share
twitter facebook
Correction (Score:2, Insightful)

by misof ( 617420 ) writes:

The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text.

There is a minor problem with this sentence. And with this whole gzip business. It is misleading. Words, phrases? You cannot force gzip to match words, gzip tries to exploit every likeliness found, even at the character level. E.g., if your "spam dictionary" contains words sex and pants, mail about sextants will have a good compression ratio. And there is no way how to prevent this. That's why the Bayesian filters (operating on words) outperform gzip by a league. That's (one of more reasons) why I think this article belongs not to /. but to a wastebin instead. It simply presents a worse approach to do something. Interesting idea, yes, but that's all.

(Just FYI: it is proved, that the bzip2 algorithm due to Burrows and Wheeler exploits all such repeatings in the input file nearly optimally -- within some small ratio. Hence, it is even worse to use it as a spam filter :-)
Repost? (Score:4, Interesting)

by fulldecent ( 598482 ) writes: on Monday January 27, 2003 @11:16AM (#5167334) Homepage

This post looks like it came from my previous reply [slashdot.org] on a way to detect entropy (non-repititious content)in P2P files
Here is a code snippet from the comment:

#!/bin/bash # Entropic analysis by Full Decent SIZE=$(cat $1 | wc -c).0 CSIZE=$(gzip -c --best $1 | wc -c).0 ENTROPY=$(echo "scale=4; $CSIZE / $SIZE * 100" | bc) echo "$1 is ${ENTROPY}% entropic"

Share
twitter facebook
How about.... (Score:3, Interesting)

by slummerx86 ( 642287 ) writes: on Monday January 27, 2003 @11:22AM (#5167381)

if all the email clients had a little button saying "This is Spam" and if you click it the mail gets sent to some nice spam black list agency. They'd wait for about 10 people to do this, then verify it for the spam it is and then A) black list the spammer and B) send anti-spam email (subject: spam sender here ) nice and easy :)

Share
twitter facebook
A similar idea (no pun intended) (Score:2)

by Ed Avis ( 5917 ) writes:

The other day I hacked together a script similarity [membled.com] which uses gzip compression to work out how similar two files are. I find this useful when searching for almost-duplicate files.
Bayesian Filters (Score:2)

by tacocat ( 527354 ) writes:

Sorry, but I don't see how this is anything different from just another spin on Bayesian Statistical filtering of spam that everyone's been playing with.

It's hardly patentable. But it is interesting to see. But, once you look at it, not surprising.
Messages from teenagers would be spam (Score:5, Funny)

by Adam9 ( 93947 ) writes: on Monday January 27, 2003 @12:01PM (#5167590) Journal

Don't use this filtering if you're a high school teacher or something else that involves getting messages from teenagers..

[E-mail from skittles9333@some.email marked as spam and deleted] So like, I was like sick, and like, I didn't go to school today. So like, I was told like, that Jim like said, that like you might like, have some homework due like tomorrow. Could you like, tell me what like that homework would like be?

Share
twitter facebook
Nope (Score:3, Insightful)

by I Am The Owl ( 531076 ) writes: on Monday January 27, 2003 @12:14PM (#5167686) Homepage Journal

Doesn't work for the Lameness Filter, won't work for spam .

Share
twitter facebook
Zip on DNA & Different Languages. (Score:2, Interesting)

by wilgamesh ( 308197 ) writes:

This reminds me that about a year ago, three italian scientists came up with a way to find species relatedness by using the zip algorithm. One takes the sequence of bacteria 1, and then attaches a little bit of bacteria X sequence to the end of that. Again, one attaches a bit of bacteria X sequence to the end of bacteria 2. And then zipping is done on this concatenation. The final compression size of just the bacteria X part ended up telling us the homology (or relatedness) of bacteria X to bacteria 1 or 2.

But from reading all these posts, perhaps a Bayesian method would work just as well. There seems to be no inherent advantage to using zip. One still needs a reference piece of work (non-spam email, or bacteria 1) for comparing entropies or probabilities. Of interest also is that the researchers applied their method to generating an accurate language tree of Indoeuropean languages (grouped by relatedness of course.)

The ref & abstract of above paper is here:

Phys. Rev. Lett. 88, 048702 (2002)
Dario Benedetto,1 Emanuele Caglioti,1 and Vittorio Loreto2,3

In this Letter we present a very general method for extracting information from a generic string of characters, e.g., a text, a DNA sequence, or a time series. Based on data-compression techniques, its key point is the computation of a suitable measure of the remoteness of two bodies of knowledge. We present the implementation of the method to linguistic motivated problems, featuring highly accurate results for language recognition, authorship attribution, and language classification. ©2002 The American Physical Society
- Re:Text of the full article (Score:5, Insightful)
  
  by Anonymous Coward writes: on Monday January 27, 2003 @09:24AM (#5166818)
  
  > The current fad among spam filters is word-counting, with various statistical heuristics applied to the results.
  
  The current fad is in fact Bayesian filtering, sophisticated statistical analysis.
  
  gzip used this way can be viewed as a very poor Bayesian analysis with substantially lower effectiveness. Lets just skip the half-assed attempt and go straight to the real thing.
  
  Parent Share
  twitter facebook
  - Re:Text of the full article (Score:5, Informative)
    
    by Hal-9001 ( 43188 ) writes: on Monday January 27, 2003 @10:14AM (#5167038) Homepage Journal
    
    The scheme described in the article is not Bayesian at all. It's more like a very crude hash comparison. If two similar messages are concatenated, they should compress very well. If two dissimilar messages are concatenated, they will not compress as well.
    
    An actual Bayesian filter would perform a statistical analysis of an existing body of spam and non-spam messages, identify key words or phrases that identify a message as spam or non-spam, and calculate the probability for every key word that a message containing that word is spam. Then every new message is classified as spam or non-spam by running a statistical analysis on its content, and the statistics of that message update and improve the probability model.
    
    Parent Share
    twitter facebook
  - Re:Text of the full article (Score:2, Informative)
    
    by NoseyNick ( 19946 ) writes:
    
    > The current fad is in fact Bayesian filtering, sophisticated statistical analysis.
    
    Baysian filtering IS word-counting with (not very sophisticated) statistical heuristics applied to the results.
    - Re:Text of the full article (Score:3, Informative)
      
      by Arkham ( 10779 ) writes:
      
      Baysian filtering IS word-counting with (not very sophisticated) statistical heuristics applied to the results
      
      This may be the case, but most of the newer filters available now are not really Bayesian filtering by this definition. I use spambayes [sourceforge.net], and it has some very sophisticated algorithms to determine the statistical probability of the "spamminess" of a ham/spam.
      
      Some of these fancier algorithms were developed by Gary Robinson and are discussed in some detail here [weblogs.com]. You can see the results of these different classification techniques (gary combining, chi-squared) in some nice graphs here [sourceforge.net].
      
      On a related note, spambayes is VERY accurate in catching spam for me. Amazingly so in fact. It does a far better job than SpamAssassin or the Bayesian filter in Mail.app in my personal experience.
  - Re:Text of the full article (Score:2)
    
    by timeOday ( 582209 ) writes:
    
    Guess what, Bayesian filtering IS a statistical heuristic applied to word counts.
    First you count the occurrances of each word in spam and nonspam. This gives you the probability that spam contains the word, and that nonspam contains the word. Then you use Bayes' theorem to compute the reverse - the probability that, given a message contains a word, it is spam or nonspam. You take the product of this value for all words in the message. Then you normalize so the sum of probability of spam and nonspam equals 1. (This is a so-called "naive bayesian classifier". Somebody might be using a bayesian network with a more complicated structure, but it would still be based on WORD COUNTING as the first step)
  - GZIP used this way ... (Score:3, Interesting)
    
    by fygment ( 444210 ) writes:
    
    ... can be universal. The principles used actually have their roots in the theories put forward by R. Solomonoff and Kolmogorov (links below). Any given string of bits can be assigned a "complexity" which is proportional to the length of the shortest program that can generate that string. It isn't usually computable BUT the size of the output file of a compression algorithm can be shown to be a reasonable if crude approximation. The beauty is that this approach (minimum description length or MDL) is clustering email in a very fundamental way without the bias' that can be introduced with assumptions required by Bayesian techniques and arguably making use of all the information (vice a subset chosen by the Bayesian user) contained in the email. Yes, the answers can be the same but the MDL approach is universal and the same classifier without modification could be used for broader clustering tasks i.e beyond binary classification of junk/not_junk to multi-class classification junk/best friend/mom/dad/wife/work/etc.
    
    As an aside, since it could be fully automated it would be interesting to run the such an algorithm with a graphical display, say a 2D plot of compression size vs time of day just to see what shakes out.
    
    By the way, the problematic portion for bioinformatics apps is the compression. DNA sequences often exhibit _expansion_ when put through the common compression schemes. Li has come up with a compression scheme that is more optimal called GenCompress.
    
    Kolmogorov Complexity - http://www.idsia.ch/~marcus/kolmo.htm
    Minimum Description Length - http://www3.oup.co.uk/computer_journal/hdb/Volume_ 42/Issue_04/
    Bioinformatics app - http://www.cs.ucsb.edu/~mli/sam.ps
    GeneCompressio n Program - http://www.cs.cityu.edu.hk/~cssamk/gencomp/GenComp ress1.htm
- Re:Text of the full article (Score:2, Interesting)
  
  by Anonymous Coward writes:
  
  Reminds me off a program I helped with for a short time in college called "Siff" (ftp://ftp.cs.arizona.edu/reports/1993/TR93-33.ps) , which would find similar files by taking small fingerprints (32-bit hashes) of 50 byte sequences and finding groups of files that shared a lot of them. It works surprisingly well, even when the files were modified extensively.
  I've often thought since that large mailhubs (yahoo, hotmail, etc) could automatically filter junk mail efficiently by a similar method, perhaps by limiting the delivery rate/fingerprint or just flagging high-occurence hashes as suspect (and then rating each mail by how many of its fingerprints are among this group, too many without an ADV: or bulk-mail tag would cause a mail to be marked as SPAM).
  I wonder if it'd be possible to have a network of smaller hubs accomplish the same thing, perhaps even using an encrypting checksum instead of a simple hash so that individuals could contribute without anyone being able to recreate their original messages?
- Re:this is nice (Score:3, Informative)
  
  by gazbo ( 517111 ) writes:
  
  No, the lameness filter does nothing like this. The lameness filter (strictly the postercomment compression filter) just sees how well the isolated text compresses. Too high compression implies too much repetition (hence likely repeatedy copy+pasted junk), too low compression implies random chars - English contains plenty of redundancy.
  This, on the other hand, talks about gziping the mail in the context of corpora of known spam or known ham. Thus it serves as a classification of types of Englishg text, whereas the slashdot system only tries to classify whether or not it is actually English text at all.
- Re:Maybe I am missing something here (Score:4, Funny)
  
  by 6Yankee ( 597075 ) writes: on Monday January 27, 2003 @09:48AM (#5166936)
  
  the text in each is quite varied; e.g. longer xxx
  
  The text in each of my spams seems to have more XXX...
  
  Parent Share
  twitter facebook
- Re:Legislation (Score:3, Funny)
  
  by liquidsin ( 398151 ) writes:
  
  That's pretty harsh. Once the death sentence has been carried out, I see no reason not to parole them. Have some compassion.
- RBL (Score:5, Interesting)
  
  by Penguinoflight ( 517245 ) writes: on Monday January 27, 2003 @10:11AM (#5167022) Journal
  
  RBL blocks a lot of stuff that isn't spam. It's probably a better idea to go with bayesian filtering. You can read up on it here: http://www.paulgraham.com/better.html [paulgraham.com]
  
  Parent Share
  twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Grep it instead! (Score:2, Funny)

Re:Grep it instead! (Score:2)

Re:Grep it instead! (Score:5, Funny)

Raw data (Score:5, Informative)

Not that different (Score:5, Interesting)

Sorry, that's not right (Score:5, Interesting)

bzip2 results (Score:5, Informative)

It's all spam (Score:4, Funny)

Re:It's all spam (Score:5, Interesting)

I receive a lot of Russian spam (Score:2)

Korean (Score:2)

Foreign Language Spam (Score:2)

Re:It's all spam (Score:2)

Slashdot filter (Score:5, Interesting)

Re:Slashdot filter (Score:4, Informative)

Re:Slashdot filter (Score:3, Funny)

Re:Slashdot filter (Score:2)

Meet the Bayesian Filtering Algorythm (Score:5, Informative)

Re:Meet the Bayesian Filtering Algorythm (Score:3, Informative)

Re:Meet the Bayesian Filtering Algorythm (Score:5, Informative)

Re:Meet the Bayesian Filtering Algorythm (Score:2)

Right tool for right job (Score:2, Interesting)

HTML (Score:5, Interesting)

Re:HTML (Score:2, Informative)

Re:HTML (Score:2)

Excellent (Score:5, Funny)

It won't work for businesses (Score:5, Funny)

Yes please! (Score:2)

Spam Conference talk (Score:5, Interesting)

Re:Spam Conference talk (Score:3, Interesting)

Re:Spam Conference talk (Score:3, Insightful)

Re:Spam Conference talk (Score:3, Interesting)

Re:Spam Conference talk (Score:2)

Re:Spam Conference talk (Score:5, Insightful)

Quantitive, not qualititive (Score:5, Interesting)

Re:Quantitive, not qualititive (Score:5, Interesting)

This happened to a friend of mine (Score:2)

Don't compress (Score:3, Funny)

Same old problem... (Score:5, Insightful)

Re:Same old problem... (Score:3, Informative)

Re:Same old problem... (Score:5, Interesting)

Re:Same old problem... (Score:2)

Re:Same old problem... (Score:2)

Re:Same old problem... (Score:2, Interesting)

Re: (Score:2)

Re:Same old problem... (Score:2)

Re:Same old problem... (Score:2)

Spammers will adjust their tactics (Score:5, Interesting)

Re:Moron (Score:5, Interesting)

Alternative (Score:5, Interesting)

Sequitur Most Likely Superior (Score:5, Interesting)

Re:Sequitur Most Likely Superior (Score:2, Insightful)

Yay! (Score:5, Funny)

My spam compression approach (Score:2)

What is spam, though? (Score:5, Funny)

How to stop spam.... (Score:4, Informative)

Re:How to stop spam.... (Score:3, Insightful)

Re:How to stop spam.... (Score:2)

Re:How to stop spam.... (Score:2, Informative)

filtering across multiple accounts (Score:2)

Re:How to stop spam.... (Score:5, Interesting)

Just use a string entropy calculation algorithm... (Score:4, Interesting)

Re:Just use a string entropy calculation algorithm (Score:2)

Re:Just use a string entropy calculation algorithm (Score:2)

Re:Just use a string entropy calculation algorithm (Score:2)

Re:Just use a string entropy calculation algorithm (Score:3, Interesting)

Compression algorithms as filters... (Score:5, Insightful)

I can't figure this out... (Score:4, Interesting)

Re:I can't figure this out... (Score:5, Insightful)

Re:I can't figure this out... (Score:3, Insightful)

Re:I can't figure this out... (Score:2)

Stopping Spam (Score:5, Insightful)

Re:Stopping Spam (Score:2, Insightful)

Disposable Email addresses (Score:2)

Re:Stopping Spam (Score:2, Interesting)

Email to my girlfriend (Score:5, Funny)

Spammers just found another loophole.. (Score:5, Interesting)

32k Window... (Score:4, Informative)

Similar article on heise was published a year ago (Score:2, Informative)

Even Better (Score:2, Informative)