Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Spam The Internet Your Rights Online

Two Spam Filters 10 Times As Accurate As Humans 487

Nuclear Elephant writes "The authors of two spam filters, CRM114 and DSPAM, announced recently that their filters have achieved accuracy rates ten times better than a human is capable of. Based on a study by Bill Yerazunis of CRM114, the average human is only 99.84% accurate. Both filters are reporting to have reached accuracy levels between 99.983% and 99.984% (1 misclassification in 6250 messages) using completely different approaches (CRM114 touts Markovan, while DSPAM implements a Dolby-type noise reduction algorithm called Dobly). If you're looking for a way to rid spam from your inbox, roll on over to one of these authors' websites."
This discussion has been archived. No new comments can be posted.

Two Spam Filters 10 Times As Accurate As Humans

Comments Filter:
  • IM Spam (Score:5, Interesting)

    by jeffskyrunner ( 701044 ) on Monday February 23, 2004 @09:15PM (#8368806)
    Once Email Spam is eliminated, then IM spam will begin...
  • by Chuck Bucket ( 142633 ) on Monday February 23, 2004 @09:16PM (#8368818) Homepage Journal
    can this be used with Spamassasin, or is a stand alone program? Does it need something like Amasis to run?

    CB
  • Huh? (Score:1, Interesting)

    by MBCook ( 132727 ) <foobarsoft@foobarsoft.com> on Monday February 23, 2004 @09:17PM (#8368829) Homepage
    OK, I am the one who DEFINES what spam is for me, hence everything I say is spam is, and everything I say isn' isn't. I'm 100% accurate by the fact that as the person who defines what spam is for me, I know exactly what spam is.

    Would someone like to explain how a program (even if it's right 99.something% of the time) is more accurate than me (100%)?

  • Is this possible? (Score:1, Interesting)

    by Knetzar ( 698216 ) on Monday February 23, 2004 @09:17PM (#8368833)
    How does one test a program like this that's more acurate the humans?
  • Better (Score:5, Interesting)

    by gid13 ( 620803 ) on Monday February 23, 2004 @09:17PM (#8368840)
    Well, it certainly sounds better than the pay-per-email "postage" idea. If postage hasn't stopped snail spam, why would it stop e-mail spam?
  • by hatrisc ( 555862 ) on Monday February 23, 2004 @09:17PM (#8368841) Homepage
    but can you identify spam before opening it 100% of the time? Now, I realize that the mail program is looking at the actual data as well, which gives it an advantage, but on the other hand, how else can IT detect spam?
  • Combined accuracy? (Score:2, Interesting)

    by LagDemon ( 521810 ) on Monday February 23, 2004 @09:21PM (#8368893) Homepage
    Does this mean that if I use the 2 together, i get a 99.99999728% accuracy? Awesome! THat means it would takes months for me to see a single error!
  • by isaac338 ( 705434 ) on Monday February 23, 2004 @09:21PM (#8368894)
    1 in 6250?

    Who wants to bet that they only sent two 'spam' and one of them was disguised well? ;)
  • by sisukapalli1 ( 471175 ) on Monday February 23, 2004 @09:21PM (#8368898)
    I reached the conclusion of "two filters better than humans" by using two sequential filters:
    server side spamassassin, and a couple of simple procmail recipes. They have kept almost all the SPAM away.

    However, it is good to see such good techniques becoming available and we can hope to see them as straight forward usable tools.

    So, when will mozilla/TB (or your favourite server side or client side filter) get them?

    S
  • knowspam.net (Score:2, Interesting)

    by flyingrobots ( 704155 ) on Monday February 23, 2004 @09:22PM (#8368910)
    I still think it is the best 'filter' available, since filtering is a lookup into a database of 'good senders' http://www.knowspam.net [knowspam.net]
  • Re:Spamassassin (Score:3, Interesting)

    by pclminion ( 145572 ) on Monday February 23, 2004 @09:24PM (#8368931)
    It's hard to believe that a single approach like this is better than SpamAssassin.

    SpamAssassin is a single approach. It looks at a bunch of features, then combines them linearly and compares the result against a threshold function. It's a relatively simplistic method, compared to these two. Not hard to see how more sophisticated methods could do better.

  • Here's the real test (Score:3, Interesting)

    by Otter ( 3800 ) on Monday February 23, 2004 @09:28PM (#8368972) Journal
    I'm very happy with POPFile but there's one thing it just can't handle -- bounces from spam with my domain forged in the header when the original text isn't included. And how could it know? The response is the same whether it's to my mail or to spam. The domain is a clue, I guess, but otherwise it seems like an impossible task. I just let them be sorted into my inbox and delete them manually.

    If these filters can hit 99.99% with those, I'd be quite impressed.

  • Re:wait, WTF? (Score:3, Interesting)

    by LBArrettAnderson ( 655246 ) on Monday February 23, 2004 @09:29PM (#8368990)
    look at it this way... you've just tuned in to your favorite radio station and you hear your favorite DJ talking about something. Sometimes you could mix what he's saying up between an advertisement or something he's discussing for the sake of discussing.

    i'm sure there's spam out there that makes it seem like it's one of your friends talking to you (sending with "nick" or "john" as the sender name) and talks to you in a friendly manner about how great this product is.

    i've got a few of those, but luckily all my friends have weird names.
  • by Elwood P Dowd ( 16933 ) <judgmentalist@gmail.com> on Monday February 23, 2004 @09:31PM (#8369003) Journal
    No, humans are not 100%.

    If you see a strange name in your inbox with an odd title, that might be a Nigerian businessman, or it might be your long lost Nigerian brother.

    I recently tried to order a t-shirt from this guy for a band he used to be in. I found his band because we have the same (semi-uncommon) name. So, he got an email From: himself. I had to send him two emails because he deleted the first one assuming it was spam.

    I ordered some RAM for my dad a while back. He gets 200 spam emails a day (email addy in resume & web page), and he deleted the confirmation email from the RAM vendor. The RAM never shipped, and it took us a week to figure out that there was a problem.

    People make mistakes all the time. Why is this an unexpected result? People are jackasses. This should be obvious.
  • by heldlikesound ( 132717 ) on Monday February 23, 2004 @09:32PM (#8369013) Homepage
    I order all kinds of stuff online, wouldn't the receipt emails look like spam? My current spam solution is very simple:

    1. display my email online as little as possible

    2. use a number of addresses that all filter into one account, then filter by the sent-to address... this has turned up some VERY interesting results, for instance. I used dellorders@mydomain.com for an order from Dell, and NEVER used it or even typed it anywhere again, and started get spam about 6 months later, and I mean the nasty stuff, no just innocent stuff from Dell resellers...

    3. i built a rudementary filter that looks for viagra,free,debt,enlarge, etc... if the sender is not in my address book, and the email contains these words, it is sent to a "check these out" folder...

    How might a spam filter help me out without zapping confirmation type emails?

  • by ptolemu ( 322917 ) * <pateym@NospAM.mcmaster.ca> on Monday February 23, 2004 @09:32PM (#8369015) Homepage Journal
    I think these guys are trying to put the focus on the server side of things where they emphasize greater speed and efficiency in eliminating spam from a large number of accounts as opposed to a single one. Just out of curiosity, do Thunderbird and iMail use similar filtering techniques with their junk mail controls?
  • by Dulimano ( 686806 ) on Monday February 23, 2004 @09:33PM (#8369018)
    No, imaginary humans with infinite time and dedication are 100%. But real humans are not. The percent goes down with time and dedication continuously, so I really don't understand what this 99.84% means.
  • by canajin56 ( 660655 ) on Monday February 23, 2004 @09:33PM (#8369023)
    No, that only works if the probability of system X being wrong is independent of the particular message it is checking. (This also means that their figures are dependent on the makeup of the e-mail you are getting) Also, you couldn't really combine them usefully. If one says yes and the other says no, what do you do? You could either accept in these cases, or reject. But either way you could increase the error over just using one or the other.
  • by dbarclay10 ( 70443 ) on Monday February 23, 2004 @09:37PM (#8369052)
    How can a spam filter be more accurate than humans? Humans are always the last step in spam filtering.. i use popfile and it catches 99% but it still needs me.. because im the only one capable of identifying spam 100% of the time.

    And if the study posted about is accruate, of those 1% that are left, you will (if you're a perfectly average person) accidentally delete 0.16% of good messages. Surely you've deleted a valid message by accident before? I do it regularily, deleting 25 spam messages with a single good one embedded in it when I just woke up before I had my coffee is not a good thing ;)

    At the very least, if you were given the same data as these tests, that would be true. Consider if you *didn't* use popfile - how many spams would you be deleting every day, and how many good messages would be accidentally deleted? I know that if I had to manually delete the two or three hundred spams interspersed with good messages, my false-positive rate (the percentage of good mail I accidentally deleted) would skyrocket.

    So just be glad you've got popfile. Not only do you not have to go through as much spam, but you're also more accurate while going through the little you must.

  • by Anonymous Coward on Monday February 23, 2004 @09:38PM (#8369065)
    *Most* spammers have some very recognizable patterns, because they're classic advertising patterns. They use BIG PRINT, they offer a very limited of popular and fraudulent products (such as free prizes and Viagra) and now use various tricks to avoid other spam filters. Normal on-line business traffic should not trigger this: if it does, you should be able to notice it and create a whitelist for that sender.

    Those classic spam patterns are detectable, but writing the detection rules as a static list is a bitch and a half. And as soon as you publish *static* rules, your rules will be circumvented.

    The Bayesian/Markovian style learning of these tools helps randomize the rules so there is no magic bullet to get past them.
  • by use_compress ( 627082 ) on Monday February 23, 2004 @09:42PM (#8369102) Journal
    I find it interesting that an algorithm that was originally for image noise reduction [ee.ubc.ca] found it's way to Machine Learning through a company whose purpose is to impliment noise reduction in audio. From my Googling, I think this is the first time anyone has used Baysian Noise Reduction in Machine Learning. Does anyone know otherwise?
  • Re:wait, WTF? (Score:4, Interesting)

    by HeelToe ( 615905 ) on Monday February 23, 2004 @09:42PM (#8369109) Homepage
    6000 over what period?

    This represents 8 days worth of spam for me. Yes, ~800 per day.

    My address has been valid for 10 years. Why should I change it? Bogofilter is currently letting 2-3 per day into my inbox. I generally check for false-positives, but as the training has progressed, I am finding none anymore.

    I plan to implement a single-shot, one try notification sender. I.e., if the mail gets classified as spam: lookup the mx record for the envelope return address, if it's nonexistent, lookup the a record. Make a connection and try to deliver a message indicating their message (include subject reference) was identified as spam, include a way for them to reliably get a message through to me. If any of the smtp exchange or address lookup fails, just forget it, they're probably not real anyway.
  • You joke, but... (Score:2, Interesting)

    by Ancient Devices King ( 469802 ) on Monday February 23, 2004 @09:47PM (#8369156)
    I know a guy who has a Korean grad student who doesn't speak English very well. He manages to produce subject lines for the messages he sends that get him blocked by spam filters nearly all the time. Not his fault really, but it happens.
  • by kfg ( 145172 ) on Monday February 23, 2004 @09:49PM (#8369171)
    People are jackasses.

    Hence we have spam in the first place.

    KFG
  • by Anonymous Coward on Monday February 23, 2004 @09:50PM (#8369188)
    Statistical/Probabilistic filters are adaptive, and are capable of learning new characteristics of spam. This is the biggest difference between SpamAssassin (which has a set of predefined "rules") and these two filters. These filters break down each message into tokens and statistically weigh the tokens based on prior learning. If one of them makes a mistake, you can teach it. AFAIK these have been around for at least a couple years, and have only increased in accuracy over time.
  • by Trejkaz ( 615352 ) on Monday February 23, 2004 @09:52PM (#8369205) Homepage

    That actually makes humans much more accurate. We can eliminate many of the messages just by looking at the subject.

    The further question is, if humans aren't as accurate as the computer, how are they measuring the accuracy at all? That is, how do they know that the 1 in 6250 messages is wrong, if a human, known to be inaccurate, was testing for accuracy?

  • by gvc ( 167165 ) on Monday February 23, 2004 @10:01PM (#8369272)
    Last week I ran a spam filter on all the email I recieved for the last several months. The filter came up with a dozen 'false positives' - messages that I had not flagged as spam when I manually classified them. 11 of them were clearly errors I made in my original classification. The 12th was a solicitation from the alumni association of my alma mater ....

    Before I used a spam filter, I once missed a very important message whose subject line was something to the effect of "URGENT - DON't REBOOT THIS MORNING." That was a bad one to miss.

    Of course humans make mistakes, and it is entirely possible for an automated or semi-automated system to be more accurate than a human alone.

  • Thats a problem. (Score:3, Interesting)

    by geekoid ( 135745 ) <dadinportlandNO@SPAMyahoo.com> on Monday February 23, 2004 @10:13PM (#8369383) Homepage Journal
    If there is no universal bottom line of what Spam is, we can never manage it.

    I think 'unsolicited request for money from a for profit oranization' will fit into everybodies base definition. Some people will expand on it, but we need a defined place to start.
  • Help setting this up (Score:1, Interesting)

    by ModernGeek ( 601932 ) on Monday February 23, 2004 @10:41PM (#8369649)
    I would love to rtfm, but I want a fairly simple answer to this, how can I do a 30 minute job of integrating this into the mozilla mail client, or does it have to be tied into the server itself? I was wondering if this was a quick, easy fix, or if it is an all weekend type of project. While I'm on the subject of mail, what is a good all in one mail bundle with webbased interface that isn't opengroupware or ms exchange for php/apache under unix?
  • by Elwood P Dowd ( 16933 ) <judgmentalist@gmail.com> on Monday February 23, 2004 @11:07PM (#8369818) Journal
    Or your dad is an idiot who doesn't know how to route his email.

    But I was only contesting the great-grandparent poster, who said that humans are by definition 100% accurate.

    While my dad may be an idiot, he is also human. I am correct, great-grandparent poster is incorrect, and you are off topic. As far as I can tell, I've never deleted an email I meant to keep either. But you and I aren't the only people worth discussing.
  • by QuantumFTL ( 197300 ) * on Monday February 23, 2004 @11:12PM (#8369859)
    The further question is, if humans aren't as accurate as the computer, how are they measuring the accuracy at all? That is, how do they know that the 1 in 6250 messages is wrong, if a human, known to be inaccurate, was testing for accuracy?

    I believe that humans can be 100% accurate (or thereabouts) if they read the *ENTIRE* message, however that's exactly the point - if you have to read an entire message to tell that it's spam, the spam has succeeded.

    Their number probably concerns how people can tell without reading the entire message whether or not the message is spam. My brother accidentally deleted a few messages I had sent to him, however if he had read them fully he would have known they were legit.

    Cheers,
    Justin
  • Current Spam filters (Score:2, Interesting)

    by Anonymous Coward on Monday February 23, 2004 @11:19PM (#8369923)

    Current spam filters may be "10x" better than humans, current spam filters may be terrible on future spam.

    Filters beating spam and spam beating filters is a continuous arms race. In the limit, optimal spam filtering is equivalent to solving NLP (natural language processing); Unless you build a filter that can fully understand the text (syntax, semantics, pragmatics, world knowledge, the whole shebang), an adversary can always construct spam to defeat your filter.
  • by mabu ( 178417 ) on Tuesday February 24, 2004 @12:15AM (#8370315)
    As an ISP that has to try to do my best to provide my clients with "spam free" e-mail, I have to pass these costs onto the clients, whether they're in the form of charges for additional bandwidth or ineffective server-side filtering systems.

    When you filter e-mail at the client or server side based on content, the spammers have no idea that their efforts are truly ineffective. At least RBLs send them a message. Content-based filtering is TOTALLY, TOTALLY ineffective. Yea, it makes the spam go away for a short period, but adds the burden of having to deal with legitimate mail being blocked and you still have to waste 70+% of resources you wouldn't normally need to handle legitimate e-mail. When you're not managing systems that are constantly under attack, you might not realize what a complete fucking mess it is.

    On any given day, I have at least 20-30 probes and attempts to DOS my open ports into breaking down and giving these spammers some form of access. I'm having to build new systems to handle the existing load, not because my clients' need more resources, but the spammers progressively eat up more and more system resources. E-mail IS an almost-instanteous communication medium. BUT, because of spammers, it no longer is in many cases, especially with larger ISPs. The spammers, because the authorities won't shut them down, are screwing everything up and content-based filtering is something they LOVE because it's completely ineffective in the long run.
  • by omeomi ( 675045 ) on Tuesday February 24, 2004 @12:41AM (#8370471) Homepage
    Dolby noise reduction works by filtering a spectrum into a bunch of bands, each of which are compressed (in an audio sense, not in a digital sense), and recorded to tape. On playback, they go through an expander...how does that concept translate to spam filtering? It can't be "dolby-type", that doesn't make any sense...
  • by Trejkaz ( 615352 ) on Tuesday February 24, 2004 @01:40AM (#8370794) Homepage
    I dunno. I'm running CRM114 now, and it's taking something like 1.5 seconds to identify emails. I am on a slow machine though, which used to do SpamAssassin at around 4 seconds, and inaccurately to boot. CRM114 is a big improvement, and if it trains well after the first fortnight I'll kiss TMDA goodbye.
  • Share the luxury (Score:5, Interesting)

    by bigberk ( 547360 ) <bigberk@users.pc9.org> on Tuesday February 24, 2004 @02:40AM (#8371106)

    Having such a powerful statistical spam filter is definitely a luxury. I have no difficulty believing the accuracy values presented here. I have had experience with spamprobe, CRM114, bogofilter, spambayes, and spamassassin and all of these do an amazing job to the point where spam no longer exists (for you).

    Which leads to me plug a little project called WPBL [pc9.org] that uses exactly these types of statistical spam filters to spot spam sources in a distributed fashion. Each project member uploads hourly the IPs they see relaying spam and non-spam, where the 'decision' is made by these extremely reliable filters. This effectively converts your regular mail account into an intelligent spam-trap that feeds a central blocklist.

    The more members we get, the better we can identify active spam sources around the world. This information is then used by some sites for quite large-scale blocking [dnsbl.net.au]. Since you're doing all this filtering processing anyway, why not also share "what you learn" (the IPs that are spamming you)?

    If this grabs your interest, read up on the reporting scripts [pc9.org] or alternatively, the open WPBL data upload protocol [pc9.org] if you want to code your own report generator. Bandwidth usage is minimal.

  • Re:Not the best idea (Score:2, Interesting)

    by warrax_666 ( 144623 ) on Tuesday February 24, 2004 @04:50AM (#8371531)
    I don't think SMTP allows for a "reject" after getting to the DATA portion of the SMTP transaction. That prevents most (effective) spam filters from working at SMTP time. If it were possible, wouldn't everybody be doing this?

    Hmm... maybe it's time to update SMTP to allow for this? (Sure, bandwidth is still being consumed, but at least legitimate senders would know that their message didn't get through because of "spamminess")
  • Overkill (Score:3, Interesting)

    by mdfst13 ( 664665 ) on Tuesday February 24, 2004 @09:22AM (#8372418)
    We don't need to trust the *person* sending the mail. It would be sufficient to trust the machine that is doing so.

    Look at http://spf.pobox.com/ which is sufficient. With SPF, you know that if you are getting SPAM saying it is from @ultraviolet.org, then it really is from @ultraviolet.org (or at least someone who ultraviolet.org trusts).

    Your solution requires a certain level of technical proficiency (setting up and managing the key) of *all* participants. SPF's solution only requires technical proficiency from those who manage DNS settings and those who manage email servers (in particular the person who manages your email server).

    Also, what about *stolen* keys? And who handles key checking? SSL certificates are restricted to a few root signers, but you don't want a central certificate authority. PGP/GPG work well because they only involve small numbers of people. In general, you know the person directly. Occasionally it will be a friend of a friend message. What do you do when the chain is 10 or a 100 or a 1000 keys long? How long will it take for you to find out that 978 has since revoked their signature for 977 (counting in steps from you, i.e. you are 0 and 1000 is the original signer of this chain)? Or how long will it take you to verify all 1000 keys if you try to do it real time (i.e. when you get the message)?
  • Re:Not the best idea (Score:2, Interesting)

    by Continental Drift ( 262986 ) <slashdot@[ ]ghte ... t ['bri' in gap]> on Tuesday February 24, 2004 @10:50AM (#8373152) Homepage
    I disagree, I think that a white list with challenge auto-replies, as I use, [wunderland.com] are clearly effective and add just a little to mail traffic. I encourage others to use such a system, which would eliminate problems from having the spam reply-to being a real address. Since applying this schema, I've gotten exactly one spam message in my inbox. That's an excellent percentage.

BLISS is ignorance.

Working...