Two Spam Filters 10 Times As Accurate As Humans 487
Nuclear Elephant writes "The authors of two spam filters, CRM114 and DSPAM, announced recently
that their filters have achieved accuracy rates ten times better than a human is capable of. Based on a study by Bill Yerazunis of CRM114, the average human is only 99.84% accurate. Both filters are reporting to have reached accuracy levels between 99.983% and 99.984% (1 misclassification in 6250 messages) using completely different approaches (CRM114 touts Markovan, while DSPAM implements a Dolby-type noise reduction algorithm called Dobly). If you're looking for a way to rid spam from your inbox, roll on over to one of these authors' websites."
IM Spam (Score:5, Interesting)
can it be used with SA? (Score:5, Interesting)
CB
Huh? (Score:1, Interesting)
Would someone like to explain how a program (even if it's right 99.something% of the time) is more accurate than me (100%)?
Is this possible? (Score:1, Interesting)
Better (Score:5, Interesting)
Re:Huh? Aren't humans 100%? (Score:2, Interesting)
Combined accuracy? (Score:2, Interesting)
how to lie with statistics.. (Score:2, Interesting)
Who wants to bet that they only sent two 'spam' and one of them was disguised well?
Obligatory Q... When will mozilla/TB have them? (Score:5, Interesting)
server side spamassassin, and a couple of simple procmail recipes. They have kept almost all the SPAM away.
However, it is good to see such good techniques becoming available and we can hope to see them as straight forward usable tools.
So, when will mozilla/TB (or your favourite server side or client side filter) get them?
S
knowspam.net (Score:2, Interesting)
Re:Spamassassin (Score:3, Interesting)
SpamAssassin is a single approach. It looks at a bunch of features, then combines them linearly and compares the result against a threshold function. It's a relatively simplistic method, compared to these two. Not hard to see how more sophisticated methods could do better.
Here's the real test (Score:3, Interesting)
If these filters can hit 99.99% with those, I'd be quite impressed.
Re:wait, WTF? (Score:3, Interesting)
i'm sure there's spam out there that makes it seem like it's one of your friends talking to you (sending with "nick" or "john" as the sender name) and talks to you in a friendly manner about how great this product is.
i've got a few of those, but luckily all my friends have weird names.
Re:Huh? Aren't humans 100%? (Score:5, Interesting)
If you see a strange name in your inbox with an odd title, that might be a Nigerian businessman, or it might be your long lost Nigerian brother.
I recently tried to order a t-shirt from this guy for a band he used to be in. I found his band because we have the same (semi-uncommon) name. So, he got an email From: himself. I had to send him two emails because he deleted the first one assuming it was spam.
I ordered some RAM for my dad a while back. He gets 200 spam emails a day (email addy in resume & web page), and he deleted the confirmation email from the RAM vendor. The RAM never shipped, and it took us a week to figure out that there was a problem.
People make mistakes all the time. Why is this an unexpected result? People are jackasses. This should be obvious.
Could somebody explain this to me... (Score:5, Interesting)
1. display my email online as little as possible
2. use a number of addresses that all filter into one account, then filter by the sent-to address... this has turned up some VERY interesting results, for instance. I used dellorders@mydomain.com for an order from Dell, and NEVER used it or even typed it anywhere again, and started get spam about 6 months later, and I mean the nasty stuff, no just innocent stuff from Dell resellers...
3. i built a rudementary filter that looks for viagra,free,debt,enlarge, etc... if the sender is not in my address book, and the email contains these words, it is sent to a "check these out" folder...
How might a spam filter help me out without zapping confirmation type emails?
Operating on a different scale... (Score:2, Interesting)
Re:Huh? Aren't humans 100%? (Score:2, Interesting)
Re:Combined accuracy? (Score:3, Interesting)
Re:Huh? Aren't humans 100%? (Score:5, Interesting)
And if the study posted about is accruate, of those 1% that are left, you will (if you're a perfectly average person) accidentally delete 0.16% of good messages. Surely you've deleted a valid message by accident before? I do it regularily, deleting 25 spam messages with a single good one embedded in it when I just woke up before I had my coffee is not a good thing ;)
At the very least, if you were given the same data as these tests, that would be true. Consider if you *didn't* use popfile - how many spams would you be deleting every day, and how many good messages would be accidentally deleted? I know that if I had to manually delete the two or three hundred spams interspersed with good messages, my false-positive rate (the percentage of good mail I accidentally deleted) would skyrocket.
So just be glad you've got popfile. Not only do you not have to go through as much spam, but you're also more accurate while going through the little you must.
Re:Could somebody explain this to me... (Score:1, Interesting)
Those classic spam patterns are detectable, but writing the detection rules as a static list is a bitch and a half. And as soon as you publish *static* rules, your rules will be circumvented.
The Bayesian/Markovian style learning of these tools helps randomize the rules so there is no magic bullet to get past them.
Image Noise Reduction and Machine Learning (Score:4, Interesting)
Re:wait, WTF? (Score:4, Interesting)
This represents 8 days worth of spam for me. Yes, ~800 per day.
My address has been valid for 10 years. Why should I change it? Bogofilter is currently letting 2-3 per day into my inbox. I generally check for false-positives, but as the training has progressed, I am finding none anymore.
I plan to implement a single-shot, one try notification sender. I.e., if the mail gets classified as spam: lookup the mx record for the envelope return address, if it's nonexistent, lookup the a record. Make a connection and try to deliver a message indicating their message (include subject reference) was identified as spam, include a way for them to reliably get a message through to me. If any of the smtp exchange or address lookup fails, just forget it, they're probably not real anyway.
You joke, but... (Score:2, Interesting)
Re:Huh? Aren't humans 100%? (Score:3, Interesting)
Hence we have spam in the first place.
KFG
Re:The true test of a spam filter... (Score:2, Interesting)
Re:Huh? Aren't humans 100%? (Score:5, Interesting)
That actually makes humans much more accurate. We can eliminate many of the messages just by looking at the subject.
The further question is, if humans aren't as accurate as the computer, how are they measuring the accuracy at all? That is, how do they know that the 1 in 6250 messages is wrong, if a human, known to be inaccurate, was testing for accuracy?
Re:Huh? Aren't humans 100%? (Score:4, Interesting)
Before I used a spam filter, I once missed a very important message whose subject line was something to the effect of "URGENT - DON't REBOOT THIS MORNING." That was a bad one to miss.
Of course humans make mistakes, and it is entirely possible for an automated or semi-automated system to be more accurate than a human alone.
Thats a problem. (Score:3, Interesting)
I think 'unsolicited request for money from a for profit oranization' will fit into everybodies base definition. Some people will expand on it, but we need a defined place to start.
Help setting this up (Score:1, Interesting)
Re:Huh? Aren't humans 100%? (Score:3, Interesting)
But I was only contesting the great-grandparent poster, who said that humans are by definition 100% accurate.
While my dad may be an idiot, he is also human. I am correct, great-grandparent poster is incorrect, and you are off topic. As far as I can tell, I've never deleted an email I meant to keep either. But you and I aren't the only people worth discussing.
Re:Huh? Aren't humans 100%? (Score:5, Interesting)
I believe that humans can be 100% accurate (or thereabouts) if they read the *ENTIRE* message, however that's exactly the point - if you have to read an entire message to tell that it's spam, the spam has succeeded.
Their number probably concerns how people can tell without reading the entire message whether or not the message is spam. My brother accidentally deleted a few messages I had sent to him, however if he had read them fully he would have known they were legit.
Cheers,
Justin
Current Spam filters (Score:2, Interesting)
Current spam filters may be "10x" better than humans, current spam filters may be terrible on future spam.
Filters beating spam and spam beating filters is a continuous arms race. In the limit, optimal spam filtering is equivalent to solving NLP (natural language processing); Unless you build a filter that can fully understand the text (syntax, semantics, pragmatics, world knowledge, the whole shebang), an adversary can always construct spam to defeat your filter.
Re:Let's get this straight people! (Score:3, Interesting)
When you filter e-mail at the client or server side based on content, the spammers have no idea that their efforts are truly ineffective. At least RBLs send them a message. Content-based filtering is TOTALLY, TOTALLY ineffective. Yea, it makes the spam go away for a short period, but adds the burden of having to deal with legitimate mail being blocked and you still have to waste 70+% of resources you wouldn't normally need to handle legitimate e-mail. When you're not managing systems that are constantly under attack, you might not realize what a complete fucking mess it is.
On any given day, I have at least 20-30 probes and attempts to DOS my open ports into breaking down and giving these spammers some form of access. I'm having to build new systems to handle the existing load, not because my clients' need more resources, but the spammers progressively eat up more and more system resources. E-mail IS an almost-instanteous communication medium. BUT, because of spammers, it no longer is in many cases, especially with larger ISPs. The spammers, because the authorities won't shut them down, are screwing everything up and content-based filtering is something they LOVE because it's completely ineffective in the long run.
Dolby-type noise reduction algorithm called Dobly? (Score:4, Interesting)
Re:Huh? Aren't humans 100%? (Score:4, Interesting)
Share the luxury (Score:5, Interesting)
Having such a powerful statistical spam filter is definitely a luxury. I have no difficulty believing the accuracy values presented here. I have had experience with spamprobe, CRM114, bogofilter, spambayes, and spamassassin and all of these do an amazing job to the point where spam no longer exists (for you).
Which leads to me plug a little project called WPBL [pc9.org] that uses exactly these types of statistical spam filters to spot spam sources in a distributed fashion. Each project member uploads hourly the IPs they see relaying spam and non-spam, where the 'decision' is made by these extremely reliable filters. This effectively converts your regular mail account into an intelligent spam-trap that feeds a central blocklist.
The more members we get, the better we can identify active spam sources around the world. This information is then used by some sites for quite large-scale blocking [dnsbl.net.au]. Since you're doing all this filtering processing anyway, why not also share "what you learn" (the IPs that are spamming you)?
If this grabs your interest, read up on the reporting scripts [pc9.org] or alternatively, the open WPBL data upload protocol [pc9.org] if you want to code your own report generator. Bandwidth usage is minimal.
Re:Not the best idea (Score:2, Interesting)
Hmm... maybe it's time to update SMTP to allow for this? (Sure, bandwidth is still being consumed, but at least legitimate senders would know that their message didn't get through because of "spamminess")
Overkill (Score:3, Interesting)
Look at http://spf.pobox.com/ which is sufficient. With SPF, you know that if you are getting SPAM saying it is from @ultraviolet.org, then it really is from @ultraviolet.org (or at least someone who ultraviolet.org trusts).
Your solution requires a certain level of technical proficiency (setting up and managing the key) of *all* participants. SPF's solution only requires technical proficiency from those who manage DNS settings and those who manage email servers (in particular the person who manages your email server).
Also, what about *stolen* keys? And who handles key checking? SSL certificates are restricted to a few root signers, but you don't want a central certificate authority. PGP/GPG work well because they only involve small numbers of people. In general, you know the person directly. Occasionally it will be a friend of a friend message. What do you do when the chain is 10 or a 100 or a 1000 keys long? How long will it take for you to find out that 978 has since revoked their signature for 977 (counting in steps from you, i.e. you are 0 and 1000 is the original signer of this chain)? Or how long will it take you to verify all 1000 keys if you try to do it real time (i.e. when you get the message)?
Re:Not the best idea (Score:2, Interesting)