Filter-foiling Gibberish Becoming A Spam Staple 606
hcg50a writes "Wired has a story about the random words which have recently been appearing in spam. Antispam experts agreed that this isn't a brand-new technique, but said the addition of potentially filter-foiling gibberish is rapidly becoming a common component of spam."
Gibberish (Score:1, Insightful)
Hasn't spam always been gibberish?
I don't get it, really (Score:5, Insightful)
I really understand this part: going after people who are taking active measures against your enterprise due to their disinterest. Why bother to market to them at all? Is the rate of return worth all the ill will, DOS attacks and legislation?
Why? (Score:3, Insightful)
Granted, this may get them past the filters, but if somebody's gone through the effort of setting up a Bayesian filter, they're not going to buy your product even if you get into their inbox. It seems like a waste of everybody's effort, and I mean including the spammers.
We already have tools to stop this (Score:3, Insightful)
It'll take more processing power, and lead to spammers following proper grammar in their pseudo-nonsense, but it's the way to raise the bar against this attack (making those spammers that can't clear the bar out of luck).
Reminds me of a Dr. Seus book...
RD
Grammar Check and Spell Check... (Score:5, Insightful)
Sure, a few strange words might be a name that's not in the filter yet, but pure gibberish should be a red flag that either somebody's cat walked on the keyboard, or there's spam going on here. Heavy use of "non-spam" words can override to indicate it's good mail... but a poorly composed mail that doesn't use language seen in friendly mail is highly likely to be spam....
Parent post is not offtopic (steganography) (Score:5, Insightful)
Spam is a perfect carrier for steganographic data since it's broadcast to millions of people and nobody can fall under suspicion merely by receiving it. When the government wants to monitor people's communications to search for steganography, when they don't do anything about spam, the purpose of the monitoring is probably not the stated one.
As if spam wasn't a big enough waste of bandwidth (Score:3, Insightful)
--
Still looking for an email replacement...
Re:Spamkiller doesn't care (Score:5, Insightful)
What good is that when somebody spams you for Gen3r@c v|agar@?
Re:I don't get it, really (Score:5, Insightful)
Look for it. (Score:1, Insightful)
Re:gibberish... (Score:5, Insightful)
The next attempt (Score:3, Insightful)
Insert four or five lines of valid extra text -- lines from books, selections from recent USENET postings, etc, etc -- into the spam. Make the selection semi-random. Now do it 100 times and send 100 copies to each person on the mailing list.
One of them will get through. And the spammers will continue to work.
Re:I don't get it, really (Score:2, Insightful)
Re:I don't get it, really (Score:5, Insightful)
On every spam thread on Slashdot, there's someone complaining that technical measures won't solve the problem, and another saying legal measures won't solve the problem. The answer is that you need both: technical measures to assure the identity of the sender -- both spammer and sponsor -- as well as legal measures to provide for punishment.
Re:Why? (Score:3, Insightful)
* valid sender domain
* html links to external images etc, or large amounts of html in general.
* blacklisted servers/relays
Re:I don't get it, really (Score:5, Insightful)
Re:I don't get it, really (Score:3, Insightful)
Re:Should be easy to block (Score:5, Insightful)
Most of them are using random word sequences; the random strings like xdwexe are not usually an important percentage of the overall text, no more than names might be. Besides, how large a corpus of "valid" words do you want to use? The OED weighs in at almost 0.5M; and then with another 0.5M uncatalogued scientific terms and neologisms, plus common mis-spellings and typos and jargon and dialect orthography (like our color, meter, checker, jail etc. for the Brits colour, metre, chequer, gaol) ...
If you don't want to keep the entire corpus of "valid" words in your code, you're going to have to make some compromises. Maybe you'll want to exclude words like "thou," "hauberk," and "coney." Not so good if you're subscribing to an Early Modern Literature listserv.
So you're going to need some logic to determine whether or not a "valid" word that occurs in a message is meaningful. Here's how one rather well known discussion [paulgraham.com] of Bayesian filtering deals with this issue (of unknown words); this is precisely the logic that spammers with random meaningful words are exploiting:
One question that arises in practice is what probability to assign to a word you've never seen, i.e. one that doesn't occur in the hash table of word probabilities. I've found, again by trial and error, that .4 is a good number to use. If you've never seen a word before, it is probably fairly innocent; spam words tend to be all too familiar.
So, what if all the words are valid, but the sentences aren't? Grammar checkers involve a lot more logic than spellcheckers do, and are consequently a lot less accurate. Fact is, you can also fool a grammar checker filter: just pad with random quotations from novels, etc. instead of padding with random words or random misspelled strings.
So the Bayesian approach of identifying spam and ham words is a pretty effective one, given the limitations.
Re:Simple Solution... (Score:4, Insightful)
Proposed Solution (Score:1, Insightful)
Another input I wish Mozilla (or other bayesian filtering systems) would include is a dictionary look-up on words, then input the statistics of the message. For instance, a message where > 60% of the words don't match my english dictionary and 40% do match is most likely spam in my mailbox. This additional stat would give those filters more power.
SO I wonder... Would adding these things to existing bayesian filtering systems solve this issue to some degree? My gut instinct is that it would.
Feature added (Score:3, Insightful)
Now a days however ISPs (most notably Earthlink and MSN) advertise spam blocking as a feature.
If people wanted this stuff you'd think non-filtering ISPs would advertise "You get ALL your e-mail".
But back to the original point. Spammers have used misleading topics in e-mail if only to make sure you don't delete the message. That and creating spam lists based on people who DO NOT like spam or of people who have manually opted out of spam lists.
The people who actually make money with spam don't care about selling products via spam as they sell spam services. The people who sell stuff via spam aren't making money becouse they are reaching markets who are wholely disintrested in buying stuff from them.
Re:The real problem will be deliberate poisoning (Score:3, Insightful)
Besides, if they could guess what your ham looked like, then they wouldn't be spammers... they'd be advertising folks pulling in 7 figures.
Re:Spamkiller doesn't care (Score:4, Insightful)
I'm pretty sure that the big worry is about third party filtering. If I install a spam filter, that means that I don't want to see spam and am unlikely to buy something advertized therein. If my ISP installs a spam filter, it removes spam to everyone, including the idiots who might actually buy something from a spammer. Since my ISP theoretically might be using the same technology in their filter that I'm using in mine, it would still make sense for the spammer to work on defeating my filter.
Re:I don't get it, really (Score:3, Insightful)
It's possible, if not likely, that some of the spamware authors are doing it for the challenge. Some of those guys are allegedly pretty good programmers, and I suspect that many of them are essentially hackers with no sense of morals. I could easily imagine somebody like that trying to figure out how to bypass spam filters just because it was a challenge, not because he actually expected any particular rewards for it. It's like trying to break into the computers in the Pentagon; it's stupid and illegal but a big enough challenge that some people with more brains than common sense will try it anyway.
Re:The real problem will be deliberate poisoning (Score:3, Insightful)
But how would you sell more inches on your male member enhanced with V*@gra to make money fast watching celeb teenie nymphos doing it on the farm while only using ordinary non-spammy words?
There are only so many ways to get someone to click here to get all the hot action and a long boring story full of erudite euphemisms is not one of them.
It would be interesting to see if your method of disguising spam can work on a wider range of topics.
Re:why not filter out 1337 sp3@k? (Score:1, Insightful)
FragHARD
Re:I keep praying for that silver bullet (Score:3, Insightful)
What it will take is the enforcement of existing computer-cracking laws. Spammers will then have a choice between 5-10 year sentences or sending spam with no munged words, forged headers, misleading subject lines, etc.
Re:A method for removing spam from your life. (Score:3, Insightful)
Twice in this thread, I see you talking about training the bayesian filter. You seem to think this is something of a burden, like training a big dog...
I think you misunderstand how easily one trains the current Mozilla email client's bayesian filter.
Day 1:
1: the mail comes in, spam included.
2: one of the inbox columns is a blue 'recycle' lookin' symbol. It is a toggle that acts like the 'new' indicator column, and a click on it turns state on or off.
3: glancing through the list, one clicks on the obvious spam, on this column. If there are chunks or patterns that help, you sort them via whatever useful column, then highlight a group, and hit a 'junk' button up in the toolbar. The messages marked as junk disappear (into a 'junk' folder), where they are automatically parsed by the bayes filter. This is what you'd I guess mean by training the filter. For me, it took about 4 minutes the first day, for over 100 messages at a 90% spam ratio. No disrespect, but I doubt you could write your whole stack of filters in 4 minutes.
Day 2:
Most of the junk mail gets caught. I'd say well over 3/4ths of the spam goes away on day 2. You see it come into your inbox, and then a second later all the junk items get the little blue icon turned on, then flash away to the junk folder. A few missed items or new junky things surface.
Days 3 and on: same thing, only better. By the 4th day, my 100 messages a day had fallen back to the dozen nonspams, plus one or two bogus items. It's an automatic 'In, ZZAP! Junk!' Every few days, I glance at the junk folder as you mention, and so far in the last 4 months I've had 5 misfiled messages declared as junk. 3 of them were atypically 'spammy' messages on usually-clean lists.
Now, compared to your way, I have:
Oh, and people I could never expect to set/maintain filters can intuitively 'click' the spam away. That's my favorite advantage to my way.
Re:Spamkiller doesn't care (Score:3, Insightful)
What makes you think they have any sales (of the advertised product). I would guess that almost all spam (maybe excluding for pr0n sites) is either being sent by a MAKEMONEYFAST sucker or by a professional spammer who charges such suckers to send their spam out. The first set never make any sales, dissapear and are replaced by the next moron, the latter have their money sales or not.
But then again, Joe Sixpack and Jane Astrology aren't all that smart.
And you think Sam Slashdot is? How many pieces of dead end technology do you think you could find in the average /.ers home? `Early Adoption' is geek herbal viagra.
Re:Grammar Check and Spell Check... (Score:3, Insightful)
So yes, as far as I'm concerned, a good filter should throw away that kind of message away anyway. I don't care if the l33t spelled part was "|-|3rb@1 \/1@gr@" or "Ph34r my 1337 D34thm4tch ski11z", I just don't want to receive it anyway. They're both garbage.
That said... I can somewhat see your point.
Having once written a walkthrough for a game, I have had the dubious honour of receiving tons of mail from people who were both 1 and 2. I.e., 14 year old _and_ gamers.
Ooer. Stuff like "u sux & ur walkthru sux becuz u never sed which of teh terminal 2 klik on & y duzent ne1 make maps" were more common than I would have thought. (The above sequence was about a small level with 3 blinking terminals. You'd think someone could just try all 3 of them if it isn't clear enough.)
But... I don't think it's fair to blame it on the "gamer" part. Some people are simply retards. Plain and simple. Completely coincidental, some of them also play games. But even without the "gamer" part, they'd still be retards. And they'd still write like total analphabets.