You've Got Mail -- Tons Of It 249
Daniel Goldman writes "The Baltimore Sun has an article about the City of Baltimore's email problem." A snippet: "Millions of old e-mail messages are clogging Baltimore's municipal computers, so the city is going to start automatically deleting any messages older than 90 days.
A common practice in private business, the move raises questions when made by a municipality, which has a responsibility to retain certain public records." Goldman points out "Just think about all the potential law suits; 'if it's not there, they can't subpoena it.'"
Simple... (Score:5, Funny)
Re:Simple... (Score:4, Funny)
yeah, and if the budget's looking a bit bad for that year, they could always put a few of the email accounts up on ebay.
-- james
Comment removed (Score:4, Insightful)
Re:Simple... (Score:4, Insightful)
Re:Simple... (not) (Score:4, Informative)
Re:Simple... (not) (Score:3, Insightful)
Re:Simple... (not) (Score:3, Interesting)
Re:Simple... (Score:3, Insightful)
Sort all users e-mail recieved by size for a given year.
Delete 5% of the largest e-mails. These will probably account for around 90% of all disk usage. They probably represent file attachments which should have been stored on a server instead of in an e-mail account anyway.
Just think, when you mail a 2MB attachment to 3,000 people in a division, that could use quite a bit of disk space.
Re:Simple... (Score:2, Funny)
Re:Simple... (Score:4, Funny)
I can beat that. A few years ago this bitch at work clogged up the mail system with a 50 mb zip file containing pictures from the corp. picnic. She sent it to every employee in the company.
Stupid bitch
Re:Simple... (Score:5, Insightful)
*shrugs*
Outsourcing garbage collection... (Score:4, Funny)
Beowulf cluster (Score:2, Insightful)
Just load them into Google or the Archive (Score:3, Funny)
And there are no privacy concerns? (Score:2)
I'd prefer that people who are familiar with the actual data being stored make the determination if it should be publicly available.
Re:And there are no privacy concerns? (Score:3, Funny)
They are public records. So, yes it should all be public.
Simple, no?
Bayesian Filter to Identify Officail Mail (Score:5, Interesting)
Re:Bayesian Filter to Identify Officail Mail (Score:5, Insightful)
Re:Bayesian Filter to Identify Officail Mail (Score:2, Insightful)
Re:Bayesian Filter to Identify Officail Mail (Score:3, Interesting)
Re:Bayesian Filter to Identify Officail Mail (Score:3, Insightful)
You should probably go take a class on probability. When you're dealing with millions of email, there are going to be some false positives.
What's the alternative, hand sort them?
Yeah, that's a good idea right? But with bayesian filtering, you can do a lot of refining when you're dealing with millions of email.
And who says that you need to use the same filters for the health dept and the transport dept.
Jesus christ, there are lots of companies that
Re:Bayesian Filter to Identify Officail Mail (Score:2)
Since they need to delete tons of old messages spam included, but want to save official email, why don't they train a Bayesian Filter to sort through and save as much as possible
Mostly, corporations do not want competent customer support (unless it's a big client).
Why?
- Truly good customer support costs money and requires that the CS people know what they're talking about. That costs money.
- Corporations want script-trained "shooters" that can deflect resposibilty.
- Corporations think that they
Saving to local drives? (Score:5, Informative)
The answer to the problem in the article is quotas. *nix has them, Novell has them and even Windows has them. Our email quota works as follows
Limit 1 - email user once per day marked high importance that they are getting close.
Limit 2 - disable sending and continue with (2k) warning message.
Limit 3 - disable receiving apart from one final message saying that it would all start working again when the user clears some space
When they can't send/receive, they get a dialogue box reminding them when they try and when they can't receive, the sender gets a messge.
This does make for support calls like...
"Why does my computer tell me that the email is full up and I can't send any more?"
"Because your email is full up. You have a message explaining this to you."
"X tried to send me an email and it bounced saying that my mailbox was full up. Why?"
"Because your mailbox is full up."
Re:Saving to local drives? (Score:4, Informative)
Must be nice.
I field 250-320 emails a week.
All replys are "reply with history" often with screenshots as company policy and due to the complexity of the job. (3rd level insurance support w/ story problems galore)
I have a personal storage quota of 75 megs,
the mailbox I save to has a personal storage quota
of 75 megs. (personal space? about 7% and holding, corporate box? 90-100% at all times)
They cannot share, or transfer any storage quota from one user or resource to another.
They will not buy *any* new drive space.
They will not examine *any* redistribution of present drive shares. (like oh I dunno, *USAGE*)
And the first warning we get that the drive is
filling (about 3-4X a week) is the cannot write to
drive warning.
We delete somewhere in the neighborhood of 90 megs a week of pontentially subpeonable documentation and there is no plans in place, or even spoken of
to correct this. (don't ask, don't tell)
Save it to a local drive? No that would violate security protocols.
Grell
Re:Bayesian Filter to Identify Officail Mail (Score:5, Insightful)
Why not... (Score:5, Insightful)
Re:Why not... (Score:5, Funny)
so? buy some storage, stick them in there. (Score:4, Insightful)
If the average message is 10kib (10,000 bytes, make the math easier), and compresses down to 3kib (probably even better if you compress a bunch together), then you'd need roughly 30gib to store 10 million of them. Can you even buy hard drives that small any more?
Add some search index, throw a crappy web interface on it, and call it a day. Never delete an email again!
Re:so? buy some storage, stick them in there. (Score:2)
With gzip -9, you get either 20% or 10% of the original size of a text document (this is what the Internet Archive used for its archive of HTML when I was there).
An index into a text collection is on the order of 10% - 100% of the size of the original collection, depending on what features you want to offer at speed. 10-50% is a reasonable size.
So for 10M messages at 10k each, assuming the compression ratio above (which might not hold for MS Word attachments - a big caveat) you
Re:so? buy some storage, stick them in there. (Score:4, Insightful)
Re:so? buy some storage, stick them in there. (Score:2, Interesting)
something about breaking down, but is that real?
then there's dvds and magneto-optical (my personal favourite)
Re:so? buy some storage, stick them in there. (Score:2)
The discs are prone to damage when handled on a daily basis, but are much less so when recorded and stored. Of course, in ten years there will be the cost associated with the next generation of reliable st
Better than a 90-day maximum lifetime (Score:2)
Also, there's the issue of centralized vs. distributed archiving. If you're centralizing, DVDs are obviously the better choice, because you can store 6 times are much data on each, and if you're doing one mailbox at a time, you're less likely to need multiple disks. For distributed use, though, CDs may win, because
Re:so? buy some storage, stick them in there. (Score:3, Insightful)
Probably Running Exchange (Score:2)
IMHO this sounds perfectly reasonable (Score:5, Insightful)
There are always going to be things like replies to an original question and subsequent follow up questions going back and forth, so normally hanging onto the latest/final reply would be sufficient (providing it had the previous history - clearly showed the conclusion).
Now if they were to use this as an excuse to accidently lose records that would be a different matter. This however is where auditors should be playing a role to ensure that they are keeping the right records and discarding the rubbish.
not all people quote the entire post... (Score:2)
I'm old-school when it comes to email (probably because I've been a BBS sysop who had to worry about bandwidth consumption), but you've touched on one of the two big problems with most corporate email cultures:
Re:not all people quote the entire post... (Score:2)
Re:HTML in email? Forbidden! (Score:5, Insightful)
Word attachments are acceptable when they are just a means of moving files around, and not the entire content of the email. What is not acceptable is expecting me to load a large word processor just so you can use the company letterhead. In my experience the latter type is far more common. Besides the security implications (macro viruses, etc), I do not have a gui on the computer I read my email. Nor should I need one.
As for HTML email, I'm simply not going to render strange IMG tags. They could lead to goatse, or back to a spammer's site, and now they know my email is active. HTML email generally looks like it was designed by an 8 year old with downs syndrome anyway. Plain text is just more readable for nearly every email. Check out HTML email is STILL evil!!! [georgedillon.com] for more.
Try 30 days (Score:2, Informative)
incremental backup (Score:5, Insightful)
What?!? What's wrong with an incremental backup? Surely all those millions of messages aren't *changing* every day?!?
Think of all the children that will suffer from this!!!
Re:incremental backup (Score:4, Insightful)
What?!? What's wrong with an incremental backup? Surely all those millions of messages aren't *changing* every day?!?
That depends on how their email system works. If it stores each user in a single file, then that file is changing every day. If they're using a file-based backup system...
Re:incremental backup (Score:3, Insightful)
With Postfix, use always_bcc to forward all outgoing mail to a user called outlog, then use procmail to save all outgoing mail to a log file.
Likewise, procmail can save all incoming mail - after the crap filte
Re:incremental backup (Score:3, Funny)
Blame On-Line Storage (Score:5, Interesting)
1. On-line storage. There's no reason to keep all of everyone's mail on-line on the server (a la IMAP or proprietary MS Exchange) instead of offline on their PC's (a la POP, most often seen with Eudora for non-techies). With offline storage, the servers don't clog, and you can keep as much mail as you like.
The biggest rap agains off-line storage is that you can't control what people do with their mail or how they store it. My old job had a neat solution for this: Eudora downloaded your mail, but stored it on a file server. Each employee had 100 GB or something very large. It worked great; the SMTP/POP servers were never full, and everyone could keep their email.
2. Ridiculous stupid bullshit HTML rich-text mail crap. Can you tell I have a bias here? Aside from being annoying, HTML mail can take up to ten times the size of plain old text. Some of the HTML generated by common email programs is just terrible; filled with repeating tags for every line, and just wasting an incredible amount of space for absolutely zero benefit. (Outlook is bad, but there are others that are just as bad.)
There's no excuse for not fixing these problems. Someday someone's going to tell a court they had to delete mail for these reasons, and someone else is going to explain exactly why they're wrong. Until then, people who want to delete mail for legal reasons will hide behind false technical reasons.
Re:Blame On-Line Storage (Score:4, Informative)
Actually, storing the messages on local computers in an organization is about the worst thing to do. Most/all user computers are not backed up the way the servers are.
For legal requirements for some organizations, various backups must be maintained. Just because the active mailstore does not maintain messages older then X days in it does not mean that the data is lost forever (and thus, subpoena-able).
To do this right, first, the City needs to create a policy that establishes that active e-mail messages will not be retained in the "inboxes" more than 30 days. They should also set up mailstores for everyone in a different area on the same or different server (but NOT to user PCs. they need to define a policy against this, also, because user computers can be subpoenae'd, so if a user has been retaining e-mail messages on their own computer, this could undermine the overriding policy, aka "Smoking Gun").
HTML/Rich-text e-mail messages
No argument there!
It is LEGAL to not retain e-mail messages past a reasonable amount of time as long as there is an organization-wide POLICY in place and reasonably applied over the entire organization, but the policy has to be in place first.
There is lots of information on the net about this already. I would maybe google for "email retention policy"...
Re:Blame On-Line Storage (Score:2)
Re:Blame On-Line Storage (Score:2)
Why couldn't you have a 100GB account for each employee on the mail server instead? - what is the bloody point to get mail from one network server and move it to another one?
Lots of reasons. First, mail servers just don't work very well when storing large quantities of mail for large quantities of people. I've never seen one that works well. If I'm wrong, please tell me. Second, the file server model is much more flexible: you can spread the accounts out across lots of file servers, but still have o
Re:Blame On-Line Storage (Score:3, Informative)
Check out project Cyrus [cmu.edu]. I haven't used it for large projects, but I notice it does support distributing mailboxes across multiple backend servers (The Murder stuff).
Re:Blame On-Line Storage (Score:2, Informative)
wrong approach (Score:5, Insightful)
They've taken a simple problem of old or improperly speced equipment and turned it into a manual labor solution instead. That's an insane waste of time and salary. They should just upgrade their network and storage. If I can build a 4 terabyte RAIDed PC for a few thousand dollars, they can centralize their mailserver and back it up for say a hundred thousand, even with extra redundancy and inefficiencies and admin costs.
By contrast, forcing every current employee to perform a task that would eat up weeks of time per employee per year, in a city of Baltimore's size, will cost tens of millions of dollars.
Dumb, dumb, dumb.
--Pat / zippy@cs.brandeis.edu
Re:wrong approach (Score:2)
Moving older stuff into folders (that are still on the server) would probably make more sense.
Google to the rescue! (Score:2, Redundant)
Another Solution for Baltimore (Score:2)
Temporary Fix (Score:5, Insightful)
Old e-mail - it's a resource (Score:2, Funny)
We see old e-mails as a resource to be harnessed and turned into profit. Thanks to old e-mails we can ensure that no employee leaves with a spotless record since everyone always e-mails something incriminating sooner or later from the company e-mail address.
We also find that the e-mails are great for data repositories; we fill all of our databases with text and when our clients come in, we tell them that those data warehouses
Millons of old spam, most likely. (Score:5, Interesting)
I've noticed an annoying trend lately that e-mail sent to businesses is frequently getting just ignored. Certainly it seems much more frequent this year than in the past. I've wondered if this is simply because so many e-mail boxes are getting filled up as fast as the spammers can send.
I'd suspect that the city of Baltimore wouldn't be having any problems if spam weren't such a problem. If the number of messages they had to deal with dropped by 5 to 20 times (depending on which estimates of current spam levels you believe), they could probably just leave the mail where it is.
This is all something I've been struggling with, being a small business owner doing business on the net. My company of 5 people gets between 4,000 and 20,000 borderline spams per day. By borderline, I mean that we throw away obvious viruses and things which score above a certain score in SpamAssasin (I think it's 9). So, that doesn't count the super spammy messages.
If it weren't for our fairly strict and complicated spam blocker setup, and a very powerful machine, we couldn't get the few hundred messages per day that are of interest to us. Spam is killing e-mail. I'm not sure why more people aren't treating it as an attack, but it's really hard to get anyone's interest to take some action. Canceling accounts doesn't even begin to solve the problem.
In the mean time, the City of Baltimore is suffering...
Sean
Re:Millons of old spam, most likely. (Score:2)
Hmm... that's not a very good solution when you run your own mail server (which is very reasonable for a small company), and have to invest in more server hardware, and more bandwidth just to accommodate the spam. Secondly, not all spam is "easily recognizable by the subject line". Spammers are starting to get clever, and there are a lot of messages where I suspect the message is spam, based on the subject line, but the possibility that it's not
Re:Millons of old spam, most likely. (Score:3, Insightful)
Mine's full of:
hi
how are you?
Please Complete and Return
I miss you
Fwd: I need your help
Re: Your Account
etc... etc...
Any one of these could be legitimate (occasionally you get a headline that's so inocuous I think the spam filter has got it wrong... until I actually read the email).
Dude, you'd think they would have the sense... (Score:3, Interesting)
Oh, wait, let me guess, they aren't using tape backups...
Re:Dude, you'd think they would have the sense... (Score:2)
to dump it off to tape and then just store the tapes instead of just deleting it. Though they are probably running an Exchange server so offloading data stores wouldn't be the easiest thing to do. If they were using something with a simple mbox store, they could easily just parse it through a date filter and dump the older than 90 day stuff to tape.
It's a lot tougher on backup systems to deal with mbox systems, because every time a flag is changed in a mail or a mail is added, the entire mailbox is ad
Re:Dude, you'd think they would have the sense... (Score:2)
If they were using something with a simple mbox store, they could easily just parse it through a date filter and dump the older than 90 day stuff to tape.
If the creation date is older than 90 days, off load to tape and delete original. End of story.
Sure, means they can't go through stuff older than 90 days, but if they need it, restore from tape. Geez.
Archive it (Score:2)
Then it's not clogging anything anymore, and also it's there if you ever need it.
Deleting older than 90 days common? (Score:4, Interesting)
90 days seems both unrealistic to implement and way too much reliance on
Do they understand the value of data? (Score:2)
Data is valuable, and Sysadmins know it. (Values such as when combating a lawsuit as the poster suggests or for trend analysis, contact information, or other historical purposes.)
That said, hard drive space is inexpensive and archiving to optical medium is even LESS expensive. When 47 GB of DVD media can be had at Target for less than $10, it makes NO sense to destroy this data.
Nothing new to see, move on.... (Score:2)
Some gov't org do this on purpose (Score:2, Interesting)
We don't want someone to be able to request something from backups that the user thinks is gone.
This way it's up to the user to decide if they want their data archived. And the onus is on the user to comply with however long the data is supposed to be kept before being destroyed.
Problem with email (Score:3, Insightful)
Re:Problem with email (Score:2)
Email == offical documents ? (Score:2, Insightful)
Some info on record keeping in Maryland (Score:2)
"8. Has any public records legislation/administrative regulation been proposed calling for "permanent public access" to electronic public records? _x__ Yes ___ No a. If "Yes," cite to and briefly discuss the legislation/proposed regulation; what was the outcome? Arguably, Maryland has such a provision in MD. REGS. CODE tit 14.18.04. Certain electronic records may be considered "permanent electronic records" in they
Did they even Look for offline soultions? (Score:5, Informative)
Mind you, we are only a 700 user shop. But nothing gets deleted. If it gets buy the spam filter it gets saved.
This issue isn't limited to the City of Baltimore (Score:5, Interesting)
As far as I've been able to figure out, this arose from a lawsuit against the county where an e-mail retrived from two years previous proved a county commissioner to be taking bribes in a zoning issue.
Rather than fix the corruption, just ensure that it's covered up more efficiently. Gotta love local governments.
Re:This issue isn't limited to the City of Baltimo (Score:2)
I'm surprised that there aren't any state laws that would override that local limit.
A mark or procedure for official business (Score:5, Insightful)
Better procedures and training goes a long way here. These same folks have no problems with snail mail.
Re:A mark or procedure for official business (Score:2)
Require mail from the public be encrypted (Score:2)
Actually, it doesn't have to be encrypted--any hoop that you can people jump through to mail you is fine, as long as it isn't something that spammers will be able to automate. For example, you could also use a randomly generated email address that changes frequently, and provide a website w
Re:Require mail from the public be encrypted (Score:2)
I once tried using X509 to everyone, but Outlook express just refuses to display the message and puts up a huge warning about a corrupt email (all other mailers handled it fine - OE just doesn't support X509 correctly), so I'd just get an email back that said 'your mail was corrupted and I couldn't read it'.
PGP is worse. It isn't supported by *any* mailer widely uses mailer (installing an extra 'plugin' does not count - most of the pe
Re:Require mail from the public be encrypted (Score:2)
Why is there a problem with retention? (Score:3, Interesting)
Removing old messages isn't the best option (Score:5, Insightful)
A better option would be to archive old messages rather than remove them entirely. From the article it sounds like they are keeping ALL messages active all the time. For example:
"They say the system is so overburdened that creating a daily backup has become impossible; there is so much data that it takes more than 24 hours to copy it."
So, it seems like the solution would be to periodically lop off old messages to offline storage (tape, spare drives, whatever). In the event of a lawsuit the old messages could be reasonably recovered and the cost for such a system would be extremely minimal.
Complying with Public Records Acts (Score:4, Insightful)
In responce to the posters question about all those subpoenas: welcome to the world of civil litigation, where the first one to destroy the evidence wins!
Re:Complying with Public Records Acts (Score:3, Insightful)
Storing older emails is a rather trivial issue of collecting, compressing and copying to an inexpensive tape or hard drive which can be archived. A 250GB IDE drive is quite inexpensive and could probably archive several hundred million emails, many more than the city is claiming it will delete.
In a time when the government is fading furthe
Re:Complying with Public Records Acts (Score:2)
There is a c
Inept IT (Score:2)
I find that rather hard to believe. They only need to back up the new emails, then they can delete them at any time without actually losing them. I doubt they see many terabytes of new email every day. Nine times out of ten, any IT tech who says something is "impossible" is just lazy and/or incompetent.
Information Lifecycle Management (Score:4, Insightful)
You can't oversee growing data storage without a parallel increase in administration costs. Instead, the idea is to build automatic archiving into your storage architecture.
In practice this means you build tiers of storage/archive methods. Tier 1 is a high tkt Shark SAN etc, Tier 2 is lower priced SATA RAID and Tier 3 is a DAS Tape Library. Build retention guidelines into the storage management playform (Tivoli etc). Older items are automatically moved to the Tier corresponding to that retention/access policy. Really old items "live" on Tape. Frequently accessed data lives on the high speed boxes near to the users/application. You snapshot updates to a DR replica offsite or burn periodic Tape sets etc. Its a good idea to team this with storage virtualization (virtual LUNS/ Metadata directory servers) and you can add/rotate/modify the storage tiers when necessary without any downtime.
From a user perspective, you click on the link and if applicable, get notified the item is being retrieved from media x (its mostly transparent). Worse case - access times are in the minutes.
Of course, all this comes with a high price. Enterprise Storage systems are not cheap. Recent legislated policy (Sarbanes Oxley etc) enforces the retention of some media (e.g. email). You cannot rely on end users to enforce data retention. This lets you mandate tiers of protection and is highly configurable to support per application monitoring.
Nothing is foolproof. Its still being finessed but if you can afford it - its truly a thing of beauty.
Screw the Lawyers (Score:4, Interesting)
Re:Screw the Lawyers (Score:3, Interesting)
Then one day, in a meeting with VP's, a manager tried to put me on the spot, and use me as a scapegoat with some bold face lies.
I'll never forget the look on his face when I produced hard copies of our email exchange...
ahh, memories.
I also got the VP to change the email policy.
Re:Great way to ignore your customers (Score:3)
Re:Great way to ignore your customers (Score:5, Insightful)
I don't know what business you work in, but if they haven't read it in 3 days, they've lost my business.
Re:Great way to ignore your customers (Score:3, Insightful)
Let me guess.... you're emigrating a lot, yes? Otherwise you might have to have "business" with the government. Good luck getting a reply in three days there.
Kjella
Cheap solution (Score:2)
Re:Great way to ignore your customers (Score:3, Insightful)
There are no laws I know of that tell me I have to pay Company X for products. If I don't want any products from Company X, I won't buy anything from them. I'm not going to be breaking any laws because of it. However, if I don't pay my taxes I'll get hounded to death with the possibly of being tossed in jail.
See the difference?
Re:This shouldn't be a problem. (Score:2)
If they do daily backups of everything, seems like that's one hell of a RAID array to store all the bureaucrats' PowerPoint bloat on.
Re:Overload probably not the only reason (Score:2)
Well for some government employees, penis enlarging Vigralisis pills ARE official business.
Re:Overload probably not the only reason (Score:2)
I'm going on my 4th year here now, and it does really look like Baltimore is turning around. Of course there are some problems, like this email thing as well as financial incompetency problems w/ the public schools. But many other problems seem to be finally coming around.
For the first time in decades the population in Baltimore is actually increasing, and many formerly bad or sketchy areas are actually quite nice now.
There's still a bunch of problem
Re:Client-side storage is not a good solution (Score:4, Insightful)
No file system will save you from multiple HDD failures; they should save old (>12 months) data to DVD burners and/or tapes or cheap SATA storage. One can buy 1TB of external SATA space for couple thousand dollars.
>One or two XServe G5s could do the trick quite well.
What do XServe boxes have to do with generic application like email? Besides, they're more expensive than comparable Intel+Linux servers (especially considering the fact that CPU perormance is unimportant for most mail servers).
Re:That's what a personal folder is for... (Score:2)
Re:That's what a personal folder is for... (Score:2, Interesting)
Any of these so called "important governement documents" shouldn't be stores in an email archive anyway. They should be on a network drive getting backed up.
My point is that a better so
No, they should use Lotus Notes/Domino... (Score:2)