Ask Slashdot: Best Practices For Collecting and Storing User Information? 120
New submitter isaaccs writes "I'm a mobile developer at a startup. My experience is in building user-facing applications, but in this case, a component of an app I'm building involves observing and collecting certain pieces of user information and then storing them in a web service. This is for purposes of analysis and ultimately functionality, not persistence. This would include some obvious items like names and e-mail addresses, and some less obvious items involving user behavior. We aim to be completely transparent and honest about what it is we're collecting by way of our privacy disclosure. I'm an experienced developer, and I'm aware of a handful of considerations (e.g., the need to hash personal identifiers stored remotely), but I've seen quite a few startups caught with their pants down on security/privacy of what they've collected — and I'd like to avoid it to the degree reasonably possible given we can't afford to hire an expert on the topic. I'm seeking input from the community on best-practices for data collection and the remote storage of personal (not social security numbers, but names and birthdays) information. How would you like information collected about you to be stored? If you could write your own privacy policy, what would it contain? To be clear, I'm not requesting stack or infrastructural recommendations."
Just don't do it (Score:5, Insightful)
Re:Just don't do it (Score:5, Insightful)
If you really feel the need to collect personal data and you *truly* care about the privacy concerns and needs of your customers, then don't go burying such disclosures in a privacy statement that the average user is unlikely to ever see let alone read.
If you truly care about privacy, then either require the user to *opt-in* to such sharing or prominently display the lack of such privacy on the initial splash screen.
Burying the collection of personal data in the middle of some lawyerly gobblygook privacy statement is like mortgage lenders burying key terms in the middle of 100's of pages of documentation. Yeah, it's legally there but no one is actually going to read or understand it.
Re: (Score:3, Insightful)
Alternately, people could simply take responsibility for themselves and choose to avoid services which require agreement to miles of terms. Given your attitude on the topic, you probably haven't even bothered to read the terms of service for anything you're using right now. It seems you're trying to divert responsibility for yourself onto the backs of the service organizations you choose to deal with. Again, note the word "choose."
You've also managed to miss the opportunity to discuss where data goes and ho
Re:Just don't do it (Score:5, Funny)
Yes, just store the data in plaintext, in a mysql database connected directly to the internet.
Bonus points if you create mysql users for each unique user and use their username/password to authenticate connections to the database.
Re: (Score:2)
Re:Just don't do it (Score:4, Interesting)
Re: (Score:1)
Unfortunately that would mean having no internet access (good luck finding an internet provider without a big list of terms and requirements).
Re:Just don't do it (Score:5, Interesting)
>Burying the collection of personal data in the middle of some lawyerly gobblygook privacy statement is like mortgage lenders burying key terms in the middle of 100's of pages of documentation. Yeah, it's legally there but no one is actually going to read or understand it.
When I bought my house, I spent about 3 hours at the title company reading and signing the mountain of paperwork. I would never commit myself to 30 years of anything without knowing and understanding the details. I will say that the notary was pissed. After 30 minutes she said "Are you really going to read the entire thing?" And later "I have an appointment, you're going to make me late." My responses were "Yes, I'd be stupid not to." and "You scheduled this entire block with me, its not my fault you double booked yourself, you'll have to cancel your other appointment."
Re: (Score:2)
Re: (Score:2)
That's great and all, but what happens when you read page 200 and say "uh, I don't agree with this"?
Response: "Sorry, no house for you!"
Re: (Score:2)
That's great and all, but what happens when you read page 200 and say "uh, I don't agree with this"?
Response: "Sorry, no house for you!"
Well, if the document does not match the preview doc they sent you, or match the terms and rates that they promised you (you get that in writing before you get the contract), then they have to update the contract. There are some crooks out there that will tell you one interest rate and slip another into the docs. You really need to trust your mortgage broker. I used a friend's dad, thankfully. He was very helpful, and I knew him to be honest. He even used his commission on my loan to buy me some points
Reading it. (Score:3)
Here in Quebec, the notary actually reads the entire document to you and asks you enough questions that he is sure you've understood it.
Re: (Score:2)
Re: (Score:2, Insightful)
Best practice from my perspective: do not collect the data at all.
Exactly: "Put the Database down now, and step away from the Internet."
Sorry, but my interest in giving beneficial doubt to the question's possible sincerity was lost when reading the part about the unoriginal solution for insuring honesty and transparency -- the solution being hidden in (the lawyer make-work terms of) "our privacy disclosure".
Re: (Score:2)
Re: (Score:3)
So, Slashdot made a mistake in allowing you to create an account?
Re: (Score:3)
Best practice from my perspective: do not collect the data at all.
More detailed:
Rule 1. don't do it
Rule 2. if for some reasons, rule 1 cannot be followed, collect them but discard them immediately
Rule 3. if for some reasons, the prev 2 rules cannot be obeyed, after collection put them on a WORN storage (that is: "Write Only, Read Never" media)
Re: (Score:1)
I hate to IANAL here but here goes:
In your country of origin you have legislation that you have to prove compliance to should your respective government body find out if you are collecting user information. Personally identifiable stuff (Name, address, Phone Number, E-mail) is considered sensitive, Personally Identifiable sensitive stuff (Social Insurance, Health Records, Employment History, Criminal Records etc. ad. nosium,) comes with hefty legislation like HIPPA for each type of stuff. Again the parent h
Re: (Score:2)
Mod Up +1
No matter what you want to do with PI you must check first that is legal, first on your jurisdiction (state or province), then your country (countries) where you expect your customers to reside.
It doesn't matter what good intentions you have, it might not be enough to keep you out of trouble.
For example, if your jurisdiction forbids you from keeping DOB then make sure you are clean.
Re: (Score:1)
Re: (Score:1)
Someone with mod points.
Because you only get the option to moderate if you (a) are logged in (you cannot moderate as Anonymous Coward), (b) have enough Karma (which basically means your posts have been moderated up often enough, and certainly more often than down), and (c) happen to have some mod points (even if your Karma is high enough, you'll only get mod points every now and then, and if you don't use them, they'll expire in a few days)
risk vs. investment tradeoffs (Score:4, Informative)
Re:risk vs. investment tradeoffs (Score:4, Insightful)
Finally, if possible, associate all data collection events with time (timestamp) and location (gps).
It started getting a little creepy there at the end, bud. ;)
Re: (Score:2)
He wants to analyze users "back ends"!!!
Re: (Score:1)
Don't store the data. (Score:1)
Just don't.
When you get the expertise to store the data securely then consider it.
Once you get into the habit of justifying everything that you store you will be less prone to the woops! plain text password/username/real-name/creditcard table being found by intruders.
Re: (Score:1)
Re: (Score:2)
I didn't see gp post as condescending, I think he is trying to make the point of how serious private information storage is.
I did cringe on his reference of putting password/username/cc in one table, even encrypted. I suggest to use hash values to replace those real values and mask CC numbers. So even if the encryption is broken a hacker would not be able to identify the person.
Doing this doesn't limit your innovation in any way. It's actually a burden we all have to deal with to avoid a legal bomb landing
Let me have a login? (Score:1)
Let me have a login for the benefit of having my data saved?
If I don't log in then don't store my details.
As for the rest whatever. Hash + salt or whatever?
If no-one can reach / use the data for anything then maybe say just e-mail address or something such as identifier.
Re: (Score:2)
Well, if you are looking for developer/legal opinions there are better forums, but if you want legal, developer and user opinion (and a discussion based on them), slashdot is not bad. Besides you dont really know that OP has not also posted in a better developer/legal oriented forum (and I find it strange that you mention that you wouldnt post on slashdot, buy fail to mention the forum that is appropriate for this question (unless you yourselves were just trolling)).
Re: (Score:1)
I dont know the appropriate forum, as I am not an experienced web developer, nor would I expect any serious answer from slashdot when I do need it, I develop electronics, I dont post which FET has the best ESD damage resistance on slashdot, nor would I expect anything but random opinion from it.
when your serious, you get the data from people who have been down that road, and test it yourself, not post to some news recycler and hope for the best.
Re:I'm an experienced developer (Score:4, Insightful)
I'd give him the benefit of the doubt, and assume this isn't the only place he's looking for best practices.
Meanwhile, "I'm an experienced developer, I'm familiar with all the general rules for securing customer data, but I'd like to hear of any 'gotchas' that you know about"? That seems like a reasonable thing to ask.
Again, assuming this isn't the one-and-only source. So instead of grabbing our pitchforks, maybe someone has some examples of what he asked about?
Re: (Score:1)
There's the blatantly obvious stuff: keep the data heavily encrypted on a back-end d/b or file store, on a server nowhere near a public-facing interface (or DMZ); obfuscate and/or consolidate the individual, personal data as soon as you gather it, assuming you don't need specific per user info to be retained. Needless to say, keep all your OS/software/services/apps/etc patched with latest security on a weekly, if not daily basis, FFS!
Also, invite some wannabe hack-meisters you can kind-of-trust to try &
Re: (Score:2)
Isn't your first bit of advice right there a classic gotcha?
Encryption doesn't mean anything unless the access routes to that encrypted data are well defined and understood - since at some point it has to be unencrypted to be used. So who's doing the unencrypting, who holds the keys etc.
Re: (Score:1)
Re: (Score:2, Interesting)
Agreed. People mistake this for a technical forum.
Re: (Score:1)
Re: (Score:1)
Re: (Score:1)
Re: (Score:2)
Don't (Score:4, Informative)
honestly... try not to store it.
You need to examine why you actually need the data, and if you can't think of a good reason (except it might be valuable in the future), then don't store it.
If you do need it for analysis, machine learning apps, etc, try to anonymize it as early as possible, and not to keep raw data longer than you need it. (say raw data for 3 months, then just store aggregate info).
also.. for behavior.. you don't need years of information, studies have shown people change, so make sure the things people do recently are more important, and the old stuff gradually decays.
Re: (Score:2)
As a counterpoint, Don't process all that data.
At my company, we store everything. Every click, every bit of data, nightly snapshots of all data, etc. Forever. This results in stupid amounts of data about our users and we pretty much don't bother to try to correlate the data, we just provide it upon request of the customer.
Why try to correlate it, when our customers are eager to pay us to do other things with it? Just because you have the data, doesn't mean you have to be devious with it. Save everything re
Re: (Score:2)
If your purpose is really just for "analysis and ultimately functionality, not persistence" then there is really no reason to keep an email or a name. Just assign a unique identifier, and then you're done.
So if for some reason, the user wants to get in touch with you to file a bug report, or what not, then assign a unique identifier for the device to the bug report (in case you get other bug reports coming from the same source), but don't ask for his/her contact information unless the user ticks a box askin
Start reading about PII (Score:3, Informative)
Wikipedia (http://en.wikipedia.org/wiki/Personally_identifiable_information) is a good start.
Break the association (Score:5, Insightful)
If at all possible, stay away from personally identifiable data. If your aim is to use identity as an index, work out a way in which you can translate an identity into an an index or hash value (i.e. one way). This is not going to be perfect (there will be about a million "John Smith"s out there), but if you have a consistent pair such as name and phone number, turn that into a hash and use it as data index.
That means you can still do correlations, but a leak will not result in exposure of personal data.
However, first of all, look at what you're holding on personal data and simply assume you got hacked and it's "out there" - plan for that crisis first because there is one question you need to answer:
If you cannot afford to pay for security advice, can you afford to pay for the inevitable consequences?
Re: (Score:1)
If you cannot afford to pay for security advice, can you afford to pay for the inevitable consequences?
put another way: if you can't afford to do it right, how will you afford to do it again?
Re: (Score:1)
Re: (Score:2)
Or keep personally identifiable information separate from everything else. Ensure that you cannot get to one data set from the other and vice versa. Use login information as a hash into the identity database and the behaviors database. If you must store any time stamps on database records, make sure you do so in a way that prevents using them to easily correlate the two data sets (e.g. update the time stamp on the personal info record only when the user changes his/her password, address, or whatever, ra
Re: (Score:2)
If your aim is to use identity as an index, work out a way in which you can translate an identity into an an index or hash value (i.e. one way). This is not going to be perfect (there will be about a million "John Smith"s out there), but if you have a consistent pair such as name and phone number, turn that into a hash and use it as data index.
Bad idea when you get a hash collision. Account numbers do not have to be seen by the user, but there aren't (m)any useful ways of avoiding their use internally.
If OP is storing data for analysis and not for immediate reuse, there are some often overlooked but stupidly easy things to do like making sure that the user-facing machines collecting the data only have append/insert access to the data (no read, no modify). Analysing the data would be done from another machine/subnet/database account whatever.
Re: (Score:3)
He said he had little money available, so I figured I gave him something that was easy vs. perfect. The key question is if the delta introduced by the odd hash collision is actually significant in the volume of data he is planning to process. If it isn't, I would not try to develop perfection - he can use his little funding better elsewhere..
In other words, in theory you're absolutely right, in practice I suspect there is little difference. But my favourite way of avoiding issues with personal data is si
Re: (Score:1)
Great idea for some cases. If you need "telemetry" data to understand how people are using your application, assign each session a unique ID and don't store which user did it. It also works for some other statistical data. The argument against is that you may need the correlation between sessions later.
Depending on the application, you could have a hierarchical system of databases where the lowest level contains session information, the next contains persistent user information but not personally identifiab
Re: (Score:1)
Regarding my second paragraph, an important part was not obvious: Each session in the session database has a unique ID, and each anonymised user in the middle database has a list of sessions, and each user in the top database points to an anonymised user.
Collect as little as possible, throw it away... (Score:5, Interesting)
I have been toying with a site idea. Your account name is your public key fingerprint. You public nicname is whatever you use in the message. Your login is validated because everything you send is signed wiht the key that matches the fingerprint (and encrypted with my public key for transmision). Input to user form is constrained and validated within those constraints (to prevent padding attacks).
I would then have a database "key x","paid through date y".
Sure, I couldn't sell any farmed data a-la facebook, but suppoena requests woudl be a breze... "here's your hex dump..."
P.S. (Score:2)
Return email will be sent, if necessary, to whatever address(es) are registered in the public key database for that fingeprint, encrypted with that key.
Obviously I have no control over your passphrase and can do nothing to help you "recover your password" or whatever. Please see your GPG or PGP documentation for a better explanation.
Your account will not be "renewed" past the key expiration date.
Re: (Score:2)
I have been toying with a site idea. Your account name is your public key fingerprint. You public nicname is whatever you use in the message. Your login is validated because everything you send is signed wiht the key that matches the fingerprint (and encrypted with my public key for transmision). Input to user form is constrained and validated within those constraints (to prevent padding attacks).
I would then have a database "key x","paid through date y".
Sure, I couldn't sell any farmed data a-la facebook, but suppoena requests woudl be a breze... "here's your hex dump..."
If you accept payments, wouldn't those keys still be linked to contact information and/or payment transactions?
Payment Recepits (Score:2)
Not for any longer than necessary. Likely I would make that opt-in.
I would have a payment history (bob paid x dollars for y time) as an atomic event. Bob could check a box to say "remember this for me", or not at the time of payment.
At the time of payment I would also send Bob a receipt. That recept would say "Bob paid for a service". The receipt would also contain a dot-splash (e.g. Qr Code a linear 2D barcode, depending on how much info space I turn out to need) that was the "proper join record for the da
The little nicities (Score:2)
There would be other little niceties.
Agressive use of POST instead of GET messages on all forms so that pin-trap requirements, if levied, would be largely moot. as in user XXXXXX did POST to "/" at this site on these dates and times. [POST data is not legal to collect in PIN traps in the USA as I understand the law.]
Services a site could sell? POST the URL you want as part of the encrypted blob you sen to this site, we will retrieve it, scrub it and send its content back to you encrypted to with your key.
Pa
Give me control and earn my trust (Score:4, Insightful)
The short requirements:
1) Explain what you're collecting in real-time at the moment when you give me the option whether or not to permit you to collect it. Tell me what you will use it for, when you will delete it and the consequences if I don't give it to you. People don't read privacy disclosures. Give notice and ask permission at the moment of proposed collection. Make it opt-in, not opt-out.
2) Only request the information required to perform the service I've requested. Use the information I provide only to provide the service I've requested. Only share the information I provide with third parties to the limited extent necessary to provide the services I've requested. Obtain contractual commitments from those third parties that cause them to protect my information and delete it as soon as they've done what's required to provide the service I've requested. Keep information only as long as necessary to provide the service I've requested and delete it after you've done what's required to provide the service I've requested.
3) Protect my information. Encrypt in transit and at rest. Delete thoroughly and don't give in to the urge to collect and keep information just because it might be useful some time in the future. You can't lose what you don't have.
You say the collection "... is for purposes of analysis and ultimately functionality, not persistence." That seems inconsistent with the collection of name and email address. I can't think of too many use cases where you're collecting my name and email address and don't plan to keep it (and use it for marketing or otherwise share it in some way). If you need to contact me or I need to create a user-id that is my email address, you don't need my name.
Your privacy policy is your contract with your user. It is an operational document that must be consistent with your practices. The privacy policy should be consistent with your policies and procedures. If the information you collect, or the way you handle it changes, you must change your privacy policy.
Re: (Score:2)
Support OpenID (Score:2)
You can't afford it, by your own admission. (Score:4, Insightful)
If you can't afford the expert then you can't afford to collect such data. Move away from this project to something you have the ability to do.
Re: (Score:3)
If you can't afford the expert then you can't afford to collect such data. Move away from this project to something you have the ability to do.
I'm surprised it took this long for someone to say that. The people who will exploit your system and extract something valuable from it can afford those experts.
OWASP (Score:5, Informative)
OWASP has guidance; for instance, here: https://www.owasp.org/index.php/IOS_Developer_Cheat_Sheet#Insecure_Data_Storage_.28M1.29 [owasp.org]
From https://www.owasp.org/images/5/5e/Mobile_Security_-_Android_and_iOS_-_OWASP_NY_-_Final.pdf [owasp.org]
2. Insecure data storage
Solution
Avoid local storage inside the device for sensitive information
If local storage is “required” encrypt data securely and then store Use the Crypto APIs provided by Apple and Google
Avoid writing custom crypto code – prone to vulnerability
Re: (Score:1)
Avoid local storage inside the device for sensitive information
That does make sense, but it still feels like I've fallen into opposite land.
Avoid writing custom crypto code – prone to vulnerability
Yes! I'll repeat it a couple of times
Avoid writing custom crypto code – prone to vulnerability
Avoid writing custom crypto code – prone to vulnerability
Book of best practices (Score:5, Insightful)
In the US, we have the National Electrical Code [wikipedia.org] which explains in clear detail how house wiring is constructed.
Following the code a legal requirement in many (most?) states, but from the point of an electrician it's a "book of best practices". Use this gauge wire for this current, staple the wire within 6" of the box, and so on. The code gets revised and added to over time as questions crop up and new technologies get added and people get more experience.
There's a reason for everything. For example, the light in a bathroom should be on a separate breaker from the outlet next to the sink. It makes sense in retrospect, but this is not something that is obvious beforehand.
It's very detailed, but also very clear. Homeowners routinely understand the instructions and are able to make simple repairs and modifications to their home wiring which conform to the code.
We throw a lot of "best practices" around here as if they were simple and obvious at the outset, but maybe they're not. Hash your passwords, salt the hash, sanitize the form inputs, don't keep CC info... lots of best practices which in hindsight make sense but which aren't necessarily obvious beforehand.
Most web apps have common requirements for login, identity management, privacy, various forms of functionality, and so on.
Should we have a "book of best practices"?
Re: (Score:2)
I suspect that the big problem with that analogy is that data collection(unlike electrical wiring) is a substantially adversarial field.
There is a certain amount of tension, (fast, cheap, good, pick any two, and the usual buyer/seller desire to not leave money on the table); but the buyer and the seller both share roughly the same ideal, though they may deviate from it out of laziness, cheapness, or incompetence.
With data collection, the purely security/architectural aspects are somewhat similar; but there
Re: (Score:2)
Aggregate Data (Score:2)
Aggregate the data as quickly as possible to anonymize it.
Collect "Mary did X, Y but not Z", but aggregate it to Three people did X, Two Y and TWELVE Z and drop Mary from the data. You don't need to know Mary did anything.
Re: (Score:2)
Also there are a lot of laws around the world regarding things like this which can and cannot be tracked *at all* that no amount of legal disclosure will make lawful in some places. Seriously, just avoid any form of identifying data (pre
What is is for? (Score:1)
You say you aren't interested in persistence, so I don't see any reason why the data needs to be personally identifiable. Whether your index is John Smith in Albany,NY or User #71829382 doesn't matter for usage analytics. Even demographic information can at least be stripped of things like name and phone number.
If you REALLY need to tie this information to a particular instance, then use a hardware key from the mobile device and not a user's information. A hacked phone is easier to deal with than identity t
Also consider TLDR-TOS (Score:3)
Re: (Score:2)
On a need to know basis only (Score:2)
My car insurance company needs to be able to pull my DMV records, perhaps even periodicly. They could retain *none* of that information and ask me to visit a web site periodicly where the info gets enterred so they can do the query (and then forget the information required to perform the query). Most customers wouldn't mind them holding that information; but if I'm *that* security minded and they make it clear to me that I'll have to hit their site once a month to maintain my insurance... well... There a
Use (Score:2)
Read "Translucent Databases" by Peter Wayner (Score:2)
Collecting Personally Identifiable Information (Score:3)
On passwords, I liked Jeff Atwood's article, `You're Probably Storing Passwords Incorrectly' [codinghorror.com].
For Personally Identifiable Information (PII) [wikipedia.org], I liked Brian Danger Graham's article, `What's in a name database?' [blogspot.com].
Policies, Procedures, Standards, Trust all Useless (Score:2)
If your company goes bankrupt, or is sold to another, all it's assets become the property of someone else. That someone cannot be constrained to respect anything you have promised. You may not even have the opportunity to wipe disks or change passwords.
For example, a hospital failed to pay the rent on a warehouse storing patient records. The landlord seized and sold those records as scrap. None of the hospital's patient privacy obligations transfer to the landlord, or to the scrap dealer.
Heed th
Keep it on the user's computer, not in the cloud (Score:1)
Google Mobile Analytics (Score:2)
Although you state you're not looking for stack or infrastructure recommendations, I'd still recommend having a look at Google Mobile Analytics [google.com]. They have an SDK for Android and iOS that makes it very easy to integrate in your apps.
It Is a Matter of How to Encrypt (Score:2)
Best practice is to encrypt each record with a unique key. This key could be generated by some unique identifiers per user like Visible User ID (maybe E-mail address) and Password and Hidden User ID (different from
Don't. (Score:2)
Analyze data on a nightly basis. Store the results. Scrub database after results are stored. The asshole MBA that your startup hires because it isn't making enough money then has nothing to turn around and sell for a quick buck.
If you have to store *anything at all*, hire the expert. Can't hire the expert? Your startup is inadequately funded.
Some advise (Score:2)
Disclaimer: I work in the field, but do not have nearly enough information on your particular situation, jurisdiction, etc to provide detailed recommendations. What follows is basic best practice stuff based on my jurisdiction and market sector.
* First, any sensitive information you are collecting, ask if you really REALLY REALLY need it. This stuff is toxic waste. Your first and best defense is not to store it if you don't need it.
* A hash of something like a SSN, Telephone number, etc is worthless in t