angryrice tips news that Adobe seems to be campaigning for the inclusion of Flash and PDF in the Obama administration's efforts at increasing government transparency and openness. A post from the Sunlight Labs blog is critical of Adobe's undertaking, in part since PDF is often "non-parsable by software, unfindable by search engines, and unreliable if text is extracted." They also say government's priority should be to publish datasets and the APIs to interact with them, rather than choosing how they're displayed in fancy graphs and charts.
"non-parsable by software, unfindable by search engines, and unreliable if text is extracted."
I don't believe this is true - I find PDF documents in search results all the time. The consistency and reliability of PDF for forms creation has no real competition. If you hate Adobe, ok, but don't hate PDF 'cause it's beautiful...
I have no problem with PDFs, there are a number of free and commercial applications out there that can work with them.
Flash on the other hand is absolutely an abomination that must be wiped from the net. They still haven't released a proper version for *BSD and they commonly don't bother with less popular OSes. If they want it to be used for this sort of purpose then they need to get their act together and make it available for all operating environments on an equal basis. Which I don't think they have the resources to do.
PDF remains difficult to manage. Like MS Word documents, an incredible amount of resources is wasted in display information rather than actual text or graphical content. Unlike MS Word, they're parseable: but unfortunately like MS Word, the commercial vendor-sold document creation tool (Adobe Acrobat) generates unstable and unreliable content that interacts very badly with other tools. Oddly, the ghostscript created PDF remains very stable and legible, and tools like "PDFCreator" which uses ghostscript creates long-term viable PDF printouts of other document formats. I use it for complex MS Word documents that cannot be handled by other software, even different versions of MS Word.
Adobe can actually do better with this, and I hope that they will in the future. But it's not stable enough to be reliably indexed or viewable even 5 years in the future, much less 10 or 20 or 100 such as may be needed for legal or historical documents.
Flash, you're quite right. Unless they open up the source, it has no business as yet another document format.
but unfortunately like MS Word, the commercial vendor-sold document creation tool (Adobe Acrobat) generates unstable and unreliable content that interacts very badly with other tools Can you be more specific as to what problems you have had using files from acrobat in other tools?
Printing documents created in other language versions of Acrobat. In particular, the Adobe Acrobat for German created documents that were not only unviewable in a normal Acrobat viewer, but when used to "print PDF" for MS Word documents, created documents that actually crashed Windows computers. The Acrobat for Hebrew didn't crash Windows with the printed documents, but was filled with layout errors when rendered even by Acrobat Reader, errors that didn't show up in the Adobe Acrobat tool. Much of this may have been fixed with the latest release, but I'm not spending nor suggesting that my peers overseas spend all the money needed to upgrade.
Getting our colleagues to stop using Acrobat and use _anything else_ to generate their documents, and use PDFCreator to print them as PDF, stabilized the situation enough for us to generate the documents we needed. It didn't provide PDF forms for people to fill out, which was its only flaw.
Just recently I had to look at, and print a few pages from, a PDF document. Knowing where it came from, a corporation that is only very slowly dipping a toe in the water of software other than the big names, I'm sure it was done with Adobe.
Now I don't even have the Adobe Acrobat reader on my system, when I try to install it, the install crashes. But Fedora comes with several other PDF readers, and the default is set to "Evince" which works fine MOST of the time.
But I got this PDF, and one page was a picture of a tax form, and when I tried to print it, the tax form came out as a big black blob - man, does that waste ink! Obviously I killed the print job to try something else. (Just VIEWING this tax form was fine, only printing messed up.)
I remembered using "Xpdf" a while ago, so I tried that, and voila, the tax form printed perfectly. Since I knew there were more tax forms in there, I used Xpdf for the rest of the job.
So here is a case where two different PDF viewers reacted differently to the same PDF file. I think what we need is is an OPEN DEFINITION for PDF files, probably a subset of Adobe's definition, that any OSS viewer can follow and get the proper results - and ask the user what to do with files that don't follow it.
And tell Adobe they can either follow the open definition, or stuff it where the sun don't shine!
I think what we need is is an OPEN DEFINITION for PDF files, probably a subset of Adobe's definition, that any OSS viewer can follow and get the proper results - and ask the user what to do with files that don't follow it.
There is such; Adobe publishes it and makes it freely available on its web site. It's possible your file didn't follow it, but it's more likely your reader wasn't 100% compliant; it's a very complicated specification.
Unlike micros~1 word documents, there are freely available specifications and a reasonable number of quite reasonable third party implementations that can either display or generate PDF, or even both. That is to say, you can very well ``do PDF'' without ever using adobe software. Part of its success is that it's a dumbed-down version of PostScript, also open and arguably the right way to talk to printers. That's a whole sight better than micros~1's ooxml abomination, that once standardized turned out to hav
I agree that open government docs should stay away from Flash. I don't agree that Flash is an abomination because Adobe does not bother with less popular OSes. Why should they implement Flash on less popular OSes? That costs Adobe real money and then only a handful of users would benefit. If you were in charge of the engineering budget at Adobe, would you spend $ on a feature for Mac and Windows that 100 million people would use or would you use that same $ to port Flash to a less popular OS with 10,000
Adobe ships Flash/PDF readers/plugins to: Windows, OS X, Symbian (in some form), Linux, *BSD and various, uncountable tiny platforms. iPhone/iPod does not count because of obvious reasons.
Lets see what MS Silverlight ships to: Windows/Intel Mac. Damn thing is so tied to Windows that they couldn't even convert/ship the V2 for PPC Macs or they simply abandoned them. (like we cared!)
MS XPS format and viewer is the answer to PDF which, some people who didn't use Windows have never, ever heard of. It is that Win
Because Flash is now a crucial part of the internet. Until HTML 5 comes out with video standards and the like, Flash is about the only way you can embed videos in sites without ruining the layout of the site with a third-party media player and without your users searching for codecs.
If Adobe would simply release the source to the Flash player, they could -save- money, have full platform compatibility and perhaps make more money with the Flash creation products. Think of it this way, if there was a fast
A PDF file produced by the LiveCycle suite is actually an XML document with a thin PDF wrapper. The XML conforms to the XFA standard which is owned by Adobe but is a published standard (http://partners.adobe.com/public/developer/en/xml/xfa_spec_2_4.pdf).
PDFs are only searchable if the document contains text. Half the time PDFs contain text-as-image, which is about as useful to a search engine as a captcha image. Google doesn't run OCR on PDFs, AFAIK. Although, come to think of it, that sounds like something they'd get sued by a random company for doing for "violating copyright proprietary information".
I work with PDFs a lot, especially on OS X. I am telling you from an OS which you can have 60 KB 1080p screenshots in PDF in some circumstances: Whoever did that "text as image" trick, he is a complete moron.
One of the reasons that PDF took off is exactly embedding fonts used in a document so it will appear as pixel perfect on client machines.
As last resort (and a good practice), you can embed unformatted pure text of the entire PDF in your PDF file. PDF, like Quicktime Mov is one of the formats where peopl
Whoever did that "text as image" trick, he is a complete moron.
Generally text as images in PDFs are the result of people who scan in paper documents but don't have access to or don't use OCR programs to convert the raw image coming in from the scanner into text.
You're missing the point. PDFs do not store text. Text is a stream of characters. PDFs store glyphs and their locations. It is more or less possible to convert glyphs into characters, although things like ligatures and the fact that spaces are not really represented make this difficult. In the metadata, some PDFs also store the text of the document, allowing it to be extracted. Given that the PDF is created automatically from the text in most cases, the text is more useful. You can create the PDF fro
It depends on how the PDF was created. If the PDF had the source text embedded in the metadata then it will work fine. Now try it with a PDF that's generated by printing to PostScript and then distilling to PDF (as a lot of PDFs are). It won't work.
The summary does not do a good job of reflecting the original blog post's point. The point was that the government should make data available in a machine-parseable and generic format. PDF is a great format for storing typeset pages, but it is a terrible format for publishing data. It's easy to generate beautiful PDFs from well-structured data but it's much harder to go the other way. Would you rather have budget figures (for example) as a CSV file in a well-defined format or as a PDF of tables and graphs? If the data is available in the former format, it's easy for you or a third party to produce the latter format. If it's only available in the PDF form then it's much harder to create the CSV.
The summary does not do a good job of reflecting the original blog post's point. The point was that the government should make data available in a machine-parseable and generic format. PDF is a great format for storing typeset pages, but it is a terrible format for publishing data. It's easy to generate beautiful PDFs from well-structured data but it's much harder to go the other way. Would you rather have budget figures (for example) as a CSV file in a well-defined format or as a PDF of tables and graphs? If the data is available in the former format, it's easy for you or a third party to produce the latter format. If it's only available in the PDF form then it's much harder to create the CSV.
If the goal is to make the data available, then even CSV would be a better option than PDF. PDF, while pretty, is a terminal format and is the digital equivalent of a mayfly. It's paper that hasn't happened yet and when it does it will exist for a few short hours before finding its way to the circular file.
Much of the government data consists of tables and tables of data. gzipped csv would be readable by anyone, so would ODF. Adobe appears to be looking for a handout at the expense of creating a usef
CSV is kinda evil (see my post above), but it's better for tabular data than JSON or XML. Again, a tabular serialization format such as Avro, Thrift, or Protocol Buffers might well be far better than CSV for tabular data. JSON has quite a bit of format bloat, and would need some standardized way to explain the data's schema for further analysis. XML is the king of format bloat, but at least has standard schema representations. XML is far better for semi-structured or unstructured data than tables.
Many implementations of PDF converters merely print a document to images and then embed the images into a PDF. Those are non-searchable and no text can be extracted with the existing tools. I once created a documentation website which relied on these embedded image types of PDF documents. I had to implement an OCR solution in order to extract the text to make my clients documentation searchable. It was ugly and a real pain in the ass.
Certainly, PDF can be beautiful, but it is often not implemented that
Nobody likes Flash, and they probably shouldn't use it for anything. But there's not much wrong with PDF, if it's done right. When publishing something, one could offer "source" (some sane, machine-readable format) and PDF (autogenerated from the source, and prettified for easier reading).
PDF shouldn't be used as a way to encapsulate scanned JPEGs and pretend they're a real electronic document.
I would also note that many of the complaints about PDF as a format in TFA are really complaints about Adobe's abysmal PDF reading software. For example, the concern about the visually impaired: KDE's Okular does speech synthesis and has a high-contrast mode.
They also say government's priority should be to publish datasets and the APIs to interact with them, rather than choosing how they're displayed in fancy graphs and charts.
I felt a great disturbance in the Force, as if millions of IT workers suddenly cried out in terror, and were suddenly silenced.
GP is right. Government should focus on doing what government is needed for success, such as determining standards for formats that everyone can use, with input from academia and industry. For example a human readable parsable format that one could embed in a web page for semantic metadata. Or funding open source software to make it easy (cross platform) to input such data (I am thinking of information about cited papers or books). Typeset information is nice but we already are drowning in information - how many pages of Google results do you usually look at? And we need help before generating 10 times as much.
Why PDF is bad: - It is a potable typeset document package. Not a data sharing package that could be pulled apart easily with tools automatically. - PDF is extremely hard to parse, and using current free software does not always give good results. - You destroy useful document structure, or in the case of ASCII text parsability and small size, when you convert to PDF. You can't just convert back to the original. - It takes significant processing power and commercial software to display well and reliability as far as I can see. Having just gotten the latest Mac I feel like I'm in a dauntless battleship, but I have had many trouble with different unix tools in the past. - Scientists publish PDF too but then also use other formats for data. For example on arxiv, one scientists recently published animations inside a zip but it was hard to find the link - It is difficult to manage bibliographic information automatically. - It is proprietary - It requires a huge amount of data, and arcane knowledge, just to build a parser that works most of the time (such as for Asian languages especially).
by Anonymous Coward
on Saturday October 31, @09:07AM (#29934227)
- It is proprietary
FAIL.
PDF is an ISO standard. See: ISO 32000-1, Document management – Portable document format – Part 1: PDF 1.7
This doesn't change the fact that it is a portable typesetting document format though. It's good for read only documents from your word processor but it shouldn't be (ab)used to store tables or graphs or whatever other crap people use it for.
--- As for Flash, lets not even go there. Flash is passable as a streaming video container, if you're making animated cartoons like Homestar Runner or as a platform for small web games but other than those use cases, you're using it wrong.
- Scientists publish PDF too but then also use other formats for data. For example on arxiv, one scientists recently published animations inside a zip but it was hard to find the link
Err... also? I've never seen a scientist using pdf to publish data. We use pdf (and ps and div) to publish typeset papers. The actual data is in a lot of formats, dependent on the field and application. I've seen csv, matlab's.mat, xml, jpeg, tiff, proprietary crap, etc.
PDF is often "non-parsable by software, unfindable by search engines, and unreliable if text is extracted."
Have these people not heard of Google? Just because YOU can't write software to parse PDF files doesn't mean that nobody else can and that it doesn't already exist.
If you look around, every single Apple computer, device (ipod/iphone) is actively indexing every single PDF thrown at them, instantly and keep database of it.
It is the famous "Spotlight" technology. They don't even need to look at Google, some of them have same kind of indexing technology (minus relation) running on their laptops.
One should check the TFA relations with MS. I am sure something will come up.
If you are publishing a document that can be printed then PDF is a good format. If you expect people to extract data from the document then you should look for a different format. It depends on the purpose of posting the document on the web.
I am OK with PDF. I would RATHER see documents in plain HTML, but there are times when formatting is important. In those cases, if it is to be read/print-only, PDF is the way to go. Otherwise, the gov should use ODF.
But Flash? Are you kidding? The last thing on earth we need is more Flash.
* Does not work on all devices * Slow and/or consumes tons of CPU * Consumes tons of RAM * Consumes more bandwidth * Makes it difficult or impossible to cut and paste * Impossible to "search/find" * Violates the native UI look and feel * Fonts and font sizes are uncontrollable by the end user * Can't scroll correctly much of the time * Almost completely proprietary * Rarely adjusts to screen size * Often introduces extremely irritating animation. * Doesn't allow text to be "seen" by the browser (or OS), making other plugins (like a screen reader) 100% useless
At least that SilverDark stuff isn't even on the radar- thank God for little favors.
Most of what you say is implementation-related rather than format-related. It's like saying that C sucks because there are so many crappy programs. I know about feeding the trolls, but for all those who don't know better, here we go:
Nothing "just works" on all devices and in this area flash fares better than most other technologies; agree is slow; not really agree on RAM usage.
Flash uses less bandwidth than alternatives, it's quite very well optimized. Sure, someone can stuff some 10 min. mp3s encoded at 25
but specially html5+js+canvas+svg+ogg vorbis/theora for rich web content.
Who has announced authoring tools for this stack that are anywhere near as capable as even Flash 3, let alone Flash CS4? Say I want to make an animated SVG like the Flash animations I see on Newgrounds. What package should I start with?
Yeah, and you can hex edit an SWF file too. But change a letter, refresh, change a letter, refresh, is not the kind of editing that graphic designers prefer to do. If that's what SVG has to offer, the market will choose SWF. I can only hope your comment was sarcasm.
On top of that, [HTML 5 video] requires the browser to implement basic navigation controls; producers are going to want to keep their own in-house player controls.
That's still doable. JavaScript running in an HTML 5 page can disable the browser's built-in controls in a <video> element and control the video itself.
What are you talking about? The PDF specification has been available as a free download from Adobe with no royalties payable by implementors since PDF was first created. More recently, the PDF/X family of specifications was approved by ISO. These define subsets of the PDF 1.4 specification for different uses (see ISO 15930). There are at least three open source PDF readers that I know of as well as several commercial viewers (Adobe Reader, FoxIt, Apple's Preview, and so on) and numerous tools can generate PDFs.
Yes, and then they SUED Microsoft for putting PDF support in Office. It's only "open" as long as you're not big enough to compete with Acrobat. If you even get within a mile of stepping on Adobe's business, you're sued up the wazzoo.
It's either an open standard, meaning anybody can use it-- ANY BODY-- or it's not. There's no such classification as "it's an open standard, except we don't let companies we don't like use it because they have a big marketshare, but other than that it's an open standard believe me!"
By your argument, Microsoft should also be prevented from parsing HTML files in IE because they're a monopoly. Does that make sense? No. Does your argument make sense? No.
PostScript is also a free specification, but NeXT was using the Display PostScript implementation licensed from Adobe. They switched to something closer to PDF because, it turned out, no one actually cared about the nicer features in PS. With DPS, you could write view objects entirely in PostScript and have them run on the display server. This was quite slow and had all sorts of problems in that the PS programs could (potentially) run forever. Most people just used the drawing subset of PS, which is als
PDF/A is already open. However, that doesn't mean that anyone knows how to produce it, especially some R.O.A.D. staffer or random hourly GS1.
Open or not, PDF/A is a display format and, in most cases, useless for information retrieval or automated data processing. PDF/A is a useful alternative to paper [digitalpreservation.gov]. However, the open government initiative is not talking about paper. It's about 'born digital [wired.com]', machine readable data.
don't hate PDF 'cause it's beautiful (Score:2, Informative)
I don't believe this is true - I find PDF documents in search results all the time. The consistency and reliability of PDF for forms creation has no real competition. If you hate Adobe, ok, but don't hate PDF 'cause it's beautiful...
Re:don't hate PDF 'cause it's beautiful (Score:5, Insightful)
Flash on the other hand is absolutely an abomination that must be wiped from the net. They still haven't released a proper version for *BSD and they commonly don't bother with less popular OSes. If they want it to be used for this sort of purpose then they need to get their act together and make it available for all operating environments on an equal basis. Which I don't think they have the resources to do.
Parent
Re:don't hate PDF 'cause it's beautiful (Score:5, Informative)
PDF remains difficult to manage. Like MS Word documents, an incredible amount of resources is wasted in display information rather than actual text or graphical content. Unlike MS Word, they're parseable: but unfortunately like MS Word, the commercial vendor-sold document creation tool (Adobe Acrobat) generates unstable and unreliable content that interacts very badly with other tools. Oddly, the ghostscript created PDF remains very stable and legible, and tools like "PDFCreator" which uses ghostscript creates long-term viable PDF printouts of other document formats. I use it for complex MS Word documents that cannot be handled by other software, even different versions of MS Word.
Adobe can actually do better with this, and I hope that they will in the future. But it's not stable enough to be reliably indexed or viewable even 5 years in the future, much less 10 or 20 or 100 such as may be needed for legal or historical documents.
Flash, you're quite right. Unless they open up the source, it has no business as yet another document format.
Parent
Re: (Score:2)
but unfortunately like MS Word, the commercial vendor-sold document creation tool (Adobe Acrobat) generates unstable and unreliable content that interacts very badly with other tools
Can you be more specific as to what problems you have had using files from acrobat in other tools?
Re:don't hate PDF 'cause it's beautiful (Score:4, Interesting)
Printing documents created in other language versions of Acrobat. In particular, the Adobe Acrobat for German created documents that were not only unviewable in a normal Acrobat viewer, but when used to "print PDF" for MS Word documents, created documents that actually crashed Windows computers. The Acrobat for Hebrew didn't crash Windows with the printed documents, but was filled with layout errors when rendered even by Acrobat Reader, errors that didn't show up in the Adobe Acrobat tool. Much of this may have been fixed with the latest release, but I'm not spending nor suggesting that my peers overseas spend all the money needed to upgrade.
Getting our colleagues to stop using Acrobat and use _anything else_ to generate their documents, and use PDFCreator to print them as PDF, stabilized the situation enough for us to generate the documents we needed. It didn't provide PDF forms for people to fill out, which was its only flaw.
Parent
Re:don't hate PDF 'cause it's beautiful (Score:4, Interesting)
Just recently I had to look at, and print a few pages from, a PDF document. Knowing where it came from, a corporation that is only very slowly dipping a toe in the water of software other than the big names, I'm sure it was done with Adobe.
Now I don't even have the Adobe Acrobat reader on my system, when I try to install it, the install crashes. But Fedora comes with several other PDF readers, and the default is set to "Evince" which works fine MOST of the time.
But I got this PDF, and one page was a picture of a tax form, and when I tried to print it, the tax form came out as a big black blob - man, does that waste ink! Obviously I killed the print job to try something else. (Just VIEWING this tax form was fine, only printing messed up.)
I remembered using "Xpdf" a while ago, so I tried that, and voila, the tax form printed perfectly. Since I knew there were more tax forms in there, I used Xpdf for the rest of the job.
So here is a case where two different PDF viewers reacted differently to the same PDF file. I think what we need is is an OPEN DEFINITION for PDF files, probably a subset of Adobe's definition, that any OSS viewer can follow and get the proper results - and ask the user what to do with files that don't follow it.
And tell Adobe they can either follow the open definition, or stuff it where the sun don't shine!
Parent
Re:don't hate PDF 'cause it's beautiful (Score:4, Informative)
There is such; Adobe publishes it and makes it freely available on its web site. It's possible your file didn't follow it, but it's more likely your reader wasn't 100% compliant; it's a very complicated specification.
Parent
Re: (Score:2, Informative)
Unlike micros~1 word documents, there are freely available specifications and a reasonable number of quite reasonable third party implementations that can either display or generate PDF, or even both. That is to say, you can very well ``do PDF'' without ever using adobe software. Part of its success is that it's a dumbed-down version of PostScript, also open and arguably the right way to talk to printers. That's a whole sight better than micros~1's ooxml abomination, that once standardized turned out to hav
Re: (Score:2)
PDF and Flash are massively multiplatform (Score:3, Interesting)
Adobe ships Flash/PDF readers/plugins to: Windows, OS X, Symbian (in some form), Linux, *BSD and various, uncountable tiny platforms. iPhone/iPod does not count because of obvious reasons.
Lets see what MS Silverlight ships to: Windows/Intel Mac. Damn thing is so tied to Windows that they couldn't even convert/ship the V2 for PPC Macs or they simply abandoned them. (like we cared!)
MS XPS format and viewer is the answer to PDF which, some people who didn't use Windows have never, ever heard of. It is that Win
Re: (Score:3, Insightful)
If Adobe would simply release the source to the Flash player, they could -save- money, have full platform compatibility and perhaps make more money with the Flash creation products. Think of it this way, if there was a fast
Re: (Score:3, Informative)
A PDF file produced by the LiveCycle suite is actually an XML document with a thin PDF wrapper. The XML conforms to the XFA standard which is owned by Adobe but is a published standard (http://partners.adobe.com/public/developer/en/xml/xfa_spec_2_4.pdf).
Re: (Score:3, Interesting)
PDFs are only searchable if the document contains text. Half the time PDFs contain text-as-image, which is about as useful to a search engine as a captcha image. Google doesn't run OCR on PDFs, AFAIK. Although, come to think of it, that sounds like something they'd get sued by a random company for doing for "violating copyright proprietary information".
What do you want? (Score:2)
Perhaps you know of a document format where the text in images IS searchable?
Re: (Score:3, Interesting)
A document format shouldn't store text as an image. That's why it's called text.
Re: (Score:3, Insightful)
That is not really a format issue though, in any format that supports images I can insert an image containing text.
Which idiot managed to do it? (Score:3, Informative)
I work with PDFs a lot, especially on OS X. I am telling you from an OS which you can have 60 KB 1080p screenshots in PDF in some circumstances: Whoever did that "text as image" trick, he is a complete moron.
One of the reasons that PDF took off is exactly embedding fonts used in a document so it will appear as pixel perfect on client machines.
As last resort (and a good practice), you can embed unformatted pure text of the entire PDF in your PDF file. PDF, like Quicktime Mov is one of the formats where peopl
Re: (Score:3, Informative)
Whoever did that "text as image" trick, he is a complete moron.
Generally text as images in PDFs are the result of people who scan in paper documents but don't have access to or don't use OCR programs to convert the raw image coming in from the scanner into text.
Re: (Score:3, Interesting)
Re: (Score:2)
Re:don't hate PDF 'cause it's beautiful (Score:5, Insightful)
Parent
data formats independent of campaign donors (Score:3, Informative)
The summary does not do a good job of reflecting the original blog post's point. The point was that the government should make data available in a machine-parseable and generic format. PDF is a great format for storing typeset pages, but it is a terrible format for publishing data. It's easy to generate beautiful PDFs from well-structured data but it's much harder to go the other way. Would you rather have budget figures (for example) as a CSV file in a well-defined format or as a PDF of tables and graphs? If the data is available in the former format, it's easy for you or a third party to produce the latter format. If it's only available in the PDF form then it's much harder to create the CSV.
If the goal is to make the data available, then even CSV would be a better option than PDF. PDF, while pretty, is a terminal format and is the digital equivalent of a mayfly. It's paper that hasn't happened yet and when it does it will exist for a few short hours before finding its way to the circular file.
Much of the government data consists of tables and tables of data. gzipped csv would be readable by anyone, so would ODF. Adobe appears to be looking for a handout at the expense of creating a usef
Re: (Score:3, Interesting)
CSV is kinda evil (see my post above), but it's better for tabular data than JSON or XML. Again, a tabular serialization format such as Avro, Thrift, or Protocol Buffers might well be far better than CSV for tabular data. JSON has quite a bit of format bloat, and would need some standardized way to explain the data's schema for further analysis. XML is the king of format bloat, but at least has standard schema representations. XML is far better for semi-structured or unstructured data than tables.
Re: (Score:3, Informative)
Many implementations of PDF converters merely print a document to images and then embed the images into a PDF. Those are non-searchable and no text can be extracted with the existing tools. I once created a documentation website which relied on these embedded image types of PDF documents. I had to implement an OCR solution in order to extract the text to make my clients documentation searchable. It was ugly and a real pain in the ass.
Certainly, PDF can be beautiful, but it is often not implemented that
Re: (Score:2)
Can I hate all the multimedia/hyperlink/scripting/vulnerabilities they've added to PDF?
I'll back this so long as it's PDF light - text and graphics only (OK, maybe I'll allow hyperlinks...).
Nobody likes flash (Score:5, Insightful)
PDF shouldn't be used as a way to encapsulate scanned JPEGs and pretend they're a real electronic document.
I would also note that many of the complaints about PDF as a format in TFA are really complaints about Adobe's abysmal PDF reading software. For example, the concern about the visually impaired: KDE's Okular does speech synthesis and has a high-contrast mode.
Re: (Score:3, Insightful)
But there's not much wrong with PDF, if it's done right.
I'm sure they won't fuck this up, after all it is the US government.
Tremor (Score:3, Funny)
They also say government's priority should be to publish datasets and the APIs to interact with them, rather than choosing how they're displayed in fancy graphs and charts.
I felt a great disturbance in the Force, as if millions of IT workers suddenly cried out in terror, and were suddenly silenced.
PDF bad. Work on microformats please. (Score:4, Interesting)
GP is right. Government should focus on doing what government is needed for success, such as determining standards for formats that everyone can use, with input from academia and industry. For example a human readable parsable format that one could embed in a web page for semantic metadata. Or funding open source software to make it easy (cross platform) to input such data (I am thinking of information about cited papers or books). Typeset information is nice but we already are drowning in information - how many pages of Google results do you usually look at? And we need help before generating 10 times as much.
Why PDF is bad:
- It is a potable typeset document package. Not a data sharing package that could be pulled apart easily with tools automatically.
- PDF is extremely hard to parse, and using current free software does not always give good results.
- You destroy useful document structure, or in the case of ASCII text parsability and small size, when you convert to PDF. You can't just convert back to the original.
- It takes significant processing power and commercial software to display well and reliability as far as I can see. Having just gotten the latest Mac I feel like I'm in a dauntless battleship, but I have had many trouble with different unix tools in the past.
- Scientists publish PDF too but then also use other formats for data. For example on arxiv, one scientists recently published animations inside a zip but it was hard to find the link
- It is difficult to manage bibliographic information automatically.
- It is proprietary
- It requires a huge amount of data, and arcane knowledge, just to build a parser that works most of the time (such as for Asian languages especially).
Re:PDF bad. Work on microformats please. (Score:5, Informative)
FAIL.
PDF is an ISO standard. See: ISO 32000-1, Document management – Portable document format – Part 1: PDF 1.7
This doesn't change the fact that it is a portable typesetting document format though. It's good for read only documents from your word processor but it shouldn't be (ab)used to store tables or graphs or whatever other crap people use it for.
---
As for Flash, lets not even go there. Flash is passable as a streaming video container, if you're making animated cartoons like Homestar Runner or as a platform for small web games but other than those use cases, you're using it wrong.
Parent
Re: (Score:2)
Re: (Score:2)
- Scientists publish PDF too but then also use other formats for data. For example on arxiv, one scientists recently published animations inside a zip but it was hard to find the link
Err... also? I've never seen a scientist using pdf to publish data. We use pdf (and ps and div) to publish typeset papers. The actual data is in a lot of formats, dependent on the field and application. I've seen csv, matlab's .mat, xml, jpeg, tiff, proprietary crap, etc.
Re:PDF bad. Work on microformats please. (Score:5, Funny)
So you can drink a PDF?!
Parent
WTF? (Score:2)
Have these people not heard of Google? Just because YOU can't write software to parse PDF files doesn't mean that nobody else can and that it doesn't already exist.
Forget Google, every single Apple device does it (Score:2)
If you look around, every single Apple computer, device (ipod/iphone) is actively indexing every single PDF thrown at them, instantly and keep database of it.
It is the famous "Spotlight" technology. They don't even need to look at Google, some of them have same kind of indexing technology (minus relation) running on their laptops.
One should check the TFA relations with MS. I am sure something will come up.
Depends on the purpose (Score:2)
PDF Yes, Flash No (Score:5, Insightful)
I am OK with PDF. I would RATHER see documents in plain HTML, but there are times when formatting is important. In those cases, if it is to be read/print-only, PDF is the way to go. Otherwise, the gov should use ODF.
But Flash? Are you kidding? The last thing on earth we need is more Flash.
* Does not work on all devices
* Slow and/or consumes tons of CPU
* Consumes tons of RAM
* Consumes more bandwidth
* Makes it difficult or impossible to cut and paste
* Impossible to "search/find"
* Violates the native UI look and feel
* Fonts and font sizes are uncontrollable by the end user
* Can't scroll correctly much of the time
* Almost completely proprietary
* Rarely adjusts to screen size
* Often introduces extremely irritating animation.
* Doesn't allow text to be "seen" by the browser (or OS), making other plugins (like a screen reader) 100% useless
At least that SilverDark stuff isn't even on the radar- thank God for little favors.
Re: (Score:3, Insightful)
Most of what you say is implementation-related rather than format-related. It's like saying that C sucks because there are so many crappy programs. I know about feeding the trolls, but for all those who don't know better, here we go:
Nothing "just works" on all devices and in this area flash fares better than most other technologies; agree is slow; not really agree on RAM usage.
Flash uses less bandwidth than alternatives, it's quite very well optimized. Sure, someone can stuff some 10 min. mp3s encoded at 25
Re: (Score:3, Insightful)
Those are very lame reasons. We are talking about open government initiative here, not about "standard for web games" initiative. Flash is:
Not portable: Many platforms lack proper support. Flash can't be legally redistributed, alternatives are poor. It is no open format in any way.
Bad for accessibility.
Not a web standard or anything close to it.
Re: (Score:3, Interesting)
Re:The future is ODF and html5 (Score:4, Insightful)
but specially html5+js+canvas+svg+ogg vorbis/theora for rich web content.
Who has announced authoring tools for this stack that are anywhere near as capable as even Flash 3, let alone Flash CS4? Say I want to make an animated SVG like the Flash animations I see on Newgrounds. What package should I start with?
Parent
Re:The future is ODF and html5 (Score:4, Funny)
Parent
Re: (Score:3, Interesting)
Re: (Score:3, Insightful)
In order to read a document, what I really need to replace the heavyweight Adobe Reader, is a bloated modern browser !
Re: (Score:2)
On top of that, [HTML 5 video] requires the browser to implement basic navigation controls; producers are going to want to keep their own in-house player controls.
That's still doable. JavaScript running in an HTML 5 page can disable the browser's built-in controls in a <video> element and control the video itself.
Re:Tell Adobe to open-license PDF (Score:5, Informative)
Parent
Re: (Score:3, Insightful)
Yes, and then they SUED Microsoft for putting PDF support in Office. It's only "open" as long as you're not big enough to compete with Acrobat. If you even get within a mile of stepping on Adobe's business, you're sued up the wazzoo.
"Free and open" my ass.
Re: (Score:3, Insightful)
Bullshit.
It's either an open standard, meaning anybody can use it-- ANY BODY-- or it's not. There's no such classification as "it's an open standard, except we don't let companies we don't like use it because they have a big marketshare, but other than that it's an open standard believe me!"
By your argument, Microsoft should also be prevented from parsing HTML files in IE because they're a monopoly. Does that make sense? No. Does your argument make sense? No.
Re: (Score:3, Interesting)
Digital Stewardship : PDF vs PDF/A (Score:3, Insightful)
PDF/A is already open. However, that doesn't mean that anyone knows how to produce it, especially some R.O.A.D. staffer or random hourly GS1.
Open or not, PDF/A is a display format and, in most cases, useless for information retrieval or automated data processing. PDF/A is a useful alternative to paper [digitalpreservation.gov]. However, the open government initiative is not talking about paper. It's about 'born digital [wired.com]', machine readable data.