Adobe Pushing For Flash and PDF In Open Government Initiative 172
angryrice tips news that Adobe seems to be campaigning for the inclusion of Flash and PDF in the Obama administration's efforts at increasing government transparency and openness. A post from the Sunlight Labs blog is critical of Adobe's undertaking, in part since PDF is often "non-parsable by software, unfindable by search engines, and unreliable if text is extracted." They also say government's priority should be to publish datasets and the APIs to interact with them, rather than choosing how they're displayed in fancy graphs and charts.
Re:don't hate PDF 'cause it's beautiful (Score:3, Interesting)
PDFs are only searchable if the document contains text. Half the time PDFs contain text-as-image, which is about as useful to a search engine as a captcha image. Google doesn't run OCR on PDFs, AFAIK. Although, come to think of it, that sounds like something they'd get sued by a random company for doing for "violating copyright proprietary information".
Re:What do you want? (Score:3, Interesting)
A document format shouldn't store text as an image. That's why it's called text.
Re:What do you want? (Score:0, Interesting)
PDF bad. Work on microformats please. (Score:4, Interesting)
GP is right. Government should focus on doing what government is needed for success, such as determining standards for formats that everyone can use, with input from academia and industry. For example a human readable parsable format that one could embed in a web page for semantic metadata. Or funding open source software to make it easy (cross platform) to input such data (I am thinking of information about cited papers or books). Typeset information is nice but we already are drowning in information - how many pages of Google results do you usually look at? And we need help before generating 10 times as much.
Why PDF is bad:
- It is a potable typeset document package. Not a data sharing package that could be pulled apart easily with tools automatically.
- PDF is extremely hard to parse, and using current free software does not always give good results.
- You destroy useful document structure, or in the case of ASCII text parsability and small size, when you convert to PDF. You can't just convert back to the original.
- It takes significant processing power and commercial software to display well and reliability as far as I can see. Having just gotten the latest Mac I feel like I'm in a dauntless battleship, but I have had many trouble with different unix tools in the past.
- Scientists publish PDF too but then also use other formats for data. For example on arxiv, one scientists recently published animations inside a zip but it was hard to find the link
- It is difficult to manage bibliographic information automatically.
- It is proprietary
- It requires a huge amount of data, and arcane knowledge, just to build a parser that works most of the time (such as for Asian languages especially).
Re:What do you want? (Score:3, Interesting)
Re:don't hate PDF 'cause it's beautiful (Score:4, Interesting)
Printing documents created in other language versions of Acrobat. In particular, the Adobe Acrobat for German created documents that were not only unviewable in a normal Acrobat viewer, but when used to "print PDF" for MS Word documents, created documents that actually crashed Windows computers. The Acrobat for Hebrew didn't crash Windows with the printed documents, but was filled with layout errors when rendered even by Acrobat Reader, errors that didn't show up in the Adobe Acrobat tool. Much of this may have been fixed with the latest release, but I'm not spending nor suggesting that my peers overseas spend all the money needed to upgrade.
Getting our colleagues to stop using Acrobat and use _anything else_ to generate their documents, and use PDFCreator to print them as PDF, stabilized the situation enough for us to generate the documents we needed. It didn't provide PDF forms for people to fill out, which was its only flaw.
Re:The future is ODF and html5 (Score:3, Interesting)
Re:don't hate PDF 'cause it's beautiful (Score:4, Interesting)
Just recently I had to look at, and print a few pages from, a PDF document. Knowing where it came from, a corporation that is only very slowly dipping a toe in the water of software other than the big names, I'm sure it was done with Adobe.
Now I don't even have the Adobe Acrobat reader on my system, when I try to install it, the install crashes. But Fedora comes with several other PDF readers, and the default is set to "Evince" which works fine MOST of the time.
But I got this PDF, and one page was a picture of a tax form, and when I tried to print it, the tax form came out as a big black blob - man, does that waste ink! Obviously I killed the print job to try something else. (Just VIEWING this tax form was fine, only printing messed up.)
I remembered using "Xpdf" a while ago, so I tried that, and voila, the tax form printed perfectly. Since I knew there were more tax forms in there, I used Xpdf for the rest of the job.
So here is a case where two different PDF viewers reacted differently to the same PDF file. I think what we need is is an OPEN DEFINITION for PDF files, probably a subset of Adobe's definition, that any OSS viewer can follow and get the proper results - and ask the user what to do with files that don't follow it.
And tell Adobe they can either follow the open definition, or stuff it where the sun don't shine!
PDF and Flash are massively multiplatform (Score:3, Interesting)
Adobe ships Flash/PDF readers/plugins to: Windows, OS X, Symbian (in some form), Linux, *BSD and various, uncountable tiny platforms. iPhone/iPod does not count because of obvious reasons.
Lets see what MS Silverlight ships to: Windows/Intel Mac. Damn thing is so tied to Windows that they couldn't even convert/ship the V2 for PPC Macs or they simply abandoned them. (like we cared!)
MS XPS format and viewer is the answer to PDF which, some people who didn't use Windows have never, ever heard of. It is that Windows centric. Despise all rude attempts by MS (adding XPS printer without etc), it has never, ever took off.
What we need is, something combines ODF and PDF. You can add binary file to PDF document like some layer. ROM LogicWare, less known Office (Papyrus) developer does it right now. The files are both PDF and their own edit format, transparent to PDF readers and NOT a hack.
Of course, people will spend time "omg flash, pdf, Adobe is slow" flaming rather than finding a solution to a real problem. Asking government to use Flash is really absurd but the real one to blame here is MS and open source based large companies. If they have no alternative, Adobe will suggest PDF of course. What else they should use? MS XPS?
Re:the flash web browser does enable screen reader (Score:3, Interesting)
Re:don't hate PDF 'cause it's beautiful (Score:3, Interesting)
CSV is kinda evil (see my post above), but it's better for tabular data than JSON or XML. Again, a tabular serialization format such as Avro, Thrift, or Protocol Buffers might well be far better than CSV for tabular data. JSON has quite a bit of format bloat, and would need some standardized way to explain the data's schema for further analysis. XML is the king of format bloat, but at least has standard schema representations. XML is far better for semi-structured or unstructured data than tables.
Re:Tell Adobe to open-license PDF (Score:3, Interesting)