Follow Slashdot blog updates by subscribing to our blog RSS feed


Forgot your password?
Government Your Rights Online

Adobe Pushing For Flash and PDF In Open Government Initiative 172

angryrice tips news that Adobe seems to be campaigning for the inclusion of Flash and PDF in the Obama administration's efforts at increasing government transparency and openness. A post from the Sunlight Labs blog is critical of Adobe's undertaking, in part since PDF is often "non-parsable by software, unfindable by search engines, and unreliable if text is extracted." They also say government's priority should be to publish datasets and the APIs to interact with them, rather than choosing how they're displayed in fancy graphs and charts.
This discussion has been archived. No new comments can be posted.

Adobe Pushing For Flash and PDF In Open Government Initiative

Comments Filter:
  • by Bacon Bits ( 926911 ) on Saturday October 31, 2009 @09:35AM (#29934009)

    PDFs are only searchable if the document contains text. Half the time PDFs contain text-as-image, which is about as useful to a search engine as a captcha image. Google doesn't run OCR on PDFs, AFAIK. Although, come to think of it, that sounds like something they'd get sued by a random company for doing for "violating copyright proprietary information".

  • Re:What do you want? (Score:3, Interesting)

    by Bacon Bits ( 926911 ) on Saturday October 31, 2009 @09:46AM (#29934077)

    A document format shouldn't store text as an image. That's why it's called text.

  • Re:What do you want? (Score:0, Interesting)

    by trickyD1ck ( 1313117 ) on Saturday October 31, 2009 @09:48AM (#29934089)
    Actually, OneNote can search in images: "Powerful search capabilities can help you locate information from text within pictures or from spoken words in audio and video recordings." []
  • by mattr ( 78516 ) <> on Saturday October 31, 2009 @09:51AM (#29934101) Homepage Journal

    GP is right. Government should focus on doing what government is needed for success, such as determining standards for formats that everyone can use, with input from academia and industry. For example a human readable parsable format that one could embed in a web page for semantic metadata. Or funding open source software to make it easy (cross platform) to input such data (I am thinking of information about cited papers or books). Typeset information is nice but we already are drowning in information - how many pages of Google results do you usually look at? And we need help before generating 10 times as much.

    Why PDF is bad:
    - It is a potable typeset document package. Not a data sharing package that could be pulled apart easily with tools automatically.
    - PDF is extremely hard to parse, and using current free software does not always give good results.
    - You destroy useful document structure, or in the case of ASCII text parsability and small size, when you convert to PDF. You can't just convert back to the original.
    - It takes significant processing power and commercial software to display well and reliability as far as I can see. Having just gotten the latest Mac I feel like I'm in a dauntless battleship, but I have had many trouble with different unix tools in the past.
    - Scientists publish PDF too but then also use other formats for data. For example on arxiv, one scientists recently published animations inside a zip but it was hard to find the link
    - It is difficult to manage bibliographic information automatically.
    - It is proprietary
    - It requires a huge amount of data, and arcane knowledge, just to build a parser that works most of the time (such as for Asian languages especially).

  • Re:What do you want? (Score:3, Interesting)

    by TheRaven64 ( 641858 ) on Saturday October 31, 2009 @10:07AM (#29934229) Journal
    You're missing the point. PDFs do not store text. Text is a stream of characters. PDFs store glyphs and their locations. It is more or less possible to convert glyphs into characters, although things like ligatures and the fact that spaces are not really represented make this difficult. In the metadata, some PDFs also store the text of the document, allowing it to be extracted. Given that the PDF is created automatically from the text in most cases, the text is more useful. You can create the PDF from the text easily, but creating the text from the PDF is much harder.
  • by Antique Geekmeister ( 740220 ) on Saturday October 31, 2009 @10:19AM (#29934331)

    Printing documents created in other language versions of Acrobat. In particular, the Adobe Acrobat for German created documents that were not only unviewable in a normal Acrobat viewer, but when used to "print PDF" for MS Word documents, created documents that actually crashed Windows computers. The Acrobat for Hebrew didn't crash Windows with the printed documents, but was filled with layout errors when rendered even by Acrobat Reader, errors that didn't show up in the Adobe Acrobat tool. Much of this may have been fixed with the latest release, but I'm not spending nor suggesting that my peers overseas spend all the money needed to upgrade.

    Getting our colleagues to stop using Acrobat and use _anything else_ to generate their documents, and use PDFCreator to print them as PDF, stabilized the situation enough for us to generate the documents we needed. It didn't provide PDF forms for people to fill out, which was its only flaw.

  • by tepples ( 727027 ) <tepples@gmail.BOHRcom minus physicist> on Saturday October 31, 2009 @10:33AM (#29934451) Homepage Journal
    Yeah, and you can hex edit an SWF file too. But change a letter, refresh, change a letter, refresh, is not the kind of editing that graphic designers prefer to do. If that's what SVG has to offer, the market will choose SWF. I can only hope your comment was sarcasm.
  • by xjimhb ( 234034 ) on Saturday October 31, 2009 @11:01AM (#29934647) Homepage

    Just recently I had to look at, and print a few pages from, a PDF document. Knowing where it came from, a corporation that is only very slowly dipping a toe in the water of software other than the big names, I'm sure it was done with Adobe.

    Now I don't even have the Adobe Acrobat reader on my system, when I try to install it, the install crashes. But Fedora comes with several other PDF readers, and the default is set to "Evince" which works fine MOST of the time.

    But I got this PDF, and one page was a picture of a tax form, and when I tried to print it, the tax form came out as a big black blob - man, does that waste ink! Obviously I killed the print job to try something else. (Just VIEWING this tax form was fine, only printing messed up.)

    I remembered using "Xpdf" a while ago, so I tried that, and voila, the tax form printed perfectly. Since I knew there were more tax forms in there, I used Xpdf for the rest of the job.

    So here is a case where two different PDF viewers reacted differently to the same PDF file. I think what we need is is an OPEN DEFINITION for PDF files, probably a subset of Adobe's definition, that any OSS viewer can follow and get the proper results - and ask the user what to do with files that don't follow it.

    And tell Adobe they can either follow the open definition, or stuff it where the sun don't shine!

  • by Ilgaz ( 86384 ) on Saturday October 31, 2009 @11:04AM (#29934665) Homepage

    Adobe ships Flash/PDF readers/plugins to: Windows, OS X, Symbian (in some form), Linux, *BSD and various, uncountable tiny platforms. iPhone/iPod does not count because of obvious reasons.

    Lets see what MS Silverlight ships to: Windows/Intel Mac. Damn thing is so tied to Windows that they couldn't even convert/ship the V2 for PPC Macs or they simply abandoned them. (like we cared!)

    MS XPS format and viewer is the answer to PDF which, some people who didn't use Windows have never, ever heard of. It is that Windows centric. Despise all rude attempts by MS (adding XPS printer without etc), it has never, ever took off.

    What we need is, something combines ODF and PDF. You can add binary file to PDF document like some layer. ROM LogicWare, less known Office (Papyrus) developer does it right now. The files are both PDF and their own edit format, transparent to PDF readers and NOT a hack.

    Of course, people will spend time "omg flash, pdf, Adobe is slow" flaming rather than finding a solution to a real problem. Asking government to use Flash is really absurd but the real one to blame here is MS and open source based large companies. If they have no alternative, Adobe will suggest PDF of course. What else they should use? MS XPS?

  • by markdavis ( 642305 ) on Saturday October 31, 2009 @11:56AM (#29934999)
    So there is a partial option for MS-Windows only. Great. Not exactly platform agnostic and open. I suppose it is better than nothing, though.
  • by John Whitley ( 6067 ) on Saturday October 31, 2009 @12:27PM (#29935181) Homepage

    CSV is kinda evil (see my post above), but it's better for tabular data than JSON or XML. Again, a tabular serialization format such as Avro, Thrift, or Protocol Buffers might well be far better than CSV for tabular data. JSON has quite a bit of format bloat, and would need some standardized way to explain the data's schema for further analysis. XML is the king of format bloat, but at least has standard schema representations. XML is far better for semi-structured or unstructured data than tables.

  • by TheRaven64 ( 641858 ) on Saturday October 31, 2009 @01:54PM (#29935723) Journal
    PostScript is also a free specification, but NeXT was using the Display PostScript implementation licensed from Adobe. They switched to something closer to PDF because, it turned out, no one actually cared about the nicer features in PS. With DPS, you could write view objects entirely in PostScript and have them run on the display server. This was quite slow and had all sorts of problems in that the PS programs could (potentially) run forever. Most people just used the drawing subset of PS, which is also available in PDF, and none of the flow control stuff.

"Never give in. Never give in. Never. Never. Never." -- Winston Churchill