Problems with using Acrobat to create tagged PDF files from scans

(I didn’t do screenshots when I tested on MacOSX yesterday, so no screenshots for now. This will be fixed later, but seriously, Adobe need to recommit to releasing Acrobat and/or Reader on Unix, or at least make sure Reader DC’s installer work under Wine. The status quo is unacceptable.)

Since Acrobat CC and Reader DC are really the only tools for creating tagged PDF files from a scan, yesterday I decided to see how well they actually work. (What I used was Acrobat CC on a recent MacOSX, with VoiceOver for testing. The scan was of low quality but sometimes this is what people have to work with.)

Many obvious, major problems surfaced almost right away:

  • The relevant panels and toolbars were difficult to find. You have to google to figure out where to even look.
  • Creating a tag tree will cause OCR. The result contained lots of errors, but there’s no obvious way to fix any.
  • Acrobat made many reasonable guesses at the document structure, but also many bad ones. Some could have been easily fixed by removing container elements but keeping their contents in place, but there’s no obvious way to do this.
  • Structural errors (including the above) can often be fixed by rearranging tags, but drag-and-drop is the only way to reorder tags, and your tags more often than not end up being dropped at unpredictable locations – even if you’re very careful. In other words, Adobe’s official workflow doesn’t actually work.
  • The document object model doesn’t seem to be inconsistent; you can type over the errors in the Content (“TURO”) view, but the errors remain uncorrected in the Tags view (or as revealed by reading the actual PDF with an actual screen reader).
  • When you finally found out how you’re supposed to correct OCR errors, you’d realize the automatic OCR performed by the Tags editor did not create any layers, so you’d have to redo the OCR even though it had already been done.
  • You’re given only one chance to correct “possible” OCR errors — those that Acrobat automatically identifies as “suspects”. If you made an error when correcting, too bad, you’ll have to start the OCR process all over again: There is no Undo.
  • There’s no way to correct OCR errors Acrobat fails to even identify as “suspects”.
  • The contents fields in a tag produce no observable effects; they don’t override anything.

I’ll elaborate on these later, but I’m shocked this is the kind of tool we are forced to work with if we need to “remediate” a PDF file — pretty much unusable if you asked me.

(MM told me when he had to PDF remediation he ended up re-creating almost everything, and AW said my complaints are her daily complaints. I can totally empathize. If I end up doing this kind of work, InDesign will likely end up being a major part of my workflow as well.)