How to annotate scans for NLP

In short: OCR into searchable PDF & upload to tagtog

šŸƒtagtog
3 min readApr 26, 2021

Your problem: you have a bunch of scanned images or PDFs but cannot make use of it because you cannot even select the text.

Your solution: you just need to OCR your scans (saving the results into text-embedded PDFs) and then to upload those. From there on, you can annotate (by just highlighting the text) and export your valuable data into machine-readable JSON. This is thanks to the Native PDF annotation built in.

Annotated (Native) PDFs look like this on tagtog:

Any text-embedded PDF can be annotated in tagtog and itā€™s as easy as selecting text.

All your text selections can then be downloaded in JSON, which you can use to train your ML models!

Your visual text selections on tagtog Native PDFs can be exported into JSON.

Andā€¦ thatā€™s pretty much it!

Which OCR software should I use?

Your options are many: Google Cloudā€™s Vision API, AWS Amazon Textract, Tesseract, ABBYY FineReader, Kadmos, ā€¦

You should first decide if you want to run an external API or run the program on your own server. This narrows down your search quite a bit. Then, you should most likely test the quality of the different providers with some of your own & real scans.

As for what tagtog is concerned, you can use any OCR software you want. The only requirement is that you upload the result of the OCR as a PDF document. Some of the OCR solutions (e.g. Tesseract or FineReader) can export the results directly into PDF. As for the other OCR solutions, which export the results into JSON (including all the graphical coordinates), one can build programmatically the end PDFs. In Java, this is at least possible with the excellent library Apache PDFBox.

Example: OCRā€™ing with Amazon Textract

We prepared this github repository, which contains fully-functioning Java code to:

  1. OCR scans calling the APIs of Amazon Textract
  2. Upload the resulting PDFs

Example: FUNSD dataset OCRā€™ed & uploaded to tagtog

Once we have the code to OCR and convert to PDF, itā€™s really very easy to continue the labeling work in tagtog.

As an example, we created this sample tagtog public project, containing the whole FUNSD dataset (which originally contains noisy scanned images encoded in .png).

And here are some sample annotations over the OCRā€™ed PDFs:

ā¤ļø Liked this post? Please šŸ‘šŸ‘šŸ‘ Clap & Re-Clap to share it with others!

--

--