How to annotate scans for NLP

In short: OCR into searchable PDF & upload to tagtog

By Dr. Juan Miguel Cejuela & tagtog team🤲 this story’s open link

Your problem: you have a bunch of scanned images or PDFs but cannot make use of it because you cannot even select the text.

Your solution: you just need to OCR your scans (saving the results into text-embedded PDFs) and then to upload those into tagtog. From there on, you can annotate (by just highlighting the text) and export your valuable data into machine-readable JSON. This is thanks to the Native PDF annotation built in tagtog.

Annotated (Native) PDFs look like this on tagtog:

Any text-embedded PDF can be annotated in tagtog and it’s as easy as selecting text.

All your text selections can then be downloaded in JSON, which you can use to train your ML models!

Your visual text selections on tagtog Native PDFs can be exported into JSON.

And… that’s pretty much it!

Which OCR software should I use?

Your options are many: Google Cloud’s Vision API, AWS Amazon Textract, Tesseract, ABBYY FineReader, Kadmos, …

You should first decide if you want to run an external API or run the program on your own server. This narrows down your search quite a bit. Then, you should most likely test the quality of the different providers with some of your own & real scans.

As for what tagtog is concerned, you can use any OCR software you want. The only requirement is that you upload the result of the OCR into tagtog as a PDF document. Some of the OCR solutions (e.g. Tesseract or FineReader) can export the results directly into PDF. As for the other OCR solutions, which export the results into JSON (including all the graphical coordinates), one can build programmatically the end PDFs. In Java, this is at least possible with the excellent library Apache PDFBox.

Example: OCR’ing with Amazon Textract

We prepared this github repository, which contains fully-functioning Java code to:

  1. OCR scans calling the APIs of Amazon Textract
  2. Upload the resulting PDFs into tagtog

Example: FUNSD dataset OCR’ed & uploaded to tagtog

Once we have the code to OCR and convert to PDF, it’s really very easy to continue the labeling work in tagtog.

As an example, we created this sample tagtog public project, containing the whole FUNSD dataset (which originally contains noisy scanned images encoded in .png).


And here are some sample annotations done in tagtog over the OCR’ed PDFs:


🤖 Need training data for #NLP? Find & create it for free on: 🍃tagtog.

🐦 Twitter user? Follow @tagtog_net.

❤️ Liked this post? Please 👏👏👏 Clap & Re-Clap to share it with others!

The text annotation platform to train #NLP. Easy. 🔗