How to annotate scans for NLP

By Dr. Juan Miguel Cejuela🤲 this story’s open link

Your problem: you have a bunch of scanned images or PDFs but cannot make use of it because you cannot even select the text.

Your solution: you just need to OCR your scans (saving the results into text-embedded PDFs) and then to upload those. From there on, you can annotate (by just highlighting the text) and export your valuable data into machine-readable JSON. This is thanks to the Native PDF annotation built in.

Annotated (Native) PDFs look like this on tagtog:

Any text-embedded PDF can be annotated in tagtog and it’s as easy as selecting text.

All your text selections can then be downloaded in JSON, which you can use to train your ML models!

Your visual text selections on tagtog Native PDFs can be exported into JSON.

And… that’s pretty much it!

Which OCR software should I use?

Your options are many: Google Cloud’s Vision API, AWS Amazon Textract, Tesseract, ABBYY FineReader, Kadmos, …

You should first decide if you want to run an external API or run the program on your own server. This narrows down your search quite a bit. Then, you should most likely test the quality of the different providers with some of your own & real scans.

As for what tagtog is concerned, you can use any OCR software you want. The only requirement is that you upload the result of the OCR as a PDF document. Some of the OCR solutions (e.g. Tesseract or FineReader) can export the results directly into PDF. As for the other OCR solutions, which export the results into JSON (including all the graphical coordinates), one can build programmatically the end PDFs. In Java, this is at least possible with the excellent library Apache PDFBox.

Example: OCR’ing with Amazon Textract

We prepared this github repository, which contains fully-functioning Java code to:

  1. OCR scans calling the APIs of Amazon Textract
  2. Upload the resulting PDFs

Example: FUNSD dataset OCR’ed & uploaded to tagtog

Once we have the code to OCR and convert to PDF, it’s really very easy to continue the labeling work in tagtog.

As an example, we created this sample tagtog public project, containing the whole FUNSD dataset (which originally contains noisy scanned images encoded in .png).

And here are some sample annotations over the OCR’ed PDFs:

❤️ Liked this post? Please 👏👏👏 Clap & Re-Clap to share it with others!




The text annotation platform to train #NLP. Easy.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Recommender systems in Machine Learning

Why should you use Transfer Learning for your Image Recognition App ?

Insurance Cost Prediction using Linear Regression- Assignment on Deep Learning with PyTorch: Zero…

📊 State of the Machine Translation (June 2019)

Fast-SCNN explained and implemented using Tensorflow 2.0

Machine Learning/KNN(K-Nearest Neighbor) Algorithm (Jupyter Notebook)

Scotch and Machine Learning

Complete Understanding of Morphological Transformations in Image Processing

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


The text annotation platform to train #NLP. Easy.

More from Medium

Term Frequency & Inverse Document Frequency : TF-IDF in NLP

Component of Natural Language Processing (NLP)

Importing diaparser and Turku neural pipeline to spacy

ANLP(4): Parsing — CYK Algorithm