How to annotate scans for NLP
Your problem: you have a bunch of scanned images or PDFs but cannot make use of it because you cannot even select the text.
Your solution: you just need to OCR your scans (saving the results into text-embedded PDFs) and then to upload those into tagtog. From there on, you can annotate (by just highlighting the text) and export your valuable data into machine-readable JSON. This is thanks to the Native PDF annotation built in tagtog.
Annotated (Native) PDFs look like this on tagtog:
All your text selections can then be downloaded in JSON, which you can use to train your ML models!
And… that’s pretty much it!
Which OCR software should I use?
Your options are many: Google Cloud’s Vision API, AWS Amazon Textract, Tesseract, ABBYY FineReader, Kadmos, …
You should first decide if you want to run an external API or run the program on your own server. This narrows down your search quite a bit. Then, you should most likely test the quality of the different providers with some of your own & real scans.
As for what tagtog is concerned, you can use any OCR software you want. The only requirement is that you upload the result of the OCR into tagtog as a PDF document. Some of the OCR solutions (e.g. Tesseract or FineReader) can export the results directly into PDF. As for the other OCR solutions, which export the results into JSON (including all the graphical coordinates), one can build programmatically the end PDFs. In Java, this is at least possible with the excellent library Apache PDFBox.
Example: OCR’ing with Amazon Textract
We prepared this github repository, which contains fully-functioning Java code to:
- OCR scans calling the APIs of Amazon Textract
- Upload the resulting PDFs into tagtog
Example: FUNSD dataset OCR’ed & uploaded to tagtog
Once we have the code to OCR and convert to PDF, it’s really very easy to continue the labeling work in tagtog.
As an example, we created this sample tagtog public project, containing the whole FUNSD dataset (which originally contains noisy scanned images encoded in .png).
And here are some sample annotations done in tagtog over the OCR’ed PDFs: