DocTR: Document Text Recognition
Optical Character Recognition made seamless & accessible to anyone, powered by TensorFlow 2
What you can expect from this repository:
- efficient ways to parse textual information (localize and identify each word) from your documents
- guidance on how to integrate this in your current architecture
Getting your pretrained model
End-to-End OCR is achieved in DocTR using a two-stage approach: text detection (localizing words), then text recognition (identify all characters in the word). As such, you can select the architecture used for text detection, and the one for text recognition from the list of available implementations.
from doctr.models import ocr_predictor model = ocr_predictor(det_arch='db_resnet50', reco_arch='crnn_vgg16_bn', pretrained=True)
Documents can be interpreted from PDF or images:
from doctr.documents import DocumentFile # PDF pdf_doc = DocumentFile.from_pdf("path/to/your/doc.pdf").as_images() # Image single_img_doc = DocumentFile.from_images("path/to/your/img.jpg") # Webpage webpage_doc = DocumentFile.from_url("https://www.yoursite.com").as_images() # Multiple page images multi_img_doc = DocumentFile.from_images(["path/to/page1.jpg", "path/to/page2.jpg"])
Putting it together
Let's use the default pretrained model for an example:
from doctr.documents import DocumentFile from doctr.models import ocr_predictor model = ocr_predictor(pretrained=True) # PDF doc = DocumentFile.from_pdf("path/to/your/doc.pdf").as_images() # Analyze result = model(doc)
To make sense of your model's predictions, you can visualize them as follows:
or export them to JSON format (to get a better understanding of our document model, check our documentation):
json_output = result.export()
Python 3.6 (or higher) and pip are required to install DocTR.
You can install the latest release of the package using pypi as follows:
pip install python-doctr
Or you can install it from source:
git clone https://github.com/mindee/doctr.git pip install -e doctr/.
Credits where it's due: this repository is implementing, among others, architectures from published research papers.
- Real-time Scene Text Detection with Differentiable Binarization.
- LinkNet: Exploiting Encoder Representations forEfficient Semantic Segmentation
- An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition.
- Show, Attend and Read:A Simple and Strong Baseline for Irregular Text Recognition.
The full package documentation is available here for detailed specifications.
A minimal demo app is provided for you to play with the text detection model!
You will need an extra dependency (Streamlit) for the app to run:
pip install -r demo/requirements.txt
You can then easily run your app in your default browser by running:
streamlit run demo/app.py
If you are to deploy containerized environments, you can use the provided Dockerfile to build a docker image:
docker build . -t <YOUR_IMAGE_TAG>
An example script is provided for a simple documentation analysis of a PDF or image file:
python scripts/analyze.py path/to/your/doc.pdf
All script arguments can be checked using
python scripts/analyze.py --help
If you scrolled down to this section, you most likely appreciate open source. Do you feel like extending the range of our supported characters? Or perhaps submitting a paper implementation? Or contributing in any other way?
You're in luck, we compiled a short guide (cf.
CONTRIBUTING) for you to easily do so!
Distributed under the Apache 2.0 License. See
LICENSE for more information.