Description:
This code implements a document question-answering system leveraging the LayoutLM architecture for visual question answering. It integrates two OCR engines, PaddleOCR and Tesseract, for text extraction and bounding box detection from input images. The pre-trained impira/layoutlm-document-qa model and tokenizer are used for encoding both the question and the OCR-extracted text along with their respective spatial coordinates. The model predicts the answer span within the document by calculating start and end logits, and the answer is decoded using the tokenizer. A Gradio interface facilitates user interaction, allowing image uploads, question input, OCR engine selection, and visualization of the answer, confidence score, and OCR results. The system operates on CPU and handles bounding box normalization for consistent model input.
