Skip to main content

Table Outliner Textract

Overview

Table Outliner Textract is a model designed to detect and extract tables from document images using the AWS Textract service. The model processes each page of a document, leveraging AZURE OCR and table detection capabilities to accurately identify table boundaries, rows, columns, and cells. It then structures the extracted information into a format that can be easily processed for further analysis.

How It Works

  1. Image Preprocessing: For each page of the document:

    • The image is prepared for processing, ensuring it meets the input requirements of AWS Textract.
    • Basic image preprocessing steps such as resizing, binarization, and noise reduction may be applied to enhance detection accuracy.
  2. Textract Integration: The processed images are sent to AWS Textract, which:

    • Detects and extracts text from tables within the document.
    • Identifies the structure of tables, including rows, columns, and cells.
  3. Line and Table Structure Identification: AWS Textract returns detailed information about the position and content of detected tables, which includes:

    • Coordinates of table boundaries.
    • Locations of individual cells, rows, and columns.
  4. Data Extraction and Post-Processing: The extracted data is further processed by:

    • Converting the positional data into a structured format.
    • Organizing the detected tables according to their spatial arrangement on the page.
  5. Validation and Filtering: The model ensures that:

    • Only valid table structures are retained.
    • Any extraneous or misidentified lines are filtered out to improve the accuracy of the table extraction.
  6. Final Output: The tables, along with their contents, are formatted into a structured text format, where each table is represented in json format, making them ready for further analysis or export to other applications.

Installation

Prerequisites

Ensure you have the following before starting the installation:

  • Git installed on your machine
  • Python and pip installed
  • FURY authentication token

Installation Steps

  1. Clone the repository

    git clone https://github.com/cognaize/ai-core-table-outliner-textract/tree/prod
  2. Navigate to the project directory

    cd ai-core-table-textract
  3. Install the required Python packages

    FURY_AUTH=${FURY_AUTH} pip install -r requirements.txt

Usage

Here's a quick start example to use table outliner textract:

# Create an instance of TableTextFormatter with the specified PDF path
model = TableTextFormatter(pdf_path='/path_to_your_document.pdf')

# Predict tables into strucutred text
tables_str = model.to_structured_text_table()

# Print the tables
print(tables_str)