Table Outliner
Overview
Table Outliner is a model designed to detect and extract tables from document images. The model processes each page of a document to identify the boundaries of tables using vertical and horizontal line detection. It groups these lines into rows and columns, matches them to form table structures, and filters out any extraneous lines that don't align with the detected table boundaries.
How It Works
-
Line Detection: For each page, the model:
- Converts the image to grayscale.
- Applies thresholding and inversion to create a binary image.
- Uses morphological operations to detect vertical and horizontal lines separately.
-
Extracting Line Coordinates: The detected lines are converted to coordinates, representing the start and end points of each line.
-
Grouping Lines into Rows and Columns: The lines are grouped into rows and columns using a clustering algorithm ( DBSCAN). Groups with insufficient lines are filtered out.
-
Matching Rows and Columns: The model matches the groups of rows and columns to form potential table boundaries. It checks for intersections and distance thresholds to validate the matches.
-
Filtering Extraneous Lines: Horizontal lines that exceed the vertical boundaries are filtered out to refine the table structure.
-
Constructing Tables: The matched and filtered lines are used to construct tables. The model identifies cell boundaries within the tables and extracts the text contained in each cell.
-
Sorting and Returning Tables: The extracted tables are saved and sorted based on their position on the page. The results are returned as text for further use
Installation
Prerequisites
Ensure you have the following before starting the installation:
- Git installed on your machine
- Python and pip installed
- FURY authentication token
Installation Steps
-
Clone the repository
git clone https://github.com/cognaize/ai-core-table-outliner/tree/prod
-
Navigate to the project directory
cd ai-core-table-outliner
-
Install the required Python packages
FURY_AUTH=${FURY_AUTH} pip install -r requirements.txt
Usage
Here's a quick start example to use table outliner:
# Create an instance of SpatialTextCreator with the specified PDF path
model = TableTextFormatter(pdf_path='/path_to_your_document.pdf')
# Predict tables into strucutred text
tables_str = model.to_structured_text_table()
# Print the tables
print(tables_str)