Skip to main content

Table Outliner Textract Model Service

Overview

TableOutlinerTextarct is designed to detect and extract tables from document images using the AWS Textract service. The model processes each page of a document, leveraging AZURE OCR and table detection capabilities to accurately identify table boundaries, rows, columns, and cells. It then structures the extracted information into a format that can be easily processed for further analysis.

Installation

The required dependencies can be installed using pip along with the FURY_AUTH authentication token:

FURY_AUTH=${FURY_AUTH} pip install --use-deprecated=legacy-resolver -r requirements.txt

Running the Service Locally

Launching the Server:

Ensure you have the following before starting the server:

  • AWS credentials configured for access to Textract
  • AZURE_OCR_API_KEY and AZURE_OCR_ENDPOINT variables with appropriate values in environment variables

Start the server by running:

python app/server.py

[Optional] Monitor Service Progress in Real-Time via WebSocket [Optional]

  • Set the request type to WebSocket
  • Enter the WebSocket URL: ws://localhost:8000/table-outliner-textract/ws
  • Click Connect

Uploading PDF File to Process:

  • Open Postman
  • Set the request type to http
  • Set the method to POST
  • Use the URL http://localhost:8000/table-outliner-textract
  • Add a key with the type File and name it file
  • Select the PDF file from your computer that you wish to upload
  • Click Send to upload the file and start the Table Detection Outliner Textract Creation process

Viewing the Output:

  • Response Format: The response will appear in JSON with the following format:
{
"result": "formatted tables",
"filename": "uploaded_file.pdf"
}
  • Response Example: Here is an example of response output:
{
"result": "\n\n--------------------------------------------\n\n{\"0\":{\"0\":\"\",\"1\":\"Net Sales\",\"2\":\"Cost of Sales\",\"3\":\"Gross Profit\",\"4\":\"Selling and Administrative Expenses (1)\",\"5\":\"Operating Income\",\"6\":\"Interest and Other Non-Operating Expense, net (2)\",\"7\":\"Income Before Taxes\",\"8\":\"Provision for Income Taxes (3)\",\"9\":\"Net Income\"},\"1\":{\"0\":\"Year Ended December 31, 2016\",\"1\":\"$ 235,611,012\",\"2\":\"168,123,270\",\"3\":\"67,487,742\",\"4\":\"54,440,205\",\"5\":\"13,047,537\",\"6\":\"612,464\",\"7\":\"12,435,073\",\"8\":\"4,787,503\",\"9\":\"$ 7,647,570\"}}\n\n--------------------------------------------\n\n{\"0\":{\"0\":\"\",\"1\":\"Assets\",\"2\":\"Current Assets:\",\"3\":\"Cash and Cash Equivalents\",\"4\":\"Receivables, net\",\"5\":\"Product Inventories, net\",\"6\":\"Deferred Income Taxes\",\"7\":\"Prepaid Expenses\",\"8\":\"Total Current Assets\",\"9\":\"Property and Equipment, net\",\"10\":\"Goodwill\",\"11\":\"Other Intangible Assets, net\",\"12\":\"Other Assets\",\"13\":\"Intercompany, net (1)\",\"14\":\"Total Assets\",\"15\":\"Liabilities and Equity\",\"16\":\"Current Liabilities:\",\"17\":\"Accounts Payable\",\"18\":\"Accrued and Other Current Liabilities\",\"19\":\"Current Portion of Other Long-Term Liabilities\",\"20\":\"Total Current Liabilities\",\"21\":\"Long-Term Debt\",\"22\":\"Other Long Term Liabilities\",\"23\":\"Total Liabilities\",\"24\":\"Equity:\",\"25\":\"Capital Stock\",\"26\":\"Paid in Capital (1)\",\"27\":\"Retained Earnings\",\"28\":\"Total Equity\",\"29\":\"Total Liabilities & Equity\"},\"1\":{\"0\":\"December 31, 2016\",\"1\":\"\",\"2\":\"\",\"3\":\"$ 608,212\",\"4\":\"22,424,863\",\"5\":\"43,958,524\",\"6\":\"-\",\"7\":\"626,909\",\"8\":\"67,618,508\",\"9\":\"10,946,826\",\"10\":\"60,830,612\",\"11\":\"10,524,998\",\"12\":\"158,249\",\"13\":\"9,368,532\",\"14\":\"$ 159,447,725\",\"15\":\"\",\"16\":\"\",\"17\":\"$ 15,506,405\",\"18\":\"4,694,330\",\"19\":\"-\",\"20\":\"20,200,735\",\"21\":\"-\",\"22\":\"759,141\",\"23\":\"20,959,876\",\"24\":\"\",\"25\":\"-\",\"26\":\"87,127,267\",\"27\":\"51,360,582\",\"28\":\"138,487,849\",\"29\":\"$ 159,447,725\"}}",
"filename": "horizon_distributors_inc_2016.pdf"
}
  • WebSocket Updates: If connected via WebSocket, you'll receive real-time updates on the process of Table Detection Outliner Textract.