Table Outliner Textract Model Service
Overview
TableOutlinerTextarct is designed to detect and extract tables from document images using the AWS Textract service. The model processes each page of a document, leveraging AZURE OCR and table detection capabilities to accurately identify table boundaries, rows, columns, and cells. It then structures the extracted information into a format that can be easily processed for further analysis.
Installation
The required dependencies can be installed using pip along with the FURY_AUTH authentication token:
FURY_AUTH=${FURY_AUTH} pip install --use-deprecated=legacy-resolver -r requirements.txt
Running the Service Locally
Launching the Server:
Ensure you have the following before starting the server:
- AWS credentials configured for access to Textract
- AZURE_OCR_API_KEY and AZURE_OCR_ENDPOINT variables with appropriate values in environment variables
Start the server by running:
python app/server.py
[Optional] Monitor Service Progress in Real-Time via WebSocket [Optional]
- Set the request type to
WebSocket
- Enter the WebSocket URL:
ws://localhost:8000/table-outliner-textract/ws
- Click
Connect
Uploading PDF File to Process:
- Open Postman
- Set the request type to
http
- Set the method to
POST
- Use the URL http://localhost:8000/table-outliner-textract
- Add a key with the type
File
and name itfile
- Select the PDF file from your computer that you wish to upload
- Click
Send
to upload the file and start the Table Detection Outliner Textract Creation process
Viewing the Output:
- Response Format: The response will appear in
JSON
with the following format:
{
"result": "formatted tables",
"filename": "uploaded_file.pdf"
}
- Response Example: Here is an example of response output:
{
"result": "\n\n--------------------------------------------\n\n{\"0\":{\"0\":\"\",\"1\":\"Net Sales\",\"2\":\"Cost of Sales\",\"3\":\"Gross Profit\",\"4\":\"Selling and Administrative Expenses (1)\",\"5\":\"Operating Income\",\"6\":\"Interest and Other Non-Operating Expense, net (2)\",\"7\":\"Income Before Taxes\",\"8\":\"Provision for Income Taxes (3)\",\"9\":\"Net Income\"},\"1\":{\"0\":\"Year Ended December 31, 2016\",\"1\":\"$ 235,611,012\",\"2\":\"168,123,270\",\"3\":\"67,487,742\",\"4\":\"54,440,205\",\"5\":\"13,047,537\",\"6\":\"612,464\",\"7\":\"12,435,073\",\"8\":\"4,787,503\",\"9\":\"$ 7,647,570\"}}\n\n--------------------------------------------\n\n{\"0\":{\"0\":\"\",\"1\":\"Assets\",\"2\":\"Current Assets:\",\"3\":\"Cash and Cash Equivalents\",\"4\":\"Receivables, net\",\"5\":\"Product Inventories, net\",\"6\":\"Deferred Income Taxes\",\"7\":\"Prepaid Expenses\",\"8\":\"Total Current Assets\",\"9\":\"Property and Equipment, net\",\"10\":\"Goodwill\",\"11\":\"Other Intangible Assets, net\",\"12\":\"Other Assets\",\"13\":\"Intercompany, net (1)\",\"14\":\"Total Assets\",\"15\":\"Liabilities and Equity\",\"16\":\"Current Liabilities:\",\"17\":\"Accounts Payable\",\"18\":\"Accrued and Other Current Liabilities\",\"19\":\"Current Portion of Other Long-Term Liabilities\",\"20\":\"Total Current Liabilities\",\"21\":\"Long-Term Debt\",\"22\":\"Other Long Term Liabilities\",\"23\":\"Total Liabilities\",\"24\":\"Equity:\",\"25\":\"Capital Stock\",\"26\":\"Paid in Capital (1)\",\"27\":\"Retained Earnings\",\"28\":\"Total Equity\",\"29\":\"Total Liabilities & Equity\"},\"1\":{\"0\":\"December 31, 2016\",\"1\":\"\",\"2\":\"\",\"3\":\"$ 608,212\",\"4\":\"22,424,863\",\"5\":\"43,958,524\",\"6\":\"-\",\"7\":\"626,909\",\"8\":\"67,618,508\",\"9\":\"10,946,826\",\"10\":\"60,830,612\",\"11\":\"10,524,998\",\"12\":\"158,249\",\"13\":\"9,368,532\",\"14\":\"$ 159,447,725\",\"15\":\"\",\"16\":\"\",\"17\":\"$ 15,506,405\",\"18\":\"4,694,330\",\"19\":\"-\",\"20\":\"20,200,735\",\"21\":\"-\",\"22\":\"759,141\",\"23\":\"20,959,876\",\"24\":\"\",\"25\":\"-\",\"26\":\"87,127,267\",\"27\":\"51,360,582\",\"28\":\"138,487,849\",\"29\":\"$ 159,447,725\"}}",
"filename": "horizon_distributors_inc_2016.pdf"
}
- WebSocket Updates: If connected via WebSocket, you'll receive real-time updates on the process of Table Detection Outliner Textract.