Skip to main content

PDF to HTML Converter Service

Overview 📖

ai-core-pdf-to-html is a service which utilizes the cognaize_pdf_to_html library for converting PDF documents into HTML. The service is using aiohttp for handling HTTP and WebSocket connections. For more details about cognaize_pdf_to_html, please refer to README.md.

Installation 🚀

The requirements can be installed using pip along with FURY_AUTH authentication token:

FURY_AUTH=${FURY_AUTH} pip install -r requirements.txt

Running the Service Locally 🛠️

Please find step by step guide how to use the service by Postman.

Launch the Server:

python app/server.py

[Optional] To monitor the service's progress in real-time via WebSocket:

  • Set the request type to WebSocket
  • Enter the WebSocket URL: ws://localhost:8000/ws
  • Click Connect

Uploading PDF File to Process:

In Postman,

  • Set the request type to http
  • Set the method to POST
  • Use the URL http://localhost:8000/pdf-to-html
  • Add a key with the type File and name it file
  • Select the PDF file from your computer that you wish to upload
  • Click Send to upload the file and start the HTML Conversion process

Parameters:

  • charLimit (int, optional): Maximum number of characters per chunk. Default is 12000. Character length must not exceed models 4096 max "token" output generation.
  • modelName (ModelType, optional): Claude model to use for processing. Default is ModelType.SONNET3 other options include "SONNET3_5.
  • ocrType (OcrType, optional): OCR engine to use for text extraction. Default is OcrType.DOCTR, other options include "AZURE".

You can specify the ocr type in the request by adding ocrType like this http://localhost:8000/pdf-to-html?ocrType=AZURE&modelName=SONNET3_5&charLimit=15000