Skip to main content

PDF to HTML Converter Package

Overview 📖

cognaize_pdf_to_html is a Python library that converts PDF documents into HTML format. This conversion offers several advantages, including well-structured text representation, seamless data retrieval, table creation, list structure instantiation, and header identification. The PDF document is first converted into spatial text using Cognaize's proprietary AI core service cognaize_spatial_text, that extracts text from PDFs while maintaining the document's original spatial layout. Please refer to Spatial Text Package for more information. The extracted text is retrieved from the spatial text converter and expertly divided into batches of chunks. This batching allows for multi-processing of large PDFs, while individual chunks are sized to not exceed Claude LLM's maximum generation length. Each text chunk is converted to HTML by Claude. Utilizing Claude's large context window, the tail end of the previously generated HTML is passed as context to assist in the generation of subsequent chunks' HTML code. This approach ensures careful continuation of HTML structures across disparate chunks. Finally, the HTML chunks are concatenated across all batches to produce a single, complete HTML file, providing a comprehensive conversion of the original PDF.

Features ✨

🔄 PDF to Spatial Text Conversion

  • Converts PDF files into spatial text format, preserving original layout information
  • Handles both local PDF files and PDFs from URLs
  • Supports 3 OCR Types

🧩 Intelligent Text Chunking

  • Breaks down spatial text into manageable chunks at smart break-off points
  • Optimizes chunk size for efficient processing and AI generation

🤖 AI-Powered HTML Generation

  • Utilizes Claude AI for intelligent conversion of text chunks into structured HTML
  • Maintains document continuity and structure across chunks

🧠 Advanced Prompt Engineering

  • Provides careful instructions to Claude AI to:
    • Avoid unwanted content generation
    • Handle continuation portions effectively
    • Guides the AI with detailed reasoning in choosing appropriate HTML elements:
      • Headings (h3)
      • Paragraphs (p) for body text
      • Tables (table, tr, td) for tabular data
      • Lists (ul, ol, li) for ordered and unordered sequences
      • Line breaks (br) for maintaining spacing and layout

Efficient Processing

  • Employs multi-processing for handling large PDFs
  • Optimizes performance for documents of varying sizes

📊 Real-Time Conversion Logging

  • Provides detailed, real-time updates on all aspects of the conversion process
  • Tracks progress for PDF loading, text extraction, chunking, and HTML generation

📄 Comprehensive HTML Output

  • Generates a single, cohesive HTML file from the concatenation of processed chunks
  • Produces well-structured HTML with proper element hierarchy

📁 Local Output Storage

  • Store the HTML output locally as a html file for easy access and further processing.

Installation 🚀

The package can be installed using pip along with FURY_AUTH authentication token:

pip install --index-url "https://${FURY_AUTH}:@pypi.fury.io/cognaize/" --extra-index-url "https://pypi.org/simple/" cognaize_ai_core

Usage 🛠️

Here's a quick start example to use cognaize_spatial_text:

from cognaize_pdf_to_html.pdf_to_html import ConvertPDFToHTML
model = ConvertPDFToHTML(pdf_source="")
model.convert()
model.save_html_output(output_file="")

Advanced Usage 🎓

The ConvertPDFToHTML class offers several customizable parameters for advanced use cases:

from cognaize_pdf_to_html.pdf_to_html import ConvertPDFToHTML
from cognaize_pdf_to_html.tools.config import ModelType
from cognaize_spatial_text.config import OcrType

model = ConvertPDFToHTML(
pdf_source="path/to/your/pdf.pdf",
char_limit=160000, #Larger Chunk Sizes to Evaluate Chunk Continuity Ability
model=ModelType.SONNET3_5, #Stronger Model
ocr_type=OcrType.AZURE,
batch_size= 10 #Makes for quicker parallel processing
callback=None
)

model.convert()
model.save_html_output(output_file="")

Parameters 🌌

  • model (ModelType, optional): Claude model to use for processing. Default is ModelType.SONNET3. (Other: SONNET3_5)
  • ocr_type (OcrType, optional): OCR engine to use for text extraction. Default is ocrType.AZURE. (Other: DOCTR)
  • char_limit (int, optional): Maximum number of characters per chunk. Default is 12000. (Other: int between 7000-16000)
  • batch_size (int, optional): Number of text chunks processing into html per batch, the lower the quicker as parallel processing is employed.

Step-by-Step Conversion 🪜

model = ConvertPDFToHTML("path/to/your/pdf.pdf")

# Convert PDF to spatial text
model.convert_to_spatial_text()

# Chunk the spatial text
model.chunk_spatial_text()

# Process chunks and generate HTML files
model.process_chunks()

# Combine HTML files
model.combine_html_files()

# Save the final HTML output
model.save_html_output()

Output Examples 🎥

The following displays the conversion of PDF.pdf -> SpatialText.txt -> HTML.html all located in resources folder.

PDF

View PDF.png or PDF.pdf in resources folder

Spatial Text

Consolidated Statements of Changes in Member's Interest

For the Nine Months Ended September 30, 2019 and Year Ended December 31, 2018
(Unaudited)
(In Thousands)

Accumulated
Other
Contributed Comprehensive Accumulated Noncontrolling Total Member's
Capital Income Deficit Member's Interest Interest Deficit
Balance at December 31, 2017 $ 3,826,940 $ 27,472 $ (5,221,217) $ (1,366,805) $ 124,176 $ (1,242,629)
Cumulative-flect adjustment (5,963) 5,963
Adjusted balance at January 1, 2018 3,826,940 21,509 (5,215,254) (1,366,805) 124,176 (1,242,629)
Net income (93,235) (93,235) 205,675 112,440
Total other comprehensive income (loss) (12,549) (12,549) 2,979 (9,570)
Distributions paid to Parent 46,049 (4,145,270) (4,099,221) (4,099,221)
Distributions to noncontrolling interest, net (191,084) (191,084)
Capital returned to Parent (25,877) (25,877) (25,877)
Parent stock-based compensation 200,651 200,651 200,651
Balance at December 31, 2018 4,047,763 8,960 (9,453,759) (5,397,036) 141,746 (5,255,290)
Cumulative-effect adjustment 2,971 2,971 2,971
Adjusted balance at January 1, 2019 4,047,763 8,960 (9,450,788) (5,394,065) 141,746 (5,252,319)
Net income 616,033 616,033 155,629 771,662
Total other comprehensive income (loss) (28,107) (28,107) 4,407 (23,700)
Distributions paid to Parent (47,625) (47,625) (47,625)
Distributions to noncontrolling interest, net (150,342) (150,342)
Purchase ofs shares by noncontrolling interest 80 80 96 176
Parent stock-based compensation 100,158 100,158 100,158
Balance at September 30, 2019 $ 4,148,002 $ (19,147) $ (8,882,380) $ (4,753,525) $ 151,536 $ (4,601,989)

Cumulative-effect adjustments relate to adoption of the following new accounting guidance:
1
ASU No. 2017-12, Targeted Improvements to Accounting. for Hedging Activities (Topic 815) & ASUNO. 2018-02, Reclassification of Certain Tax Effects from Accumulated Other
Comprehensive. Income (Topic 220) require the elimination of previously recognized ineffectiveness in the Statement of Operations to AOCI and permits reclassification of the income tax
effects of 2017 tax reform on items within AOCI to retained earnings through a cumulative-effect adjustment as oft the beginning of the period of adoption.
2
ASU No. 2014-09, Revenue from Contracts with Customers (Topic 606) The core principle requires an entity to recognize revenue. to depict the transfer of goods or services to customers
in an amount that reflects the consideration that it expects to be entitled to in exchange for those goods or services. The Company will complete its assessment of the standard during 2019 and adopt it effective as of January 1, 2019 on a modified retrospective basis which results in the recognition of a cumulative effect of adoption as an adjustment to beginning retained earnings, rather than retrospectively adjusting prior periods. This is the cumulative effect of adoption identified to date.

HTML

Consolidated Statements of Changes in Member's Interest

For the Nine Months Ended September 30, 2019 and Year Ended December 31, 2018

(Unaudited)

(In Thousands)

Contributed
Capital
Accumulated
Other
Comprehensive
Income
Accumulated
Deficit
Member's InterestNoncontrolling
Interest
Total Member's
Deficit
Balance at December 31, 2017$ 3,826,940$ 27,472$ (5,221,217)$ (1,366,805)$ 124,176$ (1,242,629)
Cumulative-effect adjustment (5,963) 5,963
Adjusted balance at January 1, 2018 3,826,940 21,509 (5,215,254) (1,366,805) 124,176 (1,242,629)
Net income (93,235) (93,235) 205,675 112,440
Total other comprehensive income (loss) (12,549) (12,549) 2,979 (9,570)
Distributions paid to Parent 46,049 (4,145,270) (4,099,221) (4,099,221)
Distributions to noncontrolling interest, net (191,084) (191,084)
Capital returned to Parent (25,877) (25,877) (25,877)
Parent stock-based compensation 200,651 200,651 200,651
Balance at December 31, 2018 4,047,763 8,960 (9,453,759) (5,397,036) 141,746 (5,255,290)
Cumulative-effect adjustment 2,971 2,971 2,971
Adjusted balance at January 1, 2019 4,047,763 8,960 (9,450,788) (5,394,065) 141,746 (5,252,319)
Net income 616,033 616,033 155,629 771,662
Total other comprehensive income (loss) (28,107) (28,107) 4,407 (23,700)
Distributions paid to Parent (47,625) (47,625) (47,625)
Distributions to noncontrolling interest, net (150,342) (150,342)
Purchase of shares by noncontrolling interest 80 80 96 176
Parent stock-based compensation 100,158 100,158 100,158
Balance at September 30, 2019$ 4,148,002$ (19,147)$ (8,882,380)$ (4,753,525)$ 151,536$ (4,601,989)

Cumulative-effect adjustments relate to adoption of the following new accounting guidance:

  1. ASU No. 2017-12, Targeted Improvements to Accounting for Hedging Activities (Topic 815) & ASU No. 2018-02, Reclassification of Certain Tax Effects from Accumulated Other Comprehensive Income (Topic 220) require the elimination of previously recognized ineffectiveness in the Statement of Operations to AOCI and permits reclassification of the income tax effects of 2017 tax reform on items within AOCI to retained earnings through a cumulative-effect adjustment as of the beginning of the period of adoption.

  2. ASU No. 2014-09, Revenue from Contracts with Customers (Topic 606) The core principle requires an entity to recognize revenue to depict the transfer of goods or services to customers in an amount that reflects the consideration that it expects to be entitled to in exchange for those goods or services. The Company will complete its assessment of the standard during 2019 and adopt it effective as of January 1, 2019 on a modified retrospective basis which results in the recognition of a cumulative effect of adoption as an adjustment to beginning retained earnings, rather than retrospectively adjusting prior periods. This is the cumulative effect of adoption identified to date.

HTML Visualized

View HTML.html or HTML.png in resources folder for clearer representation.