Skip to main content

Spatial Text Package

Overview

cognaize_spatial_text is a Python library designed for extracting text from PDF documents while preserving the document's original spatial layout using Optical Character Recognition (OCR). It captures text while calculating the spaces between words and lines, considering the characters' width and height on each page, as well as the size of the page image. This ensures that the output closely reflects the original document's format. The library also supports real-time progress updates via callback functions, which can be tailored to integrate with various systems such as GUIs for user notifications or sockets for real-time process updates. Each new page in the output is clearly marked with a line of dashes to distinguish where one page ends and the next begins.

Features

  • Spatial Text Extraction: Utilizes the Cognaize Doctr OCR model to extract text and create spatial text, preserving the original layout of the document. The necessary weight files for the OCR model are automatically downloaded.
  • Real-time Progress Updates: Provides real-time progress updates via callback functions, facilitating effective monitoring and management of the process.
  • Adaptive Processing: Dynamically adjusts processing resources, allowing the specification of the number of workers and images processed simultaneously, optimizing both speed and resource use.
  • Multi-Platform Compatibility: Designed to operate seamlessly across various platforms, ensuring robust and flexible integration into existing workflows.
  • Local Output Storage: Saves the spatial text output locally as a text file for easy access and further processing.

Installation

The package can be installed using pip along with FURY_AUTH authentication token:

pip install --index-url "https://${FURY_AUTH}:@pypi.fury.io/cognaize/" --extra-index-url "https://pypi.org/simple/" cognaize_spatial_text

Usage

Here's a quick start example to use cognaize_spatial_text:

from cognaize_spatial_text.spatial_text_creator import SpatialTextCreator

# Create an instance of SpatialTextCreator with the specified PDF path
model = SpatialTextCreator(pdf_path='/path_to_your_document.pdf')

# Create spatial text, without changing default values of attributes
model.create_spatial_text()

# Print the generated spatial text
print(model.spatial_text)

# Save the output
model.save_to_txt(output_txt_path='/path_to_save.txt')

Advanced Usage

The create_spatial_text function is highly customizable with several parameters that can be adjusted according to your needs.

  • callback (function, optional): A function that is called to log messages or update a user interface with progress reports during the process of creating spatial text. This is particularly useful for long-running tasks (large pdfs) or applications with a GUI where users need real-time updates.
  • num_workers (int, optional): The number of workers to use in multiprocessing during creating spatial text. Increasing this number can significantly speed up the process by processing multiple pages in parallel. If not specified, os.cpu_count() - 1 is taken.
  • image_number_to_process (int, optional): Controls the number of images loaded into RAM simultaneously, helping to manage memory usage and optimize performance. Defaults to 17.
  • stick_coords (bool, optional): Set to True to take minimal area of the OCR word box.
from cognaize_spatial_text.spatial_text_creator import SpatialTextCreator

def progress_callback(message):
print(f"Update: {message}")

model = SpatialTextCreator(pdf_path='example-1.pdf')
model.create_spatial_text(
stick_coords=True,
callback=progress_callback,
num_workers=4,
image_number_to_process=15
)

Parameters

  • callback (function, optional): A function that is called to log messages or update a user interface with progress reports during the process of creating spatial text. This is particularly useful for long-running tasks (large pdfs) or applications with a GUI where users need real-time updates.
  • num_workers (int, optional): The number of workers to use in multiprocessing during creating spatial text. Increasing this number can significantly speed up the process by processing multiple pages in parallel. If not specified, os.cpu_count() - 1 is taken.
  • image_number_to_process (int, optional): Controls the number of images loaded into RAM simultaneously, helping to manage memory usage and optimize performance. Defaults to 17.
  • stick_coords (bool, optional): Set to True to take minimal area of the OCR word box.

Output Examples

Here are the outputs of the PDFs example-1.pdf and example-2.pdf located in resources folder.

Example 1

Abbott Pakistan

Financial liabilities at amortised cost

After initial recognition, interest-bearing loans and borrowings are subsequently measured at amortised cost using
the EIR method. Gains and losses are recognised in profit or loss when the liabilities are derecognised as well as
through the EIR amortisation process.

Amortised cost is calculated by taking into account any discount or premium on acquisition and fees or costs that
are an integral part of the EIR. The EIR amortisation is included as finance costs in the statement of profit or loss.

This category generally applies to interest-bearing loans and borrowings.

The Company has not designated any financial liability as at fair value through profit or loss.

Derecognition

A financial liability is derecognised when the obligation under the liability is discharged or cancelled or expires.
When an existing financial liability is replaced by another from the same lender on substantially different terms,
or the terms of an existing liability are substantially modified, such an exchange or modification is treated as the
derecognition of the original liability and the recognition of a new liability. The difference in the respective carrying
amounts is recognised in the statement of profit or loss.

c) Offsetting of financial instruments

Financial assets and financial liabilities are offset and the net amount is reported in the statement of financial posi-
tion if there is currently an enforceable legal right to offset the recognised amounts and there is an intention to settle
on a net basis, to realise the assets and settle the liabilities simultaneously.

2.2.11 Contract balances

a) Contract liabilities

A contract liability is recognised if a payment is received from a customer before the Company transfers the related
goods. Contract liabilities are recognised as revenue when the Company transfers control of the related goods to
the customer.

b) Trade debts

Ai trade debt is recognized if an amount of consideration that is unconditional is due from the customer (i.e., only the
passage of time is required before payment of the consideration is due).

2.2.12 Segment reporting

Segment reporting is based on the operating (business) segments of the Company. An operating segment is an iden-
tifiable component of the Company that engages in business activities from which it may earn revenues and incur
expenses, including revenues and expenses that relate to transactions with any of the Company's other components
and for which discrete financial information is available. An operating segment's operating results are reviewed regularly
by the Chief Operating Decision Maker (CODM) to make decisions about resources to be allocated to the segment and
assess its performance. The Company reports segment information separately that meets the quantitative thresholds
as defined under IFRS 8, i.e. 10 percent or more. of the combined revenue, profit or loss or assets.

Segment results that are reported to the CODM include items directly attributable to a segment as well as those that can
be allocated on a reasonable basis. Unallocated items comprise mainly corporate assets, income tax assets liabilities
and related income and expenditure. Segment capital expenditure is the total cost incurred during the year to acquire
property, plant and equipment.

The business segments are engaged in providing products which are subject to risks and rewards which differ from the
risk and rewards of other segments. Segments reported are as follows:

Pharmaceutical

The Pharmaceutical segment is engaged in the manufacture, import and marketing of branded generic pharmaceutical
products registered with the Drug Regulatory Authority of Pakistan.

142


Example 2

  1. OPERATING PROFIT AND NON-RECURRING ITEMS

The following non-recurring items are included in operating profit for the period:

(a) Gain on disposal of available for sale financial assets of USS154.4million. The Group realised a gain on disposal of
shares held in Equinox Minerals Limited previously held as available for sale financial assets of US$152.1 million;
and

(b) Write back of business acquisition expenses of US$63.8 million. The Group has benefited from the write back of
business acquisition costs accrued in 2010 in respect of the acquisition of MMG.

  1. INCOME TAX EXPENSE

No provision for Hong Kong profits tax has been made, as the Group has tax losses brought forward to offset the
assessable profit generated in Hong Kong for the period (2010: US$Nil). Taxation on profits arising from other
jurisdictions has been calculated on the estimated assessable profits for the year at the rates prevailing in the relevant
jurisdictions.
Six months ended 30 June
2011 2010
(Unaudited and
(Unaudited) restated)
US$ million US$ million
Current income tax expense
Overseas income tax (120.2) (52.1)
Deferred income tax (24.0) 28.8
Income tax expense (144.2) (23.3)

36


Resources

The examples presented above are extracted from publicly accessible financial documents found online.