Skip to main content

Snapshot Converter Package

Overview

snapshot_converter is a Python package designed to process and convert snapshot documents into various formats such as CSV, JSON, Excel, and ODD (Open Document Format). The package provides capabilities to extract, process, and save document data while preserving the document's structure and attributes. It supports real-time progress updates via callback functions, allowing for integration with systems that require real-time monitoring and updates.

Features

  • Document Conversion: Effortlessly converts snapshot documents into multiple formats including CSV, JSON, Excel, and ODD.
  • Real-time Progress Updates: Provides real-time progress updates via customizable callback functions.
  • Configurable Output: Offers extensive customization options for the output format and inclusion of various document attributes.
  • Multiprocessing Support: Utilizes multiprocessing to accelerate the conversion process, making it efficient for large datasets.
  • Local Output Storage: Saves the processed output locally for easy access and further processing.

Installation

The package can be installed using pip along with FURY_AUTH authentication token:

pip install --index-url "https://${FURY_AUTH}:@pypi.fury.io/cognaize/" --extra-index-url "https://pypi.org/simple/" snapshot_converter

Usage

Here's a quick start example to use snapshot_converter:

from snapshot_converter.convert import SnapshotConverter

# Initialize the SnapshotConverter with the specified snapshot path
converter = SnapshotConverter(
snapshot_path='/path_to_your_snapshot',
output_dir='/path_to_output_directory',
output_format='csv'
)

# Process the snapshot and generate the output
converter.process_snapshot()

# The output file will be saved in the specified output directory

Advanced Usage

Parameters

The SnapshotConverter class is highly customizable with several parameters that can be adjusted according to your needs:

  • snapshot_path (str): Path to the snapshot file.
  • output_dir (str): Directory to save the output files.
  • output_format (str): Format of the output files ('csv', 'json', 'xlsx', 'odd').
  • preferable_spreads (List[str], optional): List of preferred spreads to include.
  • include_trs (bool, optional): Include text representations. Default is True.
  • include_python_names (bool, optional): Include Python names for each field. Default is True.
  • include_spread_values_parsed (bool, optional): Include parsed spread values. Default is True.
  • include_spread_values_raw (bool, optional): Include raw spread values. Default is True.
  • include_group_name (bool, optional): Include group name for each field. Default is True.
  • include_group_id (bool, optional): Include group IDs for each field. Default is True.
  • include_tag_coordinates (bool, optional): Include coordinates information. Default is True.
  • include_matching_table_index (bool, optional): Include matching table index. Default is True.
  • include_cell_location_in_table (bool, optional): Include cell location in table. Default is True.
  • include_page (bool, optional): Include page number. Default is True.
  • n_processes (int, optional): Number of processes for multiprocessing. Default is 10.
  • callback (function, optional): Callback function for real-time progress updates.
from snapshot_converter.convert import SnapshotConverter

def progress_callback(message):
print(f"Update: {message}")

# Initialize the SnapshotConverter with advanced parameters
converter = SnapshotConverter(
snapshot_path='example-snapshot',
output_dir='/path_to_output_directory',
output_format='json',
preferable_spreads=['spread1', 'spread2'],
include_trs=True,
include_python_names=True,
include_spread_values_parsed=True,
include_spread_values_raw=True,
include_group_name=True,
include_group_id=True,
include_tag_coordinates=True,
include_matching_table_index=True,
include_cell_location_in_table=True,
include_page=True,
n_processes=4,
callback=progress_callback
)

# Process the snapshot and generate the output
converter.process_snapshot()

Output Examples

Here are the outputs of the Snapshot in different formats.

Example 1 Output CSV


Document ID Group name Group ID Python name Text Representation Value Parsed OCR Value Page Matching Table Index Cell Location in Table Tag Coordinates
643d8e728d6d4900107b5048 Cash 38e39462-8882-11ea-b84a-0242ac130007__643d8e728d6d4900107b5048 mmas_current_assets_v_1_4_1__cash__current_value Caja y cobranzas a depositar 46204 46.204 9 (9, 0) (1, 4) {'left': 41.3779, 'top': 15.9852, 'height': 1.0907, 'width': 13.0366}
643d8e728d6d4900107b5048 Cash 38e39462-8882-11ea-b84a-0242ac130007__643d8e728d6d4900107b5048 mmas_current_assets_v_1_4_1__cash__text_representation Caja y cobranzas a depositar Caja y cobranzas a depositar Caja y cobranzas a depositar 9 (9, 0) (0, 4) {'left': 11.0, 'top': 15.9852, 'height': 1.0907, 'width': 30.377899999999997}
643d8e728d6d4900107b5048 Cash bb8b8798-3498-416d-8f73-90b42124634c__643d8e728d6d4900107b5048 mmas_current_assets_v_1_4_1__cash__current_value Bancos 532816 532.816 9 (9, 0) (1, 5) {'left': 41.3779, 'top': 17.0759, 'height': 0.9468999999999994, 'width': 13.0366}
643d8e728d6d4900107b5048 Cash bb8b8798-3498-416d-8f73-90b42124634c__643d8e728d6d4900107b5048 mmas_current_assets_v_1_4_1__cash__text_representation Bancos Bancos Bancos 9 (9, 0) (0, 5) {'left': 11.0, 'top': 17.0759, 'height': 0.9468999999999994, 'width': 30.377899999999997}


Example 2 Output JSON

               
{
"64397dfc8db69c0011f797ca": [],
"643d8e728d6d4900107b5048": [
{
"Group name": "Cash",
"Group ID": "38e39462-8882-11ea-b84a-0242ac130007__643d8e728d6d4900107b5048",
"Python name": "mmas_current_assets_v_1_4_1__cash__current_value",
"Text Representation": "Caja y cobranzas a depositar",
"Value Parsed": "46204",
"OCR Value": "46.204",
"Page": 9,
"Matching Table Index": [
9,
0
],
"Cell Location in Table": [
1,
4
],
"Tag Coordinates": {
"left": 41.3779,
"top": 15.9852,
"height": 1.0907,
"width": 13.0366
}
},
{
"Group name": "Cash",
"Group ID": "38e39462-8882-11ea-b84a-0242ac130007__643d8e728d6d4900107b5048",
"Python name": "mmas_current_assets_v_1_4_1__cash__text_representation",
"Text Representation": "Caja y cobranzas a depositar",
"Value Parsed": "Caja y cobranzas a depositar",
"OCR Value": "Caja y cobranzas a depositar",
"Page": 9,
"Matching Table Index": [
9,
0
],
"Cell Location in Table": [
0,
4
],
"Tag Coordinates": {
"left": 11.0,
"top": 15.9852,
"height": 1.0907,
"width": 30.377899999999997
}
},
{
"Group name": "Cash",
"Group ID": "bb8b8798-3498-416d-8f73-90b42124634c__643d8e728d6d4900107b5048",
"Python name": "mmas_current_assets_v_1_4_1__cash__current_value",
"Text Representation": "Bancos",
"Value Parsed": "532816",
"OCR Value": "532.816",
"Page": 9,
"Matching Table Index": [
9,
0
],
"Cell Location in Table": [
1,
5
],
"Tag Coordinates": {
"left": 41.3779,
"top": 17.0759,
"height": 0.9468999999999994,
"width": 13.0366
}
},
{
"Group name": "Cash",
"Group ID": "bb8b8798-3498-416d-8f73-90b42124634c__643d8e728d6d4900107b5048",
"Python name": "mmas_current_assets_v_1_4_1__cash__text_representation",
"Text Representation": "Bancos",
"Value Parsed": "Bancos",
"OCR Value": "Bancos",
"Page": 9,
"Matching Table Index": [
9,
0
],
"Cell Location in Table": [
0,
5
],
"Tag Coordinates": {
"left": 11.0,
"top": 17.0759,
"height": 0.9468999999999994,
"width": 30.377899999999997
}
}
]
}


Example 3 Output ODD

<?xml version='1.0' encoding='UTF-8'?>
<schemaSpec xmlns:ns0="http://www.tei-c.org/ns/1.0" xmlns:ns1="http://relaxng.org/ns/structure/1.0" xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns:rng="http://relaxng.org/ns/structure/1.0" ident="snapshotConverter"><ns0:teiHeader><ns0:fileDesc><ns0:titleStmt><ns0:title>Snapshot Converter Output</ns0:title><ns0:author>Snapshot Converter</ns0:author></ns0:titleStmt><ns0:publicationStmt><ns0:p>Generated by Snapshot Converter</ns0:p></ns0:publicationStmt><ns0:sourceDesc><ns0:p>Converted document snapshot data</ns0:p></ns0:sourceDesc></ns0:fileDesc></ns0:teiHeader><ns0:moduleRef key="tei" /><ns0:elementSpec ident="64355f0dff56ff0010deeb6e" /><ns0:elementSpec ident="64397dfc8db69c0011f797ca" /><ns0:elementSpec ident="643d8e728d6d4900107b5048"><ns0:content><ns1:ref name="macro.Group name">Cash</ns1:ref><ns1:ref name="macro.Group ID">38e39462-8882-11ea-b84a-0242ac130007__643d8e728d6d4900107b5048</ns1:ref><ns1:ref name="macro.Python name">mmas_current_assets_v_1_4_1__cash__current_value</ns1:ref><ns1:ref name="macro.Text Representation">Caja y cobranzas a depositar</ns1:ref><ns1:ref name="macro.Value Parsed">46204</ns1:ref><ns1:ref name="macro.OCR Value">46.204</ns1:ref><ns1:ref name="macro.Page">9</ns1:ref><ns1:ref name="macro.Matching Table Index">(9, 0)</ns1:ref><ns1:ref name="macro.Cell Location in Table">(1, 4)</ns1:ref><ns1:ref name="macro.Tag Coordinates">{"left": 41.3779, "top": 15.9852, "height": 1.0907, "width": 13.0366}</ns1:ref></ns0:content><ns0:content><ns1:ref name="macro.Group name">Cash</ns1:ref><ns1:ref name="macro.Group ID">38e39462-8882-11ea-b84a-0242ac130007__643d8e728d6d4900107b5048</ns1:ref><ns1:ref name="macro.Python name">mmas_current_assets_v_1_4_1__cash__text_representation</ns1:ref><ns1:ref name="macro.Text Representation">Caja y cobranzas a depositar</ns1:ref><ns1:ref name="macro.Value Parsed">Caja y cobranzas a depositar</ns1:ref><ns1:ref name="macro.OCR Value">Caja y cobranzas a depositar</ns1:ref><ns1:ref name="macro.Page">9</ns1:ref><ns1:ref name="macro.Matching Table Index">(9, 0)</ns1:ref><ns1:ref name="macro.Cell Location in Table">(0, 4)</ns1:ref><ns1:ref name="macro.Tag Coordinates">{"left": 11.0, "top": 15.9852, "height": 1.0907, "width": 30.377899999999997}</ns1:ref></ns0:content><ns0:content><ns1:ref name="macro.Group name">Cash</ns1:ref><ns1:ref name="macro.Group ID">bb8b8798-3498-416d-8f73-90b42124634c__643d8e728d6d4900107b5048</ns1:ref><ns1:ref name="macro.Python name">mmas_current_assets_v_1_4_1__cash__current_value</ns1:ref><ns1:ref name="macro.Text Representation">Bancos</ns1:ref><ns1:ref name="macro.Value Parsed">532816</ns1:ref><ns1:ref name="macro.OCR Value">532.816</ns1:ref><ns1:ref name="macro.Page">9</ns1:ref><ns1:ref name="macro.Matching Table Index">(9, 0)</ns1:ref><ns1:ref name="macro.Cell Location in Table">(1, 5)</ns1:ref><ns1:ref name="macro.Tag Coordinates">{"left": 41.3779, "top": 17.0759, "height": 0.9468999999999994, "width": 13.0366}</ns1:ref></ns0:content><ns0:content><ns1:ref name="macro.Group name">Cash</ns1:ref><ns1:ref name="macro.Group ID">bb8b8798-3498-416d-8f73-90b42124634c__643d8e728d6d4900107b5048</ns1:ref><ns1:ref name="macro.Python name">mmas_current_assets_v_1_4_1__cash__text_representation</ns1:ref><ns1:ref name="macro.Text Representation">Bancos</ns1:ref><ns1:ref name="macro.Value Parsed">Bancos</ns1:ref><ns1:ref name="macro.OCR Value">Bancos</ns1:ref><ns1:ref name="macro.Page">9</ns1:ref><ns1:ref name="macro.Matching Table Index">(9, 0)</ns1:ref><ns1:ref name="macro.Cell Location in Table">(0, 5)</ns1:ref><ns1:ref name="macro.Tag Coordinates">{"left": 11.0, "top": 17.0759, "height": 0.9468999999999994, "width": 30.377899999999997}</ns1:ref></ns0:content><ns0:content><ns1:ref name="macro.Group name">Total Accts/Rec-Net</ns1:ref><ns1:ref name="macro.Group ID">be0e9964-5611-4390-9526-02a1d1ab9cc8__643d8e728d6d4900107b5048</ns1:ref><ns1:ref name="macro.Python name">mmas_current_assets_v_1_4_1__total_accts_rec_net__current_value</ns1:ref><ns1:ref name="macro.Text Representation">Deudores por Exportaciones</ns1:ref><ns1:ref name="macro.Value Parsed">4714142</ns1:ref><ns1:ref name="macro.OCR Value">4,714.142</ns1:ref><ns1:ref name="macro.Page">32</ns1:ref><ns1:ref name="macro.Matching Table Index">(32, 1)</ns1:ref><ns1:ref name="macro.Cell Location in Table">(1, 4)</ns1:ref><ns1:ref name="macro.Tag Coordinates">{"left": 36.8593, "top": 52.1594, "height": 2.821800000000003, "width": 10.835799999999999}</ns1:ref></ns0:content></ns0:elementSpec></schemaSpec>


Resources

The examples presented above are extracted from publicly accessible financial documents found online.