Tax Form Package
Overview
The Tax Form Package is a specialized suite designed to facilitate the extraction of data from tax form documents. Leveraging advanced models such as Cognaize Tax Form Extractor and CheckboxDetection, this package is adept at identifying and extracting relevant information from various types of tax forms. It can also detect and process checkboxes within documents, making it an invaluable asset for applications ranging from financial analysis to academic research.
Key Highlights
- Advanced Detection: Employs the Cognaize Tax Form Extractor and CheckboxDetection models to accurately extract data from tax form documents.
- Real-time Progress Updates: Supports callback functions to provide real-time updates on the processing status, enhancing monitoring and control
- Versatile Applications: Applicable to a wide range of use cases, including financial document analysis and academic research.
- Seamless Integration: Easily integrates with other data processing pipelines and tools, enhancing your document workflow.
How It Works
It processes documents based on their specific types, such as 1065, 1120, and 1120S. The workflow involves:
- Document Type and Page Classification: Identifying the type of tax form and classifying the pages containing the main information.
- Tax Form Extraction : Using the Tax Form Extractor Model to extract relevant data.
- Checkbox Detection: Detecting checkboxes on the pages that contain tax form data with specifying their classes(marked/unmarked).
Features
- Structure Recognition: Trains models to recognize and classify different types of documents and their specific sections.
- Evaluation: Assesses model performance and exports results to a spreadsheet for further analysis.
- Inference: Applies trained models to new documents to extract tax form data.
Installation
The package can be installed using pip along with FURY_AUTH authentication token:
pip install --index-url "https://${FURY_AUTH}:@pypi.fury.io/cognaize/" --extra-index-url "https://pypi.org/simple/" cognaize_tax_form
Usage
Here's a quick start example to use cognaize_tax_form:
from cognaize_tax_form.tax_form_extractor import TaxFormExtractor
# Create an instance of TaxFormExtractor with the specified PDF path
model = TaxFormExtractor(pdf_path='/path_to_your_document.pdf', callback='')
# Extract tax form, without changing default values of attributes
extracted_data = model.tax_form_extractor(
page_classifier_weight_file_path= '/path_to_your_weight_file')
# Print the generated tax form data
print(extracted_data)
Parameters
- callback (function, optional): A function that is called to log messages or update a user interface with progress reports during the process of extracting tax form data. This is particularly useful for long-running tasks (large pdfs) or applications with a GUI where users need real-time updates.
- page_classifier_weight_file_path : parameter is essential for providing the necessary trained model weights that enable the TaxFormExtractor to classify pages in the PDF document accurately. This classification is a critical step for effectively extracting structured tax form data from the document.
Output Example
Here is the output of the PDF example.pdf
located in resources
folder.
Example
{
"result": {
"m1_m2_page_number": [
"5"
],
"bs_page_number": [
"5"
],
"is_page_number": [
"1"
],
"document_type": [
"1065"
],
"is_continuation_page_number": [
"4"
],
"schedule_k_1065__schedule_k_text_representation_0": "Net rental real estate income (loss)",
"schedule_k_1065__schedule_k_value_0": "115,340.",
"schedule_l_1065__schedule_l_text_representation_0": "Less accumulated depreciation",
"schedule_l_1065__schedule_l_value_0": "235,819.",
"schedule_l_1065__schedule_l_text_representation_1": "Less accumulated depreciation",
"schedule_l_1065__schedule_l_value_1": "1,155,402.",
"schedule_l_1065__schedule_l_text_representation_2": "Less accumulated depreciation",
"schedule_l_1065__schedule_l_value_2": "271,360.",
"schedule_l_1065__schedule_l_text_representation_3": "Less accumulated depreciation",
"schedule_l_1065__schedule_l_value_3": "1,119,861.",
"schedule_l_1065__schedule_l_text_representation_4": "Less accumulated amortization",
"schedule_l_1065__schedule_l_value_4": "4,208.",
"schedule_l_1065__schedule_l_text_representation_5": "Less accumulated amortization",
"schedule_m_1_m_2_1065__schedule_m_1_m_2_text_representation_0": "Depreciation expense",
"schedule_m_1_m_2_1065__schedule_m_1_m_2_value_0": "235,819.",
"schedule_m_1_m_2_1065__schedule_m_1_m_2_text_representation_1": "Add lines 6 and 7",
"schedule_m_1_m_2_1065__schedule_m_1_m_2_value_1": "71,000.",
"schedule_m_1_m_2_1065__schedule_m_1_m_2_text_representation_2": "Balance at the beginning of year",
"schedule_m_1_m_2_1065__schedule_m_1_m_2_value_2": "406,183.",
"schedule_m_1_m_2_1065__schedule_m_1_m_2_text_representation_3": "Balance at end of year",
"schedule_m_1_m_2_1065__schedule_m_1_m_2_value_3": "450,523.",
"checkbox__checkbox_text_representation_0": "Initial return",
"checkbox__checkbox_value_0": "Unmarked",
"checkbox__checkbox_text_representation_1": "Final return",
"checkbox__checkbox_value_1": "Unmarked",
"checkbox__checkbox_text_representation_2": "Name change",
"checkbox__checkbox_value_2": "Unmarked",
"checkbox__checkbox_text_representation_3": "Address change",
"checkbox__checkbox_value_3": "Unmarked",
"checkbox__checkbox_text_representation_4": "Amended return",
"checkbox__checkbox_value_4": "Unmarked",
"checkbox__checkbox_text_representation_5": "Cash",
"checkbox__checkbox_value_5": "Marked",
"checkbox__checkbox_text_representation_6": "Accrual",
"checkbox__checkbox_value_6": "Unmarked",
"checkbox__checkbox_text_representation_7": "Other (specify)",
"checkbox__checkbox_value_7": "Unmarked",
"checkbox__checkbox_text_representation_8": "Check if Schedules C and M-3 are attached",
"checkbox__checkbox_value_8": "Unmarked",
"checkbox__checkbox_text_representation_9": "Aggregated activities for section 465 at-risk purposes",
"checkbox__checkbox_value_9": "Marked",
"checkbox__checkbox_text_representation_10": "Grouped activities for section 469 passive activity purposes",
"checkbox__checkbox_value_10": "Marked",
"checkbox__checkbox_text_representation_11": "Yes",
"checkbox__checkbox_value_11": "Unmarked"
},
"filename": "example.pdf"
}