Financial Statement Analysis Pipeline

completed

January 2024 - March 2024 (2 months)

Financial extraction pipeline workflow diagram showing PDF processing through GenAI to structured outputs

Case Study

Explore the detailed case study below to learn about the challenges, solutions, and outcomes of this project.

Financial Document Analysis with GenAI

1. Introduction

Financial data extraction is a critical task in banking and finance, where accurate and structured information is needed for decision-making. This project focuses on automating the extraction of financial metrics from PDF documents using GenAI, allowing for structured tabular outputs and high-level financial summaries.

2. Objective

The goal of this project is to:

Extract key financial metrics (Revenue, Expenses, Net Income, etc.) from structured PDF financial reports
Process and store the extracted information in structured CSV formats (under /data)
Generate an AI-powered financial summary report that provides insights into the company’s financial health

3. Methodology & Thought Process

Given the context of the problem, I placed these ideals as important:

We REALLY want to keep the right figures, so potentially trying lower end models for this, while it can be accurate a fair amount of the time, likely does not counteract the risk of it eventually hallucinating
Not all LLMs are equal

3.1 Extracting Text & Preprocessing Data

Thought Process: I had initially done some basic research into whether this task had been solved before, and whether OCR tools, a PDF text extractor, or an LLM would be the best solution at hand (considering the accuracy points etc.).

After some reading I came across various OCR tools (AWS textract, marker) for extracting financial data from PDFs, and the open source gmft library had good reviews. However, whilst GMFT has great metrics, with this PDF type it performed very poorly.

Final Approach:

PDF Parsing: Extract raw text using pdfplumber. Whilst noticeably slower than pypdf2, it had fewer missing characters, making it a better tradeoff when accuracy is important
Cleaning: Removed headers, footers, and page numbers to ensure clean data (and less tokens for LLMs to consume)
Segmentation: Used regex-based splitting to break down the document into individual financial statements
Validation: Validate all CSVs are using floats (numbers, not random ‘O’s instead of zeros), and manually checking between CSV and PDF for any issues/inaccuracies

3.2 Financial Data Extraction Using GenAI

Thought Process: Manual Parsing vs. AI-Based Extraction: I initially experimented with writing custom regex and heuristics to extract structured data from tables. However, the variation in formatting made this approach unsatisfactory.

Selecting The Correct GenAI Model:

Local Models: Initially tried DeepSeek-R1 (1.5B) running on ollama serve, but it performed poorly with repeated values and hallucinations
Online APIs: Switched to Gemini 2.0 Flash API because of:
- Free-tier availability
- Large context window (1M tokens)
- Better complexity threshold with minimal hallucinations

Final Approach:

Sent each segmented table to Gemini with a structured prompt
Extracted structured JSON with table_name and csv_data
Saved CSV files for downstream processing

3.3 Generating the Summary Report

Final Approach:

Read extracted CSVs from previous step
Sent them to Gemini 2.0 to generate a markdown-based financial summary
Converted the markdown summary into a well-structured PDF report using markdownpdf

3.4 Considering Scale

For scaling preprocessing: Currently use a mix of manual table names with regex patterns, but in more varied datasets we may need NLP-based methods (like spacy).

Retrieval-Augmented Generation (RAG): To improve reliability and efficiency at scale, implementing a RAG-based system offers:

Pre-processed Database: Storing extracted financial data eliminates repeated PDF processing
Contextual Retrieval: Enables accurate responses by retrieving relevant document segments
Efficient Processing: Handles large document collections

Cloud Infrastructure Considerations:

AWS Lambda Functions: Automating PDF processing on upload
Database Integration: Storing processed financial data for quick retrieval
Scalable Processing Pipeline: Handling multiple documents concurrently

4. Challenges & Solutions

4.1 Extracting Tabular Data from PDFs

Challenge: Financial statements often contain complex tabular structures with multi-line headers. Solution: Used regex-based pattern matching and structured parsing methods.

4.2 Handling Variations in Financial Statements

Challenge: Different companies use different formats for their financial reports. Solution: Implemented flexible regex patterns and prompt engineering for structured AI extraction.

4.3 Ensuring Accuracy in AI-Extracted Data

Challenge: GenAI may hallucinate or misinterpret figures, particularly negative values. Solution: Applied post-processing validation with explicit instructions to preserve negative signs and format numbers correctly.

5. Results

Extracted Financial Tables: CSV files containing structured financial metrics
AI-Generated Summary: PDF report summarizing revenue, expenses, and financial trends
Scalability Considerations: Pipeline designed for future expansion into RAG-based system

6. Future Improvements

Enhance data validation by integrating rule-based financial integrity checks
Explore Local AI Models: Larger models (70B+) would be preferable for accuracy while maintaining privacy
Optimize Preprocessing: NLP-based classification could improve robustness across varied datasets
Leverage Large database data science tools (DuckDB, Tableau, etc.)

7. Conclusion

This project successfully demonstrates how AI can be leveraged for financial data extraction and analysis. By automating the pipeline, we improve efficiency and accuracy in handling large-scale financial documents. Considerations for future scalability, including cloud-based deployment and retrieval-augmented generation (RAG), offer opportunities to make financial data extraction even more robust and dynamic.

Financial Statement Analysis Pipeline

Case Study

Technologies

Category

Financial Document Analysis with GenAI

1. Introduction

2. Objective

3. Methodology & Thought Process

3.1 Extracting Text & Preprocessing Data

3.2 Financial Data Extraction Using GenAI

3.3 Generating the Summary Report

3.4 Considering Scale

4. Challenges & Solutions

4.1 Extracting Tabular Data from PDFs

4.2 Handling Variations in Financial Statements

4.3 Ensuring Accuracy in AI-Extracted Data

5. Results

6. Future Improvements

7. Conclusion