Overview
The Document Processing Service extracts structured data from various document formats including PDFs, DOCX, PPTX, and scanned images using Docling, EasyOCR, and AI analysis. Primary Files:scripts/enhanced_extraction_service.py- Main extraction servicescripts/batched_cim_service.py- Batch CIM processingscripts/cim_analysis_service.py- CIM analysis with AI
Supported Document Types
PDF Documents
Financial statements, CIMs, pitch decks, contracts
Word Documents
DOCX files with text, tables, and images
PowerPoint
PPTX presentations with slides and content
Scanned Images
OCR for scanned documents and images
HTML Pages
Web pages and HTML documents
Excel Spreadsheets
XLSX with tables and financial data
Technology Stack
Docling
Primary document processing engine:- Layout Analysis: Understands document structure
- Table Extraction: Preserves table relationships
- Image Extraction: Extracts embedded images
- Metadata Extraction: Document properties
- Multi-format Support: PDF, DOCX, PPTX, HTML
EasyOCR
Optical Character Recognition for scanned documents:- 80+ Languages: Multi-language support
- GPU Acceleration: Fast processing
- High Accuracy: Good for printed text
- Batch Processing: Process multiple images
PDFPlumber
Fallback for PDF processing:- Text Extraction: Pure text from PDFs
- Table Extraction: Better table detection than PyPDF2
- Layout Preservation: Maintains spatial relationships
Claude AI
Post-processing and analysis:- Content Summarization: Extract key points
- Entity Extraction: Companies, people, dates, numbers
- Classification: Document type detection
- Data Structuring: Convert to JSON schema
Processing Pipeline
Enhanced Extraction Service
Main Service
File:scripts/enhanced_extraction_service.py
Extraction Methods
PDF Extraction
OCR Processing
AI Analysis
CIM Processing
Confidential Information Memorandum
File:scripts/batched_cim_service.py
Specialized processing for CIMs:
Key Sections Extracted
-
Company Overview
- Company name, industry, location
- Business description
- Products/services
-
Financial Metrics
- Revenue, EBITDA, margins
- Growth rates
- Historical performance
-
Market Analysis
- Market size, growth
- Competitive landscape
- Market positioning
-
Management Team
- Key executives
- Experience and background
-
Deal Structure
- Valuation, deal size
- Terms and conditions
- Timeline
