Overview
The File Processing Pipeline handles document upload, extraction, OCR, AI analysis, and structured data storage.Pipeline Stages
Processing Stages
1. Upload & Validation
2. Docling Extraction
3. OCR (Conditional)
4. AI Analysis
5. Storage
Supported Formats
| Format | Extension | Max Size | Processing |
|---|---|---|---|
.pdf | 100 MB | Docling + OCR | |
| Word | .docx | 50 MB | Docling |
| PowerPoint | .pptx | 50 MB | Docling |
| Excel | .xlsx | 50 MB | Pandas |
| Images | .jpg, .png | 20 MB | OCR |
| HTML | .html | 10 MB | Docling |
Performance
Typical Processing Times
| Document Type | Size | Time |
|---|---|---|
| Simple PDF | 1 MB | 5-10s |
| Complex PDF with tables | 5 MB | 15-30s |
| Scanned document (OCR) | 10 MB | 30-60s |
| CIM (100 pages) | 20 MB | 45-90s |
Optimization
- Parallel processing: Multiple files processed concurrently
- Batch operations: Batch API calls to AI services
- Caching: Cache extracted data for re-processing
- GPU acceleration: Use GPU for OCR when available
