Text Extraction
Note
View the complete implementation in Google Colab: Document Layout Analysis and Text Extraction
Process Overview
Our approach combines LayoutParser with Tesseract OCR to achieve accurate text extraction from complex document layouts. The workflow is as follows:
Text Extraction Approach
La Démarche (Approach)
Layout Detection: - LayoutParser to identify tables and figures - Apply different confidence thresholds for tables (0.1) and figures (0.9) - Tesseract when configured is good at detecting text layout (double layout & unstructured layout)
Element Extraction: - Save detected tables and figures as separate images - Store their coordinates and paths for later reference - Handle overlapping detections using IoU (Intersection over Union)
Text Extraction: - Create masks for detected tables and figures - Apply Tesseract OCR only to non-masked regions - Use page segmentation mode 1 for automatic layout analysis
Some Implementation Components
1. Layout Detection
# Initialize Layout Parser model
model = lp.Detectron2LayoutModel(
'lp://PubLayNet/faster_rcnn_R_50_FPN_3x/config',
extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", min(table_threshold, figure_threshold)],
label_map={0: "Text", 1: "Title", 2: "List", 3: "Table", 4: "Figure"}
)
2. Mask Creation
def create_mask_for_regions(image_size, regions):
"""Create a boolean mask for regions to ignore"""
mask = np.zeros(image_size[::-1], dtype=bool)
for coords in regions:
# Add padding around regions
x1, y1, x2, y2 = [int(c) for c in coords]
x1 = max(0, x1 - 5)
y1 = max(0, y1 - 5)
x2 = min(image_size[0], x2 + 5)
y2 = min(image_size[1], y2 + 5)
mask[y1:y2, x1:x2] = True
return mask
3. Page Processing
def process_page(image_path, layout_model, output_folder, page_num,
table_threshold=0.1, figure_threshold=0.9):
"""Process a single page focusing only on tables and figures"""
# Load and process image
image = Image.open(image_path).convert('RGB')
# Detect layout
layout = layout_model.detect(image)
# Process tables and figures
processed_regions = []
regions_to_mask = []
# Process and mask elements
# Apply Tesseract OCR
# Return structured results
Configuration and Usage
Used Parameters:
- table_threshold = 0.1: Lower threshold for tables
- figure_threshold = 0.9: Higher threshold for figures
- custom_config = '--psm 1': Tesseract page segmentation mode
Basic usage:
# Process an entire article
article_path = Path("path/to/article/images")
result = process_article(article_path)
Output Structure
The process generates an organized directory structure for each processed article:
article_name_processed/
├── tables/
│ ├── table_1.png
│ ├── table_2.png
│ └── ...
├── figures/
│ ├── figure_1.png
│ ├── figure_2.png
│ └── ...
└── result.json
tables/: Contains extracted table imagesfigures/: Contains extracted figure imagesresult.json: Contains the complete analysis including text content, element coordinates, and file paths