Preprocessing Data for Base Knowledge Graph
Note
View the complete implementation in Google Colab: Open Notebook Book Processing Implementation
Initial Approach
Our initial approach focused on using regular expressions to identify chapter and section boundaries. While this worked well for chapters, it proved problematic for sections.
Chapter Processing
The chapter identification pattern successfully matched chapter starts:
self.chapter_pattern = re.compile(
r'^Chapter\s+(\d+)\s*$\s*([^\n]+)',
re.MULTILINE
)
This pattern worked because: - Chapter starts were consistent (“Chapter X”) - Chapter titles were on the next line - Format was uniform throughout the book
Section Processing Challenges
The initial section processing attempted to use decimal notation:
self.section_pattern = re.compile(
r'^(\d+\.\d+)\s+([^\n]+)'
)
This approach failed because: - Decimal notation appeared in equations - Section numbers that are reference mid text - PDF extraction introduced inconsistent spacing
Unicode and Text Cleaning
Before tackling section extraction, we implemented comprehensive text cleaning:
class TextCleaner:
def __init__(self):
# Unicode replacements
self.unicode_map = {
'\ufb01': 'fi', # fi ligature
'\ufb02': 'fl', # fl ligature
'\u21b5': 'ff', # ↵ to ff
# ... many more mappings
}
self.text_replacements = {
'NUL': 'ffi', # Common NUL replacement
' ': ' ', # Double spaces
# ... other patterns
}
The cleaning process: 1. Unicode character normalization 2. Common pattern replacement 3. Whitespace normalization 4. Special character handling
Improved Section Processing
We revised our approach to use section titles as anchors:
def find_section_boundaries(content: str, sections: Dict[str, str]) -> Dict[str, Tuple[int, int]]:
boundaries = {}
ordered_sections = sorted(sections.keys())
for i, section_num in enumerate(ordered_sections):
section_title = sections[section_num]
safe_title = re.escape(section_title)
pattern = rf"{re.escape(section_num)}\s*{safe_title}"
match = re.search(pattern, content)
if match:
start_pos = match.start()
end_pos = len(content)
if i < len(ordered_sections) - 1:
next_section = ordered_sections[i + 1]
next_title = sections[next_section]
next_pattern = rf"{re.escape(next_section)}\s*{re.escape(next_title)}"
next_match = re.search(next_pattern, content)
if next_match:
end_pos = next_match.start()
boundaries[section_num] = (start_pos, end_pos)
improvements made: - Use of metadata to identify correct section titles - Escaped special characters in titles - Sequential processing using next section as boundary
1. Initial PDF Text Extraction
def process_raw_chapters(base_dir: str = "./") -> None:
cleaner = TextCleaner()
for chapter_num in range(1, 17):
# Read and clean chapter text
cleaned_text = cleaner.clean(text)
# Save as JSON with metadata
2. Section Boundary Detection
def process_sections(base_dir: str = "./") -> None:
for chapter_num in range(1, 17):
# Load chapter content and metadata
section_boundaries = find_section_boundaries(
content,
metadata["sections"]
)
# Extract and save sections
output: .. code-block:: None
Reading PDF…
…
Processed Chapter 01 Title: Introduction Sections found: 1.1: 10761 characters 1.2: 4220 characters 1.3: 4874 characters 1.4: 3451 characters 1.5: 15546 characters 1.6: 1410 characters 1.7: 33870 characters
…
Processed Chapter 01: Introduction Found 7 sections Processed Chapter 02: Multi-armed Bandits Found 9 sections Processed Chapter 03: Finite Markov Decision Processes Found 6 sections Processed Chapter 04: Dynamic Programming Found 7 sections Processed Chapter 05: Monte Carlo Methods Found 7 sections Processed Chapter 06: Temporal-Difference Learning Found 8 sections Processed Chapter 07: n-step Bootstrapping Found 5 sections Processed Chapter 08: Planning and Learning with Tabular Methods Found 13 sections Processed Chapter 09: On-policy Prediction with Approximation Found 11 sections Processed Chapter 10: On-policy Control with Approximation Found 5 sections Processed Chapter 11: *Off-policy Methods with Approximation Found 9 sections Processed Chapter 12: Policy Gradient Methods Found 7 sections Processed Chapter 13: Psychology … Processed Chapter 15: Applications and Case Studies Found 6 sections Processed Chapter 16: Frontiers Found 5 sections
Output Structure
The final processing creates three versions of each chapter:
Raw Text (
chapter_XX.txt) - Original PDF extractionCleaned Text (
chapter_XX_raw.json) - Unicode normalized - Pattern replacements - Whitespace cleanedProcessed Sections (
chapter_XX_sections.json) - Title and metadata - Individual section content - Properly bounded sections
Example Output
{
"title": "Introduction",
"sections": {
"1.1": {
"title": "Reinforcement Learning",
"content": "..."
},
"1.2": {
"title": "Examples",
"content": "..."
}
}
}
Note
The section processing approach achieved its core objective, with opportunities for future refinement in equation handling and automation. The current implementation though is good enough for the next phase.