Preprocessing Data for Base Knowledge Graph

Note

View the complete implementation in Google Colab: Open Notebook Book Processing Implementation

Initial Approach

Our initial approach focused on using regular expressions to identify chapter and section boundaries. While this worked well for chapters, it proved problematic for sections.

Chapter Processing

The chapter identification pattern successfully matched chapter starts:

self.chapter_pattern = re.compile(
    r'^Chapter\s+(\d+)\s*$\s*([^\n]+)',
    re.MULTILINE
)

This pattern worked because: - Chapter starts were consistent (“Chapter X”) - Chapter titles were on the next line - Format was uniform throughout the book

Section Processing Challenges

The initial section processing attempted to use decimal notation:

self.section_pattern = re.compile(
    r'^(\d+\.\d+)\s+([^\n]+)'
)

This approach failed because: - Decimal notation appeared in equations - Section numbers that are reference mid text - PDF extraction introduced inconsistent spacing

Unicode and Text Cleaning

Before tackling section extraction, we implemented comprehensive text cleaning:

class TextCleaner:
    def __init__(self):
        # Unicode replacements
        self.unicode_map = {
            '\ufb01': 'fi',  # fi ligature
            '\ufb02': 'fl',  # fl ligature
            '\u21b5': 'ff',  # ↵ to ff
            # ... many more mappings
        }

        self.text_replacements = {
            'NUL': 'ffi',  # Common NUL replacement
            '  ': ' ',     # Double spaces
            # ... other patterns
        }

The cleaning process: 1. Unicode character normalization 2. Common pattern replacement 3. Whitespace normalization 4. Special character handling

Improved Section Processing

We revised our approach to use section titles as anchors:

def find_section_boundaries(content: str, sections: Dict[str, str]) -> Dict[str, Tuple[int, int]]:
    boundaries = {}
    ordered_sections = sorted(sections.keys())

    for i, section_num in enumerate(ordered_sections):
        section_title = sections[section_num]
        safe_title = re.escape(section_title)
        pattern = rf"{re.escape(section_num)}\s*{safe_title}"
        match = re.search(pattern, content)

        if match:
            start_pos = match.start()
            end_pos = len(content)

            if i < len(ordered_sections) - 1:
                next_section = ordered_sections[i + 1]
                next_title = sections[next_section]
                next_pattern = rf"{re.escape(next_section)}\s*{re.escape(next_title)}"
                next_match = re.search(next_pattern, content)
                if next_match:
                    end_pos = next_match.start()

            boundaries[section_num] = (start_pos, end_pos)

improvements made: - Use of metadata to identify correct section titles - Escaped special characters in titles - Sequential processing using next section as boundary

1. Initial PDF Text Extraction

def process_raw_chapters(base_dir: str = "./") -> None:
    cleaner = TextCleaner()
    for chapter_num in range(1, 17):
        # Read and clean chapter text
        cleaned_text = cleaner.clean(text)
        # Save as JSON with metadata

2. Section Boundary Detection

def process_sections(base_dir: str = "./") -> None:
    for chapter_num in range(1, 17):
        # Load chapter content and metadata
        section_boundaries = find_section_boundaries(
            content,
            metadata["sections"]
        )
        # Extract and save sections

output: .. code-block:: None

Reading PDF…

…

Processed Chapter 01 Title: Introduction Sections found: 1.1: 10761 characters 1.2: 4220 characters 1.3: 4874 characters 1.4: 3451 characters 1.5: 15546 characters 1.6: 1410 characters 1.7: 33870 characters

…

Processed Chapter 01: Introduction Found 7 sections Processed Chapter 02: Multi-armed Bandits Found 9 sections Processed Chapter 03: Finite Markov Decision Processes Found 6 sections Processed Chapter 04: Dynamic Programming Found 7 sections Processed Chapter 05: Monte Carlo Methods Found 7 sections Processed Chapter 06: Temporal-Difference Learning Found 8 sections Processed Chapter 07: n-step Bootstrapping Found 5 sections Processed Chapter 08: Planning and Learning with Tabular Methods Found 13 sections Processed Chapter 09: On-policy Prediction with Approximation Found 11 sections Processed Chapter 10: On-policy Control with Approximation Found 5 sections Processed Chapter 11: *Off-policy Methods with Approximation Found 9 sections Processed Chapter 12: Policy Gradient Methods Found 7 sections Processed Chapter 13: Psychology … Processed Chapter 15: Applications and Case Studies Found 6 sections Processed Chapter 16: Frontiers Found 5 sections

Output Structure

The final processing creates three versions of each chapter:

Raw Text (chapter_XX.txt) - Original PDF extraction
Cleaned Text (chapter_XX_raw.json) - Unicode normalized - Pattern replacements - Whitespace cleaned
Processed Sections (chapter_XX_sections.json) - Title and metadata - Individual section content - Properly bounded sections

Example Output

{
  "title": "Introduction",
  "sections": {
    "1.1": {
      "title": "Reinforcement Learning",
      "content": "..."
    },
    "1.2": {
      "title": "Examples",
      "content": "..."
    }
  }
}

Note

The section processing approach achieved its core objective, with opportunities for future refinement in equation handling and automation. The current implementation though is good enough for the next phase.