Reinforcement Learning Base Knowledge Graph
===========================================

.. note::

    View the complete implementation in Google Colab: 
    
    - `Entities and relationships creation Notebook <https://colab.research.google.com/github/MasrourTawfik/Textra_research_v1/blob/main/docs/notebooks/base_entities_&relationships.ipynb>`_
    
    - `Base knowledge graph creation Notebook <https://colab.research.google.com/github/MasrourTawfik/Textra_research_v1/blob/main/docs/notebooks/neoj4_gdb.ipynb>`_

Prerequisites
==============

Software Requirements
---------------------

- Python 3.8+
- Neo4j Community Edition 5.26.0 (or higher)

Python Dependencies
-------------------

::

    pip install neo4j
    pip install openai
    pip install typing
    pip install pathlib

Neo4j Setup
------------

Installation
~~~~~~~~~~~~

Download Neo4j Community Edition 5.26.0 for Windows and extract to C:\\Program Files

Starting Neo4j Server
~~~~~~~~~~~~~~~~~~~~

1. Open Command Prompt as Administrator
2. Navigate to Neo4j installation directory::

    cd "C:\Program Files\neo4j-community-5.26.0-windows\neo4j-community-5.26.0"

3. Start Neo4j server::

    .\bin\neo4j console

4. Access Neo4j Browser interface:

   Open your web browser and navigate to: http://localhost:7474/browser/

Initial Setup
~~~~~~~~~~~~~

When accessing Neo4j Browser for the first time:

Default connection settings:

- Connect URL: neo4j://localhost:7687
- Database: neo4j

Default credentials:

- Username: neo4j
- You'll be prompted to change the default password

Knowledge Graph Construction
============================

Entity Extraction Process
--------------------------

The initial phase involves extracting reinforcement learning concepts from textbook content. This process is implemented through the ``RLEntityExtractor`` class.

Entity Extraction Flow
~~~~~~~~~~~~~~~~~~~~~~

.. figure:: ../Images/base_ent.png
    :align: center
    :alt: Entity Extraction Process

    Entity Extraction Process

Core Implementation
~~~~~~~~~~~~~~~~~~~

1. **Initialization**

The extractor is initialized with API configuration and tracking structures:

.. code-block:: python

    def __init__(self, api_key: str = None):
        self.client = OpenAI(
            base_url="https://integrate.api.nvidia.com/v1",
            api_key=api_key or "your_api_key"
        )
        self.entity_appearances = defaultdict(set)
        self.domain_connections = defaultdict(set)

2. **Prompt Engineering**

The prompt template is crucial for consistent entity extraction:

.. code-block:: python

    def create_extract_prompt(self, section_text: str, chapter: str, section: str) -> str:
        return f"""Extract key RL entities and their relationships from this text section. 
        Focus on core concepts, domains, and clear relationships. Format as JSON:

        {{
            "entities": [
                {{
                    "id": "unique_snake_case_id",
                    "name": "Full Concept Name",
                    "type": "concept|algorithm|method|principle|domain",
                    "definition": "Clear, precise definition under 50 words",
                    "domains": ["domain1", "domain2"],
                    "properties": [
                        {{
                            "name": "property_name",
                            "value": "property_value",
                            "type": "characteristic|parameter|constraint|requirement"
                        }}
                    ],
                    "source": {{
                        "chapter": "{chapter}",
                        "section": "{section}",
                        "context": "Brief context"
                    }}
                }}
            ]
        }}

        Text to analyze:
        {section_text}"""

3. **Section Processing**

Individual sections are processed using the LLM:

.. code-block:: python

    def process_section(self, section_text: str, chapter: str, section: str) -> Dict:
        try:
            completion = self.client.chat.completions.create(
                model="nvidia/llama-3.1-nemotron-70b-instruct",
                messages=[{
                    "role": "user", 
                    "content": self.create_extract_prompt(section_text, chapter, section)
                }],
                temperature=0.3,
                max_tokens=2048
            )
            
            if completion.choices:
                response_text = completion.choices[0].message.content
                extracted = self.clean_json_response(response_text)
                
                if 'entities' in extracted:
                    self.update_cross_references(extracted['entities'], chapter)
                
                return extracted
            
            return {}
        except Exception as e:
            print(f"Error processing section: {e}")
            return {}

4. **Cross-Reference Management**

Tracking entity appearances and domain connections:

.. code-block:: python

    def update_cross_references(self, entities: List[Dict], chapter: str) -> None:
        for entity in entities:
            entity_id = entity['id']
            self.entity_appearances[entity_id].add(chapter)
            
            if 'domains' in entity:
                for domain in entity['domains']:
                    self.domain_connections[domain].add(entity_id)

5. **Chapter Processing**

Complete chapter processing workflow:

.. code-block:: python

    def process_chapter_file(self, file_path: Path) -> Dict:
        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                sections = json.load(f)
            
            chapter_data = {
                'chapter_id': file_path.stem,
                'entities': [],
                'relationships': [],
                'domains': set(),
            }
            
            for section_id, content in sections.items():
                print(f"Processing {file_path.stem} - {section_id}")
                section_data = self.process_section(
                    content, 
                    chapter=file_path.stem, 
                    section=section_id
                )
                
                if section_data:
                    chapter_data['entities'].extend(section_data.get('entities', []))
                    chapter_data['relationships'].extend(section_data.get('relationships', []))
                    chapter_data['domains'].update(section_data.get('domains_discussed', []))
            
            chapter_data['domains'] = list(chapter_data['domains'])
            return chapter_data
            
        except Exception as e:
            print(f"Error processing chapter file {file_path}: {e}")
            return {}

6. **Output Format and Structure**

The entity extraction process produces a structured JSON output. Here's an example of extracted entities:

.. code-block:: json

    {
        "entities": {
            "reinforcement_learning": {
                "id": "reinforcement_learning",
                "name": "Reinforcement Learning",
                "type": "domain",
                "definition": "A computational approach to understanding and automating goal-directed learning and decision making.",
                "domains": [
                    "artificial_intelligence",
                    "machine_learning",
                    "psychology",
                    "neuroscience"
                ],
                "properties": [
                    {
                        "name": "characteristics",
                        "value": "trial-and-error search, delayed reward, emphasis on learning from interaction with environment",
                        "type": "characteristic"
                    }
                ],
                "source": [
                    {
                        "chapter": "1",
                        "section": "1.1",
                        "context": "Introduction to Reinforcement Learning"
                    },
                    {
                        "chapter": "Introduction to Machine Learning",
                        "section": "Subfields of ML",
                        "context": "RL as a part of ML"
                    }
                ]
            }
        }
    }

Notes about the output:

1. **Entity Structure**:

   - Unique identifier (snake_case)
   - Descriptive name
   - Entity type classification
   - Clear, concise definition
   - Associated domains
   - Characteristic properties
   - Source references

2. **Source Tracking**:

   - Multiple appearances across chapters
   - Section-level granularity
   - Contextual information
   - Hierarchical organization

3. **Domain Classification**:

   - Cross-domain relationships
   - Multiple domain associations
   - Domain hierarchy preservation

4. **Property Format**:

   - Named characteristics
   - Typed attributes
   - Value descriptions
   - Property categorization

Relationship Extraction Process
-------------------------------

The second phase focuses on extracting meaningful relationships between entities using a layered approach, implemented through the ``LayeredRelationshipExtractor`` class.

Implementation Details
~~~~~~~~~~~~~~~~~~~~~~~

1. **Layer Classification**

Each entity is classified into one of four layers based on its type:

.. code-block:: python

    def determine_layer(self, entity_data: Dict) -> str:
        if 'type' in entity_data:
            entity_type = entity_data['type'].lower()
            
            # Mathematical and theoretical concepts
            if entity_type in ['theorem', 'equation', 'principle', 'proof', 
                             'definition', 'framework', 'concept']:
                return 'foundation_layer'
            
            # Methods and approaches
            elif entity_type in ['value_based', 'policy_based', 'model_based', 
                               'hybrid', 'method']:
                return 'method_layer'
            
            # Algorithms and implementations
            elif entity_type in ['algorithm', 'base_algorithm', 'variant']:
                return 'algorithm_layer'
            
            # Applications and domains
            elif entity_type in ['field', 'benchmark', 'use_case', 'domain']:
                return 'application_layer'
        
        return 'foundation_layer'

2. **Relationship Prompt Engineering**

The prompt is structured to consider layer-specific relationships:

.. code-block:: python

    def create_relationship_prompt(self, entity_id: str, entity: Dict, 
                                 all_entities: Dict) -> str:
        source_layer = self.determine_layer(entity)
        entities_by_layer = {
            'foundation_layer': [],
            'method_layer': [],
            'algorithm_layer': [],
            'application_layer': []
        }
        
        for eid, e in all_entities.items():
            if eid != entity_id:
                layer = self.determine_layer(e)
                entities_by_layer[layer].append({
                    'id': eid,
                    'name': e['name'],
                    'type': e.get('type', '')
                })

3. **Relationship Types**

Relationships are categorized by direction:

- **up**: Connections to higher layers
- **down**: Connections to lower layers
- **same**: Within-layer relationships
- **across**: Cross-layer non-hierarchical relationships

Common relationship patterns::

    Foundation → Method: "enables", "provides basis for"
    Method → Algorithm: "is implemented by", "guides"
    Algorithm → Application: "is applied to", "solves"
    Same layer: "relates to", "extends", "similar to"
    Cross-layer: "inspired by", "analogous to"

4. **Statistics Tracking**

The system maintains detailed statistics about layer connections:

.. code-block:: python

    layer_statistics = {
        'foundation_layer': {'total': 0, 'connected': 0},
        'method_layer': {'total': 0, 'connected': 0},
        'algorithm_layer': {'total': 0, 'connected': 0},
        'application_layer': {'total': 0, 'connected': 0}
    }

    layer_connections = {
        'up': 0,
        'down': 0,
        'same': 0,
        'across': 0
    }

5. **Relationship Structure**

Each extracted relationship follows this format:

.. code-block:: json

    {
        "source": "entity_id",
        "source_layer": "layer_name",
        "target": "target_entity_id",
        "target_layer": "layer_name",
        "type": "descriptive_relationship_type",
        "direction": "up|down|same|across",
        "evidence": {
            "text": "exact text snippet showing relationship",
            "location": "definition|property|source"
        }
    }

6. **Output Generation**

The final output includes relationships and comprehensive statistics:

.. code-block:: python

    output = {
        "relationships": unique_relationships,
        "metadata": {
            "total_relationships": len(unique_relationships),
            "relationship_types": sorted(list(set(rel['type'] 
                                     for rel in unique_relationships))),
            "total_entities_involved": len(connected_entities),
            "layer_statistics": layer_statistics,
            "layer_connections": layer_connections
        }
    }

7. **Relationship Examples**

Here are examples of different types of relationships extracted:

Same-Layer Relationship (Foundation):

.. code-block:: json

    {
      "source": "dopamine",
      "source_layer": "foundation_layer",
      "target": "reward_signals",
      "target_layer": "foundation_layer",
      "type": "relates to",
      "direction": "same",
      "evidence": {
        "text": "A neurotransmitter involved in reward processing ... in the mammalian brain.",
        "location": "definition"
      }
    }

Up-Direction Relationship:

.. code-block:: json

    {
      "source": "associative_search",
      "source_layer": "foundation_layer",
      "target": "temporal_difference_learning",
      "target_layer": "method_layer",
      "type": "enables",
      "direction": "up",
      "evidence": {
        "text": "Associative Search involves trial-and-error learning, a key aspect of Temporal-Difference Learning.",
        "location": "definition"
      }
    }

These examples demonstrate:

- Different types of layer interactions
- Various relationship types
- Evidence-based connections
- Directional relationships
- Domain-specific associations

Knowledge Graph Building
------------------------

Now that we have entities.json and relationships.json we will build the base knowledge graph in Neo4j, converting the extracted entities and relationships into a queryable graph database.

Core Implementation
~~~~~~~~~~~~~~~~~~~~

1. **Database Connection**

Configuration of Neo4j connection with proper authentication:

.. code-block:: python

    def __init__(self, uri="bolt://localhost:7687", user="neo4j", password="password"):
        self.driver = GraphDatabase.driver(uri, auth=(user, password))

2. **Node Creation**

Special handling for different node types:

.. code-block:: python

    def create_node(self, tx, entity_id, entity_data):
        # Convert properties to string array
        properties_list = []
        if entity_data.get('properties'):
            for prop in entity_data['properties']:
                prop_str = f"{prop['name']}: {prop['value']}"
                properties_list.append(prop_str)

        # Node properties structure
        node_props = {
            'id': entity_id,
            'name': entity_data['name'],
            'type': entity_data['type'],
            'definition': entity_data['definition'],
            'domains': entity_data.get('domains', []),
            'properties': properties_list
        }

        # Dynamic label creation
        type_label = ''.join(c for c in entity_data['type'].title() 
                            if c.isalnum())
        
        # Different handling for domain nodes
        if entity_data['type'].lower() == 'domain':
            query = """
            MERGE (n:Domain {id: $id})
            SET n = $node_props
            """
        else:
            query = f"""
            MERGE (n:Concept:{type_label} {{id: $id}})
            SET n = $node_props
            """

3. **Relationship Creation**

Establishing connections between nodes:

.. code-block:: python

    def create_relationships(self, tx, relationships_data):
        relationships = relationships_data.get('relationships', [])
        for rel in relationships:
            # Clean relationship type for Neo4j
            rel_type = rel['type'].upper()\
                .replace(' ', '_')\
                .strip('_')
            
            query = f"""
            MATCH (source)
            WHERE source.id = $source
            MATCH (target)
            WHERE target.id = $target
            MERGE (source)-[r:{rel_type}]->(target)
            SET r.source_layer = $source_layer
            SET r.target_layer = $target_layer
            SET r.direction = $direction
            """

4. **Index Creation**

Optimizing graph performance with indices:

.. code-block:: python

    def create_indices(self, tx):
        queries = [
            "CREATE INDEX concept_type_idx IF NOT EXISTS FOR (n:Concept) ON (n.type)",
            "CREATE INDEX concept_name_idx IF NOT EXISTS FOR (n:Concept) ON (n.name)",
            "CREATE INDEX concept_id_idx IF NOT EXISTS FOR (n:Concept) ON (n.id)",
            "CREATE INDEX domain_id_idx IF NOT EXISTS FOR (n:Domain) ON (n.id)",
            "CREATE INDEX domain_name_idx IF NOT EXISTS FOR (n:Domain) ON (n.name)"
        ]
        
        for query in queries:
            tx.run(query)

5. **Metadata Addition**

Enriching the graph with analytics:

.. code-block:: python

    def add_metadata(self, tx):
        queries = [
            # Degree centrality
            """
            MATCH (n)
            WHERE n:Concept OR n:Domain
            SET n.degree = COUNT {(n)--()}
            """,
            # In-degree
            """
            MATCH (n)
            WHERE n:Concept OR n:Domain
            SET n.in_degree = COUNT {(n)<--()}
            """,
            # Out-degree
            """
            MATCH (n)
            WHERE n:Concept OR n:Domain
            SET n.out_degree = COUNT {(n)-->()}
            """
        ]
        
        for query in queries:
            tx.run(query)

Usage Example
~~~~~~~~~~~~~

Building the complete knowledge graph:

.. code-block:: python

    def main():
        ENTITIES_FILE = "entities.json"
        RELATIONSHIPS_FILE = "relationships.json"
        
        graph = RLKnowledgeGraph()
        try:
            graph.build_graph(ENTITIES_FILE, RELATIONSHIPS_FILE)
        finally:
            graph.close()