ArXiv Papers Downloader ======================= This module implements an ArXiv paper downloader specifically focused on retrieving reinforcement learning papers. It downloads the top 100 most relevant papers per year from 2017 onwards. Dependencies ------------ .. code-block:: python import arxiv # Interface with ArXiv API import json # Handle metadata storage import os # File operations from datetime import datetime import time # to implement delays between requests from pathlib import Path # Modern path manipulation import requests # Handle HTTP requests from typing import Dict, List, Optional # Type hints import logging # Track operations and errors from urllib.parse import urlparse import traceback # Detailed error tracking notes: - ``arxiv``: Python wrapper for ArXiv API (install using ``pip install arxiv``) - ``pathlib``: Provides object-oriented interface to filesystem paths - ``requests``: Used for downloading PDFs directly Class Implementation -------------------- The downloader is implemented as a class with multiple methods. Here's how to initialize it: .. code-block:: python class ArxivDownloader: def __init__(self): """Initialize the ArXiv downloader with fixed paths""" # Set your paths here self.base_dir = Path("path/to/your/project") # Replace with your project path self.base_path = self.base_dir / "data" self.pdfs_dir = self.base_path / "pdfs" self.metadata_dir = self.base_path / "metadata" self._setup_directories() self._setup_logging() Directory Setup ~~~~~~~~~~~~~~~ This method creates the necessary directory structure: .. code-block:: python def _setup_directories(self): """Create necessary directory structure""" self.pdfs_dir.mkdir(parents=True, exist_ok=True) self.metadata_dir.mkdir(parents=True, exist_ok=True) # Create year subdirectories current_year = datetime.now().year for year in range(2017, current_year + 1): (self.pdfs_dir / str(year)).mkdir(exist_ok=True) (self.metadata_dir / str(year)).mkdir(exist_ok=True) Logging Configuration ~~~~~~~~~~~~~~~~~~~~~ Setup logging to track operations and errors: .. code-block:: python def _setup_logging(self): """Setup logging configuration""" logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', handlers=[ logging.FileHandler(self.base_path / "download_log.txt"), logging.StreamHandler() ] ) self.logger = logging.getLogger(__name__) PDF Download Handler ~~~~~~~~~~~~~~~~~~~~~ Safe PDF download implementation: .. code-block:: python def _safe_download_pdf(self, paper, pdf_path: Path) -> bool: """Safely download PDF with proper path handling""" try: pdf_path.parent.mkdir(parents=True, exist_ok=True) pdf_url = next(str(link) for link in paper.links if 'pdf' in str(link)) response = requests.get(pdf_url, allow_redirects=True) if response.status_code == 200: with open(pdf_path, 'wb') as f: f.write(response.content) return True self.logger.error(f"Failed to download PDF, status code: {response.status_code}") return False except Exception as e: self.logger.error(f"Error downloading PDF: {str(e)}") return False Papers Download ~~~~~~~~~~~~~~~ Main method to download papers for a specific year: .. code-block:: python def download_papers(self, year: int): """Download exactly 100 papers for the specified year""" self.logger.info(f"\nStarting download for year {year}") # Construct search client client = arxiv.Client( page_size=100, delay_seconds=3.0, num_retries=5 ) # Create search with max_results=100 search = arxiv.Search( query=self._construct_query(year), max_results=100 # Limit to 100 papers ) # Initialize metadata collection year_metadata = [] papers_processed = 0 try: # Collect exactly 100 papers papers = list(client.results(search)) total_papers = len(papers) print(f"\nFound {total_papers} papers for {year}") if not papers: self.logger.warning(f"No papers found for year {year}") return 0 # Process and download papers print(f"\nDownloading PDFs for year {year}:") for idx, paper in enumerate(papers, 1): paper_id = paper.entry_id.split('/')[-1].split('v')[0] # Remove version number try: # Extract and save metadata metadata = self._extract_metadata(paper) year_metadata.append(metadata) # Download PDF pdf_path = self.pdfs_dir / str(year) / f"{paper_id}.pdf" if not pdf_path.exists(): if self._safe_download_pdf(paper, pdf_path): print(f"[{idx}/100] Downloaded: {paper.title[:50]}...") else: print(f"[{idx}/100] Failed to download: {paper.title[:50]}...") else: print(f"[{idx}/100] Already exists: {paper.title[:50]}...") time.sleep(1) # Be nice to arXiv servers except Exception as e: self.logger.error(f"Error processing paper {paper_id}: {str(e)}") continue # Save metadata for the year metadata_path = self.metadata_dir / str(year) / "metadata.json" with open(metadata_path, 'w', encoding='utf-8') as f: json.dump(year_metadata, f, indent=2, ensure_ascii=False) print(f"\nYear {year} complete:") print(f"- PDFs downloaded: {len(year_metadata)}") print(f"- Metadata saved to: {metadata_path}") return len(year_metadata) except Exception as e: self.logger.error(f"Error downloading papers for year {year}: {str(e)}") return 0 Query Construction ~~~~~~~~~~~~~~~~~~ Construct ArXiv search query: .. code-block:: python def _construct_query(self, year: int) -> str: """Construct arXiv query for RL papers from specific year""" return (f'(cat:cs.AI OR cat:cs.LG) AND ' f'(abs:"reinforcement learning" OR abs:"deep reinforcement learning") AND ' f'submittedDate:[{year}0101 TO {year}1231]') Metadata Extraction ~~~~~~~~~~~~~~~~~~~ Extract paper metadata: .. code-block:: python def _extract_metadata(self, paper) -> Dict: """Extract relevant metadata from arXiv paper""" return { 'title': paper.title, 'authors': [str(author) for author in paper.authors], 'abstract': paper.summary, 'categories': paper.categories, 'published': paper.published.strftime('%Y-%m-%d'), 'updated': paper.updated.strftime('%Y-%m-%d'), 'arxiv_id': paper.entry_id.split('/')[-1], 'primary_category': paper.primary_category, 'doi': paper.doi, 'links': [str(link) for link in paper.links] } Running the Downloader ---------------------- To use the downloader: .. code-block:: python # Initialize the downloader downloader = ArxivDownloader() # Start downloading papers from 2017 to current year downloader.download_all_years(start_year=2017) Experimentation Tips --------------------- 1. To experiment with different years: - Modify the ``start_year`` parameter when calling ``download_all_years()`` - You can also download papers for a single year using ``download_papers(year)`` 2. To modify the search criteria: - Adjust the ``_construct_query()`` method to change search terms or categories - Add or remove ArXiv categories (e.g., add 'cs.NE' for Neural and Evolutionary Computing) 3. To customize the metadata: - Modify the ``_extract_metadata()`` method to add or remove fields - You can add custom fields like citation count if you integrate with other APIs 4. To adjust download behavior: - Modify the delay between downloads in ``download_papers()`` (currently 1 second) - Change the delay between years in ``download_all_years()`` (currently 5 seconds)