Configuration

Customize Nancy Brain's behavior through configuration files and environment variables.

Configuration Files

Nancy Brain uses YAML configuration files to define repositories, search weights, and processing options.

Repository Configuration (`config/repositories.yml`)

Define which GitHub repositories to index:

# Group repositories by category
ml_frameworks:
  - name: "scikit-learn"
    url: "https://github.com/scikit-learn/scikit-learn"
    description: "Machine learning library for Python"
    is_notebook: false

  - name: "pytorch"  
    url: "https://github.com/pytorch/pytorch"
    description: "Deep learning framework"
    is_notebook: false

documentation:
  - name: "python-docs"
    url: "https://github.com/python/cpython"
    description: "Python source and documentation" 
    is_notebook: false

jupyter_examples:
  - name: "data-science-notebooks"
    url: "https://github.com/your-org/notebooks"
    description: "Data science tutorial notebooks"
    is_notebook: true  # Special handling for notebooks

Fields: - name: Unique identifier within the category - url: Git repository URL (HTTPS or SSH) - description: Human-readable description - is_notebook: Set to true for notebook-heavy repositories

Categories become path prefixes in search results (e.g., ml_frameworks/scikit-learn/...).

Search Weights (`config/weights.yaml`)

Control which files get higher priority in search results:

# Base multipliers by file extension
extensions:
  .md: 1.5        # Boost markdown documentation
  .rst: 1.5       # Boost reStructuredText docs  
  .txt: 1.2       # Boost plain text files
  .py: 1.0        # Standard weight for Python code
  .js: 0.9        # Slightly lower for JavaScript
  .json: 0.7      # Lower weight for config files
  .log: 0.3       # Very low for log files

# Additional multipliers for path patterns
path_includes:
  "README": 2.0       # Heavily boost README files
  "tutorial": 1.8     # Boost tutorial content
  "example": 1.5      # Boost example code
  "guide": 1.5        # Boost guide documentation
  "docs/": 1.3        # Boost files in docs directories
  "test": 0.8         # Lower weight for test files
  "__pycache__": 0.1  # Very low for cache files

How it works: 1. Base weight starts at 1.0 2. Extension multiplier is applied 3. Each matching path pattern multiplier is applied 4. Final weight = base × extension × path₁ × path₂ × ...

PDF Articles (`config/articles.yml`)

Optionally include PDF research papers and documentation:

research_papers:
  - name: "attention_is_all_you_need"
    url: "https://arxiv.org/pdf/1706.03762.pdf"
    description: "Transformer architecture paper (Vaswani et al.)"

  - name: "bert_paper"
    url: "https://arxiv.org/pdf/1810.04805.pdf" 
    description: "BERT: Pre-training of Deep Bidirectional Transformers"

documentation:
  - name: "python_tutorial"
    url: "https://docs.python.org/3/tutorial/tutorial.pdf"
    description: "Official Python Tutorial (PDF)"

PDF Processing Requirements: - Java (for Apache Tika): brew install openjdk - Fallback libraries: pip install PyPDF2 pdfplumber pymupdf

Environment Variables

Core Settings

Variable	Default	Description
`USE_DUAL_EMBEDDING`	`true`	Enable separate code and text embedding models
`CODE_EMBEDDING_MODEL`	`microsoft/codebert-base`	Model for code embeddings
`KMP_DUPLICATE_LIB_OK`	`TRUE`	Fix OpenMP conflicts on macOS

PDF Processing

Variable	Default	Description
`JAVA_HOME`	Auto-detected	Java installation path
`TIKA_SERVER_TIMEOUT`	`60`	Tika server timeout (seconds)
`SKIP_PDF_PROCESSING`	`false`	Skip PDF processing entirely

Development

Variable	Default	Description
`NANCY_LOG_LEVEL`	`INFO`	Logging level (DEBUG, INFO, WARNING, ERROR)
`NANCY_CACHE_SIZE`	`1000`	Search result cache size

Runtime Configuration

Dynamic Weights

Adjust document importance during runtime:

**Check knowledge base content:**
```bash
nancy-brain search "test" --limit 10

### Search Filters

Apply filters during search:

```bash
# Filter by category
nancy-brain search "machine learning" --category ml_frameworks

# Filter by file type
nancy-brain search "tutorial" --filetype .md

# Combine filters
nancy-brain search "example" --category documentation --filetype .py

Advanced Configuration

Custom Embedding Models

Override the default embedding models:

export USE_DUAL_EMBEDDING=true
export CODE_EMBEDDING_MODEL="microsoft/graphcodebert-base"
export TEXT_EMBEDDING_MODEL="sentence-transformers/all-MiniLM-L6-v2"

Build Options

Customize the knowledge base build process:

# Build with custom paths
nancy-brain build \
  --config custom-repos.yml \
  --embeddings-path custom/embeddings \
  --force-update

# Build specific categories only
nancy-brain build --category ml_frameworks

# Keep raw files for debugging
nancy-brain build --dirty

Performance Tuning

For large knowledge bases:

# config/performance.yml
build:
  batch_size: 100        # Documents per batch
  max_workers: 4         # Parallel processing threads
  chunk_size: 512        # Text chunk size for embedding

search:
  cache_size: 2000       # Larger cache for frequent searches
  timeout: 30            # Search timeout seconds
  max_results: 100       # Maximum results to rank

Configuration Examples

Academic Research Setup

Perfect for research papers and related code:

# config/repositories.yml
research_papers:
  - name: "paper-code"
    url: "https://github.com/author/paper-implementation"
    description: "Implementation code for our paper"
    is_notebook: true

reference_implementations:
  - name: "baseline-methods"
    url: "https://github.com/org/baseline-methods"
    description: "Baseline implementation for comparison"
    is_notebook: false

# config/weights.yaml
extensions:
  .md: 2.0      # Heavy boost for documentation
  .ipynb: 1.8   # Boost notebooks
  .py: 1.5      # Boost Python code
  .pdf: 2.5     # Heavy boost for PDFs

path_includes:
  "paper": 3.0      # Heavily boost paper-related content
  "results": 2.0    # Boost results and figures
  "notebook": 1.8   # Boost notebook directories

Software Development Setup

Focused on code documentation and examples:

# config/repositories.yml
frameworks:
  - name: "main-framework"
    url: "https://github.com/your-org/framework"
    description: "Main framework repository"
    is_notebook: false

  - name: "examples"
    url: "https://github.com/your-org/examples"
    description: "Usage examples and tutorials"
    is_notebook: true

dependencies:
  - name: "core-library"
    url: "https://github.com/org/core-lib"
    description: "Core dependency library"
    is_notebook: false

# config/weights.yaml
extensions:
  .py: 1.5      # Boost Python code
  .md: 1.8      # Boost documentation
  .yaml: 1.2    # Boost config files
  .json: 0.8    # Lower weight for data files

path_includes:
  "README": 3.0     # Critical for understanding
  "docs/": 2.0      # Documentation directory
  "examples/": 1.8  # Example code
  "api/": 1.5       # API documentation
  "test": 0.7       # Lower weight for tests

Troubleshooting Configuration

Validation

Check your configuration files:

# Validate repository config
nancy-brain validate-config config/repositories.yml

# Test search weights
nancy-brain test-weights config/weights.yaml

# Check PDF article URLs
nancy-brain validate-articles config/articles.yml

Common Issues

Repository clone failures:

# Test repository access
git ls-remote https://github.com/org/repo.git

# Check SSH key setup for private repos
ssh -T git@github.com

PDF download failures:

# Test PDF URLs manually
curl -I https://arxiv.org/pdf/1706.03762.pdf

# Check Java installation for Tika
java -version
echo $JAVA_HOME

Search weight not working:

# Check weight calculation
nancy-brain explain-weights "path/to/document.md"

# Test specific patterns
nancy-brain test-pattern "README" --weights config/weights.yaml

Next Steps

Quick Start - Build your first knowledge base
VS Code Integration - Use in your development workflow
Python API - Programmatic access to configuration