Configuration
Customize Nancy Brain's behavior through configuration files and environment variables.
Configuration Files
Nancy Brain uses YAML configuration files to define repositories, search weights, and processing options.
Repository Configuration (config/repositories.yml
)
Define which GitHub repositories to index:
# Group repositories by category
ml_frameworks:
- name: "scikit-learn"
url: "https://github.com/scikit-learn/scikit-learn"
description: "Machine learning library for Python"
is_notebook: false
- name: "pytorch"
url: "https://github.com/pytorch/pytorch"
description: "Deep learning framework"
is_notebook: false
documentation:
- name: "python-docs"
url: "https://github.com/python/cpython"
description: "Python source and documentation"
is_notebook: false
jupyter_examples:
- name: "data-science-notebooks"
url: "https://github.com/your-org/notebooks"
description: "Data science tutorial notebooks"
is_notebook: true # Special handling for notebooks
Fields:
- name
: Unique identifier within the category
- url
: Git repository URL (HTTPS or SSH)
- description
: Human-readable description
- is_notebook
: Set to true
for notebook-heavy repositories
Categories become path prefixes in search results (e.g., ml_frameworks/scikit-learn/...
).
Search Weights (config/weights.yaml
)
Control which files get higher priority in search results:
# Base multipliers by file extension
extensions:
.md: 1.5 # Boost markdown documentation
.rst: 1.5 # Boost reStructuredText docs
.txt: 1.2 # Boost plain text files
.py: 1.0 # Standard weight for Python code
.js: 0.9 # Slightly lower for JavaScript
.json: 0.7 # Lower weight for config files
.log: 0.3 # Very low for log files
# Additional multipliers for path patterns
path_includes:
"README": 2.0 # Heavily boost README files
"tutorial": 1.8 # Boost tutorial content
"example": 1.5 # Boost example code
"guide": 1.5 # Boost guide documentation
"docs/": 1.3 # Boost files in docs directories
"test": 0.8 # Lower weight for test files
"__pycache__": 0.1 # Very low for cache files
How it works: 1. Base weight starts at 1.0 2. Extension multiplier is applied 3. Each matching path pattern multiplier is applied 4. Final weight = base × extension × path₁ × path₂ × ...
PDF Articles (config/articles.yml
)
Optionally include PDF research papers and documentation:
research_papers:
- name: "attention_is_all_you_need"
url: "https://arxiv.org/pdf/1706.03762.pdf"
description: "Transformer architecture paper (Vaswani et al.)"
- name: "bert_paper"
url: "https://arxiv.org/pdf/1810.04805.pdf"
description: "BERT: Pre-training of Deep Bidirectional Transformers"
documentation:
- name: "python_tutorial"
url: "https://docs.python.org/3/tutorial/tutorial.pdf"
description: "Official Python Tutorial (PDF)"
PDF Processing Requirements:
- Java (for Apache Tika): brew install openjdk
- Fallback libraries: pip install PyPDF2 pdfplumber pymupdf
Environment Variables
Core Settings
Variable | Default | Description |
---|---|---|
USE_DUAL_EMBEDDING |
true |
Enable separate code and text embedding models |
CODE_EMBEDDING_MODEL |
microsoft/codebert-base |
Model for code embeddings |
KMP_DUPLICATE_LIB_OK |
TRUE |
Fix OpenMP conflicts on macOS |
PDF Processing
Variable | Default | Description |
---|---|---|
JAVA_HOME |
Auto-detected | Java installation path |
TIKA_SERVER_TIMEOUT |
60 |
Tika server timeout (seconds) |
SKIP_PDF_PROCESSING |
false |
Skip PDF processing entirely |
Development
Variable | Default | Description |
---|---|---|
NANCY_LOG_LEVEL |
INFO |
Logging level (DEBUG, INFO, WARNING, ERROR) |
NANCY_CACHE_SIZE |
1000 |
Search result cache size |
Runtime Configuration
Dynamic Weights
Adjust document importance during runtime:
### Search Filters
Apply filters during search:
```bash
# Filter by category
nancy-brain search "machine learning" --category ml_frameworks
# Filter by file type
nancy-brain search "tutorial" --filetype .md
# Combine filters
nancy-brain search "example" --category documentation --filetype .py
Advanced Configuration
Custom Embedding Models
Override the default embedding models:
export USE_DUAL_EMBEDDING=true
export CODE_EMBEDDING_MODEL="microsoft/graphcodebert-base"
export TEXT_EMBEDDING_MODEL="sentence-transformers/all-MiniLM-L6-v2"
Build Options
Customize the knowledge base build process:
# Build with custom paths
nancy-brain build \
--config custom-repos.yml \
--embeddings-path custom/embeddings \
--force-update
# Build specific categories only
nancy-brain build --category ml_frameworks
# Keep raw files for debugging
nancy-brain build --dirty
Performance Tuning
For large knowledge bases:
# config/performance.yml
build:
batch_size: 100 # Documents per batch
max_workers: 4 # Parallel processing threads
chunk_size: 512 # Text chunk size for embedding
search:
cache_size: 2000 # Larger cache for frequent searches
timeout: 30 # Search timeout seconds
max_results: 100 # Maximum results to rank
Configuration Examples
Academic Research Setup
Perfect for research papers and related code:
# config/repositories.yml
research_papers:
- name: "paper-code"
url: "https://github.com/author/paper-implementation"
description: "Implementation code for our paper"
is_notebook: true
reference_implementations:
- name: "baseline-methods"
url: "https://github.com/org/baseline-methods"
description: "Baseline implementation for comparison"
is_notebook: false
# config/weights.yaml
extensions:
.md: 2.0 # Heavy boost for documentation
.ipynb: 1.8 # Boost notebooks
.py: 1.5 # Boost Python code
.pdf: 2.5 # Heavy boost for PDFs
path_includes:
"paper": 3.0 # Heavily boost paper-related content
"results": 2.0 # Boost results and figures
"notebook": 1.8 # Boost notebook directories
Software Development Setup
Focused on code documentation and examples:
# config/repositories.yml
frameworks:
- name: "main-framework"
url: "https://github.com/your-org/framework"
description: "Main framework repository"
is_notebook: false
- name: "examples"
url: "https://github.com/your-org/examples"
description: "Usage examples and tutorials"
is_notebook: true
dependencies:
- name: "core-library"
url: "https://github.com/org/core-lib"
description: "Core dependency library"
is_notebook: false
# config/weights.yaml
extensions:
.py: 1.5 # Boost Python code
.md: 1.8 # Boost documentation
.yaml: 1.2 # Boost config files
.json: 0.8 # Lower weight for data files
path_includes:
"README": 3.0 # Critical for understanding
"docs/": 2.0 # Documentation directory
"examples/": 1.8 # Example code
"api/": 1.5 # API documentation
"test": 0.7 # Lower weight for tests
Troubleshooting Configuration
Validation
Check your configuration files:
# Validate repository config
nancy-brain validate-config config/repositories.yml
# Test search weights
nancy-brain test-weights config/weights.yaml
# Check PDF article URLs
nancy-brain validate-articles config/articles.yml
Common Issues
Repository clone failures:
# Test repository access
git ls-remote https://github.com/org/repo.git
# Check SSH key setup for private repos
ssh -T git@github.com
PDF download failures:
# Test PDF URLs manually
curl -I https://arxiv.org/pdf/1706.03762.pdf
# Check Java installation for Tika
java -version
echo $JAVA_HOME
Search weight not working:
# Check weight calculation
nancy-brain explain-weights "path/to/document.md"
# Test specific patterns
nancy-brain test-pattern "README" --weights config/weights.yaml
Next Steps
- Quick Start - Build your first knowledge base
- VS Code Integration - Use in your development workflow
- Python API - Programmatic access to configuration