Build a Searchable AI Knowledge Base from the Brockhaus & Efron Encyclopedia
🏛️ Build a Searchable AI Knowledge Base from the Brockhaus & Efron Encyclopedia
The Brockhaus & Efron Encyclopedic Dictionary (Энциклопедический словарь Брокгауза и Ефрона) is one of the great reference works of the late Tsarist period. Published between 1890 and 1907, it spans 86 volumes and contains around 121,000 articles covering science, history, culture, literature, and technology.
You can download all 86 volumes as DjVu files from Runivers. But scanned images are hard to search. This guide walks you through turning those DjVu files into a fully queryable, AI-powered knowledge base that runs entirely on your Mac — no cloud, no data leaving your machine.
The stack: CocoIndex for chunking and embedding, PostgreSQL + pgvector for storage and semantic search, and any LLM (Claude, Gemini, OpenAI, or a local Ollama model) for answering questions.
1. How It All Fits Together
The pipeline has five phases:
- DjVu → PDF —
ddjvuconverts scanned images to PDF - PDF → Text — Tesseract OCR (or Apple Vision, see the appendix) extracts readable Russian text
- Text → Chunks → Embeddings — CocoIndex splits text and generates vector embeddings stored in PostgreSQL
- Obsidian export — Every chunk becomes a Markdown file with wikilinks, browsable in Obsidian
- Query — Ask questions in any language, get grounded answers with volume citations
The magic of CocoIndex is that it’s incremental. After your first run, it tracks file hashes. If you re-OCR one volume, only that volume gets re-chunked and re-embedded on the next run — everything else is skipped.
2. Install System Dependencies
Step 2.1 — Install Homebrew
Homebrew is the standard macOS package manager. If you already have it, skip to Step 2.2.
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"Apple Silicon (M1/M2/M3/M4) only: the installer will print two commands at the end to add Homebrew to your PATH. Run them — they look like this:
echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> ~/.zprofile
eval "$(/opt/homebrew/bin/brew shellenv)"Verify:
brew --version # → Homebrew 4.x.xStep 2.2 — Install system tools
# python@3.11 — CocoIndex supports Python 3.11–3.13
# tesseract — OCR engine
# tesseract-lang — All language packs, including Russian (rus)
# poppler — Provides pdftoppm, used to rasterise PDF pages for OCR
# djvulibre — Provides ddjvu for DjVu → PDF conversion
# postgresql@16 — Database server (pinned to v16 for stability)
brew install python@3.11 tesseract tesseract-lang poppler djvulibre postgresql@16Verify:
python3.11 --version # → Python 3.11.x
tesseract --version # → tesseract 5.x.x
tesseract --list-langs | grep rus # → rus (must appear)
ddjvu --version # → DjVuLibre ...
pdftoppm -v 2>&1 | head -1 # → pdftoppm version ...
psql --version # → psql (PostgreSQL) 16.x3. Set Up PostgreSQL and pgvector
PostgreSQL stores everything in this project: CocoIndex’s internal bookkeeping state, your raw text chunks, and their vector embeddings — all in one local database.
Step 3.1 — Add Postgres to your PATH
echo 'export PATH="/opt/homebrew/opt/postgresql@16/bin:$PATH"' >> ~/.zprofile
source ~/.zprofileStep 3.2 — Start Postgres as a background service
# This starts Postgres and configures it to restart automatically at login.
brew services start postgresql@16
# Confirm it's running. Look for "started" in the output.
brew services list | grep postgresql
# Test the connection (Homebrew creates a superuser role matching your macOS username).
psql postgres -c "SELECT version();"
# → PostgreSQL 16.x ...Step 3.3 — Create the database, user, and grant permissions
psql postgres -c "CREATE DATABASE cocoindex;"
psql postgres -c "CREATE USER cocoindex WITH PASSWORD 'yourpassword';"
psql postgres -c "GRANT ALL PRIVILEGES ON DATABASE cocoindex TO cocoindex;"
# In PostgreSQL 15+, the public schema is restricted by default.
# This grant allows the cocoindex user to create tables in it.
psql cocoindex -c "GRANT ALL ON SCHEMA public TO cocoindex;"Step 3.4 — Install pgvector and enable it
pgvector is the open-source Postgres extension that adds a VECTOR data type and approximate nearest-neighbour (ANN) index methods. It needs to be installed as a separate Homebrew formula, and Postgres must be restarted before the extension becomes available.
brew install pgvector
# Restart Postgres so it can find the newly installed pgvector shared library.
brew services restart postgresql@16
# Enable the vector extension inside the cocoindex database.
psql cocoindex -c "CREATE EXTENSION IF NOT EXISTS vector;"
# Verify: should print 'vector' and its version number.
psql cocoindex -c "SELECT extname, extversion FROM pg_extension WHERE extname = 'vector';"Step 3.5 — Record your connection URL
Everything in this project connects to Postgres using this URL. Keep it handy — you’ll set it as an environment variable in the next step.
postgresql://cocoindex:yourpassword@localhost:5432/cocoindexTest it:
psql "postgresql://cocoindex:yourpassword@localhost:5432/cocoindex" -c "SELECT 'connection ok';"
# → connection ok4. Set Up the Python Project
Step 4.1 — Create the project directory and virtual environment
mkdir -p ~/brockhaus-rag
cd ~/brockhaus-rag
# Create and activate an isolated Python environment for this project.
python3.11 -m venv venv
source venv/bin/activate
pip install --upgrade pipStep 4.2 — Install Python packages
pip install \
"cocoindex[embeddings]" \
pdf2image pytesseract pillow \
tqdm rich python-dotenv pyyaml \
psycopg2-binary python-frontmatter \
anthropic openai google-generativeaiStep 4.3 — Create the folder structure
mkdir -p data/djvu data/pdf data/text obsidian_vault/Encyclopedia logsYour project tree:
~/brockhaus-rag/
├── venv/ ← Python environment (never edit manually)
├── data/
│ ├── djvu/ ← Put your Volume_*.djvu files here
│ ├── pdf/ ← Phase 1 output: converted PDFs
│ └── text/ ← Phase 2 output: OCR raw text files
├── obsidian_vault/
│ └── Encyclopedia/ ← Phase 5 output: one .md per article chunk
├── logs/ ← OCR error logs
├── .env ← API keys (never commit to git)
├── config.yaml ← All settings in one place
├── 01_convert_djvu.sh
├── 02_ocr.py
├── cocoindex_flow.py ← Phase 3+4 (named without numeric prefix — see note)
├── 04_export_obsidian.py
└── query.pyWhy is the CocoIndex file not called
03_cocoindex_flow.py? Python cannot import files whose names start with a digit. Thequery.pyscript imports the sharedtext_to_embeddingfunction directly fromcocoindex_flow.py, so the file must have a valid Python module name.
Step 4.4 — Create config.yaml
paths:
djvu_dir: ~/brockhaus-rag/data/djvu
pdf_dir: ~/brockhaus-rag/data/pdf
text_dir: ~/brockhaus-rag/data/text
obsidian_dir: ~/brockhaus-rag/obsidian_vault/Encyclopedia
log_dir: ~/brockhaus-rag/logs
database:
# Update 'yourpassword' to match what you chose in Step 3.3.
url: postgresql://cocoindex:yourpassword@localhost:5432/cocoindex
ocr:
language: rus # Tesseract language code. 'rus' = Russian.
dpi: 300 # Render PDF pages at 300 DPI before OCR.
psm: 1 # Tesseract page segmentation mode 1 = auto with OSD.
cocoindex:
# HuggingFace model for embedding. Supports Russian natively.
# First run downloads ~560 MB; cached locally after that.
embedding_model: intfloat/multilingual-e5-large
chunk_size: 800 # Target characters per chunk.
chunk_overlap: 100 # Characters of overlap between consecutive chunks.
collection_name: encyclopedia_embeddings # Postgres table name.
llm:
default_provider: anthropic # Options: anthropic, openai, google, ollama
anthropic_model: claude-opus-4-6
openai_model: gpt-4o
google_model: gemini-2.0-flash
ollama_model: qwen2.5:7b # Best for Russian text among local 7B models.
top_k: 8 # Number of chunks retrieved per query.
obsidian:
create_index: true # Generate a master Brockhaus Index.md file.
add_backlinks: true # Add [[Volume N]] wikilinks in every article file.Step 4.5 — Create .env
# .env — secret API keys. Never commit this file to git.
# Remove or comment out any providers you don't plan to use.
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
GOOGLE_API_KEY=AI...
# CocoIndex reads its database connection from this variable.
COCOINDEX_DATABASE_URL=postgresql://cocoindex:yourpassword@localhost:5432/cocoindexTo avoid exporting the COCOINDEX_DATABASE_URL manually in every new terminal, add it permanently:
echo 'export COCOINDEX_DATABASE_URL="postgresql://cocoindex:yourpassword@localhost:5432/cocoindex"' >> ~/.zprofile
source ~/.zprofile5. Download the Encyclopedia
# Move your downloaded Volume_*.djvu files into the djvu folder,
# or symlink the folder where you already saved them:
ln -s /path/to/your/djvu/files ~/brockhaus-rag/data/djvu
# Alternatively, use the included download script to fetch all 86 volumes.
# It resumes interrupted downloads automatically.
chmod +x download_all_volumes.sh
./download_all_volumes.shExpected total size: ~2–3 GB. Each volume is ~25–40 MB.
6. Phase 1 — DjVu → PDF
Create 01_convert_djvu.sh:
#!/bin/zsh
# Convert all DjVu volumes to PDF.
# Skips files where the PDF already exists.
# Run from ~/brockhaus-rag with the venv active.
set -euo pipefail
# Read paths from config.yaml using Python (zsh can't parse YAML natively).
DJVU_DIR=$(python3 -c "import yaml,os; c=yaml.safe_load(open('config.yaml')); print(os.path.expanduser(c['paths']['djvu_dir']))")
PDF_DIR=$(python3 -c "import yaml,os; c=yaml.safe_load(open('config.yaml')); print(os.path.expanduser(c['paths']['pdf_dir']))")
mkdir -p "$PDF_DIR"
SUCCESS=0; SKIP=0; FAIL=0
TOTAL=$(ls "$DJVU_DIR"/*.djvu 2>/dev/null | wc -l | tr -d ' ')
COUNT=0
echo "Converting $TOTAL DjVu volumes to PDF..."
for INPUT in "$DJVU_DIR"/*.djvu; do
COUNT=$((COUNT + 1))
BASENAME=$(basename "$INPUT" .djvu)
OUTPUT="$PDF_DIR/${BASENAME}.pdf"
if [[ -f "$OUTPUT" ]]; then
echo "[$COUNT/$TOTAL] Skipping (exists): $BASENAME.pdf"
SKIP=$((SKIP + 1))
continue
fi
echo "[$COUNT/$TOTAL] Converting: $BASENAME.djvu..."
# -format=pdf → output format
# -scale=300 → render at 300 DPI for better OCR quality
if ddjvu -format=pdf -scale=300 "$INPUT" "$OUTPUT"; then
SIZE=$(du -sh "$OUTPUT" | awk '{print $1}')
echo " ✓ $BASENAME.pdf ($SIZE)"
SUCCESS=$((SUCCESS + 1))
else
echo " ✗ Failed: $BASENAME"
rm -f "$OUTPUT"
FAIL=$((FAIL + 1))
fi
done
echo ""
echo "Done. ✓ $SUCCESS converted ⟳ $SKIP skipped ✗ $FAIL failed"chmod +x 01_convert_djvu.sh
./01_convert_djvu.shExpect 5–20 minutes and ~4–6 GB of output PDFs.
7. Phase 2 — OCR: Extract Text
Each PDF page is rendered to an image and passed to Tesseract for Russian text recognition. Output is one UTF-8 text file per volume, with --- PAGE N --- markers.
Tip: Apple Vision (macOS built-in) often achieves 20–30% better accuracy than Tesseract on pre-revolutionary Russian script. See the appendix at the end of this article for the drop-in replacement function.
Create 02_ocr.py:
#!/usr/bin/env python3
"""
OCR all PDF volumes into plain text files.
Usage:
python 02_ocr.py # process all volumes
python 02_ocr.py --volume 1 # process a single volume
python 02_ocr.py --volume 1 --force # re-OCR even if output already exists
"""
import os
import sys
import time
import argparse
import logging
from pathlib import Path
import yaml
import pytesseract
from pdf2image import convert_from_path
from tqdm import tqdm
from rich.console import Console
from rich.table import Table
console = Console()
def load_config() -> dict:
with open("config.yaml") as f:
cfg = yaml.safe_load(f)
for k in cfg["paths"]:
cfg["paths"][k] = os.path.expanduser(cfg["paths"][k])
return cfg
def setup_logging(log_dir: str) -> logging.Logger:
os.makedirs(log_dir, exist_ok=True)
log_path = os.path.join(log_dir, "ocr_errors.log")
logging.basicConfig(
filename=log_path,
level=logging.ERROR,
format="%(asctime)s %(levelname)s %(message)s",
)
return logging.getLogger("ocr")
def ocr_volume(pdf_path: Path, output_path: Path, config: dict, logger: logging.Logger) -> dict:
"""Render each page of a PDF to an image and run Tesseract OCR on it."""
start = time.time()
lang = config["ocr"]["language"]
dpi = config["ocr"]["dpi"]
tess_config = f"--psm {config['ocr']['psm']}"
stats = {"pages": 0, "chars": 0, "errors": 0}
try:
# pdf2image uses poppler's pdftoppm under the hood.
# thread_count=4 renders multiple pages in parallel.
images = convert_from_path(str(pdf_path), dpi=dpi, thread_count=4)
except Exception as e:
logger.error(f"Could not render {pdf_path.name}: {e}")
return stats
stats["pages"] = len(images)
lines = []
for page_num, image in enumerate(tqdm(images, desc=f" {pdf_path.stem}", leave=False), 1):
try:
text = pytesseract.image_to_string(image, lang=lang, config=tess_config)
lines.append(f"\n\n--- PAGE {page_num} ---\n\n{text}")
stats["chars"] += len(text)
except Exception as e:
logger.error(f"{pdf_path.name} page {page_num}: {e}")
lines.append(f"\n\n--- PAGE {page_num} --- [OCR ERROR]\n\n")
stats["errors"] += 1
output_path.write_text("".join(lines), encoding="utf-8")
stats["duration"] = round(time.time() - start, 1)
return stats
def main():
parser = argparse.ArgumentParser(description="OCR PDF volumes to text files.")
parser.add_argument("--volume", type=int, help="Process only this volume number.")
parser.add_argument("--force", action="store_true", help="Re-OCR even if output exists.")
args = parser.parse_args()
cfg = load_config()
logger = setup_logging(cfg["paths"]["log_dir"])
pdf_dir = Path(cfg["paths"]["pdf_dir"])
text_dir = Path(cfg["paths"]["text_dir"])
text_dir.mkdir(parents=True, exist_ok=True)
if args.volume:
pdf_files = sorted(pdf_dir.glob(f"Volume_{args.volume}.pdf"))
else:
pdf_files = sorted(pdf_dir.glob("Volume_*.pdf"))
if not pdf_files:
console.print("[red]No PDF files found. Run 01_convert_djvu.sh first.[/red]")
sys.exit(1)
summary_rows = []
for pdf_path in pdf_files:
output = text_dir / f"{pdf_path.stem}_raw.txt"
console.rule(f"[bold]{pdf_path.stem}[/bold]")
if output.exists() and not args.force:
kb = output.stat().st_size // 1024
console.print(f" [yellow]Skipping[/yellow] ({kb} KB on disk) — use --force to re-OCR")
summary_rows.append((pdf_path.stem, "—", "—", f"{kb} KB", "skipped"))
continue
stats = ocr_volume(pdf_path, output, cfg, logger)
kb = output.stat().st_size // 1024 if output.exists() else 0
summary_rows.append((
pdf_path.stem,
str(stats.get("pages", "?")),
f"{stats.get('chars', 0):,}",
f"{kb} KB",
f"{stats.get('duration', '?')}s",
))
table = Table(title="OCR Summary")
for col in ("Volume", "Pages", "Characters", "File size", "Duration"):
table.add_column(col)
for row in summary_rows:
table.add_row(*row)
console.print(table)
if __name__ == "__main__":
main()Run it:
# Always test on a single volume first.
python 02_ocr.py --volume 1
# Check the output — you should see Russian text and page markers.
head -60 data/text/Volume_1_raw.txt
# Run all 86 volumes (run this overnight).
python 02_ocr.pyExpect 2–8 hours for all volumes.
8. Phase 3+4 — CocoIndex: Chunk, Embed, Store
This single file replaces what would otherwise be two separate scripts (chunking and embedding). CocoIndex handles everything in a declarative dataflow — you describe the transformations; the Rust-core engine executes them efficiently and tracks what has changed.
How the text_to_embedding transform flow works
The @cocoindex.transform_flow() decorator is the key design decision here. It creates a shared embedding function that is used in two places:
- At index time —
cocoindex updatecalls it on every chunk - At query time —
query.pyimports it and calls.eval()on your search string
This guarantees that query vectors and document vectors are always produced by the exact same model and preprocessing logic. If they diverge, similarity search produces meaningless scores.
The query: / passage: prefix convention
The intfloat/multilingual-e5-large model is trained to expect a prefix on every input:
- Documents are prefixed with
"passage: "— CocoIndex handles this internally - Queries must be prefixed with
"query: "— we do this explicitly inquery.py
Without the correct prefix, retrieval quality drops significantly.
Create cocoindex_flow.py:
#!/usr/bin/env python3
"""
CocoIndex flow: read OCR text files → chunk → embed → store in PostgreSQL.
Usage (CocoIndex CLI — recommended):
export COCOINDEX_DATABASE_URL="postgresql://cocoindex:yourpassword@localhost:5432/cocoindex"
cocoindex update cocoindex_flow.py # first run: indexes everything
cocoindex update cocoindex_flow.py # subsequent runs: only changed files
Usage (Python directly):
python cocoindex_flow.py
To drop all CocoIndex-managed tables and rebuild from scratch:
cocoindex drop cocoindex_flow.py
cocoindex update cocoindex_flow.py
"""
import os
import asyncio
import yaml
import cocoindex
from dotenv import load_dotenv
load_dotenv()
def load_config() -> dict:
with open("config.yaml") as f:
cfg = yaml.safe_load(f)
for k in cfg["paths"]:
cfg["paths"][k] = os.path.expanduser(cfg["paths"][k])
return cfg
config = load_config()
# ---------------------------------------------------------------------------
# Shared transform flow — used both during indexing and at query time.
#
# Decorating with @cocoindex.transform_flow() adds an .eval() method so
# query.py can call text_to_embedding.eval("query: your question") and get
# back the same vector space used during indexing.
# ---------------------------------------------------------------------------
@cocoindex.transform_flow()
def text_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice:
"""Embed a text string using SentenceTransformerEmbed (multilingual-e5-large)."""
return text.transform(
cocoindex.functions.SentenceTransformerEmbed(
model=config["cocoindex"]["embedding_model"]
)
)
# ---------------------------------------------------------------------------
# Indexing flow — the main pipeline declaration.
# ---------------------------------------------------------------------------
@cocoindex.flow_def(name="BrockhausEncyclopedia")
def brockhaus_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope) -> None:
"""
Define the Brockhaus encyclopedia indexing flow.
Data path:
LocalFile (data/text/*.txt)
→ SplitRecursively (800 chars, 100 overlap)
→ SentenceTransformerEmbed (multilingual-e5-large, 1024 dims)
→ Postgres (encyclopedia_embeddings table)
"""
# SOURCE: read every *_raw.txt file produced by 02_ocr.py.
# CocoIndex hashes each file's content. On re-run, only files whose
# hash changed are re-processed downstream — unchanged volumes are skipped.
data_scope["documents"] = flow_builder.add_source(
cocoindex.sources.LocalFile(
path=config["paths"]["text_dir"],
included_patterns=["*_raw.txt"],
)
)
doc_embeddings = data_scope.add_collector()
with data_scope["documents"].row() as doc:
# CHUNK: split on paragraph boundaries, fall back to sentence boundaries.
# Never splits mid-sentence. Produces overlapping windows.
doc["chunks"] = doc["content"].transform(
cocoindex.functions.SplitRecursively(),
language="plain_text",
chunk_size=config["cocoindex"]["chunk_size"],
chunk_overlap=config["cocoindex"]["chunk_overlap"],
)
with doc["chunks"].row() as chunk:
# EMBED: call the shared transform flow.
chunk["embedding"] = chunk["text"].call(text_to_embedding)
doc_embeddings.collect(
filename=doc["filename"], # e.g. ".../Volume_1_raw.txt"
location=chunk["location"], # character offset, e.g. "0:800"
text=chunk["text"],
embedding=chunk["embedding"],
)
# EXPORT: CocoIndex creates the table and vector index automatically.
# primary_key_fields ensures upserts — re-running never duplicates rows.
doc_embeddings.export(
config["cocoindex"]["collection_name"],
cocoindex.storages.Postgres(),
primary_key_fields=["filename", "location"],
vector_indexes=[
cocoindex.VectorIndexDef(
field_name="embedding",
metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,
)
],
)
async def main_async():
await cocoindex.setup()
await brockhaus_flow.run()
print("Indexing complete.")
if __name__ == "__main__":
cocoindex.init()
asyncio.run(main_async())Run CocoIndex
# Make sure the database URL is exported (or sourced from .env).
export COCOINDEX_DATABASE_URL="postgresql://cocoindex:yourpassword@localhost:5432/cocoindex"
# First run — downloads the embedding model (~560 MB, one-time) then
# processes all volumes. Expect 30–60 minutes.
cocoindex update cocoindex_flow.pyWhat the first run prints:
BrockhausEncyclopedia: setting up...
BrockhausEncyclopedia: processing documents...
documents: 86 added, 0 removed, 0 updated
encyclopedia_embeddings: 183,492 added, 0 removed, 0 updated
BrockhausEncyclopedia: done in 47m 12sWhat a re-run prints after you edit Volume 5’s text file:
BrockhausEncyclopedia: processing documents...
documents: 0 added, 0 removed, 1 updated
encyclopedia_embeddings: 2,341 added, 0 removed, 2,389 updated
BrockhausEncyclopedia: done in 41sOnly Volume 5’s chunks are re-embedded. Every other volume is skipped.
Verify the data in Postgres
psql "postgresql://cocoindex:yourpassword@localhost:5432/cocoindex"-- Total chunks ingested.
SELECT COUNT(*) FROM encyclopedia_embeddings;
-- → 183492 (or similar)
-- Chunks and average size per volume.
SELECT
substring(filename FROM 'Volume_(\d+)') AS vol,
COUNT(*) AS chunks,
AVG(length(text))::int AS avg_chars
FROM encyclopedia_embeddings
GROUP BY 1
ORDER BY vol::int;
-- Confirm vector dimensions (should be 1024 for multilingual-e5-large).
SELECT array_length(embedding::real[], 1) AS dims
FROM encyclopedia_embeddings
LIMIT 1;
-- → 1024
-- Confirm the vector index exists.
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'encyclopedia_embeddings';
\q9. Phase 4 — Export to Obsidian
This phase exports every chunk as a Markdown file, organised by volume, into an Obsidian vault. It reads directly from the Postgres table so it always reflects exactly what CocoIndex indexed.
Create 04_export_obsidian.py:
#!/usr/bin/env python3
"""
Export encyclopedia chunks from Postgres to an Obsidian vault.
For each chunk:
- Creates a Markdown file with YAML frontmatter (title, volume, tags, source)
- Organises files into per-volume subdirectories
- Adds Obsidian wikilinks back to the volume overview and master index
Also generates:
- obsidian_vault/Encyclopedia/Brockhaus Index.md (master article list)
- obsidian_vault/Encyclopedia/Volume_N.md (one per volume)
Usage:
python 04_export_obsidian.py
python 04_export_obsidian.py --volume 1 # export one volume only
"""
import os
import re
import argparse
from pathlib import Path
import yaml
import psycopg2
import frontmatter
from tqdm import tqdm
from dotenv import load_dotenv
from rich.console import Console
load_dotenv()
console = Console()
def load_config() -> dict:
with open("config.yaml") as f:
cfg = yaml.safe_load(f)
for k in cfg["paths"]:
cfg["paths"][k] = os.path.expanduser(cfg["paths"][k])
return cfg
def extract_volume(filename: str) -> int:
m = re.search(r"Volume_(\d+)", filename)
return int(m.group(1)) if m else 0
def safe_filename(text: str, max_len: int = 60) -> str:
"""Strip filesystem-illegal characters and truncate."""
cleaned = re.sub(r'[/\\:*?"<>|]', "", text).strip()
return cleaned[:max_len].replace(" ", "_") or "chunk"
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--volume", type=int)
args = parser.parse_args()
cfg = load_config()
obsidian_dir = Path(cfg["paths"]["obsidian_dir"])
obsidian_dir.mkdir(parents=True, exist_ok=True)
db_url = os.environ.get("COCOINDEX_DATABASE_URL", cfg["database"]["url"])
table = cfg["cocoindex"]["collection_name"]
conn = psycopg2.connect(db_url)
cur = conn.cursor()
if args.volume:
cur.execute(
f"SELECT filename, location, text FROM {table} WHERE filename LIKE %s ORDER BY filename, location",
(f"%Volume_{args.volume}%",),
)
else:
cur.execute(f"SELECT filename, location, text FROM {table} ORDER BY filename, location")
rows = cur.fetchall()
conn.close()
console.print(f"Fetched {len(rows):,} chunks from Postgres.")
all_files: list[tuple[int, Path]] = []
vol_files: dict[int, list[Path]] = {}
for filename, location, text in tqdm(rows, desc="Exporting"):
vol = extract_volume(filename)
vol_dir = obsidian_dir / f"Volume_{vol:02d}"
vol_dir.mkdir(exist_ok=True)
# Use the first non-empty line as a title proxy.
lines = [l.strip() for l in text.split("\n") if l.strip()]
title = lines[0][:60] if lines else f"Volume {vol} chunk"
idx = len(list(vol_dir.glob("*.md")))
md_filename = f"Vol{vol:02d}_{safe_filename(title)}_{idx}.md"
filepath = vol_dir / md_filename
body = (
f"# {title}\n\n{text}\n\n---\n\n"
f"*Source: Brockhaus & Efron Encyclopedic Dictionary, [[Volume {vol:02d}]]* \n"
f"*[[Brockhaus Index]]*\n"
)
post = frontmatter.Post(
body,
title=title,
volume=vol,
location=location,
tags=["brockhaus", f"vol-{vol}"],
source="Энциклопедический словарь Брокгауза и Ефрона",
)
filepath.write_text(frontmatter.dumps(post), encoding="utf-8")
all_files.append((vol, filepath))
vol_files.setdefault(vol, []).append(filepath)
# Write per-volume overview files.
for vol, files in vol_files.items():
overview = obsidian_dir / f"Volume_{vol:02d}.md"
links = "\n".join(f"- [[{f.stem}]]" for f in files)
overview.write_text(
f"# Volume {vol}\n\n**Chunks:** {len(files)}\n\n## Contents\n\n{links}\n\n---\n*[[Brockhaus Index]]*\n",
encoding="utf-8",
)
# Write the master index.
if cfg.get("obsidian", {}).get("create_index", True):
index_path = obsidian_dir / "Brockhaus Index.md"
rows_md = "\n".join(
f"| [[{p.stem}]] | [[Volume {v:02d}]] |"
for v, p in sorted(all_files, key=lambda x: (x[0], x[1].name))
)
index_path.write_text(
"# Brockhaus & Efron Encyclopedic Dictionary\n\n"
"| Article | Volume |\n|---------|--------|\n"
+ rows_md + "\n",
encoding="utf-8",
)
console.print(f"[green]Index written:[/green] {index_path}")
console.print(f"\n[bold green]Done.[/bold green] {len(all_files):,} files → {obsidian_dir}")
if __name__ == "__main__":
main()python 04_export_obsidian.pyOpen ~/brockhaus-rag/obsidian_vault/ in Obsidian (free download). Press Cmd+G for the knowledge graph across all 86 volumes. Press Cmd+Shift+F to full-text search all article files.
Recommended Obsidian plugins:
- Dataview — SQL-like queries across your vault:
TABLE title, volume FROM "Encyclopedia" WHERE contains(tags, "brockhaus") - Omnisearch — fast full-text search with ranking
- Graph Analysis — improves graph view performance for large vaults
10. Phase 5 — Query the Knowledge Base
query.py imports text_to_embedding directly from cocoindex_flow.py — the same function used during indexing. This guarantees that your question vector and the stored document vectors live in the same embedding space, which is required for meaningful similarity scores.
When you type a question, the script:
- Prefixes it with
"query: "(required by multilingual-e5) and embeds it - Runs a pgvector cosine similarity search against all stored chunks
- Assembles a RAG prompt with the retrieved excerpts
- Calls your chosen LLM and streams the answer to the terminal
Create query.py:
#!/usr/bin/env python3
"""
Interactive RAG query CLI for the Brockhaus & Efron encyclopedia.
Usage:
python query.py # interactive, default provider
python query.py --provider openai # use GPT-4o
python query.py --provider google # use Gemini
python query.py --provider ollama # fully local, no API key
python query.py --query "Tell me about Tolstoy" # single non-interactive query
python query.py --top-k 15 # retrieve more context
Interactive commands:
/quit — exit
/help — show commands
/provider NAME — switch LLM provider mid-session
/topk N — change number of retrieved chunks mid-session
"""
import os
import sys
import re
import argparse
import textwrap
from pathlib import Path
import yaml
import psycopg2
import cocoindex
from dotenv import load_dotenv
from rich.console import Console
from rich.panel import Panel # Panel lives in rich.panel, not rich.console
from rich.markdown import Markdown
from rich.table import Table
load_dotenv()
cocoindex.init()
# Import the shared embedding function from the indexing flow.
# This guarantees query vectors use the exact same model and preprocessing
# as the document vectors stored in Postgres.
sys.path.insert(0, str(Path(__file__).parent))
from cocoindex_flow import text_to_embedding # noqa: E402
console = Console()
SYSTEM_PROMPT = textwrap.dedent("""
You are a scholarly research assistant specialising in 19th-century Russian
history, science, culture, and language. Answer using ONLY the encyclopedia
excerpts provided. Always cite the volume number(s) your answer draws from.
If the answer is not in the excerpts, say so clearly — do not invent facts.
Where helpful, translate Russian terms into English but keep key terms in Russian too.
""").strip()
def load_config() -> dict:
return yaml.safe_load(open("config.yaml"))
# ---------------------------------------------------------------------------
# Retrieval
# ---------------------------------------------------------------------------
def retrieve(query: str, top_k: int, config: dict) -> list[dict]:
"""
Embed the query and find the top-K most similar chunks in Postgres.
The query is prefixed with "query: " as required by multilingual-e5 models.
The <=> operator (pgvector cosine distance) returns 0 for identical vectors
and 2 for opposite vectors, so ORDER BY ASC gives most similar first.
Similarity score = 1 - distance, reported in results for transparency.
"""
# Prefix is critical: multilingual-e5 expects "query: " on search strings
# and "passage: " on documents (CocoIndex handles the document prefix internally).
query_vec = text_to_embedding.eval(f"query: {query}")
db_url = os.environ.get("COCOINDEX_DATABASE_URL", config["database"]["url"])
table = config["cocoindex"]["collection_name"]
vec_str = "[" + ",".join(str(float(x)) for x in query_vec) + "]"
conn = psycopg2.connect(db_url)
cur = conn.cursor()
cur.execute(f"""
SELECT
filename,
location,
text,
1 - (embedding <=> %s::vector) AS similarity
FROM {table}
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (vec_str, vec_str, top_k))
rows = cur.fetchall()
conn.close()
results = []
for filename, location, text, similarity in rows:
m = re.search(r"Volume_(\d+)", filename)
vol = int(m.group(1)) if m else 0
results.append({
"volume": vol,
"location": location,
"text": text,
"similarity": round(similarity, 4),
})
return results
# ---------------------------------------------------------------------------
# Prompt builder
# ---------------------------------------------------------------------------
def build_prompt(query: str, results: list[dict]) -> str:
excerpts = "\n\n---\n\n".join(
f"[Excerpt {i} — Volume {r['volume']}, position {r['location']} "
f"(similarity: {r['similarity']})]\n{r['text']}"
for i, r in enumerate(results, 1)
)
return (
"The following are excerpts from the Brockhaus & Efron "
"Encyclopedic Dictionary (1890–1907):\n\n"
f"{excerpts}\n\nQuestion: {query}"
)
# ---------------------------------------------------------------------------
# LLM backends — all four providers implemented
# ---------------------------------------------------------------------------
def call_anthropic(system: str, user: str, model: str) -> str:
from anthropic import Anthropic
client = Anthropic()
with console.status("Generating answer (Claude)..."):
r = client.messages.create(
model=model, max_tokens=2048,
system=system,
messages=[{"role": "user", "content": user}],
)
return r.content[0].text
def call_openai(system: str, user: str, model: str) -> str:
from openai import OpenAI
client = OpenAI()
with console.status("Generating answer (OpenAI)..."):
r = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system},
{"role": "user", "content": user},
],
)
return r.choices[0].message.content
def call_google(system: str, user: str, model: str) -> str:
import google.generativeai as genai
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
m = genai.GenerativeModel(model)
with console.status("Generating answer (Gemini)..."):
r = m.generate_content(f"{system}\n\n{user}")
return r.text
def call_ollama(system: str, user: str, model: str) -> str:
import requests
with console.status(f"Running {model} locally via Ollama..."):
r = requests.post(
"http://localhost:11434/api/generate",
json={"model": model, "prompt": f"{system}\n\n{user}", "stream": False},
timeout=300,
)
r.raise_for_status()
return r.json()["response"]
def call_llm(provider: str, system: str, user: str, config: dict) -> str:
llm = config["llm"]
if provider == "anthropic":
return call_anthropic(system, user, llm["anthropic_model"])
elif provider == "openai":
return call_openai(system, user, llm["openai_model"])
elif provider == "google":
return call_google(system, user, llm["google_model"])
elif provider == "ollama":
return call_ollama(system, user, llm["ollama_model"])
else:
raise ValueError(f"Unknown provider: {provider!r}. Choose: anthropic, openai, google, ollama")
# ---------------------------------------------------------------------------
# Query runner
# ---------------------------------------------------------------------------
def run_query(query: str, provider: str, top_k: int, config: dict) -> None:
with console.status("Searching encyclopedia..."):
results = retrieve(query, top_k, config)
if not results:
console.print("[red]No results found. Is the database populated? Run cocoindex update first.[/red]")
return
answer = call_llm(provider, SYSTEM_PROMPT, build_prompt(query, results), config)
console.print(Panel(Markdown(answer), title="Answer", border_style="green"))
table = Table(title="Sources retrieved", show_header=True)
table.add_column("Vol", style="cyan", width=4)
table.add_column("Similarity", width=10)
table.add_column("Preview", style="dim")
for r in results:
table.add_row(
str(r["volume"]),
str(r["similarity"]),
r["text"][:90].replace("\n", " ") + "…",
)
console.print(table)
# ---------------------------------------------------------------------------
# Entry point
# ---------------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(description="Query the Brockhaus encyclopedia with AI.")
parser.add_argument("--provider", choices=["anthropic", "openai", "google", "ollama"])
parser.add_argument("--top-k", type=int)
parser.add_argument("--query", type=str, help="Run a single query non-interactively.")
args = parser.parse_args()
config = load_config()
provider = args.provider or config["llm"]["default_provider"]
top_k = args.top_k or config["llm"]["top_k"]
console.print(Panel(
f"[bold]Brockhaus & Efron Encyclopedia[/bold]\n"
f"Provider: [cyan]{provider}[/cyan] | "
f"Top-K: [cyan]{top_k}[/cyan] | "
f"Type [bold]/help[/bold] for commands",
border_style="blue",
))
if args.query:
run_query(args.query, provider, top_k, config)
return
# Interactive loop
while True:
try:
q = input("\n[You]: ").strip()
except (EOFError, KeyboardInterrupt):
break
if not q:
continue
if q == "/quit":
break
if q == "/help":
console.print(
"[bold]Commands:[/bold]\n"
" /quit — exit\n"
" /provider NAME — switch provider (anthropic, openai, google, ollama)\n"
" /topk N — change number of retrieved chunks\n"
" anything else — search the encyclopedia"
)
continue
if q.startswith("/provider "):
provider = q.split(None, 1)[1].strip()
console.print(f"Switched to: [cyan]{provider}[/cyan]")
continue
if q.startswith("/topk "):
top_k = int(q.split()[1])
console.print(f"Top-K set to: [cyan]{top_k}[/cyan]")
continue
run_query(q, provider, top_k, config)
if __name__ == "__main__":
main()Run it:
# Single non-interactive query
python query.py --query "What does the encyclopedia say about the Trans-Siberian railway?"
# Interactive mode with Claude (default)
python query.py
# Switch to a different provider at launch
python query.py --provider google
python query.py --provider ollama # fully local, no API key
# Or switch providers mid-session using the /provider command:
# [You]: /provider openai
# [You]: /topk 1511. Bonus: Fully Local with Ollama
Run the entire pipeline — including the query LLM — without any API keys or cloud services.
# Install Ollama
brew install ollama
# Start the Ollama server (must be running for query.py --provider ollama)
ollama serve &
# Pull models — choose based on your available RAM:
ollama pull llama3.2 # 3B params, ~2 GB RAM — very fast
ollama pull mistral # 7B params, ~5 GB RAM — good multilingual
ollama pull qwen2.5:7b # 7B params, ~5 GB RAM — BEST for Russian text
ollama pull llama3.1:8b # 8B params, ~6 GB RAM — strong general quality
# Confirm a model works with Russian
ollama run qwen2.5:7b "Что такое энциклопедия?"Set default_provider: ollama and ollama_model: qwen2.5:7b in config.yaml and you’re done — no keys, no cloud.
12. Maintaining and Updating the Index
The incremental re-processing is what makes CocoIndex particularly well-suited for a long-lived project like this. If you eventually get better OCR quality (e.g., by switching to Apple Vision), re-OCR a volume and CocoIndex handles the rest.
Re-OCR a single volume and update the index
# Force re-OCR of Volume 42 (overwrites the existing text file).
python 02_ocr.py --volume 42 --force
# Re-run CocoIndex — it detects the changed file hash and re-processes only Volume 42.
# All 85 other volumes are skipped entirely.
cocoindex update cocoindex_flow.pyRebuild the entire index from scratch
# Drop all CocoIndex-managed tables (internal state + embedding table).
cocoindex drop cocoindex_flow.py
# Re-index everything.
cocoindex update cocoindex_flow.pyManage Postgres
brew services start postgresql@16 # start
brew services stop postgresql@16 # stop
brew services restart postgresql@16 # restart
# View Postgres logs if something goes wrong
tail -f /opt/homebrew/var/log/postgresql@16.log13. Troubleshooting
psql: error: connection to server on socket ... failed
Postgres isn’t running.
brew services start postgresql@16
# Wait 5 seconds, then:
psql postgres -c "SELECT 1;"ERROR: could not open extension control file ... vector.control
pgvector isn’t installed, or Postgres hasn’t been restarted since it was installed.
brew install pgvector
brew services restart postgresql@16
psql cocoindex -c "CREATE EXTENSION IF NOT EXISTS vector;"permission denied for schema public
This happens on Postgres 15+ because the public schema is restricted by default.
psql cocoindex -c "GRANT ALL ON SCHEMA public TO cocoindex;"COCOINDEX_DATABASE_URL not set
# For the current terminal session:
export COCOINDEX_DATABASE_URL="postgresql://cocoindex:yourpassword@localhost:5432/cocoindex"
# Permanently — add to ~/.zprofile so every new terminal has it:
echo 'export COCOINDEX_DATABASE_URL="postgresql://cocoindex:yourpassword@localhost:5432/cocoindex"' >> ~/.zprofile
source ~/.zprofilecocoindex: command not found
The Python virtual environment isn’t active.
source ~/brockhaus-rag/venv/bin/activateModuleNotFoundError: No module named 'cocoindex_flow'
Make sure the indexing file is named cocoindex_flow.py (not 03_cocoindex_flow.py). Python cannot import modules whose names start with a digit.
OCR output is empty or mostly noise
Confirm Tesseract has the Russian language pack installed:
tesseract --list-langs | grep rus
# Must print: rusIf it’s missing:
brew reinstall tesseract-langAlso try increasing DPI in config.yaml from 300 to 400 for better recognition of small print.
Similarity search returns irrelevant results
- The
text_to_embeddingimport inquery.pyguarantees the same model is used at index time and query time. If you changed theembedding_modelinconfig.yamlafter indexing, you must rebuild:cocoindex dropthencocoindex update. - Confirm the
"query: "prefix is applied to search strings (multilingual-e5 requires it). - Try
--top-k 20and let the LLM filter.
ollama: connection refused
The Ollama server isn’t running.
ollama serve &
# Wait a few seconds, then retry.Summary
| Phase | Script | Expected time |
|---|---|---|
| Install dependencies | brew install ... + pip install ... | 5–10 min |
| PostgreSQL setup | psql commands | 2 min |
| DjVu → PDF | ./01_convert_djvu.sh | 5–20 min |
| OCR all 86 volumes | python 02_ocr.py | 2–8 hours |
| Chunk + embed + store | cocoindex update cocoindex_flow.py | 30–60 min |
| Obsidian export | python 04_export_obsidian.py | 1–5 min |
| Query | python query.py | seconds per question |
| Incremental re-index (1 volume) | cocoindex update cocoindex_flow.py | ~1 min |
The Brockhaus & Efron encyclopedia spans 121,000 articles across 86 volumes — now searchable by question, grounded in source text, and running entirely on your machine.
Crepi il lupo! 🐺
Appendix: Apple Vision OCR (Higher Accuracy on macOS)
Apple’s built-in Vision framework often achieves 20–30% better accuracy than Tesseract on pre-revolutionary Russian script, at no extra cost (it uses the Neural Engine on Apple Silicon). It requires macOS 12+ and the pyobjc bindings:
pip install pyobjc-framework-Vision pyobjc-framework-CoreMLReplace the pytesseract.image_to_string(...) call in 02_ocr.py with this function:
from Foundation import NSURL
import Vision
def apple_vision_ocr(image_path: str) -> str:
"""
Run Apple Vision text recognition on a PNG/JPEG file path.
Observations are sorted top-to-bottom by bounding box Y coordinate
to preserve reading order. Requires macOS 12+.
"""
url = NSURL.fileURLWithPath_(image_path)
handler = Vision.VNImageRequestHandler.alloc().initWithURL_options_(url, None)
request = Vision.VNRecognizeTextRequest.alloc().init()
request.setRecognitionLanguages_(["ru", "ru-RU"])
request.setRecognitionLevel_(Vision.VNRequestTextRecognitionLevelAccurate)
request.setUsesLanguageCorrection_(True)
handler.performRequests_error_([request], None)
observations = request.results() or []
# Sort top-to-bottom: higher Y in Vision's coordinate system = higher on page.
sorted_obs = sorted(observations, key=lambda o: -o.boundingBox().origin.y)
return "\n".join(
o.topCandidates_(1)[0].string()
for o in sorted_obs
if o.topCandidates_(1)
)You’ll need to save each PDF page as a temporary PNG file first (pdf2image can do this), then pass the path to apple_vision_ocr() instead of calling pytesseract.image_to_string().