Build a Searchable AI Knowledge Base from the Brockhaus & Efron Encyclopedia

⬅️ Back to Tutorials

🏛️ Build a Searchable AI Knowledge Base from the Brockhaus & Efron Encyclopedia

The Brockhaus & Efron Encyclopedic Dictionary (Энциклопедический словарь Брокгауза и Ефрона) is one of the great reference works of the late Tsarist period. Published between 1890 and 1907, it spans 86 volumes and contains around 121,000 articles covering science, history, culture, literature, and technology.

You can download all 86 volumes as DjVu files from Runivers. But scanned images are hard to search. This guide walks you through turning those DjVu files into a fully queryable, AI-powered knowledge base that runs entirely on your Mac — no cloud, no data leaving your machine.

The stack: CocoIndex for chunking and embedding, PostgreSQL + pgvector for storage and semantic search, and any LLM (Claude, Gemini, OpenAI, or a local Ollama model) for answering questions.


1. How It All Fits Together

The pipeline has five phases:

  1. DjVu → PDFddjvu converts scanned images to PDF
  2. PDF → Text — Tesseract OCR (or Apple Vision, see the appendix) extracts readable Russian text
  3. Text → Chunks → Embeddings — CocoIndex splits text and generates vector embeddings stored in PostgreSQL
  4. Obsidian export — Every chunk becomes a Markdown file with wikilinks, browsable in Obsidian
  5. Query — Ask questions in any language, get grounded answers with volume citations

The magic of CocoIndex is that it’s incremental. After your first run, it tracks file hashes. If you re-OCR one volume, only that volume gets re-chunked and re-embedded on the next run — everything else is skipped.


2. Install System Dependencies

Step 2.1 — Install Homebrew

Homebrew is the standard macOS package manager. If you already have it, skip to Step 2.2.

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Apple Silicon (M1/M2/M3/M4) only: the installer will print two commands at the end to add Homebrew to your PATH. Run them — they look like this:

echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> ~/.zprofile
eval "$(/opt/homebrew/bin/brew shellenv)"

Verify:

brew --version   # → Homebrew 4.x.x

Step 2.2 — Install system tools

# python@3.11   — CocoIndex supports Python 3.11–3.13
# tesseract     — OCR engine
# tesseract-lang — All language packs, including Russian (rus)
# poppler       — Provides pdftoppm, used to rasterise PDF pages for OCR
# djvulibre     — Provides ddjvu for DjVu → PDF conversion
# postgresql@16 — Database server (pinned to v16 for stability)
brew install python@3.11 tesseract tesseract-lang poppler djvulibre postgresql@16

Verify:

python3.11 --version                    # → Python 3.11.x
tesseract --version                     # → tesseract 5.x.x
tesseract --list-langs | grep rus       # → rus  (must appear)
ddjvu --version                         # → DjVuLibre ...
pdftoppm -v 2>&1 | head -1             # → pdftoppm version ...
psql --version                          # → psql (PostgreSQL) 16.x

3. Set Up PostgreSQL and pgvector

PostgreSQL stores everything in this project: CocoIndex’s internal bookkeeping state, your raw text chunks, and their vector embeddings — all in one local database.

Step 3.1 — Add Postgres to your PATH

echo 'export PATH="/opt/homebrew/opt/postgresql@16/bin:$PATH"' >> ~/.zprofile
source ~/.zprofile

Step 3.2 — Start Postgres as a background service

# This starts Postgres and configures it to restart automatically at login.
brew services start postgresql@16

# Confirm it's running. Look for "started" in the output.
brew services list | grep postgresql

# Test the connection (Homebrew creates a superuser role matching your macOS username).
psql postgres -c "SELECT version();"
# → PostgreSQL 16.x ...

Step 3.3 — Create the database, user, and grant permissions

psql postgres -c "CREATE DATABASE cocoindex;"
psql postgres -c "CREATE USER cocoindex WITH PASSWORD 'yourpassword';"
psql postgres -c "GRANT ALL PRIVILEGES ON DATABASE cocoindex TO cocoindex;"

# In PostgreSQL 15+, the public schema is restricted by default.
# This grant allows the cocoindex user to create tables in it.
psql cocoindex -c "GRANT ALL ON SCHEMA public TO cocoindex;"

Step 3.4 — Install pgvector and enable it

pgvector is the open-source Postgres extension that adds a VECTOR data type and approximate nearest-neighbour (ANN) index methods. It needs to be installed as a separate Homebrew formula, and Postgres must be restarted before the extension becomes available.

brew install pgvector

# Restart Postgres so it can find the newly installed pgvector shared library.
brew services restart postgresql@16

# Enable the vector extension inside the cocoindex database.
psql cocoindex -c "CREATE EXTENSION IF NOT EXISTS vector;"

# Verify: should print 'vector' and its version number.
psql cocoindex -c "SELECT extname, extversion FROM pg_extension WHERE extname = 'vector';"

Step 3.5 — Record your connection URL

Everything in this project connects to Postgres using this URL. Keep it handy — you’ll set it as an environment variable in the next step.

postgresql://cocoindex:yourpassword@localhost:5432/cocoindex

Test it:

psql "postgresql://cocoindex:yourpassword@localhost:5432/cocoindex" -c "SELECT 'connection ok';"
# → connection ok

4. Set Up the Python Project

Step 4.1 — Create the project directory and virtual environment

mkdir -p ~/brockhaus-rag
cd ~/brockhaus-rag

# Create and activate an isolated Python environment for this project.
python3.11 -m venv venv
source venv/bin/activate
pip install --upgrade pip

Step 4.2 — Install Python packages

pip install \
    "cocoindex[embeddings]" \
    pdf2image pytesseract pillow \
    tqdm rich python-dotenv pyyaml \
    psycopg2-binary python-frontmatter \
    anthropic openai google-generativeai

Step 4.3 — Create the folder structure

mkdir -p data/djvu data/pdf data/text obsidian_vault/Encyclopedia logs

Your project tree:

~/brockhaus-rag/
├── venv/                        ← Python environment (never edit manually)
├── data/
│   ├── djvu/                    ← Put your Volume_*.djvu files here
│   ├── pdf/                     ← Phase 1 output: converted PDFs
│   └── text/                    ← Phase 2 output: OCR raw text files
├── obsidian_vault/
│   └── Encyclopedia/            ← Phase 5 output: one .md per article chunk
├── logs/                        ← OCR error logs
├── .env                         ← API keys (never commit to git)
├── config.yaml                  ← All settings in one place
├── 01_convert_djvu.sh
├── 02_ocr.py
├── cocoindex_flow.py            ← Phase 3+4 (named without numeric prefix — see note)
├── 04_export_obsidian.py
└── query.py

Why is the CocoIndex file not called 03_cocoindex_flow.py? Python cannot import files whose names start with a digit. The query.py script imports the shared text_to_embedding function directly from cocoindex_flow.py, so the file must have a valid Python module name.

Step 4.4 — Create config.yaml

paths:
  djvu_dir: ~/brockhaus-rag/data/djvu
  pdf_dir: ~/brockhaus-rag/data/pdf
  text_dir: ~/brockhaus-rag/data/text
  obsidian_dir: ~/brockhaus-rag/obsidian_vault/Encyclopedia
  log_dir: ~/brockhaus-rag/logs

database:
  # Update 'yourpassword' to match what you chose in Step 3.3.
  url: postgresql://cocoindex:yourpassword@localhost:5432/cocoindex

ocr:
  language: rus # Tesseract language code. 'rus' = Russian.
  dpi: 300 # Render PDF pages at 300 DPI before OCR.
  psm: 1 # Tesseract page segmentation mode 1 = auto with OSD.

cocoindex:
  # HuggingFace model for embedding. Supports Russian natively.
  # First run downloads ~560 MB; cached locally after that.
  embedding_model: intfloat/multilingual-e5-large
  chunk_size: 800 # Target characters per chunk.
  chunk_overlap: 100 # Characters of overlap between consecutive chunks.
  collection_name: encyclopedia_embeddings # Postgres table name.

llm:
  default_provider: anthropic # Options: anthropic, openai, google, ollama
  anthropic_model: claude-opus-4-6
  openai_model: gpt-4o
  google_model: gemini-2.0-flash
  ollama_model: qwen2.5:7b # Best for Russian text among local 7B models.
  top_k: 8 # Number of chunks retrieved per query.

obsidian:
  create_index: true # Generate a master Brockhaus Index.md file.
  add_backlinks: true # Add [[Volume N]] wikilinks in every article file.

Step 4.5 — Create .env

# .env — secret API keys. Never commit this file to git.
# Remove or comment out any providers you don't plan to use.

ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
GOOGLE_API_KEY=AI...

# CocoIndex reads its database connection from this variable.
COCOINDEX_DATABASE_URL=postgresql://cocoindex:yourpassword@localhost:5432/cocoindex

To avoid exporting the COCOINDEX_DATABASE_URL manually in every new terminal, add it permanently:

echo 'export COCOINDEX_DATABASE_URL="postgresql://cocoindex:yourpassword@localhost:5432/cocoindex"' >> ~/.zprofile
source ~/.zprofile

5. Download the Encyclopedia

# Move your downloaded Volume_*.djvu files into the djvu folder,
# or symlink the folder where you already saved them:
ln -s /path/to/your/djvu/files ~/brockhaus-rag/data/djvu

# Alternatively, use the included download script to fetch all 86 volumes.
# It resumes interrupted downloads automatically.
chmod +x download_all_volumes.sh
./download_all_volumes.sh

Expected total size: ~2–3 GB. Each volume is ~25–40 MB.


6. Phase 1 — DjVu → PDF

Create 01_convert_djvu.sh:

#!/bin/zsh
# Convert all DjVu volumes to PDF.
# Skips files where the PDF already exists.
# Run from ~/brockhaus-rag with the venv active.

set -euo pipefail

# Read paths from config.yaml using Python (zsh can't parse YAML natively).
DJVU_DIR=$(python3 -c "import yaml,os; c=yaml.safe_load(open('config.yaml')); print(os.path.expanduser(c['paths']['djvu_dir']))")
PDF_DIR=$(python3  -c "import yaml,os; c=yaml.safe_load(open('config.yaml')); print(os.path.expanduser(c['paths']['pdf_dir']))")

mkdir -p "$PDF_DIR"
SUCCESS=0; SKIP=0; FAIL=0
TOTAL=$(ls "$DJVU_DIR"/*.djvu 2>/dev/null | wc -l | tr -d ' ')
COUNT=0

echo "Converting $TOTAL DjVu volumes to PDF..."

for INPUT in "$DJVU_DIR"/*.djvu; do
    COUNT=$((COUNT + 1))
    BASENAME=$(basename "$INPUT" .djvu)
    OUTPUT="$PDF_DIR/${BASENAME}.pdf"

    if [[ -f "$OUTPUT" ]]; then
        echo "[$COUNT/$TOTAL] Skipping (exists): $BASENAME.pdf"
        SKIP=$((SKIP + 1))
        continue
    fi

    echo "[$COUNT/$TOTAL] Converting: $BASENAME.djvu..."

    # -format=pdf → output format
    # -scale=300  → render at 300 DPI for better OCR quality
    if ddjvu -format=pdf -scale=300 "$INPUT" "$OUTPUT"; then
        SIZE=$(du -sh "$OUTPUT" | awk '{print $1}')
        echo "  ✓ $BASENAME.pdf ($SIZE)"
        SUCCESS=$((SUCCESS + 1))
    else
        echo "  ✗ Failed: $BASENAME"
        rm -f "$OUTPUT"
        FAIL=$((FAIL + 1))
    fi
done

echo ""
echo "Done. ✓ $SUCCESS converted  ⟳ $SKIP skipped  ✗ $FAIL failed"
chmod +x 01_convert_djvu.sh
./01_convert_djvu.sh

Expect 5–20 minutes and ~4–6 GB of output PDFs.


7. Phase 2 — OCR: Extract Text

Each PDF page is rendered to an image and passed to Tesseract for Russian text recognition. Output is one UTF-8 text file per volume, with --- PAGE N --- markers.

Tip: Apple Vision (macOS built-in) often achieves 20–30% better accuracy than Tesseract on pre-revolutionary Russian script. See the appendix at the end of this article for the drop-in replacement function.

Create 02_ocr.py:

#!/usr/bin/env python3
"""
OCR all PDF volumes into plain text files.

Usage:
    python 02_ocr.py                 # process all volumes
    python 02_ocr.py --volume 1      # process a single volume
    python 02_ocr.py --volume 1 --force  # re-OCR even if output already exists
"""

import os
import sys
import time
import argparse
import logging
from pathlib import Path

import yaml
import pytesseract
from pdf2image import convert_from_path
from tqdm import tqdm
from rich.console import Console
from rich.table import Table

console = Console()


def load_config() -> dict:
    with open("config.yaml") as f:
        cfg = yaml.safe_load(f)
    for k in cfg["paths"]:
        cfg["paths"][k] = os.path.expanduser(cfg["paths"][k])
    return cfg


def setup_logging(log_dir: str) -> logging.Logger:
    os.makedirs(log_dir, exist_ok=True)
    log_path = os.path.join(log_dir, "ocr_errors.log")
    logging.basicConfig(
        filename=log_path,
        level=logging.ERROR,
        format="%(asctime)s %(levelname)s %(message)s",
    )
    return logging.getLogger("ocr")


def ocr_volume(pdf_path: Path, output_path: Path, config: dict, logger: logging.Logger) -> dict:
    """Render each page of a PDF to an image and run Tesseract OCR on it."""
    start = time.time()
    lang = config["ocr"]["language"]
    dpi = config["ocr"]["dpi"]
    tess_config = f"--psm {config['ocr']['psm']}"
    stats = {"pages": 0, "chars": 0, "errors": 0}

    try:
        # pdf2image uses poppler's pdftoppm under the hood.
        # thread_count=4 renders multiple pages in parallel.
        images = convert_from_path(str(pdf_path), dpi=dpi, thread_count=4)
    except Exception as e:
        logger.error(f"Could not render {pdf_path.name}: {e}")
        return stats

    stats["pages"] = len(images)
    lines = []

    for page_num, image in enumerate(tqdm(images, desc=f"  {pdf_path.stem}", leave=False), 1):
        try:
            text = pytesseract.image_to_string(image, lang=lang, config=tess_config)
            lines.append(f"\n\n--- PAGE {page_num} ---\n\n{text}")
            stats["chars"] += len(text)
        except Exception as e:
            logger.error(f"{pdf_path.name} page {page_num}: {e}")
            lines.append(f"\n\n--- PAGE {page_num} --- [OCR ERROR]\n\n")
            stats["errors"] += 1

    output_path.write_text("".join(lines), encoding="utf-8")
    stats["duration"] = round(time.time() - start, 1)
    return stats


def main():
    parser = argparse.ArgumentParser(description="OCR PDF volumes to text files.")
    parser.add_argument("--volume", type=int, help="Process only this volume number.")
    parser.add_argument("--force", action="store_true", help="Re-OCR even if output exists.")
    args = parser.parse_args()

    cfg = load_config()
    logger = setup_logging(cfg["paths"]["log_dir"])
    pdf_dir = Path(cfg["paths"]["pdf_dir"])
    text_dir = Path(cfg["paths"]["text_dir"])
    text_dir.mkdir(parents=True, exist_ok=True)

    if args.volume:
        pdf_files = sorted(pdf_dir.glob(f"Volume_{args.volume}.pdf"))
    else:
        pdf_files = sorted(pdf_dir.glob("Volume_*.pdf"))

    if not pdf_files:
        console.print("[red]No PDF files found. Run 01_convert_djvu.sh first.[/red]")
        sys.exit(1)

    summary_rows = []

    for pdf_path in pdf_files:
        output = text_dir / f"{pdf_path.stem}_raw.txt"
        console.rule(f"[bold]{pdf_path.stem}[/bold]")

        if output.exists() and not args.force:
            kb = output.stat().st_size // 1024
            console.print(f"  [yellow]Skipping[/yellow] ({kb} KB on disk) — use --force to re-OCR")
            summary_rows.append((pdf_path.stem, "—", "—", f"{kb} KB", "skipped"))
            continue

        stats = ocr_volume(pdf_path, output, cfg, logger)
        kb = output.stat().st_size // 1024 if output.exists() else 0
        summary_rows.append((
            pdf_path.stem,
            str(stats.get("pages", "?")),
            f"{stats.get('chars', 0):,}",
            f"{kb} KB",
            f"{stats.get('duration', '?')}s",
        ))

    table = Table(title="OCR Summary")
    for col in ("Volume", "Pages", "Characters", "File size", "Duration"):
        table.add_column(col)
    for row in summary_rows:
        table.add_row(*row)
    console.print(table)


if __name__ == "__main__":
    main()

Run it:

# Always test on a single volume first.
python 02_ocr.py --volume 1

# Check the output — you should see Russian text and page markers.
head -60 data/text/Volume_1_raw.txt

# Run all 86 volumes (run this overnight).
python 02_ocr.py

Expect 2–8 hours for all volumes.


8. Phase 3+4 — CocoIndex: Chunk, Embed, Store

This single file replaces what would otherwise be two separate scripts (chunking and embedding). CocoIndex handles everything in a declarative dataflow — you describe the transformations; the Rust-core engine executes them efficiently and tracks what has changed.

How the text_to_embedding transform flow works

The @cocoindex.transform_flow() decorator is the key design decision here. It creates a shared embedding function that is used in two places:

  • At index timecocoindex update calls it on every chunk
  • At query timequery.py imports it and calls .eval() on your search string

This guarantees that query vectors and document vectors are always produced by the exact same model and preprocessing logic. If they diverge, similarity search produces meaningless scores.

The query: / passage: prefix convention

The intfloat/multilingual-e5-large model is trained to expect a prefix on every input:

  • Documents are prefixed with "passage: " — CocoIndex handles this internally
  • Queries must be prefixed with "query: " — we do this explicitly in query.py

Without the correct prefix, retrieval quality drops significantly.

Create cocoindex_flow.py:

#!/usr/bin/env python3
"""
CocoIndex flow: read OCR text files → chunk → embed → store in PostgreSQL.

Usage (CocoIndex CLI — recommended):
    export COCOINDEX_DATABASE_URL="postgresql://cocoindex:yourpassword@localhost:5432/cocoindex"
    cocoindex update cocoindex_flow.py    # first run: indexes everything
    cocoindex update cocoindex_flow.py    # subsequent runs: only changed files

Usage (Python directly):
    python cocoindex_flow.py

To drop all CocoIndex-managed tables and rebuild from scratch:
    cocoindex drop cocoindex_flow.py
    cocoindex update cocoindex_flow.py
"""

import os
import asyncio
import yaml
import cocoindex
from dotenv import load_dotenv

load_dotenv()


def load_config() -> dict:
    with open("config.yaml") as f:
        cfg = yaml.safe_load(f)
    for k in cfg["paths"]:
        cfg["paths"][k] = os.path.expanduser(cfg["paths"][k])
    return cfg


config = load_config()


# ---------------------------------------------------------------------------
# Shared transform flow — used both during indexing and at query time.
#
# Decorating with @cocoindex.transform_flow() adds an .eval() method so
# query.py can call text_to_embedding.eval("query: your question") and get
# back the same vector space used during indexing.
# ---------------------------------------------------------------------------

@cocoindex.transform_flow()
def text_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice:
    """Embed a text string using SentenceTransformerEmbed (multilingual-e5-large)."""
    return text.transform(
        cocoindex.functions.SentenceTransformerEmbed(
            model=config["cocoindex"]["embedding_model"]
        )
    )


# ---------------------------------------------------------------------------
# Indexing flow — the main pipeline declaration.
# ---------------------------------------------------------------------------

@cocoindex.flow_def(name="BrockhausEncyclopedia")
def brockhaus_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope) -> None:
    """
    Define the Brockhaus encyclopedia indexing flow.

    Data path:
        LocalFile (data/text/*.txt)
          → SplitRecursively (800 chars, 100 overlap)
            → SentenceTransformerEmbed (multilingual-e5-large, 1024 dims)
              → Postgres (encyclopedia_embeddings table)
    """

    # SOURCE: read every *_raw.txt file produced by 02_ocr.py.
    # CocoIndex hashes each file's content. On re-run, only files whose
    # hash changed are re-processed downstream — unchanged volumes are skipped.
    data_scope["documents"] = flow_builder.add_source(
        cocoindex.sources.LocalFile(
            path=config["paths"]["text_dir"],
            included_patterns=["*_raw.txt"],
        )
    )

    doc_embeddings = data_scope.add_collector()

    with data_scope["documents"].row() as doc:

        # CHUNK: split on paragraph boundaries, fall back to sentence boundaries.
        # Never splits mid-sentence. Produces overlapping windows.
        doc["chunks"] = doc["content"].transform(
            cocoindex.functions.SplitRecursively(),
            language="plain_text",
            chunk_size=config["cocoindex"]["chunk_size"],
            chunk_overlap=config["cocoindex"]["chunk_overlap"],
        )

        with doc["chunks"].row() as chunk:

            # EMBED: call the shared transform flow.
            chunk["embedding"] = chunk["text"].call(text_to_embedding)

            doc_embeddings.collect(
                filename=doc["filename"],    # e.g. ".../Volume_1_raw.txt"
                location=chunk["location"],  # character offset, e.g. "0:800"
                text=chunk["text"],
                embedding=chunk["embedding"],
            )

    # EXPORT: CocoIndex creates the table and vector index automatically.
    # primary_key_fields ensures upserts — re-running never duplicates rows.
    doc_embeddings.export(
        config["cocoindex"]["collection_name"],
        cocoindex.storages.Postgres(),
        primary_key_fields=["filename", "location"],
        vector_indexes=[
            cocoindex.VectorIndexDef(
                field_name="embedding",
                metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,
            )
        ],
    )


async def main_async():
    await cocoindex.setup()
    await brockhaus_flow.run()
    print("Indexing complete.")


if __name__ == "__main__":
    cocoindex.init()
    asyncio.run(main_async())

Run CocoIndex

# Make sure the database URL is exported (or sourced from .env).
export COCOINDEX_DATABASE_URL="postgresql://cocoindex:yourpassword@localhost:5432/cocoindex"

# First run — downloads the embedding model (~560 MB, one-time) then
# processes all volumes. Expect 30–60 minutes.
cocoindex update cocoindex_flow.py

What the first run prints:

BrockhausEncyclopedia: setting up...
BrockhausEncyclopedia: processing documents...
  documents: 86 added, 0 removed, 0 updated
  encyclopedia_embeddings: 183,492 added, 0 removed, 0 updated
BrockhausEncyclopedia: done in 47m 12s

What a re-run prints after you edit Volume 5’s text file:

BrockhausEncyclopedia: processing documents...
  documents: 0 added, 0 removed, 1 updated
  encyclopedia_embeddings: 2,341 added, 0 removed, 2,389 updated
BrockhausEncyclopedia: done in 41s

Only Volume 5’s chunks are re-embedded. Every other volume is skipped.

Verify the data in Postgres

psql "postgresql://cocoindex:yourpassword@localhost:5432/cocoindex"
-- Total chunks ingested.
SELECT COUNT(*) FROM encyclopedia_embeddings;
-- → 183492 (or similar)

-- Chunks and average size per volume.
SELECT
    substring(filename FROM 'Volume_(\d+)') AS vol,
    COUNT(*)                                  AS chunks,
    AVG(length(text))::int                    AS avg_chars
FROM encyclopedia_embeddings
GROUP BY 1
ORDER BY vol::int;

-- Confirm vector dimensions (should be 1024 for multilingual-e5-large).
SELECT array_length(embedding::real[], 1) AS dims
FROM encyclopedia_embeddings
LIMIT 1;
-- → 1024

-- Confirm the vector index exists.
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'encyclopedia_embeddings';

\q

9. Phase 4 — Export to Obsidian

This phase exports every chunk as a Markdown file, organised by volume, into an Obsidian vault. It reads directly from the Postgres table so it always reflects exactly what CocoIndex indexed.

Create 04_export_obsidian.py:

#!/usr/bin/env python3
"""
Export encyclopedia chunks from Postgres to an Obsidian vault.

For each chunk:
  - Creates a Markdown file with YAML frontmatter (title, volume, tags, source)
  - Organises files into per-volume subdirectories
  - Adds Obsidian wikilinks back to the volume overview and master index

Also generates:
  - obsidian_vault/Encyclopedia/Brockhaus Index.md  (master article list)
  - obsidian_vault/Encyclopedia/Volume_N.md          (one per volume)

Usage:
    python 04_export_obsidian.py
    python 04_export_obsidian.py --volume 1   # export one volume only
"""

import os
import re
import argparse
from pathlib import Path

import yaml
import psycopg2
import frontmatter
from tqdm import tqdm
from dotenv import load_dotenv
from rich.console import Console

load_dotenv()
console = Console()


def load_config() -> dict:
    with open("config.yaml") as f:
        cfg = yaml.safe_load(f)
    for k in cfg["paths"]:
        cfg["paths"][k] = os.path.expanduser(cfg["paths"][k])
    return cfg


def extract_volume(filename: str) -> int:
    m = re.search(r"Volume_(\d+)", filename)
    return int(m.group(1)) if m else 0


def safe_filename(text: str, max_len: int = 60) -> str:
    """Strip filesystem-illegal characters and truncate."""
    cleaned = re.sub(r'[/\\:*?"<>|]', "", text).strip()
    return cleaned[:max_len].replace(" ", "_") or "chunk"


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--volume", type=int)
    args = parser.parse_args()

    cfg = load_config()
    obsidian_dir = Path(cfg["paths"]["obsidian_dir"])
    obsidian_dir.mkdir(parents=True, exist_ok=True)
    db_url = os.environ.get("COCOINDEX_DATABASE_URL", cfg["database"]["url"])
    table = cfg["cocoindex"]["collection_name"]

    conn = psycopg2.connect(db_url)
    cur = conn.cursor()

    if args.volume:
        cur.execute(
            f"SELECT filename, location, text FROM {table} WHERE filename LIKE %s ORDER BY filename, location",
            (f"%Volume_{args.volume}%",),
        )
    else:
        cur.execute(f"SELECT filename, location, text FROM {table} ORDER BY filename, location")

    rows = cur.fetchall()
    conn.close()
    console.print(f"Fetched {len(rows):,} chunks from Postgres.")

    all_files: list[tuple[int, Path]] = []
    vol_files: dict[int, list[Path]] = {}

    for filename, location, text in tqdm(rows, desc="Exporting"):
        vol = extract_volume(filename)
        vol_dir = obsidian_dir / f"Volume_{vol:02d}"
        vol_dir.mkdir(exist_ok=True)

        # Use the first non-empty line as a title proxy.
        lines = [l.strip() for l in text.split("\n") if l.strip()]
        title = lines[0][:60] if lines else f"Volume {vol} chunk"
        idx = len(list(vol_dir.glob("*.md")))
        md_filename = f"Vol{vol:02d}_{safe_filename(title)}_{idx}.md"
        filepath = vol_dir / md_filename

        body = (
            f"# {title}\n\n{text}\n\n---\n\n"
            f"*Source: Brockhaus & Efron Encyclopedic Dictionary, [[Volume {vol:02d}]]*  \n"
            f"*[[Brockhaus Index]]*\n"
        )
        post = frontmatter.Post(
            body,
            title=title,
            volume=vol,
            location=location,
            tags=["brockhaus", f"vol-{vol}"],
            source="Энциклопедический словарь Брокгауза и Ефрона",
        )
        filepath.write_text(frontmatter.dumps(post), encoding="utf-8")
        all_files.append((vol, filepath))
        vol_files.setdefault(vol, []).append(filepath)

    # Write per-volume overview files.
    for vol, files in vol_files.items():
        overview = obsidian_dir / f"Volume_{vol:02d}.md"
        links = "\n".join(f"- [[{f.stem}]]" for f in files)
        overview.write_text(
            f"# Volume {vol}\n\n**Chunks:** {len(files)}\n\n## Contents\n\n{links}\n\n---\n*[[Brockhaus Index]]*\n",
            encoding="utf-8",
        )

    # Write the master index.
    if cfg.get("obsidian", {}).get("create_index", True):
        index_path = obsidian_dir / "Brockhaus Index.md"
        rows_md = "\n".join(
            f"| [[{p.stem}]] | [[Volume {v:02d}]] |"
            for v, p in sorted(all_files, key=lambda x: (x[0], x[1].name))
        )
        index_path.write_text(
            "# Brockhaus & Efron Encyclopedic Dictionary\n\n"
            "| Article | Volume |\n|---------|--------|\n"
            + rows_md + "\n",
            encoding="utf-8",
        )
        console.print(f"[green]Index written:[/green] {index_path}")

    console.print(f"\n[bold green]Done.[/bold green] {len(all_files):,} files → {obsidian_dir}")


if __name__ == "__main__":
    main()
python 04_export_obsidian.py

Open ~/brockhaus-rag/obsidian_vault/ in Obsidian (free download). Press Cmd+G for the knowledge graph across all 86 volumes. Press Cmd+Shift+F to full-text search all article files.

Recommended Obsidian plugins:

  • Dataview — SQL-like queries across your vault: TABLE title, volume FROM "Encyclopedia" WHERE contains(tags, "brockhaus")
  • Omnisearch — fast full-text search with ranking
  • Graph Analysis — improves graph view performance for large vaults

10. Phase 5 — Query the Knowledge Base

query.py imports text_to_embedding directly from cocoindex_flow.py — the same function used during indexing. This guarantees that your question vector and the stored document vectors live in the same embedding space, which is required for meaningful similarity scores.

When you type a question, the script:

  1. Prefixes it with "query: " (required by multilingual-e5) and embeds it
  2. Runs a pgvector cosine similarity search against all stored chunks
  3. Assembles a RAG prompt with the retrieved excerpts
  4. Calls your chosen LLM and streams the answer to the terminal

Create query.py:

#!/usr/bin/env python3
"""
Interactive RAG query CLI for the Brockhaus & Efron encyclopedia.

Usage:
    python query.py                                    # interactive, default provider
    python query.py --provider openai                  # use GPT-4o
    python query.py --provider google                  # use Gemini
    python query.py --provider ollama                  # fully local, no API key
    python query.py --query "Tell me about Tolstoy"    # single non-interactive query
    python query.py --top-k 15                         # retrieve more context

Interactive commands:
    /quit            — exit
    /help            — show commands
    /provider NAME   — switch LLM provider mid-session
    /topk N          — change number of retrieved chunks mid-session
"""

import os
import sys
import re
import argparse
import textwrap
from pathlib import Path

import yaml
import psycopg2
import cocoindex
from dotenv import load_dotenv
from rich.console import Console
from rich.panel import Panel          # Panel lives in rich.panel, not rich.console
from rich.markdown import Markdown
from rich.table import Table

load_dotenv()
cocoindex.init()

# Import the shared embedding function from the indexing flow.
# This guarantees query vectors use the exact same model and preprocessing
# as the document vectors stored in Postgres.
sys.path.insert(0, str(Path(__file__).parent))
from cocoindex_flow import text_to_embedding  # noqa: E402

console = Console()

SYSTEM_PROMPT = textwrap.dedent("""
    You are a scholarly research assistant specialising in 19th-century Russian
    history, science, culture, and language. Answer using ONLY the encyclopedia
    excerpts provided. Always cite the volume number(s) your answer draws from.
    If the answer is not in the excerpts, say so clearly — do not invent facts.
    Where helpful, translate Russian terms into English but keep key terms in Russian too.
""").strip()


def load_config() -> dict:
    return yaml.safe_load(open("config.yaml"))


# ---------------------------------------------------------------------------
# Retrieval
# ---------------------------------------------------------------------------

def retrieve(query: str, top_k: int, config: dict) -> list[dict]:
    """
    Embed the query and find the top-K most similar chunks in Postgres.

    The query is prefixed with "query: " as required by multilingual-e5 models.
    The <=> operator (pgvector cosine distance) returns 0 for identical vectors
    and 2 for opposite vectors, so ORDER BY ASC gives most similar first.
    Similarity score = 1 - distance, reported in results for transparency.
    """
    # Prefix is critical: multilingual-e5 expects "query: " on search strings
    # and "passage: " on documents (CocoIndex handles the document prefix internally).
    query_vec = text_to_embedding.eval(f"query: {query}")

    db_url = os.environ.get("COCOINDEX_DATABASE_URL", config["database"]["url"])
    table = config["cocoindex"]["collection_name"]
    vec_str = "[" + ",".join(str(float(x)) for x in query_vec) + "]"

    conn = psycopg2.connect(db_url)
    cur = conn.cursor()
    cur.execute(f"""
        SELECT
            filename,
            location,
            text,
            1 - (embedding <=> %s::vector) AS similarity
        FROM {table}
        ORDER BY embedding <=> %s::vector
        LIMIT %s
    """, (vec_str, vec_str, top_k))
    rows = cur.fetchall()
    conn.close()

    results = []
    for filename, location, text, similarity in rows:
        m = re.search(r"Volume_(\d+)", filename)
        vol = int(m.group(1)) if m else 0
        results.append({
            "volume": vol,
            "location": location,
            "text": text,
            "similarity": round(similarity, 4),
        })
    return results


# ---------------------------------------------------------------------------
# Prompt builder
# ---------------------------------------------------------------------------

def build_prompt(query: str, results: list[dict]) -> str:
    excerpts = "\n\n---\n\n".join(
        f"[Excerpt {i} — Volume {r['volume']}, position {r['location']} "
        f"(similarity: {r['similarity']})]\n{r['text']}"
        for i, r in enumerate(results, 1)
    )
    return (
        "The following are excerpts from the Brockhaus & Efron "
        "Encyclopedic Dictionary (1890–1907):\n\n"
        f"{excerpts}\n\nQuestion: {query}"
    )


# ---------------------------------------------------------------------------
# LLM backends — all four providers implemented
# ---------------------------------------------------------------------------

def call_anthropic(system: str, user: str, model: str) -> str:
    from anthropic import Anthropic
    client = Anthropic()
    with console.status("Generating answer (Claude)..."):
        r = client.messages.create(
            model=model, max_tokens=2048,
            system=system,
            messages=[{"role": "user", "content": user}],
        )
    return r.content[0].text


def call_openai(system: str, user: str, model: str) -> str:
    from openai import OpenAI
    client = OpenAI()
    with console.status("Generating answer (OpenAI)..."):
        r = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system},
                {"role": "user", "content": user},
            ],
        )
    return r.choices[0].message.content


def call_google(system: str, user: str, model: str) -> str:
    import google.generativeai as genai
    genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
    m = genai.GenerativeModel(model)
    with console.status("Generating answer (Gemini)..."):
        r = m.generate_content(f"{system}\n\n{user}")
    return r.text


def call_ollama(system: str, user: str, model: str) -> str:
    import requests
    with console.status(f"Running {model} locally via Ollama..."):
        r = requests.post(
            "http://localhost:11434/api/generate",
            json={"model": model, "prompt": f"{system}\n\n{user}", "stream": False},
            timeout=300,
        )
        r.raise_for_status()
    return r.json()["response"]


def call_llm(provider: str, system: str, user: str, config: dict) -> str:
    llm = config["llm"]
    if provider == "anthropic":
        return call_anthropic(system, user, llm["anthropic_model"])
    elif provider == "openai":
        return call_openai(system, user, llm["openai_model"])
    elif provider == "google":
        return call_google(system, user, llm["google_model"])
    elif provider == "ollama":
        return call_ollama(system, user, llm["ollama_model"])
    else:
        raise ValueError(f"Unknown provider: {provider!r}. Choose: anthropic, openai, google, ollama")


# ---------------------------------------------------------------------------
# Query runner
# ---------------------------------------------------------------------------

def run_query(query: str, provider: str, top_k: int, config: dict) -> None:
    with console.status("Searching encyclopedia..."):
        results = retrieve(query, top_k, config)

    if not results:
        console.print("[red]No results found. Is the database populated? Run cocoindex update first.[/red]")
        return

    answer = call_llm(provider, SYSTEM_PROMPT, build_prompt(query, results), config)
    console.print(Panel(Markdown(answer), title="Answer", border_style="green"))

    table = Table(title="Sources retrieved", show_header=True)
    table.add_column("Vol", style="cyan", width=4)
    table.add_column("Similarity", width=10)
    table.add_column("Preview", style="dim")
    for r in results:
        table.add_row(
            str(r["volume"]),
            str(r["similarity"]),
            r["text"][:90].replace("\n", " ") + "…",
        )
    console.print(table)


# ---------------------------------------------------------------------------
# Entry point
# ---------------------------------------------------------------------------

def main():
    parser = argparse.ArgumentParser(description="Query the Brockhaus encyclopedia with AI.")
    parser.add_argument("--provider", choices=["anthropic", "openai", "google", "ollama"])
    parser.add_argument("--top-k", type=int)
    parser.add_argument("--query", type=str, help="Run a single query non-interactively.")
    args = parser.parse_args()

    config = load_config()
    provider = args.provider or config["llm"]["default_provider"]
    top_k = args.top_k or config["llm"]["top_k"]

    console.print(Panel(
        f"[bold]Brockhaus & Efron Encyclopedia[/bold]\n"
        f"Provider: [cyan]{provider}[/cyan]  |  "
        f"Top-K: [cyan]{top_k}[/cyan]  |  "
        f"Type [bold]/help[/bold] for commands",
        border_style="blue",
    ))

    if args.query:
        run_query(args.query, provider, top_k, config)
        return

    # Interactive loop
    while True:
        try:
            q = input("\n[You]: ").strip()
        except (EOFError, KeyboardInterrupt):
            break

        if not q:
            continue
        if q == "/quit":
            break
        if q == "/help":
            console.print(
                "[bold]Commands:[/bold]\n"
                "  /quit            — exit\n"
                "  /provider NAME   — switch provider (anthropic, openai, google, ollama)\n"
                "  /topk N          — change number of retrieved chunks\n"
                "  anything else    — search the encyclopedia"
            )
            continue
        if q.startswith("/provider "):
            provider = q.split(None, 1)[1].strip()
            console.print(f"Switched to: [cyan]{provider}[/cyan]")
            continue
        if q.startswith("/topk "):
            top_k = int(q.split()[1])
            console.print(f"Top-K set to: [cyan]{top_k}[/cyan]")
            continue

        run_query(q, provider, top_k, config)


if __name__ == "__main__":
    main()

Run it:

# Single non-interactive query
python query.py --query "What does the encyclopedia say about the Trans-Siberian railway?"

# Interactive mode with Claude (default)
python query.py

# Switch to a different provider at launch
python query.py --provider google
python query.py --provider ollama    # fully local, no API key

# Or switch providers mid-session using the /provider command:
# [You]: /provider openai
# [You]: /topk 15

11. Bonus: Fully Local with Ollama

Run the entire pipeline — including the query LLM — without any API keys or cloud services.

# Install Ollama
brew install ollama

# Start the Ollama server (must be running for query.py --provider ollama)
ollama serve &

# Pull models — choose based on your available RAM:
ollama pull llama3.2       # 3B params, ~2 GB RAM — very fast
ollama pull mistral        # 7B params, ~5 GB RAM — good multilingual
ollama pull qwen2.5:7b     # 7B params, ~5 GB RAM — BEST for Russian text
ollama pull llama3.1:8b    # 8B params, ~6 GB RAM — strong general quality

# Confirm a model works with Russian
ollama run qwen2.5:7b "Что такое энциклопедия?"

Set default_provider: ollama and ollama_model: qwen2.5:7b in config.yaml and you’re done — no keys, no cloud.


12. Maintaining and Updating the Index

The incremental re-processing is what makes CocoIndex particularly well-suited for a long-lived project like this. If you eventually get better OCR quality (e.g., by switching to Apple Vision), re-OCR a volume and CocoIndex handles the rest.

Re-OCR a single volume and update the index

# Force re-OCR of Volume 42 (overwrites the existing text file).
python 02_ocr.py --volume 42 --force

# Re-run CocoIndex — it detects the changed file hash and re-processes only Volume 42.
# All 85 other volumes are skipped entirely.
cocoindex update cocoindex_flow.py

Rebuild the entire index from scratch

# Drop all CocoIndex-managed tables (internal state + embedding table).
cocoindex drop cocoindex_flow.py

# Re-index everything.
cocoindex update cocoindex_flow.py

Manage Postgres

brew services start postgresql@16     # start
brew services stop postgresql@16      # stop
brew services restart postgresql@16   # restart

# View Postgres logs if something goes wrong
tail -f /opt/homebrew/var/log/postgresql@16.log

13. Troubleshooting

psql: error: connection to server on socket ... failed

Postgres isn’t running.

brew services start postgresql@16
# Wait 5 seconds, then:
psql postgres -c "SELECT 1;"

ERROR: could not open extension control file ... vector.control

pgvector isn’t installed, or Postgres hasn’t been restarted since it was installed.

brew install pgvector
brew services restart postgresql@16
psql cocoindex -c "CREATE EXTENSION IF NOT EXISTS vector;"

permission denied for schema public

This happens on Postgres 15+ because the public schema is restricted by default.

psql cocoindex -c "GRANT ALL ON SCHEMA public TO cocoindex;"

COCOINDEX_DATABASE_URL not set

# For the current terminal session:
export COCOINDEX_DATABASE_URL="postgresql://cocoindex:yourpassword@localhost:5432/cocoindex"

# Permanently — add to ~/.zprofile so every new terminal has it:
echo 'export COCOINDEX_DATABASE_URL="postgresql://cocoindex:yourpassword@localhost:5432/cocoindex"' >> ~/.zprofile
source ~/.zprofile

cocoindex: command not found

The Python virtual environment isn’t active.

source ~/brockhaus-rag/venv/bin/activate

ModuleNotFoundError: No module named 'cocoindex_flow'

Make sure the indexing file is named cocoindex_flow.py (not 03_cocoindex_flow.py). Python cannot import modules whose names start with a digit.

OCR output is empty or mostly noise

Confirm Tesseract has the Russian language pack installed:

tesseract --list-langs | grep rus
# Must print: rus

If it’s missing:

brew reinstall tesseract-lang

Also try increasing DPI in config.yaml from 300 to 400 for better recognition of small print.

Similarity search returns irrelevant results

  • The text_to_embedding import in query.py guarantees the same model is used at index time and query time. If you changed the embedding_model in config.yaml after indexing, you must rebuild: cocoindex drop then cocoindex update.
  • Confirm the "query: " prefix is applied to search strings (multilingual-e5 requires it).
  • Try --top-k 20 and let the LLM filter.

ollama: connection refused

The Ollama server isn’t running.

ollama serve &
# Wait a few seconds, then retry.

Summary

PhaseScriptExpected time
Install dependenciesbrew install ... + pip install ...5–10 min
PostgreSQL setuppsql commands2 min
DjVu → PDF./01_convert_djvu.sh5–20 min
OCR all 86 volumespython 02_ocr.py2–8 hours
Chunk + embed + storecocoindex update cocoindex_flow.py30–60 min
Obsidian exportpython 04_export_obsidian.py1–5 min
Querypython query.pyseconds per question
Incremental re-index (1 volume)cocoindex update cocoindex_flow.py~1 min

The Brockhaus & Efron encyclopedia spans 121,000 articles across 86 volumes — now searchable by question, grounded in source text, and running entirely on your machine.

Crepi il lupo! 🐺


Appendix: Apple Vision OCR (Higher Accuracy on macOS)

Apple’s built-in Vision framework often achieves 20–30% better accuracy than Tesseract on pre-revolutionary Russian script, at no extra cost (it uses the Neural Engine on Apple Silicon). It requires macOS 12+ and the pyobjc bindings:

pip install pyobjc-framework-Vision pyobjc-framework-CoreML

Replace the pytesseract.image_to_string(...) call in 02_ocr.py with this function:

from Foundation import NSURL
import Vision

def apple_vision_ocr(image_path: str) -> str:
    """
    Run Apple Vision text recognition on a PNG/JPEG file path.

    Observations are sorted top-to-bottom by bounding box Y coordinate
    to preserve reading order. Requires macOS 12+.
    """
    url = NSURL.fileURLWithPath_(image_path)
    handler = Vision.VNImageRequestHandler.alloc().initWithURL_options_(url, None)

    request = Vision.VNRecognizeTextRequest.alloc().init()
    request.setRecognitionLanguages_(["ru", "ru-RU"])
    request.setRecognitionLevel_(Vision.VNRequestTextRecognitionLevelAccurate)
    request.setUsesLanguageCorrection_(True)

    handler.performRequests_error_([request], None)

    observations = request.results() or []
    # Sort top-to-bottom: higher Y in Vision's coordinate system = higher on page.
    sorted_obs = sorted(observations, key=lambda o: -o.boundingBox().origin.y)
    return "\n".join(
        o.topCandidates_(1)[0].string()
        for o in sorted_obs
        if o.topCandidates_(1)
    )

You’ll need to save each PDF page as a temporary PNG file first (pdf2image can do this), then pass the path to apple_vision_ocr() instead of calling pytesseract.image_to_string().