How to Extract Your AI Coding Assistant Chats

⬅️ Back to Tutorials

Credit: This guide is based on ai-data-extraction by 0xSero.

Why Extract Your Chats?

Your AI coding assistant conversations are valuable:

  • Training data for fine-tuning models on your coding style
  • Personal backup before switching tools or resetting
  • Analytics on how you use AI assistance
  • Migration between tools

Supported Tools

ToolScriptData
Claude Codeextract_claude_code.pyMessages, tool use, file context
Cursorextract_cursor.pyChat, Composer, Agent sessions
Windsurfextract_windsurf.pyChat, flow conversations
Traeextract_trae.pyChat, agent data
Codexextract_codex.pyUser/agent messages, diffs
Continueextract_continue.pySessions, reasoning blocks
Gemini CLIextract_gemini.pyMessages, thoughts, token usage
OpenCodeextract_opencode.pyFull hierarchy, tool calls

Quick Start

# Clone the repo
git clone https://github.com/0xSero/ai-data-extraction
cd ai-data-extraction

# Extract from one tool
python3 extract_claude_code.py

# Extract from all tools
./extract_all.sh

Output Format

All scripts create extracted_data/ with timestamped JSONL files:

extracted_data/
├── claude_code_conversations_20260324_143022.jsonl
├── cursor_complete_20260324_143045.jsonl
└── ...

Each line is a complete conversation:

{
  "messages": [
    {
      "role": "user",
      "content": "How do I fix this error?",
      "code_context": [
        {
          "file": "/project/src/app.ts",
          "code": "const x = undefined;",
          "range": {"selectionStartLineNumber": 10}
        }
      ]
    },
    {
      "role": "assistant",
      "content": "The error is because...",
      "suggested_diffs": [...]
    }
  ],
  "source": "cursor-composer",
  "created_at": 1705414222000
}

Filtering Extracted Data

Find conversations with code changes:

import json

with open('extracted_data/cursor_complete.jsonl') as f:
    for line in f:
        conv = json.loads(line)
        if any('suggested_diffs' in m for m in conv['messages']):
            print(json.dumps(conv))

Filter by date:

from datetime import datetime

cutoff = datetime(2024, 1, 1).timestamp() * 1000
with open('extracted_data/claude_code.jsonl') as f:
    for line in f:
        conv = json.loads(line)
        if conv.get('created_at', 0) > cutoff:
            print(conv['messages'][0]['content'][:100])

Prepare for Training

Merge all files:

cat extracted_data/*.jsonl > all_conversations.jsonl
wc -l all_conversations.jsonl

Use with Unsloth for fine-tuning:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/qwen2.5-coder-7b-instruct",
    max_seq_length=4096,
    load_in_4bit=True,
)

def format_chat(example):
    return {
        'text': tokenizer.apply_chat_template(
            example['messages'],
            tokenize=False
        )
    }

Privacy First

Before using extracted data:

# Scan for secrets
pip install detect-secrets
detect-secrets scan extracted_data/*.jsonl
  • Check for API keys, passwords, tokens
  • Verify no proprietary code exposed
  • Sanitize file paths if needed
  • Don’t commit to public repos

Troubleshooting

ProblemSolution
“No installations found”Check tool is installed; verify path manually
Empty outputClose the AI tool before running; check database location
Database lockedClose the tool; use read-only mode if persistent
Permission deniedRun with correct permissions; copy databases first

Note: This toolkit extracts YOUR data from locally installed tools. You are responsible for handling sensitive information appropriately.

Crepi il lupo! 🐺