How to Extract Your AI Coding Assistant Chats

Credit: This guide is based on ai-data-extraction by 0xSero.

Why Extract Your Chats?

Your AI coding assistant conversations are valuable:

Training data for fine-tuning models on your coding style
Personal backup before switching tools or resetting
Analytics on how you use AI assistance
Migration between tools

Supported Tools

Tool	Script	Data
Claude Code	`extract_claude_code.py`	Messages, tool use, file context
Cursor	`extract_cursor.py`	Chat, Composer, Agent sessions
Windsurf	`extract_windsurf.py`	Chat, flow conversations
Trae	`extract_trae.py`	Chat, agent data
Codex	`extract_codex.py`	User/agent messages, diffs
Continue	`extract_continue.py`	Sessions, reasoning blocks
Gemini CLI	`extract_gemini.py`	Messages, thoughts, token usage
OpenCode	`extract_opencode.py`	Full hierarchy, tool calls

Quick Start

# Clone the repo
git clone https://github.com/0xSero/ai-data-extraction
cd ai-data-extraction

# Extract from one tool
python3 extract_claude_code.py

# Extract from all tools
./extract_all.sh

Output Format

All scripts create extracted_data/ with timestamped JSONL files:

extracted_data/
├── claude_code_conversations_20260324_143022.jsonl
├── cursor_complete_20260324_143045.jsonl
└── ...

Each line is a complete conversation:

{
  "messages": [
    {
      "role": "user",
      "content": "How do I fix this error?",
      "code_context": [
        {
          "file": "/project/src/app.ts",
          "code": "const x = undefined;",
          "range": {"selectionStartLineNumber": 10}
        }
      ]
    },
    {
      "role": "assistant",
      "content": "The error is because...",
      "suggested_diffs": [...]
    }
  ],
  "source": "cursor-composer",
  "created_at": 1705414222000
}

Filtering Extracted Data

Find conversations with code changes:

import json

with open('extracted_data/cursor_complete.jsonl') as f:
    for line in f:
        conv = json.loads(line)
        if any('suggested_diffs' in m for m in conv['messages']):
            print(json.dumps(conv))

Filter by date:

from datetime import datetime

cutoff = datetime(2024, 1, 1).timestamp() * 1000
with open('extracted_data/claude_code.jsonl') as f:
    for line in f:
        conv = json.loads(line)
        if conv.get('created_at', 0) > cutoff:
            print(conv['messages'][0]['content'][:100])

Prepare for Training

Merge all files:

cat extracted_data/*.jsonl > all_conversations.jsonl
wc -l all_conversations.jsonl

Use with Unsloth for fine-tuning:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/qwen2.5-coder-7b-instruct",
    max_seq_length=4096,
    load_in_4bit=True,
)

def format_chat(example):
    return {
        'text': tokenizer.apply_chat_template(
            example['messages'],
            tokenize=False
        )
    }

Privacy First

Before using extracted data:

# Scan for secrets
pip install detect-secrets
detect-secrets scan extracted_data/*.jsonl

Check for API keys, passwords, tokens
Verify no proprietary code exposed
Sanitize file paths if needed
Don’t commit to public repos

Troubleshooting

Problem	Solution
“No installations found”	Check tool is installed; verify path manually
Empty output	Close the AI tool before running; check database location
Database locked	Close the tool; use read-only mode if persistent
Permission denied	Run with correct permissions; copy databases first

Note: This toolkit extracts YOUR data from locally installed tools. You are responsible for handling sensitive information appropriately.

Crepi il lupo! 🐺