How to Extract Your AI Coding Assistant Chats
Credit: This guide is based on ai-data-extraction by 0xSero.
Why Extract Your Chats?
Your AI coding assistant conversations are valuable:
- Training data for fine-tuning models on your coding style
- Personal backup before switching tools or resetting
- Analytics on how you use AI assistance
- Migration between tools
Supported Tools
| Tool | Script | Data |
|---|---|---|
| Claude Code | extract_claude_code.py | Messages, tool use, file context |
| Cursor | extract_cursor.py | Chat, Composer, Agent sessions |
| Windsurf | extract_windsurf.py | Chat, flow conversations |
| Trae | extract_trae.py | Chat, agent data |
| Codex | extract_codex.py | User/agent messages, diffs |
| Continue | extract_continue.py | Sessions, reasoning blocks |
| Gemini CLI | extract_gemini.py | Messages, thoughts, token usage |
| OpenCode | extract_opencode.py | Full hierarchy, tool calls |
Quick Start
# Clone the repo
git clone https://github.com/0xSero/ai-data-extraction
cd ai-data-extraction
# Extract from one tool
python3 extract_claude_code.py
# Extract from all tools
./extract_all.shOutput Format
All scripts create extracted_data/ with timestamped JSONL files:
extracted_data/
├── claude_code_conversations_20260324_143022.jsonl
├── cursor_complete_20260324_143045.jsonl
└── ...Each line is a complete conversation:
{
"messages": [
{
"role": "user",
"content": "How do I fix this error?",
"code_context": [
{
"file": "/project/src/app.ts",
"code": "const x = undefined;",
"range": {"selectionStartLineNumber": 10}
}
]
},
{
"role": "assistant",
"content": "The error is because...",
"suggested_diffs": [...]
}
],
"source": "cursor-composer",
"created_at": 1705414222000
}Filtering Extracted Data
Find conversations with code changes:
import json
with open('extracted_data/cursor_complete.jsonl') as f:
for line in f:
conv = json.loads(line)
if any('suggested_diffs' in m for m in conv['messages']):
print(json.dumps(conv))Filter by date:
from datetime import datetime
cutoff = datetime(2024, 1, 1).timestamp() * 1000
with open('extracted_data/claude_code.jsonl') as f:
for line in f:
conv = json.loads(line)
if conv.get('created_at', 0) > cutoff:
print(conv['messages'][0]['content'][:100])Prepare for Training
Merge all files:
cat extracted_data/*.jsonl > all_conversations.jsonl
wc -l all_conversations.jsonlUse with Unsloth for fine-tuning:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/qwen2.5-coder-7b-instruct",
max_seq_length=4096,
load_in_4bit=True,
)
def format_chat(example):
return {
'text': tokenizer.apply_chat_template(
example['messages'],
tokenize=False
)
}Privacy First
Before using extracted data:
# Scan for secrets
pip install detect-secrets
detect-secrets scan extracted_data/*.jsonl- Check for API keys, passwords, tokens
- Verify no proprietary code exposed
- Sanitize file paths if needed
- Don’t commit to public repos
Troubleshooting
| Problem | Solution |
|---|---|
| “No installations found” | Check tool is installed; verify path manually |
| Empty output | Close the AI tool before running; check database location |
| Database locked | Close the tool; use read-only mode if persistent |
| Permission denied | Run with correct permissions; copy databases first |
Note: This toolkit extracts YOUR data from locally installed tools. You are responsible for handling sensitive information appropriately.
Crepi il lupo! 🐺