Curating Training Data for LLMs
2026-04-13 · llm · training-data
I published a reference guide on training data curation for LLMs. It covers the full pipeline: pre-training corpora, fine-tuning pairs, preference rankings, RL trajectories, and safety data.
Framework
The guide organizes training data along three axes:
- Stage: what phase of training the data is for (pre-training, fine-tuning, preference learning, RL, safety)
- Format: the shape of each example (document chunks, prompt-response pairs, ranked answers, action trajectories)
- Behavior: what the model should learn (knowledge, helpfulness, safety, tool use, planning, self-correction)
What's in it
11 sections covering the full lifecycle:
- Practical guidance for getting started
- The training pipeline and how stages connect
- Deep dives into pre-training, fine-tuning, preference/RL, and safety data
- A behavior-to-dataset mapping
- Reference tables