← blogs · yasir bugraratags

Curating Training Data for LLMs

2026-04-13 · llm · training-data

I published a reference guide on training data curation for LLMs. It covers the full pipeline: pre-training corpora, fine-tuning pairs, preference rankings, RL trajectories, and safety data.

data-guide.usagentix.com

Framework

The guide organizes training data along three axes:

What's in it

11 sections covering the full lifecycle: