Synthetic Data

Artificially generated data used to train or evaluate AI models, often created by other models or simulations.

Synthetic data is data created artificially rather than collected from the real world. It's commonly used to augment training sets, fill gaps, generate hard negatives, or create data for tasks where real labels are expensive.

In modern LLM training, synthetic data generated by larger models is used to fine-tune smaller ones. Techniques like self-instruct and constitutional AI rely heavily on synthetic data.

Risks: synthetic data can inherit or amplify biases of the generating model, and over-reliance can cause "model collapse" when models train on their own outputs.

Despite risks, synthetic data is essential for modern AI. It lets teams train high-quality models without the cost and ethical complications of curating massive human-labeled datasets.

Related Terms

← Back to Glossary