Dev Notes

Welcome to NeMo Data Designer Dev Notes! Here you'll find in-depth guides, tutorials, and insights about synthetic data generation.

April 2, 2026
12 min read

Async All the Way Down

Every Data Designer pipeline carries a map of what can run in parallel. Consider a pipeline that generates a topic, writes a summary and a trivia fact from that topic, then produces an analysis of the summary. summary and trivia both depend on topic, so they could run alongside each other. analysis depends on summary, so it has to wait — but only on the same row's summary, not the entire column. These references form a per-cell dependency graph. The previous engine used that graph to order columns, but it ran each column to completion before starting the next. A row's analysis couldn't start until every row of summary had finished, even though it only needed its own.

We rebuilt the execution layer to schedule at the cell level. As soon as a cell's specific upstream dependencies complete, it dispatches — regardless of what other rows or columns are still in flight. Completion flows diagonally across the grid: early rows finish all their columns while later rows are still generating their first. For multi-model workflows, this means every endpoint stays saturated — a judge model starts processing rows the moment the first generator results land, rather than waiting for all generation to finish. The result is significantly faster pipelines with no changes to your config.

March 25, 2026
11 min read

Owning the Model Stack: Adaptive Concurrency FTW!

Picture this: you're generating a million-record dataset. Thirty two concurrent requests per model, three models in the pipeline, two providers. Everything hums along for the first ten minutes — then one provider starts returning 429s, your retry logic kicks in, and suddenly you're in a feedback loop where retries cause more 429s. The run stalls. You restart with lower concurrency, waste throughput for hours, and wonder if there's a better way.

There is. This post is about the native model client layer we built with adaptive throttling (a system that discovers provider capacity at runtime) replacing our dependency on LiteLLM along the way.

March 24, 2026
28 min read

Data Designer Got Skills

Lessons from building an agent-first CLI and skill for Data Designer

We just published the data-designer skill, which leverages agent-focused CLI commands in Data Designer to efficiently generate datasets. Just describe the dataset you want and your agent will craft the Data Designer configuration for you — schema design, validation, preview, generation — interactively or on full autopilot (just tell the agent to "be opinionated" or "surprise me").

March 12, 2026
19 min read

Search Agent SFT Data: Teaching LLMs to Browse the Web

Training search agents requires trajectory data --- the full multi-turn interaction showing how a model searches, reads, reasons, and answers. We built a four-stage pipeline that generates synthetic search trajectories from Wikidata knowledge graph paths, converts them into BrowseComp-style riddles using NeMo Data Designer, generates multi-step search rollouts with live web search via Tavily, and post-processes the results into SFT-ready training data.

February 18, 2026
10 min read

Structured Outputs for Nemotron: Teaching Models to Produce Valid JSON, YAML, and XML

Using NeMo Data Designer, an orchestration framework for generating high-quality synthetic data at scale, we built an iterative pipeline that generates diverse, schema-constrained structured outputs across JSON, YAML, and XML. Through multiple rounds of prompt refinement, rejection sampling, and programmatic validation, we produced a 9,949-sample dataset of verified structured output training data.

February 10, 2026
20 min read

Deep Research Trajectories with NeMo Data Designer and MCP Tool Use

Data Designer v0.5.0's MCP tool-use support lets you generate multi-turn research trajectories, the kind of data needed to train deep research agents that iteratively search, read, and synthesize evidence before answering a question.

February 10, 2026
9 min read

Designing Data Designer: Why SDG Is a Systems Problem

Synthetic data generation is more than a single prompt to a large language model. In this post, we walk through the design principles behind NeMo Data Designer and explain why we built it as a composable orchestration framework - treating SDG as a system of specialized stages rather than a monolithic generation task.

February 4, 2026
8 min read

Graduate-Level Science Reasoning Data with NeMo Data Designer

Using NeMo Data Designer, we created the RQA (Reasoning Question-Answer) dataset: a massive collection of graduate-level, reasoning-heavy science samples designed to push the boundaries of model performance.