Dataset manifest (kind: Dataset)¶
Package: mas-lab-bench · Schema: dataset.schema.yaml · apiVersion: lab/v1
A dataset manifest lists benchmark inputs: prompts, multi-turn dialogue, memory
seeds, and HITL fixtures. Each experiment pairs a dataset with scenarios and
run.n_runs; every item is executed for every scenario.
Terms: glossary.md · Experiment wiring: experiment.md
run = (mas_config, flavour, memory_state, user_query, turns)
The dataset declares memory_state, user_query, and turns per item.
mas_config and flavour come from the experiment YAML.
Table of contents¶
- Manifest format
- Item fields reference
- Memory seeds — the memory state slot
- Multi-turn and HITL
- Session continuity
- Experiment-level seeds
- Seed merge order
- Path references
- Full example
1. Manifest format¶
A dataset file is a YAML document. Two formats are accepted:
Declarative manifest (preferred):
apiVersion: lab/v1
kind: Dataset
metadata:
name: my-dataset
version: "1.0"
description: "Trip planning evaluation items."
spec:
items:
- id: "001"
prompt: "Plan a three-day trip to Paris."
Flat dict (shorthand, no kind/apiVersion required):
name: my-dataset
items:
- id: "001"
prompt: "Plan a three-day trip to Paris."
The name field defaults to the file stem when omitted.
2. Item fields reference¶
| Field | Type | Required | Description |
|---|---|---|---|
id |
string | yes | Unique identifier within the dataset. Used in output paths, CSV rows, and the content-addressed run hash. |
prompt |
string | yes | The initial user message sent to the MAS entry agent. |
expected_answer |
string | no | Reference answer used by evaluation metrics (LLM judge, contains, regex_match, etc.). |
category |
string | no | Logical grouping for filtering with dataset.filter or dataset.group. |
turns |
list | no | Additional conversation turns after the initial prompt. See §4. |
session_id |
string | no | Fixed conversation identifier. See §5. |
memory_seeds |
list or path | no | Initial memory state injected before the run. See §3. |
metadata |
any extra keys | no | All other keys are collected into metadata and available for filter expressions. |
3. Memory seeds — the memory state slot¶
What is a seed?¶
A seed is a document pre-loaded into one or more agent memory backends before the first token is generated. It models the prior state of the world from the agents' perspective: things they would already know before the conversation starts.
This is conceptually the second slot of the input tuple:
run_input = (user_query, memory_state)
Without seeds the memory backend starts empty on every run (the default for reproducible benchmarks). Seeds let you test agents against specific pre-existing knowledge — a user's history, a shared catalogue, a background fact — as a first-class, versioned, reproducible input.
Seed fields¶
Each seed is a dict with:
| Field | Type | Required | Description |
|---|---|---|---|
source |
string | yes | A logical label for the document's origin (e.g. "user_history", "product_catalog", "policy"). Used for logging, deduplication, and content-addressing the run hash. See note on shadowing below. |
content |
string | yes | The text indexed into the memory backend. The backend embeds this text and makes it retrievable by semantic search. |
target_agent |
string | no | ID of the agent whose memory backend receives this seed. When absent the seed is delivered to all agents. |
metadata |
dict | no | Arbitrary key-value pairs stored alongside the document. Passed verbatim to backend.index_document(). Useful for post-retrieval filtering (e.g. {category: "preference", user: "alice"}). |
What does source mean?¶
source is not a file path. It is a human-readable provenance label that
identifies where a piece of knowledge comes from. Examples:
source: "user_history" # knowledge retrieved from a user activity log
source: "product_catalog" # knowledge from the product database
source: "travel_policy" # company travel rules
source: "operator_briefing" # pre-run context injected by an operator
The source label is stored in the memory document's metadata and surfaced
in traces, so you can reason about why an agent retrieved a specific piece
of knowledge.
Targeting: per-agent vs. global¶
memory_seeds:
# Per-agent: only the concierge's memory receives this
- source: "user_preferences"
content: "Alice prefers window seats and vegetarian meals."
target_agent: "concierge_agent"
# Global: ALL agents receive this (shared knowledge)
- source: "shared_policy"
content: "Company travel policy: economy class for flights under 3h."
Use target_agent when the knowledge is private to one agent (its own
episodic memory, its own user profile, etc.). Omit it for facts every agent
should know — a shared catalogue, a company policy, a world-state snapshot.
Inline list vs. path reference¶
Seeds can be declared inline or loaded from a separate YAML file:
# Inline
memory_seeds:
- source: "user_history"
content: "Alice visited Paris in March."
# Path reference (resolved relative to the dataset file)
memory_seeds: "./seeds/user_alice.yaml"
The seed file may be a bare list or a dict with a seeds, items, or
memory_seeds key:
# seeds/user_alice.yaml — bare list
- source: "user_history"
content: "Alice visited Paris in March."
- source: "user_preferences"
content: "Alice prefers boutique hotels."
Path references are useful when:
- Multiple dataset items share a large seed file (keep it DRY).
- Seeds are generated by a pipeline step and written to a file.
- The seed corpus is large enough to deserve its own version history.
4. Multi-turn and HITL¶
When a dataset item has a turns list, the benchmark runner keeps the same
MasRuntime instance alive across all turns, preserving session context
(conversation history, memory state).
- id: "multi-turn-001"
prompt: "Book a flight to Tokyo." # turn 0 — sent via run_once()
turns:
- role: user
content: "Actually, make it business class." # turn 1
- role: hitl
content: "Operator: budget approved up to €3,000." # HITL injection
- role: user
content: "Add a hotel in Shinjuku for 5 nights." # turn 3
Role semantics¶
| Role | Behaviour |
|---|---|
user |
Sent as a regular user prompt via rt.run(). The agent processes it and the LLM is invoked. |
hitl |
Human-In-The-Loop injection. Simulates an operator stepping in. The content is injected into the session and the agent responds to it — useful for testing approval flows, corrections, and escalation paths. |
Both roles advance the conversation; the distinction is semantic and surfaced in traces so you can distinguish agent-driven turns from operator-driven ones.
Memory + multi-turn¶
Memory seeds are injected before turn 0 (the initial prompt). The
agents' memory backends are populated, then the conversation starts. Any
memory writes that happen during the conversation accumulate on top of the
seeds — this is the expected behaviour for testing stateful agents.
5. Session continuity¶
By default the runner generates a fresh UUID session_id for every run. Set
session_id explicitly to:
- Replay a known conversation in exactly the same session slot.
- Test session-aware memory (e.g.
FileSessionStore,MemoryContextPluginkeyed onconversation_id) with deterministic inputs.
- id: "session-replay-001"
prompt: "Continue our last conversation."
session_id: "abc-1234-deterministic"
Note: a fixed session_id does not affect the content-addressed run hash —
the hash covers the inputs to the MAS, not internal session bookkeeping.
6. Experiment-level seeds¶
Seeds declared in the experiment YAML under memory_seeds are applied to
every run in the benchmark, regardless of the dataset item. They model
background knowledge shared across all test cases — a product catalogue, a
company policy, a world model.
# experiment.yaml
name: trip_planner_eval
memory_seeds:
- source: "product_catalog"
content: "Arborian Network schedule: Paris–London 08:00, 14:00, 20:00."
target_agent: "transport_agent"
- source: "company_policy"
content: "All bookings require manager approval above €2,000."
# no target_agent → all agents
dataset:
path: "./datasets/trip_queries.yaml"
The same inline-or-path syntax is supported:
memory_seeds: "./seeds/baseline_context.yaml"
7. Seed merge order¶
When both experiment-level and item-level seeds are present they are merged into a single list before injection:
effective_seeds = experiment_seeds + item_seeds
Experiment seeds come first. This means item seeds are indexed into the memory backend after experiment seeds. If a semantic search is run immediately after seeding, item-specific seeds appear later in the indexing order and will surface at higher relevance when their content is more specific (standard embedding distance behaviour).
Why not deduplicate by source? Deduplication by source would require
choosing one entry to keep, which implies a precedence rule that is opaque in
the YAML. Instead the merge is intentionally additive: both documents are
indexed. If you need to replace an experiment-level seed for a specific
item, use a different source name in the item seed (e.g. "policy_override"
instead of "policy").
The merged seed list is included in the content-addressed run hash. Two runs that differ only in their seeds produce different hashes and are cached independently.
8. Path references¶
All relative paths in a dataset file are resolved relative to the dataset file itself, not the experiment YAML or the working directory. This makes dataset files portable: you can move an experiment directory without breaking relative seed paths.
labs/my-experiment/
experiment.yaml
datasets/
trip_queries.yaml ← relative paths resolved from here
seeds/
user_alice.yaml
user_bob.yaml
In trip_queries.yaml:
- id: "alice"
prompt: "Plan my trip."
memory_seeds: "./seeds/user_alice.yaml" # relative to datasets/
Experiment-level seeds in experiment.yaml are resolved relative to the
experiment YAML file (i.e. labs/my-experiment/).
9. Full example¶
# labs/trip-planner-eval/datasets/trip_queries.yaml
apiVersion: lab/v1
kind: Dataset
metadata:
name: trip-planner-queries
version: "1.0"
description: >
Mixed single-turn, multi-turn, and HITL scenarios for the trip planner MAS.
spec:
items:
# ── Single-turn, no memory context ────────────────────────────────
- id: "cold-start-001"
prompt: "What are the cheapest flights from Paris to London this week?"
category: cold-start
expected_answer: "economy"
# ── Single-turn with inline memory seeds ─────────────────────────
- id: "warm-alice-001"
prompt: "Book my usual route."
category: personalised
memory_seeds:
- source: "user_preferences"
content: "Alice always travels Paris→London, prefers 08:00 departure."
target_agent: "concierge_agent"
metadata: {user: "alice", type: "preference"}
- source: "loyalty_status"
content: "Alice: Gold tier, eligible for lounge access."
# no target_agent → all agents receive this
# ── Single-turn with seed file reference ─────────────────────────
- id: "warm-bob-001"
prompt: "Book my usual route."
category: personalised
memory_seeds: "./seeds/user_bob.yaml"
# ── Multi-turn conversation ───────────────────────────────────────
- id: "multi-turn-001"
prompt: "I need to go to Tokyo next month."
category: multi-turn
turns:
- role: user
content: "Make it business class."
- role: user
content: "Add a hotel in Shinjuku for 5 nights."
- role: user
content: "What is the total cost?"
# ── Multi-turn with HITL and memory seeds ────────────────────────
- id: "hitl-approval-001"
prompt: "Book a flight to Singapore for the team offsite."
category: hitl
memory_seeds:
- source: "team_roster"
content: "Team: Alice (Paris), Bob (London), Carol (Amsterdam). 5 people total."
target_agent: "booking_agent"
- source: "budget_policy"
content: "Team offsites: approved up to €500 per person."
turns:
- role: user
content: "We also need hotel rooms for 3 nights."
- role: hitl
content: "Operator: budget exception approved — up to €800 per person for this trip."
- role: user
content: "Great, proceed with the booking."
session_id: "offsite-2026-q3" # deterministic session for replay
# ── Fixed session for memory continuity testing ──────────────────
- id: "session-recall-001"
prompt: "What did we discuss last time?"
category: session-memory
session_id: "test-session-recall-fixed"
Companion experiment YAML:
# labs/trip-planner-eval/experiment.yaml
name: trip-planner-eval
mas:
manifest: app: trip-planner # resolves mas.yaml via mas.apps
# Seeds shared by every run — background world knowledge
memory_seeds:
- source: "arborian_schedule"
content: "Paris–London: 08:00, 14:00, 20:00. Paris–Tokyo: 11:30, 23:00."
target_agent: "transport_agent"
- source: "pricing_baseline"
content: "Economy fares: Paris–London €89, Paris–Tokyo €640."
dataset:
path: "./datasets/trip_queries.yaml"
scenarios:
- id: baseline
overlay: overlays/baseline.yaml
- id: with-memory
overlay: overlays/with-memory.yaml
Related documentation¶
- experiment.md — how datasets attach to experiments
- overlay.md — scenario-level memory seeds and params
- Tutorial 3 — hands-on benchmarks