Abstract
Authorizing Large Language Model-driven agents to dynamically invoke tools and access protected resources introduces significant risks, since current methods for delegating authorization grant overly broad permissions and give access to tools allowing agents to operate beyond the intended task scope. In our paper, we introduced and assessed a delegated authorization model enabling authorization servers to semantically inspect access requests to protected resources, and issue access tokens constrained to the minimal set of scopes necessary for the agents' assigned tasks.
Given the unavailability of datasets centered on delegated authorization flows, particularly including both semantically appropriate and inappropriate scope requests for a given task, we introduce here ASTRA (Authorization with Semantic Task-based Restricted Access) , a dataset for benchmarking semantic matching between task and scopes. Using this dataset, our experiments show both the potential and current limitations of model-based matching, particularly as the number of scopes needed for task completion increases. Our results highlight the need for further research into semantic matching techniques enabling intent-aware authorization for multi-agent and tool-augmented applications, including fine-grained control, such as Task-Based Access Control (TBAC).
                  The
                   ASTRA data repository 
                  contains an open-source dataset for task-tool matching in the context of delegated authorization
                  flows, as described in our
                  paper. The core data resides in the
                  data/
                  directory, which is organized by task complexity:
                  01_tool
                  ,
                  02_tools
                  , and
                  03_tools
                  contain datasets for tasks requiring one, two, or three tools, respectively. Each of these directories
                  is further split into
                  ASTRA
                  (our generated data) and
                  TOUCAN
                  (processed
                   TOUCAN 
                  data), with files for generated tasks, validation, and test splits, or processed tasks and test data
                  respectively. The
                  mcp_servers/
                  folder holds the MCP Server configuration files used in data generation, separated for
                  ASTRA
                  and
                  TOUCAN
                  sources and containing JSON files for each server.
                
Key Features
- Synthetic Multi-Tool Tasks : Agentic tasks are generated using real-world MCP Servers (e.g., Wikipedia, GitHub) with sets of N tools (N in [1, 2, 3]), ensuring semantic coherence and realism.
- 
                    Simulated Tool Matching
                    : Includes both correct and simulated incorrect tool matches:
                    - Wrong matches: Tools from the same MCP Server
- Null matches: Tools from different MCP Servers
 
- TOUCAN Data Integration : Curated and pre-processed subset of the TOUCAN dataset for direct comparison, with consistent formatting and quality controls.
- Comprehensive Metadata : All tool names, descriptions (with arguments removed), and server metadata are included.
Data Overview
- Enterprise MCP Servers : 12 high-quality, English-only servers, each covering a range of 10 to 90 tools.
- Synthetic Tasks : 352 times 3 tasks per N in [1, 2, 3] for our dataset; 1,056 processed tasks per N for TOUCAN.
- Validation Ready : Processed, de-duplicated, and filtered for high data quality.
Evaluation
We evaluated two task-tool matching approaches on the ASTRA dataset: the Semantic Similarity Matcher (SemSimM) and the LLM Reasoning Matcher (LLM-ResM). SemSimM uses language model embeddings to compare an idealized tool description, generated for the task, with the descriptions of available tools, selecting the most semantically similar option if it exceeds a similarity threshold. While effective, this method can struggle with large tool registries and tasks needing multiple tools, as it assesses each tool in isolation. In contrast, LLM-ResM employs a language model to directly reason about the suitability of a requested tool for a given task, using only the task context and the tool's name and description. This reasoning-based approach is more scalable and adaptable, as it does not depend on the complete set of available tools and can capture finer contextual nuances through targeted prompting. We tested both methods on the ASTRA dataset to assess their effectiveness and limitations.
Semantic task-to-scope matching using SemSimM and LLM-ResM in the AuthZ server.
                  Trade off between under-scoping and over-scoping for tasks requiring
                  
                  one, two or three tools across both our dataset and TOUCAN dataset.
                
               
                Proxied delegated authorization enabling trusted semantic matching between task and scope requests.
Results
For single-tool tasks, LLM-ResM consistently outperformed SemSimM on both the generated and public TOUCAN datasets, achieving higher accuracy, recall, and F1 scores. SemSimM, while precise, exhibited low recall, often failing to recognize valid tool requests.
In multi-tool scenarios, only LLM-ResM was evaluated, matching each tool request within a task independently. The results showed that as the number of required tools increased, the challenge of correct authorization also grew, primarily due to a rise in false negatives (under-scoping), especially for three-tool tasks. Notably, recall was higher on the TOUCAN dataset in complex tasks, likely due to more explicit tool usage patterns compared to the implicit cues in the generated data.
Overall, while both approaches demonstrated strengths, LLM-ResM proved more robust across varying task complexities, with the main challenge being the trade-off between minimizing over-scoping (granting unnecessary access) and under-scoping (insufficient access for task completion) as tasks became more complex.
Collaboration and Context
ASTRA is part of Cisco’s broader research on Zero Trust Agency (ZTA), fine-grained, intent-aware delegated authorization for agentic applications, developed within Outshift by Cisco, the company’s incubation and innovation arm.
This work also draws inspiration and collaboration from the Linux Foundation AGNTCY project, which is building open infrastructure for the “Internet of Agents,” including identity services, verifiable credentials, and Tool-Based Access Control (TBAC).
Key contributions from these collaborations include:
- Identity and Verifiable Credential Frameworks for agent authentication.
- An open-source reference implementation of TBAC (Tool-Based Access Control) – serving as a precursor to Task-Based Access Control – available via the Linux Foundation AGNTCY GitHub.
- Real-world MCP Server configurations sourced and maintained through industry and research partnerships.
We hope that the ASTRA dataset will serve as a valuable resource for future research in semantic task-tool matching, particularly in the context of delegated authorization. If you make use of this dataset in your research, please cite our paper:
BibTeX
@misc{helou2025delegatedauthorizationagentsconstrained,
      title={Delegated Authorization for Agents Constrained to Semantic Task-to-Scope Matching}, 
      author={Majed El Helou and Chiara Troiani and Benjamin Ryder and Jean Diaconu and Hervé Muyal and Marcelo Yannuzzi},
      year={2025},
      eprint={2510.26702},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.26702}, 
}