ASTRA — Authorization with Semantic Task-based Restricted Access

Outshift by Cisco
* Equal Contribution, Shared Senior Authorship
Teaser image

Delegated authorization with semantic task-to-scope matching counters against attempts to access additional protected resources.

Abstract

Authorizing Large Language Model-driven agents to dynamically invoke tools and access protected resources introduces significant risks, since current methods for delegating authorization grant overly broad permissions and give access to tools allowing agents to operate beyond the intended task scope. In our paper, we introduced and assessed a delegated authorization model enabling authorization servers to semantically inspect access requests to protected resources, and issue access tokens constrained to the minimal set of scopes necessary for the agents' assigned tasks.

Given the unavailability of datasets centered on delegated authorization flows, particularly including both semantically appropriate and inappropriate scope requests for a given task, we introduce here ASTRA (Authorization with Semantic Task-based Restricted Access) , a dataset for benchmarking semantic matching between task and scopes. Using this dataset, our experiments show both the potential and current limitations of model-based matching, particularly as the number of scopes needed for task completion increases. Our results highlight the need for further research into semantic matching techniques enabling intent-aware authorization for multi-agent and tool-augmented applications, including fine-grained control, such as Task-Based Access Control (TBAC).

The ASTRA data repository contains an open-source dataset for task-tool matching in the context of delegated authorization flows, as described in our paper. The core data resides in the data/ directory, which is organized by task complexity: 01_tool , 02_tools , and 03_tools contain datasets for tasks requiring one, two, or three tools, respectively. Each of these directories is further split into ASTRA (our generated data) and TOUCAN (processed TOUCAN data), with files for generated tasks, validation, and test splits, or processed tasks and test data respectively. The mcp_servers/ folder holds the MCP Server configuration files used in data generation, separated for ASTRA and TOUCAN sources and containing JSON files for each server.

Key Features

  • Synthetic Multi-Tool Tasks : Agentic tasks are generated using real-world MCP Servers (e.g., Wikipedia, GitHub) with sets of N tools (N in [1, 2, 3]), ensuring semantic coherence and realism.
  • Simulated Tool Matching : Includes both correct and simulated incorrect tool matches:
    • Wrong matches: Tools from the same MCP Server
    • Null matches: Tools from different MCP Servers
  • TOUCAN Data Integration : Curated and pre-processed subset of the TOUCAN dataset for direct comparison, with consistent formatting and quality controls.
  • Comprehensive Metadata : All tool names, descriptions (with arguments removed), and server metadata are included.

Data Overview

  • Enterprise MCP Servers : 12 high-quality, English-only servers, each covering a range of 10 to 90 tools.
  • Synthetic Tasks : 352 times 3 tasks per N in [1, 2, 3] for our dataset; 1,056 processed tasks per N for TOUCAN.
  • Validation Ready : Processed, de-duplicated, and filtered for high data quality.

Evaluation

We evaluated two task-tool matching approaches on the ASTRA dataset: the Semantic Similarity Matcher (SemSimM) and the LLM Reasoning Matcher (LLM-ResM). SemSimM uses language model embeddings to compare an idealized tool description, generated for the task, with the descriptions of available tools, selecting the most semantically similar option if it exceeds a similarity threshold. While effective, this method can struggle with large tool registries and tasks needing multiple tools, as it assesses each tool in isolation. In contrast, LLM-ResM employs a language model to directly reason about the suitability of a requested tool for a given task, using only the task context and the tool's name and description. This reasoning-based approach is more scalable and adaptable, as it does not depend on the complete set of available tools and can capture finer contextual nuances through targeted prompting. We tested both methods on the ASTRA dataset to assess their effectiveness and limitations.

Results

For single-tool tasks, LLM-ResM consistently outperformed SemSimM on both the generated and public TOUCAN datasets, achieving higher accuracy, recall, and F1 scores. SemSimM, while precise, exhibited low recall, often failing to recognize valid tool requests.

In multi-tool scenarios, only LLM-ResM was evaluated, matching each tool request within a task independently. The results showed that as the number of required tools increased, the challenge of correct authorization also grew, primarily due to a rise in false negatives (under-scoping), especially for three-tool tasks. Notably, recall was higher on the TOUCAN dataset in complex tasks, likely due to more explicit tool usage patterns compared to the implicit cues in the generated data.

Overall, while both approaches demonstrated strengths, LLM-ResM proved more robust across varying task complexities, with the main challenge being the trade-off between minimizing over-scoping (granting unnecessary access) and under-scoping (insufficient access for task completion) as tasks became more complex.

Collaboration and Context

ASTRA is part of Cisco’s broader research on Zero Trust Agency (ZTA), fine-grained, intent-aware delegated authorization for agentic applications, developed within Outshift by Cisco, the company’s incubation and innovation arm.

This work also draws inspiration and collaboration from the Linux Foundation AGNTCY project, which is building open infrastructure for the “Internet of Agents,” including identity services, verifiable credentials, and Tool-Based Access Control (TBAC).

Key contributions from these collaborations include:

  • Identity and Verifiable Credential Frameworks for agent authentication.
  • An open-source reference implementation of TBAC (Tool-Based Access Control) – serving as a precursor to Task-Based Access Control – available via the Linux Foundation AGNTCY GitHub.
  • Real-world MCP Server configurations sourced and maintained through industry and research partnerships.

We hope that the ASTRA dataset will serve as a valuable resource for future research in semantic task-tool matching, particularly in the context of delegated authorization. If you make use of this dataset in your research, please cite our paper:

BibTeX

@misc{helou2025delegatedauthorizationagentsconstrained,
      title={Delegated Authorization for Agents Constrained to Semantic Task-to-Scope Matching}, 
      author={Majed El Helou and Chiara Troiani and Benjamin Ryder and Jean Diaconu and Hervé Muyal and Marcelo Yannuzzi},
      year={2025},
      eprint={2510.26702},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.26702}, 
}