Skip to content

Context Model for Physical AI Agents

Physical AI Agents generate code based on the context available in the conversation and the Telekinesis Skill Library. This context determines what the agent understands about the task and directly influences the quality and correctness of the generated output.

The agent operates using a combination of:

  • Your natural language instruction (prompt)
  • Recent conversation history (within the current session only)
  • User context (persistent across sessions)
  • The Telekinesis Skill Library (available Skills)

The agent does not access file systems, external workspaces, or runtime environments. All reasoning and generation is constrained to the conversation and the Skill Library.

Skill Context

Physical AI Agents operate with access to the Telekinesis Skill Library. Each Skill provides structured information including:

  • Function name
  • Input parameters and types
  • Output parameters and types
  • Description of behavior
  • Usage examples

Skills define the action space of the agent and serve as the interface between natural language reasoning and executable robotics code.

User Context

How to configure user context in Tzara

User context is persistent memory that is automatically included in every session. Configure it once and the agent will always have standing knowledge of your environment, preferences, and task constraints with no need to re-supply it each time.

User context can include:

  • Robot or hardware configuration (e.g. camera model, mount position, coordinate frames)
  • Workspace layout (e.g. table dimensions, object classes present, conveyor orientation)
  • Preferred output formats or coding conventions
  • Domain-specific constraints (e.g. speed limits, safe zones, preferred pick strategies)

How to Provide User Context

Pass user context as part of your initial prompt or as a preamble before your task instruction. Here is a real example:

My Hardware Setup:
- Robot: UR10e at 192.168.1.2
- Gripper: OnRobot RG2 at 192.168.1.1
- Camera: RealSense D435if (eye-in-hand)

My Calibration:
- flange_T_cam (camera extrinsics relative to flange):
  | 0.0325  -0.9995  -0.0018  0.0759 |
  | 0.9989   0.0326  -0.0334  0.0075 |
  | 0.0334  -0.0007   0.9994 -0.0338 |
  | 0.0000   0.0000   0.0000  1.0000 |
  Note: calibrated to flange so gripper can be swapped. Always use TCP for grasping

- tcp_T_flange (TCP offset, 160 mm along flange Z):
  | -1.0   0.0   0.0   0.000 |
  |  0.0  -1.0   0.0   0.000 |
  |  0.0   0.0   1.0  -0.160 |
  |  0.0   0.0   0.0   1.000 |

My Workspace:
- Observation pose [x, y, z, rx, ry, rz°]: [0.0, 0.65, 0.85, 180.0, 0.0, 90.0]

My Preferences:
- For vision: prefer specific models first, then open-vocabulary (e.g. Grounding DINO), then classical
- Always place hardcoded/tunable constants at the top of the generated file

The agent uses this context when selecting Skills, generating code, and making assumptions about the environment. Providing clear, structured user context is one of the most effective ways to improve output quality.

Scene Graph Context

Coming Soon

Scene graph support as user context is currently in development. This will allow you to describe your workspace as a structured scene graph, capturing spatial relationships, object hierarchies, and environmental constraints, so Tzara can reason directly about the physical layout of your scene without requiring it to be re-described each session. Common formats will be supported including USD/OpenUSD, JSON, RDF/OWL, and glTF.

Context Limitations

The agent operates within a bounded context window the context length. When this limit is reached, older or less relevant information may be truncated.

The following are not available:

  • File system or workspace access
  • External runtime state
  • Hidden environment variables
  • Conversation history from previous sessions

Why Context Window Matters

The agent’s outputs are determined entirely by explicit context (prompt, conversation history, and Skills). This ensures predictable, inspectable behavior within a defined reasoning space.