Skip to content

Physical AI Agents: VLM/LLM-Powered Systems for Robot Planning and Code Generation

SUMMARY

In the Telekinesis ecosystem, Physical AI Agents are VLM/LLM-powered systems that convert natural language instructions into executable code for robotics and physical AI systems using the Telekinesis Skill Library.

Telekinesis Physical AI AgentTelekinesis Physical AI Agent
Physical AI Agents are VLM/LLM-powered systems that convert natural language instructions into executable code for robotics and physical AI systems using the Telekinesis Skill Library. They leverage the prompt, chat history, and available skills as context to synthesize code-as-policy that governs perception, decision-making, and real-world actions.

What are Physical AI Agents?

Physical AI Agents are systems that:

  • Interpret natural language instructions
  • Reason about tasks using Vision-Language Models (VLMs) or Large Language Models (LLMs)
  • Generate executable Python code that leverages the Telekinesis Skill Library

They do not directly control robots. Instead, they synthesize code that can be inspected, validated, and executed via the Telekinesis Agentic Skill Library.

This paradigm is known as Code-as-Policy.

In this approach, the agent expresses its decisions and behaviors as executable code rather than issuing direct low-level commands. This makes the system more transparent, auditable, and adaptable, as the generated code can be reviewed, modified, and safely executed before affecting real-world systems.

Mental Model

A Physical AI Agent can be understood as a pipeline:

User Instruction

LLM / VLM Reasoning

Skill Selection & Planning

Generated Python Code

Telekinesis Skills Execution

This design shifts robotics development from end-to-end learned control policies to a code-as-policy execution model, where behavior is explicitly represented and inspectable.

Design Principles

This separation introduces a more transparent and production-oriented paradigm for robotics systems, with the following key properties:

  1. Safety: Generated code is reviewed and validated before execution, enabling human oversight and reducing risks associated with opaque end-to-end policies.
  2. Debuggability: System behavior is represented as explicit programs, making failures traceable at the logic level rather than requiring interpretation of internal model states.
  3. Composability: Complex behaviors are built by combining reusable Telekinesis Skills, enabling modular system design instead of monolithic policies.
  4. Controlled Generalization: Vision-Language-Action (VLA) models can be integrated as Skills, preserving their generalization capabilities while enabling higher-level orchestration and constraints.

Example

Input:

Capture an image with the webcam

Output:

python
from telekinesis.medulla import cameras

# Step 1: Instantiate the Webcam with a unique name and default camera ID (0)
webcam = cameras.webcam.Webcam(name="webcam_0", camera_id=0)

# Step 2: Connect to the webcam
connected = webcam.connect()
if not connected:
    raise RuntimeError("Failed to connect to the webcam.")

# Step 3: Capture a single color frame (returns an RGB numpy array)
frame = webcam.capture_single_color_frame()

if frame is None:
    raise RuntimeError("Failed to capture a frame from the webcam.")

# Step 4: Disconnect from the webcam
webcam.disconnect()

# Step 5: Save the captured image to disk (convert RGB -> BGR for OpenCV)
output_path = "captured_image.png"
frame_bgr = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)
cv2.imwrite(output_path, frame_bgr)

print(f"Image captured and saved to: {output_path}")
print(f"Image shape: {frame.shape}, dtype: {frame.dtype}")

Tzara: The First Physical AI Agent

The first Physical AI Agent in Telekinesis is Tzara. Tzara is a general-purpose coding agent that translates natural language instructions into executable Telekinesis Skill code.

It supports two modes:

Chat Mode

  • Interactive coding assistant
  • Iterative refinement of instructions
  • Debugging and explanation of generated code

Code Generation Mode

  • Direct conversion of prompts into Python code
  • Designed for fast pipeline generation

Tzara is available through a native Visual Studio Code extension, enabling direct integration into development workflows.

How Physical AI Agents Fit in the Telekinesis Ecosystem

Physical AI Agents sit on top of the Telekinesis Skills:

  • Skills: low-level perception and action primitives
  • Agents (Tzara): code generation layer over Skills

They do not replace Skills — they compose them into usable programs.

Key Capabilities

Physical AI Agents can:

  • Generate robotics control code from natural language
  • Compose multiple Telekinesis Skills into pipelines
  • Adapt generated code to different environments
  • Support iterative refinement through chat-based interaction

Limitations

Physical AI Agents:

  • Do not execute robots directly for safety reasons (currently)
  • Require human validation of generated code
  • Depend on available Skills in the Telekinesis Agentic Skill Library
  • May produce incorrect or incomplete code without proper context

Physical AI Agents vs. Vision-Language-Action (VLA) Models (Optional)

Tzara aligns with the latest research by Cap-X where they show that today's off-the-shelf Language Models (LMs) have incredible generalization, reasoning, and planning capabilities and outperform Vision-Language-Action (VLA) models. In particular, they find that:

  1. Frontier models achieve meaningful zero-shot success on robotic manipulation: Without any task-specific training, today's best frontier models can directly generate executable robot control code and achieve over 30% average success — a sharp contrast to the prior belief that only specially trained models (VLAs) can perform manipulation. Yet a 56-point gap to human performance remains, marking this as one of AI's most important open challenges.
  2. Training-free agents outperforms state-of-the-art VLAs on perturbed tasks: On LIBERO-PRO — 30 manipulation tasks with position and instruction perturbations — state-of-the-art Vision-Language-Action models (OpenVLA, π0) score 0% across the board. Even the best VLA (π0.5) reaches only 13% average success. In contrast, a training-free coding agent, achieves 18% without any task-specific training, demonstrating that code-generation agents generalize where end-to-end learned policies break down.
  3. Higher abstraction boosts all models — and dramatically closes the gap for smaller ones: As API abstraction increases from raw primitives (S4) to high-level pick-and-place (S1), all models improve substantially — but the gains are most pronounced for weaker and open-source models, whose compilation rates collapse at low abstraction levels. This suggests a promising path: pair a lightweight LM for high-level planning with a visual-motor policy (e.g., a VLA) that handles low-level control, letting even smaller models achieve strong task performance through the right division of labor.

Next Steps

To begin using Physical AI Agents:

  • Install and explore Tzara
  • Learn how to generate code from natural language instructions
  • Build your first robotics pipeline using Telekinesis Skills