Physical AI Agents: VLM/LLM-Powered Systems for Robot Planning and Code Generation

SUMMARY

Tzara is an LLM/VLM agent that turns a prompt into a running robotics pipeline. No hand-written code, no task-specific training required.

All videos 1x speed, fully autonomous. Zero task-specific training or fine-tuning.

Telekinesis Physical AI Agent — Physical AI Agents are VLM/LLM-powered systems that convert natural language instructions into executable code for robotics and physical AI systems using the Telekinesis Skill Library. They leverage the prompt, chat history, and available skills as context to synthesize code-as-policy that governs perception, decision-making, and real-world actions.

Go to the Quickstart

Ready to build your own? Start here.

Open quickstart →

Tzara in Action

Tzara — vision-based pick and place from a natural language instruction

What are Physical AI Agents?

Physical AI Agents are systems that:

Interpret natural language instructions
Reason about tasks using Vision-Language Models (VLMs) or Large Language Models (LLMs)
Generate executable Python code that leverages the Telekinesis Skill Library

They do not directly control robots. Instead, they synthesize code that can be inspected, validated, and executed via the Telekinesis Agentic Skill Library.

This paradigm is known as Code-as-Policy.

In this approach, the agent expresses its decisions and behaviors as executable code rather than issuing direct low-level commands. This makes the system more transparent, auditable, and adaptable, as the generated code can be reviewed, modified, and safely executed before affecting real-world systems.

Mental Model

A Physical AI Agent can be understood as a pipeline:

User Instruction
      ↓
LLM / VLM Reasoning
      ↓
Skill Selection & Planning
      ↓
Generated Python Code
      ↓
Telekinesis Skills Execution

This design shifts robotics development from end-to-end learned control policies to a code-as-policy execution model, where behavior is explicitly represented and inspectable.

Design Principles

This separation introduces a more transparent and production-oriented paradigm for robotics systems, with the following key properties:

Safety: Generated code is reviewed and validated before execution, enabling human oversight and reducing risks associated with opaque end-to-end policies.
Debuggability: System behavior is represented as explicit programs, making failures traceable at the logic level rather than requiring interpretation of internal model states.
Composability: Complex behaviors are built by combining reusable Telekinesis Skills, enabling modular system design instead of monolithic policies.
Controlled Generalization: Vision-Language-Action (VLA) models can be integrated as Skills, preserving their generalization capabilities while enabling higher-level orchestration and constraints.

Example

Input:

Capture an image with the webcam

Output:

python

from telekinesis.medulla import cameras

# Step 1: Instantiate the Webcam with a unique name and default camera ID (0)
webcam = cameras.webcam.Webcam(name="webcam_0", camera_id=0)

# Step 2: Connect to the webcam
connected = webcam.connect()
if not connected:
    raise RuntimeError("Failed to connect to the webcam.")

# Step 3: Capture a single color frame (returns an RGB numpy array)
frame = webcam.capture_color_image()

if frame is None:
    raise RuntimeError("Failed to capture a frame from the webcam.")

# Step 4: Disconnect from the webcam
webcam.disconnect()

# Step 5: Save the captured image to disk (convert RGB -> BGR for OpenCV)
output_path = "captured_image.png"
frame_bgr = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)
cv2.imwrite(output_path, frame_bgr)

print(f"Image captured and saved to: {output_path}")
print(f"Image shape: {frame.shape}, dtype: {frame.dtype}")

Tzara: The First Physical AI Agent

The first Physical AI Agent in Telekinesis is Tzara. Tzara is a general-purpose coding agent that translates natural language instructions into executable Telekinesis Skill code.

It supports two modes:

Chat Mode

Interactive coding assistant
Iterative refinement of instructions
Debugging and explanation of generated code

Code Generation Mode

Direct conversion of prompts into Python code
Designed for fast pipeline generation

Tzara is available through a native Visual Studio Code extension, enabling direct integration into development workflows.

How Physical AI Agents Fit in the Telekinesis Ecosystem

Physical AI Agents sit on top of the Telekinesis Skills:

Skills: low-level perception and action primitives
Agents (Tzara): code generation layer over Skills

They do not replace Skills — they compose them into usable programs.

Key Capabilities

Physical AI Agents can:

Generate robotics control code from natural language
Compose multiple Telekinesis Skills into pipelines
Adapt generated code to different environments
Support iterative refinement through chat-based interaction

Limitations

Physical AI Agents:

Do not execute robots directly for safety reasons (currently)
Require human validation of generated code
Depend on available Skills in the Telekinesis Agentic Skill Library
May produce incorrect or incomplete code without proper context

Physical AI Agents vs. Vision-Language-Action (VLA) Models (Optional)

Tzara aligns with the latest research by Cap-X where they show that today's off-the-shelf Language Models (LMs) have incredible generalization, reasoning, and planning capabilities and outperform Vision-Language-Action (VLA) models. In particular, they find that:

Frontier models achieve meaningful zero-shot success on robotic manipulation: Without any task-specific training, today's best frontier models can directly generate executable robot control code and achieve over 30% average success — a sharp contrast to the prior belief that only specially trained models (VLAs) can perform manipulation. Yet a 56-point gap to human performance remains, marking this as one of AI's most important open challenges.
Training-free agents outperforms state-of-the-art VLAs on perturbed tasks: On LIBERO-PRO — 30 manipulation tasks with position and instruction perturbations — state-of-the-art Vision-Language-Action models (OpenVLA, π0) score 0% across the board. Even the best VLA (π0.5) reaches only 13% average success. In contrast, a training-free coding agent, achieves 18% without any task-specific training, demonstrating that code-generation agents generalize where end-to-end learned policies break down.
Higher abstraction boosts all models — and dramatically closes the gap for smaller ones: As API abstraction increases from raw primitives (S4) to high-level pick-and-place (S1), all models improve substantially — but the gains are most pronounced for weaker and open-source models, whose compilation rates collapse at low abstraction levels. This suggests a promising path: pair a lightweight LM for high-level planning with a visual-motor policy (e.g., a VLA) that handles low-level control, letting even smaller models achieve strong task performance through the right division of labor.

Next Steps

To begin using Physical AI Agents:

Install and explore Tzara
Learn how to generate code from natural language instructions
Build your first robotics pipeline using Telekinesis Skills

Physical AI Agents: VLM/LLM-Powered Systems for Robot Planning and Code Generation ​

Tzara in Action ​

What are Physical AI Agents? ​

Mental Model ​

Design Principles ​

Example ​

Tzara: The First Physical AI Agent ​

Chat Mode ​

Code Generation Mode ​

How Physical AI Agents Fit in the Telekinesis Ecosystem ​

Key Capabilities ​

Limitations ​

Physical AI Agents vs. Vision-Language-Action (VLA) Models (Optional) ​

Next Steps ​