2D Pose Estimation Using Qwen and SAM
SUMMARY
Estimate 2D object poses (centroid and extent) in an RGB image using a text prompt. Objects are detected with Qwen, segmented with SAM, and converted into centroid and bounding box estimates.
Overview
Many robotics and vision systems require a fast way to localize objects in an image before performing tasks such as grasping, tracking, or pose estimation.
This example demonstrates how to estimate 2D object poses (centroid and extent) from an RGB image using a natural language prompt. Given a prompt such as "purple tray" or "bottle", the pipeline detects matching objects and computes their centroid and bounding box in image coordinates.
The resulting outputs provide a lightweight representation of object location and size that can be used directly for downstream tasks or as an initialization step for 3D pose estimation.
Inputs
- RGB image path
- Text prompt describing the objects to detect
Required Telekinesis Skills
- Retina — text-prompt object detection using Qwen
- Cornea — instance segmentation using SAM
- Pupil — mask postprocessing (morphological operations)
Optional: Rerun for visualization.
Use Cases
This pipeline estimates 2D object poses (centroid + bounding box) from an RGB image using a text prompt.
Typical applications include:
Robotic Grasping
Use the centroid as a grasp target and the bounding box to estimate the object's spatial extent.Object Tracking Initialization
Initialize tracking algorithms using the detected centroid and segmentation mask.3D Pose Estimation Pipelines
Combine the segmentation mask with depth or multi-view data to estimate a full 3D pose.Scene Understanding
Detect and localize objects in a scene using natural language queries.
Input-Output


The Pipeline
The pipeline combines text-prompt object detection, instance segmentation, and mask postprocessing to estimate 2D object poses (centroid and extent).
Each detected object is processed independently to compute its centroid and bounding box from the segmentation mask.
Load RGB Image
↓
Detect Objects Using Qwen (Text Prompt)
↓
Extract Bounding Boxes
↓
Segment Image Using SAM
↓
Postprocess Masks (Morphological Open)
↓
Extract Bounding Boxes & Centroids
↓
Estimated 2D Poses- Detect Objects Using Qwen — Text-prompt object detection; outputs bounding boxes per detected object.
- Segment Image Using SAM — Instance segmentation from bounding box prompts; outputs masks (and bboxes) per instance.
- Filter Image Using Morphological Open — Cleans masks (removes small holes/noise) before computing centroids.
The output poses (centroid + bounding box per instance) can drive pick-and-place, tracking, or be lifted to 3D with depth or multi-view data.
The Code
All visualization functions are defined at the top. The pipeline lives in get_2d_pose(): each Telekinesis Skill execution is followed by the relevant visualization. main() is called at the end with no arguments (image path and text prompt are fixed).
import numpy as np
import pathlib
from typing import Optional, Tuple, List, Dict, Any
from pycocotools import mask as mask_utils
import cv2
import rerun as rr
import rerun.blueprint as rrb
from loguru import logger
from datatypes import datatypes
from datatypes import io
from telekinesis import cornea, retina, pupil
ROOT_DIR = pathlib.Path(__file__).parent
DATA_DIR = ROOT_DIR / "data"
IMAGE_PATH = DATA_DIR / "images/test_image_001.png"
TEXT_PROMPT = "purple tray ."
DETECTION_COLORS = [(255, 0, 0), (0, 255, 0), (0, 0, 255), (255, 255, 0), (255, 0, 255), (0, 255, 255)]
# -----------------------------------------------------------------------------
# Visualization (all at the top)
# -----------------------------------------------------------------------------
def visualize_detections(image: np.ndarray, bboxes_xywh: List) -> None:
"""Log image and detection boxes to Rerun (Original | Detection)."""
if not bboxes_xywh:
return
rr.init("qwen_detection", spawn=True)
rr.send_blueprint(
rrb.Blueprint(
rrb.Grid(
rrb.Spatial2DView(name="Original", origin="image"),
rrb.Spatial2DView(name="Detection", origin="detection"),
),
rrb.SelectionPanel(),
rrb.TimePanel(),
),
make_active=True,
)
rr.log("image", rr.Image(image))
rr.log("detection", rr.Image(image))
rr.log(
"detection/bboxes",
rr.Boxes2D(
array=np.array(bboxes_xywh, dtype=np.float32),
array_format=rr.Box2DFormat.XYWH,
colors=np.array([DETECTION_COLORS[i % len(DETECTION_COLORS)] for i in range(len(bboxes_xywh))], dtype=np.uint8),
labels=[str(i) for i in range(len(bboxes_xywh))],
radii=[2] * len(bboxes_xywh),
),
)
def visualize_pose_estimation(image: np.ndarray, poses: List[Dict], input_bboxes_xywh: List) -> None:
"""Log input, bboxes, and segmentation (masks, boxes, centroids) to Rerun."""
rr.init("pose_estimation", spawn=False)
try:
rr.connect()
except Exception:
rr.spawn()
rr.send_blueprint(
rrb.Blueprint(
rrb.Horizontal(
rrb.Spatial2DView(name="Input", origin="input"),
rrb.Spatial2DView(name="Bboxes & Segments", origin="segmented"),
),
rrb.SelectionPanel(),
rrb.TimePanel(),
),
make_active=True,
)
rr.log("input/image", rr.Image(image))
rr.log(
"input/bboxes",
rr.Boxes2D(
array=np.array(input_bboxes_xywh, dtype=np.float32),
array_format=rr.Box2DFormat.XYWH,
colors=[[0, 255, 0]],
labels=[f"Box {i+1}" for i in range(len(input_bboxes_xywh))],
),
)
rr.log("segmented/image", rr.Image(image))
h, w = image.shape[:2]
seg_img = np.zeros((h, w), dtype=np.uint16)
for p in poses:
seg_img[p["mask"] > 0] = p["label"]
rr.log("segmented/masks", rr.SegmentationImage(seg_img))
bboxes = [p["bbox_xywh"] for p in poses if p.get("bbox_xywh") is not None]
if bboxes:
rr.log(
"segmented/boxes",
rr.Boxes2D(
array=np.asarray(bboxes, dtype=np.float32),
array_format=rr.Box2DFormat.XYWH,
class_ids=np.array([p["label"] for p in poses if p.get("bbox_xywh") is not None], dtype=np.int32),
),
)
centroids = [p["centroid"] for p in poses if p.get("centroid") is not None]
if centroids:
rr.log(
"segmented/centroids",
rr.Points2D(
np.array(centroids, dtype=np.float32),
radii=5,
colors=[[255, 0, 0]],
labels=[f"C{p['label']}" for p in poses if p.get("centroid") is not None],
),
)
# -----------------------------------------------------------------------------
# Helpers (centroid, mask decoding)
# -----------------------------------------------------------------------------
def compute_mask_centroid(mask: np.ndarray) -> Optional[Tuple[float, float]]:
"""Centroid (cx, cy) of a binary mask, or None if empty."""
if mask.ndim != 2:
raise ValueError("Mask must be (H, W).")
mask = (mask > 0).astype(np.uint8)
M = cv2.moments(mask)
if M["m00"] == 0:
return None
return float(M["m10"] / M["m00"]), float(M["m01"] / M["m00"])
def _annotation_to_mask(ann: Dict[str, Any], h: int, w: int) -> Optional[np.ndarray]:
"""Binary (H, W) mask from one SAM annotation, or None."""
mask = np.zeros((h, w), dtype=np.uint8)
if "mask" in ann and isinstance(ann["mask"], np.ndarray):
m = ann["mask"]
mask = (m > 0.5).astype(np.uint8) if m.dtype.kind in ("f", "b") else (m > 0).astype(np.uint8)
elif "segmentation" in ann and ann["segmentation"]:
seg = ann["segmentation"]
if isinstance(seg, dict):
mask_dec = mask_utils.decode(seg)
if mask_dec.ndim == 3:
mask_dec = mask_dec[:, :, 0]
mask = (mask_dec > 0).astype(np.uint8)
elif isinstance(seg, list) and len(seg) > 0:
polys = seg if isinstance(seg[0], list) else [seg]
for poly in polys:
pts = np.array(poly).reshape(-1, 2).astype(np.int32)
cv2.fillPoly(mask, [pts], 1)
if mask.sum() == 0:
return None
return mask
# -----------------------------------------------------------------------------
# Pipeline: one function, Telekinesis call then viz under each
# -----------------------------------------------------------------------------
def get_2d_pose() -> Tuple[np.ndarray, List[Dict], List]:
"""
Run 2D pose estimation: load image → Qwen detect → SAM segment → postprocess masks.
After each Telekinesis step, the corresponding visualization is called.
Returns (image_np, poses, detection_bboxes_xywh).
"""
# 1. Load image (Telekinesis: io)
image = io.load_image(str(IMAGE_PATH))
logger.info(f"Loaded image shape {image.to_numpy().shape}")
image_np = image.to_numpy()
if image_np.shape[2] == 4:
image_np = cv2.cvtColor(image_np, cv2.COLOR_BGRA2BGR)
image = datatypes.Image(image_np)
image_np = image.to_numpy()
h, w = image_np.shape[:2]
# 2. Detect objects (Telekinesis: Retina Qwen)
detections = retina.detect_objects_using_qwen(
image=image,
objects_to_detect=TEXT_PROMPT,
model_name="Qwen/Qwen3-VL-4B-Instruct",
)
detections = detections.to_list()
logger.success(f"Detected {len(detections)} objects.")
detection_bboxes = [ann["bbox"] for ann in detections]
# Visualization: detection boxes
visualize_detections(image_np, detection_bboxes)
if not detection_bboxes:
return image_np, [], detection_bboxes
# 3. Segment with SAM (Telekinesis: Cornea)
bboxes_xyxy = [[float(x), float(y), float(x + bw), float(y + bh)] for x, y, bw, bh in detection_bboxes]
sam_result = cornea.segment_image_using_sam(image=image, bboxes=bboxes_xyxy)
annotations = sam_result.to_list()
# 4. Postprocess masks and build poses (Telekinesis: Pupil morphological open per mask)
poses = []
for idx, ann in enumerate(annotations):
label = idx + 1
mask_i = _annotation_to_mask(ann, h, w)
if mask_i is None:
continue
filtered = pupil.filter_image_using_morphological_open(
image=datatypes.Image(mask_i),
kernel_size=600,
kernel_shape="rectangle",
iterations=1,
border_type="default",
)
mask_np = filtered.to_numpy()
if mask_np.shape != (h, w):
mask_np = cv2.resize(mask_np.astype(np.uint8), (w, h), interpolation=cv2.INTER_NEAREST)
centroid = compute_mask_centroid(mask_np)
bbox = ann.get("bbox")
poses.append({
"label": label,
"mask": mask_np,
"centroid": centroid,
"bbox_xywh": list(bbox) if bbox else None,
})
# Visualization: segmentation and poses (masks, boxes, centroids)
visualize_pose_estimation(image_np, poses, detection_bboxes)
return image_np, poses, detection_bboxes
def main() -> None:
get_2d_pose()
if __name__ == "__main__":
main()
