Skip to content

2D Pose Estimation Using Qwen and SAM

SUMMARY

Estimate 2D object poses (centroid and extent) in an RGB image using a text prompt. Objects are detected with Qwen, segmented with SAM, and converted into centroid and bounding box estimates.

Overview

Many robotics and vision systems require a fast way to localize objects in an image before performing tasks such as grasping, tracking, or pose estimation.

This example demonstrates how to estimate 2D object poses (centroid and extent) from an RGB image using a natural language prompt. Given a prompt such as "purple tray" or "bottle", the pipeline detects matching objects and computes their centroid and bounding box in image coordinates.

The resulting outputs provide a lightweight representation of object location and size that can be used directly for downstream tasks or as an initialization step for 3D pose estimation.

Inputs

  • RGB image path
  • Text prompt describing the objects to detect

Required Telekinesis Skills

  • Retina — text-prompt object detection using Qwen
  • Cornea — instance segmentation using SAM
  • Pupil — mask postprocessing (morphological operations)

Optional: Rerun for visualization.

Use Cases

This pipeline estimates 2D object poses (centroid + bounding box) from an RGB image using a text prompt.

Typical applications include:

  • Robotic Grasping
    Use the centroid as a grasp target and the bounding box to estimate the object's spatial extent.

  • Object Tracking Initialization
    Initialize tracking algorithms using the detected centroid and segmentation mask.

  • 3D Pose Estimation Pipelines
    Combine the segmentation mask with depth or multi-view data to estimate a full 3D pose.

  • Scene Understanding
    Detect and localize objects in a scene using natural language queries.

Input-Output

Raw Sensor Input
Forklift Segmentation Input
Raw image of a warehouse scene with a forklift.
Segmentation and Boxes
Forklift Segmentation Output
Segmented image with mask and bounding box for the forklift.

The Pipeline

The pipeline combines text-prompt object detection, instance segmentation, and mask postprocessing to estimate 2D object poses (centroid and extent).

Each detected object is processed independently to compute its centroid and bounding box from the segmentation mask.

text
Load RGB Image

Detect Objects Using Qwen (Text Prompt)

Extract Bounding Boxes

Segment Image Using SAM

Postprocess Masks (Morphological Open)

Extract Bounding Boxes & Centroids

Estimated 2D Poses

The output poses (centroid + bounding box per instance) can drive pick-and-place, tracking, or be lifted to 3D with depth or multi-view data.

The Code

All visualization functions are defined at the top. The pipeline lives in get_2d_pose(): each Telekinesis Skill execution is followed by the relevant visualization. main() is called at the end with no arguments (image path and text prompt are fixed).

python
import numpy as np
import pathlib
from typing import Optional, Tuple, List, Dict, Any
from pycocotools import mask as mask_utils
import cv2
import rerun as rr
import rerun.blueprint as rrb
from loguru import logger

from datatypes import datatypes
from datatypes import io
from telekinesis import cornea, retina, pupil

ROOT_DIR = pathlib.Path(__file__).parent
DATA_DIR = ROOT_DIR / "data"
IMAGE_PATH = DATA_DIR / "images/test_image_001.png"
TEXT_PROMPT = "purple tray ."

DETECTION_COLORS = [(255, 0, 0), (0, 255, 0), (0, 0, 255), (255, 255, 0), (255, 0, 255), (0, 255, 255)]


# -----------------------------------------------------------------------------
# Visualization (all at the top)
# -----------------------------------------------------------------------------

def visualize_detections(image: np.ndarray, bboxes_xywh: List) -> None:
    """Log image and detection boxes to Rerun (Original | Detection)."""
    if not bboxes_xywh:
        return
    rr.init("qwen_detection", spawn=True)
    rr.send_blueprint(
        rrb.Blueprint(
            rrb.Grid(
                rrb.Spatial2DView(name="Original", origin="image"),
                rrb.Spatial2DView(name="Detection", origin="detection"),
            ),
            rrb.SelectionPanel(),
            rrb.TimePanel(),
        ),
        make_active=True,
    )
    rr.log("image", rr.Image(image))
    rr.log("detection", rr.Image(image))
    rr.log(
        "detection/bboxes",
        rr.Boxes2D(
            array=np.array(bboxes_xywh, dtype=np.float32),
            array_format=rr.Box2DFormat.XYWH,
            colors=np.array([DETECTION_COLORS[i % len(DETECTION_COLORS)] for i in range(len(bboxes_xywh))], dtype=np.uint8),
            labels=[str(i) for i in range(len(bboxes_xywh))],
            radii=[2] * len(bboxes_xywh),
        ),
    )


def visualize_pose_estimation(image: np.ndarray, poses: List[Dict], input_bboxes_xywh: List) -> None:
    """Log input, bboxes, and segmentation (masks, boxes, centroids) to Rerun."""
    rr.init("pose_estimation", spawn=False)
    try:
        rr.connect()
    except Exception:
        rr.spawn()
    rr.send_blueprint(
        rrb.Blueprint(
            rrb.Horizontal(
                rrb.Spatial2DView(name="Input", origin="input"),
                rrb.Spatial2DView(name="Bboxes & Segments", origin="segmented"),
            ),
            rrb.SelectionPanel(),
            rrb.TimePanel(),
        ),
        make_active=True,
    )
    rr.log("input/image", rr.Image(image))
    rr.log(
        "input/bboxes",
        rr.Boxes2D(
            array=np.array(input_bboxes_xywh, dtype=np.float32),
            array_format=rr.Box2DFormat.XYWH,
            colors=[[0, 255, 0]],
            labels=[f"Box {i+1}" for i in range(len(input_bboxes_xywh))],
        ),
    )
    rr.log("segmented/image", rr.Image(image))
    h, w = image.shape[:2]
    seg_img = np.zeros((h, w), dtype=np.uint16)
    for p in poses:
        seg_img[p["mask"] > 0] = p["label"]
    rr.log("segmented/masks", rr.SegmentationImage(seg_img))
    bboxes = [p["bbox_xywh"] for p in poses if p.get("bbox_xywh") is not None]
    if bboxes:
        rr.log(
            "segmented/boxes",
            rr.Boxes2D(
                array=np.asarray(bboxes, dtype=np.float32),
                array_format=rr.Box2DFormat.XYWH,
                class_ids=np.array([p["label"] for p in poses if p.get("bbox_xywh") is not None], dtype=np.int32),
            ),
        )
    centroids = [p["centroid"] for p in poses if p.get("centroid") is not None]
    if centroids:
        rr.log(
            "segmented/centroids",
            rr.Points2D(
                np.array(centroids, dtype=np.float32),
                radii=5,
                colors=[[255, 0, 0]],
                labels=[f"C{p['label']}" for p in poses if p.get("centroid") is not None],
            ),
        )


# -----------------------------------------------------------------------------
# Helpers (centroid, mask decoding)
# -----------------------------------------------------------------------------

def compute_mask_centroid(mask: np.ndarray) -> Optional[Tuple[float, float]]:
    """Centroid (cx, cy) of a binary mask, or None if empty."""
    if mask.ndim != 2:
        raise ValueError("Mask must be (H, W).")
    mask = (mask > 0).astype(np.uint8)
    M = cv2.moments(mask)
    if M["m00"] == 0:
        return None
    return float(M["m10"] / M["m00"]), float(M["m01"] / M["m00"])


def _annotation_to_mask(ann: Dict[str, Any], h: int, w: int) -> Optional[np.ndarray]:
    """Binary (H, W) mask from one SAM annotation, or None."""
    mask = np.zeros((h, w), dtype=np.uint8)
    if "mask" in ann and isinstance(ann["mask"], np.ndarray):
        m = ann["mask"]
        mask = (m > 0.5).astype(np.uint8) if m.dtype.kind in ("f", "b") else (m > 0).astype(np.uint8)
    elif "segmentation" in ann and ann["segmentation"]:
        seg = ann["segmentation"]
        if isinstance(seg, dict):
            mask_dec = mask_utils.decode(seg)
            if mask_dec.ndim == 3:
                mask_dec = mask_dec[:, :, 0]
            mask = (mask_dec > 0).astype(np.uint8)
        elif isinstance(seg, list) and len(seg) > 0:
            polys = seg if isinstance(seg[0], list) else [seg]
            for poly in polys:
                pts = np.array(poly).reshape(-1, 2).astype(np.int32)
                cv2.fillPoly(mask, [pts], 1)
    if mask.sum() == 0:
        return None
    return mask


# -----------------------------------------------------------------------------
# Pipeline: one function, Telekinesis call then viz under each
# -----------------------------------------------------------------------------

def get_2d_pose() -> Tuple[np.ndarray, List[Dict], List]:
    """
    Run 2D pose estimation: load image → Qwen detect → SAM segment → postprocess masks.
    After each Telekinesis step, the corresponding visualization is called.
    Returns (image_np, poses, detection_bboxes_xywh).
    """
    # 1. Load image (Telekinesis: io)
    image = io.load_image(str(IMAGE_PATH))
    logger.info(f"Loaded image shape {image.to_numpy().shape}")
    image_np = image.to_numpy()
    if image_np.shape[2] == 4:
        image_np = cv2.cvtColor(image_np, cv2.COLOR_BGRA2BGR)
        image = datatypes.Image(image_np)
    image_np = image.to_numpy()
    h, w = image_np.shape[:2]

    # 2. Detect objects (Telekinesis: Retina Qwen)
    detections = retina.detect_objects_using_qwen(
        image=image,
        objects_to_detect=TEXT_PROMPT,
        model_name="Qwen/Qwen3-VL-4B-Instruct",
    )
    detections = detections.to_list()
    logger.success(f"Detected {len(detections)} objects.")
    detection_bboxes = [ann["bbox"] for ann in detections]

    # Visualization: detection boxes
    visualize_detections(image_np, detection_bboxes)

    if not detection_bboxes:
        return image_np, [], detection_bboxes

    # 3. Segment with SAM (Telekinesis: Cornea)
    bboxes_xyxy = [[float(x), float(y), float(x + bw), float(y + bh)] for x, y, bw, bh in detection_bboxes]
    sam_result = cornea.segment_image_using_sam(image=image, bboxes=bboxes_xyxy)
    annotations = sam_result.to_list()

    # 4. Postprocess masks and build poses (Telekinesis: Pupil morphological open per mask)
    poses = []
    for idx, ann in enumerate(annotations):
        label = idx + 1
        mask_i = _annotation_to_mask(ann, h, w)
        if mask_i is None:
            continue
        filtered = pupil.filter_image_using_morphological_open(
            image=datatypes.Image(mask_i),
            kernel_size=600,
            kernel_shape="rectangle",
            iterations=1,
            border_type="default",
        )
        mask_np = filtered.to_numpy()
        if mask_np.shape != (h, w):
            mask_np = cv2.resize(mask_np.astype(np.uint8), (w, h), interpolation=cv2.INTER_NEAREST)
        centroid = compute_mask_centroid(mask_np)
        bbox = ann.get("bbox")
        poses.append({
            "label": label,
            "mask": mask_np,
            "centroid": centroid,
            "bbox_xywh": list(bbox) if bbox else None,
        })

    # Visualization: segmentation and poses (masks, boxes, centroids)
    visualize_pose_estimation(image_np, poses, detection_bboxes)

    return image_np, poses, detection_bboxes


def main() -> None:
    get_2d_pose()


if __name__ == "__main__":
    main()