Skip to content

Detect Objects Using QWEN

SUMMARY

Detect Objects Using QWEN detects objects in images using the QWEN Vision Language Model (VLM).

QWEN is a powerful vision-language model that can understand natural language descriptions and detect objects in images based on text prompts. It uses advanced multimodal AI to interpret both visual and textual inputs, making it ideal for flexible, prompt-driven object detection.

Use this Skill when you want to detect objects using natural language descriptions or when you need flexible, prompt-driven object detection.

The Skill

python
from telekinesis import retina

annotations = retina.detect_objects_using_qwen(
    image=image,
    objects_to_detect="buttons",
    model_name="Qwen/Qwen2.5-VL-7B-Instruct",
)

API Reference

Example

Input Image

Input image

Original image for QWEN object detection

Detected Objects

Output image

Detected objects with bounding boxes using QWEN

The Code

python
from telekinesis import retina
from datatypes import io
import pathlib

# Optional for logging
from loguru import logger

DATA_DIR = pathlib.Path("path/to/telekinesis-data")

# Load image
filepath = str(DATA_DIR / "images" / "warehouse_1.jpg")
image = io.load_image(filepath=filepath)
logger.success(f"Loaded image from {filepath}")

# Detect objects using QWEN
annotations = retina.detect_objects_using_qwen(
        image=image,
        objects_to_detect="person .",
        model_name="Qwen/Qwen3-VL-4B-Instruct",
    )

# Access results
annotations = annotations.to_list()

# Extract object information
for annotation in annotations:
    bbox = annotation["bbox"]  # [x, y, width, height]
    description = annotation.get("description", "")
    area = annotation.get("area", 0)
    logger.info(f"Detected object: {description}, area={area}, bbox={bbox}")

The Explanation of the Code

This example demonstrates how to use the detect_objects_using_qwen Skill to detect objects in an image using natural language descriptions. After importing the necessary modules and setting up optional logging, the image is loaded from a file.

python
from telekinesis import retina
from datatypes import io
import pathlib

# Optional for logging
from loguru import logger

DATA_DIR = pathlib.Path("path/to/telekinesis-data")

# Load image
filepath = str(DATA_DIR / "images" / "warehouse_1.jpg")
image = io.load_image(filepath=filepath)
logger.success(f"Loaded image from {filepath}")

The Skill detects objects using the QWEN Vision Language Model, which understands both visual content and textual prompts. The objects_to_detect parameter accepts natural language descriptions (e.g., "buttons", "all screws", "person, car"), and model_name specifies the HuggingFace model to use.

python
# Detect objects using QWEN
annotations = retina.detect_objects_using_qwen(
        image=image,
        objects_to_detect="person .",
        model_name="Qwen/Qwen3-VL-4B-Instruct",
    )

The function returns an ObjectDetectionAnnotations object in COCO-like format. Call .to_list() to get the list of detections. Each annotation includes a bounding box, description, and area. Annotations are sorted by area (descending).

python
# Access results
list_annotations = annotations.to_list()

# Extract object information
for annotation in list_annotations:
    bbox = annotation["bbox"]  # [x, y, width, height]
    description = annotation.get("description", "")
    area = annotation.get("area", 0)
    logger.info(f"Detected object: {description}, area={area}, bbox={bbox}")

This Skill is particularly useful in robotics pipelines for prompt-driven object detection, flexible visual search, and manipulation planning, where detecting objects based on natural language descriptions aids in task planning and execution.

Running the Example

Runnable examples are available in the Telekinesis examples repository. Follow the README in that repository to set up the environment. Once set up, you can run this specific example with:

bash
cd telekinesis-examples
python examples/retina_examples.py --example detect_objects_using_qwen

How to Tune the Parameters

The detect_objects_using_qwen Skill has 2 parameters:

objects_to_detect (required):

  • Natural language description of objects to detect
  • Format: String (e.g., "buttons", "person, car", "all screws")
  • Use specific terms for better accuracy
  • Use comma-separated list for multiple object types
  • Use descriptive phrases for complex queries (e.g., "all circular objects")

model_name (required):

  • HuggingFace model name default* "Qwen/Qwen3-VL-4B-Instruct"
  • Format: String

TIP

Best practice: Use clear, specific descriptions in objects_to_detect. The model understands natural language, so you can describe objects in plain English. For better results, be specific about what you're looking for.

Where to Use the Skill in a Pipeline

Detect Objects Using QWEN is commonly used in the following pipelines:

  • Prompt-driven object detection - Detecting objects based on natural language descriptions
  • Flexible visual search - Finding objects without predefined classes
  • Multi-object detection - Detecting multiple object types in a single pass
  • Robotic manipulation - Identifying objects for pick-and-place operations

A typical pipeline for prompt-driven object detection looks as follows:

python
from telekinesis import retina
from datatypes import io

# 1. Load image
image = io.load_image(filepath=...)

# 2. Detect objects using QWEN with natural language prompt
annotations = retina.detect_objects_using_qwen(
    image=image,
    objects_to_detect="screws and bolts",
    model_name="Qwen/Qwen2.5-VL-7B-Instruct",
)

# 3. Extract annotations
list_annotations = annotations.to_list()

# 4. Process each detection
for annotation in list_annotations:
    bbox = annotation["bbox"]
    description = annotation.get("description", "")
    # Use bbox and description for downstream tasks

# 5. Optional: Use detections for manipulation, tracking, or further processing

Related skills to build such a pipeline:

Alternative Skills

Skillvs. Detect Objects Using QWEN
detect_objects_using_grounding_dinoGrounding DINO does zero-shot detection with text prompts. Use for similar flexibility; QWEN uses a VLM for natural language.
detect_objects_using_rfdetrRF-DETR uses predefined COCO classes. Use for transformer-based fixed-class detection; QWEN for prompt-driven detection.
detect_objects_using_yoloxYOLOX uses predefined COCO classes and is fast. Use for real-time fixed-class detection; QWEN for flexible prompts.

When Not to Use the Skill

Do not use Detect Objects Using QWEN when:

  • You need real-time performance (QWEN requires GPU and can be slow)
  • You have predefined object classes (Use RF-DETR or YOLOX instead)
  • You need instance segmentation (QWEN provides bounding boxes, not masks)
  • You're working with very small objects (QWEN may miss small details)

TIP

QWEN is excellent for flexible, prompt-driven detection but may be slower than specialized detectors. Use it when you need the flexibility of natural language descriptions.