🐼 Qwen2.5-VL for computer use for an OpenAI’s Operator-like agent.

Feb 01, 2025

( I write daily for my 112K+ AI-pro audience, with 4.5M+ weekly views. Noise-free, actionable, applied-AI developments only.)

Qwen2.5-VL was released just the last week, and its the new flagship vision-language model of Qwen and also a significant leap

It’s a vision-language model in 3 sizes (3B, 7B, 72B), with groundbreaking abilities in video comprehension, document parsing, and object recognition. It significantly enhances visual understanding and multimodal integration, supporting agentic tool use and structured outputs. Apache 2.0 licensed (except 72B), it's available on Hugging Face and ModelScope. Outperformed competitors like GPT-4o-mini, especially in document understanding and agent-based tasks.

Its agentic capabilities allow reasoning and control of external tools like phones and computers, enabling real-world applications without task-specific fine-tuning. That means Qwen2.5-VL 72B will perform well as a base model for an OpenAI’s Operator-like agent.

In this tutorial today, I will take an example from Qwen2.5-VL’s official cookbook to use the model for computer use. You can also directly checkout the Google Colab here.

GET THE FULL GOOGLE COLAB

This project takes a desktop screenshot and a user query, sends both to a visual-language model, and returns a coordinate or region that corresponds to the user’s request. It can do this by calling a remote API or by running a local instance of the model.

The main flow starts with reading the image, preparing text prompts, and then generating a special JSON-like response that includes the coordinate. The code finally draws a mark on the screenshot to highlight the model’s predicted UI element.

Image Annotation Logic: `draw_point`

Below is the function that draws an overlay on the screenshot:

The draw_point function uses PIL to open the screenshot and create a separate transparent overlay. It then draws a circular highlight around the model’s chosen coordinate. The circle’s center is the coordinate returned by the model, and a partially translucent fill color helps distinguish the highlight from the rest of the screenshot.

The overlay is merged back onto the original image, leaving a visible circle that indicates where the model expects a user action (such as a click) to happen. This approach clarifies the exact spot identified by the model and makes it easy to confirm whether the predicted location aligns with the intended user query.

Because, when a model predicts a click coordinate, it is just a pair of x and y numbers in the image’s coordinate space. Without a visual highlight, there is no convenient way to check if those numbers correspond to the correct on-screen element or area. Drawing the circle makes the model’s output visually interpretable and confirms that the chosen spot actually matches the user’s intended command.

def draw_point(image: Image.Image, point: list, color=None):
    if isinstance(color, str):
        try:
            color = ImageColor.getrgb(color)
            color = color + (128,)
        except ValueError:
            color = (255, 0, 0, 128)
    else:
        color = (255, 0, 0, 128)

    overlay = Image.new('RGBA', image.size, (255, 255, 255, 0))
    overlay_draw = ImageDraw.Draw(overlay)
    radius = min(image.size) * 0.05
    x, y = point

    overlay_draw.ellipse(
        [(x - radius, y - radius), (x + radius, y + radius)],
        fill=color
    )

    center_radius = radius * 0.1
    overlay_draw.ellipse(
        [(x - center_radius, y - center_radius),
         (x + center_radius, y + center_radius)],
        fill=(0, 255, 0, 255)
    )

    image = image.convert('RGBA')
    combined = Image.alpha_composite(image, overlay)

    return combined.convert('RGB')

The essential steps are:

An overlay in RGBA mode is created, preserving transparency.
An ellipse is drawn (a larger outer circle) around the coordinate to highlight it.
A smaller circle is drawn in green at the exact center.
The overlay is combined with the original image so the highlight is visible without permanently changing the original screenshot.

Again note why this function is crucial because the Qwen2.5-VL output typically includes a coordinate. Visualizing that coordinate (by highlighting it) clarifies which area on the screen the model is referring to.

The reason for setting a radius proportional to the minimum dimension of the image (min(image.size) * 0.05) is to keep the circle size in proportion, regardless of whether the image is very large or small.

Using an API-based Inference Approach

There is a section demonstrating how to run Qwen2.5-VL using a remote API. This is helpful if your local environment does not have enough resources to hold a large model or if you want to use Alibaba's Dashscope service.

Environment Variable for API Key

import os
os.environ['DASHSCOPE_API_KEY'] = "your api key"

The Core Function for Remote Inference

The function perform_gui_grounding_with_api loads the local screenshot from disk and opens it with PIL. The function then base64-encodes the raw image data so it can be sent through an HTTP request to the remote Qwen2.5-VL inference service.

It constructs a message structure that includes a system-level prompt telling the model it is a helpful assistant, plus a user prompt containing both the base64-encoded screenshot (as a data URI) and the text query (e.g., “open the third issue”). That message structure is passed to the client.chat.completions.create call, which sends everything to the Qwen2.5-VL endpoint. The response is plain text that contains a <tool_call> block with JSON indicating a coordinate or bounding region on the screenshot. The function parses this JSON snippet to extract the coordinate.

Finally, the function resizes the original image (if necessary) and uses draw_point to overlay a circle at the predicted coordinate. It returns the raw text output for inspection (showing how the model arrived at the coordinate) and a newly annotated image that visually highlights the target location.

GET THE FULL GOOGLE COLAB

def perform_gui_grounding_with_api(
    screenshot_path,
    user_query,
    model_id,
    min_pixels=3136,
    max_pixels=12845056
):
    # image reading
    input_image = Image.open(screenshot_path)
    base64_image = encode_image(screenshot_path)
    client = OpenAI(
        api_key=os.getenv('DASHSCOPE_API_KEY'),
        base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
    )
    resized_height, resized_width = smart_resize(
        input_image.height,
        input_image.width,
        min_pixels=min_pixels,
        max_pixels=max_pixels,
    )

    # function call initialization
    computer_use = ComputerUse(
        cfg={
            "display_width_px": resized_width,
            "display_height_px": resized_height
        }
    )

    # building system & user messages
    ...
    completion = client.chat.completions.create(
        model = model_id,
        messages = messages,
    )

    output_text = completion.choices[0].message.content

    action = json.loads(
        output_text.split('<tool_call>\\n')[1].split('\\n</tool_call>')[0]
    )
    display_image = input_image.resize((resized_width, resized_height))
    display_image = draw_point(input_image, action['arguments']['coordinate'], color='green')

    return output_text, display_image

The steps:

Base64 Encoding: encode_image(screenshot_path) is used so the image can be sent through the API.
Smart_resize: This is a function provided by Qwen to help keep the image within a certain resolution range. The min_pixels ensures the image is not too small to lose important details. The max_pixels prevents extremely large images from exceeding GPU memory or inference constraints.
ComputerUse: The ComputerUse class is defined in utils/agent_function_call. It holds metadata describing the screen size after resizing, so the model is consistent about where the coordinate is.
Messages: Qwen2.5-VL uses a chat-like approach, so building a JSON structure that includes the user query and the embedded screenshot is necessary.
Model Inference: The model responds with a chunk of text that includes a <tool_call> JSON block. This block is parsed via json.loads(...).
Coordinate Extraction: The code splits out the coordinates from 'action['arguments']['coordinate']' and calls draw_point(...) to annotate the screenshot.

Once the function returns output_text and display_image, the user query can be seen in the text, and the coordinate is highlighted on the image.

Example Usage

An example invocation is shown at the bottom of that section:

screenshot = "assets/computer_use/computer_use2.jpeg"
user_query = 'open the third issue'
model_id = "qwen2.5-vl-7b-instruct"
output_text, display_image = perform_gui_grounding_with_api(
    screenshot,
    user_query,
    model_id
)
print(output_text)
display(display_image)

Here, the function receives a path to a screenshot, a user query, and the name of the Qwen2.5-VL model to use. The output is printed and the annotated image is displayed.

Running the Model Locally

from transformers import Qwen2_5_VLProcessor, Qwen2_5_VLForConditionalGeneration

model_path = "Qwen/Qwen2.5-VL-7B-Instruct"
processor = Qwen2_5_VLProcessor.from_pretrained(model_path)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto"
)

In this instance, Qwen2_5_VLProcessor is responsible for preparing the image, tokenizing text, and building the combined input expected by the model.

The Local Inference Function

This perform_gui_grounding function is the local counterpart to the remote approach. Instead of sending requests to a distant API, it directly runs inference on your machine using the loaded Qwen2.5-VL model. It first opens the screenshot with PIL to ensure the image is in memory.

It then calculates a suitable size for the model by calling smart_resize, which ensures the screenshot isn’t too small or large for the model’s constraints. The function then constructs a specialized function call prompt, embedding both the user query text and the path to the local screenshot.

The Qwen2_5_VLProcessor processes these inputs by tokenizing text and extracting relevant patches from the image so the model can handle language and vision together. When model.generate is invoked, it runs a forward pass, returning text that encodes the final coordinate or bounding region in a JSON object contained within a <tool_call> ... </tool_call> section.

The function parses out that JSON snippet to get the numeric coordinate. It then calls draw_point to place a highlight on the screenshot. Finally, it returns both the raw text output—so you can inspect the model’s overall reasoning—and the annotated screenshot that shows exactly where the model suggests a user interaction should occur.

def perform_gui_grounding(screenshot_path, user_query, model, processor):
    input_image = Image.open(screenshot_path)
    resized_height, resized_width = smart_resize(
        input_image.height,
        input_image.width,
        factor=processor.image_processor.patch_size \
            * processor.image_processor.merge_size,
        min_pixels=processor.image_processor.min_pixels,
        max_pixels=processor.image_processor.max_pixels,
    )

    computer_use = ComputerUse(
        cfg={"display_width_px": resized_width, \
             "display_height_px": resized_height}
    )

    message = NousFnCallPrompt.preprocess_fncall_messages(
        messages=[
            Message(role="system", content=[ContentItem(text="You are a helpful assistant.")]),
            Message(role="user", content=[
                ContentItem(text=user_query),
                ContentItem(image=f"file://{screenshot_path}")
            ]),
        ],
        functions=[computer_use.function],
        lang=None,
    )

    text = processor.apply_chat_template(
        message, tokenize=False, add_generation_prompt=True
    )
    inputs = processor(
        text=[text], images=[input_image],
        padding=True, return_tensors="pt"
    ).to('cuda')

    output_ids = model.generate(**inputs, max_new_tokens=2048)
    generated_ids = [output_ids[len(input_ids):] for \
                     input_ids, output_ids in zip(inputs.input_ids, output_ids)]
    output_text = processor.batch_decode(
        generated_ids,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True
    )[0]

    action = json.loads(
        output_text.split('<tool_call>\\n')[1].split('\\n</tool_call>')[0]
    )

    display_image = input_image.resize((resized_width, resized_height))
    display_image = draw_point(input_image, action['arguments']['coordinate'], color='green')
    return output_text, display_image

Some pointers on the Coding Choices

The code uses <tool_call>\n and \n</tool_call> as known delimiters because Qwen2.5-VL encloses its function-call results in a specific JSON block. Parsing by splitting on these tags makes sure the JSON snippet stays intact without accidentally cutting off extra text. This technique ensures the coordinate data remains valid JSON.
The smart_resize function keeps images within a range of pixel counts that Qwen2.5-VL can handle efficiently. It prevents oversize images that might exceed GPU memory and undersize ones that are too small to give meaningful details. These limits speed up processing and maintain a stable environment for the model.
The draw_point approach uses a transparent overlay to highlight the predicted coordinate. The overlay method ensures the final image remains visually recognizable while adding a translucent circle on top. This design preserves the original pixels underneath and renders the highlight in an RGBA layer, which is then composited back into a standard 24-bit color image.
The NousFnCallPrompt module structures the prompt so Qwen2.5-VL sees text and images in the correct format. It packages them together in a way that aligns with how the model was trained to process multimodal input, reducing the chance of a malformed prompt.
Maintaining both a remote (Dashscope-based) and a local inference path supports different hardware setups. If you have a powerful GPU, run everything locally for greater control. If you lack the hardware or want a managed solution, the hosted service handles inference externally. Either way, the model predicts a coordinate or bounding box that can drive downstream automation.
GET THE FULL GOOGLE COLAB

Minimal Example Code

Below is a minimal excerpt showing how to perform a single run with the local model:

from PIL import Image
from transformers import Qwen2_5_VLProcessor, \
    Qwen2_5_VLForConditionalGeneration
from qwen_agent.llm.fncall_prompts.nous_fncall_prompt import \
    NousFnCallPrompt, Message, ContentItem
from utils.agent_function_call import ComputerUse

processor = Qwen2_5_VLProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

screenshot = "assets/computer_use/computer_use2.jpeg"
user_query = "open the third issue"

# The rest of perform_gui_grounding logic is the same

When this entire pipeline is executed, a user can provide any desktop screenshot and a command describing what needs to be clicked or opened.

Qwen2.5-VL interprets the screenshot content, returns a coordinate (or, in principle, multiple UI elements), and the code draws an indicator on that location. The user then knows exactly where the model "thinks" that action should happen.

Do checkout the Official Repo of Qwen for more examples.

See you tomorrow.

Rohan's Bytes

Discussion about this post