š„ Qwen Edit Image 20B Model released with Apache 2.0 License
Open-source Qwen Edit Image 20B Model released, Why OpenAI and Perplexity are going behind Browsers
Read time: 10 min
š Browse past editions here.
( I publish this newletter daily. Noise-free, actionable, applied-AI developments only).
ā”In todayās Edition (19-Aug-2025):
š„ Qwen Edit Image 20B Model released with Apache 2.0 License
š§ Agents want your tabs, Why Chrome and Edge still hold the high ground and Why OpenAI and Perplexity are going behind Browsers
š„ Qwen Edit Image 20B Model released with Apache 2.0 License
Qwen released Qwen-Image-Edit, a 20B image editor that can change an imageās meaning and also make pixel-level fixes. It also brings bilingual, font-preserving text editing, and reports state-of-the-art results.
The standout is the two-path control design inside a 20B editor, pairing semantic control via Qwen2.5-VL with appearance control via a VAE encoder. The first path understands the scene and preserves identity, the second path edits only the exact pixels you target, so untouched regions stay visually the same.
So the model routes the same input image through Qwen2.5-VL for semantic control and a VAE Encoder for appearance control. Semantic control means the model understands the scene and the identity of objects, so it can rotate an object, keep who it is, and redraw the view.
And the appearance control touches only the pixels a user targets, keeping everything else untouched. Semantic edits cover intellectual property creation and style changes. Think of making a brand character or product that stays the same across many images. The tool can take a single reference and produce new poses, outfits, backgrounds, and camera angles that still look like the same character or item. That is useful when a team needs a consistent mascot, product hero shots, or a cast of characters.
It can re-render a subject into a new art style, or rotate objects by 90 or 180 degrees to show front or back, while keeping identity consistent. That is hard because the model must reason about 3D structure from a single image, then synthesize the hidden side in a way that still matches the original subject.
Appearance edits cover practical fixes. It can add or remove small objects, clean flyaway hair, swap backgrounds, or change clothing, without spilling changes outside the selected area. This relies on tight masking and consistency checks so untouched regions remain visually identical.
Text editing is a standout. It can add, delete, or modify English and Chinese text, keeping the original font, size, and style, and even adjust tiny characters or a single letterās color.
Workflows can chain steps. Users draw boxes over problem areas, apply targeted edits, then repeat until every detail is correct, which is how they fix tricky calligraphy characters.
š§ Agents want your tabs, Why Chrome and Edge still hold the high ground and Why OpenAI and Perplexity are going behind Browsers
š© The real question
Venture capital folks used to ask, is your product a business or just a feature. That same pressure is now hitting AI. Model companies want to sit in the center, then turn everything else into a feature hanging off the model, like editing, coding, and even the browser itself. OpenAI is reportedly preparing a full browser that leans on ChatGPT and agent skills, challenging Chrome directly, and aimed at reshaping how people navigate the web .
Perplexity already shipped a browser called Comet, and made it accessible right now for Perplexity Max at $200/month, with a waitlist for everyone else . At the same time OpenAI is pushing into collaborative docs and chat inside ChatGPT, moving toward a Workspace competitor . Anthropic is taking a different swing, turning Claude into a platform where anyone can build and share interactive āartifactsā, which are lightweight apps that run right inside Claude .
I see 3 separate bets here. Coding as a near hands-off workflow. Docs as a model-first productivity layer. Browsers as an agent shell. They look similar at a glance, but the odds are different for each one.
š§° Coding, where models have a path to 99%
Coding has 2 properties that fit models well. First, code is flexible. There are many correct ways to implement a function, which means the model can pick any valid path. Second, code is verifiable. The program runs or it fails. That gives fast feedback to train on, which keeps improving the model and the scaffolding around it.
Anthropic is leaning into this by turning Claude into an app surface where the model writes, runs, and iterates inside the same canvas. That removes some of the glue code and tool juggling that humans had to do. The release makes it much easier to build small tools, share them, and iterate, and it is already live across Team and Enterprise plans . This push makes sense because the gap between model output and shippable code can shrink fast once you wire tests, run loops, and environment setup into the flow.
The practical upshot is simple. For coding, getting to 95% done by the model feels realistic in more cases, with the human stepping in on architecture choices, tricky debugging, and integration edges. That favors a chat-orchestrated workspace where the āIDEā fades into the background, because the main event is the modelās tool loop rather than manual line editing.
š Docs, where the editor still matters a lot
OpenAI wants ChatGPT to be the starting point for docs and presentations, with real-time co-editing and chat living side by side. The idea is clear, the hardest part is the blank page, so let the model draft and structure immediately, then let people tweak in context .
The challenge, and the reason this will be slower to flip, is that editing remains highly manual. People rearrange sections, fix tables, and massage wording with quick, fine-grained moves. Edits through conversation work, but they can still feel slower than direct manipulation when you are making lots of small changes. Meanwhile, the incumbents are not standing still. Chrome is getting native Gemini assistance that can read the page and help in place, starting with Google AI Pro and Ultra subscribers in the US . Edge is adding Copilot Mode that works across tabs and helps act on what you are viewing, right inside the browser chrome, not just a sidebar .
So, yes, ChatGPT can solve the blank page and give you strong drafts. But the minute you need heavy editing, the best experience often comes from tight editor integration, and that is exactly where Google Workspace and Microsoft 365 have home-field advantage. The model is important, but the editor and file plumbing still decide daily comfort.
š§ Browsers, the hardest hill to climb
Perplexityās Comet and OpenAIās incoming browser share a thesis, let the agent browse for you, execute steps, and compress the session into a conversation. The pitch sounds great, fewer clicks, more outcomes. The friction shows up in 3 places.
First, reliability. Real web tasks are messy. Success rates are improving, but they are not ātrust it blindlyā yet. Recent summaries of WebArena-style evaluations put end-to-end success around ~60% for complex multi-step web tasks, which is progress, but still leaves a lot of retries and handoffs to the human . Security researchers are also documenting fresh attack surfaces for browsing agents, which adds guardrail work that slows things down .
Second, incumbents already embed assistants natively. Chromeās Gemini and Edgeās Copilot Mode live where the user already spends time and can see the page, tabs, and history with fewer permission hoops. That lowers latency and cuts context loss, which are 2 big usability wins .
Third, websites themselves are shipping model-powered flows. A Reddit thread can be summarized inside Reddit. A maps route can be composed inside Maps. A generic browser agent that tries to control every site from the outside will keep getting outclassed by task-specific UIs and APIs that the site owners tune for their data and their latency. That does not mean agents are useless, it means the best path often goes through site-level integrations, not around them.
š§Ŗ Why build a browser anyway
Training agents needs real attempts, not just sandbox runs. You want to see where the model stalls, where people interrupt, and which tools it reaches for.
Owning the browser gives that signal stream. OpenAIās recent push on āagentā behavior inside ChatGPT makes the goal plain, the assistant should plan, choose tools, and carry out multi-step work on its own computer . Perplexity benefits too, because agent logic and browsing telemetry are a moat when you depend on other peopleās foundation models. The user experience may be rough early on, but the data helps the agent mature.
Perplexity and OpenAI want 3 things, control of distribution, control of data, and control of integration. A browser gives all 3 at once. It is the front door to search traffic, a live sensor for how users actually work on the web, and the cleanest place to wire an agent that can read a page, click, type, buy, and then learn from what happened. That is why OpenAI is building a browser and why Perplexity shipped Comet.
1) Distribution power, not just another app š§²
Owning the browser means you decide the default assistant, the default search, and the default start surfaces. Chrome has ~68% global share, so whoever controls a mainstream browser controls a giant chunk of daily intent. This is why default placements are worth real money, Google reportedly paid Apple $20B in 2022 to be Safariās default search. If OpenAI or Perplexity owns the surface, they can route more of that intent to their own answer engines by default.
2) First-party data to train agents š
Agents improve when you see where they fail, where people intervene, and which steps actually lead to success. A browser gives that stream, page events, clicks, scrolls, form edits, tab switches, and final outcomes. Reuters reported the OpenAI browser aim plainly, more direct access to user data. That data feeds model training and evaluation loops for agent skills.
3) The right place to wire an agent that can āuse a computerā š§°
OpenAIās own roadmap is to let ChatGPT plan multi-step work, pick tools, and operate a computer to finish tasks. A browser is the universal tool where bookings, purchases, dashboards, consoles, and docs already live. Running the agent close to the tabs cuts context loss and gives predictable control over popups, frames, and permissions. Microsoft and Google are doing the same play from inside their browsers with Copilot Mode in Edge and Gemini in Chrome. If OpenAI and Perplexity do not own a browser, they are playing inside someone elseās shell, with fewer hooks and slower loops.
4) A training ground you can scale, not just a demo lab š§Ŗ
Agentic browsing is getting better but still misses a lot. Recent reports and papers put end-to-end success on realistic multi-step web tasks around ~60% in the best cases, and near 0-5% on broad info-seeking that spans many sites. That gap will not close without high-volume, real interactions and corrections, exactly the kind of telemetry a first-party browser can collect safely with consent prompts.
5) Direct monetization levers šø
Browsers monetize in 3 obvious ways. First, default search and answers, you keep the query and the ad or affiliate upside, instead of sending it to Google or Bing. Second, shopping rails, when an agent buys something for you, the browser can capture affiliate fees. Third, premium access, Perplexity tied Comet to a $200/month plan at launch for priority compute and features, a hint that real agentic browsing is still GPU-heavy and worth pricing.
6) Platform risk and policy headroom š§±
On iOS, third-party browsers have long been constrained by WebKit. The EUās DMA cracked that door open a bit by allowing alternative engines in the EU, although developers argue there are still roadblocks. If you depend entirely on Chrome, Safari, or Edge, one policy change can break your assistant. Owning a browser gives you more room to innovate on your schedule and hedge OS policy risk, even if mobile distribution stays hard.
7) Competitive pressure from Chrome and Edge āļø
Google is shipping Gemini into Chrome, from page-aware summaries to multi-tab help. Microsoft flipped on Copilot Mode in Edge, turning the frame of the browser into an active helper. If the incumbents bake assistants into the chrome itself, independent labs risk being sidelined to a sidebar. A first-party browser is how OpenAI and Perplexity try to meet that integration head-on.
8) Why now, not 2 years ago ā±ļø
Two shifts converged in 2025. First, model routing and agent stacks got good enough to actually operate apps and websites for several minutes without hand-holding. OpenAIās ācomputer-using agentā and the new ChatGPT Agent formalize this, plan, browse, fill forms, run tools, and retry. Second, publishers and infra vendors started pushing back on bots, which makes passive crawling less reliable. If you put the agent inside the userās browser session, many of those blocks disappear because the user is the one visiting the page.
So why does this feel controversial or risky
Because the hard parts are very real.
Agents still fail often. Even the best single-agent systems stall on long, messy tasks, and newer broad info-seeking benchmarks show success near 0-5%. You also inherit new security problems like prompt injection in the wild. That is a big deal when an agent has cookies, autofill, and purchasing power.
Websites are fighting scraping and generic agents. Cloudflare publicly accused Perplexity of evading no-crawl rules with stealth crawlers, and Reddit is closing off another backdoor by restricting the Internet Archive after seeing AI companies siphon content via snapshots. Even if you run the agent through a userās browser, many sites will keep adding defenses that raise friction and cost.
Distribution is a wall. Chrome holds ~68% share and Safari controls iOS defaults. Shifting user habit at that scale is slow, and expensive, which is exactly why default search deals trade at $20B levels.
Cost matters. Running a real browsing agent is not just a chat call, it is a long session with tool use, vision, planning, retries, and state. Perplexityās choice to gate Comet behind a $200/month tier at launch is a loud tell that serving costs and tuning are still high.
Thatās a wrap for today, see you all tomorrow.
compliment on how well written this was