Field notes from a macOS AI agent

A new LLM model shipped. Is it better than the one you run today?

2026 has been the fastest run of model releases yet. A new frontier or open-weight model lands most weeks. Every guide that ranks for this is a release calendar or a benchmark board: it tells you a model exists and how it scored on a shared test set. None of them answer the question you actually have the morning a model drops, which is whether it is better for your work. This guide is about that question, and the one-click way Fazm lets you settle it.

forkSession(fromKey:toKey:)per-window selectedModelmacOS 14+verified 2026-05-15
M
Matthew Diakonov
9 min read

The short answer

New LLM models released through 2026 at a near-weekly pace, from Anthropic, Google, OpenAI, Alibaba, Meta and smaller labs. The practical move when one ships is not to read its benchmark card. It is to run a task you actually do on the new model and on your current one, then compare the output. In Fazm, a native macOS app, that is a one-click fork: branch a live conversation, point the fork at the new release, and both models answer the identical task with identical context.

Want the running list of what shipped? A static page goes stale fast; a live registry does not. The models hub at huggingface.co/models stays current. This page is about what you do next.

Why the leaderboard cannot answer your question

A public benchmark measures a model against a fixed, shared task set. That is genuinely useful for ranking models against each other in the abstract. It is close to useless for the decision in front of you, which is narrower: should I switch the model I run every day. Your work is a different distribution than any benchmark. It has your codebase conventions, your prompt habits, your MCP setup, the particular class of bug you keep hitting, the documents you keep reformatting. A model can top a coding board and still be worse at the one refactor you do every Tuesday.

The honest test has three properties. It runs on a task you care about, not a synthetic one. It holds the context constant, so the only thing that changes between runs is the model. And it puts the two outputs next to each other so you can read the difference rather than infer it from a score. That is hard to do by hand: you would re-create a project, re-paste the context, re-explain what you were doing, and even then the two runs would drift. The rest of this guide is a way to get all three properties from one button.

The workflow, in four beats

Fork and compare

01 / 04

A new model ships

It appears in Fazm's model dropdown on its own. The list is reported live by the agent, not compiled into the app, so there is nothing to download first.

The fork-and-compare procedure

Five steps. None of them involve setting up a project or copying context by hand. The whole point is that the comparison starts from work you already did.

1

The new model is already in your dropdown

Fazm does not bake its model list into the app binary. The list arrives live from the agent: ChatProvider registers setModelsAvailableHandler, the bridge reports the current models, and the Swift side ingests them. A Claude tier that went generally available this morning shows up in the per-window model picker on the next launch, with no App Store update. The 2.4.0 changelog states it plainly: available AI models populate dynamically from the agent.

2

Open a real task you already finished

Find a pop-out chat that already holds a conversation, a piece of work you actually completed on your current model. That conversation is your baseline. The point of the test is a task that matters to you, not a synthetic prompt copied off a benchmark page.

3

Fork it with one click

Click the fork button in the pop-out, the small branch glyph with the tooltip Fork chat. Fazm calls forkSession(fromKey:toKey:): the conversation is branched server-side at its last message, and a new pop-out window opens with the full prior history copied in. The original window is left exactly as it was, still bound to its own session.

4

Point the fork at the new model

Every pop-out window keeps its own selectedModel inside WorkspaceSettingsState. Open the fork window's model dropdown and pick the new release. The original window stays on your current model. You now have two windows holding the identical conversation and diverging only on which model answers next.

5

Send the same prompt in both, then read the output

Type the identical next message in each window and send. Same task, same context, two models. Compare the actual answers, the diff each one produces, the tool calls it chooses, the things it gets wrong. That is evidence about your work. A leaderboard score is not.

One window, one model: the part that makes it work

The reason the comparison stays clean is a small design decision in the source. A detached pop-out window in Fazm is not bound to a single global model setting. Each window owns its own selectedModel field inside WorkspaceSettingsState, and the window's model dropdown is bound straight to that per-window value. When you fork a chat, the new window inherits the workspace but keeps its model independently settable. Switch the fork to the new release and the original window does not move.

The two cards below are what Fazm's own popOutsSummary() reports for a baseline window and its fork, mid-comparison. Notice the diff: same task title root, same workspace, same history count, because the fork branched at the same message. One line differs.

Window A baseline
{
"title": "Refactor auth module",
"workspace": "~/projects/api",
"selectedModel": "sonnet",
"chatHistoryCount": 14,
"isAILoading": false
}
Window B fork on the new release
{
"title": "Refactor auth module (fork)",
"workspace": "~/projects/api",
"selectedModel": "default",
"chatHistoryCount": 14,
"isAILoading": false
}

"Snapshot of every open pop-out, safe to serialize and emit. Used by the listPopOuts control command for external testing and regression A/B harnesses."

Source comment above popOutsSummary() in Desktop/Sources/FloatingControlBar/DetachedChatWindow.swift. The per-window model is not an afterthought; the code treats a set of pop-outs as an A/B harness on purpose.

That same per-window value is part of the window registry Fazm writes to disk. On the next launch every pop-out is restored on the model it carried. So a fork-and-compare you start in the morning is still intact after a restart: the baseline window on your current model, the fork on the new release, both exactly where you left them. The longer-form version of that persistence story is in the persistent sessions guide.

What you can actually put in that dropdown

A fork is only a useful test if the new model is reachable. Fazm feeds its picker from three lanes, and a release in any of them becomes a fork target without a reinstall.

Models a fork can switch to

  • Claude tiers shown as the Scary, Fast and Smart pills (Haiku, Sonnet, Opus)
  • GPT-family models via the bundled Codex backend, using your ChatGPT subscription
  • Any model behind a custom API endpoint: a local runtime like LM Studio, a corporate proxy, an Anthropic-compatible gateway
  • New Claude models the moment the agent bridge reports them, with no app update
  • A separate model per pop-out window, each choice persisted in the window registry across a restart

The Claude lane is the one that updates with zero friction: the picker is populated from a models-available frame the agent bridge emits, so a newly released Claude tier appears on the next session. The mechanics of that, including what happens to a stored preference when an alias is renamed underneath it, are traced in the companion April 2026 model release guide.

Where this comparison is honest, and where it is not

The fork-and-compare is a real test, but it is a specific kind of test, and it helps to be clear about its edges. It is a qualitative read on one task, not a statistical result. Run it on a single prompt and you have an anecdote; run it on five representative tasks and you have a pattern worth acting on. Pick the tasks deliberately, because a new model's strengths will not show up on a problem that was already easy for the old one.

Two more caveats. You are spending tokens twice, once per window, so this is a test you run on the handful of tasks that matter, not on everything. And a new model can behave differently in ways a single turn hides: different tool-use habits, different verbosity, different latency under load. Forking holds the context constant, which is the hard part, but you still have to watch more than the final answer. Fazm is macOS 14 or newer only, so the workflow itself is Mac-bound. Within those limits, running your own task twice is still a sharper signal than any score someone else published.

Want to watch a fork-and-compare run on a real Mac?

A 20-minute call: open two pop-outs, fork a live chat, switch one to a new model, and read both outputs side by side.

Questions people ask about new model releases

Frequently asked questions

What new LLM models released in 2026?

2026 has run the fastest model-release cadence to date. New frontier and open-weight models have landed most weeks across Anthropic, Google, OpenAI, Alibaba, Meta and several smaller labs, with multimodal input and large context windows becoming standard rather than premium. Because the list moves week to week, a static article is the wrong place to track it. Public registries and trending lists, such as Hugging Face's models hub at https://huggingface.co/models, stay current. The durable question this guide answers is not which models exist, it is what to do the day one of them ships: how to find out whether it is actually better for the work you do.

Does a benchmark score tell me if a new model is better for my work?

Not reliably. A public benchmark measures a model against a fixed, shared task set. Your work is a different distribution: your codebase conventions, your prompt style, your tool setup, the specific bugs you hit. A model can top a coding leaderboard and still be worse at the particular refactor you do every week, or better on reasoning yet slower in a way that breaks your flow. The only test that maps to your decision is your own task, run on the new model and on your current one, with the output compared directly. Everything else is a proxy.

How do I test a new model release without redoing my whole setup?

Inside Fazm you fork an existing conversation instead of starting over. Open a pop-out chat that already holds a finished task, click the fork button, and Fazm branches the conversation server-side and opens a new window with the full history copied in. Switch that fork's model dropdown to the new release, leave the original on your current model, and send the same next prompt in both. You are comparing two models on identical context in under a minute, with no new project, no re-pasting, and no re-explaining what you were doing.

Where is the fork button in Fazm and what does it do exactly?

The fork button lives in the pop-out chat header, rendered as ForkChatButton with the SF Symbol arrow.triangle.branch and the tooltip Fork chat. It is only shown once the conversation has history. Clicking it calls forkSession(fromKey:toKey:) in ChatProvider, which asks the bridge to run session/fork on the live session. The bridge branches the conversation at the last message and registers the new branch under a fresh session key, while the source key stays live. A new detached window opens bound to that new key, with the source window's chat history snapshot copied in. Neither branch is destroyed; the source session id stays reachable in Conversation History.

Can two Fazm windows run different models at the same time?

Yes. Each detached pop-out window owns its own selectedModel field inside WorkspaceSettingsState. The window's model dropdown is bound directly to that per-window value, not to a single global setting. So one window can run your current model while a forked sibling runs a new release, both answering at the same time. There is even an applyModelToAllWindows helper for the opposite case, pushing one model to every open window at once, which only makes sense because per-window is the default.

Does a newly released model show up in Fazm without an app update?

For Claude tiers, yes. Fazm's model picker is populated from a models-available frame the agent bridge emits, not from a static array compiled into the app. When the agent SDK starts reporting a new Claude model, the Swift updateModels handler ingests the fresh list and the picker updates on the next session. The 2.4.0 release notes describe exactly this: available AI models now populate dynamically from the agent, so newly released Claude models appear without an app update. GPT-family models surface the same way once the Codex backend toggle is on.

What models can I actually pick in Fazm today?

Three sources feed the dropdown. First, the Claude tiers, exposed as the Scary, Fast and Smart pills that map to Haiku, Sonnet and Opus. Second, the GPT family through the bundled Codex backend, which routes those models through OpenAI's Codex CLI using your ChatGPT subscription and now gets the same MCP servers and system prompt as Claude. Third, any model behind a custom API endpoint, which lets you point Fazm at a local runtime like LM Studio, a corporate proxy, or any Anthropic-compatible gateway. A new release in any of those three lanes becomes a fork target.

Do my model choices survive a Mac restart?

Yes. The per-window selectedModel is part of the window registry that Fazm serializes to disk, alongside each pop-out's workspace directory and frame. On launch the app restores every window, and each one comes back on the model you set it to. So a fork-and-compare you start in the morning is still intact after a restart: the baseline window on your current model, the fork on the new release, both where you left them. This is the same persistence layer that keeps full conversation history alive across restarts.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.