The Best Open-Source LLMs in 2026: 8 Open-Weight Models Ranked
The best open-weight LLMs of 2026, ranked. GLM-5.2, DeepSeek V4, Kimi and more, with real benchmarks, the license trap, and the cost math.

The best open-weight language model you can run in 2026 is no longer American, and it is no longer a fallback. GLM-5.2 scores higher on SWE-bench Pro than GPT-5.5, ships under an MIT license, and costs a flat $18 a month. The frontier you used to rent is now a download.
The verdict
If you want one open model for serious agentic and coding work, run GLM-5.2. If you want the cheapest capable model for high-volume work, run DeepSeek V4 Flash. If you need to keep everything on your own hardware on a single GPU, run gpt-oss-120b. If license cleanliness is the thing your lawyers care about, Qwen3 or Mistral under Apache 2.0 is the safe default.
An open-weight model is one whose trained parameters you can download and run yourself, instead of only renting it through an API. That single fact changes the economics of building on AI: the model becomes a cost you control rather than a meter that runs while you sleep. Here is the field, ranked by what they actually deliver.
A note on the prices in that table: for a model you download, the per-token number is whatever the cheapest host charges, because the weights themselves are free. The figures above are each vendor's own first-party API rate, captured this week. Self-host instead and you trade that meter for a GPU bill.
Why open weights stopped being a compromise
For two years, "open model" meant "good enough if you cannot afford the real thing." That sentence is now false, and one benchmark says it cleanly. On SWE-bench Pro, a test that scores whether a model can resolve real software-engineering tickets, GLM-5.2 scores 62.1. GPT-5.5 scores 58.6. Gemini 3.1 Pro scores 54.2. Only Claude Opus 4.8, at 69.2, sits clearly ahead. An open, MIT-licensed model from a Tsinghua spinout now beats two of the three Western frontier labs on the benchmark that matters most for engineering work.
This is the part that decides whether you can actually use any of it, and almost no comparison spells it out: "open weight" and "open source" are not the same promise. Open weight means you can download the parameters. Open source, in the sense that matters to a business, means the license also lets you deploy, modify, and sell what you build without asking permission. DeepSeek, GLM, and Qwen ship under MIT or Apache 2.0, which are genuinely permissive: do what you want, including commercially. Llama and MiniMax ship "community licenses" with restrictions buried in the terms. The difference is not academic. It is the difference between shipping a product on the model and getting a letter from legal three months in.
GLM-5.2 is the open model to beat
GLM-5.2, from the Chinese lab Z.ai (formerly Zhipu), is the strongest open-weight model in the world for long-horizon engineering work, and it is the one to reach for first. It shipped in mid-June 2026 with a genuinely usable 1M-token context window, up from 200K in the previous version, and it holds quality across that window rather than just accepting the tokens. A context window is how much text the model can hold in its working memory at once, and a 1M window means it can keep an entire mid-sized codebase, its tests, and its conventions in view while it works.

On Terminal-Bench 2.1, which measures whether a model can drive a real terminal to finish a task, GLM-5.2 scores 81.0, within a few points of Claude Opus 4.8 at 85.0 and ahead of Gemini 3.1 Pro at 74.0. Z.ai's own benchmarks put it as the highest-ranked open-source model on every long-horizon test they ran. It is MIT-licensed, so a 12-person startup can self-host it, fine-tune it on their own code, and ship it inside a paid product with no permission and no royalty.
The catch is hardware. This is a roughly 745-billion-parameter model, so self-hosting it well means a serious multi-GPU box. Most teams will not run it themselves; they will use the API, where the published rate is around $1.40 per million input tokens and $4.40 per million output, or the flat GLM Coding Plan at about $18 a month. For a founder who has been paying frontier API prices, that plan is the headline: roughly the cost of two coffees for a month of frontier-class coding, with the option to pull the weights in-house later if data control becomes a requirement.
DeepSeek V4 is the cost floor nobody else can match
DeepSeek V4 is the model that makes the math impossible to ignore, because it delivers near-frontier quality at a price that rounds to nothing. Released in April 2026 by the Chinese lab DeepSeek, it comes in two sizes, both MIT-licensed with a 1M-token context. V4-Pro is a 1.6-trillion-parameter mixture-of-experts model; V4-Flash is a leaner 284-billion-parameter version. A mixture-of-experts model only activates a small slice of its parameters for any given token, which is how a model this large stays cheap to serve.

The prices are the story. V4-Flash runs $0.14 per million input tokens and $0.28 per million output, on DeepSeek's own API. V4-Pro, the heavyweight, is still only $0.435 in and $0.87 out. To feel the gap: a workload that costs a few hundred dollars a month on a closed frontier model often lands in the low tens of dollars on V4-Flash, for the same job, with output quality most users cannot distinguish on everyday tasks. Both support a thinking mode for harder reasoning and tool calls for agent work.
The honest limit: DeepSeek is excellent on reasoning, math, and code, but it is not the model I would pick for the most demanding multimodal work or for the longest agentic coding runs, where GLM-5.2 and Kimi pull ahead. Use V4-Flash as your default for anything high-volume, classification, extraction, summarization, drafting, and reserve a pricier model for the 5% of tasks that genuinely need it.
Kimi K2.7 Code is built for agents that write code
Kimi K2.7 Code, from Moonshot AI, is the open model purpose-built for coding agents that run for hours, and it landed in June 2026. It is a 1-trillion-parameter mixture-of-experts model with native multimodal input, meaning it reads images and video alongside text, and a 256K-token context window. Where GLM-5.2 is the generalist that happens to be great at code, Kimi is tuned specifically for the long, multi-step agent loops that modern coding tools run.

On Moonshot's reported numbers, the earlier K2.6 hit 80.2% on SWE-bench Verified, in the same band as the closed frontier. K2.7 Code costs $0.95 per million input tokens and $4.00 per million output on the official API, with a much cheaper $0.19 rate when your prompt is cached, which matters a lot for agents that re-send the same large context on every step. It ships under a modified MIT license, so commercial self-hosting is on the table.
The trade-off is focus. Kimi is sharp on coding and agentic execution and less of an all-rounder for general writing or analysis. If your product is a coding agent or an internal dev tool, it is a top pick. If you want one model to do everything, GLM-5.2 or DeepSeek is the more balanced choice. For how these models slot into the tools that actually run them, see the best AI coding agents.
Qwen3 is the most permissive serious open model
Qwen3, from Alibaba, is the open model to choose when license cleanliness and language coverage matter more than topping a single benchmark. The open flagship, Qwen3-235B-A22B, is a 235-billion-parameter mixture-of-experts model with 22 billion active parameters, a 256K-token context, and support for 119 languages. Crucially, it is Apache 2.0, the most permissive license in this entire list, which makes it the default pick for any enterprise whose legal team flinches at custom terms.

It supports switching between a thinking mode for hard reasoning and a faster non-thinking mode for everyday chat, in the same model, so you are not paying for deliberation you do not need. The weights are free to download, and you can run them through Alibaba Cloud or any host that carries them.
One thing to get right, because the marketing blurs it: Alibaba's newest flagship, Qwen3.7-Max, is closed and API-only. The genuinely open models are the published mid-tier and 235B weights, not the Max. For a non-technical reader deciding procurement, the rule is simple: if you want to own the model, pick a named open Qwen weight, not the Max tier. The open 235B is a touch behind GLM-5.2 and DeepSeek V4 on the hardest coding benchmarks, but its permissive license and multilingual reach make it the safest sovereign default for a global product.
MiniMax M3 brings multimodal to the open frontier
MiniMax M3 is the open model to watch when your work mixes text, images, and video and needs a huge context, and it is the freshest of the bunch, released on June 1, 2026. It is a 428-billion-parameter mixture-of-experts model with 23 billion active parameters, a 1M-token context, and native multimodal input. It uses an attention design the lab calls MiniMax Sparse Attention to keep long-context inference affordable.

On the cost side it is aggressive: MiniMax's own API lists M3 at $0.30 per million input tokens and $1.20 per million output, under a standing 50% discount, with benchmark scores like 93.0 on GPQA Diamond that put it among the strongest reasoners, open or closed.
The asterisk is the license. M3's weights are downloadable, but commercial use of the model or its derivatives requires a separate agreement beyond the default MiniMax Community License. In plain terms: great to experiment with, but get the commercial terms in writing before you build a business on it. That single clause is why it sits below the MIT and Apache models for anyone shipping a product.
gpt-oss is the one that runs on your own GPU
gpt-oss, OpenAI's open-weight family, is the model to pick when the hard requirement is running on hardware you control without a cluster. The larger gpt-oss-120b has 117 billion parameters but only 5.1 billion active, and it is engineered to run on a single 80GB GPU such as an NVIDIA H100. The smaller gpt-oss-20b runs in 16GB of memory, which means a well-specced workstation. Both are Apache 2.0.

This is the pragmatic on-premise pick. For a regulated business, a hospital, a bank, a defense contractor, that cannot send data to any external API, "fits on one H100" is the whole decision. It offers configurable reasoning effort, so you can dial the model between fast-and-cheap and slow-and-careful, and full chain-of-thought access for debugging. It is genuinely agentic, with native function calling, web browsing, and Python execution.
The honest framing: gpt-oss is not chasing the top of the benchmark charts the way GLM-5.2 or DeepSeek V4 are, and it is the older release in this group. You pick it for the deployment story, a capable model that fits on hardware you already own under a clean license, not because it is the single smartest option.
Mistral is the European open option
Mistral, the French lab, is the open model to choose when European data residency and a familiar Western vendor matter to your buyers. Its open lineup includes Mistral Small 4, a hybrid model that unifies instruction-following, reasoning, and coding in one efficient package, and Mistral Medium 3.5, a frontier-class multimodal model, plus Devstral 2 for code-agent work.

For a European company selling to European customers, the procurement conversation is easier with a Paris-based vendor than with a Beijing-based one, regardless of what the weights actually do. Mistral's models are smaller and lighter than the giant Chinese MoEs, which is a feature if you want efficiency, and a limit if you need the absolute top of the coding charts. It is the comfortable, compliant choice rather than the benchmark leader.
Llama 4 is the ecosystem, not the frontier
Llama 4, from Meta, is no longer the model you pick to win benchmarks, but it is still the one with the deepest tooling and the widest community. The current generation, released in 2025, includes Scout and Maverick, both natively multimodal mixture-of-experts models with 17 billion active parameters; Maverick carries 400 billion total parameters across 128 experts. The largest model, Behemoth, remains unreleased.

Llama's real asset in 2026 is gravity: more fine-tunes, more quantizations, more deployment guides, and more hiring-market familiarity than any other open model. If your team already runs Llama and your tasks are well-served, there is no urgency to switch. But be precise about the license. Llama ships under Meta's community license, not Apache or MIT, and that license carries commercial restrictions that a true open-source license does not. For a new build where licensing is open, the newer Chinese MIT and Apache models out-deliver it on capability and out-clean it on terms.
Who should pick what
The right model is a function of your constraint, not a single winner. Match your situation to the row.
The cost math, honestly
The reason any of this matters to a business is the bill, so here is the real arithmetic without the hype. There are three ways to pay for an open model, and they suit different scales.
Rent the API (smallest commitment)
You call the model through the vendor's API and pay per token. At DeepSeek V4-Flash prices, a million tokens of output costs 28 cents. A team running a few hundred million tokens a month, a real workload for a busy product, pays tens of dollars, not thousands. This is where almost everyone should start.
Buy a flat plan (predictable spend)
For heavy coding use, a subscription like the GLM Coding Plan at roughly $18 a month removes the per-token anxiety entirely. You pay one fixed price and stop counting tokens. This is the sweet spot for a small engineering team that codes with AI all day.
Self-host the weights (largest commitment, lowest marginal cost)
You download the MIT or Apache weights and run them on your own GPUs. The cost moves from a per-token meter to a fixed GPU bill plus engineering time. It only pays off at high, steady volume, or when data control is non-negotiable. gpt-oss-120b on a single H100 is the realistic entry point; the trillion-parameter MoEs need a cluster.
The widely repeated claim that a team spending $10,000 a month on a closed model could spend $1,000 to $2,000 on an open one is directionally right and easy to check against the prices above: GLM-5.2 and DeepSeek V4 undercut closed-frontier API rates by roughly five to ten times for comparable work. The savings are real. What the claim hides is the engineering cost of running your own inference well, which is why the API and flat-plan routes beat self-hosting for most teams until volume is large. For how these open models compare head-to-head with the closed Claude, GPT, and Gemini models on coding specifically, see the [best AI model for coding](/blog/best-ai-model-for-coding-2026).
The decision rule
One constraint flips the entire choice. If you can use an API, your decision is cost versus capability, and the answer is GLM-5.2 for hard agentic work or DeepSeek V4 Flash for everything high-volume. If you cannot use an API, because of data residency, regulation, or air-gapped infrastructure, then the only question that matters is what runs on the hardware you have, and the answer is gpt-oss-120b on one GPU or Qwen3 on a slightly larger box. Decide the API question first. Everything else follows from it.
What is the best open-source LLM in 2026?
For serious agentic and coding work, GLM-5.2: it leads every long-horizon open benchmark, ships under an MIT license, and is available on a flat $18-a-month plan. For cheap, high-volume tasks, DeepSeek V4 Flash at $0.14 input and $0.28 output per million tokens is unbeatable on price.
Is open weight the same as open source?
No, and the difference can cost you. Open weight means you can download the model. Open source, for a business, means the license also lets you deploy and sell what you build. DeepSeek, GLM, Qwen, and gpt-oss use permissive MIT or Apache 2.0 licenses; Llama and MiniMax ship restricted community licenses that need a lawyer's read before commercial use.
Can I run these models on my own hardware?
Some, easily. gpt-oss-120b fits on a single 80GB H100, and gpt-oss-20b runs in 16GB. The giant mixture-of-experts models, DeepSeek V4, GLM-5.2, and Kimi K2.7, need a multi-GPU cluster to self-host well, so most teams rent them through an API instead.
Are Chinese open models safe for a US company to use?
The downloaded weights run entirely on your infrastructure and send data nowhere, so self-hosting sidesteps the data-residency concern completely. The question only arises if you call a vendor's hosted API, in which case your data goes to that host. If that matters, self-host the open weights or use a Western host that carries them.
What is the best open LLM I can run locally on a normal machine?
gpt-oss-20b, which runs in 16GB of memory, or a mid-tier Qwen3 weight. Both give you a capable, permissively licensed model on a single workstation without renting anything.
Want the full landscape mapped to your business, not just the models but the agents, tools, and stacks that turn them into products? Get the AI tools map for business owners and a short weekly read on what actually shipped and what to do about it.
Jun 29, 2026







