Why behind AI: 100 trillion tokens for Christmas
The OpenRouter metrics
Merry Christmas everybody! This week we will focus on the usage patterns of AI as seen at OpenRouter.
The past year has marked a turning point in the evolution and real-world use of large language models (LLMs). With the release of the first widely adopted reasoning model, o1, on December 5th, 2024, the field shifted from single-pass pattern generation to multi-step deliberation inference, accelerating deployment, experimentation, and new classes of applications.
As this shift unfolded at a rapid pace, our empirical understanding of how these models have actually been used in practice has lagged behind. In this work, we leverage the OpenRouter platform, which is an AI inference provider across a wide variety of LLMs, to analyze over 100 trillion tokens of real-world LLM interactions across tasks, geographies, and time.
In our empirical study, we observe substantial adoption of open-weight models, the outsized popularity of creative roleplay (beyond just the productivity tasks many assume dominate) and coding assistance categories, plus the rise of agentic inference.
Furthermore, our retention analysis identifies foundational cohorts: early users whose engagement persists far longer than later cohorts. We term this phenomenon the Cinderella "Glass Slipper" effect. These findings underscore that the way developers and end-users engage with LLMs "in the wild" is complex and multifaceted. We discuss implications for model builders, AI developers, and infrastructure providers, and outline how a data-driven understanding of usage can inform better design and deployment of LLM systems.
OpenRouter is one of the most widely used platforms for API access to a variety of providers and models. It's become the default for inference with open-source models or smaller providers that need the visibility, as well as better-performing access to some of the primary frontier models.
In this paper, we draw inspiration from prior empirical studies of AI adoption, including Anthropic’s economic impact and usage analyses [1] and OpenAI’s report How People Use ChatGPT [2], aiming for a neutral, evidence-driven discussion. We first describe our dataset and methodology, including how we categorize tasks and models. We then delve into a series of analyses that illuminate different facets of usage:
Open vs. Closed Source Models: We examine the adoption patterns of open source models relative to proprietary models, identifying trends and key players in the open source ecosystem.
Agentic Inference: We investigate the emergence of multi-step, tool-assisted inference patterns, capturing how users increasingly employ models as components in larger automated systems rather than for single-turn interactions.
Category Taxonomy: We break down usage by task category (such as programming, roleplay, translation, etc.), revealing which application domains drive the most activity and how these distributions differ by model provider.
Geography: We analyze global usage patterns, comparing LLM uptake across continents and drilling into intra-US usage. This highlights how regional factors and local model offerings shape overall demand.
Effective Cost vs Usage Dynamics: We assess how usage corresponds to effective costs, capturing the economic sensitivity of LLM adoption in practice. The metric is based on average input plus output tokens and accounts for caching effects.
Retention Patterns: We analyze long-term retention for the most widely used models, identifying foundational cohorts that define persistent, stickier behaviors. We define this to be a Cinderella “Glass Slipper” effect, where early alignment between user needs and model characteristics creates a lasting fit that sustains engagement over time.
Most of the "enterprise usage" studies focus on anecdotal discussions with senior leaders who oversee projects. Since the hyperscalers closely guard their inference metrics, it's actually quite difficult to understand how models are used on a daily basis and which are preferred by the wider group of developers.

The year was dominated by closed-source models, even on the most popular platform for hosting open-weight inference. The rise of Chinese models was the big theme of the year, as a duopoly of US frontier labs competed against the (semi-private) Chinese labs.
A significant share of this growth has come from Chinese-developed models. Starting from a negligible base in late 2024 (weekly share as low as 1.2%), Chinese OSS models steadily gained traction, reaching nearly 30% of total usage among all models in some weeks. Over the one-year window, they averaged approximately 13.0% of weekly token volume, with strong growth concentrated in the second half of 2025. For comparison, RoW OSS models averaged 13.7%, while proprietary RoW models retained the largest share (70% on average). The expansion of Chinese OSS reflects not only competitive quality, but also rapid iteration and dense release cycles. Models like Qwen and DeepSeek maintained regular model releases that enabled fast adaptation to emerging workloads. This pattern has materially reshaped the open source segment and progressed global competition across the LLM landscape.
These trends indicate a durable dual structure in the LLM ecosystem. Proprietary systems continue to define the upper bound of reliability and performance, particularly for regulated or enterprise workloads. OSS models, by contrast, offer cost efficiency, transparency, and customization, making them an attractive option for certain workloads. The equilibrium is currently reached at roughly 30%. These models are not mutually exclusive; rather, they complement each other within a multi-model stack that developers and infrastructure providers increasingly favor.
The big story in the chart is the number of tokens processed by the Chinese labs, but I think that the sheer explosion of inference is worth mentioning. Some of it is obviously connected to the adoption of OpenRouter, but it's not difficult to see this as a downstream effect of "the rising wave of AI" rather than as an exception to it. platform.
Overall, the open source model ecosystem is now highly dynamic. Key insights include:
Top-tier diversity: Where one family (DeepSeek) once dominated OSS usage, we now increasingly see half a dozen models each sustaining meaningful share. No single open model holds more than ≈20–25% of OSS tokens consistently.
Rapid scaling of new entrants: Capable new open models can capture significant usage within weeks. For example, MoonshotAI’s models quickly grew to rival older OSS leaders, and even a newcomer like MiniMax went from zero to substantial traffic in a single quarter. This indicates low switching friction and a user base eager to experiment.
Iterative advantage: The longevity of DeepSeek’s presence at the top underscores that continuous improvement is critical. DeepSeek’s successive releases (Chat-V3, R1, etc.) kept it competitive even as challengers emerged. OSS models that stagnate in development tend to lose share to those with frequent updates at the frontier or domain-specific finetunes.
I've previously covered the relative decline in DeepSeek's mindshare over the year, and I think that this is consistent with what appears to be a much more academic and AGI-oriented approach by the team. Alibaba emerged as the key player in China today in terms of productizing AI, both with the Qwen models and with their funding of rising labs like Moonshot.
Infra Play #115: The Alibaba vision
As covered this week in “Why behind AI”, while the LLM models most used today for commercial purposes and rank highest in terms of benchmarks and outcomes are all from frontier labs (OpenAI, Anthropi…
A deeper look at the models driving these trends reveals distinct market dynamics:
The “Small” Market: Overall Decline in Usage. Despite a steady supply of new models, the small model category as a whole is seeing its share of usage decline. This category is characterized by high fragmentation. No single model holds a dominant position for long, and it sees a constant churn of new entrants from a diverse set of providers like Meta, Google, Mistral, and DeepSeek. For example,
Google Gemma 3.12B(released August 2025) saw a rapid adoption but competes in a crowded field where users continually seek the next best alternative.The “Medium” Market: Finding “Model-Market Fit.” The medium model category tells a clear story of market creation. The segment itself was negligible until the release of
Qwen2.5 Coder 32Bin November 2024, which effectively established this category. This segment then matured into a competitive ecosystem with the arrival of other strong contenders likeMistral Small 3(January 2025) andGPT-OSS 20B(August 2025), which carved out user mind share. This segment demonstrates that users are seeking a balance of capability and efficiency.The “Large” Model Segment: A Pluralistic Landscape. The “flight to quality” has not led to consolidation but to diversification. The large model category now features a range of high-performing contenders from
Qwen3 235B A22B Instruct(released in July 2025) andZ.AI GLM 4.5 AirtoOpenAI: GPT-OSS-120B(August 5th): each capturing meaningful and sustained usage. This pluralism suggests users are actively benchmarking across multiple open large models rather than converging on a single standard.The era of small models dominating the open source ecosystem might be behind. The market is now bifurcating, with users either gravitating toward a new, robust class of medium models, or consolidating their workloads onto the single most capable large model.
The desire for smaller models was driven by enthusiasts’ interest in running their own inference and having high-tokens-per-second interactions. Since consumer hardware exploded in price, small models didn’t improve significantly, and multiple large models are now offered at very high token-per-second inference speeds, interest faded.
While we will likely still see small models dominate “on-device AI” for many use cases and bad AI implementations at legacy SaaS vendors who want high-margin products, practically speaking, the industry has moved toward getting high-speed intelligence at fair prices via API.
Programming has become the most consistently expanding category across all models. The share of programming-related requests has grown steadily through 2025, paralleling the rise of LLM-assisted development environments and tool integrations. As shown in the figure above, programming queries accounted for roughly 11% of total token volume in early 2025 and exceeded 50% in recent weeks. This trend reflects a shift from exploratory or conversational use toward applied tasks such as code generation, debugging, and data scripting. As LLMs become embedded in developer workflows, their role as programming tools is being normalized. This evolution has implications for model development, including increased emphasis on code-centric training data, improved reasoning depth for multi-step programming tasks, and tighter feedback loops between models and integrated development environments.
This growing demand for programming support is reshaping competitive dynamics across model providers. As shown in the figure below, Anthropic’s Claude series has consistently dominated the category, accounting for more than 60% of programming-related spend for most of the observed period. The landscape has nevertheless evolved meaningfully. During the week of November 17, Anthropic’s share fell below the 60% threshold for the first time. Since July, OpenAI has expanded its share from roughly 2% to about 8% in recent weeks, likely reflecting a renewed emphasis on developer-centric workloads. Over the same interval, Google’s share has remained stable at approximately 15%. The mid-tier segment is also in motion. Open source providers including Z.AI, Qwen, and Mistral AI are steadily gaining mindshare. MiniMax, in particular, has emerged as a fast-rising entrant, showing notable gains in recent weeks.
It's good to remember that LLMs are, at the end of the day, large language models. It's not surprising that, particularly with the cost-sensitive user base of OpenRouter, a lot of the usage historically has been for "community roleplays." The more interesting usage, which also correlates with the explosion of enterprise revenue, is the significant adoption of models specifically for coding and agentic engineering workflows.
If we zoom in just on the programming category, we observe that proprietary models still handle the bulk of coding assistance overall (the gray region), reflecting strong offerings like Anthropic's Claude. However, within the OSS portion, there was a notable transition: in mid-2025, Chinese OSS models (blue) delivered the majority of open source coding help (driven by early successes like
Qwen 3 Coder). By Q4 2025, Western OSS models (orange) such as Meta's LLaMA-2 Code and OpenAI's GPT-OSS series had surged, but decreased in overall share in recent weeks. This oscillation suggests a very competitive environment. The practical takeaway is that open source code assistant usage is dynamic and highly responsive to new model quality: developers are open to whichever OSS model currently provides the best coding support. As a limitation, this figure doesn't show absolute volumes: open source coding usage grew overall so a shrinking blue band doesn't mean Chinese OSS lost users, only relative share.
Almost all of this usage is dominated by closed-source models. This comes back to the trend I highlighted earlier, where users want the best intelligence possible, ideally with fast inference speeds and easy management via API, which means that they will pick the closed-source models if they perform better.
Fundamentally, the best intelligence on the market is related to models that can service reasoning workloads. In my recent The economics of AI article, I covered the clear relationship between the type of compute we are deploying and the relative lack of performance jumps in the actual base models. The only frontier lab that delivered obviously significant jumps consistent with the scaling laws this year for the base model was Google with Gemini 3, predominantly due to training it on their new TPU architecture. OpenAI still delivered what are the best reasoning models in the industry with their Pro family, but that was mostly done through significant compute and post-training techniques.
Why behind AI: The economics of AI
As AI exploded into the public consciousness, one of the most difficult things has been understanding the big picture. I’ve gained perspective in this direction from my time in the trenches of cloud …
From a customer perspective, however, what most users cared about was "does the thing do the thing" rather than the behind-the-scenes details.
Consistently over the year, the most widely used models for programming were from Anthropic. This is not surprising, as their revenue exploded, with internal projections putting them at $9B ARR by the end of the year (after they finished 2024 at $1B).
Together, these trends (rising reasoning share, expanded tool use, longer sequences, and programming’s outsize complexity) suggest that the center of gravity in LLM usage has shifted. The median LLM request is no longer a simple question or isolated instruction. Instead, it is part of a structured, agent-like loop, invoking external tools, reasoning over state, and persisting across longer contexts.
For model providers, this raises the bar for default capabilities. Latency, tool handling, context support, and robustness to malformed or adversarial tool chains are increasingly critical. For infra operators, inference platforms must now manage not just stateless requests but long-running conversations, execution traces, and permission-sensitive tool integrations. Soon enough, if not already, agentic inference will be taking over the majority of the inference.
Not only do users want the smartest (reasoning) models, they want them to operate independently with some oversight. Agentic workflows for programming are becoming the default, as many have realized that the sweet spot for delivering outcomes lies in the middle between vibe coding (where you have no clue or interest in code and ask the model to one shot a product) and tab-complete (where you always initiate each line of code yourself and the model predicts how to finish it).
The figures above break down LLM usage across the twelve most common content categories, revealing the internal sub-topic structure of each. A key takeaway is that most categories are not evenly distributed: they are dominated by one or two recurring use patterns, often reflecting concentrated user intent or alignment with LLM strengths.
Among the highest-volume categories, roleplay stands out for its consistency and specialization. Nearly 60% of roleplay tokens fall under Games/Roleplaying Games, suggesting that users treat LLMs less as casual chatbots and more as structured roleplaying or character engines. This is further reinforced by the presence of Writers Resources (15.6%) and Adult content (15.4%), pointing to a blend of interactive fiction, scenario generation, and personal fantasy. Contrary to assumptions that roleplay is mostly informal dialogue, the data show a well-defined and replicable genre-based use case.
Programming is similarly skewed, with over two-thirds of traffic labeled as Programming/Other. This signals the broad and general-purpose nature of code-related prompts: users are not narrowly focused on specific tools or languages but are asking LLMs for everything from logic debugging to script drafting. That said, Development Tools (26.4%) and small shares from scripting languages indicate emerging specialization. This fragmentation highlights an opportunity for model builders to improve tagging or training around structured programming workflows.
Beyond the dominant categories of roleplay and programming, the remaining domains represent a diverse but lower-volume tail of LLM usage. While individually smaller, they reveal important patterns about how users interact with models across specialized and emerging tasks. For example, translation, science, and health show relatively flat internal structure. In translation, usage is nearly evenly split between Foreign Language Resources (51.1%) and Other, suggesting diffuse needs: multilingual lookup, rephrasing, light code-switching, rather than sustained document-level translation. Science is dominated by a single tag, Machine Learning & AI (80.4%), indicating that most scientific queries are meta-AI questions rather than general STEM topics like physics or biology. This reflects either user interest or model strengths skewed toward self-referential inquiry.
Health, in contrast, is the most fragmented of the top categories, with no sub-tag exceeding 25%. Tokens are spread across medical research, counseling services, treatment guidance, and diagnostic lookups. This diversity highlights the domain’s complexity, but also the challenge of modeling it safely: LLMs must span high variance user intent, often in sensitive contexts, without clear concentration in a single use case.
What links these long-tail categories is their broadness: users turn to LLMs for exploratory, lightly structured, or assistance-seeking interactions, but without the focused workflows seen in programming or personal assistants. Taken together, these secondary categories may not dominate volume, but they hint at latent demand. They signal that LLMs are being used at the fringes of many fields from translation to medical guidance to AI introspection and that as models improve in domain robustness and tooling integration, we may see these scattered intents converge into clearer, higher-volume applications.
By contrast, finance, academia, and legal are much more diffuse. Finance spreads its volume across foreign exchange, socially responsible investing, and audit/accounting: no single tag breaks 20%. Legal shows similar entropy, with usage split between Government/Other (43.0%) and Legal/Other (17.8%). This fragmentation may reflect the complexity of these domains, or simply the lack of targeted LLM workflows for them compared to more mature categories like coding and chat.
The data suggest that real-world LLM usage is not uniformly exploratory: it clusters tightly around a small set of repeatable, high-volume tasks. Roleplay, programming, and personal assistance each exhibit clear structure and dominant tags. Science, health, and legal domains, by contrast, are more diffuse and likely under-optimized. These internal distributions can guide model design, domain-specific fine-tuning, and application-level interfaces particularly in tailoring LLMs to user goals.
This looks like a fair representation of the broad usage of AI today from early adopters. The interesting split comes when looking at the differences across significant frontier model labs:
The most important trends here are the rise in adoption of OpenAI reasoning models for coding (particularly with the Codex family being introduced), xAI's recent jump in usage across more integrated use cases (although some of it is likely due to them starting to charge for Grok Code Fast), and Google's base models remaining a staple for those building a variety of applications.
DeepSeek and Qwen exhibit usage patterns that diverge considerably from the other model families discussed earlier. DeepSeek’s token distribution is dominated by roleplay, casual chat, and entertainment-oriented interaction, often accounting for more than two thirds of its total usage. Only a small fraction of activity falls into structured tasks such as programming or science. This pattern reflects DeepSeek’s strong consumer orientation and its positioning as a high-engagement conversational model. Notably, DeepSeek displays a modest but steady increase in programming-related usage toward late summer, suggesting incremental adoption in lightweight development workflows.
Qwen, by contrast, presents an almost inverted profile. Across the entire period shown, programming consistently represents 40-60 percent of all tokens, signaling a clear emphasis on technical and developer tasks. Compared with Anthropic’s more stable engineering-heavy composition, Qwen demonstrates higher volatility across adjacent categories such as science, technology, and roleplay. These week-to-week shifts imply a heterogeneous user base and rapid iteration in applied use cases. A noticeable rise in roleplay usage during September and October, followed by a contraction in November, hints at evolving user behavior or adjustments in downstream application routing.
It’s difficult not to call out that if the majority of inference for DeepSeek is coming out of the roleplay category, then essentially the “productive” usage of Chinese frontier labs is more or less limited to their 10% share of the programming tokens.
This puts their effective adoption into significant question, at least in the context of OpenRouter usage.
Which brings us to the obvious question, who is paying for inference?
With most of the usage for China likely originating on local infrastructure, not surprisingly, most of the inference is dominated by the most technologically advanced economies - USA, Singapore, Germany, South Korea, Netherlands, United Kingdom, Canada, and Japan.
The rising economic players, particularly India, Brazil, Turkey, and Africa, do not appear to be engaging with AI in a meaningful way. The outliers in usage here are Middle Eastern players like the UAE and Saudi Arabia, who have been major investors in both frontier labs and neoclouds; they are also deploying significant capacity in local private clouds, typically managed by a hyperscaler.
We will end the article with a little cloud infrastructure software themed Christmas special:
Consider this your crash course introduction to designing data intensive applications.




















