Published: October 22nd, 2025
When machines trade against each other, the results can be entertaining, alarming, or both. This week's “Alpha Arena” contest, where large language models (LLMs) manage crypto portfolios without human intervention, a few unlikely champions have emerged.
At time of writing Elon Musk's Grok, DeepSeek, and Anthropic's Claude Sonnet 4.5 were up by more than 25% each, while Google's Gemini 2.5 Pro had plunged over 28%. The spectacle, part experiment and part financial theatre, has enthralled both Wall Street and Silicon Valley.
The premise is simple enough. Each AI model starts with $10,000 to trade crypto perpetuals on the Hyperliquid exchange, betting on the likes of Bitcoin, Solana, and Dogecoin. The rules demand full autonomy: the models have to come up with strategies, time entries, size positions, and manage risk entirely on their own. Every decision and trade is published for scrutiny.
The competition, hosted by Nof1, an AI research lab, began on October 17th and runs until early November. In practice, it has turned into an ongoing referendum on the financial competence of the world's most advanced chatbots.
As in any market, rankings are volatile. Jay Azhong, Nof1's founder, told Bloomberg that the early standings were no surprise: “It usually ends up between Grok and DeepSeek,” he said, though he noted that GPT and Gemini sometimes shine.
Not this time. OpenAI's GPT-5 had dropped around 30% at time of writing, hurt less by reckless bets than by an excess of caution. It made only a handful of small trades, a sensible approach that preserved capital but doomed it to mediocrity. Claude Sonnet sits mid-table, comfortably positive yet outpaced by the leaders.
For Wall Street, the outcomes raise unsettling questions. If an irreverent chatbot can outperform an AI trained on terabytes of financial data, perhaps the edge lies not in datasets but in adaptability. Or perhaps this is merely a reminder that three weeks of trading is no measure of intelligence, human or artificial.
The exercise has revived an old debate: can AI be trusted with real money? Advocates argue that LLMs can digest oceans of unstructured data like tweets, headlines, even blockchain chatter faster than any analyst, extracting market sentiment in real time. They foresee an era in which autonomous agents uncover new “alpha” and democratize trading once confined to hedge funds.
Skeptics counter that the very opacity that makes these systems powerful also makes them perilous. A model's reasoning is often inscrutable even to its creators. When it wins, no one knows exactly why; when it loses, explanations arrive only after the fact. For compliance departments and regulators, that opacity is intolerable. Trusting a black box to manage client capital would violate the first rule of risk management: know your exposure.
The issue goes beyond transparency to reliability. LLMs are prone to what engineers politely call “hallucinations,” the confident invention of plausible-sounding nonsense. In conversation, that is an embarrassment; in trading, a hallucination can become a position.
The Alpha Arena already offers a taste of that danger. Gemini's erratic performance, characterized by rapid reversals from bullish to bearish stances, generated losses that mirrored precisely such feedback loops.
For all the fanfare, few professionals expect these AIs to manage pension funds anytime soon. Financial institutions are dabbling with generative models but mostly for mundane, low-risk tasks such as summarizing filings or drafting compliance memos. A report from law firm Gilbert + Tobin predicts broader adoption within two years, but notes that current deployments remain firmly “human-in-the-loop.”
Despite concerns, the experiment is turning up useful insights. It shows that even off-the-shelf models can trade autonomously and prudently. It also shows how unstable their decision-making can be. Grok's exuberance, DeepSeek's discipline, and Gemini's confusion each reflect distinct configurations of model architecture, training data, and prompt engineering. Strip away the branding, and they behave not unlike human traders: prone to conviction, bias, and panic in unequal measure.
For the AI community, Alpha Arena is a valuable laboratory. It shifts evaluation from static benchmarks to the messy, adversarial world of financial markets. There, decisions have measurable consequences: gains, losses, liquidations. Observers can trace each trade on-chain and inspect not only prediction accuracy but also capital management, leverage use, and drawdowns. In short, it forces models to confront reality.
Nof1's creation is less a contest than a controlled stress test. By giving identical capital to competing models and exposing them to identical volatility, it isolates differences in reasoning and risk appetite. Some, like DeepSeek, behave like seasoned quants; others, like Gemini, resemble over-caffeinated interns.
The transparency of public leaderboards and open trade logs has drawn applause from both AI researchers and market spectators. The spectacle of machines making and losing money in real time captures something of the present moment: exuberant innovation colliding head-on with financial consequence.
For Wall Street, the message is mixed. On one hand, the promise of autonomous AI traders points to a world where computers harvest sentiment, analyze charts, and execute orders at machine speed. On the other, the Gemini debacle and GPT-5's underperformance show it's still early days. As long as regulators demand transparency and investors demand accountability, the bots will stay on the sidelines.
Yet even the skeptics acknowledge a shift underway. Each iteration of the Alpha Arena brings AIs a little closer to surviving the chaos they create. The contest's name is apt: it is less about victory than evolution. In a few years, when financial algorithms whisper market moves to one another in real time, this odd little crypto tournament may be remembered as the moment artificial traders first stepped into the ring, and learned just how hard the markets can hit.