AI models lose their shirts on Premier League bets - FT中文网
登录×
电子邮件/用户名
密码
记住我
请输入邮箱和密码进行绑定操作:
请输入手机号码,通过短信验证(目前仅支持中国大陆地区的手机号):
请您阅读我们的用户注册协议隐私权保护政策,点击下方按钮即视为您接受。
FT商学院

AI models lose their shirts on Premier League bets

Systems from Google, OpenAI, Anthropic and xAI struggle when asked to predict scores over football season
00:00

{"text":[[{"start":8.33,"text":"AI models from Google, OpenAI and Anthropic lost money betting on football matches over a Premier League season, in a new study suggesting even the most advanced systems struggle to analyse the real world over long periods of time. "}],[{"start":25.96,"text":"The “KellyBench” report released this week by AI start-up General Reasoning highlights the gap between AI’s rapidly advancing capabilities in certain tasks, such as writing software, and its shortcomings in other kinds of human problems."}],[{"start":43.36,"text":"London-based General Reasoning tested eight top AI systems in a virtual recreation of the 2023-24 Premier League season, providing them with detailed historical data and statistics about each team and previous games. The AIs were instructed to build models that would maximise returns and manage risk. "}],[{"start":65.96000000000001,"text":"The AI “agents” then placed bets on the outcomes of matches and the number of goals scored to test how they could adapt to new events and updated player data as the season progressed. "}],[{"start":78.60000000000001,"text":"The AI could not access the internet to retrieve results and each was given three attempts to turn a profit."}],[{"start":87.53,"text":"Anthropic’s Claude Opus 4.6 fared best, with an average loss of 11 per cent and nearly breaking even on one attempt. "}],[{"start":97.27,"text":"xAI’s Grok 4.20 went bankrupt once and failed to complete the other two tries. Google’s Gemini 3.1 Pro managed to turn a 34 per cent profit on one go but went bankrupt on another. "}],[{"start":112.66,"text":"“Every frontier model we evaluated lost money over the season and many experienced ruin,” the authors of the paper concluded, with the AI “systematically underperforming humans” in this scenario. "}],[{"start":null,"text":"
AI modelMean ROIBest tryWorst tryMean Final Bankroll
Anthropic Claude Opus 4.6−11.0%−0.2%−18.8%£89,035
OpenAI GPT-5.4−13.6%−4.1%−31.6%£86,365
Google Gemini 3.1 Pro−43.3%+33.7%−100%£56,715
Google Gemini Flash 3.1 LP−58.4%+24.7%−100%£41,605
Z.AI GLM-5−58.8%−14.3%−100%£41,221
Moonshot Kimi K2.5−68.3%−27.0%−100%£7,420
xAI Grok 4.20−100%−100%−100%£0
Arcee Trinity−100%−100%−100%£0
Each model began with a £100,000 normalised bankroll. Return on investment and final bankroll are averaged across three tries. Grok and Trinity did not complete every attempt.
"}],[{"start":126.84,"text":"The results offer some comfort to white-collar professionals and businesses who are fretting that AI could take their jobs, as it roils the shares of industries from finance to marketing."}],[{"start":139.93,"text":"Ross Taylor, one of the study’s authors and General Reasoning’s chief executive, said: “There is so much hype about AI automation but there’s not a lot of measurement of putting AI into a longtime horizon setting.”"}],[{"start":153.17000000000002,"text":"He added that many of the benchmarks typically used to test AI are flawed because they are set in “very static environments” that bear little resemblance to the chaos and complexity of the real world. "}],[{"start":167.75000000000003,"text":"General Reasoning’s paper, which has not yet been peer reviewed, provides a counterweight to growing excitement in Silicon Valley about the huge recent leaps in AI’s ability to complete computer programming tasks with little to no human intervention. "}],[{"start":184.08000000000004,"text":"Taylor, a former Meta AI researcher, said: “If you . . . try AI on some real-world tasks, it does really badly . . . Yes, software engineering is very important and economically valuable, but there are lots of other activities with longer time horizons that are important to look at.” "}],[{"start":213.49000000000004,"text":""}]],"url":"https://audio.ftcn.net.cn/album/a_1775890949_6076.mp3"}

版权声明:本文版权归FT中文网所有,未经允许任何单位或个人不得转载,复制或以任何其他方式使用本文全部或部分,侵权必究。

澳大利亚试图解决住房危机

澳大利亚总理阿尔巴尼斯正试图扭转延续数十年的税收激励措施,让年轻人买得起房。

美联储将不得不重新审视其全球角色

美国央行在帮助稳定他国的财政状况时,作出的不仅是经济决策,同时也是外交决策。

“先租后付”贷款瞄准居住成本重压下的美国人

在住房负担能力危机加剧之际,短期融资需求正在向租赁市场扩张。

在数据中心抢建狂潮中,AI“卖铲人”赚得盆满钵满

卡特彼勒与豪赫蒂夫等老牌工业股告别沉闷,在AI 热潮推动下迎来大涨。

Lex专栏:让AI承担其代价,最简单的办法是合理征税

在AI影响日益真实而混乱的当下,自由放任的时代已经过去。

SpaceX上市虽不至震垮资本市场,却将让市场雪上加霜

此次发行将进一步拉大指数成分股与指数外公司之间的估值差距。
设置字号×
最小
较小
默认
较大
最大
分享×