Klarna's CEO announced an AI customer service assistant was doing "the work of 700 agents", then walked it back a year later

By Ellie Harris · Filed 27 February 2024

Alleged: Klarna Bank AB, OpenAI developed or deployed the AI system implicated in this incident. Details are drawn from public reports; parties are presumed innocent of any wrongdoing not established by an official finding.

What happened

On 27 February 2024, the Swedish buy-now-pay-later company Klarna issued a press release announcing that an AI customer service assistant, built in partnership with OpenAI, had handled 2.3 million customer conversations in its first month of operation. The release framed the achievement in a striking comparison: the bot was doing “the work of 700 full-time agents.” Klarna projected the deployment would drive a US$40 million improvement in profit in 2024. Customer satisfaction scores were comparable to human agents. Resolution times had dropped by an average of 9 minutes per chat. Repeat inquiry rates had fallen 25 percent.

OpenAI published a case study. The financial press picked it up as the breakthrough proof point for AI in customer service: a real company, at real scale, with real metrics, demonstrating that AI could replace customer service agents wholesale. CEO Sebastian Siemiatkowski did several podcast interviews and a substantial press tour. The “work of 700 agents” framing became the most-cited single statistic in the AI-replacing-humans discourse of 2024.

Underneath the framing, the actual figure was more nuanced than the press tour acknowledged. Between 2022 and 2024, Klarna had reduced its global headcount from roughly 5,527 employees to 3,422, a reduction of around 40 percent. The “700 agents” comparison did not refer to seven hundred specific people who had been laid off. It referred to the additional agents Klarna would have needed to hire to handle the conversation volume during a growth phase if humans had been doing the work. Klarna had avoided hiring, not laid off. The distinction matters less when the headline framing carries the implication that AI directly displaced workers, which it did, in net effect, just not via the specific causal chain the framing implied.

Through 2024, complaints accumulated. Customers reported that the bot handled simple queries, refund status, account balance, basic transaction questions, well, but degraded on anything that required nuance, judgement, or sustained context across multiple messages. Cases involving disputes, fraud, hardship, or complex multi-product interactions were particularly poorly served. Customer satisfaction in the cohort of complex cases declined materially. The bot occasionally hallucinated, in the customer-service idiom: invented refund policies, claimed account features that did not exist, gave timelines for actions it could not actually take.

In May 2025, in a Bloomberg interview, Siemiatkowski acknowledged that the company had over-deployed. “We focused too much on cost,” he said. “The result was lower quality.” In the same cycle he made a more striking point: “From a brand perspective, a company perspective, I just think it’s so critical that you are clear to your customer that there will always be a human if you want.” Klarna announced a new approach: human agents would handle complex cases, AI would handle routine ones, and the company would explicitly hire for an Uber-style flexible workforce, remote agents, including students and parents, on flexible schedules. The “work of 700 agents” framing was retired. The pivot was reported widely as the most visible public reversal of an AI-replaces-humans deployment to date.

What an auditable version would have shown

The 2024 metrics Klarna published were aggregate. Chats handled. Average resolution time. Customer satisfaction. Each is informative. None of them, on their own, distinguishes between the bot being good at customer service and the bot being good at the easy half of customer service.

An auditable version of the metric stack would slice every quality measure by case complexity. Klarna’s case management system already classified cases by type, refund status, balance inquiry, transaction dispute, fraud claim, hardship request, multi-product query. The aggregate CSAT score across all cases looked acceptable through 2024. The CSAT score on the dispute, fraud and hardship segments declined materially during the same period. The aggregate number, the one Klarna published, the one the press repeated, concealed a degrading distribution within it.

A structured per-case record would have made the pattern visible at the operational level long before the public reversal. Each case: type, route (AI-handled, human-handled, AI-then-escalated), resolution outcome, customer satisfaction, complaint trail. Aggregated by case type, the data would have shown, and almost certainly did show, in Klarna’s internal systems, that the bot’s value-add was concentrated in the easy cases and that the hard cases were being handled worse, not better, than under the previous human-only regime.

The question is not whether the bot was succeeding overall. It clearly was on volume and on routine cases. The question is whether the bot was succeeding in the cases where customer trust is built or destroyed: the dispute, the fraud claim, the hardship request. Trust is a low-frequency, high-stakes signal. It does not move the aggregate CSAT line for months, and by the time it does, the reversal is what makes the news, not the deployment.

Where the gap was

The gap was in segmentation discipline, not in the underlying technology.

The OpenAI-powered assistant was capable of handling the routine 70 percent of customer-service queries to a standard customers found acceptable. That part of the deployment worked. What was missing was a structural control that prevented the AI from being deployed against the cases it was not yet capable of handling well, and a clear escalation path that routed complex cases to humans early and unambiguously, rather than after the customer had spent forty minutes failing to make progress with the bot.

The publicly available accounts of the deployment suggest the bot was given a broad mandate: handle any customer interaction it could, escalate when it could not. The escalation criterion was the bot’s own confidence, a function the bot could be confidently wrong about. Customers in dispute or fraud cases routinely encountered a bot that scored itself as able to help, kept trying, and resolved the case in a way that resolved it for the metrics but did not resolve it for the customer.

A typed-case routing system, where case types known to require human judgement bypass the AI entirely, and case types where the AI is competent are handled by the AI with explicit human-escalation tripwires on negative-sentiment signals, would have surfaced the bot’s strengths without exposing the customers in the hardest cases to its weaknesses. Klarna’s pivot in 2025 was, in essence, the implementation of this control after the fact.

What governance should have looked like

For AI deployed in customer service, the governance question is not how much volume the AI can absorb. It is how the institution segments the work and how it captures the structured signal that tells it the segmentation is wrong.

The first move is to classify each case at first contact and route it based on what kind of case it is, not based on how confident the bot thinks it is. Refund status, balance inquiry, shipment tracking, these go to the AI. Dispute, fraud claim, hardship request, these go to a human, by design, with no AI in the loop. The AI’s mandate is bounded explicitly. The bot does not get the chance to attempt cases it is not yet competent at, and customers in the hardest moments do not have to bounce off it before reaching someone who can help.

The second move is to log every case with its type, its handler, its outcome and its customer-side signals, independent of whether the AI or a human took it. The aggregate metrics across all of those records are useful. The segment metrics, sliced by case type and handler, are the operational signal. Aggregate CSAT will look fine for a long time even as the dispute segment quietly degrades. If the segment numbers are the ones in front of the operator, the segment problem surfaces before the brand problem does.

The third is to make the escalation criterion the customer’s experience, not the bot’s self-assessment. Negative-sentiment patterns across messages, explicit customer requests for a human, case duration past a sensible threshold, each is a recorded signal that pulls a human agent in automatically. The bot does not decide when it has failed. The customer’s behaviour decides, and the system listens.

Customer-service work isn’t all the same. Saying “AI does the work of 700 agents” treats every request as one interchangeable task — but a quick balance check and a fraud dispute are nothing alike. AI handles the easy requests well; the hard ones, where a customer’s trust is won or lost, are exactly where it struggles. Those harder cases — roughly 30 percent — are also the most expensive to get wrong: deploying AI across them widely costs the brand more than it saves on payroll. Klarna’s reversal is the cleanest public statement of that, but it is not the last one, every customer-service deployment of this size will eventually meet the same wall, and the operators with the segment-level records will see it coming first.

The reference implementation of ConstraintGate and ConductRecord is open source. It lives at github.com/saffronandindia/headlights-oss, Apache 2.0 licensed and free to install. Anyone can read every line and verify the signatures. The repository is public now.

Sources

Klarna AI assistant handles two-thirds of customer service chats in its first month (Klarna press release, February 2024)
Klarna’s AI assistant does the work of 700 full-time agents (OpenAI case study)
Klarna Reverses AI Customer Service Replacement (Tech.co)
Klarna reverses AI push, hires customer service agents (eMarketer)
[Klarna Dials Back its AI Customer Service Strategy

The mailing list

Fresh incident reports every week. One email to match.

We add new incidents to the library regularly, and send a single short email each week with what's new. The library stays free and open; this is just how you keep up with it.

No tracking. Unsubscribe in one click.

The record

An auditable system would have produced a signed, tamper-evident record the moment this happened: what the system did, the version that did it, the basis it acted on, and the action taken, and Klarna Bank AB, OpenAI could have produced it on demand.

This is the record the system as deployed did not produce in a signed, auditable form.

What this teaches

Capture what happened when it happens

What the system did, the version that did it, the basis it acted on, and the action taken, recorded at the moment, not reconstructed after.

Sign it, so no one has to trust the record-keeper

A tamper-evident entry. Edit it later and the signature breaks. The record does not ask for the benefit of the doubt.

Make it verifiable by anyone

A court, a regulator, a customer's lawyer can check the record themselves, without taking the company, or us, at our word.

Also in the library

HD-INC-015 Commonwealth Bank made 45 staff redundant based on AI performance claims that were not true Financial services · 2025 HD-INC-033 An eating-disorder charity replaced its human helpline with a chatbot, and within days the bot was reportedly giving dieting advice to people in recovery Healthcare · 2023 HD-INC-049 IBM sold Watson as a revolution in cancer care, then its own documents showed it recommending unsafe treatments it had learned from invented patients Healthcare · 2018

Headlights summarises publicly reported AI incidents. All summaries are independently written, attributed to their original sources, and intended for research and educational purposes. Allegations are identified as such until established through official findings.

Last reviewed June 2026. This report is based on the sources listed above and reflects information available at the time of review; later developments may not be captured. Where a person is described as charged with or alleged to have done something, that allegation is unproven unless a conviction or a court or regulatory finding is stated. Headlights publishes journalism and commentary, not legal advice.

Want to write back?

Direct to my inbox.

ellie@useheadlights.com →