HD-INC-017

Housing & real estate · United States · 2021 · Automated-decision harm

Zillow's home-pricing algorithm overpaid for thousands of houses and ended the iBuyer business in a single quarter, with write-downs topping $500 million

By Ellie Harris · Filed 1 July 2021

Alleged: Zillow Group developed or deployed the AI system implicated in this incident. Details are drawn from public reports; parties are presumed innocent of any wrongdoing not established by an official finding.

What happened

Zillow Offers was the iBuyer arm of Zillow Group: the part of the company that, beginning in 2018, made cash offers on residential homes, bought them, did light renovations, and sold them back into the market. The pitch was that an algorithm trained on Zillow’s vast home-pricing dataset, the same dataset behind the Zestimate, the consumer-facing home-value estimate Zillow had been refining for years, could outperform a human flipper at scale. Buy at the algorithm’s offer price, hold briefly, sell at the algorithm’s predicted exit price, capture the margin.

Through the first half of 2021, Zillow Offers operated in a US housing market that was, in retrospect, peaking. Prices were rising fast. The pricing model was tuned for an environment in which homes appreciated rapidly between purchase and resale; the margin came partly from the appreciation. Zillow’s bidding became more aggressive. Offers crept above what comparable homes were transacting for, on the theory that by the time the resale closed, the market would have caught up.

In the second half of 2021, the market turned. Not dramatically, the bubble did not burst, but the rate of appreciation slowed, supply chain disruptions extended the renovation cycle, and labour shortages pushed contractor schedules out. Homes Zillow had bought at peak-aggressiveness pricing began sitting on the books longer than the model assumed. The exit prices the model had projected stopped materialising. By late summer, Zillow had a multi-thousand-home inventory it could not move at the prices it had paid.

On 17 October 2021, Zillow paused new home purchases through Offers, citing capacity constraints. On 2 November 2021, in its Q3 earnings release, the company announced it was winding down Zillow Offers entirely. The Q3 release disclosed an inventory write-down of approximately US$304 million; the total cost of the wind-down, including write-downs across Q3 and the expected Q4 impairment, ran to more than US$540 million. Approximately 25 percent of Zillow’s workforce, around 2,000 employees, would be laid off. The stock fell about 11 percent in late trading on the announcement, then dropped roughly 25 percent the following day, 3 November 2021, and continued falling in the days that followed.

Co-founder and CEO Rich Barton, in the shareholder letter, framed the decision in terms of risk to the broader business: the company had “determined the unpredictability in forecasting home prices far exceeds what we anticipated” and that continuing to scale Zillow Offers “would result in too much earnings and balance-sheet volatility.” A securities class action followed in the Western District of Washington, alleging that Zillow had misled investors about the model’s performance through 2021. It remains in active litigation, and the courts have allowed it to proceed as a class action.

The Zillow Offers shutdown is the foundational case study in a particular failure mode: an algorithmic system that performs well in the conditions it was trained on, fails when the conditions change, and fails most expensively precisely when its operator has scaled the system up most aggressively.

What an auditable version would have shown

Zillow had structured records. It was, after all, a data company. What Zillow lacked was a structured record of what the model predicted about the future at the moment each purchase was authorised, and whether that prediction was being independently challenged.

For each home Zillow Offers bought, the relevant question is not just “what did the model offer” but “what did the model expect to sell this for, in how many days, with what confidence, and what was the confidence interval around that estimate.” For a portfolio of 7,000 homes, that is 7,000 forecasts. Aggregated, the forecasts have a distribution: median expected hold time, variance, confidence intervals, sensitivity to underlying market assumptions. The question that mattered in the third quarter of 2021, is the model’s view of the future systematically wrong because the regime has shifted, should have been answerable from the structured record of forecasts versus actual outcomes, in close to real time.

Zillow’s internal reporting was sufficient to know that homes were sitting longer than expected. It was not, in the public record at least, structured in a way that surfaced the deeper question: was the forecast distribution itself drifting? Were closed sales coming in systematically below model predictions on a pattern that signalled a regime change rather than ordinary noise?

An auditable version would log, for each purchase decision, the model’s expectations: predicted resale price, predicted hold time, confidence interval, key assumptions about appreciation and renovation costs. Each closed sale would be matched to its purchase-time forecast, with the gap computed. The aggregate gap, sliced by region, by purchase month, by model version, would be a continuous quality signal. A widening gap between forecast and actual is the early sign that the model’s view of the world has stopped matching the world’s actual behaviour. Acting on that signal weeks earlier, rather than in a Q3 earnings release, would have meant smaller inventory, smaller write-downs, and a different company.

Where the gap was

The gap was in the feedback loop between the model and the operating decisions, not in the model itself.

The Zillow Offers pricing model was capable. It was working with one of the larger residential-real-estate datasets in the world. It had been refined over years against the Zestimate’s predictions of standing home values. Its problem was not technical incapacity. Its problem was that the operating decisions, how aggressively to bid, how many homes to buy, in which markets, were being made on the basis of recent performance rather than on the basis of how confident the model was in its forward forecasts.

When the model is right consistently, the operator scales up. The operator scales up by giving the model more aggressive bidding parameters and pushing it into more markets. The model’s outputs at the scaled-up bidding parameters are still confident, confidence does not necessarily go down just because the bidding got more aggressive. But the model’s sensitivity to assumption changes did, mechanically, increase. Aggressive bidding means a smaller buffer between purchase price and projected exit. A market turn that would have eaten the buffer at conservative bidding eats well past it at aggressive bidding.

The control that was missing was a forecast-skill metric, watched continuously, with an explicit threshold at which bidding parameters revert. Forecast-skill metrics are standard in weather forecasting, in economics, and increasingly in trading systems. They were not, on the available evidence, the basis on which Zillow Offers’ bidding aggressiveness was governed. The bidding aggressiveness was governed by what looked, period-on-period, like rising performance. The performance was a function of a market regime that was about to change.

What governance should have looked like

For any algorithmic system that takes operating decisions, the governance question is not whether the model is accurate today. It is whether you have continuous visibility into the model’s forecast skill and a written, signed escalation path when that skill degrades.

The governance question here is not whether the model is accurate today, but whether you have a continuous, structured read on how the model’s predictions are matching reality, and a pre-agreed response when that match degrades.

Capturing the prediction at the moment of the decision is the part most operators skip. It is easy to log the offer price. It is harder to log the model’s full view of the future at the time the offer went in: the predicted resale, the predicted hold time, the confidence interval, the key assumptions about appreciation and renovation. Without that, there is no way to tell, after the fact, whether a loss was bad luck or model failure. The forecast is the record.

The other half of the loop is reconciliation. Every closed sale gets matched back to the forecast that authorised the purchase. The gap is the data point. Aggregated across the portfolio, the gaps tell you whether the model is staying calibrated or whether its view of the world is drifting away from what the world is actually doing. A small, stable gap means the model knows what it does not know. A widening gap, especially one that skews in a single direction, is the early sign that the market regime has shifted and the model has not.

The third piece is the response. A widening gap is not, on its own, useful unless something happens because of it. Zillow’s bidding aggressiveness was governed by recent profit, not by forecast skill. If the bidding parameters had stepped down automatically when the population-level gap crossed a threshold, and a clearly named human had been notified the moment that happened, the conversation Zillow had with the market on 2 November 2021 would have been a conversation Zillow had with itself sometime in August. Lower drama, lower numbers, ongoing business.

What broke Zillow Offers was not the model. The model was capable. What broke Zillow Offers was the absence of an instrumented feedback loop between what the model predicted and what the world actually did, watched in close to real time, with a written response when the two diverged. The next regime change, in housing, in insurance pricing, in algorithmic underwriting, in hiring, will find whichever operator has not yet built that loop.

The reference implementation of MetricRecord and ConductRecord is open source. It lives at github.com/saffronandindia/headlights-oss, Apache 2.0 licensed and free to install. Anyone can read every line and verify the signatures. The repository is public now.

Sources

The mailing list

Fresh incident reports every week. One email to match.

We add new incidents to the library regularly, and send a single short email each week with what's new. The library stays free and open; this is just how you keep up with it.

No tracking. Unsubscribe in one click.

The record

An auditable system would have produced a signed, tamper-evident record the moment this happened: what the system did, the version that did it, the basis it acted on, and the action taken, and Zillow Group could have produced it on demand.

This is the record the system as deployed did not produce in a signed, auditable form.

What this teaches

Capture what happened when it happens

What the system did, the version that did it, the basis it acted on, and the action taken, recorded at the moment, not reconstructed after.

Sign it, so no one has to trust the record-keeper

A tamper-evident entry. Edit it later and the signature breaks. The record does not ask for the benefit of the doubt.

Make it verifiable by anyone

A court, a regulator, a customer's lawyer can check the record themselves, without taking the company, or us, at our word.

Also in the library

HD-INC-007 UnitedHealth allegedly used an algorithm with a 90% error rate to deny post-acute care to elderly Medicare Advantage patients Healthcare · 2022 HD-INC-009 Robodebt, Australia's automated welfare-debt scheme raised $1.76 billion in unlawful debts against 443,000 people Government · 2016 HD-INC-042 The Dutch tax office used a risk algorithm that flagged families by nationality, and wrongly branded tens of thousands as benefit fraudsters Government · 2021

Headlights summarises publicly reported AI incidents. All summaries are independently written, attributed to their original sources, and intended for research and educational purposes. Allegations are identified as such until established through official findings.

Last reviewed June 2026. This report is based on the sources listed above and reflects information available at the time of review; later developments may not be captured. Where a person is described as charged with or alleged to have done something, that allegation is unproven unless a conviction or a court or regulatory finding is stated. Headlights publishes journalism and commentary, not legal advice.

Want to write back?

Direct to my inbox.

ellie@useheadlights.com →