HD-INC-006

Government · United States · 2023 · Hallucination & fabrication

New York City's official business chatbot told small businesses they could break the law for over a year, and stayed online

By Ellie Harris · Filed 1 October 2023

Alleged: City of New York, Microsoft (Azure / OpenAI integration) developed or deployed the AI system implicated in this incident. Details are drawn from public reports; parties are presumed innocent of any wrongdoing not established by an official finding.

What happened

In October 2023, Mayor Eric Adams launched MyCity, a chatbot built on Microsoft Azure and OpenAI’s models, as the public face of the City of New York’s small-business support services. The mayor framed it as the future of how citizens would interact with city government: a single conversational interface that could answer questions about permits, taxes, hiring, housing, and the dense thicket of city regulations small businesses are expected to navigate. The program had cost the city in the order of half a million dollars to develop. It was, in Adams’ framing, a flagship example of how generative AI could make government accessible.

In March 2024, The Markup and The City published an investigation showing what MyCity was actually telling small businesses. The findings landed hard.

Asked whether a landlord could refuse a prospective tenant who paid with a Section 8 voucher, MyCity said yes. Source-of-income discrimination has been illegal in New York City since 2008.

Asked whether an employer could keep a portion of workers’ tips, MyCity said yes. Section 196-d of the New York Labor Law explicitly prohibits an employer from retaining any part of a worker’s tips.

Asked whether an employer could fire a worker for complaining about sexual harassment, MyCity said yes. Retaliation against employees for reporting harassment is illegal under federal, state and city law.

Asked whether a business could refuse to accept cash, MyCity said yes. New York City Local Law 34 of 2020 (Administrative Code § 20-840) requires businesses to accept cash.

The Markup repeated the queries and got contradictory answers from different sessions. The investigators were careful, they tested variations of the same question, documented the responses, ran the queries through different user accounts, and verified each response against the relevant statute. The pattern held. The chatbot was not occasionally wrong on edge cases. It was confidently wrong on basic compliance questions that small businesses were being told they could trust it on.

The Adams administration’s response was striking for what it did not do. The mayor acknowledged at a press conference that the bot’s answers were “wrong in some areas.” He did not take it offline. The city’s communications team quietly updated the MyCity site to label the bot as a “beta product” that may provide “inaccurate or incomplete” information. The bot remained the public-facing recommended option for small-business owners with questions about city regulations.

The chatbot continued operating, with various small fixes and continued public criticism, for nearly two years. In late January 2026, Mayor Zohran Mamdani, who had taken office on 1 January 2026 after campaigning in part on the failures of the Adams administration’s AI procurement, announced that the MyCity chatbot would be discontinued. The Markup reported the decision under the headline that captured the shape of the whole episode: Mamdani to kill the NYC AI chatbot we caught telling businesses to break the law.

What an auditable version would have shown

The core failure was not that the chatbot produced wrong answers occasionally. Generative models will hallucinate. The failure was that the chatbot was deployed as the city’s authoritative front door for small-business compliance questions, with no record-keeping discipline that would have let anyone, the city, the operators, the public, see what it was telling people.

There is no public log of MyCity’s interactions. The city did not publish, and as far as is known did not retain, structured records of the questions asked and the responses produced, classified by topic, scored against ground-truth answers. The Markup investigation had to recreate the questions and document responses from scratch. The city’s own analytics, to the extent they existed, were not the basis on which the city decided whether the bot was performing acceptably. The decision was political and reputational, not evidentiary.

An auditable version would have produced, for every interaction, a signed record: the question asked, the answer given, the model version, the retrieval sources (if any), the topic category, and a confidence score. The records would be aggregated for population-level analytics: how often does the bot answer questions about source-of-income discrimination, what’s the variance of those answers across sessions, what’s the rate at which the same question gets contradictory answers. Periodic adversarial testing, running known-correct compliance questions through the bot at scale, would generate a continuously updated error-rate per topic, visible to the public, surfaced to the procurement office. The bot’s continued operation would be a question with an evidence base, not a press-conference answer.

Without those records, the only signal the city had was journalism. The Markup investigation was the audit the city should have been running on itself, conducted instead by reporters with a media platform. Not every city has a Markup investigation on its docket. Most do not.

Where the gap was

The gap was in procurement and governance, not in the model.

The model, GPT-4-class at the time of launch, was capable of producing accurate answers to most compliance questions when given the right grounding documents and a sensible system prompt. The city had access to the canonical statutory text for every regulation MyCity covered. The right architecture was retrieval-augmented generation against that statutory text, with refusal patterns for any question the bot could not ground in a verifiable source. What was deployed was a thinner system: a chat interface over a generic model, with insufficient grounding and insufficient refusal, applied to a category of questions where confident wrong answers had legal consequences for the people asking.

The city’s procurement of the chatbot did not require, as a condition of acceptance, structured logging of interactions in a form suitable for audit. It did not require adversarial testing against the corpus of city regulations before deployment. It did not require a continuous-monitoring dashboard exposed to the City Council or to the public. The chatbot was procured as if it were a website redesign. It functioned as legal advice to thousands of people who had no other obvious recourse.

When The Markup’s findings landed, the city’s options were narrow because the records were not there. The administration could not say, with evidence, how often the bot answered each category of question correctly. It could not say which model version was active during the Markup’s testing. It could not produce a population-level error rate per topic. It could acknowledge that the answers cited in the investigation were wrong, and it did. It could not say that the broader pattern was different from what The Markup had documented, because nothing in the city’s own records supported a different account.

What governance should have looked like

A government-services chatbot deployed for compliance questions needs three things the MyCity deployment did not have.

The chatbot, first, has to be willing to say it does not know. A bot that refuses to answer a Section 8 question when it cannot ground the answer in a current statute is a far better government service than a bot that answers confidently and wrongly. I can’t confirm this against current city law, please call 311 is a complete answer. Refusal rates also become a useful signal in their own right: a topic the bot can never answer is a topic the city should either improve its source coverage on or stop offering through the chatbot.

The bot then has to be continuously tested. A small test harness runs hundreds of known compliance questions against it every day. Errors per topic are scored against the underlying statute and published as a dashboard the procurement office, the City Council and the public can all see. When a topic’s error rate crosses a threshold, the bot routes that topic straight to fallback until the underlying issue is fixed. The administration does not need to wait for The Markup to publish.

Finally, the records themselves should be public in aggregate. Questions asked, topics, grounded-answer rates, refusal rates, daily, anonymised, on a city dashboard. Journalists then do not have to reverse-engineer what the bot is telling people. The city is the source of truth on what it is telling its own citizens.

MyCity is not the last municipal chatbot. Almost every major US city and county now has AI procurement underway. The procurement standards being written this year, what the city must contractually require of the vendor, what records the vendor must keep, what the vendor must be willing to disclose, are the standards by which every program of this shape will be judged. The cities that write them well will not need their own version of The Markup’s investigation. The ones that don’t will read about themselves on a Friday.

The reference implementation of VerificationGate and ConductRecord is open source. It lives at github.com/saffronandindia/headlights-oss, Apache 2.0 licensed and free to install. Anyone can read every line and verify the signatures. The repository is public now.

Sources

The mailing list

Fresh incident reports every week. One email to match.

We add new incidents to the library regularly, and send a single short email each week with what's new. The library stays free and open; this is just how you keep up with it.

No tracking. Unsubscribe in one click.

The record

An auditable system would have produced a signed, tamper-evident record the moment this happened: what the system did, the version that did it, the basis it acted on, and the action taken, and City of New York, Microsoft (Azure / OpenAI integration) could have produced it on demand.

This is the record the system as deployed did not produce in a signed, auditable form.

What this teaches

Capture what happened when it happens

What the system did, the version that did it, the basis it acted on, and the action taken, recorded at the moment, not reconstructed after.

Sign it, so no one has to trust the record-keeper

A tamper-evident entry. Edit it later and the signature breaks. The record does not ask for the benefit of the doubt.

Make it verifiable by anyone

A court, a regulator, a customer's lawyer can check the record themselves, without taking the company, or us, at our word.

Also in the library

HD-INC-001 Air Canada chatbot promised a bereavement refund policy that did not exist Aviation · 2022 HD-INC-002 Mata v. Avianca, the lawyer who cited six cases that did not exist and asked ChatGPT to confirm them Legal services · 2023 HD-INC-003 Michael Cohen gave his lawyer fake case citations he had got from Google Bard, and his lawyer filed them in a federal court Legal services · 2023

Headlights summarises publicly reported AI incidents. All summaries are independently written, attributed to their original sources, and intended for research and educational purposes. Allegations are identified as such until established through official findings.

Last reviewed June 2026. This report is based on the sources listed above and reflects information available at the time of review; later developments may not be captured. Where a person is described as charged with or alleged to have done something, that allegation is unproven unless a conviction or a court or regulatory finding is stated. Headlights publishes journalism and commentary, not legal advice.

Want to write back?

Direct to my inbox.

ellie@useheadlights.com →