HD-INC-001 · Aviation · Hallucination

Air Canada chatbot promised a bereavement refund policy that did not exist

An airline's chatbot invented a refund policy. The court made the airline pay anyway.

What happened

On 11 November 2022, Jake Moffatt's grandmother died. He went to the Air Canada website to book flights from Vancouver to Toronto for the funeral. Before booking, he opened the support chatbot to ask about bereavement fares. The chatbot told him he could book at the regular price and apply for a partial refund within ninety days of the ticket being issued.

He booked. He applied for the refund. Air Canada refused.

The airline's actual policy required bereavement fares to be requested before booking, not after. The chatbot had described a process that did not exist. Moffatt produced a screenshot of the conversation as evidence and took the matter to the British Columbia Civil Resolution Tribunal. Air Canada's defence, in substance, was that the chatbot was a separate entity from the airline and that the airline could not be held responsible for what its chatbot said.

On 14 February 2024, the Tribunal rejected the argument. Tribunal Member Christopher C. Rivers, summarising the airline's position, wrote: "In effect, Air Canada suggests the chatbot is a separate legal entity that is responsible for its own actions. This is a remarkable submission." He ruled that a chatbot, however interactive, is still a part of the company's website. The company is responsible for what its website tells a customer. The airline was ordered to pay CAD 812.02 in total, covering the fare difference, pre-judgment interest, and court fees.

The damages were modest. Moffatt v. Air Canada has since been cited in nearly every legal analysis of AI agent liability written in Canada, the UK, and Australia.

What an auditable version would have shown

The case turned on a single piece of evidence: Moffatt's screenshot of the chatbot conversation. Air Canada did not contradict the screenshot. It did not produce its own log of the conversation. It did not produce the underlying prompt template, the policy document the chatbot was trained on, or the model version active on the day of the conversation. It made a legal argument about agency instead.

An auditable conduct record would have produced something different on demand. The conversation captured server-side and signed cryptographically the moment it happened. The model version active on 11 November 2022. The retrieval sources the model pulled from when it composed the bereavement-refund claim. The policy version current at the time the question was asked. None of that existed.

With that record, the airline could have argued from evidence. Maybe the bot pulled from a stale knowledge base. Maybe the bot hallucinated against a correct one. Maybe the policy had just changed. Each is a different problem with a different fix, and a signed record would have shown which one applied to Moffatt's conversation. Without the record, Air Canada's only remaining argument was that the chatbot was a separate legal entity.

Where the gap was

The gap was not specific to Air Canada. It was, and remains, the default state of almost every AI agent currently deployed by a non-trivial company.

The default deployment writes logs into a customer service database with thirty to ninety day retention, no signing, no model version pinning, no retrieval traces, no snapshot of the prompt. When the incident arrives, the company has logs, which are not the same thing as evidence. A customer's screenshot, presented in court, carries roughly the same weight as an unsigned database row, because both are unverifiable claims.

The Moffatt court did not have to grapple with this because Air Canada chose not to introduce its own logs. The next case will be different. Companies will produce logs and customers will point out that the logs are unsigned, the model version was not captured, the system prompt has been changed seven times since the incident, and there is no way to tell whether the bot the company is now describing is the same bot the customer talked to. Without a signed, version-pinned conduct record, the logs are admitted with weight comparable to the screenshot.

What governance should have looked like

Every chatbot reply gets written to a signed, hash-chained record at the moment it happens. The record captures the model version, the policy version that was active that day, the documents the model retrieved from, a hash of the system prompt, and the conversation itself. The signature is verifiable by any third party, including the customer's lawyer, without needing to trust the company.

from headlights import ConductRecord, sign, chain
from datetime import datetime, timezone

# At the moment the bot replies, build a structured record
record = ConductRecord(
    agent_id="customer-support-bot-v2",
    model_provider="openai",
    model_version="gpt-4-1106-preview",
    timestamp=datetime.now(timezone.utc),
    policy_version="bereavement-fares-2022-08",
    system_prompt_hash=sha256(system_prompt),
    retrieved_docs=[doc.id for doc in retrieved_docs],
    conversation=[
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": bot_reply},
    ],
    tool_calls=tool_call_log,
    previous_record_hash=last_record.hash(),
)

# Sign with the company's ECDSA P-256 private key
signed = sign(record, key=company_private_key)

# Append to the hash-chained audit log
chain.append(signed)

Two years later, when the court asks "what did the bot actually say?", the company produces the signed record. Anyone with a few lines of code can verify the signature: the customer, the customer's lawyer, the court itself. If the record was edited after the fact, the signature breaks. If the model version was different that day, the record shows it. The argument shifts from "trust us" to "check it yourself."

The signed conduct record is one layer. Air Canada had several others available. A retrieval-grounding policy that confined the bot to repeating verified policy text rather than improvising would have caught the bereavement hallucination at the source. A refusal pattern for sensitive categories, defaulting to "let me connect you with a person" for bereavement, refunds, and legal questions, would have removed the bot from the decision entirely. Adversarial testing that included bereavement scenarios before deployment would have surfaced the problem in QA rather than in court. Policy version pinning with a freshness check would have flagged any answer drawing on a policy not re-verified in the last thirty days. None of these are exotic. They are documented practice in any mature AI governance framework. The cumulative cost of implementing all four is less than the cost of one court hearing.

The reference implementation of this pattern is open source. It will live at github.com/saffronandindia/headlights-oss, Apache 2.0 licensed, 226 tests passing, free for any company to install. Anyone can read every line. Anyone can verify the signatures. No vendor lock-in. No proprietary auditor in the loop. The repository goes public alongside the launch of this Incident Library.

This entry is an educational analysis based on the publicly reported sources listed below. It does not constitute legal advice. Facts are stated to the best of our knowledge as of the date of publication; corrections will be issued promptly on request. Contact: ellie@useheadlights.com.