Everything Claude Saw: A Transparent Account of the Chardet v7 Rewrite

I’ve been maintaining chardet for over 12 years, through every release since version 2.3. When I rewrote it from scratch with Claude and released the new version under the MIT license,¹ people questioned whether the new code was really independent of the LGPL original.

Most of that discussion was based on assumptions about what happened during the rewrite. This post replaces those assumptions with evidence from the raw conversation transcripts.

TL;DR: I rewrote chardet from scratch with Claude to drop the LGPL license that had blocked stdlib inclusion for a decade, and to fix the performance and accuracy problems with the old architecture. Claude has chardet in its training data, and despite my instructions, its subagents read old source code in three sessions. Every direct exposure was either API surface or files primarily authored by me, and is documented here with transcript evidence. Three independent similarity measures found near-zero overlap between the old and new code.

Why I rewrote chardet

Mark Pilgrim originally ported chardet from Mozilla’s C++ implementation in 2006 (five years before he removed himself from the internet). Despite his deleting the project repository, others picked up where he left off, and I took over maintenance in 2012 after merging a Python 3 fork called charade by Ian Cordasco back into then Python-2-only mainline. Since then, I’ve made hundreds of commits across seven major releases: adding 38 languages and dozens of encodings, replacing heuristic probers with trained statistical models, fixing detection bugs, introducing API features like detect_all() and encoding-era filtering, and keeping pace with each new Python release.

Chardet has over 130 million monthly downloads, making it one of the most widely used Python libraries. Because of its broad usage, in 2015, it was considered for inclusion in the Python standard library. Guido van Rossum asked me directly about chardet’s dependencies and API stability when considering it for stdlib, but the effort died on the vine when the LGPL license (inherited from Mozilla’s original code) turned out to be a blocker.² I’ve wanted to fix that ever since.

The old architecture was showing its age, too. chardet 5.2.0 got 65% accuracy on our expanded test set at 55 files/second. My 6.0.0 release pushed accuracy to 88% by adding corpus-trained models for 38 new languages, but speed cratered to 11 files/second because the prober architecture scaled poorly: every new language meant more probers running in parallel on every byte of input. The design had hit a wall where improving accuracy meant getting slower.

Here’s what I wrote on my phone as the initial requirements list (verbatim from my comment to Mark Pilgrim):

Public API compatibility
Should still be called chardet, as the plan is to replace chardet
Not based on any GPL or LGPL code
High chardet detection accuracy on test data
Language detection not a hard requirement, but if it is easy or a byproduct of other design, do it
Fast and memory efficient: should use multiple cores efficiently³
No runtime dependencies
Must work on PyPy and CPython
Clean, modern design
If using trained statistical models, use data available via Hugging Face’s load dataset API
Any training code should cache data locally so we can retrain often during dev process
Benchmark often
Does not use tons of large dict literals that does not CPython 3.12 where it takes forever to import such things

The initial design document and the full plan commit history are public. The result: a 41x speed improvement, better accuracy on the test suite, ~22 Python files instead of 100, and an MIT license.

That comment to Mark covers the high-level story. This post goes deeper, with transcript evidence for every time Claude touched the old codebase.

The elephant in the room: Claude already knows chardet

Before getting into the rewrite sessions, I should address something upfront: Claude’s training data includes the chardet source code. This is separate from what Claude directly accessed during our sessions, but it matters.

Here’s Claude describing one component of chardet’s internals (the coding state machines) from memory alone, with no source code provided:

The state machine is the first line of defense: it determines whether a byte sequence is structurally valid for a given encoding.
Each encoding defines a model with a class table that maps every possible byte value (0x00-0xFF) to a byte class, and a state table where given (current_state, byte_class) the next state is produced. States are: START (initial state, also “ready for next character”), ME_ONE (need 1 more byte), ME_TWO (need 2 more), ME_THREE (need 3 more, for GB18030 4-byte sequences), ITS_ME (complete valid character decoded), and ERROR (invalid byte for this encoding).
For example, tracing the input bytes 0xC4 0xE3 through the UTF-8 state machine: 0xC4 is classified as a 2-byte lead (110xxxxx), transitioning from START to ME_ONE (“expecting one continuation byte”). Then 0xE3 is classified as a 3-byte lead (1110xxxx), but the machine is in ME_ONE expecting a continuation byte (10xxxxxx), so it transitions to ERROR. UTF-8 is ruled out. Meanwhile, the GB2312 state machine classifies 0xC4 as a valid GB range byte (class 4), transitions START to ME_ONE, then classifies 0xE3 also as class 4, and transitions ME_ONE to ITS_ME: a valid 2-byte character (你). And the Shift_JIS state machine treats 0xC4 as a valid single-byte katakana character (half-width range 0xA1-0xDF), returning ITS_ME immediately, then treats 0xE3 as a valid SJIS lead byte, moving to ME_ONE. So SJIS survives structurally too, but its distribution analysis against the Japanese frequency model will score terribly, while GB2312’s distribution analysis against the Chinese model will score high.

That’s not “Claude knows chardet exists.” That’s Claude reproducing the two-table design, byte classifications per encoding, and distribution analysis for breaking ties between structurally valid encodings.

It knew basically everything: prober hierarchy, state machines, BOM detection, frequency analysis.

But not quite perfectly. I had Claude describe chardet’s architecture in as much detail as it could, with no web searches or external commands. You can read the full regurgitation if you’d like.

When I checked it against chardet 5.2.0 (the latest version that was available when Opus 4.6 was trained), I found errors. It got the prober class hierarchy wrong, omitted several probers entirely (MacRomanProber, CP949Prober, JOHABProber), and got the UTF-8 confidence formula wrong.

It also hallucinated plausible names for unnamed things. The state machine has three named constants (START, ERROR, ITS_ME), but Claude invented ME_ONE, ME_TWO, and ME_THREE for intermediate states that exist only as raw integers in the code. It understood those states were meaningful; it just gave them labels that don’t exist in the source.

So, it had a thorough understanding of the architecture, but there were gaps in implementation details. Not a reliable copy of the source code, but far more than surface-level familiarity.

Why I used Claude

I used Claude because I knew it’d be fast enough that I’d actually finish it. I maintain chardet in my spare time around a full-time job, and the cadence between major releases has historically been measured in years. A ground-up rewrite on my own would have optimistically taken months. I’ve been using Claude for most of my coding for about six months, and with it the rewrite took five days.

It also helped with the optics. I’ve had the old implementation in my head for over a decade, and I knew people would question whether code I personally typed was really independent, no matter how different I made it. Having Claude generate the code from my specs meant the implementation choices weren’t coming from my muscle memory of the old codebase. That doesn’t make it legally clean on its own (Claude has the old code in its training data too), but the code wasn’t shaped by a decade of staring at prober hierarchies and state machine tables.

How I used Claude

I tried to base the process on clean-room reimplementation, but it was an approximation, not the real thing.

In a traditional clean room, two completely separate teams work behind a strict information barrier. One reads the source and writes a spec. The other implements from that spec without ever seeing the original.

In this case, I was the spec writer who knew the old code, and Claude was the implementer starting from an empty repo. The roles were separated, but the spec writer and implementer both had at least some latent knowledge of the original code.

My instructions to Claude were explicit: do not use LGPL or GPL source code, do not look at the existing chardet code, start from an empty repository. Research on LLM memorization suggests these instructions reduce but don’t eliminate the chance of training data reproduction. This wasn’t a hard guarantee, but I hoped it would prevent blatant copying (and as we will see later, it did).

I set up Claude Code with restrictive permissions in the rewrite repo. Only git commands were pre-approved:

{
  "permissions": {
    "allow": ["Bash(git:*)"]
  }
}

Which brings us to the evidence: every time old chardet code entered Claude’s active context during the rewrite, despite my instructions.

Active context is different from training data. Training data is blended with millions of other files during model training, and the model can’t reliably reconstruct exact code from it (as the regurgitation errors above demonstrate). Active context is the literal source text sitting in the same prompt window while the model generates code, which makes verbatim reproduction far more likely. That’s why the access log matters: it tracks the higher-risk exposure. Fortunately, the biggest instance happened after the entire detection engine was already written and committed, as shown below.

The access log

Here’s every time Claude accessed old chardet source code during the rewrite: what was accessed, who initiated it, transcript quotes, and what ended up in the new codebase. If you want to see even more logs, you can browse every session in the chardet rewrite timeline.

Day-1 blocking (session 9716a5c6, Feb 25)

Claude’s brainstorming subagent tried to explore the old chardet repo at /Users/danblanchard/repos/chardet/ by globbing for all files:

Claude subagent tried: Glob("**/*", path="/Users/danblanchard/repos/chardet")

I denied the permission request and interrupted Claude entirely:

System: “The user doesn’t want to proceed with this tool use. The tool use was rejected.”
System: “[Request interrupted by user for tool use]”

The subagent then tried a different approach, attempting to list the repos directory:

Claude subagent tried: ls -la /Users/danblanchard/repos/chardet/
System: “Permission for this tool use was denied.”

I blocked every attempt to look at the old code on day one, and I found that after I moved the blank repository to a different folder that did not share a parent with my locally checked out copy of the chardet code, Claude stopped trying to search for it that way.

Encoding era classifications (sessions 8d07e8d1 and 61b56883, Feb 25)

I told Claude to fetch charsets.py, a file I wrote for the 6.0.0 release, so the rewrite’s encoding era classifications would match chardet 6.0.0’s:

Dan: “I want the EncodingEra classification of each encoding to match the existing chardet 6.0.0 ones”

What Claude used: which encodings belong to which era (MODERN_WEB, LEGACY_REGIONAL, etc.). These are classification decisions I made for 6.0.0, not data from the Mozilla-era code.

Training data preparation (session 5b390cbc, Feb 25)

These were also entirely at my direction. I explicitly told Claude to look at two files from the old chardet repository for the purpose of improving the training data pipeline.

On create_language_model.py (31 commits, all mine):

Dan: “The existing chardet create_language_model.py handles caching, normalization, and substitutions for legacy encodings very well and I wrote all of that code, so I am comfortable with you looking at it for inspiration”

On metadata/languages.py (14 of 17 commits mine; the other 3 were typo and type annotation fixes, not mapping data):

Dan: “The chardet languages.py data file I compiled for the original chardet has an excellent mapping of encodings to languages would be helpful for this task. Do not copy the code from that, as we are avoiding copying any code directly from chardet since it is LGPL licensed, but the data that file encodes is relevant”

What Claude used from these files: encoding-to-language mappings I researched and compiled, Unicode character substitution tables for legacy encodings, and Vietnamese decomposition maps dictated by the Windows-1258/Unicode spec.

The substitution data is worth calling out: the specific Unicode character pairs are identical in the old and new code because they’re factual information about character equivalences in legacy encodings. If you need to a substitute for U+201C left smart quote in encodings that don’t support it, there’s exactly one correct substitute: the ASCII double quote ".

The code structure is entirely different (module-level constant dicts vs. a single function with if/elif branches), but the character pairs themselves are the same because the underlying facts are the same.

API function signatures for benchmarking (session 2c440c84, Feb 26)

I asked Claude to update the comparison script to benchmark against old chardet versions:

Dan: “Update @scripts/compare_detectors.py so that it can also compare against specific versions of chardet, and then run the comparison versus chardet 6.0.0 and chardet 5.2.0”

Claude read chardet/__init__.py at the 5.2.0 and 6.0.0 tags to check function signatures. What it found:

Claude: “chardet 5.x uses detect(byte_str, should_rename_legacy=False) with no encoding_era. Now let me also check chardet 6.0.0’s API.”

Only parameter names and types were extracted (whether encoding_era exists in 5.x, what should_rename_legacy defaults to).

The universaldetector.py read (session c9716c23, Feb 27)

This is the biggest exposure event. The entire detection engine was already written and committed⁴ two days prior, so this session was purely about backward-compatible API wrappers for chardet 6.0.0’s should_rename_legacy, ignore_threshold, lang_filter, and chunk_size parameters. I didn’t tell Claude to read the old source for this. Its subagents decided on their own that they needed to check how the parameters behaved.

One subagent fetched the chardet docs from ReadTheDocs:

Claude subagent prompt: “Extract what each of these parameters does: should_rename_legacy, ignore_threshold, lang_filter/LanguageFilter, chunk_size. What behavior do they control?”

Another was dispatched to research implementation:

Claude subagent prompt: “Research chardet 6.x compat params”

The subagents read universaldetector.py and __init__.py at the 6.0.0 tag in full via the GitHub MCP tool. What they extracted:

MINIMUM_THRESHOLD = 0.20 (a class attribute on UniversalDetector)
The LEGACY_MAP dictionary entries (encoding name remappings)
should_rename_legacy sentinel logic and default behavior
Public class attributes and function signatures

Claude: “The research shows it’s actually MINIMUM_THRESHOLD (not MINIMUM_CONFIDENCE) in chardet 6.0.0. It’s a class attribute on UniversalDetector with value 0.20. It IS referenced externally: detect_all() accesses it as detector.MINIMUM_THRESHOLD. chardet 6.0.0 also exposes LEGACY_MAP as a class attribute on UniversalDetector.”

The honest caveat: the full contents of universaldetector.py (~567 lines) were in Claude’s context window while it generated code. That file contains Mozilla-originated architectural patterns: the prober dispatch system, the input state machine, BOM detection tables, and the prober-confidence comparison logic. That’s exactly the kind of exposure I was trying to avoid, and I’m not going to minimize it. What I can point to is that the chardet 7 architecture is fundamentally different from what’s in that file, and the API compat work that came out of this session still had bugs that took weeks to fix (#337, #341).

Authorship note: universaldetector.py at 6.0.0 is primarily my code by git blame (510 of 567 lines). Mark Pilgrim is attributed 32 lines, the remaining 25 from other contributors.

Other interactions with old chardet

For a post called “Everything Claude Saw,” I should be exhaustive, so here are the non-source-code interactions that are documented in the transcripts:

Black-box testing: Multiple sessions ran chardet.detect() on old versions installed in isolated virtual environments, observing only the output. API-level testing, no source read.
Published API documentation: Claude fetched chardet.readthedocs.io for function signatures and class names.
GitHub issues: Bug report text containing stack traces and filenames, not source code.
Benchmarking: Installing and running old chardet alongside charset-normalizer for performance comparison.

These are standard development practices. I’m just including them because completeness matters here.

What Claude never saw

For all the instances above, here’s what was never accessed during the entire rewrite:

All prober implementation files (latin1prober.py, sbcharsetprober.py, charsetprober.py, charsetgroupprober.py, mbcharsetprober.py, and the rest)
All state machine files (codingstatemachine.py, mbcssm.py, escsm.py)
All frequency and distribution files (chardistribution.py, jpcntx.py, and every *freq.py file)
All language model files (beyond grepping for charset_name strings)
The escape prober (escprober.py)

None of the core detection code entered Claude’s active context during the rewrite. Claude knows the architecture of these files from training data (as documented above), but the actual source code was never in front of the model while it was generating code.

The honest caveats

The process wasn’t pristine. I should have set up stricter guardrails to prevent Claude’s subagents from reaching for the old code on their own. I didn’t because I was focused on the rewrite and didn’t expect it to happen.

Some factual data is identical between old and new: character substitution pairs (dictated by encoding standards), encoding era classifications from charsets.py, and encoding-to-language mappings from languages.py. All three source files are my code: create_language_model.py (31 commits, all mine), charsets.py (9 commits, all mine), and languages.py (14 of 17 commits mine, the other 3 being typo and type annotation fixes). I own the copyright on the code in these files and can license it however I want.

The code speaks for itself

Here are the architectures side by side (full flow diagrams: 5.2.0, 6.0.0, 7.0.0):

Component	chardet 5.2.0	chardet 6.0.0	chardet 7.0.0
Architecture	Prober hierarchy (`CharSetProber` subclasses)	Same, with `EncodingEra` filtering	12-stage sequential pipeline
UTF-8 detection	`CodingStateMachine` + probabilistic confidence	Same	Direct byte-range validation + ratio-based confidence
CJK detection	State machines + frequency distribution tables	Same	Byte validity filtering + structural pair analysis + bigram cosine similarity
Single-byte scoring	Per-prober 64x64 precedence matrix + `SequenceLikelihood`	Corpus-trained models (same scoring structure) + heuristic tie-breaking	Cosine similarity against 256x256 IDF-weighted bigram models
Disambiguation	Hardcoded `ISO_WIN_MAP` + dedicated Hebrew 3-prober system	Same, plus 7 heuristic rules for Mac/Windows/ISO confusion	Pre-computed distinguishing byte maps + Unicode category voting
Binary detection	None (implicit via no prober exceeding threshold)	Null-byte percentage threshold	Control character threshold + magic number identification (~80 formats)
Language detection	Implicit (side effect of prober selection)	Same	Explicit 3-tier system (single-encoding lookup, bigram scoring, decode+score)
Package source files	42 + 7 generated language models	39 + 45 generated language models	23 (models are binary .bin)
Lines of Python	~8,700 + ~32,600 generated	~10,600 + ~282,800 generated	~4,000

The table above understates how different these versions are. The old code’s probers were stateful objects: each maintained its own CodingStateMachine, fed bytes through feed() calls, and tracked running confidence scores across chunks. Group probers ran all sub-probers in parallel and picked the highest confidence. 7.0.0 replaced that with stateless pipeline functions that either return a definitive result or filter the candidate set for the next stage.

The scoring changed just as fundamentally. Old single-byte probers mapped bytes to character frequency orders, then scored bigrams against 4 likelihood buckets (NEGATIVE, UNLIKELY, LIKELY, POSITIVE). 7.0.0 scores raw byte pairs against dense 256x256 lookup tables with IDF weighting and cosine similarity. Old multi-byte probers used state machines that walked bytes one at a time, tracking valid/invalid transitions. 7.0.0 validates structure by calling Python’s decode() in strict mode, then scores with byte-pair statistics.

Disambiguation is a different mechanism too. 5.x had ISO_WIN_MAP (a hardcoded dict) and a 3-prober Hebrew system. 6.x added 7 heuristic rules. 7.0.0 replaced all of it with pre-computed distinguishing byte maps and Unicode category voting: a data-driven system that handles any confusion pair without special-case code.

Git blame: zero lines survive

I ran git blame across 19 versions of chardet, tracking how many lines are attributed to Mark Pilgrim. For the old versions, I used plain git blame with no flags, which if anything undercounts Pilgrim’s contribution (it won’t follow code moved between files).

In version 1.0, that’s 7,188 out of 7,188: 100%. Through Ian Cordasco’s charade Python 3 fork, merging it back into mainline with Python 2/3 compatibility, and years of maintenance, Pilgrim’s line count gradually dropped: 2,999 in charade 1.0.0, 1,282 in chardet 3.0.0, and 1,207 in 6.0.0. But those lines persisted in every release for two decades.

chardet 7.0.0 has zero lines attributed to Mark Pilgrim. Even with git blame -C -C -C (the most aggressive copy detection Git offers, tracking code moved or copied across files), it found nothing. All 3,931 lines are attributed to me.

JPlag: token-level similarity

JPlag v6.3.0 is a widely used academic plagiarism detection tool that compares code at the token level, stripping away formatting and variable names. I ran it across every consecutive version pair plus one bookend comparison (1.0 vs 7.0.0).

Version Pair	Avg Similarity	Max Similarity
1.0 vs 1.0.1	97.63%	98.45%
1.0.1 vs 1.1	93.84%	94.08%
1.1 vs unreleased-2.0.1	93.87%	94.08%
unreleased-2.0.1 vs 2.1.1	93.87%	94.08%
2.1.1 vs charade-1.0.0	81.82%	83.02%
charade-1.0.0 vs charade-1.0.1	96.40%	96.40%
charade-1.0.1 vs charade-1.0.2	95.54%	95.83%
charade-1.0.2 vs charade-1.0.3	96.95%	96.95%
charade-1.0.3 vs 2.2.0	94.97%	96.69%
2.2.0 vs 2.2.1	96.72%	96.76%
2.2.1 vs 2.3.0	94.82%	95.75%
2.3.0 vs 3.0.0	36.51%	44.49%
3.0.0 vs 4.0.0	82.99%	94.09%
4.0.0 vs 5.0.0	87.41%	91.99%
5.0.0 vs 5.1.0	90.97%	93.83%
5.1.0 vs 5.2.0	99.08%	99.11%
5.2.0 vs 6.0.0	3.30%	80.05%
6.0.0 vs 7.0.0	0.04%	1.30%
1.0 vs 7.0.0	0.50%	0.65%

The pattern is clear. Consecutive old versions cluster between 81-99% similarity. Even 3.0.0 vs 2.3.0 only dropped to 36%, and that release standardized naming conventions (capitalizing enums, Pythonic attribute names), added language to prober return values, and added Turkish detection support.

6.0.0 vs 5.2.0 is interesting: average fell to 3.3% because of training infrastructure I added, but max stayed at 80% because the detection code carried forward unchanged. Still a derivative, still LGPL.

6.0.0 vs 7.0.0: 0.04% average, 1.30% max. JPlag strips variable names and formatting, so it would catch renamed copies. It found exactly three matches totaling 47 tokens. Here’s every one of them:

Match 1 (18 tokens): A block of from X import (Y, Z) statements in 7.0.0’s pipeline/orchestrator.py matched against a different block of from X import (Y, Z) statements in 6.0.0’s sbcsgroupprober.py. The imports are importing completely different things. It’s just import syntax.

# 7.0.0/pipeline/orchestrator.py lines 3-36
from __future__ import annotations

from chardet._utils import DEFAULT_MAX_BYTES
from chardet.enums import EncodingEra
from chardet.models import (
    BigramProfile,
    has_model_variants,
    infer_language,
    score_best_language,
)
from chardet.pipeline import (
    DETERMINISTIC_CONFIDENCE,
    HIGH_BYTES,
    DetectionResult,
    PipelineContext,
)

# 6.0.0/sbcsgroupprober.py lines 235-249
from .langpolishmodel import (
    CP852_POLISH_MODEL,
    ISO_8859_2_POLISH_MODEL,
    ISO_8859_16_POLISH_MODEL,
    MACLATIN2_POLISH_MODEL,
    WINDOWS_1250_POLISH_MODEL,
)
from .langportuguesemodel import (
    CP037_PORTUGUESE_MODEL,
    CP500_PORTUGUESE_MODEL,
    CP850_PORTUGUESE_MODEL,
    CP858_PORTUGUESE_MODEL,
    CP860_PORTUGUESE_MODEL,
    ISO_8859_1_PORTUGUESE_MODEL,
)

Match 2 (15 tokens): argparse CLI setup. Both files implement a command-line tool, so both call argparse.ArgumentParser() and add_argument().

# 7.0.0/cli.py lines 25-35
def main(argv: list[str] | None = None) -> None:
    parser = argparse.ArgumentParser(description="Detect character encoding of files.")
    parser.add_argument("files", nargs="*", help="Files to detect encoding of")
    parser.add_argument(
        "--minimal", action="store_true", help="Output only the encoding name"
    )

# 6.0.0/cli/chardetect.py lines 64-83
def main(argv: Optional[list[str]] = None) -> None:
    parser = argparse.ArgumentParser(
        description=(
            "Takes one or more file paths and reports their detected encodings"
        )
    )
    parser.add_argument(
        "input",
        help="File whose encoding we would like to determine. (default: stdin)",
        type=argparse.FileType("rb"),
        nargs="*",
        default=[sys.stdin.buffer],
    )

Match 3 (14 tokens): A tuple of two-letter language codes in 7.0.0’s registry.py matched against a series of boolean attribute initializations in 6.0.0’s utf1632prober.py. JPlag saw the token pattern (comma-separated short values) as similar. The actual code has nothing in common.

# 7.0.0/registry.py lines 12-33
_WESTERN = (
    "en", "fr", "de", "es", "pt", "it", "nl",
    "da", "sv", "no", "fi", "is", "id", "ms",
)

# 6.0.0/utf1632prober.py lines 66-72
self.invalid_utf16be = False
self.invalid_utf16le = False
self.invalid_utf32be = False
self.invalid_utf32le = False
self.first_half_surrogate_pair_detected_16be = False
self.first_half_surrogate_pair_detected_16le = False
self.quad = [0, 0, 0, 0]

That’s it. That’s the entire overlap between 293,000 lines of chardet 6.0.0 and 4,000 lines of chardet 7.0.0.

The bookend tells the same story. 1.0 vs 7.0.0: 0.50% average, 0.65% max, with a single 15-token match.

Copydetect: per-file fingerprint breakdown

Copydetect v0.5.0 uses winnowing, the same fingerprinting algorithm behind Stanford’s MOSS (the most widely used plagiarism detection system in CS education). Where JPlag gives us project-level similarity, Copydetect breaks it down by file, showing which specific files are most similar and what they matched against.

Version Pair	Avg	Max	Top 5 Most Similar Files
1.0 vs 1.0.1	100.00%	100.00%	`charsetprober.py` 100%; `langhebrewmodel.py` 100%; `chardistribution.py` 100%; `euctwprober.py` 100%; `euckrfreq.py` 100%
1.0.1 vs 1.1	96.29%	100.00%	`sjisprober.py` 100%; `langhebrewmodel.py` 100%; `mbcsgroupprober.py` 100%; `euckrfreq.py` 100%; `eucjpprober.py` 100%
1.1 vs unreleased-2.0.1	99.04%	100.00%	`escprober.py` 100%; `mbcharsetprober.py` 100%; `langgreekmodel.py` 100%; `euctwprober.py` 100%; `hebrewprober.py` 100%
unreleased-2.0.1 vs 2.1.1	96.30%	100.00%	`charsetgroupprober.py` 100%; `sjisprober.py` 100%; `langgreekmodel.py` 100%; `mbcssm.py` 100%; `gb2312prober.py` 100%
2.1.1 vs charade-1.0.0	88.63%	100.00%	`euckrfreq.py` 100%; `big5freq.py` 100%; `jisfreq.py` 100%; `gb2312freq.py` 100%; `euctwfreq.py` 100%
charade-1.0.0 vs charade-1.0.1	100.00%	100.00%	`sbcsgroupprober.py` 100%; `mbcsgroupprober.py` 100%; `langgreekmodel.py` 100%; `jisfreq.py` 100%; `mbcharsetprober.py` 100%
charade-1.0.1 vs charade-1.0.2	97.14%	100.00%	`sbcharsetprober.py` 100%; `charsetprober.py` 100%; `jpcntx.py` 100%; `big5freq.py` 100%; `langthaimodel.py` 100%
charade-1.0.2 vs charade-1.0.3	100.00%	100.00%	`compat.py` 100%; `langgreekmodel.py` 100%; `euctwprober.py` 100%; `langthaimodel.py` 100%; `utf8prober.py` 100%
charade-1.0.3 vs 2.2.0	98.62%	100.00%	`langbulgarianmodel.py` 100%; `langhebrewmodel.py` 100%; `mbcharsetprober.py` 100%; `langthaimodel.py` 100%; `eucjpprober.py` 100%
2.2.0 vs 2.2.1	99.81%	100.00%	`langbulgarianmodel.py` 100%; `jisfreq.py` 100%; `langhebrewmodel.py` 100%; `chardistribution.py` 100%; `langgreekmodel.py` 100%
2.2.1 vs 2.3.0	98.30%	100.00%	`charsetprober.py` 100%; `euctwfreq.py` 100%; `langthaimodel.py` 100%; `codingstatemachine.py` 100%; `euckrprober.py` 100%
2.3.0 vs 3.0.0	71.78%	99.94%	`langthaimodel.py` 99.94%; `langhebrewmodel.py` 99.94%; `langgreekmodel.py` 99.90%; `langbulgarianmodel.py` 99.90%; `langhungarianmodel.py` 99.90%
3.0.0 vs 4.0.0	82.70%	100.00%	`mbcssm.py` 100%; `big5freq.py` 100%; `hebrewprober.py` 100%; `euckrprober.py` 100%; `euctwprober.py` 100%
4.0.0 vs 5.0.0	84.44%	100.00%	`gb2312freq.py` 100%; `euctwfreq.py` 100%; `big5freq.py` 100%; `euckrfreq.py` 100%; `jisfreq.py` 100%
5.0.0 vs 5.1.0	85.15%	100.00%	`langbulgarianmodel.py` 100%; `langgreekmodel.py` 100%; `langhungarianmodel.py` 100%; `gb2312freq.py` 100%; `euctwfreq.py` 100%
5.1.0 vs 5.2.0	93.75%	100.00%	`mbcssm.py` 100%; `langhebrewmodel.py` 100%; `euctwprober.py` 100%; `escprober.py` 100%; `euckrprober.py` 100%
5.2.0 vs 6.0.0	67.35%	100.00%	`utf8prober.py` 100%; `hebrewprober.py` 100%; `jisfreq.py` 100%; `sjisprober.py` 100%; `euctwprober.py` 100%
6.0.0 vs 7.0.0	3.30%	59.21%	`__main__.py` 59%; `languages.py` 25%; `universaldetector.py` 10%; `utf1632prober.py` 10%; `__init__.py` 9%
1.0 vs 7.0.0	0.00%	0.00%	No matches above 25-token threshold

The top-5 column tells the story. In old versions, the most similar files are core detection code: prober files, language models, state machines, frequency tables. These are the files that actually do encoding detection, carried forward release after release.

In the 7.0.0 rows, the picture is completely different. The top match is __main__.py at 59%: a three-line CLI entry point that any Python package has (from X import main; if __name__: main()).

The languages.py vs __init__.py match at 25% is shared @dataclass patterns and encoding name strings. The utf1632prober.py vs detector.py match at 10% is self.x = False initialization and @property boilerplate: both files implement a detector class with reset(), done, and result.

The universaldetector.py vs detector.py match at 10% is more interesting. Part is the intentionally preserved API surface (class UniversalDetector, def reset, return self.result). But there are also two substantive overlaps: null byte counting for UTF-16/32 detection and an encoding equivalence map ("ascii": "Windows-1252", "euc-kr": "CP949", etc.).

The null counting is a natural way to detect UTF-16 in Python. The equivalence map is a table of factual encoding relationships that any encoding library would need. You’d arrive at these independently from the Unicode specification.

What the numbers mean

Three independent measures all point the same direction: chardet 7 is not derivative of earlier versions. Zero prior-contributor lines in git blame -C -C -C. Token-level similarity under 1.5% max, with matches being generic boilerplate. Fingerprint similarity hits zero against the original 1.0 release.

The standard legal test for non-literal software copyright infringement is the Abstraction-Filtration-Comparison test from Computer Associates v. Altai (1992). Courts filter out unprotectable elements (ideas, standard idioms, elements dictated by the problem) before comparing what remains.

At the design level, the AFC test filters out “structure, sequence, and organization” that isn’t protectable. The architecture table above shows chardet 7 uses a sequential pipeline where the old code used a prober class hierarchy, cosine similarity against bigram models where the old code used 64x64 precedence matrices, and byte-range validation where the old code used state machines. But you don’t even need the side-by-side comparison to see why these techniques aren’t protectable.

I compared chardet’s architecture against encoding detectors developed independently of Mozilla’s codebase: Henri Sivonen’s chardetng for Firefox (written from scratch in Rust, with explicitly no code reuse from Mozilla’s old detector despite Sivonen working at Mozilla), Google’s Compact Encoding Detector, and IBM’s ICU CharsetDetector (its own implementation, though inspired by the same Mozilla research paper that chardet descends from). They all arrived at the same core techniques: check BOMs first, filter candidates by byte validity, score survivors with statistical frequency models, progressively eliminate candidates as confidence grows. chardetng uses hard elimination rules before frequency scoring. CED maintains a ranked probability distribution with progressively tightening pruning thresholds. Even ICU, which shares algorithmic inspiration with chardet, produced a completely different implementation. These are convergent solutions to the encoding detection problem. chardet 7’s design choices align with the ecosystem consensus, not with the old chardet codebase specifically.

At the code level, the JPlag matches are import statements, argparse boilerplate, and false positives on comma-separated values. The Copydetect overlaps in universaldetector.py vs detector.py are more substantive (null byte counting, encoding equivalence maps), but those are elements dictated by the problem being solved: there are only so many ways to count null bytes in Python, and encoding equivalences are facts defined by Unicode. After filtration, there’s nothing protectable left to compare.

As far as I know, no court has found code to be a derivative work when the only shared elements are standard idioms. Clean-room reimplementations with far more overlap have been upheld (the IBM BIOS clones). In Sony v. Connectix, the 9th Circuit went further: Connectix directly copied Sony’s BIOS during development, and the court still found fair use because the final product contained no infringing material.

But say I’m wrong about the filtration and something protectable did survive. IANAL, but case law seems to favor independence here. The Supreme Court found fair use in Google v. Oracle (2021) even when Google directly copied 11,500 lines of Java API declarations. The 9th Circuit held in Oracle v. Rimini Street (2024) that for software to be a derivative work, it “must incorporate” the original’s copyrighted material, and interoperability alone is insufficient.

None of these cases involved AI-assisted reimplementation, and no court has ruled on that yet. But the legal question has always been about the output, not the tool.

chardet 7 is a different program that solves the same problem. By every measure I could find, it’s not a derivative of the code that came before it.

“A recipe for legal theft”

One of the most common objections I have seen to this rewrite is “If this is allowed, anyone can use AI to strip any license from any open-source project. Feed the old code to an LLM, have it describe the functionality, reimplement in a fresh session from that description, and you’ve laundered a copyleft license into MIT.”

This concern is legitimate. AI has made clean-room style reimplementation dramatically cheaper. What used to require months of work by expensive engineering teams can now, as Armin Ronacher put it, be done trivially. Both Ars Technica and The Register covered the broader implications.

That cost reduction is genuinely disruptive to copyleft as an enforcement mechanism, and it’s worth being honest about what that means for open source.

But the legal mechanism isn’t new. Clean-room reimplementation has been used to replicate proprietary and copyleft software for decades: Compaq’s IBM PC BIOS (1982), Wine, ReactOS. As antirez pointed out, the open-source movement itself was built on reimplementation: GNU reimplemented UNIX, and many contributors had been exposed to the code they were replacing.

The obvious counterargument: in those cases, the implementing team hadn’t memorized the source code. Claude has chardet in its training data. That’s a real difference, but the question is whether that training data exposure produced a derivative work, and the similarity evidence above says it didn’t.

The bigger question is about the economics of reimplementation, not whether what I did is some new kind of infringement. AI didn’t create a new legal mechanism; it made an existing one cheap enough that individual maintainers can use it, not just corporations with massive budgets. That’s a real shift for all of software development, and I don’t think any of us in the open-source community have figured out what it means yet.

Judge for yourself

I’ve built a timeline of every conversation session from the rewrite. You can browse the full transcripts, see what tools were called, what files were read, and what code was written. Nothing is redacted. I’m sure people will find all the times in the process I said something stupid or made a typo. That’s fine. I’m not trying to present myself as infallible. I’m just trying to be transparent about what actually happened.

The process wasn’t perfect. Claude saw old source code I didn’t intend it to see, and I should have set up stricter guardrails. But the output speaks for itself: a fundamentally different architecture, near-zero similarity by every measure I could find, and zero surviving lines from the original codebase.

I spent a decade trying to solve the license problem that kept chardet out of the standard library. The old architecture had hit a wall on performance. The rewrite fixed both. If you think the process was flawed, I’d rather you base that on what actually happened than on speculation.

The initial rewrite was released under MIT. I later switched to 0BSD, a public-domain-equivalent license, which sidesteps the question of whether AI-generated code is copyrightable in the first place. ↩︎
My side of this conversation is lost. I deleted my Twitter account after Elon bought it. Guido’s replies give enough context to follow the thread, and my later comment on issue #36 confirms the outcome: the LGPL made stdlib inclusion impossible. ↩︎
In the end, multicore detection didn’t make the cut. The rewrite was fast enough single-threaded that the overhead of dispatching work across cores canceled out the gains. The code is thread-safe, but it doesn’t parallelize internally. ↩︎
Individual commits for each component: binary, BOM, ASCII, UTF-8, markup, byte validity, multi-byte structural, statistical scoring, orchestrator, bigram model training, accuracy suite, benchmark suite, escape detection, mypyc optimization. ↩︎

Why I rewrote chardet#

The elephant in the room: Claude already knows chardet#

Why I used Claude#

How I used Claude#

The access log#

Day-1 blocking (session 9716a5c6, Feb 25)#

Encoding era classifications (sessions 8d07e8d1 and 61b56883, Feb 25)#

Training data preparation (session 5b390cbc, Feb 25)#

API function signatures for benchmarking (session 2c440c84, Feb 26)#

The universaldetector.py read (session c9716c23, Feb 27)#

Other interactions with old chardet#

What Claude never saw#

The honest caveats#

The code speaks for itself#

Git blame: zero lines survive#

JPlag: token-level similarity#

Copydetect: per-file fingerprint breakdown#

What the numbers mean#

“A recipe for legal theft”#

Judge for yourself#