Overview

chardet is a Python library for automatic character encoding detection. It is a port of Mozilla’s Universal Charset Detector, originally written in C++ for Mozilla Firefox. The original Python port was done by Mark Pilgrim (of “Dive Into Python” fame), and has been maintained by Dan Blanchard for 12+ years. It is one of the most widely depended-upon packages on PyPI, being a transitive dependency for a huge chunk of the Python ecosystem (via requests and others).

Detection Approach

The detection approach uses several techniques in parallel:

  • BOM detection — checks for byte order marks at the start of the data
  • Escape-based detection — for encodings like ISO-2022-JP that use escape sequences
  • Multi-byte probers — for CJK encodings (UTF-8, SJIS, EUC-JP, GB2312, Big5, etc.) using state machines
  • Single-byte probers — for encodings like Latin-1, Windows-1252, KOI8-R, etc., using character frequency analysis and sequence analysis
  • UniversalDetector — the orchestrator class that feeds data through all the probers and picks the best result

API

chardet.detect(byte_string)

The simple, all-in-one convenience function. Takes a bytes object and returns a dict:

{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

Under the hood, it is a thin wrapper around UniversalDetector:

  1. Creates a UniversalDetector instance
  2. Feeds the entire byte string into it in one shot
  3. Calls close()
  4. Returns the result dict

chardet.detect_all(byte_str, ignore_threshold=False)

Returns a list of possible encodings ranked by confidence, rather than just the top one:

[
    {'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'},
    {'encoding': 'Big5', 'confidence': 0.12, 'language': 'Chinese'},
    {'encoding': 'EUC-KR', 'confidence': 0.08, 'language': 'Korean'},
]

Key difference from detect(): instead of just picking the best prober result, it reaches inside the group probers to collect confidence from every individual child prober, sorts them by confidence descending, and returns the full list.

The ignore_threshold parameter controls whether probers below the minimum threshold (0.2) are included.

UniversalDetector Class

The core engine for incremental/streaming encoding detection.

Constructor:

  • UniversalDetector(lang_filter=None) — optionally pass language filter flags to limit which encodings are considered

Methods:

  • feed(data) — feed a chunk of bytes into the detector. Can be called multiple times for streaming use. Short-circuits if detection reaches high confidence (sets self.done = True)
  • close() — signal that you’re done feeding data. Triggers final analysis and picks the best result from the probers
  • reset() — reset the detector state so you can reuse the instance for a new detection

Properties / Attributes:

  • result — a dict with encoding, confidence (0.0-1.0), and language after close() is called
  • done — boolean, True if the detector has reached a confident conclusion early

Typical Usage Pattern:

from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()
for line in binary_file:
    detector.feed(line)
    if detector.done:
        break
detector.close()
print(detector.result)
# e.g. {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

Data Flow: detect() Through UniversalDetector

Example: 'Hello, this is tést'.encode('utf-8')

Input bytes: b'Hello, this is t\xc3\xa9st'

chardet.detect(byte_string)
│
├─ Creates UniversalDetector instance
├─ Calls detector.feed(byte_string)
│   │
│   ├─ Step 1: BOM Check (first call only)
│   │   └─ Checks first 2-4 bytes for BOM (UTF-8 BOM, UTF-16 LE/BE, UTF-32 LE/BE)
│   │   └─ "Hell" has no BOM → continue
│   │
│   ├─ Step 2: Scan bytes to classify input
│   │   └─ Iterates through each byte
│   │   └─ Sees 0xC3 and 0xA9 → these are high bytes (> 0x7F)
│   │   └─ Sets input_state = HIGH_BYTE
│   │   └─ (If it saw escape sequences, would set ESCAPE instead)
│   │
│   ├─ Step 3: Based on input_state, activate probers
│   │   │
│   │   │  input_state == HIGH_BYTE, so activates:
│   │   │
│   │   ├─ MBCSGroupProber (multi-byte group)
│   │   │   │
│   │   │   ├─ UTF8Prober
│   │   │   │   └─ Has a CodingStateMachine with UTF-8 transitions
│   │   │   │   └─ Feeds bytes through state machine
│   │   │   │   └─ 0xC3 0xA9 is a VALID 2-byte UTF-8 sequence
│   │   │   │   └─ Returns state: FOUND_IT or high confidence
│   │   │   │
│   │   │   ├─ SJISProber
│   │   │   │   └─ Feeds bytes through Shift_JIS state machine
│   │   │   │   └─ 0xC3 0xA9 could be valid SJIS, but context is wrong
│   │   │   │   └─ Returns low confidence
│   │   │   │
│   │   │   ├─ EUCJPProber, EUCKRProber, GB2312Prober, Big5Prober, etc.
│   │   │   │   └─ Similar: feed through respective state machines
│   │   │   │   └─ Most will reject or return low confidence
│   │   │   │
│   │   │   └─ Returns best confidence among its children
│   │   │
│   │   └─ SBCSGroupProber (single-byte group)
│   │       │
│   │       ├─ Latin1Prober
│   │       │   └─ Analyzes byte frequency patterns
│   │       │   └─ 0xC3 and 0xA9 are valid Latin-1 chars
│   │       │   └─ Returns moderate confidence
│   │       │
│   │       ├─ Windows1252Prober, ISO8859_2Prober, etc.
│   │       │   └─ Each uses character frequency models for its
│   │       │      target language to score the input
│   │       │
│   │       └─ Returns best confidence among its children
│   │
│   └─ Step 4: Check if any prober hit FOUND_IT threshold
│       └─ If so, set self.done = True (short-circuit)
│
├─ Calls detector.close()
│   │
│   ├─ If input was pure ASCII → return {'encoding': 'ascii', ...}
│   ├─ If BOM was found → return that encoding
│   └─ Otherwise:
│       ├─ Collect confidence from all active probers
│       ├─ Find the prober with highest confidence
│       ├─ For this input: UTF8Prober wins
│       │   └─ The 0xC3 0xA9 sequence is valid UTF-8
│       │   └─ All ASCII bytes are valid UTF-8
│       │   └─ High confidence
│       └─ Populate self.result
│
└─ Returns detector.result
    └─ {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

Why UTF-8 wins: The key bytes are 0xC3 0xA9. In UTF-8, 0xC3 signals “start of a 2-byte sequence” and 0xA9 is a valid continuation byte (10xxxxxx pattern). The UTF-8 state machine sees a perfectly valid transition. Meanwhile, the single-byte probers see the same bytes as valid characters in their encodings (e.g., Latin-1 reads them as “é”), but their frequency analysis scores lower because “Ô followed by “©” is an unusual combination in natural language.

Example: GB18030 Chinese — '你好世界'.encode('gb18030')

Input bytes: b'\xc4\xe3\xba\xc3\xca\xc0\xbd\xe7'

Each character is 2 bytes: 你=0xC4 0xE3, 好=0xBA 0xC3, 世=0xCA 0xC0, 界=0xBD 0xE7

chardet.detect(byte_string)
│
├─ Creates UniversalDetector
├─ Calls detector.feed(byte_string)
│   │
│   ├─ Step 1: BOM Check
│   │   └─ 0xC4 0xE3... → no BOM match → continue
│   │
│   ├─ Step 2: Byte Classification
│   │   └─ Every byte here is > 0x7F (all high bytes)
│   │   └─ No ASCII at all, no escape sequences
│   │   └─ input_state = HIGH_BYTE
│   │
│   ├─ Step 3: Activate probers
│   │   │
│   │   ├─ MBCSGroupProber
│   │   │   │
│   │   │   ├─ UTF8Prober
│   │   │   │   └─ 0xC4 = start of 2-byte UTF-8 (110xxxxx)
│   │   │   │   └─ 0xE3 = start of 3-byte UTF-8 (1110xxxx), NOT valid continuation
│   │   │   │   └─ State machine → ERROR state
│   │   │   │   └─ Prober state: NOT_ME (ruled out)
│   │   │   │
│   │   │   ├─ GB2312Prober / GB18030Prober
│   │   │   │   └─ CodingStateMachine: all 4 two-byte pairs valid
│   │   │   │   └─ Distribution analysis: checks character frequencies
│   │   │   │       against Chinese character frequency model
│   │   │   │   └─ 你好世界 are all common Chinese characters
│   │   │   │   └─ High frequency match → HIGH confidence
│   │   │   │
│   │   │   ├─ SJISProber
│   │   │   │   └─ State machine may accept some byte pairs
│   │   │   │   └─ Distribution analysis against Japanese model → poor match
│   │   │   │   └─ Returns LOW confidence
│   │   │   │
│   │   │   ├─ EUCKRProber
│   │   │   │   └─ Byte ranges overlap with EUC-KR
│   │   │   │   └─ Distribution analysis against Korean model → poor match
│   │   │   │   └─ Returns LOW confidence
│   │   │   │
│   │   │   ├─ EUCJPProber, Big5Prober, etc.
│   │   │   │   └─ Distribution analysis doesn't match respective models
│   │   │   │   └─ Return LOW confidence
│   │   │   │
│   │   │   └─ Best child: GB2312/GB18030 prober wins
│   │   │
│   │   └─ SBCSGroupProber
│   │       └─ All bytes are high bytes, frequency patterns
│   │           don't match any European language models
│   │       └─ Returns LOW confidence across the board
│   │
│   └─ Step 4: Short-circuit check
│       └─ GB prober confidence may trigger FOUND_IT → done = True
│
├─ Calls detector.close()
│   │
│   └─ Collects all prober results:
│       ├─ UTF8Prober:    NOT_ME (ruled out by state machine)
│       ├─ GB2312Prober:  ~0.99 confidence ← WINNER
│       ├─ SJISProber:    ~0.1-0.3
│       ├─ EUCKRProber:   ~0.1-0.3
│       ├─ Big5Prober:    ~0.1-0.3
│       ├─ Latin1 etc:    very low
│       └─ Selects GB2312 prober as winner
│
└─ Returns detector.result
    └─ {'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}

Key distinction from UTF-8: With CJK encodings, the byte ranges heavily overlap (GB2312, EUC-KR, Big5, SJIS all use similar high-byte ranges). The state machines alone can’t differentiate them. That’s where distribution analysis becomes critical — checking whether the decoded characters are commonly used in that language.

Note on GB2312 vs GB18030: chardet often reports GB2312 even when the encoding is technically GB18030, since GB18030 is a superset of GB2312. If all the characters fall within the GB2312 range, the detector can’t distinguish between them. This is a known quirk.


CodingStateMachine: Deep Dive

The state machine is the first line of defense — it determines whether a byte sequence is structurally valid for a given encoding.

Architecture

CodingStateMachine
├─ model: StateMachineModel (encoding-specific transition table)
├─ current_state: starts at START
└─ next_state(byte) → feeds one byte, returns new state

States:
┌─────────┐
│  START   │ ← initial state, also "ready for next character"
├─────────┤
│  ME_ONE  │ ← in the middle of a multi-byte sequence (need 1 more)
├─────────┤
│  ME_TWO  │ ← need 2 more bytes
├─────────┤
│  ME_THREE│ ← need 3 more bytes (GB18030 4-byte sequences)
├─────────┤
│  ITS_ME  │ ← complete valid character decoded (terminal, resets)
├─────────┤
│  ERROR   │ ← invalid byte for this encoding (terminal)
└─────────┘

Transition Table Structure

Each encoding defines a model with:

  • Class table: maps every possible byte value (0x00-0xFF) to a byte class
  • State table: given (current_state, byte_class) → next_state

Example: GB2312 byte classification (conceptual):

Byte range       → Class
0x00-0x20        → 0 (control chars)
0x21-0x7E        → 1 (ASCII printable)
0x7F             → 2 (DEL)
0x80-0xA0        → 3 (undefined range)
0xA1-0xFE        → 4 (valid GB2312 first/second byte)
0xFF             → 5 (invalid)

State transitions for GB2312:

              Class 0  Class 1  Class 2  Class 3  Class 4  Class 5
START      →  ERROR    ITS_ME   ERROR    ERROR    ME_ONE   ERROR
ME_ONE     →  ERROR    ERROR    ERROR    ERROR    ITS_ME   ERROR

Tracing: GB18030 Byte-by-Byte Through Multiple State Machines

Input: 0xC4 0xE3 0xBA 0xC3 0xCA 0xC0 0xBD 0xE7

GB2312 State Machine

Byte 0xC4:
  └─ class_table[0xC4] → class 4 (valid GB range)
  └─ state_table[START][class 4] → ME_ONE
  └─ "I need one more byte to complete this character"

Byte 0xE3:
  └─ class_table[0xE3] → class 4 (valid GB range)
  └─ state_table[ME_ONE][class 4] → ITS_ME
  └─ "Valid 2-byte character complete!" → reset to START
  └─ Character decoded: 你

Byte 0xBA:
  └─ class_table[0xBA] → class 4
  └─ state_table[START][class 4] → ME_ONE

Byte 0xC3:
  └─ class_table[0xC3] → class 4
  └─ state_table[ME_ONE][class 4] → ITS_ME
  └─ Character decoded: 好

...same pattern for 0xCA 0xC0 (世) and 0xBD 0xE7 (界)

Result: 4 valid characters, 0 errors → structurally VALID

UTF-8 State Machine

Byte 0xC4:
  └─ class_table[0xC4] → class for 110xxxxx (2-byte start)
  └─ state_table[START][two_byte_start] → ME_ONE
  └─ "Expecting one continuation byte (10xxxxxx)"

Byte 0xE3:
  └─ class_table[0xE3] → class for 1110xxxx (3-byte start!)
  └─ state_table[ME_ONE][three_byte_start] → ERROR
  └─ "Expected continuation byte, got a new sequence start"
  └─ *** UTF-8 RULED OUT ***

Shift_JIS State Machine

Byte 0xC4:
  └─ class_table[0xC4] → Katakana half-width range (0xA1-0xDF)
  └─ Valid single-byte character in SJIS
  └─ state_table[START][katakana] → ITS_ME

Byte 0xE3:
  └─ class_table[0xE3] → valid SJIS lead byte (0xE0-0xEF)
  └─ state_table[START][sjis_lead] → ME_ONE

Byte 0xBA:
  └─ SJIS second byte range is 0x40-0x7E, 0x80-0xFC
  └─ 0xBA is valid → ITS_ME

...may survive structurally, but decoded characters are nonsense
Result: structurally VALID but distribution will be terrible

Distribution Analysis: Deep Dive

When multiple encodings pass the state machine (as GB2312, EUC-KR, and SJIS often do for Chinese input), distribution analysis breaks the tie.

Architecture

CharDistributionAnalysis
├─ char_to_order_table: maps decoded character codes → frequency rank
├─ typical_distribution_ratio: expected ratio for this language
├─ freq_chars: count of frequently-used characters seen
├─ total_chars: count of total characters analyzed
└─ get_confidence() → float

Frequency Table Construction (offline, from corpora)

For Chinese (GB2312):
  Analyze a large corpus of Chinese text
  Rank characters by frequency of occurrence

  Order 0-511:   "most frequent" bucket
                  e.g., 的 一 是 不 了 人 我 在 有 他 这 ...
  Order 512+:    "less frequent"
  Order -1:      "not seen in training corpus"

  typical_distribution_ratio = expected_freq_chars / expected_total_chars

Tracing: Chinese Characters Through Distribution Analysis

Input characters (decoded as GB2312): 你 好 世 界

Character: 你
  └─ GB2312 code point → look up in char_to_order_table
  └─ 你 is extremely common in Chinese
  └─ order = low number (say ~50) → falls in "frequent" bucket
  └─ freq_chars += 1, total_chars += 1

Character: 好
  └─ Also extremely common
  └─ order = low number (say ~80) → "frequent" bucket
  └─ freq_chars += 1, total_chars += 1

Character: 世
  └─ Common character
  └─ order = moderate (say ~200) → still in "frequent" bucket
  └─ freq_chars += 1, total_chars += 1

Character: 界
  └─ Fairly common
  └─ order = moderate (say ~300) → still in "frequent" bucket
  └─ freq_chars += 1, total_chars += 1

Result: freq_chars=4, total_chars=4
  └─ ratio = 4/4 = 1.0
  └─ Compare to typical_distribution_ratio
  └─ Very high match → HIGH confidence (~0.99)

Same Bytes Through Korean Distribution Analysis

Same byte pairs decoded as EUC-KR:

0xC4 0xE3 → some Korean character (or invalid)
0xBA 0xC3 → some Korean character
0xCA 0xC0 → some Korean character
0xBD 0xE7 → some Korean character

Look up each in Korean char_to_order_table:
  └─ These map to uncommon or meaningless Korean characters
  └─ Most get high order numbers or -1 (not in corpus)
  └─ freq_chars ≈ 0, total_chars = 4
  └─ ratio = 0/4 = 0.0
  └─ Far below typical_distribution_ratio
  └─ LOW confidence (~0.01)

Confidence Flow Back Up to UniversalDetector

detector.close() called
│
├─ Collect results from all prober groups:
│
│   MBCSGroupProber.get_confidence()
│   │
│   │  Iterates through child probers, returns best:
│   │
│   ├─ UTF8Prober:     state=NOT_ME      → confidence = 0.0
│   ├─ GB2312Prober:   state=DETECTING
│   │   ├─ coding_sm: no errors (structurally valid)
│   │   └─ distribution_analyzer.get_confidence() → 0.99
│   │   └─ combined confidence → 0.99         ← WINNER
│   ├─ SJISProber:     state=DETECTING
│   │   ├─ coding_sm: no errors (structurally valid)
│   │   └─ distribution_analyzer.get_confidence() → 0.05
│   │   └─ combined confidence → 0.05
│   ├─ EUCKRProber:    state=DETECTING
│   │   └─ distribution confidence → 0.01
│   ├─ Big5Prober:     state=DETECTING
│   │   └─ distribution confidence → 0.08
│   ├─ EUCJPProber:    state=DETECTING
│   │   └─ distribution confidence → 0.02
│   │
│   └─ Returns: GB2312, confidence 0.99
│
│   SBCSGroupProber.get_confidence()
│   │
│   │  All single-byte probers see only high bytes
│   │  No ASCII context to anchor frequency analysis
│   │  Best confidence maybe ~0.1-0.2
│   │
│   └─ Returns: some Latin variant, confidence ~0.15
│
├─ Compare group winners:
│   GB2312 @ 0.99  vs  Latin-something @ 0.15
│
├─ Winner: GB2312 @ 0.99
│
├─ Apply minimum threshold (typically 0.2)
│   └─ 0.99 > 0.2 → passes
│
└─ self.result = {
       'encoding': 'GB2312',
       'confidence': 0.99,
       'language': 'Chinese'
   }

BOM Detection

BOM (Byte Order Mark) detection is the very first thing that happens in UniversalDetector.feed(), and it only runs once (on the first call to feed()).

BOM Detection Table

┌─────────────────────┬──────────────────────┬─────────┐
│ Byte Sequence       │ Encoding             │ Length  │
├─────────────────────┼──────────────────────┼─────────┤
│ 0xEF 0xBB 0xBF      │ UTF-8-SIG            │ 3 bytes │
│ 0xFF 0xFE 0x00 0x00  │ UTF-32-LE            │ 4 bytes │
│ 0x00 0x00 0xFE 0xFF  │ UTF-32-BE            │ 4 bytes │
│ 0xFF 0xFE            │ UTF-16-LE            │ 2 bytes │
│ 0xFE 0xFF            │ UTF-16-BE            │ 2 bytes │
└─────────────────────┴──────────────────────┴─────────┘

Order matters! UTF-32-LE starts with 0xFF 0xFE (same as UTF-16-LE)
so the 4-byte checks must come before 2-byte checks.

Tracing: UTF-8 with BOM

Input: b'\xef\xbb\xbfHello'
       ^^^^^^^^^ BOM    ^^^^^ content

detector.feed(b'\xef\xbb\xbfHello')
│
├─ _got_data = False (first call)
│   └─ Set _got_data = True
│   └─ Enter BOM detection
│
├─ Check 4-byte BOMs first (need at least 4 bytes, we have 8):
│   ├─ data[:4] == b'\xff\xfe\x00\x00'?  → No (UTF-32-LE)
│   ├─ data[:4] == b'\x00\x00\xfe\xff'?  → No (UTF-32-BE)
│   └─ No 4-byte BOM match
│
├─ Check 3-byte BOMs:
│   ├─ data[:3] == b'\xef\xbb\xbf'?      → YES! UTF-8-SIG
│   └─ Set self._detected_encoding = 'UTF-8-SIG'
│   └─ Set self.done = True
│   └─ Return immediately (no probers needed!)
│
└─ detector.close()
    └─ BOM was detected → result comes from _detected_encoding
    └─ self.result = {
           'encoding': 'UTF-8-SIG',
           'confidence': 1.0,      ← BOM detection is always 100%
           'language': ''
       }

Tracing: UTF-16-LE with BOM

Input: '你好'.encode('utf-16')
     = b'\xff\xfe`O}Y'
       UTF-16-LE BOM, then content

detector.feed(data)
│
├─ First call, enter BOM detection
│
├─ Check 4-byte BOMs:
│   ├─ data[:4] = b'\xff\xfe\x60\x4f'
│   ├─ == b'\xff\xfe\x00\x00'?  → No (3rd byte is 0x60, not 0x00)
│   └─ No 4-byte match
│
├─ Check 3-byte BOMs:
│   └─ No 3-byte match
│
├─ Check 2-byte BOMs:
│   ├─ data[:2] == b'\xff\xfe'?  → YES! UTF-16-LE
│   └─ Set self._detected_encoding = 'UTF-16-LE'
│   └─ Set self.done = True
│   └─ Return immediately
│
└─ result = {'encoding': 'UTF-16-LE', 'confidence': 1.0, 'language': ''}

No BOM Found

Input: b'Hello, world'

detector.feed(b'Hello, world')
│
├─ Check 4-byte BOMs: b'Hell' → No match
├─ Check 3-byte BOMs: b'Hel' → No match
├─ Check 2-byte BOMs: b'He' → No match
│
├─ No BOM found
│   └─ Fall through to byte scanning and prober logic

Escape-Based Detection

Escape-based probers handle encodings that use escape sequences to switch between character sets. These are primarily ISO-2022 family encodings.

Background: How ISO-2022 Works

Normal ASCII text here → ESC $ B → 日本語のテキスト → ESC ( B → back to ASCII
                         ^^^^^^^^                     ^^^^^^^^
                         "switch to JIS X 0208"       "switch back to ASCII"

Common escape sequences:

┌──────────────────────┬─────────────────────────┬───────────────┐
│ Escape Sequence      │ Meaning                 │ Encoding      │
├──────────────────────┼─────────────────────────┼───────────────┤
│ ESC ( B              │ Switch to ASCII          │ (all)         │
│ ESC $ B              │ Switch to JIS X 0208     │ ISO-2022-JP   │
│ ESC $ @              │ Switch to JIS C 6226     │ ISO-2022-JP   │
│ ESC $ ( C            │ Switch to KSC 5601       │ ISO-2022-KR   │
│ ESC $ ( D            │ Switch to JIS X 0212     │ ISO-2022-JP   │
│ ESC $ ) A            │ Switch to GB 2312        │ ISO-2022-CN   │
│ ESC $ ) G            │ Switch to CNS 11643-1    │ ISO-2022-CN   │
│ ESC $ * H            │ Switch to CNS 11643-2    │ ISO-2022-CN   │
│ ESC ( J              │ Switch to JIS X 0201     │ ISO-2022-JP   │
└──────────────────────┴─────────────────────────┴───────────────┘

Key: ESC = 0x1B

How Escape Detection is Triggered

detector.feed(byte_string)
│
├─ BOM check (no match, continue)
│
├─ Byte scanning loop:
│   for each byte in input:
│   │
│   │  if byte == 0x1B (ESC) or byte == 0x7E (~):
│   │      input_state = ESC_ASCII
│   │      ─── this activates the escape prober path ───
│   │
│   │  elif byte >= 0x80:
│   │      input_state = HIGH_BYTE
│   │
│   │  else:
│   │      (stays as PURE_ASCII if no high bytes seen yet)

EscCharSetProber Architecture

EscCharSetProber
│
├─ Contains a list of CodingStateMachines, one per escape encoding:
│   ├─ CodingStateMachine(HZ_SM_MODEL)         → HZ-GB-2312
│   ├─ CodingStateMachine(ISO2022CN_SM_MODEL)   → ISO-2022-CN
│   ├─ CodingStateMachine(ISO2022JP_SM_MODEL)   → ISO-2022-JP
│   └─ CodingStateMachine(ISO2022KR_SM_MODEL)   → ISO-2022-KR
│
├─ active: bool (starts True, False if all machines hit ERROR)
│
└─ feed(data):
    └─ for each byte:
        └─ feed byte to each active state machine
        └─ if any returns ITS_ME → we found the encoding!
        └─ if any returns ERROR → deactivate that machine
        └─ if all hit ERROR → prober state = NOT_ME

Tracing: ISO-2022-JP Detection

Input: b'Hello \x1b$B$3$s$K$A$O\x1b(B world'
        ^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^
        ASCII  ESC$B=JIS  こんにちは  ESC(B=ASCII

detector.feed(data)
│
├─ BOM check → no match
│
├─ Byte scan:
│   'H','e','l','l','o',' ' → all < 0x80, still PURE_ASCII
│   0x1B (ESC) → input_state = ESC_ASCII!
│       └─ Activate EscCharSetProber
│
├─ EscCharSetProber.feed(data):
│   │
│   │  Byte 0x1B (ESC):
│   │  ├─ HZ_SM:       ESC is not part of HZ protocol (HZ uses ~{ and ~})
│   │  ├─ ISO2022CN:   ESC → intermediate "saw escape" state
│   │  ├─ ISO2022JP:   ESC → intermediate "saw escape" state
│   │  └─ ISO2022KR:   ESC → intermediate "saw escape" state
│   │
│   │  Byte '$' (0x24):
│   │  ├─ HZ_SM:       ESC $ is not HZ → ERROR (deactivated)
│   │  ├─ ISO2022CN:   ESC $ → intermediate state (valid so far)
│   │  ├─ ISO2022JP:   ESC $ → intermediate state (valid so far)
│   │  └─ ISO2022KR:   ESC $ → intermediate state (valid so far)
│   │
│   │  Byte 'B' (0x42):
│   │  ├─ ISO2022CN:   ESC $ B is not a valid CN sequence → ERROR
│   │  ├─ ISO2022JP:   ESC $ B = "switch to JIS X 0208"
│   │  │               → ITS_ME!!! MATCH FOUND
│   │  └─ ISO2022KR:   ESC $ B is not valid KR → ERROR
│   │
│   │  ISO2022JP returned ITS_ME!
│   │  └─ Prober state = FOUND_IT
│   │  └─ detected_charset = 'ISO-2022-JP'
│   │  └─ Stop processing
│   │
│   └─ Return FOUND_IT
│
├─ detector.done = True (short-circuit)
│
└─ result = {'encoding': 'ISO-2022-JP', 'confidence': 0.99, 'language': 'Japanese'}

Tracing: HZ-GB-2312 Detection

HZ doesn’t use ESC — it uses ~ (tilde) as its escape character.

HZ-GB-2312 protocol:
  ~{ = switch to GB2312 mode (two-byte characters)
  ~} = switch back to ASCII mode
  ~~ = literal tilde

Input: b'Hello ~{\xc4\xe3\xba\xc3~} world'

detector.feed(data)
│
├─ Byte scan:
│   'H','e','l','l','o',' ' → PURE_ASCII
│   '~' (0x7E) → input_state = ESC_ASCII!
│       └─ The tilde triggers escape detection
│
├─ EscCharSetProber.feed(data):
│   │
│   │  Byte '~' (0x7E):
│   │  ├─ HZ_SM:       ~ → intermediate "saw tilde" state
│   │  ├─ ISO2022JP:   ~ has no meaning → ERROR (deactivated)
│   │  ├─ ISO2022CN:   ~ has no meaning → ERROR (deactivated)
│   │  └─ ISO2022KR:   ~ has no meaning → ERROR (deactivated)
│   │
│   │  Byte '{' (0x7B):
│   │  ├─ HZ_SM:       ~{ = "enter GB mode" → ITS_ME!!!
│   │
│   │  detected_charset = 'HZ-GB-2312'
│   │  FOUND_IT
│   │
│   └─ Return FOUND_IT
│
└─ result = {'encoding': 'HZ-GB-2312', 'confidence': 0.99, 'language': 'Chinese'}

Tracing: ISO-2022-KR Detection

ISO-2022-KR uses:
  ESC $ ) C = designate KSC 5601 (Korean) character set
  SO (0x0E) = shift out (switch to Korean)
  SI (0x0F) = shift in (switch back to ASCII)

Input: b'\x1b$)C\x0e Korean chars \x0f ASCII'

EscCharSetProber.feed(data):
│
│  Byte 0x1B (ESC): all ISO-2022 machines → "saw escape" state
│  Byte '$' (0x24): all → valid intermediate
│  Byte ')' (0x29): all → valid intermediate
│
│  Byte 'C' (0x43):
│  ├─ ISO2022JP:   ESC $ ) C → not valid JP (expects D) → ERROR
│  ├─ ISO2022CN:   ESC $ ) C → not valid CN (expects A or G) → ERROR
│  └─ ISO2022KR:   ESC $ ) C → "designate KSC 5601" → ITS_ME!!!
│
│  detected_charset = 'ISO-2022-KR'
│
└─ result = {'encoding': 'ISO-2022-KR', 'confidence': 0.99, 'language': 'Korean'}

Specialty Probers

UTF1632Prober

Detects UTF-16 and UTF-32 without a BOM by analyzing statistical patterns of null bytes.

Key Insight

Most text is in the Basic Multilingual Plane and uses common characters:

In UTF-16-LE, ASCII text looks like:
  'H' = 0x48 0x00
  'e' = 0x65 0x00
  → Every other byte is 0x00

In UTF-16-BE, ASCII text looks like:
  'H' = 0x00 0x48
  'e' = 0x00 0x65
  → Every other byte is 0x00, offset by 1

In UTF-32-LE, ASCII text:
  'H' = 0x48 0x00 0x00 0x00
  → 3 out of every 4 bytes are 0x00

In UTF-32-BE, ASCII text:
  'H' = 0x00 0x00 0x00 0x48
  → 3 out of every 4 bytes are 0x00

Detection Strategy

UTF1632Prober.feed(data)
│
├─ Track null byte counts at each position modulo 4:
│   position[0]: count of null bytes at index 0, 4, 8, 12...
│   position[1]: count of null bytes at index 1, 5, 9, 13...
│   position[2]: count of null bytes at index 2, 6, 10, 14...
│   position[3]: count of null bytes at index 3, 7, 11, 15...
│
├─ On get_confidence():
│
│  UTF-32-LE: positions 1,2,3 almost all null, position 0 rarely null
│  UTF-32-BE: positions 0,1,2 almost all null, position 3 rarely null
│  UTF-16-LE: odd positions mostly null (must rule out UTF-32 first)
│  UTF-16-BE: even positions mostly null (must rule out UTF-32 first)

Tracing: BOM-less UTF-16-LE

Input: 'Hello'.encode('utf-16-le')
     = b'\x48\x00\x65\x00\x6c\x00\x6c\x00\x6f\x00'

UTF1632Prober.feed(data):
│
│  Index 0: 0x48 (H)  → position 0 mod 4 = 0 → not null
│  Index 1: 0x00       → position 1 mod 4 = 1 → NULL
│  Index 2: 0x65 (e)  → position 2 mod 4 = 2 → not null
│  Index 3: 0x00       → position 3 mod 4 = 3 → NULL
│  Index 4: 0x6C (l)  → position 0 mod 4 = 0 → not null
│  Index 5: 0x00       → position 1 mod 4 = 1 → NULL
│  ...
│
│  Null distribution:
│    Position 0: 0% nulls
│    Position 1: 100% nulls
│    Position 2: 0% nulls
│    Position 3: 100% nulls
│
│  Analysis:
│    Position 2 is NOT null → rules out UTF-32-LE
│    Odd positions (1,3) are all null → UTF-16-LE pattern
│    → High confidence UTF-16-LE (~0.95+)

Latin1Prober

Latin-1 (ISO-8859-1) gets special treatment because it’s the most common single-byte encoding for Western European languages, and it’s a frequent “default guess.”

Why Latin-1 is Special-Cased

Latin-1 maps ALL byte values 0x00-0xFF to some character.
There are NO invalid byte sequences.
Every possible byte string is "valid" Latin-1.

This means:
- A state machine approach is useless (never errors)
- Standard frequency analysis is unreliable
- It can't be ruled out by structure alone

So instead of asking "is this valid Latin-1?" (always yes),
the prober asks "does this LOOK LIKE natural text in Latin-1?"

Byte Classification Scheme

┌────────────────┬──────────┬───────────────────────────────┐
│ Byte Range     │ Class    │ Meaning                       │
├────────────────┼──────────┼───────────────────────────────┤
│ 0x00-0x1F      │ CONTROL  │ Control characters            │
│ 0x20-0x7F      │ ASCII    │ Standard ASCII                │
│ 0x80-0x9F      │ CONTROL  │ C1 control chars              │
│                │          │ (undefined in Latin-1,        │
│                │          │  used in Windows-1252)        │
│ 0xA0-0xBF      │ COMMON   │ Common accented range         │
│ 0xC0-0xDF      │ UPPER    │ Uppercase accented            │
│ 0xE0-0xFF      │ LOWER    │ Lowercase accented            │
└────────────────┴──────────┴───────────────────────────────┘

Confidence Heuristic

Natural text in Latin-1 has expected patterns:

  • Mostly ASCII
  • Lowercase accented > uppercase accented (most text is lowercase)
  • Few or no C1 control characters (if many C1 bytes → probably Windows-1252)

The prober intentionally returns moderate confidence (not high) so that more specific probers win when they have a strong match. Latin-1 is the “if nothing else works well, it’s probably this” fallback.

Tracing: “café résumé” in Latin-1

Input: 'café résumé'.encode('latin-1')
     = b'caf\xe9 r\xe9sum\xe9'

Latin1Prober.feed(data):
│
├─ 'c' (0x63) → ASCII
├─ 'a' (0x61) → ASCII
├─ 'f' (0x66) → ASCII
├─ 'é' (0xE9) → LOWER (lowercase accented)
├─ ' ' (0x20) → ASCII
├─ 'r' (0x72) → ASCII
├─ 'é' (0xE9) → LOWER
├─ 's' (0x73) → ASCII
├─ 'u' (0x75) → ASCII
├─ 'm' (0x6D) → ASCII
├─ 'é' (0xE9) → LOWER
│
├─ Tally: ASCII=8, LOWER=3, UPPER=0, COMMON=0, CONTROL=0
│
├─ get_confidence():
│   ├─ Has accented characters → prober is relevant
│   ├─ lowercase accented > uppercase accented (natural pattern)
│   ├─ No C1 control chars (not Windows-1252 junk)
│   └─ Confidence ≈ 0.5-0.7

Latin-1 vs Windows-1252

Input: b'smart \x93quotes\x94'
       (Windows-1252 "smart quotes": 0x93 = left ", 0x94 = right ")

Latin1Prober.feed(data):
│
├─ 0x93 → CONTROL (C1 range: 0x80-0x9F)
├─ 0x94 → CONTROL (C1 range)
│
├─ get_confidence():
│   ├─ Natural Latin-1 text almost never has C1 controls
│   ├─ Suggests Windows-1252 instead
│   └─ Returns LOW confidence (defers to Windows-1252 prober)

UTF8Prober

UTF-8 has its own prober because UTF-8 detection is unique — it doesn’t need distribution analysis.

Why No Distribution Analysis

Unlike GB2312 vs EUC-KR vs SJIS (which share byte ranges),
UTF-8 has a very distinctive structure:

1. Strict byte patterns: lead bytes and continuation bytes
   have non-overlapping bit patterns
2. Self-synchronizing: you can identify character boundaries
   from any position in the stream
3. No valid random byte stream: random bytes have only ~1/8
   chance of being valid continuation bytes after a lead byte

UTF-8 Encoding Rules

┌────────────────────┬────────────────────────────────────┐
│ Code point range   │ Byte pattern                       │
├────────────────────┼────────────────────────────────────┤
│ U+0000 - U+007F    │ 0xxxxxxx                           │
│ U+0080 - U+07FF    │ 110xxxxx 10xxxxxx                  │
│ U+0800 - U+FFFF    │ 1110xxxx 10xxxxxx 10xxxxxx         │
│ U+10000 - U+10FFFF │ 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx│
└────────────────────┴────────────────────────────────────┘

Byte classes:
  0x00-0x7F → ASCII (single byte, complete character)
  0x80-0xBF → Continuation byte (10xxxxxx)
  0xC0-0xC1 → Invalid (overlong encoding)
  0xC2-0xDF → 2-byte lead (110xxxxx)
  0xE0-0xEF → 3-byte lead (1110xxxx)
  0xF0-0xF4 → 4-byte lead (11110xxx)
  0xF5-0xFF → Invalid (would encode > U+10FFFF)

State Machine

                     ASCII  CONT   INVLD  2-LEAD  3-LEAD  4-LEAD
  START          →  ITS_ME  ERROR  ERROR  ME_ONE  ME_TWO  ME_THREE
  ME_ONE (need 1)→  ERROR  ITS_ME  ERROR  ERROR   ERROR   ERROR
  ME_TWO (need 2)→  ERROR  ME_ONE  ERROR  ERROR   ERROR   ERROR
  ME_THREE(need3)→  ERROR  ME_TWO  ERROR  ERROR   ERROR   ERROR

Confidence Calculation

UTF8Prober
├─ coding_sm: CodingStateMachine
├─ num_mb_chars: count of multi-byte characters found
│
└─ get_confidence():
    ├─ If state machine ever hit ERROR → 0.0
    ├─ Otherwise:
    │   ONE_CHAR_PROB = 0.5
    │   confidence = 1.0 - (ONE_CHAR_PROB ^ num_mb_chars)
    │
    │   1 mb char:  1 - 0.5   = 0.50
    │   2 mb chars: 1 - 0.25  = 0.75
    │   3 mb chars: 1 - 0.125 = 0.875
    │   5 mb chars: 1 - 0.031 = 0.969
    │   10 mb chars: 1 - 0.001 = 0.999
    │
    └─ Represents: "probability that n valid multi-byte sequences
       all happened to be valid UTF-8 by coincidence"

Tracing: Mixed ASCII and Multi-byte

Input: 'Hello 世界'.encode('utf-8')
     = b'Hello \xe4\xb8\x96\xe7\x95\x8c'

UTF8Prober.feed(data):
│
├─ 'H','e','l','l','o',' ' → ASCII → ITS_ME (single byte chars)
│
├─ 0xE4: 3-byte lead → START→ME_TWO
├─ 0xB8: continuation → ME_TWO→ME_ONE
├─ 0x96: continuation → ME_ONE→ITS_ME (complete!) → num_mb_chars=1
│
├─ 0xE7: 3-byte lead → START→ME_TWO
├─ 0x95: continuation → ME_TWO→ME_ONE
├─ 0x8C: continuation → ME_ONE→ITS_ME (complete!) → num_mb_chars=2
│
├─ No errors → structurally valid
│
└─ confidence = 1.0 - (0.5 ^ 2) = 0.75

HebrewProber

Hebrew gets special handling because of bidirectional text complexity.

HebrewProber
│
├─ Problem: Hebrew can be stored in two byte orders:
│   "Logical" order: characters in reading order (right-to-left)
│   "Visual" order:  characters in display order (left-to-right)
│   Both use the same encoding but the byte sequence is REVERSED
│
├─ Strategy: Works with TWO SingleByteCharSetProbers:
│   1. Win-1255 Logical Hebrew model
│   2. ISO-8859-8 Visual Hebrew model
│
├─ HebrewProber acts as a "meta-prober":
│   │
│   ├─ Looks at word-final vs word-non-final letter forms
│   │
│   │   Hebrew has 5 letters with special final forms:
│   │   ך (final kaf)    vs כ (non-final kaf)
│   │   ם (final mem)    vs מ (non-final mem)
│   │   ן (final nun)    vs נ (non-final nun)
│   │   ף (final pe)     vs פ (non-final pe)
│   │   ץ (final tsade)  vs צ (non-final tsade)
│   │
│   │   In correctly ordered text:
│   │     Final forms appear before spaces/punctuation
│   │     Non-final forms appear before other letters
│   │
│   │   In reverse-ordered text:
│   │     Final forms appear after spaces (wrong position)
│   │
│   ├─ Counts: final_in_correct_position, final_in_wrong_position
│   │
│   └─ Uses counts to determine logical vs visual
│       If correct > wrong → logical order
│       If wrong > correct → visual order
│
└─ get_confidence(): returns confidence from the winning sub-prober

Japanese Context Analysis (SJIS/EUC-JP)

SJIS and EUC-JP get additional analysis beyond standard distribution.

SJISContextAnalysis / EUCJPContextAnalysis
│
├─ Problem: SJIS and EUC-JP often have similar confidence
│   from distribution analysis alone
│
├─ Extra signal: Hiragana character usage patterns
│   Japanese text frequently uses hiragana particles
│   (は、が、の、に、を、etc.)
│
│   SJIS and EUC-JP encode hiragana at different byte positions:
│   'の' in SJIS:   0x82 0xCC
│   'の' in EUC-JP: 0xA4 0xCE
│
│   The context analyzer checks whether the decoded hiragana
│   makes sense in Japanese text
│
└─ Provides additional confidence boost for SJIS vs EUC-JP
   disambiguation

SingleByteCharSetProber: In Depth

Model Structure

SingleByteCharSetModel
│
├─ char_to_order_table[256]:
│   Maps each byte value to a frequency order
│   ┌──────────────────────────────────────────────────┐
│   │ Order 0-63:    Most frequent 64 characters       │
│   │                (covers ~90%+ of typical text)     │
│   │ Order 64-253:  Less frequent characters           │
│   │ Order 254:     Character not expected in encoding │
│   │ Order 255:     Undefined/unused byte              │
│   └──────────────────────────────────────────────────┘
│
│   Example: Windows-1251 Russian
│   'о' (0xEE) → order 0  (most frequent Russian letter)
│   'е' (0xE5) → order 1
│   'а' (0xE0) → order 2
│   'и' (0xE8) → order 3
│   'н' (0xED) → order 4
│   'т' (0xF2) → order 5
│   ...
│   'ъ' (0xFA) → order 31 (rare hard sign)
│   'ё' (0xB8) → order 32 (rare yo)
│
├─ precedence_matrix[64][64]:
│   Bigram frequency categories for the top 64 characters
│
│   For each pair (char_a, char_b):
│   ┌───────────────────────────────────────────┐
│   │ 0 = NEGATIVE  : pair almost never occurs  │
│   │ 1 = UNLIKELY  : pair is uncommon          │
│   │ 2 = LIKELY    : pair occurs sometimes     │
│   │ 3 = POSITIVE  : pair is very natural      │
│   └───────────────────────────────────────────┘
│
│   Example: Russian bigrams
│   'с' → 'т' : POSITIVE  (ст is very common)
│   'т' → 'о' : POSITIVE  (то is common)
│   'ъ' → 'ъ' : NEGATIVE  (double hard sign never happens)
│
└─ typical_positive_ratio: float
    Expected ratio of POSITIVE pairs in natural text

Tracing: “Привет” (Russian) in Windows-1251

Input: 'Привет'.encode('windows-1251')
     = b'\xcf\xf0\xe8\xe2\xe5\xf2'

SingleByteCharSetProber (Win-1251 / Russian).feed(data):
│
├─ Byte 0xCF (П): order ~20, first char, no bigram yet
├─ Byte 0xF0 (р): order ~8
│   precedence_matrix[20][8] → POSITIVE (Пр is natural)
│   seq_counters[POSITIVE] += 1
├─ Byte 0xE8 (и): order ~3
│   precedence_matrix[8][3] → POSITIVE (ри is common)
│   seq_counters[POSITIVE] += 1
├─ Byte 0xE2 (в): order ~7
│   precedence_matrix[3][7] → POSITIVE (ив is common)
│   seq_counters[POSITIVE] += 1
├─ Byte 0xE5 (е): order ~1
│   precedence_matrix[7][1] → POSITIVE (ве is natural)
│   seq_counters[POSITIVE] += 1
├─ Byte 0xF2 (т): order ~5
│   precedence_matrix[1][5] → POSITIVE (ет is natural)
│   seq_counters[POSITIVE] += 1
│
├─ Results: 5 sequences, all POSITIVE
│
└─ get_confidence():
    positive_ratio = 5/5 = 1.0
    typical_positive_ratio ≈ 0.976
    confidence ≈ 0.5 (scaled conservatively)

Same Bytes Through KOI8-R

Same bytes decoded as KOI8-R produce different letters:
  0xCF = п, 0xF0 = р, 0xE8 = х, 0xE2 = т, 0xE5 = х, 0xF2 = т
  Reads as "прхтхт" (nonsense)

KOI8-R SingleByteCharSetProber:
│
├─ Bigram analysis:
│   р→х: UNLIKELY
│   х→т: NEGATIVE
│   т→х: NEGATIVE
│   х→т: NEGATIVE
│
├─ NEGATIVE sequences dominate
│
└─ confidence ≈ 0.05

WINNER: Windows-1251 at ~0.5 beats KOI8-R at ~0.05

SBCSGroupProber: Prober List

The SBCSGroupProber manages many probers. Note that the same encoding can appear multiple times with different language models:

  • Windows-1251 / Russian
  • Windows-1251 / Bulgarian
  • Windows-1251 / Macedonian
  • KOI8-R / Russian
  • ISO-8859-5 / Russian
  • ISO-8859-5 / Bulgarian
  • MacCyrillic / Russian
  • IBM866 / Russian
  • IBM855 / Russian
  • ISO-8859-7 / Greek
  • Windows-1253 / Greek
  • Windows-1256 / Arabic
  • Windows-1255 / Hebrew (logical)
  • Windows-1255 / Hebrew (visual)
  • TIS-620 / Thai
  • ISO-8859-9 / Turkish
  • And more…

MBCSGroupProber: Full Child List

MBCSGroupProber
├─ UTF8Prober         [state machine + count]
├─ SJISProber         [state machine + distribution + context]
├─ EUCJPProber        [state machine + distribution + context]
├─ GB2312Prober       [state machine + distribution]
├─ EUCKRProber        [state machine + distribution]
├─ Big5Prober         [state machine + distribution]
├─ EUCTWProber        [state machine + distribution]
└─ UTF1632Prober      [null-byte pattern analysis]

LanguageFilter System

Bit flags that constrain which encodings the detector considers:

┌─────────────────────┬───────┬──────────────────────────────┐
│ Flag                │ Value │ Encodings it enables         │
├─────────────────────┼───────┼──────────────────────────────┤
│ CHINESE_SIMPLIFIED  │ 0x01  │ GB2312, GB18030, HZ-GB-2312, │
│                     │       │ ISO-2022-CN                   │
│ CHINESE_TRADITIONAL │ 0x02  │ Big5, EUC-TW                 │
│ JAPANESE            │ 0x04  │ SJIS, EUC-JP, ISO-2022-JP    │
│ KOREAN              │ 0x08  │ EUC-KR, ISO-2022-KR          │
│ NON_CJK             │ 0x10  │ All single-byte encodings    │
│ ALL                 │ 0x1F  │ Everything (default)          │
│ CHINESE             │ 0x03  │ SIMPLIFIED | TRADITIONAL      │
│ CJK                 │ 0x0F  │ CHINESE | JAPANESE | KOREAN   │
└─────────────────────┴───────┴──────────────────────────────┘

How Filtering Works

UniversalDetector(lang_filter=LanguageFilter.JAPANESE)
│
├─ MBCSGroupProber creation:
│   ├─ UTF8Prober    → always included (language-neutral)
│   ├─ SJISProber    → JAPANESE & 0x04 = ✓ INCLUDE
│   ├─ EUCJPProber   → JAPANESE & 0x04 = ✓ INCLUDE
│   ├─ GB2312Prober  → CHINESE_SIMPLIFIED & 0x04 = 0 ✗ SKIP
│   ├─ EUCKRProber   → KOREAN & 0x04 = 0 ✗ SKIP
│   ├─ Big5Prober    → CHINESE_TRADITIONAL & 0x04 = 0 ✗ SKIP
│   └─ UTF1632Prober → always included (language-neutral)
│
├─ SBCSGroupProber: NON_CJK & 0x04 = 0 ✗ SKIP ALL
│
└─ EscCharSetProber:
    ├─ ISO2022JP → ✓ INCLUDE
    ├─ ISO2022CN → ✗ SKIP
    ├─ ISO2022KR → ✗ SKIP
    └─ HZ        → ✗ SKIP

Usage

from chardet.universaldetector import UniversalDetector
from chardet.enums import LanguageFilter

detector = UniversalDetector(lang_filter=LanguageFilter.JAPANESE)
detector = UniversalDetector(lang_filter=LanguageFilter.NON_CJK)
detector = UniversalDetector(
    lang_filter=LanguageFilter.JAPANESE | LanguageFilter.KOREAN
)

CLI Tool: chardetect

chardet ships with a command-line tool for detecting file encodings.

Invocation

$ chardetect           # as console_scripts entry point
$ python -m chardet    # as module

Architecture

chardetect CLI
│
├─ Entry point: chardet.cli.chardetect module
│   └─ main() function, registered as console_scripts
│
├─ Argument parsing (argparse):
│   ├─ positional: files (one or more paths, or stdin if none)
│   ├─ --version: show chardet version
│   └─ -m / --minimal: output just the encoding name
│
└─ Processing flow:
    ├─ For each input file (or stdin):
    │   ├─ Open in binary mode ('rb')
    │   ├─ Create UniversalDetector instance
    │   ├─ Read line by line:
    │   │   for line in file:
    │   │       detector.feed(line)
    │   │       if detector.done:
    │   │           break  ← early exit, doesn't read entire file
    │   ├─ detector.close()
    │   └─ Output result
    │
    └─ Output format:
        Normal:  filename: encoding with confidence value
        Minimal: encoding

Usage Examples

$ chardetect somefile.txt
somefile.txt: utf-8 with confidence 0.99

$ chardetect file1.txt file2.csv file3.html
file1.txt: utf-8 with confidence 0.99
file2.csv: Windows-1252 with confidence 0.73
file3.html: ascii with confidence 1.0

$ chardetect --minimal somefile.txt
utf-8

$ echo -n "some bytes" | chardetect
stdin: ascii with confidence 1.0

Full CLI Trace

$ chardetect mixed_chinese.txt

main()
│
├─ Parse args: files=['mixed_chinese.txt'], minimal=False
├─ Open 'mixed_chinese.txt' in binary mode
├─ detector = UniversalDetector()
│
├─ Line 1: b'\xc4\xe3\xba\xc3\n'  (你好\n in GB2312)
│   ├─ detector.feed(line)
│   │   ├─ BOM check: no BOM
│   │   ├─ All bytes > 0x80 → HIGH_BYTE
│   │   ├─ Create probers, feed to all
│   │   └─ Only 4 chars, not enough for FOUND_IT
│   └─ done = False → continue
│
├─ Line 2: b'\xca\xc0\xbd\xe7\xc4\xe3\xba\xc3\n'  (世界你好\n)
│   ├─ detector.feed(line)
│   │   ├─ UTF8Prober: ERROR → NOT_ME
│   │   ├─ GB2312Prober: confidence ~0.8
│   │   └─ Not at FOUND_IT threshold yet
│   └─ done = False → continue
│
├─ Line 3: b'\xd6\xd0\xbb\xaa\xc8\xcb\xc3\xf1...\n' (中华人民共和国\n)
│   ├─ detector.feed(line)
│   │   ├─ GB2312Prober: confidence > 0.95
│   │   └─ Hits FOUND_IT → done = True
│   └─ done = True → break (remaining lines NOT read)
│
├─ detector.close()
│   └─ result = {'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}
│
└─ Print: "mixed_chinese.txt: GB2312 with confidence 0.99"

detect_all() Internals

How It Differs From detect()

detect() flow through GroupProber:
│  MBCSGroupProber.get_confidence():
│  └─ Returns ONLY the best child's confidence → 0.99 (GB2312)
│  Caller sees one number

detect_all() flow:
│  Reaches INSIDE the group probers:
│  └─ MBCSGroupProber.probers → list of child probers
│     ├─ prober[0] (UTF8): 0.0 (NOT_ME)
│     ├─ prober[1] (GB2312): 0.99
│     ├─ prober[2] (SJIS): 0.05
│     ├─ prober[3] (EUCKR): 0.08
│     └─ ...each one becomes a separate entry in results

When It’s Useful

Input: b'\xc0\xd1'  (only 2 bytes)

detect() returns:
    {'encoding': 'EUC-KR', 'confidence': 0.35, 'language': 'Korean'}

detect_all() returns:
    [
        {'encoding': 'EUC-KR',    'confidence': 0.35, 'language': 'Korean'},
        {'encoding': 'GB2312',    'confidence': 0.33, 'language': 'Chinese'},
        {'encoding': 'Big5',      'confidence': 0.30, 'language': 'Chinese'},
        {'encoding': 'ISO-8859-1','confidence': 0.25, 'language': ''},
    ]

Now you can see all candidates are close — the detector is guessing.

Internal Constants and Thresholds

┌────────────────────────────────┬───────┬──────────────────────────────┐
│ Constant                       │ Value │ Purpose                      │
├────────────────────────────────┼───────┼──────────────────────────────┤
│ MINIMUM_THRESHOLD              │ 0.20  │ Below this, result = None    │
│ SHORTCUT_THRESHOLD             │ 0.95  │ Above this, stop early       │
│                                │       │ (done=True)                  │
│ ENOUGH_DATA_THRESHOLD          │ 1024  │ Bytes needed for reliable    │
│                                │       │ distribution analysis        │
│ SURE_YES                       │ 0.99  │ Returned for BOM detections  │
│ SURE_NO                        │ 0.01  │ Minimum non-zero confidence  │
└────────────────────────────────┴───────┴──────────────────────────────┘

Prober States

┌──────────────┬───────┬─────────────────────────────────────────────┐
│ State        │ Value │ Meaning                                     │
├──────────────┼───────┼─────────────────────────────────────────────┤
│ DETECTING    │ 0     │ Still gathering data, no conclusion yet     │
│ FOUND_IT     │ 1     │ Confident match (triggers early termination)│
│ NOT_ME       │ 2     │ Ruled out (state machine error or similar)  │
└──────────────┴───────┴─────────────────────────────────────────────┘

How Thresholds Interact

feed() loop:
│  if prober.state == FOUND_IT:
│     └─ detector.done = True → stop feeding
│  if prober.state == NOT_ME:
│     └─ Prober deactivated, never fed again

close():
│  best = max(prober.get_confidence() for active probers)
│  if best > MINIMUM_THRESHOLD (0.20):
│      result = that encoding
│  else:
│      result = {'encoding': None, 'confidence': 0.0, 'language': ''}
│      (chardet is saying "I genuinely don't know")

Complete Priority Flow

detector.feed(data)
│
│  ┌─────────────────────────────────────────────────┐
│  │ PRIORITY 1: BOM Detection (first feed() only)   │
│  │   Confidence: 1.0 (absolute certainty)           │
│  │   If found → done, skip everything else          │
│  └──────────────────────┬──────────────────────────┘
│                         │ no BOM
│  ┌──────────────────────▼──────────────────────────┐
│  │ CLASSIFY: Scan bytes to determine input type     │
│  │   PURE_ASCII: all bytes < 0x80                   │
│  │   ESC_ASCII:  saw 0x1B or 0x7E                   │
│  │   HIGH_BYTE:  saw bytes >= 0x80                  │
│  └──────┬────────────┬────────────┬────────────────┘
│         │            │            │
│    PURE_ASCII    ESC_ASCII    HIGH_BYTE
│         │            │            │
│  ┌──────▼──────┐ ┌───▼────────┐ ┌▼──────────────────┐
│  │ Do nothing  │ │ PRIORITY 2 │ │ PRIORITY 3         │
│  │ Wait for    │ │ EscCharSet │ │ MBCS + SBCS        │
│  │ more data   │ │ Prober     │ │ GroupProbers        │
│  │ or close()  │ │            │ │                     │
│  └──────┬──────┘ └───┬────────┘ └┬──────────────────┘
│         │            │            │
│  ┌──────▼──────────────▼───────────▼──────────────────┐
│  │ detector.close()                                    │
│  │                                                     │
│  │ if PURE_ASCII → result = 'ascii', confidence 1.0    │
│  │ if BOM found  → result = BOM encoding, conf 1.0     │
│  │ if prober FOUND_IT → result from that prober        │
│  │ else → compare all prober confidences               │
│  │        pick highest above minimum threshold (0.2)   │
│  │        if none above threshold → result = None      │
│  └─────────────────────────────────────────────────────┘

Complete Prober Hierarchy

UniversalDetector
│
├─ BOM Detection (hardcoded, confidence 1.0)
│   └─ UTF-8-SIG, UTF-16-LE, UTF-16-BE, UTF-32-LE, UTF-32-BE
│
├─ EscCharSetProber (triggered by ESC/~ bytes)
│   ├─ CodingStateMachine(ISO2022JP)  → ISO-2022-JP
│   ├─ CodingStateMachine(ISO2022CN)  → ISO-2022-CN
│   ├─ CodingStateMachine(ISO2022KR)  → ISO-2022-KR
│   └─ CodingStateMachine(HZ)        → HZ-GB-2312
│
├─ MBCSGroupProber (triggered by high bytes)
│   ├─ UTF8Prober         [state machine + count]
│   ├─ SJISProber         [state machine + distribution + context]
│   ├─ EUCJPProber        [state machine + distribution + context]
│   ├─ GB2312Prober       [state machine + distribution]
│   ├─ EUCKRProber        [state machine + distribution]
│   ├─ Big5Prober         [state machine + distribution]
│   ├─ EUCTWProber        [state machine + distribution]
│   └─ UTF1632Prober      [null-byte pattern analysis]
│
├─ SBCSGroupProber (triggered by high bytes)
│   ├─ Latin1Prober               [byte class heuristic]
│   ├─ HebrewProber               [final-form meta-analysis]
│   │   ├─ Win-1255 Logical model
│   │   └─ ISO-8859-8 Visual model
│   └─ Many SingleByteCharSetProbers [bigram + frequency]:
│       ├─ Win-1251 / Russian
│       ├─ KOI8-R / Russian
│       ├─ ISO-8859-5 / Russian
│       ├─ MacCyrillic / Russian
│       ├─ IBM866 / Russian
│       ├─ IBM855 / Russian
│       ├─ Win-1251 / Bulgarian
│       ├─ ISO-8859-5 / Bulgarian
│       ├─ Win-1251 / Macedonian
│       ├─ Win-1253 / Greek
│       ├─ ISO-8859-7 / Greek
│       ├─ Win-1256 / Arabic
│       ├─ TIS-620 / Thai
│       ├─ ISO-8859-9 / Turkish
│       ├─ Win-1255 / Hebrew (logical)
│       ├─ Win-1255 / Hebrew (visual)
│       └─ ... and more
│
└─ Fallback: PURE_ASCII → 'ascii', confidence 1.0
             No winner above threshold → None

Edge Cases

Empty Input

chardet.detect(b'')
→ {'encoding': None, 'confidence': 0.0, 'language': ''}

Pure ASCII

chardet.detect(b'Hello world')
→ {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
No probers are ever created (optimization).

Single High Byte

chardet.detect(b'\xe9')
Most probers don't have enough data. Likely returns None or very low
confidence Latin-1.

Reusing Detector with reset()

detector = UniversalDetector()

detector.feed(chinese_data)
detector.close()
result1 = detector.result  # GB2312

detector.reset()  # Must reset before reuse!
# Clears all state: result, done, _got_data, _input_state,
# _last_char, all probers destroyed/reset

detector.feed(korean_data)
detector.close()
result2 = detector.result  # EUC-KR

Incremental Feeding with Small Chunks

detector = UniversalDetector()
for byte in data:
    detector.feed(bytes([byte]))
    if detector.done:
        break
detector.close()

This works because state machines and distribution analyzers maintain state between feed() calls. However, BOM detection is less reliable with tiny first chunks — it only checks the first feed’s data. Recommendation: feed at least 4 bytes in the first call.

All Null Bytes

chardet.detect(b'\x00\x00\x00\x00\x00\x00\x00\x00')
UTF1632Prober may trigger but pattern doesn't match cleanly.
Likely result: None (binary data, not text).

Design Philosophy

The whole design is an elegant cascade from certainty to statistical guesswork:

  1. BOMs are unambiguous (100% confidence)
  2. Escape sequences are nearly unambiguous (specific sequence identifies exactly one encoding)
  3. State machines cheaply eliminate structurally impossible encodings
  4. Distribution analysis disambiguates encodings with overlapping byte ranges using character frequency models
  5. Bigram/sequence analysis provides fine-grained disambiguation for single-byte encodings

Each layer trades specificity for coverage, and the system as a whole handles a remarkably wide range of real-world encoding scenarios.


About This Document

Date written: March 14, 2026

Author: Claude Opus 4.6 (Anthropic), in conversation with Dan Blanchard (chardet maintainer)

Source of knowledge: This document was written entirely from the model’s training data memory, with no web searches or file reads of the actual chardet source code performed during the session. The knowledge likely reflects a composite of chardet versions 4.x through 5.x (roughly 2021-2024), based on the features described:

  • detect_all() was introduced around chardet 4.0
  • UTF1632Prober (BOM-less UTF-16/32 detection) appears to be a 4.x+ addition
  • LanguageFilter as a formalized enum and the lang_filter parameter on UniversalDetector are from a similar era

Because training data doesn’t carry clean version tags, some details may reflect one version while others reflect a different one. Specific values like threshold constants, byte class ranges, enum integer values, the exact list of single-byte probers, and confidence calculation formulas may be inaccurate. Consult the actual chardet source code for authoritative information.