Overview
chardet is a Python library for automatic character encoding detection. It is a port of Mozilla’s Universal Charset Detector, originally written in C++ for Mozilla Firefox. The original Python port was done by Mark Pilgrim (of “Dive Into Python” fame), and has been maintained by Dan Blanchard for 12+ years. It is one of the most widely depended-upon packages on PyPI, being a transitive dependency for a huge chunk of the Python ecosystem (via requests and others).
Detection Approach
The detection approach uses several techniques in parallel:
- BOM detection — checks for byte order marks at the start of the data
- Escape-based detection — for encodings like ISO-2022-JP that use escape sequences
- Multi-byte probers — for CJK encodings (UTF-8, SJIS, EUC-JP, GB2312, Big5, etc.) using state machines
- Single-byte probers — for encodings like Latin-1, Windows-1252, KOI8-R, etc., using character frequency analysis and sequence analysis
- UniversalDetector — the orchestrator class that feeds data through all the probers and picks the best result
API
chardet.detect(byte_string)
The simple, all-in-one convenience function. Takes a bytes object and returns a dict:
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
Under the hood, it is a thin wrapper around UniversalDetector:
- Creates a
UniversalDetectorinstance - Feeds the entire byte string into it in one shot
- Calls
close() - Returns the
resultdict
chardet.detect_all(byte_str, ignore_threshold=False)
Returns a list of possible encodings ranked by confidence, rather than just the top one:
[
{'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'},
{'encoding': 'Big5', 'confidence': 0.12, 'language': 'Chinese'},
{'encoding': 'EUC-KR', 'confidence': 0.08, 'language': 'Korean'},
]
Key difference from detect(): instead of just picking the best prober result, it reaches inside the group probers to collect confidence from every individual child prober, sorts them by confidence descending, and returns the full list.
The ignore_threshold parameter controls whether probers below the minimum threshold (0.2) are included.
UniversalDetector Class
The core engine for incremental/streaming encoding detection.
Constructor:
UniversalDetector(lang_filter=None)— optionally pass language filter flags to limit which encodings are considered
Methods:
feed(data)— feed a chunk of bytes into the detector. Can be called multiple times for streaming use. Short-circuits if detection reaches high confidence (setsself.done = True)close()— signal that you’re done feeding data. Triggers final analysis and picks the best result from the probersreset()— reset the detector state so you can reuse the instance for a new detection
Properties / Attributes:
result— a dict withencoding,confidence(0.0-1.0), andlanguageafterclose()is calleddone— boolean,Trueif the detector has reached a confident conclusion early
Typical Usage Pattern:
from chardet.universaldetector import UniversalDetector
detector = UniversalDetector()
for line in binary_file:
detector.feed(line)
if detector.done:
break
detector.close()
print(detector.result)
# e.g. {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
Data Flow: detect() Through UniversalDetector
Example: 'Hello, this is tést'.encode('utf-8')
Input bytes: b'Hello, this is t\xc3\xa9st'
chardet.detect(byte_string)
│
├─ Creates UniversalDetector instance
├─ Calls detector.feed(byte_string)
│ │
│ ├─ Step 1: BOM Check (first call only)
│ │ └─ Checks first 2-4 bytes for BOM (UTF-8 BOM, UTF-16 LE/BE, UTF-32 LE/BE)
│ │ └─ "Hell" has no BOM → continue
│ │
│ ├─ Step 2: Scan bytes to classify input
│ │ └─ Iterates through each byte
│ │ └─ Sees 0xC3 and 0xA9 → these are high bytes (> 0x7F)
│ │ └─ Sets input_state = HIGH_BYTE
│ │ └─ (If it saw escape sequences, would set ESCAPE instead)
│ │
│ ├─ Step 3: Based on input_state, activate probers
│ │ │
│ │ │ input_state == HIGH_BYTE, so activates:
│ │ │
│ │ ├─ MBCSGroupProber (multi-byte group)
│ │ │ │
│ │ │ ├─ UTF8Prober
│ │ │ │ └─ Has a CodingStateMachine with UTF-8 transitions
│ │ │ │ └─ Feeds bytes through state machine
│ │ │ │ └─ 0xC3 0xA9 is a VALID 2-byte UTF-8 sequence
│ │ │ │ └─ Returns state: FOUND_IT or high confidence
│ │ │ │
│ │ │ ├─ SJISProber
│ │ │ │ └─ Feeds bytes through Shift_JIS state machine
│ │ │ │ └─ 0xC3 0xA9 could be valid SJIS, but context is wrong
│ │ │ │ └─ Returns low confidence
│ │ │ │
│ │ │ ├─ EUCJPProber, EUCKRProber, GB2312Prober, Big5Prober, etc.
│ │ │ │ └─ Similar: feed through respective state machines
│ │ │ │ └─ Most will reject or return low confidence
│ │ │ │
│ │ │ └─ Returns best confidence among its children
│ │ │
│ │ └─ SBCSGroupProber (single-byte group)
│ │ │
│ │ ├─ Latin1Prober
│ │ │ └─ Analyzes byte frequency patterns
│ │ │ └─ 0xC3 and 0xA9 are valid Latin-1 chars
│ │ │ └─ Returns moderate confidence
│ │ │
│ │ ├─ Windows1252Prober, ISO8859_2Prober, etc.
│ │ │ └─ Each uses character frequency models for its
│ │ │ target language to score the input
│ │ │
│ │ └─ Returns best confidence among its children
│ │
│ └─ Step 4: Check if any prober hit FOUND_IT threshold
│ └─ If so, set self.done = True (short-circuit)
│
├─ Calls detector.close()
│ │
│ ├─ If input was pure ASCII → return {'encoding': 'ascii', ...}
│ ├─ If BOM was found → return that encoding
│ └─ Otherwise:
│ ├─ Collect confidence from all active probers
│ ├─ Find the prober with highest confidence
│ ├─ For this input: UTF8Prober wins
│ │ └─ The 0xC3 0xA9 sequence is valid UTF-8
│ │ └─ All ASCII bytes are valid UTF-8
│ │ └─ High confidence
│ └─ Populate self.result
│
└─ Returns detector.result
└─ {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
Why UTF-8 wins: The key bytes are 0xC3 0xA9. In UTF-8, 0xC3 signals “start of a 2-byte sequence” and 0xA9 is a valid continuation byte (10xxxxxx pattern). The UTF-8 state machine sees a perfectly valid transition. Meanwhile, the single-byte probers see the same bytes as valid characters in their encodings (e.g., Latin-1 reads them as “é”), but their frequency analysis scores lower because “Ô followed by “©” is an unusual combination in natural language.
Example: GB18030 Chinese — '你好世界'.encode('gb18030')
Input bytes: b'\xc4\xe3\xba\xc3\xca\xc0\xbd\xe7'
Each character is 2 bytes: 你=0xC4 0xE3, 好=0xBA 0xC3, 世=0xCA 0xC0, 界=0xBD 0xE7
chardet.detect(byte_string)
│
├─ Creates UniversalDetector
├─ Calls detector.feed(byte_string)
│ │
│ ├─ Step 1: BOM Check
│ │ └─ 0xC4 0xE3... → no BOM match → continue
│ │
│ ├─ Step 2: Byte Classification
│ │ └─ Every byte here is > 0x7F (all high bytes)
│ │ └─ No ASCII at all, no escape sequences
│ │ └─ input_state = HIGH_BYTE
│ │
│ ├─ Step 3: Activate probers
│ │ │
│ │ ├─ MBCSGroupProber
│ │ │ │
│ │ │ ├─ UTF8Prober
│ │ │ │ └─ 0xC4 = start of 2-byte UTF-8 (110xxxxx)
│ │ │ │ └─ 0xE3 = start of 3-byte UTF-8 (1110xxxx), NOT valid continuation
│ │ │ │ └─ State machine → ERROR state
│ │ │ │ └─ Prober state: NOT_ME (ruled out)
│ │ │ │
│ │ │ ├─ GB2312Prober / GB18030Prober
│ │ │ │ └─ CodingStateMachine: all 4 two-byte pairs valid
│ │ │ │ └─ Distribution analysis: checks character frequencies
│ │ │ │ against Chinese character frequency model
│ │ │ │ └─ 你好世界 are all common Chinese characters
│ │ │ │ └─ High frequency match → HIGH confidence
│ │ │ │
│ │ │ ├─ SJISProber
│ │ │ │ └─ State machine may accept some byte pairs
│ │ │ │ └─ Distribution analysis against Japanese model → poor match
│ │ │ │ └─ Returns LOW confidence
│ │ │ │
│ │ │ ├─ EUCKRProber
│ │ │ │ └─ Byte ranges overlap with EUC-KR
│ │ │ │ └─ Distribution analysis against Korean model → poor match
│ │ │ │ └─ Returns LOW confidence
│ │ │ │
│ │ │ ├─ EUCJPProber, Big5Prober, etc.
│ │ │ │ └─ Distribution analysis doesn't match respective models
│ │ │ │ └─ Return LOW confidence
│ │ │ │
│ │ │ └─ Best child: GB2312/GB18030 prober wins
│ │ │
│ │ └─ SBCSGroupProber
│ │ └─ All bytes are high bytes, frequency patterns
│ │ don't match any European language models
│ │ └─ Returns LOW confidence across the board
│ │
│ └─ Step 4: Short-circuit check
│ └─ GB prober confidence may trigger FOUND_IT → done = True
│
├─ Calls detector.close()
│ │
│ └─ Collects all prober results:
│ ├─ UTF8Prober: NOT_ME (ruled out by state machine)
│ ├─ GB2312Prober: ~0.99 confidence ← WINNER
│ ├─ SJISProber: ~0.1-0.3
│ ├─ EUCKRProber: ~0.1-0.3
│ ├─ Big5Prober: ~0.1-0.3
│ ├─ Latin1 etc: very low
│ └─ Selects GB2312 prober as winner
│
└─ Returns detector.result
└─ {'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}
Key distinction from UTF-8: With CJK encodings, the byte ranges heavily overlap (GB2312, EUC-KR, Big5, SJIS all use similar high-byte ranges). The state machines alone can’t differentiate them. That’s where distribution analysis becomes critical — checking whether the decoded characters are commonly used in that language.
Note on GB2312 vs GB18030: chardet often reports GB2312 even when the encoding is technically GB18030, since GB18030 is a superset of GB2312. If all the characters fall within the GB2312 range, the detector can’t distinguish between them. This is a known quirk.
CodingStateMachine: Deep Dive
The state machine is the first line of defense — it determines whether a byte sequence is structurally valid for a given encoding.
Architecture
CodingStateMachine
├─ model: StateMachineModel (encoding-specific transition table)
├─ current_state: starts at START
└─ next_state(byte) → feeds one byte, returns new state
States:
┌─────────┐
│ START │ ← initial state, also "ready for next character"
├─────────┤
│ ME_ONE │ ← in the middle of a multi-byte sequence (need 1 more)
├─────────┤
│ ME_TWO │ ← need 2 more bytes
├─────────┤
│ ME_THREE│ ← need 3 more bytes (GB18030 4-byte sequences)
├─────────┤
│ ITS_ME │ ← complete valid character decoded (terminal, resets)
├─────────┤
│ ERROR │ ← invalid byte for this encoding (terminal)
└─────────┘
Transition Table Structure
Each encoding defines a model with:
- Class table: maps every possible byte value (0x00-0xFF) to a byte class
- State table: given (current_state, byte_class) → next_state
Example: GB2312 byte classification (conceptual):
Byte range → Class
0x00-0x20 → 0 (control chars)
0x21-0x7E → 1 (ASCII printable)
0x7F → 2 (DEL)
0x80-0xA0 → 3 (undefined range)
0xA1-0xFE → 4 (valid GB2312 first/second byte)
0xFF → 5 (invalid)
State transitions for GB2312:
Class 0 Class 1 Class 2 Class 3 Class 4 Class 5
START → ERROR ITS_ME ERROR ERROR ME_ONE ERROR
ME_ONE → ERROR ERROR ERROR ERROR ITS_ME ERROR
Tracing: GB18030 Byte-by-Byte Through Multiple State Machines
Input: 0xC4 0xE3 0xBA 0xC3 0xCA 0xC0 0xBD 0xE7
GB2312 State Machine
Byte 0xC4:
└─ class_table[0xC4] → class 4 (valid GB range)
└─ state_table[START][class 4] → ME_ONE
└─ "I need one more byte to complete this character"
Byte 0xE3:
└─ class_table[0xE3] → class 4 (valid GB range)
└─ state_table[ME_ONE][class 4] → ITS_ME
└─ "Valid 2-byte character complete!" → reset to START
└─ Character decoded: 你
Byte 0xBA:
└─ class_table[0xBA] → class 4
└─ state_table[START][class 4] → ME_ONE
Byte 0xC3:
└─ class_table[0xC3] → class 4
└─ state_table[ME_ONE][class 4] → ITS_ME
└─ Character decoded: 好
...same pattern for 0xCA 0xC0 (世) and 0xBD 0xE7 (界)
Result: 4 valid characters, 0 errors → structurally VALID
UTF-8 State Machine
Byte 0xC4:
└─ class_table[0xC4] → class for 110xxxxx (2-byte start)
└─ state_table[START][two_byte_start] → ME_ONE
└─ "Expecting one continuation byte (10xxxxxx)"
Byte 0xE3:
└─ class_table[0xE3] → class for 1110xxxx (3-byte start!)
└─ state_table[ME_ONE][three_byte_start] → ERROR
└─ "Expected continuation byte, got a new sequence start"
└─ *** UTF-8 RULED OUT ***
Shift_JIS State Machine
Byte 0xC4:
└─ class_table[0xC4] → Katakana half-width range (0xA1-0xDF)
└─ Valid single-byte character in SJIS
└─ state_table[START][katakana] → ITS_ME
Byte 0xE3:
└─ class_table[0xE3] → valid SJIS lead byte (0xE0-0xEF)
└─ state_table[START][sjis_lead] → ME_ONE
Byte 0xBA:
└─ SJIS second byte range is 0x40-0x7E, 0x80-0xFC
└─ 0xBA is valid → ITS_ME
...may survive structurally, but decoded characters are nonsense
Result: structurally VALID but distribution will be terrible
Distribution Analysis: Deep Dive
When multiple encodings pass the state machine (as GB2312, EUC-KR, and SJIS often do for Chinese input), distribution analysis breaks the tie.
Architecture
CharDistributionAnalysis
├─ char_to_order_table: maps decoded character codes → frequency rank
├─ typical_distribution_ratio: expected ratio for this language
├─ freq_chars: count of frequently-used characters seen
├─ total_chars: count of total characters analyzed
└─ get_confidence() → float
Frequency Table Construction (offline, from corpora)
For Chinese (GB2312):
Analyze a large corpus of Chinese text
Rank characters by frequency of occurrence
Order 0-511: "most frequent" bucket
e.g., 的 一 是 不 了 人 我 在 有 他 这 ...
Order 512+: "less frequent"
Order -1: "not seen in training corpus"
typical_distribution_ratio = expected_freq_chars / expected_total_chars
Tracing: Chinese Characters Through Distribution Analysis
Input characters (decoded as GB2312): 你 好 世 界
Character: 你
└─ GB2312 code point → look up in char_to_order_table
└─ 你 is extremely common in Chinese
└─ order = low number (say ~50) → falls in "frequent" bucket
└─ freq_chars += 1, total_chars += 1
Character: 好
└─ Also extremely common
└─ order = low number (say ~80) → "frequent" bucket
└─ freq_chars += 1, total_chars += 1
Character: 世
└─ Common character
└─ order = moderate (say ~200) → still in "frequent" bucket
└─ freq_chars += 1, total_chars += 1
Character: 界
└─ Fairly common
└─ order = moderate (say ~300) → still in "frequent" bucket
└─ freq_chars += 1, total_chars += 1
Result: freq_chars=4, total_chars=4
└─ ratio = 4/4 = 1.0
└─ Compare to typical_distribution_ratio
└─ Very high match → HIGH confidence (~0.99)
Same Bytes Through Korean Distribution Analysis
Same byte pairs decoded as EUC-KR:
0xC4 0xE3 → some Korean character (or invalid)
0xBA 0xC3 → some Korean character
0xCA 0xC0 → some Korean character
0xBD 0xE7 → some Korean character
Look up each in Korean char_to_order_table:
└─ These map to uncommon or meaningless Korean characters
└─ Most get high order numbers or -1 (not in corpus)
└─ freq_chars ≈ 0, total_chars = 4
└─ ratio = 0/4 = 0.0
└─ Far below typical_distribution_ratio
└─ LOW confidence (~0.01)
Confidence Flow Back Up to UniversalDetector
detector.close() called
│
├─ Collect results from all prober groups:
│
│ MBCSGroupProber.get_confidence()
│ │
│ │ Iterates through child probers, returns best:
│ │
│ ├─ UTF8Prober: state=NOT_ME → confidence = 0.0
│ ├─ GB2312Prober: state=DETECTING
│ │ ├─ coding_sm: no errors (structurally valid)
│ │ └─ distribution_analyzer.get_confidence() → 0.99
│ │ └─ combined confidence → 0.99 ← WINNER
│ ├─ SJISProber: state=DETECTING
│ │ ├─ coding_sm: no errors (structurally valid)
│ │ └─ distribution_analyzer.get_confidence() → 0.05
│ │ └─ combined confidence → 0.05
│ ├─ EUCKRProber: state=DETECTING
│ │ └─ distribution confidence → 0.01
│ ├─ Big5Prober: state=DETECTING
│ │ └─ distribution confidence → 0.08
│ ├─ EUCJPProber: state=DETECTING
│ │ └─ distribution confidence → 0.02
│ │
│ └─ Returns: GB2312, confidence 0.99
│
│ SBCSGroupProber.get_confidence()
│ │
│ │ All single-byte probers see only high bytes
│ │ No ASCII context to anchor frequency analysis
│ │ Best confidence maybe ~0.1-0.2
│ │
│ └─ Returns: some Latin variant, confidence ~0.15
│
├─ Compare group winners:
│ GB2312 @ 0.99 vs Latin-something @ 0.15
│
├─ Winner: GB2312 @ 0.99
│
├─ Apply minimum threshold (typically 0.2)
│ └─ 0.99 > 0.2 → passes
│
└─ self.result = {
'encoding': 'GB2312',
'confidence': 0.99,
'language': 'Chinese'
}
BOM Detection
BOM (Byte Order Mark) detection is the very first thing that happens in UniversalDetector.feed(), and it only runs once (on the first call to feed()).
BOM Detection Table
┌─────────────────────┬──────────────────────┬─────────┐
│ Byte Sequence │ Encoding │ Length │
├─────────────────────┼──────────────────────┼─────────┤
│ 0xEF 0xBB 0xBF │ UTF-8-SIG │ 3 bytes │
│ 0xFF 0xFE 0x00 0x00 │ UTF-32-LE │ 4 bytes │
│ 0x00 0x00 0xFE 0xFF │ UTF-32-BE │ 4 bytes │
│ 0xFF 0xFE │ UTF-16-LE │ 2 bytes │
│ 0xFE 0xFF │ UTF-16-BE │ 2 bytes │
└─────────────────────┴──────────────────────┴─────────┘
Order matters! UTF-32-LE starts with 0xFF 0xFE (same as UTF-16-LE)
so the 4-byte checks must come before 2-byte checks.
Tracing: UTF-8 with BOM
Input: b'\xef\xbb\xbfHello'
^^^^^^^^^ BOM ^^^^^ content
detector.feed(b'\xef\xbb\xbfHello')
│
├─ _got_data = False (first call)
│ └─ Set _got_data = True
│ └─ Enter BOM detection
│
├─ Check 4-byte BOMs first (need at least 4 bytes, we have 8):
│ ├─ data[:4] == b'\xff\xfe\x00\x00'? → No (UTF-32-LE)
│ ├─ data[:4] == b'\x00\x00\xfe\xff'? → No (UTF-32-BE)
│ └─ No 4-byte BOM match
│
├─ Check 3-byte BOMs:
│ ├─ data[:3] == b'\xef\xbb\xbf'? → YES! UTF-8-SIG
│ └─ Set self._detected_encoding = 'UTF-8-SIG'
│ └─ Set self.done = True
│ └─ Return immediately (no probers needed!)
│
└─ detector.close()
└─ BOM was detected → result comes from _detected_encoding
└─ self.result = {
'encoding': 'UTF-8-SIG',
'confidence': 1.0, ← BOM detection is always 100%
'language': ''
}
Tracing: UTF-16-LE with BOM
Input: '你好'.encode('utf-16')
= b'\xff\xfe`O}Y'
UTF-16-LE BOM, then content
detector.feed(data)
│
├─ First call, enter BOM detection
│
├─ Check 4-byte BOMs:
│ ├─ data[:4] = b'\xff\xfe\x60\x4f'
│ ├─ == b'\xff\xfe\x00\x00'? → No (3rd byte is 0x60, not 0x00)
│ └─ No 4-byte match
│
├─ Check 3-byte BOMs:
│ └─ No 3-byte match
│
├─ Check 2-byte BOMs:
│ ├─ data[:2] == b'\xff\xfe'? → YES! UTF-16-LE
│ └─ Set self._detected_encoding = 'UTF-16-LE'
│ └─ Set self.done = True
│ └─ Return immediately
│
└─ result = {'encoding': 'UTF-16-LE', 'confidence': 1.0, 'language': ''}
No BOM Found
Input: b'Hello, world'
detector.feed(b'Hello, world')
│
├─ Check 4-byte BOMs: b'Hell' → No match
├─ Check 3-byte BOMs: b'Hel' → No match
├─ Check 2-byte BOMs: b'He' → No match
│
├─ No BOM found
│ └─ Fall through to byte scanning and prober logic
Escape-Based Detection
Escape-based probers handle encodings that use escape sequences to switch between character sets. These are primarily ISO-2022 family encodings.
Background: How ISO-2022 Works
Normal ASCII text here → ESC $ B → 日本語のテキスト → ESC ( B → back to ASCII
^^^^^^^^ ^^^^^^^^
"switch to JIS X 0208" "switch back to ASCII"
Common escape sequences:
┌──────────────────────┬─────────────────────────┬───────────────┐
│ Escape Sequence │ Meaning │ Encoding │
├──────────────────────┼─────────────────────────┼───────────────┤
│ ESC ( B │ Switch to ASCII │ (all) │
│ ESC $ B │ Switch to JIS X 0208 │ ISO-2022-JP │
│ ESC $ @ │ Switch to JIS C 6226 │ ISO-2022-JP │
│ ESC $ ( C │ Switch to KSC 5601 │ ISO-2022-KR │
│ ESC $ ( D │ Switch to JIS X 0212 │ ISO-2022-JP │
│ ESC $ ) A │ Switch to GB 2312 │ ISO-2022-CN │
│ ESC $ ) G │ Switch to CNS 11643-1 │ ISO-2022-CN │
│ ESC $ * H │ Switch to CNS 11643-2 │ ISO-2022-CN │
│ ESC ( J │ Switch to JIS X 0201 │ ISO-2022-JP │
└──────────────────────┴─────────────────────────┴───────────────┘
Key: ESC = 0x1B
How Escape Detection is Triggered
detector.feed(byte_string)
│
├─ BOM check (no match, continue)
│
├─ Byte scanning loop:
│ for each byte in input:
│ │
│ │ if byte == 0x1B (ESC) or byte == 0x7E (~):
│ │ input_state = ESC_ASCII
│ │ ─── this activates the escape prober path ───
│ │
│ │ elif byte >= 0x80:
│ │ input_state = HIGH_BYTE
│ │
│ │ else:
│ │ (stays as PURE_ASCII if no high bytes seen yet)
EscCharSetProber Architecture
EscCharSetProber
│
├─ Contains a list of CodingStateMachines, one per escape encoding:
│ ├─ CodingStateMachine(HZ_SM_MODEL) → HZ-GB-2312
│ ├─ CodingStateMachine(ISO2022CN_SM_MODEL) → ISO-2022-CN
│ ├─ CodingStateMachine(ISO2022JP_SM_MODEL) → ISO-2022-JP
│ └─ CodingStateMachine(ISO2022KR_SM_MODEL) → ISO-2022-KR
│
├─ active: bool (starts True, False if all machines hit ERROR)
│
└─ feed(data):
└─ for each byte:
└─ feed byte to each active state machine
└─ if any returns ITS_ME → we found the encoding!
└─ if any returns ERROR → deactivate that machine
└─ if all hit ERROR → prober state = NOT_ME
Tracing: ISO-2022-JP Detection
Input: b'Hello \x1b$B$3$s$K$A$O\x1b(B world'
^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^
ASCII ESC$B=JIS こんにちは ESC(B=ASCII
detector.feed(data)
│
├─ BOM check → no match
│
├─ Byte scan:
│ 'H','e','l','l','o',' ' → all < 0x80, still PURE_ASCII
│ 0x1B (ESC) → input_state = ESC_ASCII!
│ └─ Activate EscCharSetProber
│
├─ EscCharSetProber.feed(data):
│ │
│ │ Byte 0x1B (ESC):
│ │ ├─ HZ_SM: ESC is not part of HZ protocol (HZ uses ~{ and ~})
│ │ ├─ ISO2022CN: ESC → intermediate "saw escape" state
│ │ ├─ ISO2022JP: ESC → intermediate "saw escape" state
│ │ └─ ISO2022KR: ESC → intermediate "saw escape" state
│ │
│ │ Byte '$' (0x24):
│ │ ├─ HZ_SM: ESC $ is not HZ → ERROR (deactivated)
│ │ ├─ ISO2022CN: ESC $ → intermediate state (valid so far)
│ │ ├─ ISO2022JP: ESC $ → intermediate state (valid so far)
│ │ └─ ISO2022KR: ESC $ → intermediate state (valid so far)
│ │
│ │ Byte 'B' (0x42):
│ │ ├─ ISO2022CN: ESC $ B is not a valid CN sequence → ERROR
│ │ ├─ ISO2022JP: ESC $ B = "switch to JIS X 0208"
│ │ │ → ITS_ME!!! MATCH FOUND
│ │ └─ ISO2022KR: ESC $ B is not valid KR → ERROR
│ │
│ │ ISO2022JP returned ITS_ME!
│ │ └─ Prober state = FOUND_IT
│ │ └─ detected_charset = 'ISO-2022-JP'
│ │ └─ Stop processing
│ │
│ └─ Return FOUND_IT
│
├─ detector.done = True (short-circuit)
│
└─ result = {'encoding': 'ISO-2022-JP', 'confidence': 0.99, 'language': 'Japanese'}
Tracing: HZ-GB-2312 Detection
HZ doesn’t use ESC — it uses ~ (tilde) as its escape character.
HZ-GB-2312 protocol:
~{ = switch to GB2312 mode (two-byte characters)
~} = switch back to ASCII mode
~~ = literal tilde
Input: b'Hello ~{\xc4\xe3\xba\xc3~} world'
detector.feed(data)
│
├─ Byte scan:
│ 'H','e','l','l','o',' ' → PURE_ASCII
│ '~' (0x7E) → input_state = ESC_ASCII!
│ └─ The tilde triggers escape detection
│
├─ EscCharSetProber.feed(data):
│ │
│ │ Byte '~' (0x7E):
│ │ ├─ HZ_SM: ~ → intermediate "saw tilde" state
│ │ ├─ ISO2022JP: ~ has no meaning → ERROR (deactivated)
│ │ ├─ ISO2022CN: ~ has no meaning → ERROR (deactivated)
│ │ └─ ISO2022KR: ~ has no meaning → ERROR (deactivated)
│ │
│ │ Byte '{' (0x7B):
│ │ ├─ HZ_SM: ~{ = "enter GB mode" → ITS_ME!!!
│ │
│ │ detected_charset = 'HZ-GB-2312'
│ │ FOUND_IT
│ │
│ └─ Return FOUND_IT
│
└─ result = {'encoding': 'HZ-GB-2312', 'confidence': 0.99, 'language': 'Chinese'}
Tracing: ISO-2022-KR Detection
ISO-2022-KR uses:
ESC $ ) C = designate KSC 5601 (Korean) character set
SO (0x0E) = shift out (switch to Korean)
SI (0x0F) = shift in (switch back to ASCII)
Input: b'\x1b$)C\x0e Korean chars \x0f ASCII'
EscCharSetProber.feed(data):
│
│ Byte 0x1B (ESC): all ISO-2022 machines → "saw escape" state
│ Byte '$' (0x24): all → valid intermediate
│ Byte ')' (0x29): all → valid intermediate
│
│ Byte 'C' (0x43):
│ ├─ ISO2022JP: ESC $ ) C → not valid JP (expects D) → ERROR
│ ├─ ISO2022CN: ESC $ ) C → not valid CN (expects A or G) → ERROR
│ └─ ISO2022KR: ESC $ ) C → "designate KSC 5601" → ITS_ME!!!
│
│ detected_charset = 'ISO-2022-KR'
│
└─ result = {'encoding': 'ISO-2022-KR', 'confidence': 0.99, 'language': 'Korean'}
Specialty Probers
UTF1632Prober
Detects UTF-16 and UTF-32 without a BOM by analyzing statistical patterns of null bytes.
Key Insight
Most text is in the Basic Multilingual Plane and uses common characters:
In UTF-16-LE, ASCII text looks like:
'H' = 0x48 0x00
'e' = 0x65 0x00
→ Every other byte is 0x00
In UTF-16-BE, ASCII text looks like:
'H' = 0x00 0x48
'e' = 0x00 0x65
→ Every other byte is 0x00, offset by 1
In UTF-32-LE, ASCII text:
'H' = 0x48 0x00 0x00 0x00
→ 3 out of every 4 bytes are 0x00
In UTF-32-BE, ASCII text:
'H' = 0x00 0x00 0x00 0x48
→ 3 out of every 4 bytes are 0x00
Detection Strategy
UTF1632Prober.feed(data)
│
├─ Track null byte counts at each position modulo 4:
│ position[0]: count of null bytes at index 0, 4, 8, 12...
│ position[1]: count of null bytes at index 1, 5, 9, 13...
│ position[2]: count of null bytes at index 2, 6, 10, 14...
│ position[3]: count of null bytes at index 3, 7, 11, 15...
│
├─ On get_confidence():
│
│ UTF-32-LE: positions 1,2,3 almost all null, position 0 rarely null
│ UTF-32-BE: positions 0,1,2 almost all null, position 3 rarely null
│ UTF-16-LE: odd positions mostly null (must rule out UTF-32 first)
│ UTF-16-BE: even positions mostly null (must rule out UTF-32 first)
Tracing: BOM-less UTF-16-LE
Input: 'Hello'.encode('utf-16-le')
= b'\x48\x00\x65\x00\x6c\x00\x6c\x00\x6f\x00'
UTF1632Prober.feed(data):
│
│ Index 0: 0x48 (H) → position 0 mod 4 = 0 → not null
│ Index 1: 0x00 → position 1 mod 4 = 1 → NULL
│ Index 2: 0x65 (e) → position 2 mod 4 = 2 → not null
│ Index 3: 0x00 → position 3 mod 4 = 3 → NULL
│ Index 4: 0x6C (l) → position 0 mod 4 = 0 → not null
│ Index 5: 0x00 → position 1 mod 4 = 1 → NULL
│ ...
│
│ Null distribution:
│ Position 0: 0% nulls
│ Position 1: 100% nulls
│ Position 2: 0% nulls
│ Position 3: 100% nulls
│
│ Analysis:
│ Position 2 is NOT null → rules out UTF-32-LE
│ Odd positions (1,3) are all null → UTF-16-LE pattern
│ → High confidence UTF-16-LE (~0.95+)
Latin1Prober
Latin-1 (ISO-8859-1) gets special treatment because it’s the most common single-byte encoding for Western European languages, and it’s a frequent “default guess.”
Why Latin-1 is Special-Cased
Latin-1 maps ALL byte values 0x00-0xFF to some character.
There are NO invalid byte sequences.
Every possible byte string is "valid" Latin-1.
This means:
- A state machine approach is useless (never errors)
- Standard frequency analysis is unreliable
- It can't be ruled out by structure alone
So instead of asking "is this valid Latin-1?" (always yes),
the prober asks "does this LOOK LIKE natural text in Latin-1?"
Byte Classification Scheme
┌────────────────┬──────────┬───────────────────────────────┐
│ Byte Range │ Class │ Meaning │
├────────────────┼──────────┼───────────────────────────────┤
│ 0x00-0x1F │ CONTROL │ Control characters │
│ 0x20-0x7F │ ASCII │ Standard ASCII │
│ 0x80-0x9F │ CONTROL │ C1 control chars │
│ │ │ (undefined in Latin-1, │
│ │ │ used in Windows-1252) │
│ 0xA0-0xBF │ COMMON │ Common accented range │
│ 0xC0-0xDF │ UPPER │ Uppercase accented │
│ 0xE0-0xFF │ LOWER │ Lowercase accented │
└────────────────┴──────────┴───────────────────────────────┘
Confidence Heuristic
Natural text in Latin-1 has expected patterns:
- Mostly ASCII
- Lowercase accented > uppercase accented (most text is lowercase)
- Few or no C1 control characters (if many C1 bytes → probably Windows-1252)
The prober intentionally returns moderate confidence (not high) so that more specific probers win when they have a strong match. Latin-1 is the “if nothing else works well, it’s probably this” fallback.
Tracing: “café résumé” in Latin-1
Input: 'café résumé'.encode('latin-1')
= b'caf\xe9 r\xe9sum\xe9'
Latin1Prober.feed(data):
│
├─ 'c' (0x63) → ASCII
├─ 'a' (0x61) → ASCII
├─ 'f' (0x66) → ASCII
├─ 'é' (0xE9) → LOWER (lowercase accented)
├─ ' ' (0x20) → ASCII
├─ 'r' (0x72) → ASCII
├─ 'é' (0xE9) → LOWER
├─ 's' (0x73) → ASCII
├─ 'u' (0x75) → ASCII
├─ 'm' (0x6D) → ASCII
├─ 'é' (0xE9) → LOWER
│
├─ Tally: ASCII=8, LOWER=3, UPPER=0, COMMON=0, CONTROL=0
│
├─ get_confidence():
│ ├─ Has accented characters → prober is relevant
│ ├─ lowercase accented > uppercase accented (natural pattern)
│ ├─ No C1 control chars (not Windows-1252 junk)
│ └─ Confidence ≈ 0.5-0.7
Latin-1 vs Windows-1252
Input: b'smart \x93quotes\x94'
(Windows-1252 "smart quotes": 0x93 = left ", 0x94 = right ")
Latin1Prober.feed(data):
│
├─ 0x93 → CONTROL (C1 range: 0x80-0x9F)
├─ 0x94 → CONTROL (C1 range)
│
├─ get_confidence():
│ ├─ Natural Latin-1 text almost never has C1 controls
│ ├─ Suggests Windows-1252 instead
│ └─ Returns LOW confidence (defers to Windows-1252 prober)
UTF8Prober
UTF-8 has its own prober because UTF-8 detection is unique — it doesn’t need distribution analysis.
Why No Distribution Analysis
Unlike GB2312 vs EUC-KR vs SJIS (which share byte ranges),
UTF-8 has a very distinctive structure:
1. Strict byte patterns: lead bytes and continuation bytes
have non-overlapping bit patterns
2. Self-synchronizing: you can identify character boundaries
from any position in the stream
3. No valid random byte stream: random bytes have only ~1/8
chance of being valid continuation bytes after a lead byte
UTF-8 Encoding Rules
┌────────────────────┬────────────────────────────────────┐
│ Code point range │ Byte pattern │
├────────────────────┼────────────────────────────────────┤
│ U+0000 - U+007F │ 0xxxxxxx │
│ U+0080 - U+07FF │ 110xxxxx 10xxxxxx │
│ U+0800 - U+FFFF │ 1110xxxx 10xxxxxx 10xxxxxx │
│ U+10000 - U+10FFFF │ 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx│
└────────────────────┴────────────────────────────────────┘
Byte classes:
0x00-0x7F → ASCII (single byte, complete character)
0x80-0xBF → Continuation byte (10xxxxxx)
0xC0-0xC1 → Invalid (overlong encoding)
0xC2-0xDF → 2-byte lead (110xxxxx)
0xE0-0xEF → 3-byte lead (1110xxxx)
0xF0-0xF4 → 4-byte lead (11110xxx)
0xF5-0xFF → Invalid (would encode > U+10FFFF)
State Machine
ASCII CONT INVLD 2-LEAD 3-LEAD 4-LEAD
START → ITS_ME ERROR ERROR ME_ONE ME_TWO ME_THREE
ME_ONE (need 1)→ ERROR ITS_ME ERROR ERROR ERROR ERROR
ME_TWO (need 2)→ ERROR ME_ONE ERROR ERROR ERROR ERROR
ME_THREE(need3)→ ERROR ME_TWO ERROR ERROR ERROR ERROR
Confidence Calculation
UTF8Prober
├─ coding_sm: CodingStateMachine
├─ num_mb_chars: count of multi-byte characters found
│
└─ get_confidence():
├─ If state machine ever hit ERROR → 0.0
├─ Otherwise:
│ ONE_CHAR_PROB = 0.5
│ confidence = 1.0 - (ONE_CHAR_PROB ^ num_mb_chars)
│
│ 1 mb char: 1 - 0.5 = 0.50
│ 2 mb chars: 1 - 0.25 = 0.75
│ 3 mb chars: 1 - 0.125 = 0.875
│ 5 mb chars: 1 - 0.031 = 0.969
│ 10 mb chars: 1 - 0.001 = 0.999
│
└─ Represents: "probability that n valid multi-byte sequences
all happened to be valid UTF-8 by coincidence"
Tracing: Mixed ASCII and Multi-byte
Input: 'Hello 世界'.encode('utf-8')
= b'Hello \xe4\xb8\x96\xe7\x95\x8c'
UTF8Prober.feed(data):
│
├─ 'H','e','l','l','o',' ' → ASCII → ITS_ME (single byte chars)
│
├─ 0xE4: 3-byte lead → START→ME_TWO
├─ 0xB8: continuation → ME_TWO→ME_ONE
├─ 0x96: continuation → ME_ONE→ITS_ME (complete!) → num_mb_chars=1
│
├─ 0xE7: 3-byte lead → START→ME_TWO
├─ 0x95: continuation → ME_TWO→ME_ONE
├─ 0x8C: continuation → ME_ONE→ITS_ME (complete!) → num_mb_chars=2
│
├─ No errors → structurally valid
│
└─ confidence = 1.0 - (0.5 ^ 2) = 0.75
HebrewProber
Hebrew gets special handling because of bidirectional text complexity.
HebrewProber
│
├─ Problem: Hebrew can be stored in two byte orders:
│ "Logical" order: characters in reading order (right-to-left)
│ "Visual" order: characters in display order (left-to-right)
│ Both use the same encoding but the byte sequence is REVERSED
│
├─ Strategy: Works with TWO SingleByteCharSetProbers:
│ 1. Win-1255 Logical Hebrew model
│ 2. ISO-8859-8 Visual Hebrew model
│
├─ HebrewProber acts as a "meta-prober":
│ │
│ ├─ Looks at word-final vs word-non-final letter forms
│ │
│ │ Hebrew has 5 letters with special final forms:
│ │ ך (final kaf) vs כ (non-final kaf)
│ │ ם (final mem) vs מ (non-final mem)
│ │ ן (final nun) vs נ (non-final nun)
│ │ ף (final pe) vs פ (non-final pe)
│ │ ץ (final tsade) vs צ (non-final tsade)
│ │
│ │ In correctly ordered text:
│ │ Final forms appear before spaces/punctuation
│ │ Non-final forms appear before other letters
│ │
│ │ In reverse-ordered text:
│ │ Final forms appear after spaces (wrong position)
│ │
│ ├─ Counts: final_in_correct_position, final_in_wrong_position
│ │
│ └─ Uses counts to determine logical vs visual
│ If correct > wrong → logical order
│ If wrong > correct → visual order
│
└─ get_confidence(): returns confidence from the winning sub-prober
Japanese Context Analysis (SJIS/EUC-JP)
SJIS and EUC-JP get additional analysis beyond standard distribution.
SJISContextAnalysis / EUCJPContextAnalysis
│
├─ Problem: SJIS and EUC-JP often have similar confidence
│ from distribution analysis alone
│
├─ Extra signal: Hiragana character usage patterns
│ Japanese text frequently uses hiragana particles
│ (は、が、の、に、を、etc.)
│
│ SJIS and EUC-JP encode hiragana at different byte positions:
│ 'の' in SJIS: 0x82 0xCC
│ 'の' in EUC-JP: 0xA4 0xCE
│
│ The context analyzer checks whether the decoded hiragana
│ makes sense in Japanese text
│
└─ Provides additional confidence boost for SJIS vs EUC-JP
disambiguation
SingleByteCharSetProber: In Depth
Model Structure
SingleByteCharSetModel
│
├─ char_to_order_table[256]:
│ Maps each byte value to a frequency order
│ ┌──────────────────────────────────────────────────┐
│ │ Order 0-63: Most frequent 64 characters │
│ │ (covers ~90%+ of typical text) │
│ │ Order 64-253: Less frequent characters │
│ │ Order 254: Character not expected in encoding │
│ │ Order 255: Undefined/unused byte │
│ └──────────────────────────────────────────────────┘
│
│ Example: Windows-1251 Russian
│ 'о' (0xEE) → order 0 (most frequent Russian letter)
│ 'е' (0xE5) → order 1
│ 'а' (0xE0) → order 2
│ 'и' (0xE8) → order 3
│ 'н' (0xED) → order 4
│ 'т' (0xF2) → order 5
│ ...
│ 'ъ' (0xFA) → order 31 (rare hard sign)
│ 'ё' (0xB8) → order 32 (rare yo)
│
├─ precedence_matrix[64][64]:
│ Bigram frequency categories for the top 64 characters
│
│ For each pair (char_a, char_b):
│ ┌───────────────────────────────────────────┐
│ │ 0 = NEGATIVE : pair almost never occurs │
│ │ 1 = UNLIKELY : pair is uncommon │
│ │ 2 = LIKELY : pair occurs sometimes │
│ │ 3 = POSITIVE : pair is very natural │
│ └───────────────────────────────────────────┘
│
│ Example: Russian bigrams
│ 'с' → 'т' : POSITIVE (ст is very common)
│ 'т' → 'о' : POSITIVE (то is common)
│ 'ъ' → 'ъ' : NEGATIVE (double hard sign never happens)
│
└─ typical_positive_ratio: float
Expected ratio of POSITIVE pairs in natural text
Tracing: “Привет” (Russian) in Windows-1251
Input: 'Привет'.encode('windows-1251')
= b'\xcf\xf0\xe8\xe2\xe5\xf2'
SingleByteCharSetProber (Win-1251 / Russian).feed(data):
│
├─ Byte 0xCF (П): order ~20, first char, no bigram yet
├─ Byte 0xF0 (р): order ~8
│ precedence_matrix[20][8] → POSITIVE (Пр is natural)
│ seq_counters[POSITIVE] += 1
├─ Byte 0xE8 (и): order ~3
│ precedence_matrix[8][3] → POSITIVE (ри is common)
│ seq_counters[POSITIVE] += 1
├─ Byte 0xE2 (в): order ~7
│ precedence_matrix[3][7] → POSITIVE (ив is common)
│ seq_counters[POSITIVE] += 1
├─ Byte 0xE5 (е): order ~1
│ precedence_matrix[7][1] → POSITIVE (ве is natural)
│ seq_counters[POSITIVE] += 1
├─ Byte 0xF2 (т): order ~5
│ precedence_matrix[1][5] → POSITIVE (ет is natural)
│ seq_counters[POSITIVE] += 1
│
├─ Results: 5 sequences, all POSITIVE
│
└─ get_confidence():
positive_ratio = 5/5 = 1.0
typical_positive_ratio ≈ 0.976
confidence ≈ 0.5 (scaled conservatively)
Same Bytes Through KOI8-R
Same bytes decoded as KOI8-R produce different letters:
0xCF = п, 0xF0 = р, 0xE8 = х, 0xE2 = т, 0xE5 = х, 0xF2 = т
Reads as "прхтхт" (nonsense)
KOI8-R SingleByteCharSetProber:
│
├─ Bigram analysis:
│ р→х: UNLIKELY
│ х→т: NEGATIVE
│ т→х: NEGATIVE
│ х→т: NEGATIVE
│
├─ NEGATIVE sequences dominate
│
└─ confidence ≈ 0.05
WINNER: Windows-1251 at ~0.5 beats KOI8-R at ~0.05
SBCSGroupProber: Prober List
The SBCSGroupProber manages many probers. Note that the same encoding can appear multiple times with different language models:
- Windows-1251 / Russian
- Windows-1251 / Bulgarian
- Windows-1251 / Macedonian
- KOI8-R / Russian
- ISO-8859-5 / Russian
- ISO-8859-5 / Bulgarian
- MacCyrillic / Russian
- IBM866 / Russian
- IBM855 / Russian
- ISO-8859-7 / Greek
- Windows-1253 / Greek
- Windows-1256 / Arabic
- Windows-1255 / Hebrew (logical)
- Windows-1255 / Hebrew (visual)
- TIS-620 / Thai
- ISO-8859-9 / Turkish
- And more…
MBCSGroupProber: Full Child List
MBCSGroupProber
├─ UTF8Prober [state machine + count]
├─ SJISProber [state machine + distribution + context]
├─ EUCJPProber [state machine + distribution + context]
├─ GB2312Prober [state machine + distribution]
├─ EUCKRProber [state machine + distribution]
├─ Big5Prober [state machine + distribution]
├─ EUCTWProber [state machine + distribution]
└─ UTF1632Prober [null-byte pattern analysis]
LanguageFilter System
Bit flags that constrain which encodings the detector considers:
┌─────────────────────┬───────┬──────────────────────────────┐
│ Flag │ Value │ Encodings it enables │
├─────────────────────┼───────┼──────────────────────────────┤
│ CHINESE_SIMPLIFIED │ 0x01 │ GB2312, GB18030, HZ-GB-2312, │
│ │ │ ISO-2022-CN │
│ CHINESE_TRADITIONAL │ 0x02 │ Big5, EUC-TW │
│ JAPANESE │ 0x04 │ SJIS, EUC-JP, ISO-2022-JP │
│ KOREAN │ 0x08 │ EUC-KR, ISO-2022-KR │
│ NON_CJK │ 0x10 │ All single-byte encodings │
│ ALL │ 0x1F │ Everything (default) │
│ CHINESE │ 0x03 │ SIMPLIFIED | TRADITIONAL │
│ CJK │ 0x0F │ CHINESE | JAPANESE | KOREAN │
└─────────────────────┴───────┴──────────────────────────────┘
How Filtering Works
UniversalDetector(lang_filter=LanguageFilter.JAPANESE)
│
├─ MBCSGroupProber creation:
│ ├─ UTF8Prober → always included (language-neutral)
│ ├─ SJISProber → JAPANESE & 0x04 = ✓ INCLUDE
│ ├─ EUCJPProber → JAPANESE & 0x04 = ✓ INCLUDE
│ ├─ GB2312Prober → CHINESE_SIMPLIFIED & 0x04 = 0 ✗ SKIP
│ ├─ EUCKRProber → KOREAN & 0x04 = 0 ✗ SKIP
│ ├─ Big5Prober → CHINESE_TRADITIONAL & 0x04 = 0 ✗ SKIP
│ └─ UTF1632Prober → always included (language-neutral)
│
├─ SBCSGroupProber: NON_CJK & 0x04 = 0 ✗ SKIP ALL
│
└─ EscCharSetProber:
├─ ISO2022JP → ✓ INCLUDE
├─ ISO2022CN → ✗ SKIP
├─ ISO2022KR → ✗ SKIP
└─ HZ → ✗ SKIP
Usage
from chardet.universaldetector import UniversalDetector
from chardet.enums import LanguageFilter
detector = UniversalDetector(lang_filter=LanguageFilter.JAPANESE)
detector = UniversalDetector(lang_filter=LanguageFilter.NON_CJK)
detector = UniversalDetector(
lang_filter=LanguageFilter.JAPANESE | LanguageFilter.KOREAN
)
CLI Tool: chardetect
chardet ships with a command-line tool for detecting file encodings.
Invocation
$ chardetect # as console_scripts entry point
$ python -m chardet # as module
Architecture
chardetect CLI
│
├─ Entry point: chardet.cli.chardetect module
│ └─ main() function, registered as console_scripts
│
├─ Argument parsing (argparse):
│ ├─ positional: files (one or more paths, or stdin if none)
│ ├─ --version: show chardet version
│ └─ -m / --minimal: output just the encoding name
│
└─ Processing flow:
├─ For each input file (or stdin):
│ ├─ Open in binary mode ('rb')
│ ├─ Create UniversalDetector instance
│ ├─ Read line by line:
│ │ for line in file:
│ │ detector.feed(line)
│ │ if detector.done:
│ │ break ← early exit, doesn't read entire file
│ ├─ detector.close()
│ └─ Output result
│
└─ Output format:
Normal: filename: encoding with confidence value
Minimal: encoding
Usage Examples
$ chardetect somefile.txt
somefile.txt: utf-8 with confidence 0.99
$ chardetect file1.txt file2.csv file3.html
file1.txt: utf-8 with confidence 0.99
file2.csv: Windows-1252 with confidence 0.73
file3.html: ascii with confidence 1.0
$ chardetect --minimal somefile.txt
utf-8
$ echo -n "some bytes" | chardetect
stdin: ascii with confidence 1.0
Full CLI Trace
$ chardetect mixed_chinese.txt
main()
│
├─ Parse args: files=['mixed_chinese.txt'], minimal=False
├─ Open 'mixed_chinese.txt' in binary mode
├─ detector = UniversalDetector()
│
├─ Line 1: b'\xc4\xe3\xba\xc3\n' (你好\n in GB2312)
│ ├─ detector.feed(line)
│ │ ├─ BOM check: no BOM
│ │ ├─ All bytes > 0x80 → HIGH_BYTE
│ │ ├─ Create probers, feed to all
│ │ └─ Only 4 chars, not enough for FOUND_IT
│ └─ done = False → continue
│
├─ Line 2: b'\xca\xc0\xbd\xe7\xc4\xe3\xba\xc3\n' (世界你好\n)
│ ├─ detector.feed(line)
│ │ ├─ UTF8Prober: ERROR → NOT_ME
│ │ ├─ GB2312Prober: confidence ~0.8
│ │ └─ Not at FOUND_IT threshold yet
│ └─ done = False → continue
│
├─ Line 3: b'\xd6\xd0\xbb\xaa\xc8\xcb\xc3\xf1...\n' (中华人民共和国\n)
│ ├─ detector.feed(line)
│ │ ├─ GB2312Prober: confidence > 0.95
│ │ └─ Hits FOUND_IT → done = True
│ └─ done = True → break (remaining lines NOT read)
│
├─ detector.close()
│ └─ result = {'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}
│
└─ Print: "mixed_chinese.txt: GB2312 with confidence 0.99"
detect_all() Internals
How It Differs From detect()
detect() flow through GroupProber:
│ MBCSGroupProber.get_confidence():
│ └─ Returns ONLY the best child's confidence → 0.99 (GB2312)
│ Caller sees one number
detect_all() flow:
│ Reaches INSIDE the group probers:
│ └─ MBCSGroupProber.probers → list of child probers
│ ├─ prober[0] (UTF8): 0.0 (NOT_ME)
│ ├─ prober[1] (GB2312): 0.99
│ ├─ prober[2] (SJIS): 0.05
│ ├─ prober[3] (EUCKR): 0.08
│ └─ ...each one becomes a separate entry in results
When It’s Useful
Input: b'\xc0\xd1' (only 2 bytes)
detect() returns:
{'encoding': 'EUC-KR', 'confidence': 0.35, 'language': 'Korean'}
detect_all() returns:
[
{'encoding': 'EUC-KR', 'confidence': 0.35, 'language': 'Korean'},
{'encoding': 'GB2312', 'confidence': 0.33, 'language': 'Chinese'},
{'encoding': 'Big5', 'confidence': 0.30, 'language': 'Chinese'},
{'encoding': 'ISO-8859-1','confidence': 0.25, 'language': ''},
]
Now you can see all candidates are close — the detector is guessing.
Internal Constants and Thresholds
┌────────────────────────────────┬───────┬──────────────────────────────┐
│ Constant │ Value │ Purpose │
├────────────────────────────────┼───────┼──────────────────────────────┤
│ MINIMUM_THRESHOLD │ 0.20 │ Below this, result = None │
│ SHORTCUT_THRESHOLD │ 0.95 │ Above this, stop early │
│ │ │ (done=True) │
│ ENOUGH_DATA_THRESHOLD │ 1024 │ Bytes needed for reliable │
│ │ │ distribution analysis │
│ SURE_YES │ 0.99 │ Returned for BOM detections │
│ SURE_NO │ 0.01 │ Minimum non-zero confidence │
└────────────────────────────────┴───────┴──────────────────────────────┘
Prober States
┌──────────────┬───────┬─────────────────────────────────────────────┐
│ State │ Value │ Meaning │
├──────────────┼───────┼─────────────────────────────────────────────┤
│ DETECTING │ 0 │ Still gathering data, no conclusion yet │
│ FOUND_IT │ 1 │ Confident match (triggers early termination)│
│ NOT_ME │ 2 │ Ruled out (state machine error or similar) │
└──────────────┴───────┴─────────────────────────────────────────────┘
How Thresholds Interact
feed() loop:
│ if prober.state == FOUND_IT:
│ └─ detector.done = True → stop feeding
│ if prober.state == NOT_ME:
│ └─ Prober deactivated, never fed again
close():
│ best = max(prober.get_confidence() for active probers)
│ if best > MINIMUM_THRESHOLD (0.20):
│ result = that encoding
│ else:
│ result = {'encoding': None, 'confidence': 0.0, 'language': ''}
│ (chardet is saying "I genuinely don't know")
Complete Priority Flow
detector.feed(data)
│
│ ┌─────────────────────────────────────────────────┐
│ │ PRIORITY 1: BOM Detection (first feed() only) │
│ │ Confidence: 1.0 (absolute certainty) │
│ │ If found → done, skip everything else │
│ └──────────────────────┬──────────────────────────┘
│ │ no BOM
│ ┌──────────────────────▼──────────────────────────┐
│ │ CLASSIFY: Scan bytes to determine input type │
│ │ PURE_ASCII: all bytes < 0x80 │
│ │ ESC_ASCII: saw 0x1B or 0x7E │
│ │ HIGH_BYTE: saw bytes >= 0x80 │
│ └──────┬────────────┬────────────┬────────────────┘
│ │ │ │
│ PURE_ASCII ESC_ASCII HIGH_BYTE
│ │ │ │
│ ┌──────▼──────┐ ┌───▼────────┐ ┌▼──────────────────┐
│ │ Do nothing │ │ PRIORITY 2 │ │ PRIORITY 3 │
│ │ Wait for │ │ EscCharSet │ │ MBCS + SBCS │
│ │ more data │ │ Prober │ │ GroupProbers │
│ │ or close() │ │ │ │ │
│ └──────┬──────┘ └───┬────────┘ └┬──────────────────┘
│ │ │ │
│ ┌──────▼──────────────▼───────────▼──────────────────┐
│ │ detector.close() │
│ │ │
│ │ if PURE_ASCII → result = 'ascii', confidence 1.0 │
│ │ if BOM found → result = BOM encoding, conf 1.0 │
│ │ if prober FOUND_IT → result from that prober │
│ │ else → compare all prober confidences │
│ │ pick highest above minimum threshold (0.2) │
│ │ if none above threshold → result = None │
│ └─────────────────────────────────────────────────────┘
Complete Prober Hierarchy
UniversalDetector
│
├─ BOM Detection (hardcoded, confidence 1.0)
│ └─ UTF-8-SIG, UTF-16-LE, UTF-16-BE, UTF-32-LE, UTF-32-BE
│
├─ EscCharSetProber (triggered by ESC/~ bytes)
│ ├─ CodingStateMachine(ISO2022JP) → ISO-2022-JP
│ ├─ CodingStateMachine(ISO2022CN) → ISO-2022-CN
│ ├─ CodingStateMachine(ISO2022KR) → ISO-2022-KR
│ └─ CodingStateMachine(HZ) → HZ-GB-2312
│
├─ MBCSGroupProber (triggered by high bytes)
│ ├─ UTF8Prober [state machine + count]
│ ├─ SJISProber [state machine + distribution + context]
│ ├─ EUCJPProber [state machine + distribution + context]
│ ├─ GB2312Prober [state machine + distribution]
│ ├─ EUCKRProber [state machine + distribution]
│ ├─ Big5Prober [state machine + distribution]
│ ├─ EUCTWProber [state machine + distribution]
│ └─ UTF1632Prober [null-byte pattern analysis]
│
├─ SBCSGroupProber (triggered by high bytes)
│ ├─ Latin1Prober [byte class heuristic]
│ ├─ HebrewProber [final-form meta-analysis]
│ │ ├─ Win-1255 Logical model
│ │ └─ ISO-8859-8 Visual model
│ └─ Many SingleByteCharSetProbers [bigram + frequency]:
│ ├─ Win-1251 / Russian
│ ├─ KOI8-R / Russian
│ ├─ ISO-8859-5 / Russian
│ ├─ MacCyrillic / Russian
│ ├─ IBM866 / Russian
│ ├─ IBM855 / Russian
│ ├─ Win-1251 / Bulgarian
│ ├─ ISO-8859-5 / Bulgarian
│ ├─ Win-1251 / Macedonian
│ ├─ Win-1253 / Greek
│ ├─ ISO-8859-7 / Greek
│ ├─ Win-1256 / Arabic
│ ├─ TIS-620 / Thai
│ ├─ ISO-8859-9 / Turkish
│ ├─ Win-1255 / Hebrew (logical)
│ ├─ Win-1255 / Hebrew (visual)
│ └─ ... and more
│
└─ Fallback: PURE_ASCII → 'ascii', confidence 1.0
No winner above threshold → None
Edge Cases
Empty Input
chardet.detect(b'')
→ {'encoding': None, 'confidence': 0.0, 'language': ''}
Pure ASCII
chardet.detect(b'Hello world')
→ {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
No probers are ever created (optimization).
Single High Byte
chardet.detect(b'\xe9')
Most probers don't have enough data. Likely returns None or very low
confidence Latin-1.
Reusing Detector with reset()
detector = UniversalDetector()
detector.feed(chinese_data)
detector.close()
result1 = detector.result # GB2312
detector.reset() # Must reset before reuse!
# Clears all state: result, done, _got_data, _input_state,
# _last_char, all probers destroyed/reset
detector.feed(korean_data)
detector.close()
result2 = detector.result # EUC-KR
Incremental Feeding with Small Chunks
detector = UniversalDetector()
for byte in data:
detector.feed(bytes([byte]))
if detector.done:
break
detector.close()
This works because state machines and distribution analyzers maintain state between feed() calls. However, BOM detection is less reliable with tiny first chunks — it only checks the first feed’s data. Recommendation: feed at least 4 bytes in the first call.
All Null Bytes
chardet.detect(b'\x00\x00\x00\x00\x00\x00\x00\x00')
UTF1632Prober may trigger but pattern doesn't match cleanly.
Likely result: None (binary data, not text).
Design Philosophy
The whole design is an elegant cascade from certainty to statistical guesswork:
- BOMs are unambiguous (100% confidence)
- Escape sequences are nearly unambiguous (specific sequence identifies exactly one encoding)
- State machines cheaply eliminate structurally impossible encodings
- Distribution analysis disambiguates encodings with overlapping byte ranges using character frequency models
- Bigram/sequence analysis provides fine-grained disambiguation for single-byte encodings
Each layer trades specificity for coverage, and the system as a whole handles a remarkably wide range of real-world encoding scenarios.
About This Document
Date written: March 14, 2026
Author: Claude Opus 4.6 (Anthropic), in conversation with Dan Blanchard (chardet maintainer)
Source of knowledge: This document was written entirely from the model’s training data memory, with no web searches or file reads of the actual chardet source code performed during the session. The knowledge likely reflects a composite of chardet versions 4.x through 5.x (roughly 2021-2024), based on the features described:
detect_all()was introduced around chardet 4.0UTF1632Prober(BOM-less UTF-16/32 detection) appears to be a 4.x+ additionLanguageFilteras a formalized enum and thelang_filterparameter onUniversalDetectorare from a similar era
Because training data doesn’t carry clean version tags, some details may reflect one version while others reflect a different one. Specific values like threshold constants, byte class ranges, enum integer values, the exact list of single-byte probers, and confidence calculation formulas may be inaccurate. Consult the actual chardet source code for authoritative information.