Bad Encoding Scenarios
These pages simulate a common real-world issue: the HTTP Content-Type header
declares one charset but the actual content is encoded differently. Your scraper must detect
the mismatch and correctly decode the content.
Cyrillic (Latin-1 header, UTF-8 body)
Russian text encoded as UTF-8 but header declares charset=latin-1
Header: charset=latin-1 |
Actual: utf-8
German (ASCII header, UTF-8 body)
German text with umlauts encoded as UTF-8 but header declares charset=ascii
Header: charset=ascii |
Actual: utf-8
Mixed Multilingual (CP1252 header, UTF-8 body)
Chinese, Arabic, French, Japanese text as UTF-8 but header declares charset=windows-1252
Header: charset=windows-1252 |
Actual: utf-8
CP1252 Smart Quotes (UTF-8 header, CP1252 body)
Raw CP1252 bytes with smart quotes and special symbols but header declares charset=utf-8
Header: charset=utf-8 |
Actual: cp1252
Latin-1 Accents (UTF-8 header, ISO-8859-1 body)
French/Spanish accented text as raw Latin-1 bytes but header declares charset=utf-8
Header: charset=utf-8 |
Actual: iso-8859-1
Invalid UTF-8 Sequences (UTF-8 header, broken bytes)
Content with truncated/invalid UTF-8 sequences: broken multi-byte chars, lone continuation bytes, overlong encodings
Header: charset=utf-8 |
Actual: utf-8
What to test
- Cyrillic/German/Mixed: Content is valid UTF-8 but header lies about charset. Scraper should ignore the header and use the actual encoding.
- CP1252/Latin-1: Content is raw legacy bytes but header claims UTF-8. Scraper must detect invalid UTF-8 and re-decode with the correct encoding.