Bad Encoding Scenarios
These pages simulate a common real-world issue: the HTTP Content-Type header
declares one charset but the actual content is encoded differently. Your scraper must detect
the mismatch and correctly decode the content.
Cyrillic (Latin-1 header, UTF-8 body)
Russian text encoded as UTF-8 but header declares charset=latin-1
Header: charset=latin-1 |
Actual: utf-8
German (ASCII header, UTF-8 body)
German text with umlauts encoded as UTF-8 but header declares charset=ascii
Header: charset=ascii |
Actual: utf-8
Mixed Multilingual (CP1252 header, UTF-8 body)
Chinese, Arabic, French, Japanese text as UTF-8 but header declares charset=windows-1252
Header: charset=windows-1252 |
Actual: utf-8
CP1252 Smart Quotes (UTF-8 header, CP1252 body)
Raw CP1252 bytes with smart quotes and special symbols but header declares charset=utf-8
Header: charset=utf-8 |
Actual: cp1252
Latin-1 Accents (UTF-8 header, ISO-8859-1 body)
French/Spanish accented text as raw Latin-1 bytes but header declares charset=utf-8
Header: charset=utf-8 |
Actual: iso-8859-1
Invalid UTF-8 Sequences (UTF-8 header, broken bytes)
Content with truncated/invalid UTF-8 sequences: broken multi-byte chars, lone continuation bytes, overlong encodings
Header: charset=utf-8 |
Actual: utf-8
What to test
- Cyrillic/German/Mixed: Content is valid UTF-8 but header lies about charset. Scraper should ignore the header and use the actual encoding.
- CP1252/Latin-1: Content is raw legacy bytes but header claims UTF-8. Scraper must detect invalid UTF-8 and re-decode with the correct encoding.
Non-UTF-8 HTTP Headers
These endpoints serve files with non-ASCII bytes in the Content-Disposition header
(e.g. CP1252 smart quotes in filenames). The scraper must detect the encoding of raw header bytes
and recover the original characters instead of replacing them with U+FFFD (�).
CP1252 Smart Quotes in Filename
Content-Disposition header with CP1252 byte 0x92 (right single quote) in the filename. The scraper should recover U+2019 (’) not U+FFFD (�).
Disposition: inline |
Type: application/pdf
Expected filename: They Don’t Produce Income If They Dont Reproduce.pdf
CP1252 Attachment Filename
Same as above but with Content-Disposition: attachment. Byte 0x92 (right single quote) and 0x96 (en dash) in filename.
Disposition: attachment |
Type: application/pdf
Expected filename: They Don’t Know – A Study.pdf
Latin-1 Accented Filename
Content-Disposition with ISO-8859-1 bytes: 0xE9 (e-acute), 0xE8 (e-grave), 0xFC (u-umlaut). Should recover é, è, ü.
Disposition: inline |
Type: application/pdf
Expected filename: Résumé für Café Crème.pdf
Multiple CP1252 Special Characters
Filename with several CP1252-specific bytes: 0x93/0x94 (smart double quotes), 0x85 (ellipsis), 0x97 (em dash). Should recover “ ” … —.
Disposition: inline |
Type: text/plain
Expected filename: “Document” Title… Part — Final.txt
What to test
- inline: Chrome displays the PDF in its built-in viewer. The
Content-Dispositionfilename should be recovered with correct characters. - attachment: Chrome triggers a download. The filename in
browser_data.attachmentsshould have correct characters.