Bad Encoding Scenarios

These pages simulate a common real-world issue: the HTTP Content-Type header declares one charset but the actual content is encoded differently. Your scraper must detect the mismatch and correctly decode the content.

Cyrillic (Latin-1 header, UTF-8 body)

Russian text encoded as UTF-8 but header declares charset=latin-1

Header: charset=latin-1 | Actual: utf-8

Visit page

German (ASCII header, UTF-8 body)

German text with umlauts encoded as UTF-8 but header declares charset=ascii

Header: charset=ascii | Actual: utf-8

Visit page

Mixed Multilingual (CP1252 header, UTF-8 body)

Chinese, Arabic, French, Japanese text as UTF-8 but header declares charset=windows-1252

Header: charset=windows-1252 | Actual: utf-8

Visit page

CP1252 Smart Quotes (UTF-8 header, CP1252 body)

Raw CP1252 bytes with smart quotes and special symbols but header declares charset=utf-8

Header: charset=utf-8 | Actual: cp1252

Visit page

Latin-1 Accents (UTF-8 header, ISO-8859-1 body)

French/Spanish accented text as raw Latin-1 bytes but header declares charset=utf-8

Header: charset=utf-8 | Actual: iso-8859-1

Visit page

Invalid UTF-8 Sequences (UTF-8 header, broken bytes)

Content with truncated/invalid UTF-8 sequences: broken multi-byte chars, lone continuation bytes, overlong encodings

Header: charset=utf-8 | Actual: utf-8

Visit page

What to test

Cyrillic/German/Mixed: Content is valid UTF-8 but header lies about charset. Scraper should ignore the header and use the actual encoding.
CP1252/Latin-1: Content is raw legacy bytes but header claims UTF-8. Scraper must detect invalid UTF-8 and re-decode with the correct encoding.

Non-UTF-8 HTTP Headers

These endpoints serve files with non-ASCII bytes in the Content-Disposition header (e.g. CP1252 smart quotes in filenames). The scraper must detect the encoding of raw header bytes and recover the original characters instead of replacing them with U+FFFD (�).

CP1252 Smart Quotes in Filename

Content-Disposition header with CP1252 byte 0x92 (right single quote) in the filename. The scraper should recover U+2019 (’) not U+FFFD (�).

Disposition: inline | Type: application/pdf

Expected filename: They Don’t Produce Income If They Dont Reproduce.pdf

Download

CP1252 Attachment Filename

Same as above but with Content-Disposition: attachment. Byte 0x92 (right single quote) and 0x96 (en dash) in filename.

Disposition: attachment | Type: application/pdf

Expected filename: They Don’t Know – A Study.pdf

Download

Latin-1 Accented Filename

Content-Disposition with ISO-8859-1 bytes: 0xE9 (e-acute), 0xE8 (e-grave), 0xFC (u-umlaut). Should recover é, è, ü.

Disposition: inline | Type: application/pdf

Expected filename: Résumé für Café Crème.pdf

Download

Multiple CP1252 Special Characters

Filename with several CP1252-specific bytes: 0x93/0x94 (smart double quotes), 0x85 (ellipsis), 0x97 (em dash). Should recover “ ” … —.

Disposition: inline | Type: text/plain

Expected filename: “Document” Title… Part — Final.txt

Download

What to test

inline: Chrome displays the PDF in its built-in viewer. The Content-Disposition filename should be recovered with correct characters.
attachment: Chrome triggers a download. The filename in browser_data.attachments should have correct characters.