Bad Encoding Scenarios

These pages simulate a common real-world issue: the HTTP Content-Type header declares one charset but the actual content is encoded differently. Your scraper must detect the mismatch and correctly decode the content.

Cyrillic (Latin-1 header, UTF-8 body)

Russian text encoded as UTF-8 but header declares charset=latin-1

Header: charset=latin-1 | Actual: utf-8

Visit page
German (ASCII header, UTF-8 body)

German text with umlauts encoded as UTF-8 but header declares charset=ascii

Header: charset=ascii | Actual: utf-8

Visit page
Mixed Multilingual (CP1252 header, UTF-8 body)

Chinese, Arabic, French, Japanese text as UTF-8 but header declares charset=windows-1252

Header: charset=windows-1252 | Actual: utf-8

Visit page
CP1252 Smart Quotes (UTF-8 header, CP1252 body)

Raw CP1252 bytes with smart quotes and special symbols but header declares charset=utf-8

Header: charset=utf-8 | Actual: cp1252

Visit page
Latin-1 Accents (UTF-8 header, ISO-8859-1 body)

French/Spanish accented text as raw Latin-1 bytes but header declares charset=utf-8

Header: charset=utf-8 | Actual: iso-8859-1

Visit page
Invalid UTF-8 Sequences (UTF-8 header, broken bytes)

Content with truncated/invalid UTF-8 sequences: broken multi-byte chars, lone continuation bytes, overlong encodings

Header: charset=utf-8 | Actual: utf-8

Visit page
What to test
  • Cyrillic/German/Mixed: Content is valid UTF-8 but header lies about charset. Scraper should ignore the header and use the actual encoding.
  • CP1252/Latin-1: Content is raw legacy bytes but header claims UTF-8. Scraper must detect invalid UTF-8 and re-decode with the correct encoding.