Content-Type header declares a wrong charset.
Your scraper must detect the actual encoding and decode the content correctly.
| Declared charset (header) | utf-8 |
|---|---|
| Actual encoding (body) | utf-8 |
| Scenario | Content with truncated/invalid UTF-8 sequences: broken multi-byte chars, lone continuation bytes, overlong encodings |
Truncated 2-byte: caf is missing the second byte.
Truncated 3-byte: price missing last byte of euro sign.
Lone continuation: helloworld.
Mixed valid/invalid: Strae in Mnchen (CP1252 in UTF-8 stream).
Overlong slash: pathfile (overlong encoding).
This paragraph is completely valid UTF-8 and should survive intact.
Illegal bytes: data end.
This iframe is served as ISO-8859-1 bytes with header charset=utf-8 — a different mismatch than the main page.