web-scraping.dev

A realistic e-commerce testing platform for web scraping developers

Practice web scraping with 19 realistic scenarios covering pagination, authentication, GraphQL APIs, CSRF protection, and more. Safe, legal, and designed for learning.

Realistic Scenarios Challenging CI Testing Suite

Scraping Scenarios 48 Scenarios

Real-world web patterns you'll encounter in production scraping projects. Each scenario is tagged with difficulty level and includes working code examples.

Scenario Path Level Description
Static Paging https://web-scraping.dev/products Beginner HTML-based server-side item paging where each page is its own URL.
Endless Scroll Paging https://web-scraping.dev/testimonials Intermediate Dynamic client side paging where new items appear as user scrolls.
Secret API Token https://web-scraping.dev/testimonials Intermediate X-Secret-Token used to lock access to hidden APIs.
Endless Button Paging https://web-scraping.dev/reviews Intermediate Dynamic paging where new items appear on Load More click.
GraphQL Background Requests https://web-scraping.dev/reviews Advanced Data loaded via JavaScript through a backend GraphQL API.
Forced New Tab Links https://web-scraping.dev/reviews Beginner Links forced to open in new tabs via different techniques.
Product HTML Markup https://web-scraping.dev/product/1 Beginner Basic e-commerce product structure and CSS class based markup.
Hidden Web Data https://web-scraping.dev/product/1 Intermediate Review data hidden in HTML as JSON, loaded on page load.
Local Storage https://web-scraping.dev/product/1 Intermediate Cart system powered by localStorage for client-side state.
CSRF Token Locks https://web-scraping.dev/product/1 Advanced Load More reviews uses X-CSRF-Token header to block cross-site access.
Cookies Based Login https://web-scraping.dev/login Intermediate User authentication based on form request and cookies.
Iframe Login https://web-scraping.dev/login/iframe Intermediate Login form embedded in an iframe (SSO widget pattern).
PDF Downloads https://web-scraping.dev/login Intermediate Link and JS based file download triggers.
Cookie Popup https://web-scraping.dev/login?cookies= Beginner Cookie info modal popup that blocks the entire screen.
Example Block Page https://web-scraping.dev/blocked Beginner Valid 200-status response that redirects to block notification.
Blocking Redirect (Invalid Referer) https://web-scraping.dev/credentials Advanced Page requires valid Referer header, otherwise redirects to blocked.
Persistent Cookie Blocking https://web-scraping.dev/blocked?persist= Advanced Using cookies to mark blocked clients for persistent blocking.
Form File Attachment Download https://web-scraping.dev/file-download Intermediate Form submission triggers file download with Content-Disposition header.
AI Content Obfuscation https://web-scraping.dev/ai-content-obfuscation Intermediate Extract clean text from AI-obfuscated content using invisible Unicode.
Challenge + File Download https://web-scraping.dev/challenge-download Advanced Challenge bypass + file download with 403 status.
Bad Encoding https://web-scraping.dev/bad-encoding Intermediate Pages with mismatched Content-Type charset headers.
Antibot Challenge https://web-scraping.dev/antibot/easy Beginner Simple antibot protection that blocks direct access.
Robots.txt Compliance Trap /robots-disallowed Crawler Disallowed URL that returns 200. Conforming crawlers must not fetch.
WebMCP Tools /mcp-tools Intermediate Page with registered MCP tools. Test native browser MCP with AI agents via Cloud Browser.
Crawler Test Report Endpoint /crawler-test-report Crawler Central JSON assertion endpoint. POST /reset to clear between runs.
Rate Limit & Politeness /rate-limited Crawler 429 with Retry-After: 5 after 3 req/10s.
Crawl Delay Enforcement /slow-section/page/1 Crawler robots.txt Crawl-delay: 2 for /slow-section/.
Meta Robots Directives /meta-noindex Crawler Pages with meta noindex/nofollow.
X-Robots-Tag Header /header-noindex Crawler Responses with X-Robots-Tag noindex/nofollow.
Canonical URL Handling /canonical-tracking?utm_source=foo&session=bar Crawler Tracking params with rel=canonical.
Session ID Vortex (Trap) /session-vortex Crawler Every page produces 10 new sid-parameterized links.
Infinite Calendar (Trap) /calendar/2024/01 Crawler Calendar with prev/next links across 2000-2100.
Redirect Chain & Loops /redirect-chain/1 Crawler Chain of 10 redirects + loop detection.
Fragment & URL Normalization /fragment-collapse Crawler Fragment-only and normalization variants.
Content Deduplication /dup-a Crawler Identical HTML with canonical. Tests dedup.
JS-only Link Discovery /js-links Crawler Anchors injected by JS after DOMContentLoaded.
JSON-LD Link Discovery /linked-data Crawler Link only present inside a JSON-LD script tag.
data-href Link Discovery /data-href Crawler Link attribute on a non-anchor element.
HTTP Link Header Discovery /header-link Crawler Link: rel=next HTTP header points to target.
HTML base Tag /base-tag Crawler Relative links resolved against base href.
Huge Response Body /huge-page Crawler Streams ~50MB of HTML.
Slow Drip Response /slow-drip Crawler 1 byte/sec for 60s. Tests timeout behavior.
Wrong Content-Length /wrong-content-length Crawler Header claims 100 bytes, body is 1000.
External Redirect /redirect-external Crawler 302 to example.com.
Mixed HTTP/HTTPS Content /mixed-content Crawler Same URL under http and https.
Sitemap Index /sitemap-index.xml Crawler 50 child sitemaps x 100 URLs each.
Gzipped Sitemap /sitemap.xml.gz Crawler sitemap.xml as application/gzip.
Basic Auth Wall /private/secret Crawler 401 + WWW-Authenticate: Basic.
Cookie Consent Flow /needs-consent Crawler Multi-step cookie consent redirect.