web-scraping.dev
A realistic e-commerce testing platform for web scraping developers
Practice web scraping with 19 realistic scenarios covering pagination, authentication, GraphQL APIs, CSRF protection, and more. Safe, legal, and designed for learning.
Realistic Scenarios
Challenging
CI Testing Suite
Scraping Scenarios 48 Scenarios
Real-world web patterns you'll encounter in production scraping projects. Each scenario is tagged with difficulty level and includes working code examples.
| Scenario | Path | Level | Description |
|---|---|---|---|
| Static Paging | https://web-scraping.dev/products |
Beginner | HTML-based server-side item paging where each page is its own URL. |
| Endless Scroll Paging | https://web-scraping.dev/testimonials |
Intermediate | Dynamic client side paging where new items appear as user scrolls. |
| Secret API Token | https://web-scraping.dev/testimonials |
Intermediate | X-Secret-Token used to lock access to hidden APIs. |
| Endless Button Paging | https://web-scraping.dev/reviews |
Intermediate | Dynamic paging where new items appear on Load More click. |
| GraphQL Background Requests | https://web-scraping.dev/reviews |
Advanced | Data loaded via JavaScript through a backend GraphQL API. |
| Forced New Tab Links | https://web-scraping.dev/reviews |
Beginner | Links forced to open in new tabs via different techniques. |
| Product HTML Markup | https://web-scraping.dev/product/1 |
Beginner | Basic e-commerce product structure and CSS class based markup. |
| Hidden Web Data | https://web-scraping.dev/product/1 |
Intermediate | Review data hidden in HTML as JSON, loaded on page load. |
| Local Storage | https://web-scraping.dev/product/1 |
Intermediate | Cart system powered by localStorage for client-side state. |
| CSRF Token Locks | https://web-scraping.dev/product/1 |
Advanced | Load More reviews uses X-CSRF-Token header to block cross-site access. |
| Cookies Based Login | https://web-scraping.dev/login |
Intermediate | User authentication based on form request and cookies. |
| Iframe Login | https://web-scraping.dev/login/iframe |
Intermediate | Login form embedded in an iframe (SSO widget pattern). |
| PDF Downloads | https://web-scraping.dev/login |
Intermediate | Link and JS based file download triggers. |
| Cookie Popup | https://web-scraping.dev/login?cookies= |
Beginner | Cookie info modal popup that blocks the entire screen. |
| Example Block Page | https://web-scraping.dev/blocked |
Beginner | Valid 200-status response that redirects to block notification. |
| Blocking Redirect (Invalid Referer) | https://web-scraping.dev/credentials |
Advanced | Page requires valid Referer header, otherwise redirects to blocked. |
| Persistent Cookie Blocking | https://web-scraping.dev/blocked?persist= |
Advanced | Using cookies to mark blocked clients for persistent blocking. |
| Form File Attachment Download | https://web-scraping.dev/file-download |
Intermediate | Form submission triggers file download with Content-Disposition header. |
| AI Content Obfuscation | https://web-scraping.dev/ai-content-obfuscation |
Intermediate | Extract clean text from AI-obfuscated content using invisible Unicode. |
| Challenge + File Download | https://web-scraping.dev/challenge-download |
Advanced | Challenge bypass + file download with 403 status. |
| Bad Encoding | https://web-scraping.dev/bad-encoding |
Intermediate | Pages with mismatched Content-Type charset headers. |
| Antibot Challenge | https://web-scraping.dev/antibot/easy |
Beginner | Simple antibot protection that blocks direct access. |
| Robots.txt Compliance Trap | /robots-disallowed |
Crawler | Disallowed URL that returns 200. Conforming crawlers must not fetch. |
| WebMCP Tools | /mcp-tools |
Intermediate | Page with registered MCP tools. Test native browser MCP with AI agents via Cloud Browser. |
| Crawler Test Report Endpoint | /crawler-test-report |
Crawler | Central JSON assertion endpoint. POST /reset to clear between runs. |
| Rate Limit & Politeness | /rate-limited |
Crawler | 429 with Retry-After: 5 after 3 req/10s. |
| Crawl Delay Enforcement | /slow-section/page/1 |
Crawler | robots.txt Crawl-delay: 2 for /slow-section/. |
| Meta Robots Directives | /meta-noindex |
Crawler | Pages with meta noindex/nofollow. |
| X-Robots-Tag Header | /header-noindex |
Crawler | Responses with X-Robots-Tag noindex/nofollow. |
| Canonical URL Handling | /canonical-tracking?utm_source=foo&session=bar |
Crawler | Tracking params with rel=canonical. |
| Session ID Vortex (Trap) | /session-vortex |
Crawler | Every page produces 10 new sid-parameterized links. |
| Infinite Calendar (Trap) | /calendar/2024/01 |
Crawler | Calendar with prev/next links across 2000-2100. |
| Redirect Chain & Loops | /redirect-chain/1 |
Crawler | Chain of 10 redirects + loop detection. |
| Fragment & URL Normalization | /fragment-collapse |
Crawler | Fragment-only and normalization variants. |
| Content Deduplication | /dup-a |
Crawler | Identical HTML with canonical. Tests dedup. |
| JS-only Link Discovery | /js-links |
Crawler | Anchors injected by JS after DOMContentLoaded. |
| JSON-LD Link Discovery | /linked-data |
Crawler | Link only present inside a JSON-LD script tag. |
| data-href Link Discovery | /data-href |
Crawler | Link attribute on a non-anchor element. |
| HTTP Link Header Discovery | /header-link |
Crawler | Link: rel=next HTTP header points to target. |
| HTML base Tag | /base-tag |
Crawler | Relative links resolved against base href. |
| Huge Response Body | /huge-page |
Crawler | Streams ~50MB of HTML. |
| Slow Drip Response | /slow-drip |
Crawler | 1 byte/sec for 60s. Tests timeout behavior. |
| Wrong Content-Length | /wrong-content-length |
Crawler | Header claims 100 bytes, body is 1000. |
| External Redirect | /redirect-external |
Crawler | 302 to example.com. |
| Mixed HTTP/HTTPS Content | /mixed-content |
Crawler | Same URL under http and https. |
| Sitemap Index | /sitemap-index.xml |
Crawler | 50 child sitemaps x 100 URLs each. |
| Gzipped Sitemap | /sitemap.xml.gz |
Crawler | sitemap.xml as application/gzip. |
| Basic Auth Wall | /private/secret |
Crawler | 401 + WWW-Authenticate: Basic. |
| Cookie Consent Flow | /needs-consent |
Crawler | Multi-step cookie consent redirect. |
Sponsored by