web-scraping.dev

A realistic e-commerce testing platform for web scraping developers

Practice web scraping with 19 realistic scenarios covering pagination, authentication, GraphQL APIs, CSRF protection, and more. Safe, legal, and designed for learning.

Realistic Scenarios Challenging CI Testing Suite

Scraping Scenarios 47 Challenges

Real-world web patterns you'll encounter in production scraping projects. Each scenario is tagged with difficulty level and includes working code examples.

Robots.txt Compliance Trap Crawler

/robots-disallowed

This URL is explicitly Disallow'd in /robots.txt for all user-agents, but returns a valid HTTP 200 response. A conforming crawler MUST NOT fetch it. If your crawler hits this page, check the X-Robots-Trap: violated response header and the ROBOTS_TXT_VIOLATION body token — both are unambiguous assertion signals for your test suite.

Changelog

v1.5.0
  • Add comprehensive crawler test suite (25 new scenarios + central report endpoint) under the "Crawler" badge. Scenarios cover politeness (rate limiting, crawl-delay, meta robots, X-Robots-Tag, canonical), crawler traps (session vortex, infinite calendar, redirect chain, redirect loops, fragment collapse, URL normalization, canonical dedup), link discovery (JS-only, JSON-LD, data-href, HTTP Link header, base tag), content-type edge cases (50MB huge page, slow-drip streaming, wrong Content-Length, external redirect, mixed HTTP/HTTPS), sitemap index with 50 children + dead link + gzipped variant, and auth/state flows (Basic auth wall, cookie consent).
  • Add /crawler-test-report central JSON assertion endpoint aggregating all trap hits. Use POST /crawler-test-report/reset to clear between runs. Supports ?trap=<name> filter. See app/web/CRAWLER_TEST_SUITE.md for per-scenario details.
  • Add Crawl-delay: 2 group for /slow-section/ in robots.txt.
v1.4.0
  • Add /robots-disallowed robots.txt compliance trap for crawler test suites — URL is linked from homepage scenarios but Disallow'd in /robots.txt. Asserts via X-Robots-Trap header and ROBOTS_TXT_VIOLATION body token. Also rewrote robots.txt to valid RFC 9309 syntax and added a Sitemap: directive.
  • Add /challenge-download page for testing challenge bypass + file download scenarios (like Cloudflare Turnstile leading to attachment download with 403 status)
  • Add /challenge-download/interactive variant for interactive challenge (requires click)
  • Add /challenge-download/file direct download endpoint with configurable status code and file type
v1.3.1
  • Add query params relative_url=true to render page with relative URL instead of absolute
  • Add vertical/horizontal table on /product/n pages
  • Add breadcrumb navigation on /product/n pages where urls are always relative
v1.3.0
  • Change /login page to not prefilled and not show cookie pop up by default though the behavior is still available through url flags cookies and prefill
  • Add testimonial summary widget to /testimonials
  • Add similar products widget to /product/n pages
  • Add /sitemap.xml and /robots.txt endpoints
  • Add PDF download link and js powered button to the /login page
  • Add /blocked page which emulates redirect to 200 status block page. This endpoint also supports ?persist url parameter flag for persisting blocking through a blocked=true cookie.
  • Add /credentials page (linked on /login) which redirects to /blocked if Referer header is not set to https://web-scraping.dev/login
  • Add Graphql endpoint to /api/graphql
  • Add product reviews objects and relay type paging to graphql
  • Add /reviews page which uses graphql relay type paging
  • Add data-testid markup to /reviews to simulate a common automated web test markup that is ideal for scraping parsing
  • Add target=_blank pages and window.open(url, "_blank") urls to /reviews that simulate a common pattern of forcing links to open in a new page
v1.2.0
  • Change header requirement for /api/reviews to require only x-csrf-token header (secret-csrf-token-123)
  • Change header requirement for /api/testimonials to require only referer header (https://web-scraping.dev/testimonials)
v1.1.0
  • Add cookie popup modal to /login
  • Add cart system: see cart preview button at the top and the /cart endpoint; enable add to cart button on products. Carts are purely JS and are used to demo Local Storage
  • Add header requirements to /api/reviews for Referer and X-Csrf-Token to demo header locking
  • Add multiple product request api through post to /api/products with multiple id values, e.g. {"id": [1,2,3,4]}
  • Improve styling, especially for mobile
  • Improve openapi docs with examples, default values and more info (/docs)