web-scraping.dev

web-scraping.dev is a mock website for testing and learning about web scraping. It covers popular web patterns encountered in web scraping so take a look at the scenarios section for details.

This platform is used in:

Refer to ☝️ for learning web scraping.

Scenarios

web-scraping.dev implements many web patterns encountered in modern web scraping:

Static Paging

/products

HTML-based server-side item paging where each page is it's own URL.

Endless Scroll Paging

/testimonials

Dynamic client side paging where new items appear as user scrolls.

Secret API token

/testimonials

The testimonial paging shows how X-Secret-Token is used to lock access to hidden APIs.

Endless Button Paging

/reviews

Dynamic client side paging where new items appear as presses Load More button.

Graphql Background Requests

/reviews

Data loaded using javascript through a backend graphql API.

Forced New Tab Links

/reviews

The review policy and code of conduct links use different techniques to force opening links in a new tab.

Product HTML Markup

/product/1

Basic e-commerce product structure and css class based markup.

Hidden Web Data

/product/1

Product review data is hidden in HTML as json, then loaded to HTML on page load.

Local Storage

/product/1

The cart system is powered by local storage to demo client-side website state store (alternative to cookies).

CSRF Token Locks

/product/1

The Load More reviews action uses a X-CSRF-Token header to block cross-site access.

Cookies Based Login

/login

User authentication based on form request and cookies.

PDF Downloads

/login

The login page features link and js based file download triggers.

Cookie Popup

/login?cookies

Cookie info modal pop up that blocks the entire screen

Example Block Page

/blocked

Valid 200-status response that redirects to block notification.

Blocking Redirect for Invalid Referer

/credentials

The credentials page is only accessed with valid Referer otherwise redirects to blocked. So this page can only be accessed by clicking link on /login.

Persistent Cookie-Based blocking

/blocked

Using cookies to mark blocked clients for persistent blocking.

Changelog

v1.3.1

Add query params relative_url=true to render page with relative URL instead of absolute
Add vertical/horizontal table on /product/n pages
Add breadcrumb navigation on /product/n pages where urls are always relative

v1.3.0

Change /login page to not prefilled and not show cookie pop up by default though the behavior is still available through url flags cookies and prefill
Add testimonial summary widget to /testimonials
Add similar products widget to /product/n pages
Add /sitemap.xml and /robots.txt endpoints
Add PDF download link and js powered button to the /login page
Add /blocked page which emulates redirect to 200 status block page. This endpoint also supports ?persist url parameter flag for persisting blocking through a blocked=true cookie.
Add /credentials page (linked on /login) which redirects to /blocked if Referer header is not set to https://web-scraping.dev/login
Add Graphql endpoint to /api/graphql
Add product reviews objects and relay type paging to graphql
Add /reviews page which uses graphql relay type paging
Add data-testid markup to /reviews to simulate a common automated web test markup that is ideal for scraping parsing
Add target=_blank pages and window.open(url, "_blank") urls to /reviews that simulate a common pattern of forcing links to open in a new page

v1.2.0

Change header requirement for /api/reviews to require only x-csrf-token header (secret-csrf-token-123)
Change header requirement for /api/testimonials to require only referer header (https://web-scraping.dev/testimonials)

v1.1.0

Add cookie popup modal to /login
Add cart system: see cart preview button at the top and the /cart endpoint; enable add to cart button on products. Carts are purely JS and are used to demo Local Storage
Add header requirements to /api/reviews for Referer and X-Csrf-Token to demo header locking
Add multiple product request api through post to /api/products with multiple id values, e.g. {"id": [1,2,3,4]}
Improve styling, especially for mobile
Improve openapi docs with examples, default values and more info (/docs)