Fetching latest headlines…
What to do when websites change and your spider doesn't know
NORTH AMERICA
πŸ‡ΊπŸ‡Έ United Statesβ€’May 11, 2026

What to do when websites change and your spider doesn't know

0 views0 likes0 comments
Originally published byDev.to

Empty-field-rate monitoring catches selectors that return nothing. It does not catch selectors that return something wrong. The most damaging form of schema drift is the kind where a selector keeps producing values, the values are syntactically reasonable, and they are no longer the values you wanted. A price selector that quietly starts returning the financing instalment instead of the sticker price will pass every non-empty check while corrupting your data for as long as the drift goes unnoticed. That is the failure mode this post is about.

Prices change

Drift comes in several flavours

People talk about "schema drift" as if it were one thing, but in scraping practice there are several kinds of drift, each of which fails differently and demands a different defence.

Drift type Meaning Example
DOM/layout drift The page structure changes Product cards move from table rows to grid cards
Data contract drift The meaning or format of a field changes Price changes from numeric text to "Contact us"
Navigation drift Discovery paths change Pagination links disappear, replaced by infinite scroll
Output schema drift The spider output changes shape A field is renamed or removed in the item definition

The first kind is the most familiar. The second is the most dangerous. When extraction returns nothing, you can catch it with a simple non-empty assertion. When extraction returns something plausible but wrong, the validation has to be semantic, and most pipelines have nothing in place to do that work.

Consider the financing-price example. Before the redesign, your selector .product-price matched a <span> containing the value $129.99. After the redesign, the same class name is reused for a marketing element that displays $11/mo with affirm. Your extractor still returns a string. The string still contains a dollar sign and a number. A naive validator looks at it, decides it is a price, and accepts it. The data is wrong, but nothing in the pipeline knows that.

The dangerous failure is not always when extraction returns nothing. It is when extraction returns something plausible but wrong.

The empty-field-rate metric from the previous post in this series will catch DOM drift that produces blanks. It will not catch data contract drift that produces something that just looks like a real value. For that, you need an extra layer of defence.

Structural fingerprints as smoke alarms

One way to catch a site change before the data goes wrong is to monitor the page structure itself, separately from the data you extract. The basic idea is simple: hash a fragment of the page that should remain stable, store the hash as a baseline, and compare future fetches against it. If the hash changes, something about the page changed, and you have an early warning.

The naive implementation, hashing the raw HTML of the page or the product container, is too noisy to be useful. Modern pages contain rotating ads, A/B test variants, randomised CSS class names from build tools, recommendation widgets, inventory banners, and inline analytics scripts, all of which change between requests without anything meaningful changing on the page. A raw hash will fire constantly and you will learn to ignore it.

The better pattern is a normalised structural fingerprint. The goal is to capture the shape of the page, the hierarchy of tags and the semantic attributes, while discarding everything that varies cosmetically.

from hashlib import sha256
from lxml import html
from copy import deepcopy

VOLATILE_TAGS = {"script", "style", "noscript", "iframe"}
VOLATILE_ATTRS_PREFIX = ("data-track", "data-analytics", "data-test-id-")

def normalize_subtree(element):
    """Return a string representation of structure only, not content or noise."""
    el = deepcopy(element)

    # remove volatile tags entirely
    for tag in VOLATILE_TAGS:
        for node in el.iter(tag):
            node.getparent().remove(node) if node.getparent() is not None else None

    parts = []
    for node in el.iter():
        # keep tag and stable semantic attributes only
        attrs = []
        for k, v in sorted(node.attrib.items()):
            if k.startswith("aria-") or k in {"role", "itemprop", "itemtype"}:
                attrs.append(f"{k}={v}")
        parts.append(f"<{node.tag} {' '.join(attrs)}>")
    return "".join(parts)

def fingerprint(html_str, container_xpath):
    tree = html.fromstring(html_str)
    container = tree.xpath(container_xpath)
    if not container:
        return None
    return sha256(normalize_subtree(container[0]).encode()).hexdigest()

The principle behind the normalisation is to keep the things that should be stable across requests (tag hierarchy, ARIA roles, microdata attributes, intentional data-* attributes) and drop the things that are not (text content, generated class names, scripts, ads, tracking IDs). What remains is a structural fingerprint that changes when the developer of the target site changes the page, and is mostly stable otherwise.

A note on A/B testing: even with normalisation, a single hash mismatch is not a reliable signal of a real change. The site might be serving you a different test variant than the one you fingerprinted last week, and the difference is genuine without being a redesign. The right pattern is to sample more than one fetch before concluding that drift has occurred, and to treat a single mismatch as a prompt for review rather than an automatic alert.

Use fingerprints as smoke alarms, not verdicts. When the hash changes, fire a review task. Do not abort the crawl, do not roll back the deployment, and do not page anyone in the middle of the night. The fingerprint is telling you to look at the page; it is not telling you the page is broken.

Live canary checks before production runs

The fingerprint catches changes after they happen. The canary check catches them before they cost you a full crawl of bad data. The pattern is straightforward: pick a small, stable set of representative URLs, fetch them, run your current extraction logic against them, and assert that the critical fields come back with plausible values.

import pytest
import requests
from myproject.extractors import extract_product

CANARY_URLS = [
    "https://example.com/product/12345",
    "https://example.com/product/67890",
]

@pytest.mark.parametrize("url", CANARY_URLS)
def test_extraction_canary(url):
    response = requests.get(url, timeout=30)
    response.raise_for_status()

    item = extract_product(response.text)

    assert item["title"], f"empty title for {url}"
    assert item["price"], f"empty price for {url}"
    assert _looks_like_price(item["price"]), (
        f"price {item['price']!r} for {url} does not look like a price"
    )
    assert item["availability"] in {"in_stock", "out_of_stock", "preorder"}, (
        f"unexpected availability {item['availability']!r} for {url}"
    )

def _looks_like_price(value):
    import re
    # rejects "$11/mo" style strings, accepts "$129.99" and "129,99 €"
    return bool(re.fullmatch(r"[^\d]?\d{1,3}([.,]\d{3})*([.,]\d{2})?[^\d]?", value.strip()))

The semantic checks are what make this useful. Asserting that the title is non-empty is fine, but asserting that the price actually looks like a price is what catches the financing-string failure mode. The check on availability against a known set rejects values that are syntactically valid strings but no longer in the contract.

Wiring this into CI is a question of cadence. Running canary checks on every commit will produce noise from transient network issues and rate limiting. Running them on a schedule (every few hours, or before each production deployment) gives you a useful signal without the false-positive churn. Failed runs should store the fetched HTML, the extracted item, and the assertion that failed, all as artifacts you can inspect later. A canary that fires and discards the evidence is a canary that wastes your time when you go to investigate.

# .github/workflows/canary.yml
name: Extraction canary
on:
  schedule:
    - cron: "0 */4 * * *"
  workflow_dispatch:

jobs:
  canary:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -r requirements.txt
      - run: pytest tests/canary -v
      - if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: canary-failures
          path: tests/canary/artifacts/

The same A/B caveat applies here. If a canary fails on a single fetch, retry on a fresh request before alerting. If it fails consistently across multiple fetches, the change is real.

Where this connects to the rest of the stack

If your spider is part of a system that pulls a lot of data from a small set of high-value sites, an alternative to maintaining selectors and canaries is to skip the selector-based approach for some content types entirely. Zyte API's pageContent data type, released in late 2025, is one example of a route around the problem: it returns the cleaned main content of a page without you having to maintain selectors at all, which means there is no selector to drift against. That trade-off is not right for every project, especially when you need fine-grained structured fields, but it is worth knowing about when the maintenance cost of a selector-based pipeline starts to dominate.

For pipelines that stay selector-based, the combination of structural fingerprints and canary checks is the strongest defence available. Fingerprints flag that the page changed; canaries verify that your extraction still works against the changed page. Neither is sufficient on its own, and both together still rely on the metrics from the previous post to catch the failure modes they miss.

What to do next

Pick the three or four most valuable URLs in your crawl and write canary checks for them with semantic assertions, not just non-empty checks. Add a normalised structural fingerprint for the same URLs and store the baseline. Run both on a schedule before your next production deployment. That alone will catch most of the silent-failure cases that empty-field-rate monitoring lets through.

In the final post of this series, we will look at the third leg of production-ready scraping: making sure that when something does go wrong mid-run, you can restart the crawl without duplicating data or corrupting state.

Comments (0)

Sign in to join the discussion

Be the first to comment!