HTML Parsing and Scraping Orientation

HTML parsing means reading an HTML document as a tree instead of treating it as a plain string. Scraping means fetching pages from somewhere else and extracting data from them.

These are different skills. Parsing is technical. Scraping also has legal, ethical, operational, and reliability concerns. Prefer an official API or export when one exists.

Parse HTML with DOMDocument

Use an HTML parser instead of regex for document structure.

PHP example

<?php

declare(strict_types=1);

$html = '<main><h1>PHP From Zero</h1><p>Learn PHP step by step.</p></main>';

$document = new DOMDocument();
$document->loadHTML($html, LIBXML_NOERROR | LIBXML_NOWARNING);

$heading = $document->getElementsByTagName('h1')->item(0)?->textContent;

echo $heading . PHP_EOL;

// Prints:
// PHP From Zero

HTML found in the wild is often imperfect. DOMDocument can still parse many documents, but your extraction code must handle missing nodes.

Use XPath for targeted extraction

XPath is useful when you need specific elements.

PHP example

<?php

declare(strict_types=1);

$html = '<article><h2>Arrays</h2><a class="lesson" href="/arrays">Open</a></article>';
$document = new DOMDocument();
$document->loadHTML($html, LIBXML_NOERROR | LIBXML_NOWARNING);

$xpath = new DOMXPath($document);
$link = $xpath->query('//a[contains(@class, "lesson")]')->item(0);

echo $link instanceof DOMElement ? $link->getAttribute('href') : 'missing';
echo PHP_EOL;

// Prints:
// /arrays

Use selectors that match stable structure, not incidental text or layout details.

Extract safely when nodes are missing

Scraped HTML changes. Missing data should produce a controlled result, not a fatal error.

PHP example

<?php

declare(strict_types=1);

function firstHeadingText(string $html): ?string
{
    $document = new DOMDocument();
    $document->loadHTML($html, LIBXML_NOERROR | LIBXML_NOWARNING);

    $heading = $document->getElementsByTagName('h1')->item(0);

    return $heading ? trim($heading->textContent) : null;
}

echo firstHeadingText('<main><p>No heading</p></main>') ?? 'heading missing';
echo PHP_EOL;

// Prints:
// heading missing

This is the difference between a scraper that fails loudly on every small layout change and one that can report useful errors.

Normalise extracted values

HTML text often contains extra spaces and line breaks.

PHP example

<?php

declare(strict_types=1);

function normaliseHtmlText(string $text): string
{
    return trim(preg_replace('/\s+/', ' ', $text) ?? '');
}

echo normaliseHtmlText("  Price:\n  £24.99  ") . PHP_EOL;

// Prints:
// Price: £24.99

Normalise after extraction, then validate the value against the application rule.

Scraping needs operational rules

Before scraping a site, check whether it is allowed, whether an API exists, and how often you can request pages.

PHP example

<?php

declare(strict_types=1);

$scrapingPolicy = [
    'usesOfficialApiWhenAvailable' => true,
    'hasRateLimit' => true,
    'hasUserAgent' => true,
    'storesSourceUrl' => true,
];

echo $scrapingPolicy['hasRateLimit'] ? 'rate limited' : 'unsafe';
echo PHP_EOL;

// Prints:
// rate limited

Real scraping code should set a clear user agent, respect rate limits, handle robots or terms requirements where applicable, cache responsibly, and log source URLs for debugging.

Do not trust scraped content

Scraped content is external input. Escape it before displaying it and validate it before storing it as structured data.

PHP example

<?php

declare(strict_types=1);

$scrapedTitle = '<script>alert("x")</script>';

echo htmlspecialchars($scrapedTitle, ENT_QUOTES | ENT_SUBSTITUTE, 'UTF-8') . PHP_EOL;

// Prints:
// &lt;script&gt;alert(&quot;x&quot;)&lt;/script&gt;

Parsing HTML does not make the text safe for your own HTML output.

What to remember

Use DOM parsing for structure, XPath for targeted extraction, handle missing nodes, normalise and validate extracted values, and treat scraping as an external integration with permission, rate limits, logging, and failure handling.

Practice

Task: Extract lesson links from HTML

Write a small parser for lesson links in an HTML fragment.

Requirements

Use declare(strict_types=1);.
Parse the HTML with DOMDocument.
Use DOMXPath to find links with class lesson.
Extract each link's text and href.
Trim and normalise the link text.
Skip links with an empty href.
Print the extracted links.
Show a missing-link case that returns an empty list.
Include the expected output as comments in the same PHP code block.

Do not use regex to parse the HTML structure.

Show solution

PHP example

<?php

declare(strict_types=1);

function normaliseHtmlText(string $text): string
{
    return trim(preg_replace('/\s+/', ' ', $text) ?? '');
}

function extractLessonLinks(string $html): array
{
    $document = new DOMDocument();
    $document->loadHTML($html, LIBXML_NOERROR | LIBXML_NOWARNING);
    $xpath = new DOMXPath($document);
    $links = [];

foreach ($xpath->query('//a[contains(concat(" ", normalize-space(@class), " "), " lesson ")]') as $node) {
        if (!$node instanceof DOMElement) {
            continue;
        }

$href = trim($node->getAttribute('href'));

if ($href === '') {
            continue;
        }

$links[] = [
            'title' => normaliseHtmlText($node->textContent),
            'href' => $href,
        ];
    }

return $links;
}

$links = extractLessonLinks('<nav><a class="lesson" href="/arrays"> Arrays </a><a class="lesson" href="">Broken</a></nav>');

foreach ($links as $link) {
    echo $link['title'] . ' -> ' . $link['href'] . PHP_EOL;
}

echo count(extractLessonLinks('<main><p>No links</p></main>')) . ' links' . PHP_EOL;

// Prints:
// Arrays -> /arrays
// 0 links

<?php

declare(strict_types=1);

function normaliseHtmlText(string $text): string
{
    return trim(preg_replace('/\s+/', ' ', $text) ?? '');
}

function extractLessonLinks(string $html): array
{
    $document = new DOMDocument();
    $document->loadHTML($html, LIBXML_NOERROR | LIBXML_NOWARNING);
    $xpath = new DOMXPath($document);
    $links = [];

    foreach ($xpath->query('//a[contains(concat(" ", normalize-space(@class), " "), " lesson ")]') as $node) {
        if (!$node instanceof DOMElement) {
            continue;
        }

        $href = trim($node->getAttribute('href'));

        if ($href === '') {
            continue;
        }

        $links[] = [
            'title' => normaliseHtmlText($node->textContent),
            'href' => $href,
        ];
    }

    return $links;
}

$links = extractLessonLinks('<nav><a class="lesson" href="/arrays"> Arrays </a><a class="lesson" href="">Broken</a></nav>');

foreach ($links as $link) {
    echo $link['title'] . ' -> ' . $link['href'] . PHP_EOL;
}

echo count(extractLessonLinks('<main><p>No links</p></main>')) . ' links' . PHP_EOL;

// Prints:
// Arrays -> /arrays
// 0 links

The solution uses DOM and XPath for structure, normalises text after extraction, skips unusable links, and handles the empty result without an error.