data types and standard library
HTML Parsing and Scraping Orientation
HTML parsing means reading an HTML document as a tree instead of treating it as a plain string. Scraping means fetching pages from somewhere else and extracting data from them.
These are different skills. Parsing is technical. Scraping also has legal, ethical, operational, and reliability concerns. Prefer an official API or export when one exists.
Parse HTML with DOMDocument
Use an HTML parser instead of regex for document structure.
<?php
declare(strict_types=1);
$html = '<main><h1>PHP From Zero</h1><p>Learn PHP step by step.</p></main>';
$document = new DOMDocument();
$document->loadHTML($html, LIBXML_NOERROR | LIBXML_NOWARNING);
$heading = $document->getElementsByTagName('h1')->item(0)?->textContent;
echo $heading . PHP_EOL;
// Prints:
// PHP From Zero
HTML found in the wild is often imperfect. DOMDocument can still parse many documents, but your extraction code must handle missing nodes.
Use XPath for targeted extraction
XPath is useful when you need specific elements.
<?php
declare(strict_types=1);
$html = '<article><h2>Arrays</h2><a class="lesson" href="/arrays">Open</a></article>';
$document = new DOMDocument();
$document->loadHTML($html, LIBXML_NOERROR | LIBXML_NOWARNING);
$xpath = new DOMXPath($document);
$link = $xpath->query('//a[contains(@class, "lesson")]')->item(0);
echo $link instanceof DOMElement ? $link->getAttribute('href') : 'missing';
echo PHP_EOL;
// Prints:
// /arrays
Use selectors that match stable structure, not incidental text or layout details.
Extract safely when nodes are missing
Scraped HTML changes. Missing data should produce a controlled result, not a fatal error.
<?php
declare(strict_types=1);
function firstHeadingText(string $html): ?string
{
$document = new DOMDocument();
$document->loadHTML($html, LIBXML_NOERROR | LIBXML_NOWARNING);
$heading = $document->getElementsByTagName('h1')->item(0);
return $heading ? trim($heading->textContent) : null;
}
echo firstHeadingText('<main><p>No heading</p></main>') ?? 'heading missing';
echo PHP_EOL;
// Prints:
// heading missing
This is the difference between a scraper that fails loudly on every small layout change and one that can report useful errors.
Normalise extracted values
HTML text often contains extra spaces and line breaks.
<?php
declare(strict_types=1);
function normaliseHtmlText(string $text): string
{
return trim(preg_replace('/\s+/', ' ', $text) ?? '');
}
echo normaliseHtmlText(" Price:\n £24.99 ") . PHP_EOL;
// Prints:
// Price: £24.99
Normalise after extraction, then validate the value against the application rule.
Scraping needs operational rules
Before scraping a site, check whether it is allowed, whether an API exists, and how often you can request pages.
<?php
declare(strict_types=1);
$scrapingPolicy = [
'usesOfficialApiWhenAvailable' => true,
'hasRateLimit' => true,
'hasUserAgent' => true,
'storesSourceUrl' => true,
];
echo $scrapingPolicy['hasRateLimit'] ? 'rate limited' : 'unsafe';
echo PHP_EOL;
// Prints:
// rate limited
Real scraping code should set a clear user agent, respect rate limits, handle robots or terms requirements where applicable, cache responsibly, and log source URLs for debugging.
Do not trust scraped content
Scraped content is external input. Escape it before displaying it and validate it before storing it as structured data.
<?php
declare(strict_types=1);
$scrapedTitle = '<script>alert("x")</script>';
echo htmlspecialchars($scrapedTitle, ENT_QUOTES | ENT_SUBSTITUTE, 'UTF-8') . PHP_EOL;
// Prints:
// &lt;script&gt;alert(&quot;x&quot;)&lt;/script&gt;
Parsing HTML does not make the text safe for your own HTML output.
What to remember
Use DOM parsing for structure, XPath for targeted extraction, handle missing nodes, normalise and validate extracted values, and treat scraping as an external integration with permission, rate limits, logging, and failure handling.
Practice
Task: Extract lesson links from HTML
Write a small parser for lesson links in an HTML fragment.
Requirements
- Use
declare(strict_types=1);. - Parse the HTML with
DOMDocument. - Use
DOMXPathto find links with classlesson. - Extract each link's text and
href. - Trim and normalise the link text.
- Skip links with an empty
href. - Print the extracted links.
- Show a missing-link case that returns an empty list.
- Include the expected output as comments in the same PHP code block.
Do not use regex to parse the HTML structure.
Show solution
<?php
declare(strict_types=1);
function normaliseHtmlText(string $text): string
{
return trim(preg_replace('/\s+/', ' ', $text) ?? '');
}
function extractLessonLinks(string $html): array
{
$document = new DOMDocument();
$document->loadHTML($html, LIBXML_NOERROR | LIBXML_NOWARNING);
$xpath = new DOMXPath($document);
$links = [];
foreach ($xpath->query('//a[contains(concat(" ", normalize-space(@class), " "), " lesson ")]') as $node) {
if (!$node instanceof DOMElement) {
continue;
}
$href = trim($node->getAttribute('href'));
if ($href === '') {
continue;
}
$links[] = [
'title' => normaliseHtmlText($node->textContent),
'href' => $href,
];
}
return $links;
}
$links = extractLessonLinks('<nav><a class="lesson" href="/arrays"> Arrays </a><a class="lesson" href="">Broken</a></nav>');
foreach ($links as $link) {
echo $link['title'] . ' -> ' . $link['href'] . PHP_EOL;
}
echo count(extractLessonLinks('<main><p>No links</p></main>')) . ' links' . PHP_EOL;
// Prints:
// Arrays -> /arrays
// 0 links
The solution uses DOM and XPath for structure, normalises text after extraction, skips unusable links, and handles the empty result without an error.