data types and standard library
Unicode and Multibyte Strings
Modern PHP applications handle names, addresses, product titles, search terms, comments, translations, and imported data from many languages. Treating every character as one byte leads to broken validation and mangled output.
PHP strings are byte sequences. UTF-8 characters may use more than one byte, so byte-oriented functions such as strlen() and substr() can give the wrong answer for user-facing text. Use multibyte functions when the unit is a character rather than a byte.
Bytes and characters are different
strlen() counts bytes. mb_strlen() counts characters for a given encoding.
<?php
declare(strict_types=1);
$name = 'Élodie';
echo strlen($name) . PHP_EOL;
echo mb_strlen($name, 'UTF-8') . PHP_EOL;
// Prints:
// 7
// 6
The first character uses two bytes in UTF-8. That matters for display limits, initials, excerpts, and validation messages.
Take substrings safely
Use mb_substr() when cutting user-facing text.
<?php
declare(strict_types=1);
$name = 'Élodie';
$initial = mb_substr($name, 0, 1, 'UTF-8');
echo 'Initial: ' . $initial . PHP_EOL;
// Prints:
// Initial: É
Using substr($name, 0, 1) would cut the first byte, not the first character.
Normalise case with multibyte functions
Use mb_strtolower() and mb_strtoupper() for user-facing text.
<?php
declare(strict_types=1);
$city = 'MÜNCHEN';
echo mb_strtolower($city, 'UTF-8') . PHP_EOL;
// Prints:
// münchen
This is useful for search normalisation, comparisons, tags, and display cleanup.
Validate UTF-8 at boundaries
When data comes from files, APIs, or old systems, check that it is valid UTF-8 before storing or displaying it.
<?php
declare(strict_types=1);
function requireUtf8(string $value): string
{
if (!mb_check_encoding($value, 'UTF-8')) {
throw new InvalidArgumentException('Text must be valid UTF-8.');
}
return $value;
}
echo requireUtf8('Valid text') . PHP_EOL;
// Prints:
// Valid text
Invalid encoding can break JSON responses, database writes, search indexing, and HTML output.
Limit display text by characters
User-facing limits should usually be character-based, not byte-based.
<?php
declare(strict_types=1);
function excerpt(string $text, int $limit): string
{
if (mb_strlen($text, 'UTF-8') <= $limit) {
return $text;
}
return mb_substr($text, 0, $limit, 'UTF-8') . '...';
}
echo excerpt('Résumé updated today', 6) . PHP_EOL;
// Prints:
// Résumé...
Database column limits may still be byte-based depending on the database and column type, so know which limit you are enforcing.
Regex needs Unicode mode
When regex rules apply to Unicode text, use the u modifier and Unicode character classes.
<?php
declare(strict_types=1);
$name = 'Élodie Martin';
echo preg_match('/^[\p{L}\s]+$/u', $name) === 1 ? 'valid name' : 'invalid name';
echo PHP_EOL;
// Prints:
// valid name
This pattern accepts letters and spaces across languages. The exact rule still depends on the product, because real names can include hyphens, apostrophes, and other marks.
Know the limits of mb_*
Multibyte functions work with code points, but users may think in grapheme clusters: visible characters that can be made from multiple code points, such as some accented characters and emoji sequences. For high-quality cursor movement, display truncation, or emoji-heavy features, look at the intl extension's grapheme functions.
<?php
declare(strict_types=1);
$text = 'Cafe';
echo mb_strlen($text, 'UTF-8') . PHP_EOL;
// Prints:
// 4
For many business applications, mb_* functions are enough. The important step is not treating UTF-8 text as plain bytes.
What to remember
Use mb_strlen(), mb_substr(), and mb_strtolower() for user-facing UTF-8 text. Validate encoding at import boundaries, use Unicode-aware regex when needed, and remember that output still needs escaping for its destination.
Practice
Task: Prepare a display name
Write a small helper for user display names.
Requirements
- Use
declare(strict_types=1);. - Accept a raw name string.
- Trim surrounding whitespace.
- Require valid UTF-8.
- Reject an empty name.
- Limit the name to 20 characters using multibyte functions.
- Return the cleaned name and its first initial.
- Print the result for a name containing a multibyte character.
- Show an empty-name error by catching the exception.
- Include the expected output as comments in the same PHP code block.
The helper should use mb_strlen() and mb_substr(), not byte-oriented string functions.
Show solution
<?php
declare(strict_types=1);
function prepareDisplayName(string $rawName): array
{
$name = trim($rawName);
if (!mb_check_encoding($name, 'UTF-8')) {
throw new InvalidArgumentException('Name must be valid UTF-8.');
}
if ($name === '') {
throw new InvalidArgumentException('Name is required.');
}
if (mb_strlen($name, 'UTF-8') > 20) {
$name = mb_substr($name, 0, 20, 'UTF-8');
}
return [
'name' => $name,
'initial' => mb_substr($name, 0, 1, 'UTF-8'),
];
}
$display = prepareDisplayName(' Élodie Martin ');
echo $display['name'] . ' / ' . $display['initial'] . PHP_EOL;
try {
prepareDisplayName(' ');
} catch (InvalidArgumentException $exception) {
echo $exception->getMessage() . PHP_EOL;
}
// Prints:
// Élodie Martin / É
// Name is required.
The solution validates the text boundary, trims safely, uses multibyte length and substring operations, and keeps the first initial correct for a name that starts with a non-ASCII character.