Unicode and Multibyte Strings

Modern PHP applications handle names, addresses, product titles, search terms, comments, translations, and imported data from many languages. Treating every character as one byte leads to broken validation and mangled output.

PHP strings are byte sequences. UTF-8 characters may use more than one byte, so byte-oriented functions such as strlen() and substr() can give the wrong answer for user-facing text. Use multibyte functions when the unit is a character rather than a byte.

Bytes and characters are different

strlen() counts bytes. mb_strlen() counts characters for a given encoding.

PHP example

<?php

declare(strict_types=1);

$name = 'Élodie';

echo strlen($name) . PHP_EOL;
echo mb_strlen($name, 'UTF-8') . PHP_EOL;

// Prints:
// 7
// 6

The first character uses two bytes in UTF-8. That matters for display limits, initials, excerpts, and validation messages.

Take substrings safely

Use mb_substr() when cutting user-facing text.

PHP example

<?php

declare(strict_types=1);

$name = 'Élodie';
$initial = mb_substr($name, 0, 1, 'UTF-8');

echo 'Initial: ' . $initial . PHP_EOL;

// Prints:
// Initial: É

Using substr($name, 0, 1) would cut the first byte, not the first character.

Normalise case with multibyte functions

Use mb_strtolower() and mb_strtoupper() for user-facing text.

PHP example

<?php

declare(strict_types=1);

$city = 'MÜNCHEN';

echo mb_strtolower($city, 'UTF-8') . PHP_EOL;

// Prints:
// münchen

This is useful for search normalisation, comparisons, tags, and display cleanup.

Validate UTF-8 at boundaries

When data comes from files, APIs, or old systems, check that it is valid UTF-8 before storing or displaying it.

PHP example

<?php

declare(strict_types=1);

function requireUtf8(string $value): string
{
    if (!mb_check_encoding($value, 'UTF-8')) {
        throw new InvalidArgumentException('Text must be valid UTF-8.');
    }

    return $value;
}

echo requireUtf8('Valid text') . PHP_EOL;

// Prints:
// Valid text

Invalid encoding can break JSON responses, database writes, search indexing, and HTML output.

Limit display text by characters

User-facing limits should usually be character-based, not byte-based.

PHP example

<?php

declare(strict_types=1);

function excerpt(string $text, int $limit): string
{
    if (mb_strlen($text, 'UTF-8') <= $limit) {
        return $text;
    }

    return mb_substr($text, 0, $limit, 'UTF-8') . '...';
}

echo excerpt('Résumé updated today', 6) . PHP_EOL;

// Prints:
// Résumé...

Database column limits may still be byte-based depending on the database and column type, so know which limit you are enforcing.

Regex needs Unicode mode

When regex rules apply to Unicode text, use the u modifier and Unicode character classes.

PHP example

<?php

declare(strict_types=1);

$name = 'Élodie Martin';

echo preg_match('/^[\p{L}\s]+$/u', $name) === 1 ? 'valid name' : 'invalid name';
echo PHP_EOL;

// Prints:
// valid name

This pattern accepts letters and spaces across languages. The exact rule still depends on the product, because real names can include hyphens, apostrophes, and other marks.

Know the limits of `mb_*`

Multibyte functions work with code points, but users may think in grapheme clusters: visible characters that can be made from multiple code points, such as some accented characters and emoji sequences. For high-quality cursor movement, display truncation, or emoji-heavy features, look at the intl extension's grapheme functions.

PHP example

<?php

declare(strict_types=1);

$text = 'Cafe';

echo mb_strlen($text, 'UTF-8') . PHP_EOL;

// Prints:
// 4

For many business applications, mb_* functions are enough. The important step is not treating UTF-8 text as plain bytes.

What to remember

Use mb_strlen(), mb_substr(), and mb_strtolower() for user-facing UTF-8 text. Validate encoding at import boundaries, use Unicode-aware regex when needed, and remember that output still needs escaping for its destination.

Practice

Task: Prepare a display name

Write a small helper for user display names.

Requirements

Use declare(strict_types=1);.
Accept a raw name string.
Trim surrounding whitespace.
Require valid UTF-8.
Reject an empty name.
Limit the name to 20 characters using multibyte functions.
Return the cleaned name and its first initial.
Print the result for a name containing a multibyte character.
Show an empty-name error by catching the exception.
Include the expected output as comments in the same PHP code block.

The helper should use mb_strlen() and mb_substr(), not byte-oriented string functions.

Show solution

PHP example

<?php

declare(strict_types=1);

function prepareDisplayName(string $rawName): array
{
    $name = trim($rawName);

    if (!mb_check_encoding($name, 'UTF-8')) {
        throw new InvalidArgumentException('Name must be valid UTF-8.');
    }

    if ($name === '') {
        throw new InvalidArgumentException('Name is required.');
    }

    if (mb_strlen($name, 'UTF-8') > 20) {
        $name = mb_substr($name, 0, 20, 'UTF-8');
    }

    return [
        'name' => $name,
        'initial' => mb_substr($name, 0, 1, 'UTF-8'),
    ];
}

$display = prepareDisplayName('  Élodie Martin  ');

echo $display['name'] . ' / ' . $display['initial'] . PHP_EOL;

try {
    prepareDisplayName('   ');
} catch (InvalidArgumentException $exception) {
    echo $exception->getMessage() . PHP_EOL;
}

// Prints:
// Élodie Martin / É
// Name is required.

The solution validates the text boundary, trims safely, uses multibyte length and substring operations, and keeps the first initial correct for a name that starts with a non-ASCII character.

Unicode and Multibyte Strings

Bytes and characters are different

Take substrings safely

Normalise case with multibyte functions

Validate UTF-8 at boundaries

Limit display text by characters

Regex needs Unicode mode

Know the limits of mb_*

What to remember

Practice

Requirements

Know the limits of `mb_*`