PHP: Clean encoding issues with smart (curly) quotes, em dashes and more

When dealing with content from various sources, such as XML feeds, you will inevitably encounter problems with smart quotes, em dashes, and other random encoding issues. Smart quotes are known by other names such as curly quotes and left/right angled quotes.

The main problem is Windows. Many Windows programs use Windows-1252 character encoding, which is very similar to ISO-8859-1, but with some differences. Attempts to detect the encoding of a Windows-1252 string will tend to result in a guess of ISO-8859-1. So conversion tools will overlook the differences. Unfortunately, there are some relatively common characters among the differences. These characters include left/right angled single/double quotes, em/en dashes, ellipsis, and bullets.

To deal with these encoding issues, I wrote a function to do the cleanup and convert the string to UTF-8.

Note: I have only tested this with the English language / character set.

Full disclosure: PHP must have the mbstring extension enabled to use the mb_* functions.

 * cleanEncoding deals with pesky characters like curly smart quotes and em dashes (and some other encoding related problems)
 * @param string $text Text string to cleanup / convert
 * @param string $type 'standard' for standard characters, 'reference' for decimal numerical character reference
 * @return $text Cleaned up UTF-8 string
function cleanEncoding( $text, $type='standard' ){
    // determine the encoding before we touch it
    $encoding = mb_detect_encoding($text, 'UTF-8, ISO-8859-1');
    // The characters to output
    if ( $type=='standard' ){
        $outp_chr = array('...',          "'",            "'",            '"',            '"',            '•',            '-',            '-'); // run of the mill standard characters
    } elseif ( $type=='reference' ) {
        $outp_chr = array('…',      '‘',      '’',      '“',      '”',      '•',      '–',      '—'); // decimal numerical character references
    // The characters to replace (purposely indented for comparison)
        $utf8_chr = array("\xe2\x80\xa6", "\xe2\x80\x98", "\xe2\x80\x99", "\xe2\x80\x9c", "\xe2\x80\x9d", '\xe2\x80\xa2', "\xe2\x80\x93", "\xe2\x80\x94"); // UTF-8 hex characters
        $winc_chr = array(chr(133),       chr(145),       chr(146),       chr(147),       chr(148),       chr(149),       chr(150),       chr(151)); // ASCII characters (found in Windows-1252)
    // First, replace UTF-8 characters.
    $text = str_replace( $utf8_chr, $outp_chr, $text);
    // Next, replace Windows-1252 characters.
    $text = str_replace( $winc_chr, $outp_chr, $text);
    // even if the string seems to be UTF-8, we can't trust it, so convert it to UTF-8 anyway
    $text = mb_convert_encoding($text, 'UTF-8', $encoding);
    return $text;

If you are interested in more information on this topic, here are some links you may find helpful:

10 Comments on PHP: Clean encoding issues with smart (curly) quotes, em dashes and more

Leave a Reply

Your email address will not be published. Required fields are marked *