PHP: Clean encoding issues with smart (curly) quotes, em dashes and more

When dealing with content from various sources, such as XML feeds, you will inevitably encounter problems with smart quotes, em dashes, and other random encoding issues. Smart quotes are known by other names such as curly quotes and left/right angled quotes.

The main problem is Windows. Many Windows programs use Windows-1252 character encoding, which is very similar to ISO-8859-1, but with some differences. Attempts to detect the encoding of a Windows-1252 string will tend to result in a guess of ISO-8859-1. So conversion tools will overlook the differences. Unfortunately, there are some relatively common characters among the differences. These characters include left/right angled single/double quotes, em/en dashes, ellipsis, and bullets.

To deal with these encoding issues, I wrote a function to do the cleanup and convert the string to UTF-8.

Note: I have only tested this with the English language / character set.

Full disclosure: PHP must have the mbstring extension enabled to use the mb_* functions.

/**
 * cleanEncoding deals with pesky characters like curly smart quotes and em dashes (and some other encoding related problems)
 *
 * @param string $text Text string to cleanup / convert
 * @param string $type 'standard' for standard characters, 'reference' for decimal numerical character reference
 *
 * @return $text Cleaned up UTF-8 string
 */
function cleanEncoding( $text, $type='standard' ){
    // determine the encoding before we touch it
    $encoding = mb_detect_encoding($text, 'UTF-8, ISO-8859-1');
    // The characters to output
    if ( $type=='standard' ){
        $outp_chr = array('...',          "'",            "'",            '"',            '"',            '•',            '-',            '-'); // run of the mill standard characters
    } elseif ( $type=='reference' ) {
        $outp_chr = array('…',      '‘',      '’',      '“',      '”',      '•',      '–',      '—'); // decimal numerical character references
    }
    // The characters to replace (purposely indented for comparison)
        $utf8_chr = array("\xe2\x80\xa6", "\xe2\x80\x98", "\xe2\x80\x99", "\xe2\x80\x9c", "\xe2\x80\x9d", '\xe2\x80\xa2', "\xe2\x80\x93", "\xe2\x80\x94"); // UTF-8 hex characters
        $winc_chr = array(chr(133),       chr(145),       chr(146),       chr(147),       chr(148),       chr(149),       chr(150),       chr(151)); // ASCII characters (found in Windows-1252)
    // First, replace UTF-8 characters.
    $text = str_replace( $utf8_chr, $outp_chr, $text);
    // Next, replace Windows-1252 characters.
    $text = str_replace( $winc_chr, $outp_chr, $text);
    // even if the string seems to be UTF-8, we can't trust it, so convert it to UTF-8 anyway
    $text = mb_convert_encoding($text, 'UTF-8', $encoding);
    return $text;
}

If you are interested in more information on this topic, here are some links you may find helpful:
http://www.joelonsoftware.com/articles/Unicode.html
http://shiflett.org/blog/2005/oct/convert-smart-quotes-with-php#comment-3
http://web.forret.com/tools/charmap.asp?show=ascii
http://en.wikipedia.org/wiki/Extended_ASCII
http://www.ascii-code.com/
http://stackoverflow.com/questions/631406/what-is-the-difference-between-em-dash-151-and-8212
http://en.wikipedia.org/wiki/Numeric_character_reference
http://www.dwheeler.com/essays/quotes-in-html.html
http://www.i18nguy.com/markup/ncrs.html
http://www.kadifeli.com/fedon/utf.htm

10 Comments on PHP: Clean encoding issues with smart (curly) quotes, em dashes and more

  1. Daniel says:

    Thank you for the code. I spent hours trying codes from other websites and yours is the only one that works.

    Danel

  2. Jeramiah says:

    This little bit of code was a life saver. I spent more than a few hours trying every solution I came across and nothing was doing the job 100%. Thanks a million!

  3. Matt O'Neal says:

    Thank you! I was so sick of fighting with funky characters from RSS feeds that were created with some type of funky Microsoft encoded character sets.

    Now that you’ve saved me some time, I’m going to have to see what else you have going on in this blog! Thanks again for posting this elegant solution.

  4. Romanovic says:

    Thank you soo much! Finally something that works! iconv(‘ASCII’, ‘UTF-8//IGNORE’, $my_text) removed the strange chars but with your code I was able to transform them. Good thing! :)

  5. Steve says:

    Great! After looking up several options, this one worked for me. Thanks.

  6. Clifford Meece says:

    You freaking rock!

  7. angel says:

    oh my goodness for how many years!!!!..joke. after almost 1 year of being a web developer thanks for this! Really life saver. very useful for importing large excel file… it works!

Leave a Reply

Your email address will not be published. Required fields are marked *