Category: PHP

PHP: Clean encoding issues with smart (curly) quotes, em dashes and more

When dealing with content from various sources, such as XML feeds, you will inevitably encounter problems with smart quotes, em dashes, and other random encoding issues. Smart quotes are known by other names such as curly quotes and left/right angled quotes.

The main problem is Windows. Many Windows programs use Windows-1252 character encoding, which is very similar to ISO-8859-1, but with some differences. Attempts to detect the encoding of a Windows-1252 string will tend to result in a guess of ISO-8859-1. So conversion tools will overlook the differences. Unfortunately, there are some relatively common characters among the differences. These characters include left/right angled single/double quotes, em/en dashes, ellipsis, and bullets.

To deal with these encoding issues, I wrote a function to do the cleanup and convert the string to UTF-8.

Note: I have only tested this with the English language / character set.

Full disclosure: PHP must have the mbstring extension enabled to use the mb_* functions.

/**
 * cleanEncoding deals with pesky characters like curly smart quotes and em dashes (and some other encoding related problems)
 *
 * @param string $text Text string to cleanup / convert
 * @param string $type 'standard' for standard characters, 'reference' for decimal numerical character reference
 *
 * @return $text Cleaned up UTF-8 string
 */
function cleanEncoding( $text, $type='standard' ){
    // determine the encoding before we touch it
    $encoding = mb_detect_encoding($text, 'UTF-8, ISO-8859-1');
    // The characters to output
    if ( $type=='standard' ){
        $outp_chr = array('...',          "'",            "'",            '"',            '"',            '•',            '-',            '-'); // run of the mill standard characters
    } elseif ( $type=='reference' ) {
        $outp_chr = array('…',      '‘',      '’',      '“',      '”',      '•',      '–',      '—'); // decimal numerical character references
    }
    // The characters to replace (purposely indented for comparison)
        $utf8_chr = array("\xe2\x80\xa6", "\xe2\x80\x98", "\xe2\x80\x99", "\xe2\x80\x9c", "\xe2\x80\x9d", '\xe2\x80\xa2', "\xe2\x80\x93", "\xe2\x80\x94"); // UTF-8 hex characters
        $winc_chr = array(chr(133),       chr(145),       chr(146),       chr(147),       chr(148),       chr(149),       chr(150),       chr(151)); // ASCII characters (found in Windows-1252)
    // First, replace UTF-8 characters.
    $text = str_replace( $utf8_chr, $outp_chr, $text);
    // Next, replace Windows-1252 characters.
    $text = str_replace( $winc_chr, $outp_chr, $text);
    // even if the string seems to be UTF-8, we can't trust it, so convert it to UTF-8 anyway
    $text = mb_convert_encoding($text, 'UTF-8', $encoding);
    return $text;
}

If you are interested in more information on this topic, here are some links you may find helpful:
http://www.joelonsoftware.com/articles/Unicode.html
http://shiflett.org/blog/2005/oct/convert-smart-quotes-with-php#comment-3
http://web.forret.com/tools/charmap.asp?show=ascii
http://en.wikipedia.org/wiki/Extended_ASCII
http://www.ascii-code.com/
http://stackoverflow.com/questions/631406/what-is-the-difference-between-em-dash-151-and-8212
http://en.wikipedia.org/wiki/Numeric_character_reference
http://www.dwheeler.com/essays/quotes-in-html.html
http://www.i18nguy.com/markup/ncrs.html
http://www.kadifeli.com/fedon/utf.htm

PHP: Truncate string while preserving HTML tags and whole words

Truncating strings is a very common task while programming. Sometimes those strings have HTML code within them. If you simply truncated at X characters, you risk outputting very broken HTML. If you can live without the HTML, the easy solution is to strip_tags. However, if you want to preserve the HTML tags, you’ll need a smarter truncate function.

I yanked this function from a blog who looks like they yanked it from another blog, who yanked in from the CakePHP framework. This function is too good not to share.

/**
 * truncateHtml can truncate a string up to a number of characters while preserving whole words and HTML tags
 *
 * @param string $text String to truncate.
 * @param integer $length Length of returned string, including ellipsis.
 * @param string $ending Ending to be appended to the trimmed string.
 * @param boolean $exact If false, $text will not be cut mid-word
 * @param boolean $considerHtml If true, HTML tags would be handled correctly
 *
 * @return string Trimmed string.
 */
function truncateHtml($text, $length = 100, $ending = '...', $exact = false, $considerHtml = true) {
	if ($considerHtml) {
		// if the plain text is shorter than the maximum length, return the whole text
		if (strlen(preg_replace('/<.*?>/', '', $text)) <= $length) {
			return $text;
		}
		// splits all html-tags to scanable lines
		preg_match_all('/(<.+?>)?([^<>]*)/s', $text, $lines, PREG_SET_ORDER);
		$total_length = strlen($ending);
		$open_tags = array();
		$truncate = '';
		foreach ($lines as $line_matchings) {
			// if there is any html-tag in this line, handle it and add it (uncounted) to the output
			if (!empty($line_matchings[1])) {
				// if it's an "empty element" with or without xhtml-conform closing slash
				if (preg_match('/^<(\s*.+?\/\s*|\s*(img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param)(\s.+?)?)>$/is', $line_matchings[1])) {
					// do nothing
				// if tag is a closing tag
				} else if (preg_match('/^<\s*\/([^\s]+?)\s*>$/s', $line_matchings[1], $tag_matchings)) {
					// delete tag from $open_tags list
					$pos = array_search($tag_matchings[1], $open_tags);
					if ($pos !== false) {
					unset($open_tags[$pos]);
					}
				// if tag is an opening tag
				} else if (preg_match('/^<\s*([^\s>!]+).*?>$/s', $line_matchings[1], $tag_matchings)) {
					// add tag to the beginning of $open_tags list
					array_unshift($open_tags, strtolower($tag_matchings[1]));
				}
				// add html-tag to $truncate'd text
				$truncate .= $line_matchings[1];
			}
			// calculate the length of the plain text part of the line; handle entities as one character
			$content_length = strlen(preg_replace('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};/i', ' ', $line_matchings[2]));
			if ($total_length+$content_length> $length) {
				// the number of characters which are left
				$left = $length - $total_length;
				$entities_length = 0;
				// search for html entities
				if (preg_match_all('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};/i', $line_matchings[2], $entities, PREG_OFFSET_CAPTURE)) {
					// calculate the real length of all entities in the legal range
					foreach ($entities[0] as $entity) {
						if ($entity[1]+1-$entities_length <= $left) {
							$left--;
							$entities_length += strlen($entity[0]);
						} else {
							// no more characters left
							break;
						}
					}
				}
				$truncate .= substr($line_matchings[2], 0, $left+$entities_length);
				// maximum lenght is reached, so get off the loop
				break;
			} else {
				$truncate .= $line_matchings[2];
				$total_length += $content_length;
			}
			// if the maximum length is reached, get off the loop
			if($total_length>= $length) {
				break;
			}
		}
	} else {
		if (strlen($text) <= $length) {
			return $text;
		} else {
			$truncate = substr($text, 0, $length - strlen($ending));
		}
	}
	// if the words shouldn't be cut in the middle...
	if (!$exact) {
		// ...search the last occurance of a space...
		$spacepos = strrpos($truncate, ' ');
		if (isset($spacepos)) {
			// ...and cut the text in this position
			$truncate = substr($truncate, 0, $spacepos);
		}
	}
	// add the defined ending to the text
	$truncate .= $ending;
	if($considerHtml) {
		// close all unclosed html-tags
		foreach ($open_tags as $tag) {
			$truncate .= '</' . $tag . '>';
		}
	}
	return $truncate;
}