PHP: Truncate string while preserving HTML tags and whole words

Truncating strings is a very common task while programming. Sometimes those strings have HTML code within them. If you simply truncated at X characters, you risk outputting very broken HTML. If you can live without the HTML, the easy solution is to strip_tags. However, if you want to preserve the HTML tags, you’ll need a smarter truncate function.

I yanked this function from a blog who looks like they yanked it from another blog, who yanked in from the CakePHP framework. This function is too good not to share.

<?php
/**
 * truncateHtml can truncate a string up to a number of characters while preserving whole words and HTML tags
 *
 * @param string $text String to truncate.
 * @param integer $length Length of returned string, including ellipsis.
 * @param string $ending Ending to be appended to the trimmed string.
 * @param boolean $exact If false, $text will not be cut mid-word
 * @param boolean $considerHtml If true, HTML tags would be handled correctly
 *
 * @return string Trimmed string.
 */
function truncateHtml($text, $length = 100, $ending = '...', $exact = false, $considerHtml = true) {
	if ($considerHtml) {
		// if the plain text is shorter than the maximum length, return the whole text
		if (strlen(preg_replace('/<.*?>/', '', $text)) <= $length) {
			return $text;
		}
		// splits all html-tags to scanable lines
		preg_match_all('/(<.+?>)?([^<>]*)/s', $text, $lines, PREG_SET_ORDER);
		$total_length = strlen($ending);
		$open_tags = array();
		$truncate = '';
		foreach ($lines as $line_matchings) {
			// if there is any html-tag in this line, handle it and add it (uncounted) to the output
			if (!empty($line_matchings[1])) {
				// if it's an "empty element" with or without xhtml-conform closing slash
				if (preg_match('/^<(\s*.+?\/\s*|\s*(img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param)(\s.+?)?)>$/is', $line_matchings[1])) {
					// do nothing
				// if tag is a closing tag
				} else if (preg_match('/^<\s*\/([^\s]+?)\s*>$/s', $line_matchings[1], $tag_matchings)) {
					// delete tag from $open_tags list
					$pos = array_search($tag_matchings[1], $open_tags);
					if ($pos !== false) {
					unset($open_tags[$pos]);
					}
				// if tag is an opening tag
				} else if (preg_match('/^<\s*([^\s>!]+).*?>$/s', $line_matchings[1], $tag_matchings)) {
					// add tag to the beginning of $open_tags list
					array_unshift($open_tags, strtolower($tag_matchings[1]));
				}
				// add html-tag to $truncate'd text
				$truncate .= $line_matchings[1];
			}
			// calculate the length of the plain text part of the line; handle entities as one character
			$content_length = strlen(preg_replace('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};/i', ' ', $line_matchings[2]));
			if ($total_length+$content_length> $length) {
				// the number of characters which are left
				$left = $length - $total_length;
				$entities_length = 0;
				// search for html entities
				if (preg_match_all('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};/i', $line_matchings[2], $entities, PREG_OFFSET_CAPTURE)) {
					// calculate the real length of all entities in the legal range
					foreach ($entities[0] as $entity) {
						if ($entity[1]+1-$entities_length <= $left) {
							$left--;
							$entities_length += strlen($entity[0]);
						} else {
							// no more characters left
							break;
						}
					}
				}
				$truncate .= substr($line_matchings[2], 0, $left+$entities_length);
				// maximum lenght is reached, so get off the loop
				break;
			} else {
				$truncate .= $line_matchings[2];
				$total_length += $content_length;
			}
			// if the maximum length is reached, get off the loop
			if($total_length>= $length) {
				break;
			}
		}
	} else {
		if (strlen($text) <= $length) {
			return $text;
		} else {
			$truncate = substr($text, 0, $length - strlen($ending));
		}
	}
	// if the words shouldn't be cut in the middle...
	if (!$exact) {
		// ...search the last occurance of a space...
		$spacepos = strrpos($truncate, ' ');
		if (isset($spacepos)) {
			// ...and cut the text in this position
			$truncate = substr($truncate, 0, $spacepos);
		}
	}
	// add the defined ending to the text
	$truncate .= $ending;
	if($considerHtml) {
		// close all unclosed html-tags
		foreach ($open_tags as $tag) {
			$truncate .= '</' . $tag . '>';
		}
	}
	return $truncate;
}
 
?>

6 Comments on PHP: Truncate string while preserving HTML tags and whole words

  1. Scott says:

    I written a Java version of truncateHTML. This version also preserves word boundaries.

    public static String truncateHTML(String text, int length, String suffix) {
    // if the plain text is shorter than the maximum length, return the whole text
    if (text.replaceAll(“”, “”).length() <= length) {
    return text;
    }
    String result = "";
    boolean trimmed = false;
    if (suffix == null) {
    suffix = "…";
    }

    /*
    * This pattern creates tokens, where each line starts with the tag.
    * For example, "One, Two, Three” produces the following:
    * One,
    * Two
    *
    , Three
    */
    Pattern tagPattern = Pattern.compile(“()?([^]*)”);

    /*
    * Checks for an empty tag, for example img, br, etc.
    */
    Pattern emptyTagPattern = Pattern.compile(“^$”);

    /*
    * Checks for closing tags, allowing leading and ending space inside the brackets
    */
    Pattern closingTagPattern = Pattern.compile(“^$”);

    /*
    * Checks for opening tags, allowing leading and ending space inside the brackets
    */
    Pattern openingTagPattern = Pattern.compile(“^$”);

    /*
    * Find   > …
    */
    Pattern entityPattern = Pattern.compile(“(&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};)”);

    // splits all html-tags to scanable lines
    Matcher tagMatcher = tagPattern.matcher(text);
    int numTags = tagMatcher.groupCount();

    int totalLength = suffix.length();
    List openTags = new ArrayList();

    boolean proposingChop = false;
    while (tagMatcher.find()) {
    String tagText = tagMatcher.group(1);
    String plainText = tagMatcher.group(2);

    if (proposingChop &&
    tagText != null && tagText.length() != 0 &&
    plainText != null && plainText.length() != 0) {
    trimmed = true;
    break;
    }

    // if there is any html-tag in this line, handle it and add it (uncounted) to the output
    if (tagText != null && tagText.length() > 0) {
    boolean foundMatch = false;

    // if it’s an “empty element” with or without xhtml-conform closing slash
    Matcher matcher = emptyTagPattern.matcher(tagText);
    if (matcher.find()) {
    foundMatch = true;
    // do nothing
    }

    // closing tag?
    if (!foundMatch) {
    matcher = closingTagPattern.matcher(tagText);
    if (matcher.find()) {
    foundMatch = true;
    // delete tag from openTags list
    String tagName = matcher.group(1);
    openTags.remove(tagName.toLowerCase());
    }
    }

    // opening tag?
    if (!foundMatch) {
    matcher = openingTagPattern.matcher(tagText);
    if (matcher.find()) {
    // add tag to the beginning of openTags list
    String tagName = matcher.group(1);
    openTags.add(0, tagName.toLowerCase());
    }
    }

    // add html-tag to result
    result += tagText;
    }

    // calculate the length of the plain text part of the line; handle entities (e.g.  ) as one character
    int contentLength = plainText.replaceAll(“&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};”, ” “).length();
    if (totalLength + contentLength > length) {
    // the number of characters which are left
    int numCharsRemaining = length – totalLength;
    int entitiesLength = 0;
    Matcher entityMatcher = entityPattern.matcher(plainText);
    while (entityMatcher.find()) {
    String entity = entityMatcher.group(1);
    if (numCharsRemaining > 0) {
    numCharsRemaining–;
    entitiesLength += entity.length();
    } else {
    // no more characters left
    break;
    }
    }

    // keep us from chopping words in half
    int proposedChopPosition = numCharsRemaining + entitiesLength;
    int endOfWordPosition = plainText.indexOf(” “, proposedChopPosition-1);
    if (endOfWordPosition == -1) {
    endOfWordPosition = plainText.length();
    }
    int endOfWordOffset = endOfWordPosition – proposedChopPosition;
    if (endOfWordOffset > 6) { // chop the word if it’s extra long
    endOfWordOffset = 0;
    }

    proposedChopPosition = numCharsRemaining + entitiesLength + endOfWordOffset;
    if (plainText.length() >= proposedChopPosition) {
    result += plainText.substring(0, proposedChopPosition);
    proposingChop = true;
    if (proposedChopPosition = length) {
    trimmed = true;
    break;
    }
    }

    for (String openTag : openTags) {
    result += “”;
    }
    if (trimmed) {
    result += suffix;
    }
    return result;
    }

  2. Jake says:

    Wonderful work. Thank you!

  3. Cameron Barr says:

    Great work! Works like a charm.

  4. Joe Corneli says:

    I found a slight modification to this code at http://www.gsdesign.ro/blog/cut-html-string-without-breaking-the-tags/#comment-487407314 (by Alex Prokop). It improves the code in the case $empty = false. I’ve tested it and it seems like a necessary fix for some complicated HTML. I thought I should mention it to, in case you want to add the fix into your version of the function.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">