PHP: Truncate string while preserving HTML tags and whole words

Truncating strings is a very common task while programming. Sometimes those strings have HTML code within them. If you simply truncated at X characters, you risk outputting very broken HTML. If you can live without the HTML, the easy solution is to strip_tags. However, if you want to preserve the HTML tags, you’ll need a smarter truncate function.

I yanked this function from a blog who looks like they yanked it from another blog, who yanked in from the CakePHP framework. This function is too good not to share.

<?php
/**
 * truncateHtml can truncate a string up to a number of characters while preserving whole words and HTML tags
 *
 * @param string $text String to truncate.
 * @param integer $length Length of returned string, including ellipsis.
 * @param string $ending Ending to be appended to the trimmed string.
 * @param boolean $exact If false, $text will not be cut mid-word
 * @param boolean $considerHtml If true, HTML tags would be handled correctly
 *
 * @return string Trimmed string.
 */
function truncateHtml($text, $length = 100, $ending = '...', $exact = false, $considerHtml = true) {
	if ($considerHtml) {
		// if the plain text is shorter than the maximum length, return the whole text
		if (strlen(preg_replace('/<.*?>/', '', $text)) <= $length) {
			return $text;
		}
		// splits all html-tags to scanable lines
		preg_match_all('/(<.+?>)?([^<>]*)/s', $text, $lines, PREG_SET_ORDER);
		$total_length = strlen($ending);
		$open_tags = array();
		$truncate = '';
		foreach ($lines as $line_matchings) {
			// if there is any html-tag in this line, handle it and add it (uncounted) to the output
			if (!empty($line_matchings[1])) {
				// if it's an "empty element" with or without xhtml-conform closing slash
				if (preg_match('/^<(\s*.+?\/\s*|\s*(img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param)(\s.+?)?)>$/is', $line_matchings[1])) {
					// do nothing
				// if tag is a closing tag
				} else if (preg_match('/^<\s*\/([^\s]+?)\s*>$/s', $line_matchings[1], $tag_matchings)) {
					// delete tag from $open_tags list
					$pos = array_search($tag_matchings[1], $open_tags);
					if ($pos !== false) {
					unset($open_tags[$pos]);
					}
				// if tag is an opening tag
				} else if (preg_match('/^<\s*([^\s>!]+).*?>$/s', $line_matchings[1], $tag_matchings)) {
					// add tag to the beginning of $open_tags list
					array_unshift($open_tags, strtolower($tag_matchings[1]));
				}
				// add html-tag to $truncate'd text
				$truncate .= $line_matchings[1];
			}
			// calculate the length of the plain text part of the line; handle entities as one character
			$content_length = strlen(preg_replace('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};/i', ' ', $line_matchings[2]));
			if ($total_length+$content_length> $length) {
				// the number of characters which are left
				$left = $length - $total_length;
				$entities_length = 0;
				// search for html entities
				if (preg_match_all('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};/i', $line_matchings[2], $entities, PREG_OFFSET_CAPTURE)) {
					// calculate the real length of all entities in the legal range
					foreach ($entities[0] as $entity) {
						if ($entity[1]+1-$entities_length <= $left) {
							$left--;
							$entities_length += strlen($entity[0]);
						} else {
							// no more characters left
							break;
						}
					}
				}
				$truncate .= substr($line_matchings[2], 0, $left+$entities_length);
				// maximum lenght is reached, so get off the loop
				break;
			} else {
				$truncate .= $line_matchings[2];
				$total_length += $content_length;
			}
			// if the maximum length is reached, get off the loop
			if($total_length>= $length) {
				break;
			}
		}
	} else {
		if (strlen($text) <= $length) {
			return $text;
		} else {
			$truncate = substr($text, 0, $length - strlen($ending));
		}
	}
	// if the words shouldn't be cut in the middle...
	if (!$exact) {
		// ...search the last occurance of a space...
		$spacepos = strrpos($truncate, ' ');
		if (isset($spacepos)) {
			// ...and cut the text in this position
			$truncate = substr($truncate, 0, $spacepos);
		}
	}
	// add the defined ending to the text
	$truncate .= $ending;
	if($considerHtml) {
		// close all unclosed html-tags
		foreach ($open_tags as $tag) {
			$truncate .= '</' . $tag . '>';
		}
	}
	return $truncate;
}
 
?>

20 Comments on PHP: Truncate string while preserving HTML tags and whole words

  1. Scott says:

    I written a Java version of truncateHTML. This version also preserves word boundaries.

    public static String truncateHTML(String text, int length, String suffix) {
    // if the plain text is shorter than the maximum length, return the whole text
    if (text.replaceAll(“”, “”).length() <= length) {
    return text;
    }
    String result = "";
    boolean trimmed = false;
    if (suffix == null) {
    suffix = "…";
    }

    /*
    * This pattern creates tokens, where each line starts with the tag.
    * For example, "One, Two, Three” produces the following:
    * One,
    * Two
    *
    , Three
    */
    Pattern tagPattern = Pattern.compile(“()?([^]*)”);

    /*
    * Checks for an empty tag, for example img, br, etc.
    */
    Pattern emptyTagPattern = Pattern.compile(“^$”);

    /*
    * Checks for closing tags, allowing leading and ending space inside the brackets
    */
    Pattern closingTagPattern = Pattern.compile(“^$”);

    /*
    * Checks for opening tags, allowing leading and ending space inside the brackets
    */
    Pattern openingTagPattern = Pattern.compile(“^$”);

    /*
    * Find   > …
    */
    Pattern entityPattern = Pattern.compile(“(&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};)”);

    // splits all html-tags to scanable lines
    Matcher tagMatcher = tagPattern.matcher(text);
    int numTags = tagMatcher.groupCount();

    int totalLength = suffix.length();
    List openTags = new ArrayList();

    boolean proposingChop = false;
    while (tagMatcher.find()) {
    String tagText = tagMatcher.group(1);
    String plainText = tagMatcher.group(2);

    if (proposingChop &&
    tagText != null && tagText.length() != 0 &&
    plainText != null && plainText.length() != 0) {
    trimmed = true;
    break;
    }

    // if there is any html-tag in this line, handle it and add it (uncounted) to the output
    if (tagText != null && tagText.length() > 0) {
    boolean foundMatch = false;

    // if it’s an “empty element” with or without xhtml-conform closing slash
    Matcher matcher = emptyTagPattern.matcher(tagText);
    if (matcher.find()) {
    foundMatch = true;
    // do nothing
    }

    // closing tag?
    if (!foundMatch) {
    matcher = closingTagPattern.matcher(tagText);
    if (matcher.find()) {
    foundMatch = true;
    // delete tag from openTags list
    String tagName = matcher.group(1);
    openTags.remove(tagName.toLowerCase());
    }
    }

    // opening tag?
    if (!foundMatch) {
    matcher = openingTagPattern.matcher(tagText);
    if (matcher.find()) {
    // add tag to the beginning of openTags list
    String tagName = matcher.group(1);
    openTags.add(0, tagName.toLowerCase());
    }
    }

    // add html-tag to result
    result += tagText;
    }

    // calculate the length of the plain text part of the line; handle entities (e.g.  ) as one character
    int contentLength = plainText.replaceAll(“&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};”, ” “).length();
    if (totalLength + contentLength > length) {
    // the number of characters which are left
    int numCharsRemaining = length – totalLength;
    int entitiesLength = 0;
    Matcher entityMatcher = entityPattern.matcher(plainText);
    while (entityMatcher.find()) {
    String entity = entityMatcher.group(1);
    if (numCharsRemaining > 0) {
    numCharsRemaining–;
    entitiesLength += entity.length();
    } else {
    // no more characters left
    break;
    }
    }

    // keep us from chopping words in half
    int proposedChopPosition = numCharsRemaining + entitiesLength;
    int endOfWordPosition = plainText.indexOf(” “, proposedChopPosition-1);
    if (endOfWordPosition == -1) {
    endOfWordPosition = plainText.length();
    }
    int endOfWordOffset = endOfWordPosition – proposedChopPosition;
    if (endOfWordOffset > 6) { // chop the word if it’s extra long
    endOfWordOffset = 0;
    }

    proposedChopPosition = numCharsRemaining + entitiesLength + endOfWordOffset;
    if (plainText.length() >= proposedChopPosition) {
    result += plainText.substring(0, proposedChopPosition);
    proposingChop = true;
    if (proposedChopPosition = length) {
    trimmed = true;
    break;
    }
    }

    for (String openTag : openTags) {
    result += “”;
    }
    if (trimmed) {
    result += suffix;
    }
    return result;
    }

  2. Jake says:

    Wonderful work. Thank you!

  3. Cameron Barr says:

    Great work! Works like a charm.

  4. Joe Corneli says:

    I found a slight modification to this code at http://www.gsdesign.ro/blog/cut-html-string-without-breaking-the-tags/#comment-487407314 (by Alex Prokop). It improves the code in the case $empty = false. I’ve tested it and it seems like a necessary fix for some complicated HTML. I thought I should mention it to, in case you want to add the fix into your version of the function.

  5. Dan says:

    Does not work for long urls like loooooong_url

  6. Pingback: PHP: Truncate HTML, ignoring tags - PHP Questions - Developers Q & A

  7. Luís Felipe de Andrade says:

    Not working. :(

  8. Matt O'Neal says:

    So Felipe says it doesn’t work. I don’t know; I have no need for this code right now (maybe in the future), but do have a generic PHP function question.

    I’m always hesitant to put long functions in my code for fear that it will slow the system down too much. I try to modularize my scripts (for my own sanity), so one script may call 2-3 other scripts. And most of these scripts are less than 200 lines of code. I guess I could make these other smaller scripts into functions, but they’re usually only called once so I’ve had no real need to.

    I guess my question would be, is there any overhead cost to including other *.php files (that do a certain process)? Meaning would it save any processor time to build these as functions?

    • gustl says:

      by calling scripts u mean including the file?
      never ever do that, max 1-2 times, to load an init file or something… allways use functions to execute code, just use include to include code, but never to execute other code.

  9. Eric Swierczek says:

    I can’t tell you how many times this has worked instead of another suggested solution. In my case, this was a lifesaver because I had special HTML chars in my input string (–, •, etc.) that other scripts just wouldn’t quite handle correctly. Your script is also posted many places on Stackoverflow as an accepted answer, nice job!

  10. Evan says:

    I am greateful for this code.

    I have one problem that tags are not properly stripped and reinstated… I’m getting extra \” in the code where ” are which doesn’t affect the display of the text other screwing up links . . .

    Perhaps someone can help, you can take a look at the code on northfronteonac.com, the body text of the news posts accordions are using the truncathtml script . . .

  11. Jason Long says:

    Great work – just what I was looking for!

    Thanks very much Alan (and CakePHP!)

  12. Jeff Andersen says:

    There’s an issue where malformed HTML will be output if the main truncation loop exits and $truncation looks like “widget”. The code will proceed to find the last occurrence of ” ” and trim there, (assuming $exact is set to false) resulting in “<a…". When the closing tags are added on, you've got something like "<a…“.

    I borrowed some code from the main loop and wrote a subroutine that takes this possibility into account. Replace all code from “// if the words shouldn’t be cut in the middle…” to “// add the defined ending to the text” with the following:

    // if the words shouldn't be cut in the middle...
    if (!$exact) {
    /*
    In the event that the main loop above leaves us with $truncate looking like
    "hashbaz" with no trailing space, a naive approach to splitting
    on a word boundary will leave us with a string looking like "<a...
    ".

    This subroutine identifies what sort of string $truncate ends with, and deals with it accordingly.
    */
    $matched = preg_match_all('/()?([^]*)/s', $truncate, $lines, PREG_SET_ORDER | PREG_OFFSET_CAPTURE);
    $naiveTrim = true;

    // We found something...
    if ($matched !== false && $matched > 0) {
    // ...so we should look for the last occurrence of anything interesting.
    $lastTagIndex = -1;

    for ($i = count($lines) - 1; $i >= 0; $i--) {
    // If the position of the tag is not -1, we've got something.
    if ($lines[$i][1][1] !== -1) {
    $lastTagIndex = $i;
    break;
    }
    }

    // If we didn't find anything interesting, or the last "line" contains spaces,
    // we can just naively trim the string to a word boundary.
    if ($lastTagIndex != -1 && strrpos($lines[$lastTagIndex][2][0], ' ') === false) {
    $naiveTrim = false;
    }
    }

    if ($naiveTrim) {
    // Search the last occurance of a space...
    $spacepos = strrpos($truncate, ' ');
    if (isset($spacepos)) {
    // ...and cut the text in this position
    $truncate = substr($truncate, 0, $spacepos);
    }
    } else {
    // By default we will trim after the last HTML tag. If this tag is an opening tag we will
    // take that into account.
    $trimLocation = $lines[$lastTagIndex][2][1];

    if (!empty($lines[$lastTagIndex][1][0])) {
    // if it's an "empty element" with or without xhtml-conform closing slash
    if (preg_match('/^$/is', $lines[$lastTagIndex][1][0])) {
    // do nothing
    // if tag is a closing tag
    } else if (preg_match('/^$/s', $lines[$lastTagIndex][1][0])) {
    // do nothing
    // if tag is an opening tag
    } else if (preg_match('/^!]+).*?>$/s', $lines[$lastTagIndex][1][0], $tag_matchings)) {
    // Now we're trimming just before this opening tag. Better remove the tag from $open_tags.
    $trimLocation = $lines[$lastTagIndex][1][1];
    $pos = array_search(strtolower($tag_matchings[1]), $open_tags);
    if ($pos !== false) {
    unset($open_tags[$pos]);
    }
    }
    }

    $truncate = rtrim(substr($truncate, 0, $trimLocation));
    }
    }

    Note that this does not address the problem of unnecessarily trimming off characters in the event that the max character limit happened to fall on the end of a word. Trimming the string “foobar hashbaz” to six characters will still result in an empty string.

  13. Neil Kempin says:

    Heads up this code has a bug!

    the problem occurs towards the end where the code is cutting on spaces:

    // if the words shouldn’t be cut in the middle…
    if (!$exact) {
    // …search the last occurance of a space…
    $spacepos = strrpos($truncate, ‘ ‘);
    if (isset($spacepos)) {
    // …and cut the text in this position
    $truncate = substr($truncate, 0, $spacepos);
    }
    }

    my html looks like blah blah blah
    my result ends up being blah blah
    the closing strong tag gets cut off because it chooses cut the text including the closing tag. Doesnt this make the whole function pointless? I don’t understand why this appears to be working for so many of you.

  14. Neil Kempin says:

    oops, my previous comment cut the html tags i was writing out, my examples should have been

    blah blah blah

    after being run through the function becomes

    blah blah

    because that is the position where the last space occured (closing tag is chopped off defeating the purpose)

    • Alan Whipple says:

      There may very well be a bug with this function, as eluded to by previous comments, but I haven’t investigated. This function originally came from CakePHP, so perhaps they now have an updated version. This one is still working for me and apparently others.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>