Truncating strings is a very common task while programming. Sometimes those strings have HTML code within them. If you simply truncated at X characters, you risk outputting very broken HTML. If you can live without the HTML, the easy solution is to strip_tags. However, if you want to preserve the HTML tags, you’ll need a smarter truncate function.
I yanked this function from a blog who looks like they yanked it from another blog, who yanked in from the CakePHP framework. This function is too good not to share.
<?php /** * truncateHtml can truncate a string up to a number of characters while preserving whole words and HTML tags * * @param string $text String to truncate. * @param integer $length Length of returned string, including ellipsis. * @param string $ending Ending to be appended to the trimmed string. * @param boolean $exact If false, $text will not be cut mid-word * @param boolean $considerHtml If true, HTML tags would be handled correctly * * @return string Trimmed string. */ function truncateHtml($text, $length = 100, $ending = '...', $exact = false, $considerHtml = true) { if ($considerHtml) { // if the plain text is shorter than the maximum length, return the whole text if (strlen(preg_replace('/<.*?>/', '', $text)) <= $length) { return $text; } // splits all html-tags to scanable lines preg_match_all('/(<.+?>)?([^<>]*)/s', $text, $lines, PREG_SET_ORDER); $total_length = strlen($ending); $open_tags = array(); $truncate = ''; foreach ($lines as $line_matchings) { // if there is any html-tag in this line, handle it and add it (uncounted) to the output if (!empty($line_matchings[1])) { // if it's an "empty element" with or without xhtml-conform closing slash if (preg_match('/^<(\s*.+?\/\s*|\s*(img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param)(\s.+?)?)>$/is', $line_matchings[1])) { // do nothing // if tag is a closing tag } else if (preg_match('/^<\s*\/([^\s]+?)\s*>$/s', $line_matchings[1], $tag_matchings)) { // delete tag from $open_tags list $pos = array_search($tag_matchings[1], $open_tags); if ($pos !== false) { unset($open_tags[$pos]); } // if tag is an opening tag } else if (preg_match('/^<\s*([^\s>!]+).*?>$/s', $line_matchings[1], $tag_matchings)) { // add tag to the beginning of $open_tags list array_unshift($open_tags, strtolower($tag_matchings[1])); } // add html-tag to $truncate'd text $truncate .= $line_matchings[1]; } // calculate the length of the plain text part of the line; handle entities as one character $content_length = strlen(preg_replace('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};/i', ' ', $line_matchings[2])); if ($total_length+$content_length> $length) { // the number of characters which are left $left = $length - $total_length; $entities_length = 0; // search for html entities if (preg_match_all('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};/i', $line_matchings[2], $entities, PREG_OFFSET_CAPTURE)) { // calculate the real length of all entities in the legal range foreach ($entities[0] as $entity) { if ($entity[1]+1-$entities_length <= $left) { $left--; $entities_length += strlen($entity[0]); } else { // no more characters left break; } } } $truncate .= substr($line_matchings[2], 0, $left+$entities_length); // maximum lenght is reached, so get off the loop break; } else { $truncate .= $line_matchings[2]; $total_length += $content_length; } // if the maximum length is reached, get off the loop if($total_length>= $length) { break; } } } else { if (strlen($text) <= $length) { return $text; } else { $truncate = substr($text, 0, $length - strlen($ending)); } } // if the words shouldn't be cut in the middle... if (!$exact) { // ...search the last occurance of a space... $spacepos = strrpos($truncate, ' '); if (isset($spacepos)) { // ...and cut the text in this position $truncate = substr($truncate, 0, $spacepos); } } // add the defined ending to the text $truncate .= $ending; if($considerHtml) { // close all unclosed html-tags foreach ($open_tags as $tag) { $truncate .= '</' . $tag . '>'; } } return $truncate; } ?>
I written a Java version of truncateHTML. This version also preserves word boundaries.
public static String truncateHTML(String text, int length, String suffix) {
// if the plain text is shorter than the maximum length, return the whole text
if (text.replaceAll(“”, “”).length() <= length) {
return text;
}
String result = "";
boolean trimmed = false;
if (suffix == null) {
suffix = "…";
}
/*
* This pattern creates tokens, where each line starts with the tag.
* For example, "One, Two, Three” produces the following:
* One,
* Two
* , Three
*/
Pattern tagPattern = Pattern.compile(“()?([^]*)”);
/*
* Checks for an empty tag, for example img, br, etc.
*/
Pattern emptyTagPattern = Pattern.compile(“^$”);
/*
* Checks for closing tags, allowing leading and ending space inside the brackets
*/
Pattern closingTagPattern = Pattern.compile(“^$”);
/*
* Checks for opening tags, allowing leading and ending space inside the brackets
*/
Pattern openingTagPattern = Pattern.compile(“^$”);
/*
* Find > …
*/
Pattern entityPattern = Pattern.compile(“(&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};)”);
// splits all html-tags to scanable lines
Matcher tagMatcher = tagPattern.matcher(text);
int numTags = tagMatcher.groupCount();
int totalLength = suffix.length();
List openTags = new ArrayList();
boolean proposingChop = false;
while (tagMatcher.find()) {
String tagText = tagMatcher.group(1);
String plainText = tagMatcher.group(2);
if (proposingChop &&
tagText != null && tagText.length() != 0 &&
plainText != null && plainText.length() != 0) {
trimmed = true;
break;
}
// if there is any html-tag in this line, handle it and add it (uncounted) to the output
if (tagText != null && tagText.length() > 0) {
boolean foundMatch = false;
// if it’s an “empty element” with or without xhtml-conform closing slash
Matcher matcher = emptyTagPattern.matcher(tagText);
if (matcher.find()) {
foundMatch = true;
// do nothing
}
// closing tag?
if (!foundMatch) {
matcher = closingTagPattern.matcher(tagText);
if (matcher.find()) {
foundMatch = true;
// delete tag from openTags list
String tagName = matcher.group(1);
openTags.remove(tagName.toLowerCase());
}
}
// opening tag?
if (!foundMatch) {
matcher = openingTagPattern.matcher(tagText);
if (matcher.find()) {
// add tag to the beginning of openTags list
String tagName = matcher.group(1);
openTags.add(0, tagName.toLowerCase());
}
}
// add html-tag to result
result += tagText;
}
// calculate the length of the plain text part of the line; handle entities (e.g. ) as one character
int contentLength = plainText.replaceAll(“&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};”, ” “).length();
if (totalLength + contentLength > length) {
// the number of characters which are left
int numCharsRemaining = length – totalLength;
int entitiesLength = 0;
Matcher entityMatcher = entityPattern.matcher(plainText);
while (entityMatcher.find()) {
String entity = entityMatcher.group(1);
if (numCharsRemaining > 0) {
numCharsRemaining–;
entitiesLength += entity.length();
} else {
// no more characters left
break;
}
}
// keep us from chopping words in half
int proposedChopPosition = numCharsRemaining + entitiesLength;
int endOfWordPosition = plainText.indexOf(” “, proposedChopPosition-1);
if (endOfWordPosition == -1) {
endOfWordPosition = plainText.length();
}
int endOfWordOffset = endOfWordPosition – proposedChopPosition;
if (endOfWordOffset > 6) { // chop the word if it’s extra long
endOfWordOffset = 0;
}
proposedChopPosition = numCharsRemaining + entitiesLength + endOfWordOffset;
if (plainText.length() >= proposedChopPosition) {
result += plainText.substring(0, proposedChopPosition);
proposingChop = true;
if (proposedChopPosition = length) {
trimmed = true;
break;
}
}
for (String openTag : openTags) {
result += “”;
}
if (trimmed) {
result += suffix;
}
return result;
}
Good job Alan Whipple !!!
Wonderful work. Thank you!
Great work! Works like a charm.
Thanks man!
I found a slight modification to this code at http://www.gsdesign.ro/blog/cut-html-string-without-breaking-the-tags/#comment-487407314 (by Alex Prokop). It improves the code in the case $empty = false. I’ve tested it and it seems like a necessary fix for some complicated HTML. I thought I should mention it to, in case you want to add the fix into your version of the function.