How can I take the first 100 characters of html content ( without stripping the TAGS! )

Posted by Atomiton on Stack Overflow See other posts from Stack Overflow or by Atomiton
Published on 2010-03-29T20:11:39Z Indexed on 2010/03/29 22:13 UTC
Read the original article Hit count: 456

Filed under:
|
|
|
|

There are lots of questions on how to strip html tags, but not many on functions/methods to close them.

Here's the situation. I have a 500 character Message summary ( which includes html tags ), but I only want the first 100 characters. Problem is if I truncate the message, it could be in the middle of an html tag... which messes up stuff.

Assuming the html is something like this:

<div class="bd">"Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. <br/>
 <br/>Some Dates: April 30 - May 2, 2010 <br/>
 <p>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. <em>Duis aute irure dolor in reprehenderit</em> in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. <br/>
 </p>
 For more information about Lorem Ipsum doemdloe, visit: <br/>
 <a href="http://www.somesite.com" title="Some Conference">Some text link</a><br/> 
</div>

How would I take the first ~100 characters or so? ( Although, ideally that would be the first approximately 100 characters of "CONTENT" ( in between the html tags )

I'm assuming the best way to do this would be a recursive algorithm that keeps track of the html tags and appends any tags that would be truncated, but that may not be the best approach.

My first thoughts are using recursion to count nested tags, and when we reach 100 characters, look for the next "<" and then use recursion to write the closing html tags needed from there.

The reason for doing this is to make a short summary of existing articles without requiring the user to go back and provide summaries for all the articles. I want to keep the html formatting, if possible.

NOTE: Please ignore that the html isn't totally semantic. This is what I have to deal with from my WYSIWYG.

EDIT:

I added a potential solution ( that seems to work ) I figure others will run into this problem as well. I'm not sure it's the best... and it's probably not totally robust ( in fact, I know it isn't ), but I'd appreciate any feedback

© Stack Overflow or respective owner

Related posts about recursion

Related posts about html