Sanitize Content: removing markup from Amazon's content
        Posted  
        
            by 
                StackOverflowNewbie
            
        on Stack Overflow
        
        See other posts from Stack Overflow
        
            or by StackOverflowNewbie
        
        
        
        Published on 2011-01-31T09:13:13Z
        Indexed on 
            2011/02/12
            23:25 UTC
        
        
        Read the original article
        Hit count: 372
        
I'm using Amazon Web Service to get product descriptions of various items. The problem is that Amazon's content contains mark up that is sometimes destructive to the layout of my web page (e.g. unclosed DIVs, etc.).
I want to sanitize the content I get from Amazon. My solution would be to do the following (my initial list so far):
- Remove unnecessary tags such as 
div,span, etc. while keeping tags likep,ul,ol, etc. - Remove all attributes from all the tags (e.g. seems like there are 
styleattributes in some of the tags) - Remove excess white space (e.g. multiple spaces, carriage returns, new lines, tabs, etc.)
 - Etc.
 
Before I go off trying to build my solution, I'm wondering if anyone has a better idea (or an already existing solution). Thanks.
© Stack Overflow or respective owner