Programmatically clean Word generated HTML while preserving styles?
        Posted  
        
            by GeReV
        on Stack Overflow
        
        See other posts from Stack Overflow
        
            or by GeReV
        
        
        
        Published on 2010-05-10T21:46:40Z
        Indexed on 
            2010/05/14
            20:44 UTC
        
        
        Read the original article
        Hit count: 439
        
In my current company, we have this decade old... let's call it a "Hello World" application.
While wanting to create a newer version of it, we also want to preserve older entries.
These older entries contain hideous Word generated HTML which was never filtered before.
If and when we move to a newer system, I'd generally prefer to have that HTML cleaned and filtered in order to have the site comply with HTML standards as much as possible.
However, just cleaning that code like Jeff Atwood described in his blog or in any other way I know of would also ruin the style and formatting.
Now, that just might cause our users to revolt and then all hell will break loose... Not a very good idea.
Question is -- can Word's HTML be cleaned while preserving basic formatting? (e.g: coloring, italicized, bold text and so on)
Preferably using publicly available code or library, such as HTML Tidy, examples in C# would be much appreciated.
Thanks!
© Stack Overflow or respective owner