I had a bunch of separate DOC files with underlines to convert to italics and some smart quotes in the text and some straight quotes.
I wanted a clean HTML output with smart quotes.
That’s right. I wanted to add smart quotes and convert them to HTML entities as well.
Automatically, so I don’t have to go through this again. And no line wrapping. And so I did, using LibreOffice (OpenOffice should work too) from the commandline, HTMLTidy, and SED.
I started out with the instructions on TechRepublic here: http://www.techrepublic.com/blog/linux-and-open-source/how-to-convert-doc-and-odf-files-to-clean-and-lean-html/, but there were some shortcomings. I kept losing underscores entirely – they were converted to classes, and then stripped – and it couldn’t handle filenames with spaces at all. And the smart quotes handling of HTMLTidy is… well, minimal.
My script converts all the DOCX, DOC, and ODT files in a directory, does each substitution step-by-step (so you can deconstruct it yourself), and deletes all of the temporary files at the end. I’ve added it to the ebook-utilz github repository, and you can see just that bash script at https://github.com/uriel1998/ebook-utilz/blob/master/doc2html. Since it’s a BASH script, you should be able to use it on *nix, OSX, and possibly CygWin.