Word to HTML and Smartening Quotes for eBooks using SED and HTMLTidy

technology.pngI had a bunch of separate DOC files with underlines to convert to italics and some smart quotes in the text and some straight quotes.

I wanted a clean HTML output with smart quotes. 

That’s right.  I wanted to add smart quotes and convert them to HTML entities as well.

Automatically, so I don’t have to go through this again.  And no line wrapping.  And so I did, using LibreOffice (OpenOffice should work too) from the commandline, HTMLTidy, and SED.

I started out with the instructions on TechRepublic here:  https://www.techrepublic.com/blog/linux-and-open-source/how-to-convert-doc-and-odf-files-to-clean-and-lean-html/, but there were some shortcomings.  I kept losing underscores entirely – they were converted to classes, and then stripped – and it couldn’t handle filenames with spaces at all.  And the smart quotes handling of HTMLTidy is… well, minimal.

My script converts all the DOCX, DOC, and ODT files in a directory, does each substitution step-by-step (so you can deconstruct it yourself), and deletes all of the temporary files at the end.   I’ve added it to the ebook-utilz github repository, and you can see just that bash script at https://github.com/uriel1998/ebook-utilz/blob/master/doc2html.  Since it’s a BASH script, you should be able to use it on *nix, OSX, and possibly CygWin.