Using SED To Capitalize Titles — And A Tricky Escaping Problem

I had a problem. I needed a tool to reliably capitalize news article titles.

While sed (I’m using GNU sed 4.9-1 from Debian) obviously came to mind, there’s a lot of variation in standardization among article titles. For my part, I needed a few things:

  • The first letter of every word — including "to" and "of" and the like — to be capitalized.
  • Capitalizing the first letter after an n-dash
  • The various forms of m-dashes to be turned into --
  • De-smarten quotes
  • Properly handling where titles have single quotes
  • Properly handling all-caps

For example, this fake title (yes, I know the apostrophe s is wrong; it’s a needed example):

Social Media and E-learning Spurs Kids' Interest In Anti-aging ASMR Products—'this is rad, boomer's don't get it.'

when run through a tool like ConvertCase.net ends up being:

Social Media And E-learning Spurs Kids' Interest In Anti-aging Asmr Products—'this Is Rad, Boomer's Don't Get It.'

Fixing those edge cases ends up involving a pretty intense extended sed transformation, but results in this:

Social Media And E-Learning Spurs Kids' Interest In Anti-Aging ASMR Products -- 'This Is Rad, Boomer's Don't Get It.

Here’s the entire sed string: sed -e "s/\b\(.\)/\u\1/g" -e "s/-\(.\)/-\u\1/g" -e 's/“/"/g' -e 's/”/"/g' -e "s/’/'/g" -e 's/—/ -- /g' -e 's/ — / -- /g' -e 's/ - / -- /g' -e 's/ – / -- /g' -e 's/ – / -- /g' -e "s/"\'"\([A-Z]\)\b/"\'"\l\1/g" -e "s/"\'"\([A-Za-z][A-Za-z]\)\b/"\'"\l\1/g"

Let’s break down the bits.

The first part — -e "s/\b\(.\)/\u\1/g" -e "s/-\(.\)/-\u\1/g" — capitalizes everything that is the first letter after every bit of whitespace, dashes, and quotation marks. I managed to find an answer on Stack Overflow that pointed me in this direction, and trial and error after that.

Then you have a bunch of transforms for the types of "smart" quotes and m-dash variations which is pretty straightforward: -e 's/“/"/g' -e 's/”/"/g' -e "s/’/'/g" -e 's/—/ -- /g' -e 's/ — / -- /g' -e 's/ - / -- /g' -e 's/ – / -- /g' -e 's/ – / -- /g' .

The trickiest bit was making it so that words like "wasn’t" or "boomer’s" did not end up being "wasn’T" or "boomer’S". Using the first portion that capitalized some words, I knew the basic structure of what I wanted, but I kept getting results that either refused to capitalize anything after an apostrophe, capitalized EVERYTHING after an apostrophe, or transformed the last letter of every word to lowercase. (HIv, ASMr, NIh, you get the idea.)

It turned out to be a really funky bit of escaping combined with a more precise regex check: -e "s/"\'"\([A-Z]\)\b/"\'"\l\1/g"

The key is that the single quotes have to be outside of the escaped portion. I’ve highlighted the portions of the string that are escaped. Even though the single quotes are outside of the escaped (by quotation marks) portion, you still need to escape the single quotes as well:

The highlighted portions are escaped by the quotation marks, but the single quotes are escaped only by the \ character. Two styles of escaping in one operation.

Then, finally, combined that with a match for cases like we're and making the next two letters lowercase: -e "s/"\'"\([A-Za-z][A-Za-z]\)\b/"\'"\l\1/g"

I wrote up a small bash script that I use with AutoKey to pull the selected text into the clipboard, run the transform, and then replace the text with the transformed text. You can find that script here: https://gist.github.com/uriel1998/37b731725653def3b675705546d14f22

It took me a long time to figure that out, so if you’re struggling with something similar, I hope this helps.

Featured image by Bruno from Pixabay

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.