Localizing Arbitrary Strings the Scripty Way

I bet you saw my post about localizing US states and said to yourself, “Ha! That little sed script won’t do much for arbitrary strings!” You’re a jerk, but yes, you’re right. Before I give it away, does anyone see the problem with the previous script?

sed 's/@"[^"]*"/NSLocalizedString(&, nil)/g'

Let’s try it on some completely pseudorandom strings that I am about to pseudo-make up.

@"hey man"
NSLocalizedString(@"hey man", nil)
@"what's up"
NSLocalizedString(@"what's up", nil)
@"what\"s up"
NSLocalizedString(@"what\", nil)s up"

Uh oh. Clearly this simple script does not take escaped quotes into account. This is kind of a complex problem! We can’t just match strings of the type \", because there may be escaped backslashes in front of a real quote (ie \\"). So we have to match quotes preceded by an odd number of backslashes. And there could be any number of instances of that pattern, anywhere in the input, so we have to interleave it with the “every other character” match. Lucky for you I already did the work!

sed -E 's/@"([^"\\]*((\\\\)*(\\")*)*)*"/NSLocalizedString(&, nil)/g'

What the hell is this mess? First things first.

  • sed -E: the -E flag tells sed to use extended regexp, which allows for strings (in parentheses), not just individual characters.
  • @": match the beginning of an NSString.
  • [^"\\]*: match any number (*) of characters that are neither a double quote nor a backslash (this is confusing because we have to escape the backslash to keep sed from interpreting it).
  • (\\\\)*: match any number of paired backslashes, ie, backslashes escaped in the input.
  • (\\")*: match any number of double quotes immediately preceded by a backslash (escaped for sed, non-escaped in the input); that is, any number of escaped double quotes.
  • " match the closing double quote.

Now we can put some of these elements together to construct a more complex regular expression.

  • ((\\\\)*(\\")*)*: match any number of strings of the pattern “any number of escaped backslashes, followed by any number of escaped quotes”.
  • ([^"\\]*((\\\\)*(\\")*)*)*: and finally, match any number of strings of the pattern “any number of non-quote, non-backslash characters, followed by any number of double quotes preceded by odd numbers of backslashes”.

It’s critically important to remember that “any number” includes zero. That’s why this pattern can match the implied most complex pattern of input — something like @"a\\\"b\\\"c\\\"" — but still perform fine on input like @"hey man". Things that don’t occur in the input still occur zero times — just enough to match.

Okay, whatever! What’s the output look like?

$sed -E 's/@"([^"\\]*((\\\\)*(\\")*)*)*"/NSLocalizedString(&, nil)/g'
@"hey man"
NSLocalizedString(@"hey man", nil)
@"what\"s up"
NSLocalizedString(@"what\"s up", nil)
@"what\\\"s up"
NSLocalizedString(@"what\\\"s up", nil)
@"a\\\"b\\\"c\\\""
NSLocalizedString(@"a\\\"b\\\"c\\\"", nil)
@"\\\\\"
@"\\\\\"

Great! The pattern performs as expected on simple input, and on complex input catches the escaped quotes. It even fails on the last input, an invalid string which is never closed. Now we can localize not only input we know is free of exceptional conditions, but any arbitrary strings we might come across, all with the help of our two friends sed and regexp.

About Joel Kin

Developing on Apple platforms for, holy shit, like twenty years now. Find me on linkedin and twitter. My personal website is joelk.in.
This entry was posted in Code and tagged , , . Bookmark the permalink.