Regex nightmare in Java: s!\\!\\\\!g becomes .replace(“\\\\”,”\\\\\\\\”);

If you like Java and love java.util.regex even more, stop reading now. I have a small complaint that was a pain to figure out so I am writing it here in the hopes that it may prove useful to someone else.

The title says it all. If you are looking to escape a single backslash in a string, a lowly ‘\’ character, you have to hit it with 4 and then 8 slashes in the replace command.

“Why” you may well ask, when Perl, which also needs to escape the backslash in a regex, just needs the parsimonious 2 and then 4 backslashes to replace a single slash with a double slash. Well, the answer (as near as I can tell) is due to the fact that all strings in Java are interpolated. That is, there is no notion of the uninterpolated string, at least as far as escaping characters goes. So you get the worst of both worlds—there is no way to write carp "The $variable" in a single string (variable interpolation), and there is no way to write carp q{Ceci \n n'est pas une linebreak}, with no escaping happening at all.

Instead, you get the following. Suppose you have a code snippet such as


this.encodedPointsLiteral=encodedPoints.replaceAll("\\","\\\\");

Suppose you pass to that argument the output of the excellent PolylineEncoder class you’ve converted into a Java class to run on your server. Well, if you expect the replaceAll command to work, you will instead get hit with a runtime error. It seems that the string “\\” properly produces a single “\”, which I would expect would be used to find all single backslashes. Instead, it is passed to something else deep in the guts of the regex implementation and interpolated again meaning that the single slash is escaping nothing at all, which is of course an error.

If you instead use the code snippet


this.encodedPointsLiteral=encodedPoints.replaceAll("\\\\","\\\\\\\\");

Then the first quadruple backslash is reduced to a double backlash on the first interpolation (when you write the string with double quotes), and then is interpolated again to result in a single backlash when the pattern is compiled. Similarly, the eight backslashes are reduced to four on the first interpolation, and then reduced to two when the pattern is evaluated as a replacement string.

It is little things like this that give me hope that my Perl skills, such as they are, will never become as useless as my 68000 assembly language talents.

And if you study Dr. McClure’s javascript code, you will see that JavaScript is also as sane as Perl, with the idiom


encodedPointsLiteral: encodedPoints.replace(/\\/g,"\\\\")

doing the job properly.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s