Using the Unicode Modifier with Regex in Elixir

By default, Regex functions in Elixir operate on strings as if they are raw binary data. This means that when operating on strings with certain unicode characters, we may not get the results we expect.

For example, if we we are replacing all non-word characters from a string, we may get differing results depending on what unicode word characters appear in that string.

> Regex.replace(~r/[\W]/, "uber!", "")
"uber"
> Regex.replace(~r/[\W]/, "über!", "")
<<195, 98, 101, 114>>  

With uber!, we get what we expected. However, with über! which contains a ü, the Regex.replace/3 function ends up mangling the result. We get back an unprintable binary.

To deal with these kinds of scenarios, we can add a unicode modifier to the regex sigil.

This can be done in two ways. The more common is to add the u modifier at the end of the sigil. The other way is to include (*UTF8) in the beginning of the regex, just after the /. Let's see them both in action.

> Regex.replace(~r/[\W]/u, "über!", "")
"über"
> Regex.replace(~r/(*UTF8)[\W]/, "über!", "")
"über"

Not only does this modifier change how character groups like \w and \W work, it also allows for full use of unicode specific patterns with \p and \P.

I made a point of including the (*UTF8) approach because I don't know where I learned it and I cannot find it documented anywhere.