Xiaoxing's Notes

Regex Nightmare with Swift

July 01, 2017
587 words
2 minutes

I am a big fan of the Swift language, and I am so excited about where its heading. There's not a lot to complaints besides some of the rough edges. It's still young after all. Dealing with Regular Expression is one of them that annoys me the most.

Lots of love to String, Great, what about regex?

The swift team have given String lots of love in Swift 4. One thing that is missing out for me is that the regex part remains untouched. In my opinion, regex is an essential part of string manipulation in any language. Ruby, JavaScript, Python ... you name it. They all have built in, easy to use regex APIs. NSRegeularExpression is the worst regex API I have ever used across the all languages (and I use a LOT of languages).

The current problems

There are two 2 main issues here.

NSRegeularExpression is not defined on Linux version

Swift is ditching NS prefix for all the new types. But there's still no RegularExpression. However, on the Linux version, you can't find NSRegeularExpression. So here is the workaround I did.

#if !os(Linux)
    typealias RegularExpression = NSRegularExpression
    typealias TextCheckingResult = NSTextCheckingResult
    extension TextCheckingResult {
        // yeah, you have to deal with this
        func rangeAt(_ idx: Int) -> NSRange {
            return range(at: idx)

It works with NSRange instead of Range<String.Index>

The above one is just annoying, but we can live with it. This one is a little bit trick. With a String, all the functions that accept or return a range will give you Range<String.Index>. NSRegeularExpression only accept NSRange for a range. The trick part is, it accept String instead of NSString for the string. The following code will demo the issue.

var str = "Hello😀"
var nsStr = str as NSString
str.characters.count // 6
nsStr.length // 7

Yes, you can freely cast between NSString and String with minimal performance penalty. But the fact that the length of the string is not consistent causes big problems. Consider the following code:

var str = "Hello😀World"

str.characters.count     // 11
(str as NSString).length // 12

var regex = try! NSRegularExpression(pattern: "World", options: [])

let match = regex.firstMatch(
  in: str,
  options: [],
  range: NSMakeRange(0, str.characters.count))

// the match would be nil.

The most confusing part of this code is the fact that NSRegeularExpression accept String as input, but not respect it's characters.count. The workaround would be:

let match = regex.firstMatch(
  in: str,
  options: [],
  range: NSMakeRange(0, (str as NSString).length))

I really hope that's the end of the story, and I we need to do is to put up with the ugly code until swift team fix this for us. Unfortunately, it is not. Well, it can end here. The only thing you need to do is to remember to convert String to NSString every time you use NSRegeularExpression, or use NSRange with them. But that would be so error pron since majority of the functions accept and return type String, even NSRegeularExpression itself.

let nsRange = match!.range // got the range from a regex match
let substring = ( str as NSString ).substring(with: nsRange)
// "World"

If you plan on using String all the time, which you should, then here is how to convert NSRange to Range<String.Index>.

extension String {
    func range(from nsRange: NSRange) -> Range<String.Index>? {
          let from16 = utf16.index(
            offsetBy: nsRange.location,
            limitedBy: utf16.endIndex),
          let to16 = utf16.index(
            offsetBy: nsRange.length,
            limitedBy: utf16.endIndex),
          let from = from16.samePosition(in: self),
          let to = to16.samePosition(in: self)
        else { return nil }
        return from ..< to

let substring = str.substring(with: str.range(from: nsRange))

From the look of the code, it is so compiler-optimizable. Which means if this is taken care by the built in framework (like a real RegularExpression type), it can be optimized down to nothing. Because technically you can directly access the raw value of the index, instead of doing all the offset work. I know by the natural of the offset behavior, it's not going to be computational intensive. But when you have massive of conversions between NSRange and Range<String.Index> type in your code, it really adds up.

Looking Forward

It is a hard problem to solve. Especially when Apple added native emoji support in the String type. Regex was not designed to handle this kind of issue. But I really looking forward to see how smart people in swift team crack this.

Read more stories about "programming" ->