Regex Nightmare with Swift
I am a big fan of the Swift language, and I am so excited about where its heading. There's not a lot to complaints besides some of the rough edges. It's still young after all. Dealing with Regular Expression is one of them that annoys me the most.
Lots of love to String, Great, what about regex?
The swift team have given String lots of love in Swift 4. One thing that is missing out for me is that the regex part remains untouched. In my opinion, regex is an essential part of string manipulation in any language. Ruby, JavaScript, Python ... you name it. They all have built in, easy to use regex APIs. NSRegeularExpression
is the worst regex API I have ever used across the all languages (and I use a LOT of languages).
The current problems
There are two 2 main issues here.
NSRegeularExpression
is not defined on Linux version
Swift is ditching NS
prefix for all the new types. But there's still no RegularExpression
. However, on the Linux version, you can't find NSRegeularExpression
. So here is the workaround I did.
#if !os(Linux)
typealias RegularExpression = NSRegularExpression
typealias TextCheckingResult = NSTextCheckingResult
#else
extension TextCheckingResult {
// yeah, you have to deal with this
func rangeAt(_ idx: Int) -> NSRange {
return range(at: idx)
}
}
#endif
It works with NSRange
instead of Range<String.Index>
The above one is just annoying, but we can live with it. This one is a little bit trick. With a String
, all the functions that accept or return a range will give you Range<String.Index>
. NSRegeularExpression
only accept NSRange
for a range. The trick part is, it accept String
instead of NSString
for the string. The following code will demo the issue.
var str = "Hello😀"
var nsStr = str as NSString
str.characters.count // 6
nsStr.length // 7
Yes, you can freely cast between NSString
and String
with minimal performance penalty. But the fact that the length of the string is not consistent causes big problems. Consider the following code:
var str = "Hello😀World"
str.characters.count // 11
(str as NSString).length // 12
var regex = try! NSRegularExpression(pattern: "World", options: [])
let match = regex.firstMatch(
in: str,
options: [],
range: NSMakeRange(0, str.characters.count))
// the match would be nil.
The most confusing part of this code is the fact that NSRegeularExpression
accept String
as input, but not respect it's characters.count
. The workaround would be:
let match = regex.firstMatch(
in: str,
options: [],
range: NSMakeRange(0, (str as NSString).length))
I really hope that's the end of the story, and I we need to do is to put up with the ugly code until swift team fix this for us. Unfortunately, it is not. Well, it can end here. The only thing you need to do is to remember to convert String
to NSString
every time you use NSRegeularExpression
, or use NSRange
with them. But that would be so error pron since majority of the functions accept and return type String
, even NSRegeularExpression
itself.
let nsRange = match!.range // got the range from a regex match
let substring = ( str as NSString ).substring(with: nsRange)
// "World"
If you plan on using String
all the time, which you should, then here is how to convert NSRange
to Range<String.Index>
.
extension String {
func range(from nsRange: NSRange) -> Range<String.Index>? {
guard
let from16 = utf16.index(
utf16.startIndex,
offsetBy: nsRange.location,
limitedBy: utf16.endIndex),
let to16 = utf16.index(
from16,
offsetBy: nsRange.length,
limitedBy: utf16.endIndex),
let from = from16.samePosition(in: self),
let to = to16.samePosition(in: self)
else { return nil }
return from ..< to
}
}
let substring = str.substring(with: str.range(from: nsRange))
From the look of the code, it is so compiler-optimizable. Which means if this is taken care by the built in framework (like a real RegularExpression
type), it can be optimized down to nothing. Because technically you can directly access the raw value of the index, instead of doing all the offset work. I know by the natural of the offset behavior, it's not going to be computational intensive. But when you have massive of conversions between NSRange
and Range<String.Index>
type in your code, it really adds up.
Looking Forward
It is a hard problem to solve. Especially when Apple added native emoji support in the String
type. Regex was not designed to handle this kind of issue. But I really looking forward to see how smart people in swift team crack this.