Code is much like a conversation, and misunderstandings can happen when assumptions aren’t stated upfront. It brings to mind a quote (perhaps apocryphal) from the Enlightenment philosopher Voltaire: “If you wish to converse with me, first define your terms.” Though he is unlikely to have anticipated how this advice might apply to computer science, it rings true when considering the code development process. Misunderstandings lead to bugs—which our team found out when working with our iOS text manipulation logic.
Think about characters, strings, and string indices: If you write code, you probably use these terms often and intuitively, rarely worrying about how to define them. It might be surprising, then, to consider the fact that expectations around these terms will differ widely across languages, OSes, and programmers.
We’ll show what happens when these different definitions become more than just philosophical and cause mistakes in code. We’ll also share how we found and fixed these issues in our Swift app with a trick we call “index type safety.” It’s a technique that could be useful anywhere your code depends on terms that might be overloaded but aren’t strongly typed by the system.
Editing text in iOS
Grammarly provides users with a wide variety of language and communication suggestions in iOS, which means we’ve become familiar with the tricky terrain of the iOS keyboard. Since most iOS developers don’t write keyboards, we’ll begin with an overview.
The keyboard runs as a separate process in iOS, and the system provides a proxy object for accessing the text field. Due to cross-process communication and privacy considerations, the system-provided API is sparse (memory is also quite limited, but that’s a discussion for another time). Here are some of the methods involved:
// Get document context around cursor
var documentContextBeforeInput: String? { get }
var documentContextAfterInput: String? { get }
// Insert or delete some text
func insertText(_ text: String)
func deleteBackward(_ count: Int)
// Move cursor around
func adjustTextPosition(byCharacterOffset offset: Int)
Since we’re going to be talking about manipulating text, here’s an example of how you could write a domain model struct to describe an edit operation in an app:
struct EditCommand: Codable {
var begin: Int
var end: Int
var replacement: String
}
To transform text, we first move the cursor to the position where we want to start editing. Then we delete the old text. Next, we insert the new text. And finally, we move the cursor back.
What can go wrong
While manipulating text seems straightforward, we encountered surprising bugs that looked something like this:
Misunderstandings around text encodings
Here’s a hint for debugging this code: Apple’s APIs do not define what a character means. To decipher what happened here, let’s introduce some terms—which, unlike “character” and “string,” do have standard definitions (according to the of Glossary of Unicode Terms).
Grapheme cluster: Grapheme base together with any number of nonspacing marks
A grapheme is the smallest unit of text in a language; it’s what most people would think of as a character. A nonspacing mark is something attached to the character, like an accent mark or the components of an emoji.
Represented as a grapheme cluster, 👩🏼🏫 is equivalent to 👩+ skin tone modifier + ZWJ (zero-width joiner, used to combine emoji into a new form) + 🏫
Code point: Any value in the Unicode codespace—that is, the range of integers from 0 to 10FFFF16
👩🏼🏫 can be represented as 4 code points: WOMAN (U+27E55), EMOJI MODIFIER FITZPATRICK TYPE-3 (U+27DD1), ZERO WIDTH JOINER (U+2670), and SCHOOL (U+27DBE)
Code unit: The minimal bit combination that can represent a unit of encoded text for processing
In UTF-16, each code unit is 2 bytes and can fit values from 0 to FFFF16. This does not fit all possible code points, so some of them are stored as two code units, called a “surrogate pair.” 👩🏼🏫 uses 3 surrogate pairs and can be represented as seven UTF-16 code units: WOMAN (0xD83D, 0xDC69), EMOJI MODIFIER FITZPATRICK TYPE-3 (0xD83C, 0xDFFC), ZERO WIDTH JOINER (0x200D), SCHOOL (0xD83C, 0xDFEB).
This framework helps us understand what went wrong in our text editing example. By default, Swift represents strings as a collection of grapheme clusters, and that’s what Apple’s deleteBackward
method uses to index. But Apple’s adjustTextPosition
method uses UTF-16 code units to index.
Even though the indices in those methods were both integers, they weren’t the same kind of integer, and it was a mistake to use them interchangeably. It’s as if two people took a measurement but one counted by feet, while the other counted by meters: Of course the numbers didn’t match up.
A brief survey of text encodings
Text encoding differs across OSes and languages because its history is as old as computer science itself—from basic 8-bit ASCII in the 1960s, to the introduction of 16-bit Unicode in the 1990s, to the latest developments in representing emojis and other rich text, which can use wide 32-bit or complicated variable-width UTF-8 encodings. Here are some of the differences to be aware of if you write code that manipulates text:
Our solution: the index type safety trick
To provide Grammarly’s writing suggestions, different technologies across our back-end and front-end use and keep track of string indices. While developing our iOS keyboard app, we had a class of issues resembling the previous example. It was too hard to tell one kind of string index from another and too easy to use an index out of context, for the wrong type of text encoding.
We weren’t going to get our different technologies to agree on a single definition for “character.” But we could bring the underlying assumptions into the light, and know which definitions we were applying. For example, we could clearly differentiate between UTF-16 and grapheme cluster indices.
The solution we developed was to leverage the type system and create stronger types for string indices. Stronger typing enforced consistency, making our string-manipulation code both safer and more readable.
Single-field wrapper structs for each collection type
The String
type conforms to the Swift Collection
protocol since it’s a collection of grapheme clusters. String.UTF16View
also conforms to the Collection
protocol, as it’s a collection of code units. And other string views are, in the world of Swift, also collections of different types. So we started out by defining single-field wrapper structs to provide indices for each collection type.
Here is the wrapper for the index that counts by grapheme clusters (which is what the Character
type represents in Swift):
struct CharacterOffset: RawRepresentable, Equatable, Hashable, Codable {
let rawValue: Int
}
// We need to extend some basic properties of integers to CharacterOffset
extension CharacterOffset: ExpressibleByIntegerLiteral, Comparable, SignedNumeric, Strideable {
init(integerLiteral value: Int) {
self.init(rawValue: value)
}
static func < (lhs: CharacterOffset, rhs: CharacterOffset) -> Bool {
return lhs.rawValue < rhs.rawValue
}
// ~100 more lines…
}
Next, we made a wrapper for the index that counts by UTF-16 code units:
struct UTF16Offset: RawRepresentable, Equatable, Hashable, Codable {
let rawValue: Int
}
extension UTF16Offset: ExpressibleByIntegerLiteral, Comparable, SignedNumeric, Strideable {
init(integerLiteral value: Int) {
self.init(rawValue: value)
}
// Just keep copy-and-pasting here… hmm, can we do better?
}
One generic to wrap them all
A lot of copying and pasting is usually an indicator that there’s a better solution. And there was: We found we could make our wrapper generic over all kinds of collections.
To do so, we used the concept of a phantom type (a type that is not used to store any data, just to satisfy the type system). We made the wrapper type generic, and put the collection type as a generic parameter:
struct IndexOffset: RawRepresentable {
let rawValue: Int
}
We refined the Collection protocol to support a typed index:
protocol CollectionWithIndexOffset: Collection {
associatedtype Offset: RawRepresentable = IndexOffset
where Offset.RawValue == Int
}
extension CollectionWithIndexOffset {
subscript(offset: Offset) -> Element {
return self[index(startIndex, offsetBy: offset.rawValue)]
}
}
Finally, we conformed existing collections to the new protocol:
extension String: CollectionWithIndexOffset {}
extension String.UTF16View: CollectionWithIndexOffset {}
Note that String and Substring are different types in Swift, with the substring containing a reference to the original string (this avoids duplication). The indices are compatible, so to handle substrings, we defined a custom implementation referring to String:
extension Substring: CollectionWithIndexOffset {
typealias Offset = String.Offset
}
extension Substring.UTF16View: CollectionWithIndexOffset {
typealias Offset = String.UTF16View.Offset
}
Using index type safety in practice
Because we extended the protocol for collections to use a typed index, IndexOffset<String>
and IndexOffset<String.UTF16View>
are now incompatible integer wrappers.
Index type safety helps us avoid the bug in our previous example:
extension String {
mutating func applyEdit(range: Range, replacement: String) {
let indexRange =
index(startIndex, offsetBy: range.lowerBound) ..<
index(startIndex, offsetBy: range.upperBound)
replaceSubrange(indexRange, with: replacement)
}
}
We can check for index type safety to prevent bugs:
extension String {
mutating func applyEdit(range: Range, replacement: String) throws {
let indexRange = try
index(startIndex, offsetBy: range.lowerBound) ..< index(startIndex, offsetBy: range.upperBound) replaceSubrange(indexRange, with: replacement) } private func index(_ base: Index, offsetBy offset: String.UTF16View.Offset) throws -> Index {
let baseUTF16 = try base.samePosition(in: utf16).unwrap()
let resultUTF16 = utf16.index(baseUTF16, offsetBy: offset)
return try resultUTF16.samePosition(in: self).unwrap()
}
}
Notice that the type system prevented us from using the default `offset(_:by:)` API—we had to write our own implementation with the correct index conversion baked in.
Finally, we can update our APIs to use indices in a type-safe way:
func deleteBackward(_ count: String.Offset)
func adjustTextPosition(_ offset: String.UTF16View.Offset)
struct EditCommand: Codable {
var begin: String.UTF16View.Offset
var end: String.UTF16View.Offset
var replacement: String
}
Conclusion
It’s hard to have a productive conversation, much less a well-functioning program, when you haven’t defined your terms ahead of time. We used type safety to define our string indices and avoid some of the pitfalls of developing for the iOS keyboard in particular, and for text manipulation in general. We hope something in this discussion is applicable to your own work, wherever underlying assumptions might be lurking. If you’re interested in helping solve problems like this one at Grammarly and helping to improve lives by improving communication, come join our team—check out our list of open roles.