jackcook.com Open in urlscan Pro
185.199.109.153  Public Scan

Submitted URL: https://go.mobilegrowth.org/4i3GczR?m=email
Effective URL: https://jackcook.com/2023/09/08/predictive-text.html?utm_campaign=Mobile-Growth-News&utm_medium=email&utm_source=Mobi...
Submission: On December 12 via api from ES — Scanned from ES

Form analysis 0 forms found in the DOM

Text Content

JACK COOK

 * Home
 * Blog


JACK COOK

 * Home
 * Blog


A LOOK AT APPLE’S NEW TRANSFORMER-POWERED PREDICTIVE TEXT MODEL

New York, NY — September 08, 2023

At WWDC earlier this year, Apple announced that upcoming versions of iOS and
macOS would ship with a new feature powered by “a Transformer language model”
that will give users “predictive text recommendations inline as they type.”

Upon hearing this announcement, I was pretty curious about how this feature
works. Apple hasn’t deployed many language models of their own, despite most of
their competitors going all-in on large language models over the last couple
years. I see this as a result of Apple generally priding themselves on polish
and perfection, while language models are fairly unpolished and imperfect.

As a result, this may be one of the first Transformer-based models that Apple
will ship in one of its operating systems, or at least one of the first that
they’ve acknowledged publicly. This left me with some questions about the
feature, notably:

 * What underlying model is powering this feature?
 * What is its architecture?
 * What data was used to train the model?

After spending some time with these questions, I was able to find some answers,
but many of the details still remain unclear. If you’re able to get any further
than I could, please get in touch!


HOW DOES THE FEATURE WORK?

After installing the macOS beta, I immediately opened the Notes app and started
typing. Despite trying many different sentence structures, the feature generally
appeared less often than I expected it to. It mostly completes individual words.

Predictive text completing one word at a time.

The feature will occasionally suggest more than one word at a time, but this is
generally limited to instances where the upcoming words are extremely obvious,
similar to the autocomplete in Gmail.

Predictive text completing two words at a time.


CAN WE DIG DEEPER?

Finding the model itself was a little tough, but I eventually found the model
being used by AppleSpell, an internal macOS application that checks for spelling
and grammar mistakes as you type. With the help of xpcspy, I wrote a Python
script that snoops on AppleSpell activity and streams the most probable
suggestions from the predictive text model as you type in any application.

My “predictive spy” script in action.

Unfortunately, I wrote this script earlier in the summer, on the first macOS
Sonoma beta. In one of the subsequent betas (I’m not sure which), Apple removed
the unused completions from the XPC messages sent by AppleSpell. I wasn’t able
to glean too much about the model’s behavior from these completions, but it was
still a cool find.


WHERE IS THE MODEL?

After some more digging, I’m pretty sure I found the predictive text model in
/System/Library/LinguisticData/RequiredAssets_en.bundle/AssetData/en.lm/unilm.bundle.
The bundle contains multiple Espresso model files that are used while typing
(Espresso appears to be the internal name for the part of CoreML that runs
inference on models). I wasn’t ultimately able to reverse-engineer the model,
but I’m fairly confident this is where the predictive text model is kept. Here’s
why:

 1. Many of the files in unilm.bundle don’t exist on macOS Ventura (13.5), but
    they do exist on the macOS Sonoma beta (14.0). And the files that do exist
    in both versions have all been updated in Sonoma.
 2. sp.dat, one of the files in unilm.bundle, exists on Ventura, but it’s been
    updated in the Sonoma beta. In the updated version of the file, I found what
    looks pretty clearly like a set of tokens for a tokenizer.
 3. The number of tokens in sp.dat matches the shape of the output layer in both
    unilm_joint_cpu.espresso.shape and unilm_joint_ane.espresso.shape (ANE =
    Apple Neural Engine), two files in unilm.bundle that describe the shapes of
    layers in an Espresso/CoreML model. This is what we would expect to see for
    a model that is trained to predict the next token.


THE PREDICTIVE TEXT MODEL’S TOKENIZER

I found a set of 15,000 tokens in unilm.bundle/sp.dat that pretty clearly look
like they form the vocabulary set for a large language model. I wrote a script
that you can use to see this vocabulary file for yourself, which you can check
out on GitHub.

The vocabulary starts with <pad>, <s>, </s>, and <unk> tokens, which are all
fairly common special tokens (roberta-base and t5-base are two popular language
models):

>>> from transformers import AutoTokenizer
>>>
>>> tokenizer = AutoTokenizer.from_pretrained("roberta-base")
>>> tokenizer.convert_ids_to_tokens([0, 1, 2, 3])
['<s>', '<pad>', '</s>', '<unk>']
>>>
>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> tokenizer.convert_ids_to_tokens([0, 1, 2])
['<pad>', '</s>', '<unk>']


Next come the following sequences:

 * 20 special tokens, named UniLMCTRL0 through UniLMCTRL19
 * 79 contractions (I’d, couldn’t, you’ve…)
 * 1 special _U_CAP_ token
 * 20 special tokens, named _U_PRE0_ through _U_PRE19_
 * 60 special tokens, named _U_NT00_ through _U_NT59_
 * 100 emojis

And then comes a more normal-looking list of 14,716 tokens, most of which are
followed by the special character ▁ (U+9601), which is commonly used in
byte-pair encoding (BPE) tokenizers, such as the GPT-2 tokenizer, to denote a
space.

I have to say that this vocabulary file strikes me as pretty unique, but it’s
definitely not out of the question for a language model deployed in this
setting. I’ve personally never seen emojis featured so prominently in a language
model’s tokenizer, but existing research has shown that domain-specific models
and tokenizers can drastically improve downstream model performance. So it makes
sense that a model trained for use in things like text messages, in which emojis
and contractions will be used a lot, would prioritize them.


MODEL ARCHITECTURE

Based on the contents of the unilm_joint_cpu model from earlier, we can make
some assumptions about the predictive text network. Despite sharing the name of
Microsoft’s UniLM from 2019, it looks more to me like a model based on GPT-2.

GPT-2 has four main parts: token embeddings, positional encodings, a series of
12-48 decoder blocks, and an output layer. The network described by
unilm_joint_cpu appears to be the same, except with only 6 decoder blocks. Most
of the layers within each decoder block have names like
gpt2_transformer_layer_3d, which would also seem to suggest it’s based on a
GPT-2 architecture.

From my calculations based on sizes of each layer, Apple’s predictive text model
appears to have about 34 million parameters, and it has a hidden size of 512
units. This makes it much smaller than even the smallest version of GPT-2.

Model Decoder Blocks Parameters Hidden Size Apple’s predictive text model 6 34M
512 gpt2 12 117M 768 gpt2-medium 24 345M 1024 gpt2-large 36 762M 1280 gpt2-xl 48
1542M 1600

For the limited scope of the predictive text feature, this makes sense to me.
Apple wants a model that can run very quickly and very frequently, without
draining much of your device’s battery. When I was testing the predictive text
feature, suggestions appeared almost instantly as I typed, making for a great
user experience. While the model’s limited size means it wouldn’t be very good
at writing full sentences or paragraphs, when it exhibits very high confidence
in the next word or two, they’re likely to be good enough to suggest to the
user.

However, with my script that snoops on activity from AppleSpell, we can get the
model to write full sentences anyway. If I type “Today” as the first word of my
sentence and take the model’s top suggestion each time, here’s what I get
(video):

> Today is the day of the day and the day of the week is going to be a good
> thing I have to do is get a new one for the next couple weeks and I think I
> have a lot of…

Not very inspiring. We can compare this with the output from the smallest GPT-2
model:

> Today, the White House is continuing its efforts against Iran to help the new
> President, but it will also try to build new alliances with Iran to make more…

Or the largest GPT-2 model:

> Today, the U.S. Department of Justice has filed a lawsuit against the city of
> Chicago, the Chicago Police Department, and the city’s Independent Police
> Review Authority, alleging that the police department and the Independent
> Police Review Authority engaged in a pattern or practice…

Pretty cool seeing the effects of all those extra parameters! It’ll be
interesting to see how this feature grows and evolves in the future, and whether
Apple decides to keep its scope fairly narrow or someday expand its abilities.

If you’re interested in trying any of this out for yourself, all of my code is
on GitHub.


Please enable JavaScript to view the comments powered by Disqus.
hellojackcook.com
© Jack Cook 2023