Loeb.nyc Data Scientist, Macklin Fluehr takes a fascinating deep dive into the world of data, automation, spell corrections and the complexities behind machine learning.
How do data analysts account for and correct abbreviations in large volumes of information (like from social media, where abbreviations are common)? How our do we guess “Watr” means “water” not “waiter”?
Often, part of my job as a data scientist involves standardizing string data. String punctuation, capitalization, and spell correction are all pretty standard parts of my toolkit here, and fairly common problems for a data scientist. However, recently I started working with data sets that involve abbreviated words–abbreviated being anything from acronyms, to shorthand, to slang. This is fairly common in social media text data and supermarket receipts. Abbreviated words are not so fun to standardize, particularly if there are a lot of them to keep track of.
If you do any thorough internet search, the state of the art for expanding abbreviated words and shorthand is to create a manually-curated dictionary that maps abbreviations to their expanded form. As you might imagine, these dictionaries are painstaking to create, and particularly on large corpi, are not a manageable or scalable solution.
Throughout my career, I had used spell correctors, which probabilistically find the best correction for a misspelled word. Unfortunately, “string expansion,” while related to spell correction, is a fundamentally different problem. If you were to apply a spell corrector to abbreviated data, you would find that, while it did modify your word, the edit distance to the word you were trying to expand to is too far. If this doesn’t mean anything to you yet, don’t worry; it will. But in summary, by design, spell correctors are too short-sighted to handle most word expansions.
After sitting with this problem for a while, I realized that there wasn’t a prepackaged solution for this problem. I also realized that algorithmically, there were pre-existing concepts that I could leverage for different pieces of what I’ve now named Word_Expander. Yes, not a flashy name, but it seems pretty flashy next to the current state of the art when you put it into practice. Now, onto how it works.
The Problem Space
Let’s lay out the problem space. What does Word_Expander actually solve?
I have outlined 3 major ways a word (or string of words) might be abbreviated:
1. Vowels – all the vowels of the word are removed (and maybe some consonants too).
2. Syllable -only the first syllable of the word.
3. Intrinsic – expansions that simply must be learned (Non-intuitive expansions, acronyms, slang)
*Acronyms – typically the first letter of several words are strung together to form one (I would argue that this is a subclass of Intrinsic. Many acronyms cannot be intuited, as you can see from the second example below)
E.g.
AA – Alcoholics Anonymous
DISEC – Disarmament and International Security Committee
At risk of making the model sound simple, you can solve each of the above problems with the below solutions. Word_Expander ties all of these solutions together into a unified model.
If you don’t know what each of these are yet, don’t worry, I will go through each of these solutions in further detail. However, to understand the solutions in full, we first need to understand spell correctors, what they do, and why they don’t work here. Spell Corrector theory is very important for understanding how probabilistic word expansion works.
I will talk about two methods of spell correction–Norvig’s Method and SymSpell–the second of which builds on the theory of the first. I mention Norvig because he is effectively the father of spell-corrector theory in the data science space. If you do any googling on spell correction, Peter Norvig will come up. However, while his methods are useful for understanding the problem space, they ultimately create code that runs a little slow, particularly for high-edit distances. He did mention that his theory was for a “toy” spell corrector, whereas we are looking to build something for the industrial word expander. That said, I’ll lay some quick foundation with Norvig and then transition over to Symspell, which contains the data structures that Word_Expander is built upon.
As Norvig explains, a spell corrector assumes that someone made a mistake (or multiple mistakes) when spelling a word. A mistake could be:
We count the number of mistakes (or modifications) a word has undergone from its original. We call this count the ‘edit distance’. One modification would translate to an edit distance of one. All the examples above have an edit distance of one. ‘Waaar’, would have an edit distance of two to ‘water’ because we have two replacements– ‘te’ for ‘aa’–to make ‘water.’ In case you were wondering, there is a nice metric called Damerau-Levenshtein Distance which can measure this spell-correction edit distance for us (note: this is not to be confused with Levenshtein distance, which excludes Swaps).
So, how does a Norvig spell checker work? Given a word that needs to be spell checked, it finds all the possible Insertion, Deletion, Replacement, and Swap possibilities for that word. It will do this for an edit distance of one up to an edit distance the user specifies. As you might imagine, doing this for larger edit distances quickly becomes computationally intractable, which is why Norvig spell-correctors typically use edit distances of one or two.
Now we have some possibilities that the word might have been spelled to, but we don’t necessarily know what any correct words are. For the algorithm to work, we need a corpus of words to correct our inputs to. You might loosely consider this “train data.” Once we have found all the possible edits for a word, we then see if the edits exist in our corpus. We bucket our matches by edit distance to the original misspelled word.
The spell corrector chooses the bucket with the smallest edit distance. You can imagine that, for spell-correction, this is the right assumption. Probabilistically, when someone makes a mistake on spelling, the right word isn’t that far away. If you can find a word that is really close, it should be more likely than a word that is further away from it. (As noted in the introduction, this is not the right assumption for word expansion. But we will get to that soon.)
Getting back to our buckets, what if we have multiple matches in a bucket? How could I say that, if correcting ‘wamter’, ‘water’ is better than ‘Walter?’ This is where word probabilities come in. Not only does our word corpus provide us with a body of words to match potential corrections to, but it provides word counts. Effectively, when we make our corpus, we aggregate a lot of text data that we will be spell correcting for, and then count the number of occurrences for each word. The higher the word frequency, the more likely the word is to be a correction. This gives us a global context or a global probabilistic “intuition” with which to choose a good correction.
So, with that information, we select the word with the highest word frequency in our corpus as the best correction. You might be thinking, “what about local context?” What if the syntax of my sentence implies a different word? A global context might pick the wrong answer. This is a limitation of classic spell correctors. However, as you will notice, with good global context counts, spell-correctors typically do very well.
Now, Norvig’s approach is the basic concept here, but, as mentioned earlier, runs slowly, particularly for higher edit distances. In this case, we turn to a more modern spell correction algorithm called, SymSpell, which runs orders of magnitude faster.
The basic concept behind SymSpell abstracts away a bit of Norvig’s theory. Worrying about insertions, deletions, replacements, and swaps is overkill. Instead, we can get the same effect with deletes.
Wait for it. Think about it. Deleting on both our corpus and our input value will get us the same effect as doing all the work Norvig suggested.
So where are the speed gains? SymSpell really starts to shine with two major concepts:
1) Precalculation
2) Lookup Tables
Because we are only dealing with deletes, we can actually take our corpus of words and precalculate all the potential deletes for them up to an arbitrary max edit distance. These edits then get mapped back to the original word with a hashtable. Collisions are fine, because they will ultimately be resolved with word frequency counts. The result of precalculating this data structure is that we actually don’t need to do that much work when running the algorithm in real time–only a couple deletes on the input word, which can be evaluated lazily, unlike Norvig’s eager evaluation method. The result is an algorithm that is, amortized, several orders of magnitude faster.
Now that we have covered spell correctors, let’s dive into how the model works. As mentioned earlier, the problem with standard spell checkers is that they give preference to edit distance, which often doesn’t work when many vowels have been removed. We can see that spell-correctors’ deference to edit-distance is inappropriate for word expansion quite clearly with an example. I have the string ‘wld.’ Just because we needed to add 2 vowels to make ‘would’, doesn’t make it necessarily less valid than ‘wild’. Depending on my context, either could work. It’s our word frequency counts that should drive this decision.
Given that, to expand a word for a Vowels abbreviation, we have to figure out all the different vowel fill options for the word. This is effectively Norvig’s “inserts” but with only vowels, which we place in between the consonants of the input string. We then check to see if any of the filled strings exist in our deletes hashtable. The ones that exist in our deletes table continue on to the next round. With this smaller subset of complete words, we look at their probability counts to get the most probable expanded word.
Pretty straightforward? Onto Syllables.
Another common way people typically abbreviate words is by only writing the first syllable of the word.
This kind of problem is best solved with an autocomplete algorithm. An autocomplete algorithm effectively does: given a prefix, find the word with the highest word frequency that has that prefix. As you might be already thinking, we can use the same word frequency corpus that we used for Vowels. However, unlike Vowels, where we used a dictionary, we use a Trie here. For this use case, a Trie uses a lot less memory and is still very fast.
At risk of sounding lazy, Autocomplete Tries are very well documented and, unlike spell correction, I did nearly zero modification to get it to work for this algorithm. I will not explain how they work here, because everyone else has already done a much better job than I would do. However, here’s a picture which I think sums it up pretty quickly.
The only thing to add here is that, when we return all the nodes under the node from our abbreviated word, we will select the node with the highest frequency count. This is very similar to Vowels, just using nodes in a Trie instead of a bucket in a HashTable.
At the beginning of this article, I mentioned that typical word expanders use manually-curated dictionaries. While this is unnecessary and intractable for Vowels and Syllables, it actually makes a lot of sense in specific use cases, particularly when you simply must “know” the expansion–you cannot intuit it.
Vowels and Syllables are intuitive, so an algorithm could “figure it out.” They will probably handle most of your use cases. For the extreme minority of words you would have to expand, you have the option to continue doing it the old-fashioned way. While this might be disappointing to someone who was looking for an acronym expander, I think there is the potential to do further development here. The tricky part is not so much the acronym expansion, but to be able to recognize when something is an acronym. Read the Limitations section to learn more about relevant next steps.
While I find that this algorithm does pretty well when I use a good frequency corpus, its biggest shortcoming is that it does not handle local word context. A good frequency corpus supplies a global context, which helps choose the best words in general. However, at the local, syntactic context level, this algorithm could do better. Let’s look at the following example:
You can imagine that, from a global context, “would” is probably the better word in most corpi. But at the local context level, “wild” is the better word.
How do we extend the algorithm to handle local context? While I haven’t done it yet (since Word_Expander is good enough for nearly all of my use cases), I can imagine that the next step is a highly simplified machine translation problem. Effectively, we are doing translation–we are taking non-english words and processing them into english. This is easier than translating, for example, Spanish to English, because there is no syntactical, cultural, or idiomatic bridge to cross here. We’re just doing word substitution.
How might this work? Remember that Word_Expander can and does return an ordered list of the globally most probable words for a given word input. I think it would be fairly straightforward to use a Convolutional or Recurrent Neural Net to select the best local “translation” from these global options based on the surrounding words. The only question left is the training data.
How might you get labeled training data? You can simply apply reverse Vowels and Syllables on your raw word corpus (prior to word counts). Run the model on this data and you receive your model inputs, while the original corpus is your train outputs. Boom. A locally intelligent word expander. I invite someone to build this.
Word expansion was previously not only tedious but a fairly intractable task. While I think there is more work to do here, particularly at the local context level, there is now an algorithm that gets the job done, gets it done fast, and doesn’t give you a headache.
If you have questions, comments, or would like to collaborate, please reach out to me at mfluehr@loeb.nyc
Macklin Fluehr is a member of Loeb.nyc’s Data Analytics shared services team. Find out more about our unique structure here.