Antonio Zamora Podcast TK001

Spelling Aid

This presentation explains how spelling aid identifies misspelled words and provides correctly spelled candidates from a dictionary that can be used to replace the input word.

Click the triangle to play the podcast

Play on

Transcript:

Spelling aid is a natural language technology that got its start in the 1980s and it is widely available in word processors. This presentation explains how spelling aid identifies misspelled words and provides correctly spelled candidates from a dictionary that can be used to replace the input word. When a correctly spelled word is flagged because it is not in the dictionary, the user can add the word to the dictionary so that it will not be identified as an error in the future.

Spelling verification is just the dictionary lookup task of spelling aid. Spelling verification is relatively simple because it just determines if a word is in the dictionary. Various methods can be used to increase the speed and efficiency of the word search, such as partitioning the dictionary by word frequency, using hashing techniques for screening, or skipping the search altogether for words with numbers or words that are too short or too long compared to the words in the dictionary. Spelling verification is a data retrieval operation that does not pose any difficult linguistic problems. Most of the problems of matching words against a dictionary relate to the handling of capitalization and accented characters, but both of these issues can be resolved by normalizing the representation of the words.

Spelling verification is a specialized form of information retrieval. Recall and precision are used to measure the efficiency of information retrieval. Recall is the ability to retrieve the information related to a topic. Perfect recall would retrieve all the relevant information in a database. Retrieving the whole database would achieve 100 percent recall, but the results would contain a lot of irrelevant information. Precision is the ability to retrieve only the relevant information related to a topic. An optimum search should have 100 percent recall with no irrelevant results. In practice, a search will have a mixture of relevant and irrelevant results, and the best search strategy tries to maximize both recall and precision.

The IBM Personal Computer was introduced in 1981 and it immediately became a great commercial success. It was not a very powerful computer, but it could be used for writing letters and performing bookkeeping calculations with spreadsheets. The IBM PC became very important for managing the correspondence and finances of many small businesses and it was very popular with hobbyists. Since the typical computer had only 16 to 64 kilobytes of memory and used floppy disks for storage, the spelling aid programs and the dictionaries had to be very compact.

In order to sell personal computers outside of the United States, IBM had to provide spelling checkers for the major European languages. I worked in the natural language processing department of IBM at this time and I travelled to various countries to explain how to develop the dictionaries for spelling verification. I also developed software for verifying German compound words, and morphological analysis to create information retrieval systems for a company's electronic documents. I was the author or co-author of many U.S. patents, and I helped to implement spelling aid in the AS/400 system, which was introduced by IBM in 1988, and is still in operation today with many hardware improvements.

Spelling errors can be characterized by the number of operations needed to transform the misspelling into the correctly spelled word. There are four types of spelling error operations: omission, insertion, substitution, and transposition. Some people consider transposition to be a compound operation consisting of one deletion and one insertion. Omission consists of leaving one letter out, such as spelling the word "omission" with one s instead of two. Insertion consists of adding an extra letter, such as doubling the s in the word "insertion". Substitution consists of changing one letter for another. This frequently happens when your fingers tap an adjacent letter on the keyboard. Transposition is the exchange of two adjacent letters, as when you type "hte" instead of "the" or "from" instead of "form".

Earlier, I mentioned that spelling verification may convert all the words to lower case in order to be able to match words that have an initial upper case or words written in all caps. But where do you look for spelling aid candidates for a misspelled word? You can search the dictionary using the initial letters of the misspelled word and restrict the number of candidates by rejecting all the words whose length differs from the search term by more than 4 characters because those word would be too different. It is also necessary to apply phonetic transformation rules to look in different places of the dictionary that may have relevant terms.

For example, a word that is not in the dictionary and starts with the letter F, should also consider some spelling aid candidates starting with the letters "PH" since they have the same pronunciation.

The phonetic targeting rules make it possible to use at least two letters of a word to obtain manageable subsets of the dictionary in the search for spelling aid candidates, but it is also possible to create a reverse dictionary, where every word is spelled backwards. The potential misspelling is then reversed, and the selection of a subset of candidates is made as for the normal dictionary. This technique provides good spelling aid candidates when the front of the word is garbled. The reverse word index can be searched quickly and provides a more robust performance for spelling aid by using additional processing and dictionary storage.

The difference between two words can be calculated using the Levenshtein distance measure. This is basically the number of single-character edit operations required to convert one word into another. Each insertion, deletion or substitution counts as one operation. A transposition of adjacent characters may also be considered one operation.

In addition to searching various locations of the dictionary, spelling aid has to consider whether a dictionary word is similar enough to the misspelling to be offered as a potential candidate. This is accomplished by using a distance measure to rank the similarity of a dictionary word to the misspelling. A more robust distance measure can be constructed by combining the minimum lexical distance of the words and of their corresponding phonetic representations.

Morphological mapping provides another way of identifying string similarities. It is possible to create keys or mappings of words that are invariant for some errors, such as doubling of letters or consonant-vowel transpositions. The keys are constructed by listing the unique consonants of a word in their original order followed by the unique vowels, also in their original order. These keys, when sorted, produce clusters of words that are very similar.

When we do a Google search we usually get so many results that we do not think about recall and precision. Not many people realize that different results are produced for the British and American spellings. The British spelling of colourful plants, with "ou" produces suggestions such as colorful plants for shade, and colorful plants for pots.

The same search with the American spelling also suggests colorful plants for shade, but the rest of the suggestions are different. Google checks the input word as it is being typed to try to guess what you may want to search. Since millions of people search for similar things, Google presents combinations of terms that have been searched before and for which the results are readily available in order to provide faster response for the user. If the input term seems to be misspelled, Google highlights it in red and suggests a correctly spelled alternative.

Different suggestions are provided if the word being misspelled is a British or an American variant. If a user persists in searching the misspelled term, Google displays a message stating that the results are for the correctly spelled word, but Google allows you to search the misspelled word just in case that is what you really wanted. As an aid, the search shows you how many results are available. A word like "colorful" can produce more than 9 billion results.

A misspelled version of "colorful" with two L's can only retrieve 140,000 results, which is a tiny percentage of the 9 billion instances in the database. Google uses this statistical discrepancy to ask whether you meant to search for the correctly spelled form of "colorful".

A search for photographs of plants, where the word photographs is spelled with F's shows the results for the correctly spelled query. But if you insist on searching for photographs of plants spelled with F's, you will start getting results in foreign languages where the word "photograph" is really spelled with F's.

Thus far, I have only mentioned searching in English. British spellings are used in Canada, Australia and former British colonies, such as India. However, English is only in third place with regard to the number of speakers. There are more speakers of Mandarin Chinese and Spanish than speakers of English. You are not getting 100 percent recall for your queries if you are only searching in English.

To search in other languages, we can use Google Translate to convert the query into a foreign language, such as Chinese. When we type "colorful plants" in English, Google Translate provides the corresponding Chinese text.

We can then copy the Chinese text into the search field of Google dot CN, which is the Chinese version of Google. The results will be displayed in a format similar to English, but all the text will be in Chinese. Google Translate can be used to convert the output to English. The amount of data in the world is enormous and it is almost impossible to get 100 percent recall for any particular topic. In addition to Google, other popular search engines are Microsoft Bing, Yahoo, Baidu and Yandex.

Spelling aid can go beyond single words by looking at the context. There are cases where a correctly spelled word does not make sense with the surrounding words. The word cornmeal, which means a meal made by grinding dried corn, is perfectly spelled, but it is completely wrong in the phrase "cornmeal transplant". More likely, what is meant is corneal transplant, which is an eye operation.

Contextual spelling aid is implemented by searching databases of two or more adjacent words while the user types a query. All this happens in the fractions of a second when the user is pressing the keys to complete the query. Imaginative programming and fast access to an extensive database of two- and three-word phrases with frequencies makes this possible.

Some words, like "theater", can be spelled in the United States with either British and American spelling. As before, the number of results can vary substantially. The U.S. spelling has over 2 billion results.

The British spelling of theater has over 8 billion results, but all the results displayed are clustered in addresses near the geographical location where the search is conducted. Geographical proximity is another way in which a search can be made more relevant to a user.

Spelling aid for different languages is a common feature in most commercial word processing applications. Spelling aid is even available through cloud services that make it easy to design natural language interfaces for any applications that require human input. It is exciting to see the advances that have been achieved in natural language processing during the past 40 years, and we can expect that the application of artificial intelligence will revolutionize the way in which we interact with computers in the future.