Thursday, October 22, 2015

Fuzzy match of spoken words using string distance algorithms - Dice’s Coefficient

This is a continuation of the fuzzy string matching blog series.  The first blog entry discussed the Leveshtein Distance for finding a recognized word in a vocabulary.  This blog post will discuss Dice’s Coefficient.

The problem we are trying to address is that of matching unconstrained speech recognition to a limited vocabulary.   The result of the speech recognition is open ended and does not always exactly match a word in the expected vocabulary for the natural language understanding and processing components. We will apply the Dice’s Coefficient algorithm, also known as the Sorensen-Dice index, (https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) to perform a fuzzy match of the recognized word with a vocabulary.

Dice’s coefficient measures how similar two strings are by considering the number of common bigrams (pairs of adjacent letters) in the strings.  The result is a value between 0 and 1, with 0 being different and 1 being identical.

The formula for Dice’s Coefficient that compares two strings is:

                2 x Number of bigrams common to both strings
dice = ---------------------------------------------------------------------
       [ Number of bigrams in 1st string + Number of bigrams in 2nd string ]

As before, there are two cases that we will consider for matching a recognized word to one in the expected vocabulary. 

Plural words


Plural words usually include an “s” or “es” added to the end of the root word.  As an example, if the word ‘phone’ is in the vocabulary, how would it’s plural ‘phones’ compare?

The bigrams for phone are:

ph,ho,on,ne

and the bigrams for phones are:

ph,ho,on,ne,es

Finding the similarity between the two words, we have

dice = 2 x common_bigrams(phone,phones) / [num_bigrams(phone) + num_bigrams(phones)]
     = 2 x 4 / [ 4 + 5 ]
     = 8 / 9
     = 0.89

This result shows that the words are very similar. 

Calculating Dice’s Coefficient requires minimal computation, so for our application we calculate it for each word in the vocabulary and select the vocabulary word with the highest value, provided it meets an arbitrary threshold of 0.5 that we emperically determined to be acceptable for our use case.

Using NaturalNode (https://github.com/NaturalNode/natural), which is an NLP library for Node.js, we can also compute Dice’s Coefficient using the following code:

var natural = require('natural');
console.log(natural.DiceCoefficient(“phone”,"phones"));

Output

0.8888888888888888

Similar words


Suppose that the speech recognizer returned the word ‘photo’.  The similarity between ‘phone’ to 'photo’ is again determined by calculating Dice’s coefficent.

The bigrams for photo are:

ph,ho,ot,to

and the bigrams for phone are:

ph,ho,on,ne

Finding the similarity between the two words, we have

dice = 2 x common_bigrams(photo,phone) / [num_bigrams(photo) + num_bigrams(phone)]
     = 2 x 2 / [ 4 + 4 ]
     = 4 / 8
     = 0.5

This result shows that the words are somewhat similar, but not near the result when compared to ‘phones’.

Using NodeNatural to compute the similarity:

var natural = require(‘natural’);
console.log(natural.DiceCoefficient(“photo”,"phone"));

Output

0.5

Which is the same result we obtained from our equation above.

As a more detailed example, suppose we are matching the following word from the speech recognizer:

cellar

with a limited vocabulary consisting of the following words:

phone, cellular, computer, audio, tv

Comparing ‘cellar’ with each word in the vocabulary results in:

cellar vs phone = 0
cellar vs cellular = 0.8333333333333334
cellar vs computer = 0
cellar vs audio = 0
cellar vs tv = 0

We see that ‘cellar' is very similar to ‘cellular' and it’s result is above the threshold of 0.5, so we recognize ‘cellular’ from our vocabulary as the word that was spoken.

Summary




We have seen how Dice’s Coefficient algorithm can be used to do a fuzzy match of plural words and similar words by calculating the similarity between the two words. The greater the coefficient, the more similar the words are.


This enables mis-recognized words from the speech recognition system to be matched to the closest word in a desired vocabulary. In practice for our application of matching recognized words to a vocabulary, we have found Dice’s Coefficient to be better at matching than the Levenshtein Distance algorithm.

Further Reading


Wednesday, September 16, 2015

Fuzzy match of spoken words using string distance algorithms - The Leveshtein Distance

Conversational computing uses speech to interact with a computing system in a natural way. The key components to a conversational computing system are:
  1. Speech recognition
  2. Natural language understanding 
  3. Processing 
  4. Natural language generation 
  5. Text to speech 
There are several speech recognition services available to convert voice to text in a browser, such as the HTML5 Web Speech APIs or IBM’s Watson Speech APIs. For unconstrained speech recognition, the result of the recognition is open ended and does not always match a word in the expected vocabulary for the natural language understanding and processing components. We will look at the Levenshtein Distance (LD) algorithm (https://en.wikipedia.org/wiki/Levenshtein_distance) that can be used to perform a fuzzy match of the recognized word with a vocabulary.

There are two cases that we will consider for finding a recognized word in a vocabulary. Note that misspelled words are not a concern when using speech recognition systems, since the recognizer always returns correctly spelled words. However, LD is effective in matching misspelled words with a dictionary.

Plural words


To handle plurals of words in a vocabulary, both the singular and plural versions of the word need to be included. However, this greatly increases the size and complexity of the vocabulary, and hence the number of possible word combinations that have to be considered.

For example, if the vocabulary has the word ‘phone’, but the user either says ‘phones’ or the recognizer returns ‘phones’, the word would not be matched. However, fuzzy matching can be accomplished by using a string distance algorithm, such as the Levenshtein Distance (LD). The distance between two words is the number of deletions, insertions, or substitutions required to transform one word into the other.

To convert ‘phone’ to ‘phones’, there is one insertion. Thus, the distance is 1.

A distance of 0 means that the words are identical. The greater the distance, the more different the words are.

Similar words


Suppose that the speech recognizer returned the word ‘photo’. To convert ‘phone’ to ‘photo’ requires 2 substitutions resulting in a distance of 2.

If the word ‘photo' is in the vocabulary, then after cycling through all words in the vocabulary, ‘photo’ would be returned since it would have a distance of 0. However, if ‘photo’ is not in the vocabulary, then it may be reasonable to interpret the recognized word ‘photo’ as ‘phone’. The higher the distance, the less similar the matched words are.

NaturalNode (https://github.com/NaturalNode/natural) is an NPL library for Node.js that calculates the Levenshtein Distance. Using the examples above,

    var natural = require('natural');
    console.log(natural.LevenshteinDistance(“phone”,"phones"));
    console.log(natural.LevenshteinDistance(“phone”,”photo"));

Output

    1
    2

Which are the same results we obtained above.

By default, deletions, insertions and substitutions all have a cost of 1. However, the cost of each operation can be changed. If a substitution cost was 2, then the above would return

    console.log(natural.LevenshteinDistance("phone","phones",{
      insertion_cost: 1,
      deletion_cost: 1,
      substitution_cost: 2
    })); 
    console.log(natural.LevenshteinDistance("phone","photo",{
      insertion_cost: 1,
      deletion_cost: 1,
      substitution_cost: 2
    }));


Output

    1
    4

The results show that adding an 's' to a word to make it plural returns a distance of 1, while similar words phone and photo that require substitution returns a distance of 4.

Summary


We have seen how the Levenshtein Distance algorithm can be used to do a fuzzy match of plural words and similar words by calculating the distance between the two words. The lower the distance, the more similar the words are. This enables mis-recognized words from the speech recognition system to be matched to the closest word in a desired vocabulary.

Sunday, April 08, 2012

Mobile Technologies

This blog will deal mainly with mobile technologies, but may stray from time to time to other technologies that I find interesting.