Thursday, October 22, 2015

Fuzzy match of spoken words using string distance algorithms - Dice’s Coefficient

This is a continuation of the fuzzy string matching blog series.  The first blog entry discussed the Leveshtein Distance for finding a recognized word in a vocabulary.  This blog post will discuss Dice’s Coefficient.

The problem we are trying to address is that of matching unconstrained speech recognition to a limited vocabulary.   The result of the speech recognition is open ended and does not always exactly match a word in the expected vocabulary for the natural language understanding and processing components. We will apply the Dice’s Coefficient algorithm, also known as the Sorensen-Dice index, (https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) to perform a fuzzy match of the recognized word with a vocabulary.

Dice’s coefficient measures how similar two strings are by considering the number of common bigrams (pairs of adjacent letters) in the strings.  The result is a value between 0 and 1, with 0 being different and 1 being identical.

The formula for Dice’s Coefficient that compares two strings is:

                2 x Number of bigrams common to both strings
dice = ---------------------------------------------------------------------
       [ Number of bigrams in 1st string + Number of bigrams in 2nd string ]

As before, there are two cases that we will consider for matching a recognized word to one in the expected vocabulary. 

Plural words


Plural words usually include an “s” or “es” added to the end of the root word.  As an example, if the word ‘phone’ is in the vocabulary, how would it’s plural ‘phones’ compare?

The bigrams for phone are:

ph,ho,on,ne

and the bigrams for phones are:

ph,ho,on,ne,es

Finding the similarity between the two words, we have

dice = 2 x common_bigrams(phone,phones) / [num_bigrams(phone) + num_bigrams(phones)]
     = 2 x 4 / [ 4 + 5 ]
     = 8 / 9
     = 0.89

This result shows that the words are very similar. 

Calculating Dice’s Coefficient requires minimal computation, so for our application we calculate it for each word in the vocabulary and select the vocabulary word with the highest value, provided it meets an arbitrary threshold of 0.5 that we emperically determined to be acceptable for our use case.

Using NaturalNode (https://github.com/NaturalNode/natural), which is an NLP library for Node.js, we can also compute Dice’s Coefficient using the following code:

var natural = require('natural');
console.log(natural.DiceCoefficient(“phone”,"phones"));

Output

0.8888888888888888

Similar words


Suppose that the speech recognizer returned the word ‘photo’.  The similarity between ‘phone’ to 'photo’ is again determined by calculating Dice’s coefficent.

The bigrams for photo are:

ph,ho,ot,to

and the bigrams for phone are:

ph,ho,on,ne

Finding the similarity between the two words, we have

dice = 2 x common_bigrams(photo,phone) / [num_bigrams(photo) + num_bigrams(phone)]
     = 2 x 2 / [ 4 + 4 ]
     = 4 / 8
     = 0.5

This result shows that the words are somewhat similar, but not near the result when compared to ‘phones’.

Using NodeNatural to compute the similarity:

var natural = require(‘natural’);
console.log(natural.DiceCoefficient(“photo”,"phone"));

Output

0.5

Which is the same result we obtained from our equation above.

As a more detailed example, suppose we are matching the following word from the speech recognizer:

cellar

with a limited vocabulary consisting of the following words:

phone, cellular, computer, audio, tv

Comparing ‘cellar’ with each word in the vocabulary results in:

cellar vs phone = 0
cellar vs cellular = 0.8333333333333334
cellar vs computer = 0
cellar vs audio = 0
cellar vs tv = 0

We see that ‘cellar' is very similar to ‘cellular' and it’s result is above the threshold of 0.5, so we recognize ‘cellular’ from our vocabulary as the word that was spoken.

Summary




We have seen how Dice’s Coefficient algorithm can be used to do a fuzzy match of plural words and similar words by calculating the similarity between the two words. The greater the coefficient, the more similar the words are.


This enables mis-recognized words from the speech recognition system to be matched to the closest word in a desired vocabulary. In practice for our application of matching recognized words to a vocabulary, we have found Dice’s Coefficient to be better at matching than the Levenshtein Distance algorithm.

Further Reading


No comments: