Siri and the problem of intonation

I've worked for several years on how intonation conveys meaning in language, and I also worked for several years on dialogue systems.  So it is no surprise that I tend to notice right away the deficits that dialogue systems have when it comes to conveying meaning through intonation.  From my experience, by and large, they cannot do this.  I'll be picking on Siri here, but none of them are that great.

Here is an example of what I mean.  I ask Siri, "What is the capital of Japan?" and while the answer itself is fine, pay attention to the intonation with which the answer is pronounced:

tokyo.jpg

Let me represent intonational prominence (how much "oomph" is placed on a word in terms of pitch change, duration and intensity) with font size to illustrate the problem here.

When I ask, "What is the capital of Japan?" Siri gives me this:

Tokyo is the capital of Japan

This is not how a human speaker of English would answer this question.  A more natural pattern would be something like:

Tokyo is the capital of Japan

Many languages work this way, with only the "informationally important" words (what linguists call the 'focus' of the sentence) receiving prominence.  What exactly it means to be "informationally important" has been debated, but here is a layperson's summary of the view that I take:

The intonational prominence of a word or phrase in English (and languages phonologically similar to English in this regard) is correlated with the likelihood that the hearer of the sentence could guess what the speaker of the sentence meant from just that word or phrase.

With that in mind, we might ask:  what is missing from Siri?  The answer lies in the fact that we are able to predict from just the word "Tokyo" that Siri's answer will express the proposition 'Tokyo is the capital of Japan' because we know the context within which Siri's answer is situated.  In this case, the context is simply which question was asked of her.  Siri is good at using grammatical and other types of information to determine intonation, but not good at using context.

This seems to be an easy enough fix, though not trivially so.  Think about what the most prominent words are, relative to the other words, in the following answers to the question "What is the capital of Japan?":

  1. Tokyo is the capital of Japan.
  2. Japan's capital is Tokyo.
  3. Tokyo is Japan's capital, and Asia's largest city.

The role of context is central to the study of pragmatics, a sub-field of linguistics that studies non-literal meaning.  By stressing the word "Tokyo" only in the answers in 1 and 2, we are conveying that "Tokyo" is the word that actually answers the question under discussion.  Example 3 is more complex, because the speaker is adding an additional interesting fact to the answer, beyond what was asked about.

In pragmatics, language generation is often modeled as a process that involves predicting how the hearer is going to interpret various potential utterances.  In other words, choosing utterances and interpreting utterances are deeply intertwined, in that conversational participants are always considering each other's communication strategies.  There has been some work in the field of natural language generation (though not a ton) that takes this idea seriously.  (Here's a good example.  See also my work on content selection in dialogue systems.)

Let's apply this idea to the problem of intonation.  Let's run with the hypothesis that the more prominent words are the ones that make the answer easier to guess, given the context.  Consider the "Tokyo" examples.

In the first example ("Tokyo is the capital of Japan"), Tokyo should be most prominent.  Under the view I'm advocating here, this is because, as the sentence unfolds, after the word "Tokyo" it is already very easy to guess the entire sentence of the answer, given the context (the question).  In other words, the following quantity...

Prob('Tokyo is the capital of Japan' | "Tokyo"; "What is the capital of Japan?")

is much higher than...

Prob('Tokyo is the capital of Japan' | "What is the capital of Japan?")

and about the same (if not the same) as...

Prob('Tokyo is the capital of Japan' | "Tokyo is the capital of Japan"; "What is the capital of Japan?")

meaning "Tokyo" is the real information-carrier.

How do we incorporate this idea into an NLP or NLG system?  This is very much a work in progress, but one promising direction is to do the following:

  1. Train a neural network to match fragment answers along with their questions (e.g., "What is the capital of Japan?  Tokyo") with full intended answers ("Tokyo is the capital of Japan").
  2. For a given answer to a question, as the answer unfolds, word by word, use the trained model to estimate how well a hearer could guess the intended full answer from the fragment thus far (i.e., first look at "What is the capital of Japan?", then look at "What is the capital of Japan?  Tokyo", then "What is the capital of Japan?  Tokyo is", then add "the" and then "capital" and so on).
  3. Use the change in the above-described quantity from one word to the next to quantify the informational prominence of each word in the answer.
  4. Correlate informational prominence with intonational prominence in some way.

As a first stab at implementing this idea, I adapted a convolutional neural network used for paraphrase detection to create a hearer model to estimate the informational prominence of words (code not available yet, as this is still in the early stages), and applied the resulting method to the Tokyo examples from above.  The first two examples are rather straightforward:

ex1.png
ex2.png

Correlating this with intonation, this corresponds to the following, respectively:

Tokyo is the capital of Japan

Japan's capital is Tokyo

The third example is more interesting.

ex3.png

Here we have something like this:

Tokyo is Japan's capital and Asia's largest city

Typically we don't stress words like "and", but here it works as a signal that what follows is additional info beyond what answers the question that was asked.

In any case, the above does nothing to incorporate grammatical and positional information, which systems like Siri are already quite good at.  I envision this sort of method being used to estimate informational features to be combined with those other features as part of a larger method for generating intonational melodies for utterances in a dialogue system.