Expressive Speech Synthesis and Affective Information Retrieval

Human communication is rich, varied, and often ambiguous. This reflects the complexity and subjectivity of our lives. For thousands of years, art, music, drama and story telling have helped us understand, come to terms with, and express the complexities of our existential experience. Technology has long played a pivotal role in this artistic process, for example the role of optics in the development of perspective in painting and drawing, or the effect of film on story telling.

Information Technology has, and is having, an unprecedented impact both on our experience of life and our means of interpreting this experience. However the ability to harness this technology to help us understand, come to terms with, and mediate the explosion of electronic data, and electronic communication that now exists is generally limited to the mundane. Whereas the ability to get the height in metres of Everest is a trivial search request (8,848m by the way from a Google search), googling the question ‘What is love?’ returns (in the top four), two popular newspaper articles, a youtube video of Haddaway and a dating site. It is, of course, an unfair comparison. Google is not designed to offer responses to ambiguous questions with no definite answers. In contrast, traditional forms of art and artistic narrative have done so for centuries.

We might expect speech and language technology, dealing as it does with such a central form of human communication, to be at the forefront of applying technology to the interpretation of our ambiguous and multi-layered experience. In fact, much of the work in this area has avoided ambiguity and is often used as a tool to disambiguate information rather than as a means to interpret ambiguity. Take, for example, conversational agents (CAs): These are computer programs which allow you to speak to a device and will respond to you using computer generated speech. These systems can potentially harness the nuances of language and the ambiguity of emotional expression. However, in reality, we use them to ask them how high Everest is or where you can find a nearby pizza restaurant. Although the ability to deal with these requests is important if you are writing an assignment about Everest, or wanting to eat pizza, it raises the question of how we might extend such systems to help us interpret more complex aspects of the world around us. It is important for this technology to strive to do so for two fundamental reasons: firstly, technology has become part of our social life and as such this technology needs to be able to engender playfulness, and enrich our sense of experience, and secondly, applications which could perform a key role in mediating technology for social good require a means of interacting with users in much more complex social and cultural situations.

Conversation has a tradition as a pass time, as means of humour, as a means of helping people with their problems. However, the scope for artificial conversational agents to perform these activities is currently severely limited. In ARIA-VALUSPA we explore approaches that give conversational agents more subtle means of communicating, of becoming more playful, and representing the ambiguity in our social experience.

The technology required to do this requires close collaboration with engineers working on dialogue. CereProc Ltd, a key partner in ARIA-VALUSPA, is very active in developing techniques to make artificial voices (termed speech synthesis or text-to-speech synthesis, TTS) more emotional, expressive and characterful. These techniques include changing voice quality – for example making a voice sound stressed or calm – adding vocal gestures likes sighs and laughs,changing the emphasis from one word to another to alter the subtle meaning in a sentence, changing the rate of speech and the intonation to change how active the voice sounds, and even just making sure the voice doesn’t always say the word ‘yes’ the same way every time it
says it.

This key work on the way speech is produced can then be used to alter the perceived character of a conversational agent, or convey an internal state.

So perhaps, in the future when we ask a system ‘What is Love?’, perhaps its wistful voice will hint at past romance lost, the sense of longing for a human connection, and help us find answers which are not fixed, but emotional and depend on the different experiences we share as human beings.

Leave a Reply