Aria Valuspa

Artificial Retrieval of Information Assistants – Virtual Agents with Linguistic Understanding, Social skills, and Personalised Aspects

Adapting a Robot’s linguistic style based on socially-aware reinforcement learning using the ARIA-Platform

A new use-case of the ARIA-Valuspa Platform:

The core of the Aria platform (SSI recognition pipeline including Emax, Opensmile, Cerevoice) is used on a social social to replicate the Aria scenario (Alice in Wonderland). In addition, the demo features reinforcement learning and natural language generation to adapt the robot’s linguistic style to the user’s personality preferences. For more details see: Hannes Ritschel, Tobias Baur, Elisabeth André: Adapting a Robot’s linguistic style based on socially-aware reinforcement learning. RO-MAN 2017: 378-384. Want to create your own Human-Agent Scenarios? Check out the Aria Valuspa Platform on


An Embodied Chatbot as a Marketing Tool

Chatbots are everywhere. They offer automated customer support, provide advice and information to your staff, explain processes to new users and more. At the same time, online interactions are being transformed through technologies like augmented reality and virtual reality and by the explosion in the use of mobile devices. But how can these trends be combined to create the kind of highly personalized user experience your clients and coworkers are looking for? With a new-generation virtual assistant, that’s how!

The return of the avatar

This year, avatars are making a big comeback. The proof? The major players are using them! Facebook is developing a tool that can automatically generate a custom avatar from a photo, while Apple has already introduced avatars in its new iPhone X.

Avatars like these allow users to personify their moods and their identity, enabling them to react quickly to current events or a particular situation. For a company, using an avatar reinforces its brand image and humanizes its interactions.

Incidentally, it’s one of the challenges addressed by the ARIA-VALUSPA (Artificial Retrieval of Information Assistants – Virtual Agents with Linguistic Understanding, Social skills, and Personalised Aspects) project: this new-generation conversational agent boasts an expanded evaluation grid and more finely tuned perceptions. The chatbot thus becomes capable of expressing emotions, adopting non-verbal behaviors and, in some situations, even interrupting the person it’s talking to!

In short, ARIAs are much more than simply virtual assistants: they incorporate social aspects into their interactions with users. The chatbot is imbued with increased sensitivity and can reproduce the characteristics of a genuine interaction between two human beings.

A virtual assistant with proactive behaviors

As a direct consequence of the development of avatars, chatbots are no longer satisfied with merely answering questions, but instead have become bona fide tools in the hands of marketing departments everywhere. An intelligent animated avatar can interact proactively and adopt the appropriate behavior depending on the person it’s speaking with.

For example, this type of virtual assistant is central to the marketing strategy of the Canadian insurance company Ignite. A chatbot in the form of an animated robot intervenes in accordance with the various stages of the user’s search for auto or property insurance on the website. In an indispensable preliminary step, the Ignite marketing team programs all of the interventions of the chatbot, named “Igins.” With a hand gesture, it lets the user know that it may be of assistance.

What’s next?

Just as in our day-to-day interactions, it’s important for the exchange to be unobtrusive. As a result, it’s left up to the user to make the decision to interact with the chatbot, in which case Igins knows exactly what to say, based on the user’s queries and browsing behavior.

Scripting the chatbot’s dialog

If Igins intervened at the wrong time or provided the wrong responses, the chatbot wouldn’t be fulfilling its purpose. Its effectiveness rests on the quality of the interaction scenarios that are established upstream, depending not only on the intentions of the user, but also on the objectives of the marketing team.

This scripting consists of creating a type of role-play in which the avatar interrupts or responds to the user, depending on the situation. The interactions may be either verbal or non-verbal: for example, during a waiting period, Igins might do exercises or start dancing in order to distract the user, while also establishing a friendly rapport. It’s a great strategy for preventing the loss of a potential client during an important phase of the transaction!

Empathy and expressiveness for an enhanced user experience

Through its mischievous, playful interactions, Igins adapts to the users, which in the case of the Ignite insurance company consist primarily of gen X. An empathic avatar will therefore play upon all of its human aspects, both verbal and non-verbal, in order to personalize the interaction.

In a different context, Marlie, the virtual collector for CenturyLink, a US telecommunications company, uses intonation and facial expression to reassure clients as they pay their bill. Each phase of the transaction is therefore scripted to provide the appropriate explanations calmly and amicably.

While this expressiveness may take numerous guises, it doesn’t necessarily have to extend to a humanized character. In fact, some companies prefer not to “incarnate” their conversational robot. In order to respond to these varying needs, the company Living Actor works on all aspects of expressiveness:
• The personality of the virtual assistant, conveyed by its style, tone of voice and expressions
• Or a more abstract, but equally expressive representation, as illustrated in the illustration below by Chad, a conversational robot recently developed by Living Actor.

It’s obvious that chatbots are no longer simple software programs designed to understand a question: they have become full-fledged marketing tools capable of providing a premium service to the user and representing your company, respecting its identity and unique characteristics. This combination of language and behavioral scripting technologies makes it possible to forge a genuine relationship between a company and its clients or employees.

What’s the future of online interactions? Restoring humanity and empathy to their rightful place at the center of the exchange for a more personalized user experience!

ARIAValuspa at the DRONGO festival

Last weekend (September 29-30) the DRONGO language festival was held in Utrecht, The Netherlands ( It’s a Belgian/Dutch festival that has everything to do with language, from bilingualism to dialects, from language learning to spoken dialogue systems. It’s meant for all sorts of people, researchers, employees involved with language and hobbyists who work with languages. The University of Twente was also present at this festival, with three demos to show, of which one was the ARIA VALUSPA Platform (AVP). We showed our virtual agent Alice to the public, in the wild. We were a bit worried if the system would survive a whole day in such a busy place and fortunately it did! Despite the noise and many people in the room, several visitors held some decent and interesting conversations with Alice. Visitors could ask questions about Wonderland or listen to Alice cracking some jokes and commenting on their reaction. It was a great way to show our platform to a broader audience. Alice did an amazing job and we’re looking forward on show-casing more capabilities in the future.

ARIA-VALUSPA developer meeting held in Paris

Early june there were an Aria-Valuspa meeting held in Paris. It started with a 2 working days for the developers. They actively worked on finalizing a new version of the Aria-Valuspa demo. They resolve bugs, speed up the connection between modules, and enhance several of Aria modules such as agent’s behaviors, ASR. The Dialog Manager has been enriched allowing the agent to provide a large set of answers.
A particularity of this Aria-Valuspa meeting was the encounter with industrials. 3 French companies were present, EDF, JobMaker, EasyRecruit. The meeting started with a presentation of the Aria-Valuspa project, from its motivation to its system. It follows by a discussion between the stakeholders and the Aria-Valuspa partners on specific use cases they were interested in.
The last day of this Paris meeting was devoted to project management where we discussed the final development of each module as well as the evaluation for the Aria System.

Congratulations to Elisabeth André

Award ceremony by Prof. Dr. Steve Feiner, Columbia University, New York, Chair, SIGCHI Achievement Awards Committee

Elisabeth André received an ACM SIGCHI award for her innovative contributions to the field of human-computer interaction at the ACM CHI Conference on Human Factors in Computing Systems, which took place in Denver/Colorado USA on May 2017 and was attended by nearly 3,000 people. With the award, Elisabeth André was inducted into the ACM SIGCHI Academy. The CHI Academy is an honorary group of principal leaders in the field of human-computer interaction whose work has shaped the field and given directions to it.

CereProc and ARIA – VALUSPA make waves at the Science Museum Lates

A team from CereProc supported by Dr Eduardo Manuel De Brito Lima Ferreira Coutinho
from the Imperial College demonstrated various aspects of speech synthesis at the Science Museum in London, as part of the Royal Society’s the Next Big Thing project.

Lates is a free event held once a month where adults take over the Science Museum. Every tevent has a different theme, covering a wide range of topics – from climate change to alcohol, from childhood to robots. These showcases have turned out to be extremely popular and attract around 5000 visitors per night.

Not surprisingly then, CereProc / ARIA team was kept busy all night. Our main activity was ‘Bot or Not’ – a quiz that lets you test your ability to recognize a synthetic voice and learn about speech synthesis in the process. Everyone who took part was added to the leader board and received a personalized message from Donald Trump (totally fake of course – generated using CereProc’s prototype Trump voice).

Feedback showed that most players found it a lot more difficult than they thought it would be – and no one has yet to reach the perfect score of 20/20!

Try it out here!

We also introduced visitors to (the voice of) Roger who gets very cross if you try to interrupt him while he’s speaking. The interruption demo was created for the ARIA – VALUSPA project as part of the effort to advance the conversational capabilities of virtual agents.

In addition, Dr Coutinho presented his work on sentiment analysis by demonstrating how to tell if a politician is being sincere when giving a speech. Once again, Mr Trump took a centre stage! Visitors also got a chance to record and analyse their own speech for signs of disingenuousness.

At CereProc, we live and breathe speech synthesis and we loved sharing our excitement with thousands of people who came along to the event. Like the Royal Society, we believe in promoting excellence in science and are proud to be part of the Next Big Thing movement!

Chatbots are becoming more receptive to human emotion!

The race to develop the best chatbot technology

It is no understatement to say that chatbots are generating a lot of interest from companies. All the big names in the IT industry (e.g. Google, Microsoft, Slack, IBM, Facebook, etc.) are following the trend and looking for new conversational interfaces.

Bots and artificial intelligence are broadening tech horizons. Development kits are now available to create artificial intelligence projects in a few clicks (see Skype, Facebook).
The increasing deployment of messaging applications – and conversational user interfaces – have truly boosted the growth of chatbots which today represent opportunities with few defined limits.

That said, imbuing more intelligence and “perceptual qualities” to these programs has become a major challenge.

Chatbots as meta-applications

Not only are chatbots outstanding gateways to the dissemination of relevant information, but they are also ideal for guiding users, segmenting information, and serving as true “hubs” capable of centralizing services. A chatbot is an e-concierge transformed into a virtual hub that manages applications that are too specialized, or too numerous!

A Nielsen analysis demonstrated that a Smartphone user clicks an average of 26 applications. According to another Gartner study, application usage tends to hit a plateau. Faced with a plethora of applications, a chatbot is a dream assistant—a service provider program that can be customized. What’s more, chatbots aren’t put off by learning! With a chatbot you can book a taxi, call a doctor, pay invoices, and more, all in the simplest and most natural way.

It’s no wonder that according to a study published by Oracle – “Can Virtual Experience Replace Reality?” – 80% of brands will use chatbots for customer interactions by 2020.

Are chatbots expressive and empathetic?

You and I don’t interact in the same way with human being or a chatbot. “Social” barriers aside, a chatbot will allow users to express themselves more candidly. The user often feel less “inhibited.”

Expressiveness can also be demonstrated in the choice of language programmed in a chatbot, and, if embodied by an avatar, in its facial expressions. The attitudes of this embodied chatbot will be able to “magnify a personality trait” or exaggerate emotions and reactions, in order to strengthen its message.

It is in this precise context that the “image and sound” become powerful. Text-based only formulas will be subject to misinterpretations. Combining text with an expressive avatar and/or voice will enrich the dialogue and strengthen the bond.

What if chatbots were granted a sixth sense?

We are not at the stage where chatbots can react as impulsively as humans, but nevertheless they are increasingly becoming more adaptive to the context of the situation.

Creating a chatbot with a wider analysis grid, a more comprehensive sensitivity spectrum, and adding other sensors to get it closer to being human is a real challenge. This is the focus chosen by the ARIA Valuspa project, whose idea is to develop the chatbot of tomorrow by offering more input channels, thus enabling it to better understand its user’s state of mind.
This more “sensitive” understanding will allow chatbots to personalize dialogues by adapting answers and reactions to the context. Chatbots will express their “emotions” with sentences, attitudes or voice and thus demonstrate a perfect understanding of the situation. They will synchronize, just like us!

Input signals are used to collect information including audio and video capture. These two sensors are able to detect several aspects related to one’s identity, stress level, and emotions experienced by the user. The dialogue can immediately become more personal and friendly.

These new chatbots are actually able to detect both verbal and nonverbal signals. They can identify the engagement of each of their users and thus adapt their pace or the amount of information they are sharing.

It is a real 6th sense that will complement current chatbots’ abilities. ARIA Valuspa’s virtual assistant offers multimodal, social interactions by assessing audio and video signals perceived as dialogue input. The aim of this project is to create a synthesizer “giving life” to an adaptive, intelligent chatbot, which is embodied by an expressive and vocalized 3D avatar.

The ultimate ambition? Establishing a real relationship based on trust with the user thanks to this new form of “virtual empathy.”

Multi-modal recordings and semi-automated annotation in the ARIA-VALUSPA project

An important contribution of the ARIA-VALUSPA project is the NOXI database of mediated Novice-Expert interactions. It consists of 84 dyads recorded in 3 locations (Paris, Nottingham, and Augsburg) spoken in 7 languages (English, French, German, Spanish, Indonesian, Arabic and Italian). The aim of the endeavour is to collect data to study how humans exchange knowledge in a setting that is as close as possible to the intended human-agent setting of the project. Therefore, the interactions were mediated using large screens, cameras, and microphones. Expert/Novice pairs discussed 58 wildly different topics, and an initial analysis of these interactions has already led to a design for the flow of the dialogue between the user and the ARIAs. In addition to information exchange, the dataset was used to collect data to let our agents learn how to classify and deal with 7 different types of interruptions. In total we collected more than 50 hours of synchronized audio, video, and depth data. The NOXI database was recorded with the aim to be of wide use, beyond the direct goals and aims of the ARIA-VALUSPA project. For example, we have included recordings of depth information using a Kinect. While the project will not use depth information, other researchers will probably find this useful. The recording system has been implemented with the Social Signal Interpretation (SSI) framework. The database is hosted at

The value of a database highly depends on the availability of proper descriptions. Given the sheer volume of the data (> 50 h) a purely manual annotation is out of question. Hence, strategies are needed to speed up the coding process in an automated way. To this end, a novel annotation tool NOVA ((Non)verbal Annotator) is developed by the University of Augsburg. Conventional interfaces are usually limited to playback audio and video streams. Hence, a database like NOXI, which includes additional signals like skeleton and facial points, can be viewed in parts only. To visualize and describe arbitrary data streams recorded with SSI is an important feature of NOVA. The coding process of multimodal data depends on the phenomenon to describe. For example, we would prefer a discrete annotation scheme to label behaviour that can be classified into a set of categories (e.g. head nods and shakes), whereas variable dimensions like activation and evaluation are better handled on continuous tiers. For other tasks like language transcriptions, which may include hundreds of individual words, we want to be able to assign labels with free text. To meet the different needs, NOVA supports both discrete and continuous annotations types. Additionally, it includes a database backend to store recordings at a central server and share annotations between different sites. In the future, NOVA will be advanced with features to create collaborative annotations and to apply cooperative learning strategies out of the box.

The above image gives an overview of the Cooperative Learning (CL) system that is currently developed in ARIA-VALUSPA and integrated into NOVA. (A) A database is populated with new recordings of human interaction (or alternatively from an existing source). (B) NOVA functions as interface to the data and sets up a MongoDB to distribute and accomplish annotation tasks among human annotators. (C) At times, Cooperative Learning (CL) can be applied to automatically complete unfinished fractions of the database. Here, we propose a two-fold strategy (bottom right box): (I) A session-dependent model is trained on a partly annotated session and applied to complete it. (II) A pool of annotated sessions is used to train a session-independent model and predict labels for the remaining sessions. In both cases, confidence values guide the revision of predicted segments (here marked with a colour gradient). To test the usefulness of the CL approach we have run experiments on the NOXI database, which show that labelling efforts can be significantly reduced that way. A joint publication is currently under review.