Aria Valuspa

Artificial Retrieval of Information Assistants – Virtual Agents with Linguistic Understanding, Social skills, and Personalised Aspects

New facial point localisation system developed in ARIA-VALUSPA

As part of the visual part of ARIA-VALUSPA, a new facial point localisation system has been developed, which builds on a novel theoretical framework, and attains state-of-the-art results in annotated benchmarks.

The facial point localisation system is the very first step in the visual part of ARIA, which allows it to know where the user’s face is, and where the main facial parts, such as the mouth, the eyes, or the eyebrows, are. It serves to further process these parts and detect subtle muscle changes, known as Action Units, which are directly linked to users’ emotions.

The problem of facial point localisation has been a research topic for the last 20 years, but it was in 2013, with the appearance of Cascaded Regression, that the field experienced a major break-trhough. Cascaded Regression builds on linear regression, and consists of a set of models that are used to tailor an initial guess of where the points might be toward where these should be. In this sense, each of the models (learnt by linear regression) just take some local information from a given image, around the input points, and then estimate where the points should be moved to be better located. This new estimate is then forwarded to the subsequent model (hence the cascade).

However, there are some flaws that cannot be directly tackled by existing Cascaded Regression methods, the most important being the incapability of performing incremental learning in real-time. Basically, the goal of incremental learning is to incorporate, to each of the models, all possible information from the subject being tracked, in order to reinforce them for any potential future frame that would require such information. The research conducted to date in the field was clearly showing the importance of it, but none of the existing methods in Cascaded Regression were capable of doing it in real-time.

Thus, as part of the research conducted by the University of Nottingham team, a new method was proposed replacing the traditional linear regression, which was coined Continuous Regression.

Briefly speaking, Continuous Regression is a mathematical solution that applies Functional Regression concepts, and permits to gather all the infinite points surrounding the real-locations into each of the models. Aside the theoretical contributions, Continuous Regression enabled for the first time the use of incremental learning under the context of Cascaded Regression.

The proposed incremental Cascaded Continuous Regression (iCCR) is the first tracker to date incorporating real-time capabilities, attaining state-of-the-art results in an extensive annotated dataset. Its impressive results were recently published in the 14th European Conference on Computer Vision (ECCV’16), which is one of the top-tier conferences in Computer Vision.

The system has been integrated into AVP 2.1 where it is part of eMax, the main part of the visual system of ARIA-Framework, and runs at more than 30fps. In addition, a specific website gathering the research conducted in Continuous Regression was launched, in which the associated papers are accompanied with MATLAB code, allowing the use of the tracker as a tool, and enabling further research on still open topics.

Aria Valuspa and KRISTINA at the joint Dagstuhl seminar 2017

At the joint seminar at the Leibniz – Centre of Computer Sciences, Dagstuhl Castle, Wadern, the ARIA VALUSPA project met up with the EU project KRISTINA to exchange their insights, progress, promising approaches, and create new (research) friendships.
The Leibniz – Centre of Computer Sciences is located in the south-west of Germany and offers a very relaxed and secluded location where computer scientists can come together to discuss and work on their research.

The KRISTINA project aims to develop technologies for a human-like socially competent and communicative agent. It runs on mobile communication devices and serves for migrants with language and cultural barriers in the host country. The agent they develop will be a trusted information provision party and mediator in questions related to basic care and healthcare for migrants.

The two projects clearly owns many similarities, which allows both teams to learn from each other. However, during discussions they identified some interesting differences. The ARIA project is working on a natural interaction system, that focusses more on “social banter” such as real-time interruptions. These interruptions can be done by the user himself (the user starts talking when the agent is speaking) or by the agent (the agent starts talking when the user is speaking). By making the agent respond in an appropriate manner to interruptions, it is hoped for creating an atmosphere of a more natural “small talk”. The KRISTINA project is working on a task based system that relies more on information retrieval and transfer. A migrant can use their system to ask questions about their new home country, for example, how healthcare is organised.

During the joint seminar, both teams presented demonstrations of their systems. A notable demonstration of the ARIA project was done by Angelo Cafaro. He presented the handling by the virtual agent of user interruptions, which was realised on the behaviour generation level. For the KRISTINA project, Dominik Schiller presented the demonstration of how their agent could emphatically react to a depressed user. The ARIA VALUSPA team was very interested to understand how the KRISTINA system worked. Yet, the most impressive demo was held by Gerard (KRISTINA) and Angelo (ARIA VALUSPA), who managed to connect the Greta platform from the ARIA project to the agent web-interface from the KRISTINA project in about an hour. This emphasized the importance and effect of standards (i.e. FML and BML)!

Additionally, there was an interesting invited keynote by Patrick Gebhard from the DFKI lab, Saarbrücken. He detailed the work of the lab on the Virtual Scene Maker, which can be used for designing real time interactive systems. The demonstration of the system gave an insight on the impressive ease of the configuration possibilities in Virtual Scene Maker.

ARIA – VALUSPA Platform 2.0 released

2017 started off well for the ARIA-VALUSPA project with the release of ARIA-VALUSPA Platform 2.0 (AVP 2.0), the second public release of the integrated behaviour analysis, dialogue management, and behaviour generation components developed as part of this EU Horizon 2020 project. The integrated Virtual Human framework will allow anyone to build their own Virtual Human Scenarios. Students are already using the framework to build a face recognition system that includes liveness detection, or to take questionnaires from people in a much more natural manner than asking people to fill in forms.

The AVP 2.0 can be downloaded from GitHub. It comes with installation instructions, and a tutorial for running the default interaction scenario.

The behaviour analysis component of the AVP 2.0 comes with integrated Automatic Speech Recognition in English, valence and arousal detection from audio, 6 basic emotion recognition from video, face tracking, head-pose estimation, age, gender, and language estimation from audio.

The default scenario of the dialogue manager presents Alice, the main character of the book ‘Alice’s Adventures in Wonderland’ by Lewis Caroll. You can ask her questions about herself, the book, and the author. You can of course also create your own scenarios, and we’ve created a tutorial with three different scenarios specifically aimed at getting new users started with this making their own systems.

The behaviour generation components come with emotional TTS created by Cereproc, and visual behaviour generation using the GRETA system. It uses standard FML and BML, and features lip-synched utterances. The behaviour generation component has a unique feature that allows an ongoing animation to be stopped, thus allowing the agent to be interrupted by a user, which makes interactions with it much more natural.

Another unique feature is the ability to record your interactions with the Virtual Humans. The framework stores raw audio and video, but also all predictions made by the analysis system (ASR, expressions, etc.), and in the near future it will also store all Dialogue Management and Behaviour Generation decisions, allowing you to replay the whole interaction. To simplify inspection and replay of an interaction, a seamless integration with the NoVa annotation tool is supported. NoVa is the new annotation tool developed as part of ARIA-VALUSPA to address shortcomings of existing multimedia annotation tools, as well as to provide integrated support for cooperative learning .

While the ARIA-VALUSPA Platform 2.0 presents a major step forwards in Virtual Human technology, we are always looking for ways to improve the system. Feature requests, bug reports, and any other suggestions can be logged through the GitHub issues tracker.

—————————————————————— Update ————————————————————————————

08. February 2017

A minor update to the ARIA-VALUSPA Platform for Virtual Humans has been released (AVP 2.1), containing mostly improved support for interruptions, logging of dialogue management actions, faster face tracking, and some bug fixes. Full release notes can be found here.

—————————————————————— Update ————————————————————————————
13. April 2017
An update to the ARIA-VALUSPA Platform for Virtual Humans has been released (AVP 2.2). Full release notes can be found here!

Expressive Speech Synthesis and Affective Information Retrieval

Human communication is rich, varied, and often ambiguous. This reflects the complexity and subjectivity of our lives. For thousands of years, art, music, drama and story telling have helped us understand, come to terms with, and express the complexities of our existential experience. Technology has long played a pivotal role in this artistic process, for example the role of optics in the development of perspective in painting and drawing, or the effect of film on story telling.

Information Technology has, and is having, an unprecedented impact both on our experience of life and our means of interpreting this experience. However the ability to harness this technology to help us understand, come to terms with, and mediate the explosion of electronic data, and electronic communication that now exists is generally limited to the mundane. Whereas the ability to get the height in metres of Everest is a trivial search request (8,848m by the way from a Google search), googling the question ‘What is love?’ returns (in the top four), two popular newspaper articles, a youtube video of Haddaway and a dating site. It is, of course, an unfair comparison. Google is not designed to offer responses to ambiguous questions with no definite answers. In contrast, traditional forms of art and artistic narrative have done so for centuries.

We might expect speech and language technology, dealing as it does with such a central form of human communication, to be at the forefront of applying technology to the interpretation of our ambiguous and multi-layered experience. In fact, much of the work in this area has avoided ambiguity and is often used as a tool to disambiguate information rather than as a means to interpret ambiguity. Take, for example, conversational agents (CAs): These are computer programs which allow you to speak to a device and will respond to you using computer generated speech. These systems can potentially harness the nuances of language and the ambiguity of emotional expression. However, in reality, we use them to ask them how high Everest is or where you can find a nearby pizza restaurant. Although the ability to deal with these requests is important if you are writing an assignment about Everest, or wanting to eat pizza, it raises the question of how we might extend such systems to help us interpret more complex aspects of the world around us. It is important for this technology to strive to do so for two fundamental reasons: firstly, technology has become part of our social life and as such this technology needs to be able to engender playfulness, and enrich our sense of experience, and secondly, applications which could perform a key role in mediating technology for social good require a means of interacting with users in much more complex social and cultural situations.

Conversation has a tradition as a pass time, as means of humour, as a means of helping people with their problems. However, the scope for artificial conversational agents to perform these activities is currently severely limited. In ARIA-VALUSPA we explore approaches that give conversational agents more subtle means of communicating, of becoming more playful, and representing the ambiguity in our social experience.

The technology required to do this requires close collaboration with engineers working on dialogue. CereProc Ltd, a key partner in ARIA-VALUSPA, is very active in developing techniques to make artificial voices (termed speech synthesis or text-to-speech synthesis, TTS) more emotional, expressive and characterful. These techniques include changing voice quality – for example making a voice sound stressed or calm – adding vocal gestures likes sighs and laughs,changing the emphasis from one word to another to alter the subtle meaning in a sentence, changing the rate of speech and the intonation to change how active the voice sounds, and even just making sure the voice doesn’t always say the word ‘yes’ the same way every time it
says it.

This key work on the way speech is produced can then be used to alter the perceived character of a conversational agent, or convey an internal state.

So perhaps, in the future when we ask a system ‘What is Love?’, perhaps its wistful voice will hint at past romance lost, the sense of longing for a human connection, and help us find answers which are not fixed, but emotional and depend on the different experiences we share as human beings.

Gartner identifies digital assistants as key to developing digital business opportunities

In the Top 10 strategic technology trends for 2016 and after, Gartner identified artificial intelligence and digital assistants as key to developing digital business opportunities. The market of Intelligent Virtual Assistants (IVA) is growing considerably. Having simple tools to create IVAs is a priority for companies that want to have a dedicated virtual assistant that they can easily modify and maintain independently. A leading-edge platform that automates the process of IVA creation, improvement, and sustainment has been developed by Aria-Valuspa’s partner Living Actor™.
Within the Aria-Valuspa project, Living Actor™ explores the contribution of emotions in human – machine interactions, focusing on animated avatars. More information can be found here.


A reporter from the WDR (West German Broadcasting Cologne) visited the Lab for Human-Centered Multimedia at Augsburg University, and Johannes Wagner presented the Alice agent to him. Alice does not just answer the user’s questions about “Alice in Wonderland”, but also responds to his or her emotions by analyzing voice and facial expressions. The synchronization of the multimodal signals is done by the award-winning SSI system developed by Augsburg University (