Abstract: Human communication consists of multimodal signals, such as speech and gestures. These signals are not always aligned, making it difficult for artificial systems (e.g. robots) to find the ...