MMAD : Multi-modal Movie Audio Description

Xiaojun Ye1, Junhao Chen2, Xiang Li3, Haidong Xin4, Chao Li5, Sheng Zhou1†, Jiajun Bu1

1Zhejiang University 2Tsinghua University 3Peking University 4Northeastern University 5Harbin Engineering University
Indicates Corresponding Author

Accepted by COLING 2024

Audio Description (AD) aims to generate narrations of information that is not accessible through unimodal hearing in movies to aid the visually impaired in following film narratives. Current solutions rely heavily on manual work, resulting in high costs and limited scalability. While automatic methods have been introduced, they often yield descriptions that are sparse and omit key details. ddressing these challenges, we propose a novel automated pipeline, the Multi-modal Movie Audio Description (MMAD). MMAD harnesses the capabilities of three key modules as well as the power of Llama2 to augment the depth and breadth of the generated descriptions. Specifically, first, we propose the Audio-aware Feature Enhancing Module to provide the model with multi-modal perception capabilities, enriching the background descriptions with a more comprehensive understanding of the environmental features. Second, we propose the Actor-tracking-aware Story Linking Module to aid in the generation of contextual and character-centric descriptions, thereby enhancing the richness of character depictions. Third, we incorporate the Subtitled Movie Clip Contextual Alignment Module, supplying semantic information about various time periods throughout the movie, which facilitates the consideration of the full movie narrative context when describing silent segments, thereby enhancing the richness of the descriptions. Experiments on widely used datasets convincingly demonstrates that MMAD significantly surpasses existing strong baselines in performance, establishing a new state-of-the-art in the field.

Demo 1


The talented pianist, 1900, mesmerized the audience with his virtuosic performance of "Christmas Eve" while wearing a pristine white tuxedo and bow tie.

Demo 2


Chris Gardner, a man with a box in his hand, runs frantically through the city, dodging people and cars while being chased by a taxi driver who is honking.

Demo 3


Dancing in the rain, Don Lockwood twirls with joy, umbrella in hand, amidst city streets.

Demo 4


Alice fled through the mushroom forest, her heart racing as the Bandersnatch's ominous hisses and growls echoed behind her.

Method


Overview of MMAD: MMAD consists of multiple modality encoders used to generate movie narration



Citation


@inproceedings{ye2024mmad,
  title={MMAD: Multi-modal Movie Audio Description},
  author={Ye, Xiaojun and Chen, Junhao and Li, Xiang and Xin, Haidong and Li, Chao and Zhou, Sheng and Bu, Jiajun},
  booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
  pages={11415--11428},
  year={2024}
}