Deep Multimodal Learning for Computer Vision

NExT Forum:多模態基礎模型

【講者】 賴尚宏副院長 | 國立清華大學電機資訊學院 Dr. Shang-Hong Lai, Associate Dean, College of Electrical Engineering and Computer Science, National Tsing Hua University

【講題】Deep Multimodal Learning for Computer Vision

【摘要】 Human senses the world through multimodal inputs, such as vision, audio, text, haptics, etc. Each modality provides description for objects or scenes based on its unique characteristic representation. Deep multimodal learning has attracted increasing attention since it can benefit model learning from data of different modalities to boost the accuracy and robustness of the deep neural network system. In this talk, I will present some examples of computer vision applications that employ deep multimodal learning. For the first part, I will introduce the multimodal learning approaches for training face recognition systems to enhance the recognition accuracy. In the second part, I will present the text-image co-training approach that has been used for developing accurate image captioning and intelligent document understanding. Finally, I will briefly describe the recent action recognition research works that take advantage of the pre-trained visual-language model to achieve zero-shot learning for spatio-temporal action detection.


主辦單位:鴻海研究院 協辦單位:財團法人人工智慧科技基金會、國立臺灣大學人工智慧技術暨全幅健康照護聯合研究中心