Neurosymbolic Multimodal Intelligence

Introduction

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications by Chao Zhang, Zichao Yang, Xiaodong He and Li Deng, Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods by Aditya Mogadala, Marimuthu Kalimuthu and Dietrich Klakow, Multimodal Machine Learning: A Survey and Taxonomy by Tadas Baltrušaitis, Chaitanya Ahuja and Louis-Philippe Morency and Deep Audio-visual Learning: A Survey by Hao Zhu, Mandi Luo, Rui Wang, Aihua Zheng and Ran He.

Image Description

A Survey on Automatic Image Caption Generation by Shuang Bai and Shan An, A Comprehensive Survey of Deep Learning for Image Captioning by M. D. Zakir Hossain, Ferdous Sohel, Mohd F. Shiratuddin and Hamid Laga, Show and Tell: A Neural Image Caption Generator by Oriol Vinyals, Alexander Toshev, Samy Bengio and Dumitru Erhan, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention by Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel and Yoshua Bengio, Unifying Visual-semantic Embeddings with Multimodal Neural Language Models by Ryan Kiros, Ruslan Salakhutdinov and Richard S. Zemel and Deep Visual-semantic Alignments for Generating Image Descriptions by Andrej Karpathy and Li Fei-Fei.

VQA: Visual Question Answering by Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick and Devi Parikh and CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning by Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick and Ross Girshick.

Reasoning

Neural Module Networks by Jacob Andreas, Marcus Rohrbach, Trevor Darrell and Dan Klein, A Simple Neural Network Module for Relational Reasoning by Adam Santoro, David Raposo, David G. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia and Timothy Lillicrap, Inferring and Executing Programs for Visual Reasoning by Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C. Lawrence Zitnick and Ross Girshick, Neural-symbolic VQA: Disentangling Reasoning from Vision and Language Understanding by Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli and Joshua B. Tenenbaum, FiLM: Visual Reasoning with a General Conditioning Layer by Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin and Aaron Courville, Compositional Attention Networks for Machine Reasoning by Drew A. Hudson and Christopher D. Manning, Explainable Neural Computation via Stack Neural Module Networks by Ronghang Hu, Jacob Andreas, Trevor Darrell and Kate Saenko, Transparency by Design: Closing the Gap between Performance and Interpretability in Visual Reasoning by David Mascharka, Philip Tran, Ryan Soklaski and Arjun Majumdar and The Neuro-symbolic Concept Learner: Interpreting Scenes, Words, and Sentences from Natural Supervision by Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum and Jiajun Wu.

Knowledge

Improving Question Answering with External Knowledge by Xiaoman Pan, Kai Sun, Dian Yu, Jianshu Chen, Heng Ji, Claire Cardie and Dong Yu, KVQA: Knowledge-aware Visual Question Answering by Sanket Shah, Anand Mishra, Naganand Yadati and Partha P. Talukdar, OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge by Kenneth Marino, Mohammad Rastegari, Ali Farhadi and Roozbeh Mottaghi and Natural Language QA Approaches Using Reasoning with External Knowledge by Chitta Baral, Pratyay Banerjee, Kuntal Kumar Pal and Arindam Mitra.

Scene Graph Generation

Explainable and Explicit Visual Reasoning over Scene Graphs by Jiaxin Shi, Hanwang Zhang and Juanzi Li, An Empirical Study on Leveraging Scene Graphs for Visual Question Answering by Cheng Zhang, Wei-Lun Chao and Dong Xuan, Neural Motifs: Scene Graph Parsing with Global Context by Rowan Zellers, Mark Yatskar, Sam Thomson and Yejin Choi, Scene Graph Generation with External Knowledge and Image Reconstruction by Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai and Mingyang Ling, Differentiable Scene Graphs by Moshiko Raboh, Roei Herzig, Jonathan Berant, Gal Chechik and Amir Globerson and Attentive Relational Networks for Mapping Images to Scene Graphs by Mengshi Qi, Weijian Li, Zhengyuan Yang, Yunhong Wang and Jiebo Luo.

Image Transformation

Image Super-resolution by Neural Texture Transfer by Zhifei Zhang, Zhaowen Wang, Zhe Lin and Hairong Qi, Image Style Transfer Using Convolutional Neural Networks by Leon A. Gatys, Alexander S. Ecker and Matthias Bethge and High-resolution Image Synthesis and Semantic Manipulation with Conditional GANs by Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz and Bryan Catanzaro.

Image Generation

Visualizing Natural Language Descriptions: A Survey by Kaveh Hassani and Won-Sook Lee, Generating Images from Captions with Attention by Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba and Ruslan Salakhutdinov, Generative Adversarial Text to Image Synthesis by Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele and Honglak Lee, StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks by Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang and Dimitris N. Metaxas, Parallel Multiscale Autoregressive Density Estimation by Scott Reed, Aäron van den Oord, Nal Kalchbrenner, Sergio G. Colmenarejo, Ziyu Wang, Yutian Chen, Dan Belov and Nando de Freitas, Scene Graph Generation with External Knowledge and Image Reconstruction by Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai and Mingyang Ling and Image Generation from Scene Graphs by Justin Johnson, Agrim Gupta and Li Fei-Fei.

Video Description

Video Description: A Survey of Methods, Datasets, and Evaluation Metrics by Nayyer Aafaq, Ajmal Mian, Wei Liu, Syed Z. Gilani and Mubarak Shah, Grounded Video Description by Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J. Corso and Marcus Rohrbach, Translating Video Content to Natural Language Descriptions by Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal and Bernt Schiele and Translating Videos to Natural Language using Deep Recurrent Neural Networks by Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney and Kate Saenko.

Video Question Answering

TGIF-QA: Toward Spatio-temporal Reasoning in Visual Question Answering by Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim and Gunhee Kim, MarioQA: Answering Questions by Watching Gameplay Videos by Jonghwan Mun, Paul Hongsuck Seo, Ilchae Jung and Bohyung Han, TVQA: Localized, Compositional Video Question Answering by Jie Lei, Licheng Yu, Mohit Bansal and Tamara L. Berg, Unifying the Video and Question Attentions for Open-ended Video Question Answering by Hongyang Xue, Zhou Zhao and Deng Cai, CLEVRER: Collision Events for Video Representation and Reasoning by Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba and Joshua B. Tenenbaum, Explore Multi-step Reasoning in Video Question Answering by Xiaomeng Song, Yucheng Shi, Xin Chen and Yahong Han, Video Question Answering via Hierarchical Spatio-temporal Attention Networks by Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He and Yueting Zhuang, Uncovering the Temporal Context for Video Question Answering by Linchao Zhu, Zhongwen Xu, Yi Yang and Alexander G. Hauptmann and Multi-turn Video Question Answering via Multi-stream Hierarchical Attention Context Network by Zhou Zhao, Xinghua Jiang, Deng Cai, Jun Xiao, Xiaofei He and Shiliang Pu.

Video Transformation

Video Super Resolution Based on Deep Learning: A Comprehensive Survey by Hongying Liu, Zhubo Ruan, Peng Zhao, Fanhua Shang, Linlin Yang and Yuanyuan Liu, Learning Temporal Coherence via Self-supervision for GAN-based Video Generation by Mengyu Chu, You Xie, Jonas Mayer, Laura Leal-Taixé and Nils Thuerey, Artistic Style Transfer for Videos by Manuel Ruder, Alexey Dosovitskiy and Thomas Brox, Real-time Neural Style Transfer for Videos by Haozhi Huang, Hao Wang, Wenhan Luo, Lin Ma, Wenhao Jiang, Xiaolong Zhu, Zhifeng Li and Wei Liu, Video-to-video Synthesis by Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz and Bryan Catanzaro and World-consistent Video-to-video Synthesis by Arun Mallya, Ting-Chun Wang, Karan Sapra and Ming-Yu Liu.

Video Generation

Conditional GAN with Discriminative Filter Generation for Text-to-video Synthesis by Yogesh Balaji, Martin Renqiang Min, Bing Bai, Rama Chellappa and Hans P. Graf, TFGAN: Improving Conditioning for Text-to-video Synthesis by Yogesh Balaji, Martin Renqiang Min, Bing Bai, Rama Chellappa and Hans P. Graf and Cross-modal Dual Learning for Sentence-to-video Generation by Yue Liu, Xin Wang, Yitian Yuan and Wenwu Zhu.

Audio Description

Automated Audio Captioning with Recurrent Neural Networks by Konstantinos Drossos, Sharath Adavanne and Tuomas Virtanen, Neural Audio Captioning Based on Conditional Sequence-to-sequence Model by Shota Ikawa and Kunio Kashino and Audio Caption: Listen and Tell by Mengyue Wu, Heinrich Dinkel and Kai Yu.

Audio Generation

WaveNet: A Generative Model for Raw Audio by Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior and Koray Kavukcuoglu, Parallel WaveNet: Fast High-fidelity Speech Synthesis by Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C. Cobo, Florian Stimberg, Norman Casagrande, Dominik Grewe, Seb Noury, Sander Dieleman, Erich Elsen, Nal Kalchbrenner, Heiga Zen, Alex Graves, Helen King, Tom Walters, Dan Belov and Demis Hassabis and Efficient Neural Audio Synthesis by Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman and Koray Kavukcuoglu.

Audiovisual Description

Generating Natural Video Descriptions via Multimodal Processing by Qin Jin, Junwei Liang and Xiaozhu Lin, Multimodal Video Description by Vasili Ramanishka, Abir Das, Dong H. Park, Subhashini Venugopalan, Lisa A. Hendricks, Marcus Rohrbach and Kate Saenko, Describing Videos using Multi-modal Fusion by Qin Jin, Jia Chen, Shizhe Chen, Yifan Xiong and Alexander Hauptmann, Video Description Generation using Audio and Visual Cues by Qin Jin and Junwei Liang and Attention-based Multimodal Fusion for Video Description by Chiori Hori, Takaaki Hori, Teng-Yok Lee, Ziming Zhang, Bret Harsham, John R. Hershey, Tim K. Marks and Kazuhiko Sumi.

Audiovisual Generation

Deep Cross-modal Audio-visual Generation by Lele Chen, Sudhanshu Srivastava, Zhiyao Duan and Chenliang Xu and Visual to Sound: Generating Natural Sound for Videos in the Wild by Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui and Tamara L. Berg.

Visuospatial Question Answering

Neural Scene Representation and Rendering by S. M. Ali Eslami, Danilo J. Rezende, Frederic Besse, Fabio Viola, Ari S. Morcos, Marta Garnelo, Avraham Ruderman, Andrei A. Rusu, Ivo Danihelka, Karol Gregor, David P. Reichert, Lars Buesing, Theophane Weber, Oriol Vinyals, Dan Rosenbaum, Neil Rabinowitz, Helen King, Chloe Hillier, Matt Botvinick, Daan Wierstra, Koray Kavukcuoglu and Demis Hassabis, Incorporating 3D Information into Visual Question Answering by Yue Qiu, Yutaka Satoh, Ryota Suzuki and Hirokatsu Kataoka and Multi-view Visual Question Answering with Active Viewpoint Selection by Yue Qiu, Yutaka Satoh, Ryota Suzuki, Kenji Iwata and Hirokatsu Kataoka.

Simulation Generation

Building Multimodal Simulations for Natural Language by James Pustejovsky and Nikhil Krishnaswamy, VoxSim: A Visual Platform for Modeling Motion Language by Nikhil Krishnaswamy and James Pustejovsky, Text to 3D Scene Generation with Rich Lexical Grounding by Angel X. Chang, Will Monroe, Manolis Savva, Christopher Potts and Christopher D. Manning and 3D Scene Creation Using Story-based Descriptions by Xin Zeng, Qasim Mehdi and Norman Gough.

Dialogue Systems

Visual Dialog by Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh and Dhruv Batra, Multi-turn Video Question Answering via Multi-stream Hierarchical Attention Context Network by Zhou Zhao, Xinghua Jiang, Deng Cai, Jun Xiao, Xiaofei He and Shiliang Pu, CLEVR-Dialog: A Diagnostic Dataset for Multi-round Reasoning in Visual Dialog by Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra and Marcus Rohrbach, End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features by Chiori Hori, Huda Alamri, Jue Wang, Gordon Wichern, Takaaki Hori, Anoop Cherian, Tim K. Marks, Vincent Cartillier, Raphael G. Lopes, Abhishek Das, Irfan Essa, Dhruv Batra and Devi Parikh and End-to-end Optimization of Goal-driven and Visually Grounded Dialogue Systems by Florian Strub, Harm de Vries, Jeremie Mary, Bilal Piot, Aaron Courville and Olivier Pietquin.

Phoster

Research and Development