摘要 : Fusion and interaction of multimodal features are essential for video question answering. Structural information composed of the relationships between different objects in videos is very complex, which restricts understanding and ... 展开
作者 | Zhicheng Guo Jiaxuan Zhao Licheng Jiao Xu Liu Fang Liu |
---|---|
作者单位 | |
期刊名称 | 《IEEE transactions on multimedia 》 |
页码/总页数 | 38-49 / 12 |
语种/中图分类号 | 英语 / TP37 |
关键词 | Quaternions Task analysis Cognition Visualization Knowledge discovery Feature extraction Convolution |
DOI | 10.1109/TMM.2021.3120544 |
馆藏号 | IELEP0172 |