摘要 : Recent transformer-based models, especially patch-based methods, have shown huge potentiality in vision tasks. However, the split fixed-size patches divide the input features into the same size patches, which ignores the fact that... 展开
作者 | Xiao Lin Shuzhou Sun Wei Huang Bin Sheng Ping Li David Dagan Feng |
---|---|
作者单位 | |
期刊名称 | 《IEEE transactions on multimedia 》 |
页码/总页数 | 50-61 / 12 |
语种/中图分类号 | 英语 / TP37 |
关键词 | Transformers Encoding Task analysis Semantics Feature extraction Costs Convolutional neural networks |
DOI | 10.1109/TMM.2021.3120873 |
馆藏号 | IELEP0172 |