TY - CONF
T1 - Token Turing Machines are Efficient Vision Models
AU - Jajal, Purvish
AU - Eliopoulos, Nick
AU - Chou, Benjamin Shiue-Hal
AU - Thiruvathukal, George K.
AU - Davis, James C.
AU - Lu, Yung-Hsiang
PY - 2025/3/1
Y1 - 2025/3/1
N2 - We propose Vision Token Turing Machines (ViTTM), an efficient, low-latency, memory-augmented Vision Transformer (ViT). Our approach builds on Neural Turing Machines (NTM) and Token Turing Machines (TTM), which were applied to NLP and sequential visual understanding tasks. ViTTMs are designed for non-sequential computer vision tasks such as image classification and segmentation. Our model creates two sets of tokens: process tokens and memory tokens; process tokens pass through encoder blocks and read-write from memory tokens at each encoder block in the network, allowing them to store and retrieve information from memory. By ensuring that there are fewer process tokens, we are able to reduce the inference time of the network while maintaining its accuracy. On ImageNet-1K, the state-of-the-art ViT-B has median latency of 529.5 ms and 81.0% accuracy, while our ViTTM-B is 56% faster (234.1 ms), with 2.4× fewer FLOPs, with an accuracy of 82.9%. On ADE20K semantic segmentation, ViT-B achieves 45.65 mIoU at 13.8 frames per second (FPS) whereas our ViTTM-B model achieves 45.17 mIoU with 26.8 FPS (+94%).
AB - We propose Vision Token Turing Machines (ViTTM), an efficient, low-latency, memory-augmented Vision Transformer (ViT). Our approach builds on Neural Turing Machines (NTM) and Token Turing Machines (TTM), which were applied to NLP and sequential visual understanding tasks. ViTTMs are designed for non-sequential computer vision tasks such as image classification and segmentation. Our model creates two sets of tokens: process tokens and memory tokens; process tokens pass through encoder blocks and read-write from memory tokens at each encoder block in the network, allowing them to store and retrieve information from memory. By ensuring that there are fewer process tokens, we are able to reduce the inference time of the network while maintaining its accuracy. On ImageNet-1K, the state-of-the-art ViT-B has median latency of 529.5 ms and 81.0% accuracy, while our ViTTM-B is 56% faster (234.1 ms), with 2.4× fewer FLOPs, with an accuracy of 82.9%. On ADE20K semantic segmentation, ViT-B achieves 45.65 mIoU at 13.8 frames per second (FPS) whereas our ViTTM-B model achieves 45.17 mIoU with 26.8 FPS (+94%).
U2 - 10.1109/WACV61041.2025.00767
DO - 10.1109/WACV61041.2025.00767
M3 - Paper
ER -