Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
https://arxiv.org/abs/2304.13705
Action Chunking with Transformers (ACT) is used in the famous Mobile ALOHA robot:
https://www.youtube.com/watch?v=mnLVbwxSdNM&t=71s
The model overall is treated as a generative model, specifically a conditional variational autoencoder (CVAE)
This paper uses a technique called action chunking inspired by “chunking” from psychology, where people create chunks of actions to perform larger actions (e.g. use chunks “pick up” and “throw” to perform “catch and release” in baseball)
However, playing chunk after chunk can cause the motion to be jerky, so this paper uses a technique called temporal ensemble
predict the chunk every timestep
now, the action chunks are overlapping so the weighted average is taken per timestep
The weight distribution of the overlapped actions
Diagram of Action Chunking and Temporal Ensemble from the original paper
Before explaining the architecture and flow of ACT, it’s important to know the components that make up this model: