Original Paper

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

https://arxiv.org/abs/2304.13705

What’s so good about this model?

Action Chunking with Transformers (ACT) is used in the famous Mobile ALOHA robot:

https://www.youtube.com/watch?v=mnLVbwxSdNM&t=71s

Architecture

Overall Structure

The model overall is treated as a generative model, specifically a conditional variational autoencoder (CVAE)

Unique techniques

This paper uses a technique called action chunking inspired by “chunking” from psychology, where people create chunks of actions to perform larger actions (e.g. use chunks “pick up” and “throw” to perform “catch and release” in baseball)

However, playing chunk after chunk can cause the motion to be jerky, so this paper uses a technique called temporal ensemble

Diagram of Action Chunking and Temporal Ensemble from the original paper

Diagram of Action Chunking and Temporal Ensemble from the original paper

Before explaining the architecture and flow of ACT, it’s important to know the components that make up this model: