Original Paper

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

https://arxiv.org/abs/2304.13705

What’s so good about this model?

Action Chunking with Transformers (ACT) is used in the famous Mobile ALOHA robot:

https://www.youtube.com/watch?v=mnLVbwxSdNM&t=71s

According to the paper, ACT is able to perform difficult tasks with 80~90% success rate with only just 10 minutes worth of demonstration

Architecture

Overall Structure

The model overall is treated as a generative model, specifically a conditional variational autoencoder (CVAE)

using a generative model allows ACT to focus on the detailed parts of the task

Unique techniques

This paper uses a technique called action chunking inspired by “chunking” from psychology, where people create chunks of actions to perform larger actions (e.g. use chunks “pick up” and “throw” to perform “catch and release” in baseball)

instead of predicting only the next timestep, the model uses and predicts k timesteps of actions

However, playing chunk after chunk can cause the motion to be jerky, so this paper uses a technique called temporal ensemble

in temporal ensemble, we:
- predict the chunk every timestep
- now, the action chunks are overlapping so the weighted average is taken per timestep
  - the weight is distributed witht he following function: $w_i = \exp(-m * i)$ where i = 0 is the oldest prediction
  - graph of the weight distribution:
  The weight distribution of the overlapped actions

Diagram of Action Chunking and Temporal Ensemble from the original paper

Before explaining the architecture and flow of ACT, it’s important to know the components that make up this model:

Conditional Variational Autoencoder (CVAE)
ResNet