What it can do

RT-1 is able to perform real-time tasks based on an image sequence with natural language instructions.

What’s cool?

Overall Architecture

The RT-1 architecture consists of the following components:

We first want a way to encode the natural language instruction, so RT-1 adopts the method from Universal Sentence Encoder to do this:

This paper proposes two primary methods for sentence embedding:

Whats’s good about this encoder?