RT-1 is able to perform real-time tasks based on an image sequence with natural language instructions.
What’s cool?
The RT-1 architecture consists of the following components:
We first want a way to encode the natural language instruction, so RT-1 adopts the method from Universal Sentence Encoder to do this:
This paper proposes two primary methods for sentence embedding:
Whats’s good about this encoder?