• Share this blog :        


  • January 18, 2024
  • Abdullah S
Breaking Down Rabbit's AI Showcase: Beneath the Surface

Rabbit's recent unveiling of its groundbreaking Large Action Model (LAM) at CES has captured the attention of the tech world. This cutting-edge technology promises to translate user intents into actions through innovative approaches like hierarchical policies and multimodal integration. In this comprehensive blog, we'll delve deeper into Rabbit's demos, exploring the intricacies of their teaching process, dissecting the AirBnB booking demo, and critically evaluating the claims made during their presentation.

 

Teaching with "Teach Mode"

 
The first demo showcases Rabbit's unique "teach mode" designed to enable the AI to generate images via Midjourney. Users are required to visit the Rabbit "teach mode" page, enter the web application URL (for instance, https://discord.com), start a session, perform the desired actions (in this case, generating an image via Midjourney), and finally, stop the session. The recorded task then undergoes processing by Rabbit servers to be transformed into a web automation script for the Rabbit OS.
The start page for teaching Rabbit new tasks appears user-friendly, but the actual annotation process remains undisclosed. The lack of clarity on how Rabbit generalizes tasks from user annotations raises questions about the robustness of their training process. However, once the recorded task is ready, users can supposedly generate images via Midjourney using any prompt, emphasizing the versatility of the Rabbit device.

 

 

Booking a Room via AirBnB

 
The second demo takes us through the process of booking a room via AirBnB, providing a behind-the-scenes view of the LAM in action. The LAM appears to employ hierarchical UI element detection, grouping individual HTML controls into higher-level conceptual controls. The demonstration illustrates how the LAM mimics the user's actions, highlighting UI elements and generating high-level instructions based on the recorded tasks. The hierarchical approach proves instrumental in creating concise lists of instructions, significantly reducing the token count compared to raw HTML.
This approach extends beyond web applications, as Rabbit claims the effectiveness of their "teach mode" for mobile and desktop applications that don't rely on HTML. The implication is that Rabbit uses a multimodal model, integrating image recognition with HTML snippet analysis for UI control detection and interaction.

 

Hierarchical Approach and Multimodal Integration

 
The Rabbit team references the HeaP paper, emphasizing a hierarchical policy for web actions using LLMs. Their innovation lies in the adoption of a multimodal LLM, enabling direct analysis of graphical elements on a web page instead of inferring them from HTML. This departure from traditional approaches allows Rabbit to efficiently detect and operate UI controls, even in desktop and mobile applications that lack HTML structures.
The hierarchical approach proves pivotal in task execution. By combining high-level instructions like "Enter a city" with low-level commands, Rabbit creates a streamlined process for the web automation engine to execute complex tasks. This hierarchical framework, as described in the HeaP paper, offers efficiency and scalability, a notable departure from traditional methods.
 
Breaking down the task execution process reveals distinct stages, each potentially requiring specialized models. UI control detection, instruction abstraction, and task reasoning are integral components that might benefit from individualized approaches. While Rabbit's technical stack incorporates transformer-style attention and graph-based message passing, the specifics of each model's role and interaction remain undisclosed.
Rabbit's approach suggests a departure from the use of off-the-shelf web automation software. Instead, their bespoke solution encompasses a new network architecture, combining transformer-style attention and graph-based message passing. This approach challenges the traditional use of open-source projects like Playwright, Puppeteer, and Cypress, signaling an emphasis on innovation over conventional engineering.

 

Handling Missing Parameters, Variations & Errors

 
Rabbit's strategy for addressing missing parameters, variations, and errors introduces a hybrid system integrating symbolic algorithms and neural networks. The incorporation of sub-goals, sequences, parameter analysis, assertions, and completion conditions suggests a nuanced approach to handle real-world scenarios where user inputs may vary, and errors might occur.
The notion of determining sub-goals and grouping instructions into sub-tasks aligns with best practices in task automation. Sequencing instructions and analyzing dependencies between sub-goals ensure a coherent execution flow. Parameter analysis becomes critical for identifying required input parameters, allowing the LLM to prompt the user for missing information—a vital aspect that was not explicitly demonstrated in the Rabbit demos.
 
Rabbit's stellar marketing efforts have garnered attention, but certain claims demand scrutiny. The "No Software" claim, particularly for desktop and mobile applications, appears questionable. While Rabbit could theoretically proxy interactions through servers for web applications, extending this model to non-HTML applications raises security and feasibility concerns. The absence of a demonstrated "teach mode" for desktop and mobile apps leaves a notable gap in understanding Rabbit's true capabilities across various platforms.
The "No Apps" claim, though technically accurate, lacks transparency. Rabbit relies on predefined plugins, akin to automation scripts, to execute tasks. The implication that Rabbit can effortlessly perform any task without prior setup is contested by the need for step-by-step training, akin to creating glorified automation scripts. While Rabbit's functionality surpasses existing AI assistants like Siri and Google Assistant, the distinction is critical for managing user expectations.
The "No Credentials Stored" claim raises questions about how Rabbit securely manages authentication information. While Rabbit asserts that they don't store third-party credentials or usernames and passwords, the necessity of storing some form of credential, potentially OAuth tokens, for future access poses security concerns. The definition of a "credential" is nuanced, and Rabbit's specific approach to safeguarding stored information remains unclear.
 
Rabbit's CES demos offer a tantalizing glimpse into the future of AI-driven task automation. The integration of hierarchical policies, multimodal models, and a bespoke technical stack positions Rabbit as a pioneering force in the field. The handling of missing parameters, variations, and errors introduces a hybrid system that combines symbolic algorithms and neural networks for a nuanced approach to real-world scenarios.
While Rabbit's marketing has successfully generated buzz, critical examination reveals gaps and ambiguities. Claims of a "No Software" approach for desktop and mobile applications raise questions about practicality and security. The "No Apps" claim, while technically true, underscores the need for predefined plugins, challenging the idea of immediate, hassle-free task execution without prior setup. The assertion of "No Credentials Stored" demands further clarification on the security measures in place for stored authentication information.
As Rabbit prepares for its March release, consumers eagerly await the opportunity to put this revolutionary AI to the test. The true test of Rabbit's capabilities will be its ability to seamlessly adapt to diverse platforms, maintain security standards, and deliver on the promises made during its high-profile unveiling.