B-MoCA: Benchmarking Mobile Device Control Agents across Diverse Configurations

1KAIST, 2Seoul National University, 3Yonsei University

B-MoCA can serve as a testbed for mobile device control agents across diverse device configurations

Interpolate start reference image.

"Turn on the airplane mode"

Interpolate start reference image.

"Create an alarm at 10:30 am"

Interpolate start reference image.

"Decrease the screen brightness"

Interpolate start reference image.

"Call 911"

[Exemplary daily tasks performed by the agents]

Abstract

Developing autonomous agents for mobile devices can significantly enhance user interactions by offering increased efficiency and accessibility. However, despite the growing interest in mobile device control agents, the absence of a commonly adopted benchmark makes it challenging to quantify scientific progress in this area. In this work, we introduce B-MoCA: a novel benchmark designed specifically for evaluating mobile device control agents. To create a realistic benchmark, we develop B-MoCA based on the Android operating system and define 60 common daily tasks. Importantly, we incorporate a randomization feature that changes various aspects of mobile devices, including user interface layouts and language settings, to assess generalization performance. We benchmark diverse agents, including agents employing large language models (LLMs) or multi-modal LLMs as well as agents trained from scratch using human expert demonstrations. While these agents demonstrate proficiency in executing straightforward tasks, their poor performance on complex tasks highlights significant opportunities for future research to enhance their effectiveness.

MY ALT TEXT

Diverse Device Environments


In B-MoCA, users can configure environments with diverse device configurations. We generate 45 environments (35 for training environments and 10 for test environments) with varying device types, languages, font sizes, wallpapers, and icon locations.

Baselines Analysis


By employing the environments we configure, we benchmark state-of-the-art closed-source LLMs (Gemini-Pro and GPT-4) and MLLMs (Gemini-Pro-V and GPT-4V) as well as VLUI agents trained from scratch. We also compare open-source LLMs (Llama2, Llama3, and AgentLM). We further analyze different design choices (e.g., Set-of-Mark prompting for MLLM agents or effects of the number of training environments for VLUI agents) for the agents. The lack of proficiencies of these agents in complex scenarios calls for future work!

LLM agents are with three few-shot examples. MLLM agents are with one (Gemini-Pro-V) or three (GPT-4V) examples but without Set-of-Mark.
Task
Instruction
LLM Agent
(Gemini-Pro)
LLM Agent
(GPT-4)
MLLM Agent
(Gemini-Pro-V)
MLLM Agent
(GPT-4V)
VLUI Agent
(from scrtach)
Airplane

"turn on the airplane mode"

47±09 73±12 13±09 80±06 63±03
Alarm 1

"turn on the alarm at 9 am"

30±06 67±03 13±03 60±15 67±03
Alarm 2

"create an alarm at 10:30 am"

00±00 00±00 00±00 23±03 47±03
Brightness

"decrease the screen brightness in setting"

20±06 73±09 00±00 87±03 60±00
Call 911

"call 911"

03±03 03±03 03±03 53±03 60±00
Language

"go to 'add a language' page in setting"

00±00 43±09 00±00 43±09 63±03

BibTeX

@inproceedings{lee2024benchmarking,
      title={Benchmarking Mobile Device Control Agents across Diverse Configurations}, 
      author={Juyong Lee and Taywon Min and Minyong An and Changyeon Kim and Kimin Lee},
      year={2024},
      booktitle={ICLR 2024 Workshop on Generative Models for Decision Making},
}