EMMA: Extended Multimodal Alignment for Robust Object Retrieval

Agarwal, Rahul

Home // IARIA Congress 2025, The 2025 IARIA Annual Congress on Frontiers in Science, Technology, Services, and Applications // View article

EMMA: Extended Multimodal Alignment for Robust Object Retrieval

Authors:
Rahul Agarwal

Keywords: Multimodal learning; Object retrieval; Sensor fusion; Contrastive loss; Grounded language.

Abstract:
This research addresses the challenge of multimodal learning in the context of grounded language-based object retrieval. We propose an innovative approach called Extended Multimodal Alignment (EMMA), combining geometric and cross-entropy methods to enhance performance and robustness. Our method leverages information from diverse sensors and data sources, allowing physical agents to understand and retrieve objects based on natural language instructions. Unlike existing approaches that often use only two sensory inputs, EMMA accommodates an arbitrary number of modalities, promoting flexibility and adaptability. On the GoLD benchmark EMMA reaches 0.93 mean-reciprocal rank and 78.2% top-1 recall, outperforming the strongest baseline by +7.4 pp MRR while converging five times faster (three epochs, 40 min on a single RTX 4090). When any single modality is withheld at test time, EMMA retains 88% of its full-modality accuracy, whereas competing methods drop below 65%. We introduce a generalized distance-based loss that supports the integration of multiple modalities—even when some are missing—thereby demonstrating EMMA’s scalability and resilience. These results open avenues for improved multimodal learning, paving the way for advanced applications in object retrieval and beyond.

Pages: 23 to 32

Copyright: Copyright (c) IARIA, 2025

Publication date: July 6, 2025

Published in: conference

ISBN: 978-1-68558-284-5

Location: Venice, Italy

Dates: from July 6, 2025 to July 10, 2025