XR Research Internship | Seunghyeon Lee

Undergraduate Research Intern / HCS Lab

Role / 2D-3D grounding, XR overlay implementation, AI-generated output evaluation

Tech / Unity, OpenXR, Gemini

Project Overview

I joined a research project that studied how users can better understand responses generated by an XR agent. I worked on object-based overlays in the XR client and evaluation of AI-generated outputs.

The project deals with where, in what form, and when visual cues connected to the real scene should be shown. Because of paper submission and anonymity concerns, I intentionally omit the specific system name, experiment results, and detailed structure.

What I Did

Understood and implemented the XR client flow connecting gaze input, the real scene, and UI overlays in a Unity/OpenXR environment.
Worked on 2D-3D grounding that connects image-based referent detection results to object-linked XR overlays.
Analyzed alignment issues between screen coordinates, camera projection, and world-space placement for object-based visual presentation.
Evaluated whether AI-generated outputs matched the scene context, the intended referent, and the purpose of the response.

Design Challenge

The key challenge was not just showing an AI-generated explanation as voice or text, but connecting it to the real scene the user was looking at so it could be understood more easily.

For this, the XR client had to handle coordinate systems and overlay placement so visual cues shown over the real scene would not drift away from the target. Screen coordinates, camera input, and Unity world space all use different references, so simply displaying object detection results on the UI was not enough. A conversion process was needed.

Also, multimodal outputs generated by AI could not be treated as always correct. I had to check whether a visual cue pointed to the right target in the real scene, whether it was the right kind of cue for the explanation, and whether it fit the whole response flow. I reviewed generated outputs and evaluated them based on the scene context and intended response.

Preview

Preview image omitted for anonymity.