Using AI to turn audio recordings into street-view images

By Ashwini Sakharkar 30 Nov, 2024

Collected at: https://www.techexplorist.com/using-ai-turn-audio-recordings-street-view-images/93739/

People experience the world through multiple senses simultaneously, which enhances our understanding of our environment. Previous quantitative geography research has largely focused on visual perceptions of people, overlooking the role of auditory perceptions in defining a place due to the difficulties in effectively describing the acoustic atmosphere.

Also, few studies have synthesized the two-dimensional (auditory and visual) perceptions in understanding human sense of place.

Utilizing generative artificial intelligence, researchers at The University of Texas at Austin now have transformed sounds from audio recordings into street-view visuals. The high level of visual fidelity in these created images indicates that machines are capable of mirroring the human ability to connect audio with the visual interpretation of environments.

In a study published in Computers, Environment and Urban Systems, the team outlines the process of training a soundscape-to-image AI model with audio and visual data sourced from diverse urban and rural streetscapes, subsequently employing that model to produce images based on audio recordings.

“Our study found that acoustic environments contain enough visual cues to generate highly recognizable streetscape images that accurately depict different places,” said Yuhao Kang, assistant professor of geography and the environment at UT and co-author of the study. “This means we can convert the acoustic environments into vivid visual representations, effectively translating sounds into sights.”

*Researchers use AI to turn sound recordings into accurate street images. Credit: University of Texas*

Using YouTube video and audio from cities across North America, Asia, and Europe, the team generated pairs of 10-second audio clips and still images from diverse locations to train an AI model capable of creating high-resolution images based on audio input. They then assessed AI-generated images derived from 100 audio clips in comparison to corresponding real-world photographs, employing both human and computer evaluations.

The computer evaluations analyzed the relative proportions of greenery, buildings, and sky between the source and the generated images, while human judges were tasked with correctly matching one of three produced images to an audio sample.

The findings revealed strong correlations in the proportions of sky and greenery between the generated images and their real-world counterparts, with a somewhat lower correlation observed in the proportions of buildings. Human participants demonstrated an average accuracy of 80% in identifying the generated images that matched the provided audio samples.

“Traditionally, the ability to envision a scene from sounds is a uniquely human capability, reflecting our deep sensory connection with the environment. Our use of advanced AI techniques supported by large language models (LLMs) demonstrates that machines have the potential to approximate this human sensory experience,” Kang said. “This suggests that AI can extend beyond mere recognition of physical surroundings to potentially enrich our understanding of human subjective experiences at different places.”

*Using YouTube video and audio, the team created pairs of 10-second audio clips and image stills from the various locations. Credit: University of Texas*

The generated images not only effectively approximate the proportions of sky, greenery, and buildings, but they also replicate the architectural styles and spatial relationships found in their real-world counterparts. Crucially, they capture the nuances of soundscapes recorded under various lighting conditions—be it sunny, cloudy, or nighttime.

The authors emphasize that lighting information may be gleaned from activity levels in soundscapes; for instance, the presence of traffic sounds or the chirping of nocturnal insects can indicate the time of day. These insights deepen our understanding of how multisensory elements shape our experience of a place.

“When you close your eyes and listen, the sounds around you paint pictures in your mind,” Kang said. “For instance, the distant hum of traffic becomes a bustling cityscape, while the gentle rustle of leaves ushers you into a serene forest. Each sound weaves a vivid tapestry of scenes, as if by magic, in the theater of your imagination.”

Kang’s research is at the forefront of utilizing geospatial AI to explore human-environment interactions. In a recent paper published in Nature, he and his co-authors investigated the transformative potential of AI in capturing the distinctive characteristics that contribute to a city’s unique identity.

Journal reference:

Yonggai Zhuang, Yuhao Kang, Teng Fei, Meng Bian, Yunyan Du. From hearing to seeing: Linking auditory and visual place perceptions with soundscape-to-image generative artificial intelligence. Computers, Environment and Urban Systems, 2024; DOI: 10.1016/j.compenvurbsys.2024.102122

Leave a Reply Cancel reply