Do Audio-Visual Large Language Models Really See and Hear?

Ramaneswaran Selvakumar^*, Kaousheik Jayakumar^*, S Sakshi, Sreyan Ghosh,
Ruohan Gao^#, Dinesh Manocha^#

^*Equal contribution ^#Equal advising

University of Maryland, College Park

CVPR Findings 2026

TL;DR

AVLLMs exhibit a strong vision bias in audio understanding, hallucinating sounds from what they see rather than what they hear. We conduct mechanistic interpretability experiments showing that rich audio semantics exist internally, cross-modal transfer occurs in mid-to-deep layers where vision dominates, and this bias likely stems from vision-centric training.

Visual bias in action — **Figure 1.** Example demonstrating visual bias in AVLLMs. In this, the visible objects are silent; the only real sound is an off-screen siren. The AVLLM hallucinates audio from what it sees.

AVLLMs have made remarkable progress in jointly understanding audio and visual inputs. But how they actually process and use these modalities internally remains a black box, and this opacity has real consequences.

To see why this matters, consider the safety-critical setting shown in Figure 1: an autonomous vehicle should respond to an off-screen ambulance siren even when it isn't visible. Current AVLLMs would likely fail here, when we stress-test them on scenarios where audio and visual content conflict, they hallucinate sounds from visible objects and miss the actual audio entirely. They have a bias to see, then guess what they should be hearing.

Some Qualitative Examples

Below we show how different AVLLMs hallucinate audio from visual content. Select a model to see its audio-visual captions for factual video (which contains the original audio) and counterfactual video (with swapped audio)

Model

Ducklings Swimming in a Pond

Factual

Audio (Ground Truth)

Birds are chirping, and ducks are making noises.

I see a bunch of ducks in the water. There's a big duck in the middle, and a bunch of smaller ones around it. They're all moving around in the water. I also hear some quacking and a few other bird sounds. It looks like a peaceful scene.

Counterfactual

Audio (Swapped — Ground Truth)

A man speaks followed by a toilet flush.

I see ducks floating on the water, with a larger one in the center and smaller ones around it. They move gently, creating a calm scene. I hear quacking and other bird sounds.

⚠ Analysis

The model produces near-identical outputs across both cases. In the factual case, "quacking and a few other bird sounds" is correct. In the counterfactual case, the audio is a man speaking followed by a toilet flush, yet the model still reports hearing "quacking and other bird sounds." It sees ducks and hallucinates the expected duck audio.

Person Typing on a Keyboard

Factual

Audio (Ground Truth)

Typing is occurring on a keyboard in a quiet environment.

Well, I see a close-up of a person's hands typing on a black keyboard. There's a computer monitor in the background, and it seems like there's a piece of paper on the desk too. I can hear the sound of the keys being pressed, which is pretty clear. It gives off the vibe of someone working on a computer, maybe typing an email or writing a document.

Counterfactual

Audio (Swapped — Ground Truth)

A train running on railroad tracks, followed by a lawn mower engine, then a steam engine whistle blowing as a crowd of people talk.

Well, I see a close-up of a person's hands typing on a black keyboard. There's a computer monitor in the background, and it seems like there's a piece of paper on the desk too. I can hear the sound of the keyboard keys clicking as the person types. It's a pretty focused scene.

⚠ Analysis

The model produces near-identical outputs for both conditions. In the factual case, hearing "the sound of the keys being pressed" is correct. In the counterfactual case, the audio has been replaced with trains, lawn mowers, and a steam whistle, yet the model still reports hearing "keyboard keys clicking." The visual content of hands on a keyboard is enough to hallucinate keyboard typing sounds.

Cooking in a Frying Pan

Factual

Audio (Ground Truth)

A woman talking followed by plastic thumping, then metal clanking as food and oil sizzles.

The video shows a person cooking food in a pan on a stove. The main source of sound in the video is the sizzling and crackling of the food as it cooks.

Counterfactual

Audio (Swapped — Ground Truth)

A series of snores followed by subdued laughter and another snore.

The video shows a person cooking food in a pan on a stove. The main source of sound in the video is the sizzling and crackling of of the food as it cooks.

⚠ Analysis

In the case of factual audio, the model misses the "woman talking" which is out of view, but correctly identifies the sizzling and crackling sounds. However, for the counterfactual case, the model produces virtually identical outputs

How We Study This

Task: We ask AVLLMs to describe what they see and hear, a simple task, but one that directly forces the model to use both modalities. And unlike multiple choice or binary QA, free-form captions are interpretable.

Dataset: In natural videos, audio and visual content are correlated, so vision can confound audio understanding, i.e., the model can infer sounds just from what it sees. To prevent this, we construct counterfactual samples by swapping a video's audio with a semantically unrelated track. We curate 500 samples from AudioCaps, half factual, half counterfactual.

Evaluation: Evaluating free-form captions at scale is hard. We use an LLM-as-judge that scores audio and video fidelity separately on a 0-1 scale. It is scalable, and has strong correlation with human judgements (ρ = 0.816 audio, ρ = 0.732 video).

Findings

Does the model pay attention to audio?

Attention fraction across layers — **Figure 2.** Mean attention from generated to input tokens. Audio gets 40-50% attention in layers 0-5, then drops to near-zero. Video climbs to 20-40% in layers 15-30.

Before asking why audio fails, a more basic question: does the model even attend to audio tokens during generation, or is it effectively ignoring them from the start? If the model never looks at audio, the hallucinations would be unsurprising it would simply be guessing from vision alone.

To test this, we track the mean attention that generated tokens allocate to each input modality, video tokens, audio tokens, and query text tokens across every transformer layer of the model.

Finding:The model does attend to audio, but only briefly. Audio tokens receive 40-50% of attention in early layers (0-5), then drop to near-zero. Visual attention, by contrast, steadily climbs through deeper layers (15-30), reaching 20-40%.

Are audio representations meaningful?

Logit lens probing — **Figure 3.** Probing audio representations. Audio tokens decode into meaningful sound concepts—including multilingual tokens like 键盘 (keyboard).

We've established that the model does attend to audio tokens, albeit briefly. Next, we ask whether those audio tokens actually encode anything meaningful to begin with. If the representations are not meaningful, the visual bias would simply reflect a lack of useful audio signal, not a failure to use it.

To find out, we probe audio representations using the logit lens. This technique decodes hidden states at each audio token position using the model's unembedding matrix, projecting them into probability distributions over the vocabulary. If the representations are meaningful, they should decode into tokens that describe the actual audio content.

Finding: Audio representations decode into interpretable tokens that capture sound sources (drill, engine, keyboard) and actions (typing, neighing), even in multiple languages (键盘/keyboard, 马/horse). Measuring this systematically, the model achieves 61.4% latent audio understanding from its internal representations, yet generated captions hit only 23% audio fidelity on counterfactual samples. The audio understanding is there internally, it just isn't making it to the output.

So the model attends to audio, and encodes meaningful audio semantics internally. But somewhere between those internal representations and the final generated text, audio gets lost. Where exactly does this happen?

To trace this, we use attention knockout, a causal intervention that selectively blocks attention from generated tokens to either audio (G↛A) or video (G↛V) at specific layers. The logic is simple: if blocking a modality at a given layer degrades the output, that modality was actively contributing there.

**Figure 4.** Attention knockout setup. Left: baseline with all attention paths. Middle: G↛V blocks generated tokens from attending to video (orange). Right: G↛A blocks attention to audio (blue).

The setup is illustrated in Figure 4. In the baseline condition, generated tokens (G) can attend to all input tokens, video (V), audio (A), and query (Q). In the G↛V condition (orange), we block the attention pathway from generated tokens to video tokens at a specific layer, forcing the model to rely solely on audio and text. In G↛A (blue), we block attention to audio tokens instead. By sweeping this intervention across layers and measuring how caption fidelity changes, we can pinpoint exactly where each modality contributes to the output.

Attention knockout: factual samples — **Figure 5a. Factual samples.** Blocking either modality has minimal impact, as the model compensates with the other complementary modality. Video fidelity (A) stays flat when video is blocked; audio fidelity (B) stays flat when audio is blocked.

Factual samples: When we block either modality, the model compensates using the other—performance recovers either way. As we can observe in plots A and B, neither video nor audio caption fidelity degrades significantly when the corresponding modality is blocked. This demonstrates audio-visual complementarity, but also reveals why factual samples alone are insufficient for evaluation. The modalities are so correlated that the model can always lean on one to cover for the other, which is exactly why counterfactuals are necessary.

Attention knockout: counterfactual samples — **Figure 5b. Counterfactual samples.** Blocking video (G↛V, orange) drops video fidelity (C) in deeper layers. Crucially, the same intervention *improves* audio fidelity (D) by ~50%.

Counterfactual samples: Here the modalities conflict, so compensation is impossible. In plot C, blocking video (G↛V, orange) causes a clear drop in video understanding concentrated in mid-to-deep layers (15-30)—telling us where visual information transfers to generated text. For audio understanding, blocking audio (G↛A, blue in plot D) produces a similar drop in the same layers, confirming audio transfers there too. But the critical finding is in plot D: blocking video (orange) actually improves audio understanding by ~50%, recovering it to near factual-setting levels. When both modalities compete in those deeper layers, vision is preferred at the cost of audio.

Finding: Both audio and video integrate into generated text in the deeper layers of the network. However, vision takes precedence over audio in this integration stage, and blocking visual pathways in these layers recovers the model's latent audio understanding capabilities.

Where does the vision bias originate?

We know vision dominates in the final layers. But is this because audio training did not sufficiently change the model's behavior to begin with leaving it fundamentally a vision-language model with audio bolted on?

To test this, we compare the output token distributions of Qwen2.5-Omni (the AVLLM, with audio input) against Qwen2.5-VL (its base vision-only model, no audio). For each generated token, we measure whether the AVLLM's prediction shifts away from what the vision-only model would predict. If audio meaningfully influences generation, we should see significant distributional shifts.

The examples below illustrate this. Colors indicate how much each token is influenced by audio, blue tokens are identical to what the vision-only base model would predict, yellow are close, and red reflect genuine audio understanding. We also visualize the attention, to see what the model attends to while generating a particular token.

Example

Helicopter Flying Between Buildings

Actual Audio

A young boy talking as a baby yells.

Model Says It Hears

In the video, I hear the sound of a helicopter flying over a cityscape. The helicopter is moving from left to right, and the city buildings are visible in the background. The sound of the helicopter is clear and distinct, and there is also some background noise that includes a child speaking and some music playing.

Attention Heatmap

When generating "the sound of a helicopter," attention concentrates on the helicopter visible in the video, not on the audio tokens.

⚠ Analysis

The model is asked only to describe what it hears. The actual audio contains a boy talking and a baby yelling, no helicopter sound. Yet the model generates "the sound of a helicopter" because its attention flows to the visually salient helicopter in the video frames. Note that when the model does predict child speaking or music playing, there is a clear distribution shift from vision-language model

Man Speaking at a Formal Event

Actual Audio

Several motor vehicles accelerating.

Model Says It Hears

In the video, I hear a man speaking into a microphone while standing at a table. The table is adorned with a large bouquet of white flowers, a bottle of champagne, and several wine glasses. The man is dressed in a white shirt and appears to be addressing an audience...

Attention Heatmap

When generating "a man speaking into a microphone," attention locks onto the man and microphone visible in the video, not on the audio tokens.

⚠ Analysis

The actual audio is motor vehicles accelerating, no speech at all. But the model sees a man holding a microphone at a formal event and fabricates a detailed description of him "speaking into a microphone" and "addressing an audience."

Finding: The distributions are remarkably similar, KL divergence of just 0.4. Of tokens describing audio events, 66% are unshifted (identical top prediction as the vision-only model), and 85% fall within the vision-only model's top 3 predictions. Notably, when the model does correctly identify audio content, those tokens do shift away from the base LVLM distribution, confirming that genuine audio processing produces distributional shifts. But the model still defaults to its visual priors more often than it should, particularly under conflict.

Citation

BibTeX

@misc{selvakumar2026audiovisuallargelanguagemodels,
      title={Do Audio-Visual Large Language Models Really See and Hear?}, 
      author={Ramaneswaran Selvakumar and Kaousheik Jayakumar and S Sakshi and Sreyan Ghosh and Ruohan Gao and Dinesh Manocha},
      year={2026},
      eprint={2604.02605},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2604.02605}, 
}

Do Audio-Visual Large Language Models Really See and Hear?

Some Qualitative Examples

How We Study This

Findings

Does the model pay attention to audio?

Are audio representations meaningful?

How does cross-modal information flow?

Where does the vision bias originate?

Citation