Interpretability

Studied LLaVA-1.5-7B using a controlled synthetic benchmark; revealed an existence bias causing yes-bias in binary relational tasks through logit lens, linear probing, and attention analyses, showing mid-layer representations are linearly decodable but misaligned with final decisions. Click to read more!