ArXiv Preprint
Prompts have been the center of progress in advancing language models'
zero-shot and few-shot performance. However, recent work finds that models can
perform surprisingly well when given intentionally irrelevant or misleading
prompts. Such results may be interpreted as evidence that model behavior is not
"human like". In this study, we challenge a central assumption in such work:
that humans would perform badly when given pathological instructions. We find
that humans are able to reliably ignore irrelevant instructions and thus, like
models, perform well on the underlying task despite an apparent lack of signal
regarding the task they are being asked to do. However, when given deliberately
misleading instructions, humans follow the instructions faithfully, whereas
models do not. Thus, our conclusion is mixed with respect to prior work. We
argue against the earlier claim that high performance with irrelevant prompts
constitutes evidence against models' instruction understanding, but we
reinforce the claim that models' failure to follow misleading instructions
raises concerns. More broadly, we caution that future research should not
idealize human behaviors as a monolith and should not train or evaluate models
to mimic assumptions about these behaviors without first validating humans'
behaviors empirically.
Albert Webson, Alyssa Marie Loo, Qinan Yu, Ellie Pavlick
2023-01-17