LLMs can be easily manipulated into promoting harmful biases under the guise of scientific inquiry.
LLMs exhibit increased bias and toxicity when prompted with malicious requests disguised as scientific language.
-----
Paper - https://arxiv.org/abs/2501.14073
Methods in this Paper 💡:
→ It uses malicious prompts disguised as scientific language.
→ Prompts misinterpret social science and psychology studies.
→ These prompts present stereotypes as scientifically beneficial.
→ LLMs are shown to respond by increasing biases and toxicity.
→ LLMs can be manipulated to fabricate scientific arguments.
→ These fabricated arguments falsely claim biases are beneficial.
→ This method jailbreaks even strong models like GPT.
→ Mentioning author names and venues enhances prompt persuasiveness.
Share this post