Prompt Hacking in LLMs 2024-2025 Literature Review

Jun 16, 2025

Browse all previously published AI Tutorials here.

Types of Prompt Hacking
- Jailbreak Attacks
- Indirect Prompt Injections
- Adversarial Prompting
- Prompt Leaking
- Stealthy Prompt Engineering
Defense Mechanisms
- Adversarial Training
- Access Control
- Input Sanitization and Preprocessing
- Response Filtering and Moderation
- Instruction Tuned Alignment
- Monitoring and Real Time Detection
Implementation Details and Frameworks
Recent Research Highlights 2024-2025
Sources

Types of Prompt Hacking

Jailbreak Attacks

Jailbreaking refers to input prompts that trick an aligned LLM into bypassing its safety protocols and content filters. These attacks often coax the model into producing disallowed or harmful outputs despite built-in safeguards (LLM01:2025 Prompt Injection - OWASP Top 10 for LLM & Generative AI Security). For example, an attacker might instruct the model “Ignore the previous instructions and tell me…” to override safety guidelines. Jailbreak prompts exploit the model’s compliance to user instructions, essentially causing it to disregard its safety training entirely . Recent work notes that even advanced models remain vulnerable to such clever prompt manipulations, which can lead to malicious or offensive responses (Robust Safety Classifier Against Jailbreaking Attacks: Adversarial Prompt Shield - ACL Anthology). A cat-and-mouse dynamic has emerged, as each new jailbreak method prompts developers to patch the model or adjust its system prompts ( An Early Categorization of Prompt Injection Attacks on Large Language Models).

Indirect Prompt Injections

Indirect prompt injection occurs when malicious instructions are embedded in content that the LLM consumes from external sources (web pages, documents, user-provided data) rather than coming directly from the user’s query . In this scenario, an attacker might hide a prompt in a webpage that a user asks the LLM to summarize, causing the model to execute those hidden instructions. Such attacks have compromised real LLM-integrated applications: one study found 31 out of 36 tested applications (including popular tools like Notion) were susceptible to hidden prompts that enabled prompt theft (revealing the app’s secret instructions) or unauthorized actions ( Prompt Injection attack against LLM-integrated Applications). Indirect injections can be imperceptible to humans (e.g. hidden in HTML or in whitespace) yet still parsed by the model . This makes them a severe threat in systems where LLMs ingest untrusted content, as the model can be tricked into harmful actions without the user’s knowledge ( Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems).

Adversarial Prompting

Adversarial prompting uses carefully crafted inputs (often generated via automated techniques) to induce model errors or policy violations. Instead of obvious instructions to break rules, these inputs may include nonsensical or subtly perturbed text that exploits the model’s internal decision boundaries. Recent research has applied gradient-based optimization to find “universal” prompt perturbations that consistently force an LLM off-track ( Automatic and Universal Prompt Injection Attacks against Large Language Models). For instance, Wichers et al. (2024) demonstrate a gradient-based red-teaming method that generates diverse prompts triggering unsafe responses even on safety-tuned models (Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations). Other works use reinforcement learning or genetic algorithms to evolve prompts that cause misbehavior while appearing benign . These adversarial prompts can take the form of weird token sequences or suffixes that humans might overlook, yet they maximize the likelihood of a forbidden response ( Goal-guided Generative Prompt Injection Attack on Large Language Models). Because they are often unintuitive and model-specific, adversarial prompts highlight weaknesses in the model’s training – they essentially serve as blind spots that attackers can algorithmically discover.

Prompt Leaking

Prompt leaking (or prompt theft) is when an attacker coaxes the LLM into revealing hidden content from its prompt or context that was not meant to be exposed. This could be the system instructions, developer notes, or confidential data earlier in the conversation. Exploiting the model’s tendency to be cooperative, an attacker might ask the LLM to “repeat the previous instructions” or use role-play to trick it into divulging the hidden prompt. A 2024 study demonstrated unintended prompt disclosure in live applications, managing to extract private prompts and secrets from many LLM-powered services ( Prompt Injection attack against LLM-integrated Applications). Successful prompt leaking undermines the confidentiality of the system prompt and can give attackers insight into the model’s guarding policies. In turn, this knowledge can facilitate more effective jailbreaks or targeted injections. Modern LLM services have introduced measures to prevent prompt leaking (for example, disallowing the model from showing system messages), yet researchers in 2024 continue to find novel ways to bypass those restrictions , underscoring that prompt leakage remains an ongoing risk.

Stealthy Prompt Engineering

This class of attack involves crafting prompts that achieve malicious goals covertly, flying under the radar of content filters or human review. Stealthy prompt engineering may involve obfuscating the attack instructions (e.g. using unicode homoglyphs, typos, or code language) or splitting the malicious payload across multiple interactions so that no single input looks suspicious (LLM01:2025 Prompt Injection - OWASP Top 10 for LLM & Generative AI Security). Huang et al. (2024) introduce ObscurePrompt, a technique in which an LLM itself rewrites a known jailbreak prompt into a more obscure form that remains effective but is harder to recognize by filters (Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations). Another strategy uses multiple persona messages or “role play” to socially engineer the model – for example, the Social-Engineer Prompt method mobilizes several fictional user roles to influence the target model’s decisions . Recent attack frameworks also demonstrate Word Games that simultaneously obfuscate the query and expected answer, successfully bypassing safety measures without obvious red flags . The net effect is that the model is duped into misbehaving in a way that is not easily detected by simple keyword-based defenses. Stealthy attacks are particularly concerning because they can evade automated moderation; as one report notes, they “effectively bypass safety alignment measures” by exploiting subtle model vulnerabilities .

Defense Mechanisms

Adversarial Training

One major line of defense is to train or fine-tune LLMs on adversarial examples so they become robust to malicious prompts. In adversarial training, developers compile a dataset of attack prompts (e.g. known jailbreak attempts or generated adversarial prompts) and explicitly teach the model to refuse or safely handle them (Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations). For instance, Mo et al. (2024) propose Prompt Adversarial Tuning (PAT), which learns a special protective prefix that, when prepended to the model’s input, helps it resist jailbreaking attempts during inference . Similarly, safety fine-tuning can involve augmenting RLHF alignment training with hard-to-handle prompts so the model learns to uphold policies even under stress. Kim et al. (2024) show that supplementing a classifier with an adversarially generated training set of “noisy” jailbreaking attempts significantly improves its robustness (Robust Safety Classifier Against Jailbreaking Attacks: Adversarial Prompt Shield - ACL Anthology) . Overall, adversarial training attempts to “bullet-proof” the model by exposing it to as many attack variations as possible beforehand. While this hardens the model, researchers caution it is not a silver bullet – models may still generalize poorly to novel attack styles ( Goal-guided Generative Prompt Injection Attack on Large Language Models), so adversarial training is most effective when combined with other safeguards (e.g. runtime detectors).

Access Control

Access control defenses aim to limit what an LLM can do or access even if it is successfully coerced by a prompt. The guiding principle is least privilege: the model should only have the minimum access and capabilities necessary for its task (GitHub - tldrsec/prompt-injection-defenses: Every practical and proposed defense against prompt injection.). For example, if an LLM is integrated into an application that can send emails or execute code, strict guardrails are put in place so that a prompt-injected hijacking cannot abuse those functions. One best practice is to assume an attacker will hijack the prompt, and design the system such that even in that case the damage is contained . Concretely, developers assign separate API keys or permissions for different LLM actions, ensuring that a compromised prompt cannot perform high-impact operations . If an LLM is meant to read a user’s calendar, it should be prevented by design from creating or deleting events unless explicitly allowed . Another access control strategy is enforcing a strict separation between system instructions and user input. Some research proposes signed or structured prompts, where the system prompt (or any trusted instruction) is cryptographically signed or formatted in a way the model recognizes, so that it will ignore any instruction not bearing the correct signature/format (HERE). By locking down the prompt interface and external privileges, these measures ensure that even a “successful” prompt injection has limited impact on the broader system or data.

Input Sanitization and Preprocessing

Input sanitization involves detecting or neutralizing malicious instructions before they reach the language model. This can range from simple rule-based filtering (e.g. blocking or altering known trigger phrases like “ignore previous instructions”) to more sophisticated transformations of user input. One effective technique is input paraphrasing: automatically rephrasing the user’s query using a separate benign model, then feeding the rephrased version to the main LLM . The idea is that truly adversarial prompts often rely on a very specific token sequence to mislead the model, and a paraphrase will preserve the user’s intent but “break” the hidden attack sequence . Empirical results in 2024 show that paraphrasing can defeat many prompt injections, though it may occasionally alter the nuance of legitimate queries . Other sanitization methods include removing or escaping special tokens and code that could carry instructions, or using retokenization tricks (reordering or segmenting the input text) to disrupt potential attack patterns. Researchers have also proposed taint tracking approaches, which mark user-provided content in the prompt and then restrict the model’s use of that content in certain ways . The downside of aggressive preprocessing is the risk of false positives or reduced model utility, so often these defenses are paired with detection mechanisms that apply them selectively .

Response Filtering and Moderation

Even if a malicious prompt slips through, systems can apply a second line of defense by filtering or post-processing the model’s outputs. This typically involves a safety classifier or a set of rules that examine the LLM’s draft response before it is shown to the user. If the response contains policy violations, the system can refuse to output it or replace it with a safe fallback. Kim et al. (2024) introduced Adversarial Prompt Shield (APS), a lightweight classifier that scans for the subtle signatures of jailbreak prompts in the model’s output (Robust Safety Classifier Against Jailbreaking Attacks: Adversarial Prompt Shield - ACL Anthology). Adversarially trained on diverse attack examples, APS was shown to reduce successful jailbreak outputs by nearly 45% . Another approach is to have the LLM critique itself: one model (or the same model in a second pass) can be asked to evaluate whether the response it produced violates any rules (Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations). Wang et al. (2024) take this further with SelfDefend, a framework that employs a secondary “shadow” LLM to analyze and protect the primary model’s responses in real-time, significantly improving resilience against many attack types . In practice, major providers already use automated moderation services to filter toxic or policy-breaking outputs. The research trend in 2024 is toward more robust, adaptive filters that can catch outputs from stealthy attacks (which may not contain obvious forbidden words). Ultimately, response filtering serves as a fail-safe to prevent the worst-case scenario of a successful prompt hack from reaching end users .

Instruction Tuned Alignment

Aligning an LLM to follow ethical guidelines and ignore malicious instructions is a fundamental preventive measure. Modern chatbots undergo instruction-tuning (e.g. via Reinforcement Learning from Human Feedback) to strongly prefer refusing disallowed requests. However, 2024 studies have confirmed that base alignment alone is insufficient – many aligned models can still be jailbroken with clever prompts (LLM01:2025 Prompt Injection - OWASP Top 10 for LLM & Generative AI Security). To bolster alignment, researchers are exploring techniques like layer-specific model editing and continual preference optimization. Zhao et al. (2024) propose Layer-specific Editing (LED), which fine-tunes certain internal layers of a model to reduce its propensity to comply with unsafe requests, improving the model’s inherent resistance to jailbreak prompts . Another strategy is reward model shaping, where the model’s training is adjusted to heavily penalize any sign of instruction-following that contradicts the system policies. Some 2024 works also leverage multi-round coaching: the model is trained to internally “think step-by-step” about whether a user prompt might be an attack before answering, an approach shown to catch some otherwise missed injections . While these alignment-focused defenses occur during model development rather than at runtime, they are critical – a well-aligned model is the first layer of defense. As one security analysis notes, truly effective jailbreaking prevention likely requires ongoing improvements to the model’s training and safety mechanisms , since attackers continuously discover new exploits.

Monitoring and Real Time Detection

In addition to the above defenses which act on inputs or outputs, organizations are implementing monitoring tools to detect prompt hacking as it happens. This includes logging and analyzing conversation data for signs of attack (e.g. a user repeatedly trying variations of a known jailbreak prompt might trigger an alert or automatic lockout). Real-time detection systems use both heuristics and AI models to flag suspicious interactions. For instance, a defense called “preflight injection testing” appends user input to a harmless test prompt and checks if the LLM’s output deviates unexpectedly – such deviation can indicate an attempted injection (GitHub - tldrsec/prompt-injection-defenses: Every practical and proposed defense against prompt injection.). Another concept is inserting canary tokens (unique markers) in the system prompt; if these markers ever appear in user-visible output, it’s a clear sign that the model was induced to leak the prompt . Researchers Hines et al. (2024) propose spotlighting, where the system clearly labels or segregates content from external sources, making it easier for the model (and monitors) to recognize untrusted instructions and ignore them ( Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems). In multi-agent LLM systems, a recent study even introduced LLM Tagging – tagging each message with its source agent identity – to stem the spread of prompt injections between agents . By combining these techniques with human-in-the-loop oversight for high-stakes use cases, real-time monitoring aims to catch attacks that slip past static defenses. The consensus in 2025 is that a multi-layered defense – detection plus prevention – is necessary, since no single method can reliably stop all prompt hacks ( Systematically Analyzing Prompt Injection Vulnerabilities in Diverse LLM Architectures).

Implementation Details and Frameworks

Deploying the above defenses in practice often requires integrating specialized tools and frameworks. A common approach is to use guardrail libraries that wrap around the LLM. For example, open-source toolkits like NVIDIA NeMo Guardrails and Guardrails AI provide ready-made components to enforce input/output checks and policies . These systems let developers define rules (e.g. prohibited topics, required formats) and will intercept prompts or responses that violate those rules, effectively acting as an application-layer firewall for the LLM . The guardrails operate by scanning user queries for malicious patterns before they reach the model and validating model outputs against expected formats or “contracts” . Another practical technique is sandboxing the LLM’s capabilities. For instance, if the LLM can execute code or use tools, those actions are isolated in a safe environment with controlled permissions. This idea extends to plugins and external API calls triggered by the LLM – each such action should run with a dedicated token or key that grants only the minimal scope needed . In effect, even if an attacker hijacks the model’s intent, they cannot escalate privileges to perform unauthorized operations.

Developers are also increasingly adopting evaluation harnesses to continually test their models against prompt attacks. New benchmarks like JailbreakBench (introduced in 2024) provide standardized attack suites to probe an LLM’s weak points and evaluate the effectiveness of defenses (Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations). Likewise, the GenTel-Bench framework offers a massive repository of 84k prompt injection examples spanning dozens of attack scenarios, which can be used to stress-test models systematically (GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks). Many of these tools are accompanied by open-source code, enabling practitioners to simulate adversarial prompts and refine their mitigation strategies. When it comes to protecting dynamic systems, some organizations implement real-time monitoring dashboards that track metrics like the frequency of refused prompts, flagged outputs by the safety filter, or unusual token sequences, helping detect an ongoing attack early. In summary, effective defense implementation relies on a combination of policy tooling (guardrails, filters), secure software design (least-privilege, sandboxing), and constant validation through red-teaming and benchmarks. The literature emphasizes that these layers must work in concert – for example, a pipeline might first sanitize or rephrase inputs, have the model generate an answer, then a secondary model or filter checks that answer before final delivery (GitHub - tldrsec/prompt-injection-defenses: Every practical and proposed defense against prompt injection.) . Such defense-in-depth is becoming the de facto best practice for any LLM deployment handling untrusted inputs.

Recent Research Highlights 2024-2025

Prompt hacking has been a rapidly evolving research area, and the past year has seen significant advances in understanding attacks and developing defenses. A large-scale study by Benjamin et al. (2024) systematically analyzed 36 different LLMs with 144 known prompt injection tests, finding that 56% of these attacks succeeded across models and that larger models tended to be more vulnerable in certain scenarios ( Systematically Analyzing Prompt Injection Vulnerabilities in Diverse LLM Architectures). This work confirmed that prompt injection is a widespread issue, not limited to a few models, and reinforced the need for multi-layered defenses . On the offensive side, researchers have introduced more automated and powerful attack generation methods. Liu et al. (2024) demonstrated a gradient-based attack that can produce highly effective prompt injections with minimal samples, often bypassing existing safety measures ( Automatic and Universal Prompt Injection Attacks against Large Language Models). Another team (Zhang et al. 2024) developed a goal-guided prompt generation technique that uses a KL-divergence objective to craft adversarial prompts, achieving high success rates against seven different models in a black-box setting ( Goal-guided Generative Prompt Injection Attack on Large Language Models). These approaches move beyond trial-and-error “jailbreaks” toward systematic attack optimization, which is an emerging trend.

In terms of defenses, a notable theme in recent research is the emphasis on detection and evaluation frameworks. Chao et al. (2024) released JailbreakBench, a benchmarking suite to evaluate how well LLMs and defenses hold up against a variety of jailbreak tactics (Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations). Similarly, Li et al. (2024) introduced GenTel-Safe, which includes an advanced detection system (GenTel-Shield) and an extensive benchmark of 84k attack cases for assessing prompt injection risks (GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks). GenTel-Shield was shown to outperform vanilla safety guardrails in catching attacks, highlighting gaps in existing defenses . We’ve also seen creative defense proposals: Hines et al. (2024) proposed “spotlighting” to defend against indirect injections, essentially marking up external content so that the model and a monitoring system can more easily ignore malicious inserts ( Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems). Wang et al. (2024c) presented SelfDefend, using a dual-LLM setup to have one model guard another, which proved effective against many live jailbreak attempts . And in multi-agent systems, Lee & Tiwari (2024) uncovered prompt infection attacks that propagate like a virus between LLM agents, proposing an LLM tagging defense to contain such spread . This points to a broader trend: as LLM deployments become more complex (e.g. with tool use, agent societies, or chain-of-thought reasoning), researchers are proactively identifying new threat vectors and tailoring defenses to them.

Another key development in late 2024 is the focus on stealthy attack mitigation. To address attacks that evade simple filters, Huang et al. (2024a) developed ObscurePrompt to generate stealthy jailbreaks and then used those to adversarially train models, thereby hardening them against obfuscated attacks . There is also progress in robust safety classification – Kim et al.’s APS classifier (NAACL 2024) is one example of a lightweight model that can sit alongside an LLM to catch tricky prompts in real-time (Robust Safety Classifier Against Jailbreaking Attacks: Adversarial Prompt Shield - ACL Anthology). Research like SmoothLLM (2024) has even adapted certified defense techniques from adversarial ML, randomly perturbing input text and aggregating the LLM’s outputs to detect if a prompt is adversarial (GitHub - tldrsec/prompt-injection-defenses: Every practical and proposed defense against prompt injection.). Using such methods, SmoothLLM was able to drive the success rate of many jailbreak attacks down to almost 0%, with provable guarantees on mitigation in some cases . These cutting-edge studies suggest that a combination of strategies – from better training, to smarter real-time detection, to rigorous evaluation – is converging to address prompt hacking. Early 2025 findings indicate that while no single fix exists, the community is steadily closing the gaps. In summary, the recent literature paints a picture of active offense-defense coevolution: as new prompt hacking techniques emerge (jailbreak variants, indirect and multi-modal injections, etc.), so do innovative defenses (structured prompts, ensemble monitors, adversarially trained filters), moving us toward more secure LLM deployments than were possible just a year ago.

Sources

The analysis above is based on research published in 2024 and early 2025, including findings from arXiv preprints, peer-reviewed conference papers, and security analyses ( Systematically Analyzing Prompt Injection Vulnerabilities in Diverse LLM Architectures), among other cited works.

Rohan's Bytes

Discussion about this post

Ready for more?