Browse all previoiusly published AI Tutorials here.
Table of Contents
Overview and Motivation
Plan-and-Execute Prompting Strategies
Hierarchical Reading and Chunking Techniques
LLM Reading Agents with Memory and Tools
Self-Improvement via Iterative Refinement
Advanced Chunking Algorithms and Implementation Details
Conclusion
Overview and Motivation
Processing lengthy documents with LLMs requires breaking text into manageable chunks due to context length limits. Naively chunking by fixed size can lose context or omit critical details. Plan-and-execute prompting has emerged as a strategy where an LLM first plans a course of action (e.g. how to split or retrieve text) and then executes each step, dynamically adjusting to the document structure and task. This approach helps address issues like the “lost-in-the-middle” effect (LLMs forgetting middle parts of long text) (HERE). Recent research in 2024–2025 explores methods for LLM agents to self-improve their reading by refining chunking strategies on the fly. Below, we review key works, methodologies, and implementations.
Plan-and-Execute Prompting Strategies
Several works introduce prompting frameworks that explicitly guide LLMs to devise a plan for long-text tasks and then carry it out step by step:
PEARL (Sun et al., 2024) – A prompting framework for Plan-Execute over long documents . It operates in three stages: action mining, plan formulation, and plan execution (HERE). For a complex query, PEARL prompts the LLM to decompose the problem into discrete actions (e.g. identify relevant sections, retrieve facts) and then executes these via zero-/few-shot prompts. This yielded improved reasoning on long-document QA, outperforming a direct GPT-4 128k context baseline in certain comprehension benchmarks .
MetaGPT/“Plan-&-Solve” Agents – Building on ideas from earlier agent frameworks, 2024 systems use an LLM to output a high-level plan (in natural language or JSON) for multi-step tasks, then execute sub-steps iteratively. For example, an LLM may first outline which document sections to read or which tools to use, then each outline item is executed in turn. This plan-then-execute design often yields more structured and accurate processing than one-shot prompting (LangChain Based Plan & Execute AI Agent With GPT-4o-mini). It has been applied in tool-using agents (e.g. UltraTool benchmark) and long-context tasks to reduce reasoning errors. However, research notes potential brittleness if the initial plan is flawed (Build an LLM-Powered API Agent for Task Execution), prompting interest in self-correction loops.
Hierarchical Reading and Chunking Techniques
To handle text beyond an LLM’s window, two main strategies have been studied:
Hierarchical Summarization: The document is recursively summarized. For example, split a book or report into chapters, summarize each, then merge those summaries into a higher-level summary, repeating if needed. This “chunk and merge” approach ensures global coverage. An ICLR 2024 study evaluated hierarchical merging vs. incremental updating for book-length summarization (HERE). Hierarchical merging (summarize chunks then merge summaries) produced more logically coherent global summaries, while incremental updating (read chunks sequentially, continuously updating one summary) preserved more detail . This highlights a trade-off between coherence and informativeness. Notably, hierarchical prompts can suffer loss of fine details, whereas incremental methods risk drifting off-topic without careful alignment.
Adaptive Chunk Sizing: Researchers emphasize that optimal chunk size can vary by task. Larger chunks carry more context (better for summarization), whereas smaller chunks focus on specifics (better for precise Q&A) (Mastering Chunking Techniques for LLM Applications in 2025 | PuppyAgent) . Dynamic chunking algorithms have been proposed to adjust segment boundaries based on content. For instance, Chang et al. (2024) found that increasing chunk size did not improve hierarchical summarization quality but did benefit certain models in incremental reading . Guidelines emerging from these studies suggest using overlapping chunks to preserve context (mitigating context loss between segments), but not so much overlap that it causes redundancy or confuses the model . In practice, systems often perform an initial pass to determine logical boundaries (e.g. by headings or semantic shifts) and then feed chunks to the LLM in a planned sequence.
LLM Reading Agents with Memory and Tools
Inspired by how humans skim and zoom in on text, several 2024 agent architectures let an LLM create its own working memory or navigate text via tools:
ReadAgent (Lee et al., 2024) – A “human-inspired” reading agent that stores compressed gist memories of text and consults the original document on demand ( A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts). It segments a long document into pieces (e.g. pages), summarizes each into a concise “gist” embedding, and when a question is asked, the LLM decides which sections to lookup in full for details. This interactive approach achieved higher QA accuracy than using either full text or only summaries, and effectively extended the usable context length by up to 20× in experiments . The agent decides what to remember or re-read, embodying a self-refining strategy (if an answer is not found in gists, it fetches the relevant chunk from original text).
GraphReader (Li et al., 2024) – A graph-based LLM agent that organizes document knowledge into a graph of nodes and links to enable non-linear traversal (HERE). Key entities or concepts become graph nodes with associated “atomic facts,” and edges represent relations. Given a query, the agent plans a traversal (jumping between nodes) to collect supporting facts, guided by a step-by-step reasoning script . This structured plan allows it to answer complex multi-hop questions efficiently. Remarkably, GraphReader with a standard 4K context window matched or exceeded GPT-4-128k on several long-context QA benchmarks , by focusing only on relevant subgraphs. This demonstrates the power of planning a reading order: the agent chooses which part of the graph (hence which chunk of text) to explore next, refining its path based on intermediate findings.
GraphRAG (Edge et al., 2024) – Instead of a live agent, this approach builds a graph index offline to aid retrieval. First, it uses an LLM to extract an entity-level knowledge graph from documents, then generates summaries for clusters of related entities . At query time, relevant summaries are retrieved and combined to form the answer. By planning retrieval around a graph of entities, GraphRAG can handle questions that require piecing together information spread across a document collection. It showed strong results on query-focused summarization by preserving global context in the graph .
LongRAG (Jiang et al., 2024) – A complementary strategy to the above is simply using bigger chunks with specialized retrievers. LongRAG introduced a “long retriever” that can handle very large segments and a “long reader” model, allowing source texts to be chunked into larger units than usual . Fewer, larger chunks mean less fragmentation of context (reducing the number of retrieval calls). However, this shifts burden to the model’s ability to ingest long inputs. In practice, such designs are often combined with planning: e.g. first decide if a query is broad (needs scanning many chunks) or specific (maybe one large chunk suffices), then execute the appropriate retrieval strategy.
Self-Improvement via Iterative Refinement
A hallmark of recent systems is the ability to dynamically refine the chunking and retrieval strategy based on partial results or model feedback:
RePrompting & In-Context Retrieval (R&R) – Agrawal et al. (EMNLP 2024) introduced R&R, a two-step prompting method that injects periodic reminders and retrieval steps during generation (HERE). The LLM is prompted after every r tokens to recall the instructions and consider if more information is needed, and it can retrieve additional pages “in-context” from the document. Empirically, R&R achieved a 16-point QA accuracy boost on 80k-token documents by softening the accuracy-vs-cost tradeoff: it enables using larger chunks (fewer calls) without losing as much accuracy . In effect, the LLM’s own uncertainty triggers it to fetch or attend to new chunks, a form of self-improvement loop at inference time.
Iterative Retrieval Agents: A number of 2024 works let the LLM decide when to retrieve more text mid-task. For example, Self-RAG (Asai et al., 2024) and MIGRES prompt the model to explicitly decide if additional retrieval is needed after an initial answer attempt (HERE). Similarly, FLARE (Jiang et al., 2023b) monitors the LLM’s token probabilities during generation and triggers a search when confidence is low . These agents essentially plan in real time: if the answer is incomplete or uncertain, they revise the plan by pulling in new information. Adaptive workflows like Adaptive-RAG (Jeong et al., 2024) and MBA-RAG (Tang et al., 2025) generalize this idea by using a controller that selects different retrieval routes depending on the query type or the evidence gathered so far . A recent framework, OkraLong (Tang et al., 2025), formalizes this with separate modules: an Analyzer classifies the query and assesses if initial retrieved chunks are sufficient; an Organizer then orchestrates a plan (e.g. “split the query into sub-questions and retrieve separately”) and an Executor runs the plan . This modular plan-refine-execute loop allowed OkraLong to flexibly handle multi-hop queries with higher accuracy than static retrieval pipelines .
Reflexion and Self-Feedback: Beyond retrieval, LLM agents can self-evaluate their outputs and adjust accordingly. Techniques like Self-Refine (Madaan et al., 2024) and others use the model’s feedback on its answer to iterate improvements. In document QA, this can mean the agent checks whether the produced answer actually resolves the question using the provided chunks and, if not, it will adjust – for example, by reading additional sections or re-chunking a confusing passage. While not specific to long documents, such self-correction mechanisms complement plan-and-execute: the plan is not fixed, but revised as needed. In practice, combining reflection with dynamic retrieval has proven effective at reducing omissions. For instance, Press et al. (2023) showed that allowing GPT-4 to self-ask follow-up questions (essentially re-planning sub-queries) and then answer them can significantly improve accuracy .
Advanced Chunking Algorithms and Implementation Details
Implementing these strategies in real systems involves careful text preprocessing and use of external tools:
Learned Segmentation: Researchers have begun exploring ML-driven chunking. LumberChunker (Duarte et al., 2024) leverages an LLM to iteratively find optimal segmentation points in a long narrative (Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception). It prompts the model with a block of text and asks where the content “begins to diverge,” splitting at that point and repeating on subsequent segments. This approach produced more semantically coherent chunks than fixed-size or simple semantic splitting, but it relies on a powerful model (the authors used GPT-4-class APIs) and was slow, highlighting the cost of using LLMs themselves for chunking . To reduce dependence on expensive models, Meta-Chunking (Chen et al., 2024) introduces a perplexity-based method: it computes the language model perplexity for each sentence in context and marks sharp increases as boundary signals . This technique groups sentences with strong logical cohesion into “meta-chunks,” enabling smaller models to chunk nearly as well as large LLMs. Meta-Chunking can dynamically produce fine-grained chunks and then merge them to a desired granularity (fine for precise retrieval or coarser for summarization) . Experiments on multiple datasets showed this method improved downstream QA performance over rule-based splitting while reducing computation cost .
Practical Pipelines: In real-world systems, plan-and-execute prompting often integrates with libraries like LangChain or LlamaIndex. These provide agent executors, text splitters, and retrievers that implement the above ideas. For instance, LangChain’s experimental plan-and-execute agent API allows a developer to define a planning prompt (to divide a task or text) and an execution prompt template for sub-tasks. The LLM’s plan (often a list of steps or chunk identifiers) is parsed and each step is fed to a new prompt. In a document QA setting, a typical implementation might:
Plan: Use an LLM to read a document’s table of contents or skim the text, and produce a list of relevant sections or page numbers for the query at hand.
Execute: For each section, either retrieve it and answer part of the query, or further summarize it if the final task is summarization. Tools like vector databases (for semantic search) or OCR (for digitizing scanned pages) may be invoked at this stage.
Integrate: Combine the answers or summaries from each chunk. This could be a final LLM call that takes the collected information and composes a coherent result.
Example – Digitization Pipeline: In a document digitization scenario (e.g. analyzing a scanned PDF), an agent might first plan how to segment the document – for example, by detecting logical divisions (sections, chapters) via an OCR layout analysis. Then, chunk by chunk, it executes text extraction and understanding: using an LLM to transcribe each chunk (if OCR text is ambiguous) and summarize or index it. If the LLM encounters an unreadable part or ambiguous reference, it can loop back (self-refine) and adjust the chunking (perhaps splitting that page differently or using an image recognition tool on a figure). Such a system continuously learns which chunk sizes and strategies yield the best accuracy, adjusting chunk boundaries or retrieval density with feedback over time. This aligns with the iterative improvement practices recommended in recent literature (Mastering Chunking Techniques for LLM Applications in 2025 | PuppyAgent) – evaluate chunking performance on each task and refine the approach.
Conclusion
Plan-and-execute prompting has become a cornerstone technique in enabling LLMs to tackle long, complex documents. The latest research (2024–2025) demonstrates that giving an LLM agent a capability to strategize about what to read and in what order can dramatically improve outcomes in document question-answering and summarization. Key advances include hierarchical chunking methods to maintain context, agent architectures that mimic human reading with memory or graph traversal, and self-correcting loops that dynamically refine the process. These approaches are being translated into practical systems via prompting frameworks and pipeline tools, making LLMs more adept at self-directed document digestion. As we move forward, an exciting direction is the convergence of these ideas: agents that not only plan and execute, but also learn from each execution to update their future chunking strategies — truly self-improving document analyzers. Researchers are already exploring such feedback-driven agents, aiming for LLM systems that become more efficient readers the more they work, continuously tuning their own “plan and read” tactics. The convergence of plan-and-execute prompting with adaptive chunking and retrieval will likely be pivotal in pushing LLMs toward ever larger and more complex texts while maintaining high fidelity and accuracy (HERE) .