Browse all previoiusly published AI Tutorials here.
Introduction
Predictive analytics, especially time-series forecasting, is a cornerstone for decision-making in industries like banking, retail, and logistics. Banks forecast financial risks and market trends; retailers predict product demand to optimize inventory; logistics and supply chain planners anticipate shipments and capacity needs. Traditionally, these forecasts rely on statistical models or specialized machine learning, but recent advances in Large Language Models (LLMs) have opened new possibilities. LLMs such as GPT-style transformers have revolutionized natural language processing (NLP) with their ability to learn complex patterns and generalize across tasks. This has sparked a pressing question: can the same architectures and techniques that excel in NLP be adapted to improve forecasting of numerical time-series data? Researchers in 2024 began emphatically exploring this idea (TimeGPT: The First Foundation Model for Time Series Forecasting | The Forecaster) , aiming to create foundation models for time-series analogous to those in NLP. The promise is compelling – an LLM trained on vast amounts of sequence data might produce accurate forecasts zero-shot, without task-specific training, and handle diverse domains with ease (Time series forecasting with LLM-based foundation models and scalable AIOps on AWS | AWS Machine Learning Blog) . In this blog, we delve into how LLM architectures (GPT-style transformers and fine-tuned variants) are being leveraged for time-series and tabular forecasting, the technical adaptations involved, and the practical impact on demand prediction, financial risk modeling, inventory optimization, and supply chain planning.
Bridging NLP Techniques with Time-Series Data
At first glance, human language and time-series data seem quite different – one deals with words and sentences, the other with numerical measurements over time. However, fundamentally both are sequential data domains. An LLM that reads text is essentially decoding a sequence of tokens to predict the next token; a forecasting model reads a history of data points to predict future points. This parallel insight was captured by an Amazon AWS team in 2025: “both LLMs and time series forecasting aim to decode sequential patterns to predict future events” . In other words, forecasting can be viewed as a sequence continuation problem, much like language modeling.
Key challenges remain when applying NLP models to time-series: time-series values are continuous and often real-valued (e.g. sales dollars, sensor readings), not discrete tokens like words. Also, time-series have temporal dynamics like trends, seasonality, and irregular intervals that text doesn’t. To bridge these gaps, researchers adapt NLP techniques in creative ways:
Sequence Encoding: In NLP, text is tokenized (broken into words/subwords) and then fed through embedding layers. For time-series, one approach is to create an analogous “vocabulary” for numeric data. For example, AWS’s Chronos model converts continuous time series into a discrete token sequence by first normalizing (scaling by the series’ mean) and then quantizing the values into fixed bins (Time series forecasting with LLM-based foundation models and scalable AIOps on AWS | AWS Machine Learning Blog) . Each binned value is treated like a “word” from a finite vocabulary that a transformer can ingest.
Transformer Architecture: The self-attention based transformer, originally a milestone in NLP, has been repurposed for time-series for a few years now. Early works (like TFT, Informer, Autoformer, etc.) applied transformers directly to time data but often needed modifications for long sequences or irregular timing. Notably, the idea of splitting a time-series into patches (contiguous subsequences) has improved transformer performance on long series. This concept, borrowed from computer vision and popularized in time-series by PatchTST, treats each patch of length P as a single token to reduce sequence length. Recent LLM-based forecasters also adopt this idea – for instance, Google’s 2024 foundation model explains that it “treat[s] a patch (a group of contiguous time-points) as a token” for the transformer input (A decoder-only foundation model for time-series forecasting) . Each token (patch) flows through stacked attention layers just like words in a sentence.
Crucially, positional encodings and other architectural tweaks handle the order of observations and their time indices (e.g. day-of-week, hour-of-day may be encoded as features akin to “context” for the sequence). By viewing a time-series as analogous to a language sequence, we can begin to apply LLM architectures – but we must still account for numeric magnitude and continuity, which pure NLP models aren’t built for. In the next section, we discuss how state-of-the-art approaches in adapt GPT-style models to overcome these challenges.
LLM Architectures for Forecasting: Key Approaches
(A decoder-only foundation model for time-series forecasting) Figure: Example transformer-based architecture for time-series forecasting (Google’s TimesFM). Input time-series are chunked into patches (yellow “Token + PE” blocks) with positional encodings, passed through multiple layers of causal self-attention (SA) and feed-forward networks, and produce output patches for future predictions. Longer output patches allow forecasting further in fewer steps .
There are two broad paradigms for using LLM architectures in forecasting: (1) training a new decoder-only transformer on time-series data (in essence, building a GPT for time-series from scratch), and (2) reprogramming or fine-tuning an existing LLM (pre-trained on text or other data) to handle time-series inputs and outputs. Both leverage the powerful sequence modeling of transformers, but differ in data requirements and how much they modify the model’s internals.
Decoder-Only “GPT-Style” Models for Time-Series
The first approach treats forecasting as a classic language modeling problem and uses a causal decoder transformer architecture (like GPT) trained on a vast collection of time-series data. A prominent example is Google Research’s TimesFM (Time-Series Foundation Model) introduced in 2024 (A decoder-only foundation model for time-series forecasting) . TimesFM uses a decoder-only transformer with about 200 million parameters (significantly smaller than GPT-3, but large for forecasting standards) and was pre-trained on a corpus of 100 billion time points from diverse real-world sources . The model learns to take in a sequence of past values and generate the sequence of future values, one patch at a time, similar to how GPT generates text continuations.
To handle continuous values, TimesFM employs the patching idea: an input patch (say 32 time steps) is embedded into a token vector via a small neural network, which is then combined with a positional encoding and fed into the transformer stack . The transformer’s output token is then decoded by another small network to produce an output patch (e.g. 128 time steps) . This asymmetry (output patch larger than input patch) lets the model forecast far into the future with fewer iterative steps, which “yields better performance for long-horizon forecasting” by reducing error accumulation . Essentially, the model can jump 128 steps ahead in one go, rather than forecasting step-by-step.
One key difference from text LLMs is that these models must flexibly handle variable context and horizon lengths – e.g. you might have 3 years of history to predict 1 year ahead in one case, or 1 week of history to predict 1 day ahead in another. TimesFM addresses this via its training regimen: it was trained on many different context/horizon splits (predicting 1 step ahead, 2 steps ahead, … up to long ranges) across numerous series (A decoder-only foundation model for time-series forecasting) . This teaches the model to generalize to arbitrary forecasting windows. Another practical addition is a frequency indicator input; TimesFM expects a code for the data frequency (e.g. daily vs monthly vs yearly) as part of input, since patterns and reasonable horizons differ by frequency (google/timesfm-1.0-200m · Hugging Face) . For instance, it uses 0 for high-frequency (daily/hourly) data, 1 for medium (weekly/monthly), 2 for low (quarterly/yearly) – this contextual info helps it adjust its predictions.
The result of such training is a true foundation model for forecasting: TimesFM can take a never-seen-before series and produce a decent forecast without any gradient updates (i.e. zero-shot) . In their evaluations, the authors show TimesFM’s zero-shot accuracy on many benchmark datasets comes close to (and sometimes rivals) that of specialized models trained on those datasets . Similarly, an independent effort by Nixtla (a time-series AI company) produced TimeGPT in late 2023, which was “the first time series foundation model capable of zero-shot inference” (TimeGPT: The First Foundation Model for Time Series Forecasting | The Forecaster) . These models signal a shift towards GPT-style pre-training for time-series, leveraging broad patterns learned from massive data (e.g. patterns from Wikipedia page views or Google Trends can inform retail sales behavior ).
Other organizations have followed suit. AWS’s Chronos (announced 2025) is described as a family of time-series models using LLM architectures, pre-trained on large datasets to generalize across domains (Time series forecasting with LLM-based foundation models and scalable AIOps on AWS | AWS Machine Learning Blog) . Chronos takes the numeric-to-token approach: it “treat[s] time series data as a language” by discretizing continuous values into tokens . Specifically, Chronos scales each series by its mean and then quantizes it into a fixed number of bins (categories), turning the sequence into a stream of token IDs . This token sequence can then be fed to a standard transformer decoder (architecturally similar to GPT-2/GPT-3) which is trained to predict the next token. By pre-training such models on diverse data and fine-tuning when needed, Chronos demonstrated strong zero-shot and few-shot forecasting performance, reportedly outperforming task-specific models on most benchmarks .
In summary, decoder-only GPT-style models for forecasting use pure transformer architectures with innovations in how data is represented (patch embeddings, quantization, frequency/context flags) to make numeric time-series look like “language” to the model. They are trained on huge corpora of time-series, analogous to how GPT-3 was trained on a huge text corpus. These models align with the vision of forecasting foundation models: one model that can be applied to many predictive analytics tasks (demand, finance, etc.) with minimal or no retraining.
Reprogramming and Fine-Tuning Pre-trained LLMs
An alternative route to bring LLM power to forecasting is to start with an existing LLM (trained on text or code) and adapt it to produce forecasts. This avoids training from scratch, instead leveraging knowledge already present in a large model. The challenge is that a language-trained model doesn’t natively “speak” the language of time-series. Enter the concept of reprogramming: essentially, translate the forecasting problem into a form the LLM can understand.
A seminal work in this vein is Time-LLM (Jin et al., ICLR 2024) ( Time-LLM: Reprogram an LLM for Time Series Forecasting). Time-LLM is not a single monolithic model but a framework that wraps around a frozen language model. As the Nixtla documentation puts it, “Time-LLM transforms a forecasting task into a ‘language task’ that can be tackled by an off-the-shelf LLM.” (Time-LLM - Nixtla)
The core pre-trained LLM then processes this sequence of embeddings as if it were a sentence. Since the LLM was trained on human text, it will treat the input as some kind of textual sequence. The hope (borne out by the research) is that the LLM’s powerful sequence modeling and reasoning can be repurposed – it will generate continuations in the embedding space that correspond to forecasted values.
An output projection (or “flatten head”) layer then takes the LLM’s output embeddings and converts them back to numeric predictions (essentially the inverse of the input embedding step) (Time-LLM: Reprogram an LLM for Time Series Forecasting). For example, if the LLM outputs what it thinks is the next “token” embedding, this layer maps that to a forecast value or patch of values. (Time-LLM: Reprogram an LLM for Time Series Forecasting)
The key benefit of this approach is that we can exploit very large LMs (billions of parameters) which would be infeasible to train purely on time-series data. For instance, one could take a pretrained GPT-3 or LLaMA-7B and, with relatively light additional training, get it to produce forecasts. Time-LLM’s authors demonstrated using LLaMA-7B and even smaller models like GPT-2 for various forecasting tasks by training only the small adapter layers (GitHub - KimMeen/Time-LLM: [ICLR 2024] Official implementation of " Time-LLM: Time Series Forecasting by Reprogramming Large Language Models") (GitHub - KimMeen/Time-LLM: [ICLR 2024] Official implementation of " Time-LLM: Time Series Forecasting by Reprogramming Large Language Models"). The downside is that the model may still carry some baggage from its original domain (language). Indeed, one must carefully design the “language cast” of the data. If done naively, the LLM might try to form English sentences or output numbers in word form. Techniques like quantization (mapping values to tokens) or formatting the input as a table of values in text can help. Chronos, for example, uses a discrete vocabulary so that an LLM can be directly fed the tokens (Time series forecasting with LLM-based foundation models and scalable AIOps on AWS | AWS Machine Learning Blog) (the LLM then essentially learns the “language” of those tokenized time-series).
Beyond Time-LLM, other fine-tuning strategies involve training an LLM on a concatenation of numerical data and possibly text. For example, one could fine-tune GPT-2 on sequences like "... 100, 105, 110, 108, 112, 115, 120, [MASK]"
where the task is to fill in or continue with the next values. With libraries like Hugging Face Transformers and PEFT (Parameter-Efficient Fine-Tuning), practitioners in 2024 have experimented with adding LoRA adapters to LLMs to inject time-series knowledge without full retraining. These methods often require significant data for the model to adjust to numeric patterns, and careful formatting to avoid the model drifting into natural language output instead of pure numbers.
A simpler but increasingly popular tactic is to use LLMs via prompting alone for forecasting tasks. Instead of training anything, one crafts a prompt that includes recent data points and asks the LLM (via its API) to predict the next points. This was illustrated in a 2025 case study by Nguyen T. Lai et al., where they tackled a complex warehouse shipment forecasting problem by querying an LLM (Anthropic’s Claude model) with a carefully engineered prompt (Forecasting Shipments with LLMs. When traditional methods fall short… | by Nguyen Thanh LAI | IRIS by Argon & Co | Medium) . They provided the historical data (as text/numbers) and a detailed description of patterns (seasonal effects, trends) and literally asked the model to output the forecast. Interestingly, they note that building such an LLM solution shifts a lot of the work from traditional feature engineering to prompt engineering: “LLM development shifts the focus to data analysis from a subject matter expert perspective, followed by prompt engineering... This approach gives us more precise control over the model’s behavior.” . In other words, the human analyst extracts the key patterns and explicitly feeds them to the LLM in the prompt (like saying “Mondays are usually busiest, and sales spike on Black Friday”), which the LLM then uses to extrapolate.
Prompt-based forecasting is appealing for its rapid iteration – if the model’s output seems off, you can refine the prompt or add clarifications without retraining. The IRIS/Argon & Co. study found that their LLM could adapt to certain regime changes faster than ML models: e.g. when a weekly pattern suddenly shifted, they “quickly modified [the LLM] to accommodate immediate changes (like instructing the LLM about the peak shift)”, whereas a traditional model would need to see more data over a full season to re-learn this shift . This highlights how LLMs bring a form of reasoning and interpretability to forecasting – you can directly tell the model about an event or anomaly, much like instructing a junior analyst, and it will adjust its output. However, this approach also risks prompt overfitting: if you over-specify patterns that don’t hold, the LLM will blindly extrapolate them. The authors caution that one must avoid instructing the model with spurious correlations just to fit the history, as the model could then overfit to noise in a way that a statistical model might not .
In summary, adapting pre-trained LLMs for forecasting can be done either by training small adapter modules (like Time-LLM) or purely via clever prompting. This leverages the rich knowledge and reasoning ability of models that have seen orders of magnitude more data (in other domains) than we could ever collect for one time-series task. It also allows integration of unstructured data (like domain descriptions, news, or events) directly into the forecasting process via language, which is a powerful advantage.
Recent Tools, Models, and Frameworks
The fast pace of research in 2024 and 2025 has yielded several open-source tools and models that practitioners can already experiment with for LLM-driven forecasting:
Google’s TimesFM (ICML 2024) – We discussed this 200M-parameter foundation model above. Google has open-sourced the model, providing a Hugging Face checkpoint and code. Users can load
timesfm-1.0-200m
and get started with zero-shot forecasting. For example, after installing thetimesfm
package, one can do:
import numpy as np
from timesfm import TimesFM
tfm = TimesFM.from_pretrained('google/timesfm-1.0-200m')
# Prepare three sample series of varying lengths
forecast_input = [
np.sin(np.linspace(0, 20, 100)), # 100-point series
np.sin(np.linspace(0, 20, 200)), # 200-point series
np.sin(np.linspace(0, 20, 400)), # 400-point series
]
frequency_input = [0, 1, 2] # frequency codes (0=daily-like, 1=monthly, 2=yearly)
point_forecast, quantile_forecast = tfm.forecast(
forecast_input, freq=frequency_input
)
In the above code, we simulate some sine-wave time series and call the model’s
.forecast()
method. The model returnspoint_forecast
(the predicted values) and even anexperimental_quantile_forecast
(an attempt at probabilistic forecasting). This showcases how foundation models like TimesFM come with convenient APIs for inference (google/timesfm-1.0-200m · Hugging Face) . With slight modifications, you could feed real business data (as NumPy arrays or Pandas DataFrames) to get forecasts without training a model from scratch. TimesFM currently handles univariate series (one variable at a time) up to 512 time steps long and can predict any horizon length . While it’s not natively probabilistic (no prediction intervals out-of-the-box), it provides some experimental support for quantiles.Nixtla’s NeuralForecast Library – Nixtla integrated many of the latest research models into this Python library, making it easy to try them on your data. As of mid-2024, NeuralForecast includes classes for classic models like ARIMA and LSTM, and also cutting-edge models: “from classic RNNs to the latest transformers: MLP, LSTM, GRU, TCN, TimesNet, PatchTST, ... and TimeLLM.” (GitHub - Nixtla/neuralforecast: Scalable and user friendly neural forecasting algorithms.) . This means you can use the Time-LLM framework or other transformer-based models through a unified sklearn-like API. For example, using NeuralForecast you might write:
from neuralforecast import NeuralForecast
from neuralforecast.models import PatchTST, TimeLLM
nf = NeuralForecast(
models = [
PatchTST(input_size=24, h=12), # 24 past points to predict 12 ahead
TimeLLM(input_size=24, h=12)
],
freq='D' # daily frequency
)
nf.fit(df=train_df) # train_df is a DataFrame with columns ['unique_id','ds','y']
forecasts = nf.predict()
Here, two models (PatchTST and TimeLLM) would be trained on the training data, and their forecasts obtained. Under the hood, TimeLLM will likely load a pre-trained language model and train the reprogramming layers as described earlier. The NeuralForecast library abstracts away the gritty details and provides state-of-the-art models out-of-the-box, which is tremendously useful for practitioners who want to benchmark LLM-based models against traditional ones in their use-case.
Chronos (AWS) – While not open-sourced at the time of writing, AWS’s Chronos initiative (2025) deserves note as it is likely to be made available via AWS services. It demonstrates the industry’s investment in LLM-based forecasting. The AWS blog on Chronos provides a detailed example of integrating such a model into a SageMaker pipeline (Time series forecasting with LLM-based foundation models and scalable AIOps on AWS | AWS Machine Learning Blog) . They walk through fine-tuning Chronos on a specific sales dataset and deploying it. This indicates that cloud providers are packaging these foundation models for easy use in production. We can expect more toolkits that let you upload your time-series data and apply a pre-trained LLM model to get forecasts, possibly including automated prompt tuning or fine-tuning under the hood.
Tabular Foundation Models – Beyond pure time-series, there’s parallel work on LLMs for tabular data (which might include time as one feature among many). For instance, the TabuLa-8B model (announced on Reddit in 2024) is an 8-billion-parameter foundation model for general tabular prediction tasks, and Microsoft’s research on “industrial foundation models” suggests combining structured data with language data (Towards industrial foundation models: Integrating large language ...) . These efforts, while beyond the scope of univariate forecasting, imply that future LLM-based predictors could accept not only a sequence of past values but entire tables of related data (exogenous variables, metadata) and produce forecasts or decisions. Imagine an LLM that reads a company’s quarterly reports (text) and revenue time-series and then forecasts the next quarter’s sales – such multimodal models are on the horizon.
Academic Releases – Many 2024 papers have released code: e.g. the authors of “TimeGPT” have a reference implementation, and the “Are LLMs Useful for Time Series?” study (ICML 2024) released an analysis framework. There is even an Awesome LLM for Time-Series repository tracking papers and datasets (xiyuanzh/awesome-llm-time-series: tracking papers, datasets, and ...) for those who want to dive into research implementations. These resources provide deep dives into how to preprocess data for LLMs, how to evaluate forecasting accuracy in a zero-shot setting, and how to compare against classical methods.
Overall, if you are looking to apply LLM architectures to your forecasting problem today, you have a growing arsenal of tools. For straightforward univariate forecasting with only historical data, a pre-trained model like TimesFM or TimeGPT can be a quick start. If your problem involves many related time series or additional features (e.g. price, promo, weather data), libraries like NeuralForecast or PyTorch Lightning (with its forecasting templates) can help you incorporate advanced models including transformers. It’s an exciting time where open-source implementations are rapidly catching up with research, lowering the barrier to using these advanced models in practical settings.
Practical Use Cases and Industry Applications
How do these LLM-based forecasting techniques make a difference in real-world predictive analytics? Let’s explore a few industry scenarios:
Retail Demand Prediction: Retailers often need to forecast product demand at various levels – from overall sales down to individual SKUs – to manage inventory and supply chain. Traditional approaches might use separate models for each product or each store, often struggling to incorporate external information like promotions or holidays. An LLM-based approach can improve this in several ways. First, a foundation model trained on massive retail datasets can detect patterns like holiday spikes, weekday vs weekend trends, and even unexpected correlations (like a viral social media event boosting sales) without explicit programming. For example, a foundation model similar to TimesFM could be fed sales histories of thousands of products and output reasonable forecasts for each, without retraining per product. The generalization power means that even a newly launched product with little data could be forecast by analogizing to similar products’ patterns (a classic cold-start problem that LLMs may handle via learned analogies). Second, if there are textual data available – say product descriptions, reviews, or marketing plans – an LLM can potentially take those into account. While current models we discussed don’t directly accept text for forecasting, one can envision using a multi-modal prompt (e.g. “Product: X, Category: winter clothing, Last 12 months sales: ..., Promotion next month: 20% off, Predict next 3 months: ...”). Indeed, large retailers like Walmart are reportedly exploring custom LLMs (such as their “Wallaby” LLM) trained on internal data to aid forecasting and decision support. The bottom line: LLM architectures can capture complex seasonality and cross-learning across products in demand forecasting. In practice, a retailer might use an LLM model’s baseline forecasts as a starting point, then fine-tune or adjust them with domain knowledge. The inventory optimization comes in when these forecasts feed into decisions of how much stock to reorder or how to allocate inventory across regions. With more accurate and granular forecasts, retailers can reduce overstock (tying up capital) and prevent stockouts (lost sales). McKinsey has noted that even a few percentage points improvement in forecast accuracy can significantly reduce inventory costs and increase revenue (A decoder-only foundation model for time-series forecasting) , which explains the keen interest in applying the most powerful modeling techniques available.
Supply Chain & Logistics Planning: Supply chains generate a wealth of time-series data: shipment volumes, delivery times, transport costs, warehouse workloads, etc. Forecasting in this realm often must account for both regular patterns and one-off disruptions (weather, strikes, pandemics). The Forecasting Shipments with LLMs case study we mentioned illustrates how an LLM was used to predict daily warehouse shipments for an e-commerce company (Forecasting Shipments with LLMs. When traditional methods fall short… | by Nguyen Thanh LAI | IRIS by Argon & Co | Medium) . Initially, traditional models struggled with the multiple seasonalities and the sporadic spikes from one-time events. By using an LLM with prompt-based reasoning, the analysts were able to inject expert knowledge (like “back-to-school season causes a spike in late August”) directly into the model’s reasoning process. The LLM effectively acted like a collaborator, taking both the data and the analyst’s hints to produce a forecast. In operations, having such a forecast allows the warehouse to staff appropriately – e.g. bringing in exactly the needed number of workers for the expected volume, as noted in the case . In broader logistics, LLM-based models can forecast transport demand (for trucks, containers), detect early signals of bottlenecks (if the model is fed not just with numerical data but also perhaps news about port closures or supplier issues), and even help with scenario planning (“what if the supplier in region X shuts down for 1 week?” – an LLM might be able to simulate the impact if given the right prompt). Large logistics companies are exploring multi-modal LLMs for exactly these reasons – because supply chain problems often involve both quantitative data and textual information (like policy changes or weather reports). A multi-modal LLM could take in a description of an upcoming event (e.g. a major sporting event that usually boosts beverage sales) along with historical shipment data and output an adjusted forecast. This kind of flexibility is hard to achieve with traditional forecasting models.
Banking and Financial Risk Forecasting: In finance, time-series models are used for stock prices, economic indicators, credit risk metrics, and more. Historically, these have been domains of econometric models or specialized neural networks, and they come with challenges like extreme volatility and heavy tail risks that are less common in retail demand data. LLM architectures bring a few potential advantages here. One is the ability to incorporate textual financial news and reports alongside numeric data. For example, an LLM could read central bank statements or news headlines and use that context to adjust forecasts of market volatility or default risk, essentially merging NLP with time-series analysis. Another advantage is the zero-shot generalization: an LLM trained on a wide range of financial series might pick up on patterns of crises or rare events (having perhaps seen something similar in another market or another time) and be able to forecast better in those scenarios compared to a model trained on only one history. That said, financial forecasting is notoriously difficult and even the best models often perform just slightly better than chance on returns – so one should temper expectations. Where LLM-based models might shine in banking is risk forecasting: predicting things like loan defaults, fraud occurrence, or market risk metrics where the target is influenced by many factors. A transformer-based model can naturally incorporate many variables (via its attention mechanism focusing on the relevant signals) and can be trained to output a probability or risk level. For instance, a bank could use a transformer to predict the probability distribution of credit card transaction volumes (for liquidity planning) or to forecast the next day’s Value-at-Risk (VaR) for a portfolio. If the transformer is pre-trained on a broad set of financial data, it may require less data specific to the bank to start making useful predictions – which is very helpful when internal historical data is limited or when facing a regime change (like a sudden interest rate regime shift, where the model might leverage analogies from another high-inflation period it has “read” about during pre-training).
In all these cases, a common theme is that LLM-based models reduce the need for manual feature engineering and can ingest a variety of data types. Traditional demand forecasting might require crafting features for promotions, moving holidays, etc., whereas an LLM model can learn those effects if given enough data (or can accept them as text prompts). This doesn’t eliminate the role of analysts – in fact, the analyst’s role might shift to curating good prompts or providing higher-level guidance to the model. The Argon & Co. study explicitly noted this: they effectively transferred the pattern recognition burden from the ML model to the human + prompt, which works until the patterns get too complex for a human to enumerate (Forecasting Shipments with LLMs. When traditional methods fall short… | by Nguyen Thanh LAI | IRIS by Argon & Co | Medium) . As LLMs improve, they might even take on some of that burden autonomously (e.g. automatically figure out the significant features via internal attention, or by reading external documents).
Technical Considerations and Best Practices
While the idea of using an LLM for forecasting is exciting, in practice one must carefully consider several technical factors to make it work effectively:
Data Preparation (Tokenization & Normalization): Unlike text, where subword tokenizers are well-established, there isn’t a one-size-fits-all tokenizer for time-series. You may choose continuous embedding (feeding normalized floats through an embedding layer) vs. discrete tokenization (quantizing values to indices). Continuous embeddings (learned via an MLP as in TimesFM (A decoder-only foundation model for time-series forecasting) ) preserve exact magnitudes but might be harder for a language model to digest if it wasn’t trained that way. Discrete tokens allow reuse of NLP infrastructure (softmax decoders, etc.) but introduce quantization error. In practice, a hybrid approach can work: e.g. rounding values to a reasonable decimal and converting to string tokens, or bucketizing changes rather than absolute values. Always normalize your time-series (zero-mean, unit-variance or min-max to [0,1]) before feeding to a model, unless the model explicitly handles raw scales. Chronos’s scaling by absolute mean is one simple technique (Time series forecasting with LLM-based foundation models and scalable AIOps on AWS | AWS Machine Learning Blog) ; others include log transformations for positive data, or differencing if trends are extreme.
Architecture Choices (Encoder-Decoder vs Decoder-only): Most examples here used decoder-only (causal) setups, which are natural for univariate forecasting (one series predicting its own future). If you have exogenous covariates (e.g. known future promos, weather forecasts), you might consider an encoder-decoder transformer where an encoder ingests the known future inputs and a decoder generates the target future. Some recent models (like TFT – Temporal Fusion Transformer) use this architecture. LLM-based approaches could incorporate an encoder for additional context or even treat future known values as part of the prompt (for instance, appending “Promo=20%” to each future time step token). The choice may affect training complexity and flexibility – encoder-decoders need teacher-forcing training and are a bit more complex to fine-tune than a single decoder, but they explicitly handle known future inputs better.
Fine-Tuning Strategy: If you use a foundation model like TimeGPT or TimesFM and your data domain is slightly different from its training distribution, fine-tuning can boost accuracy. Fine-tuning a large model on time-series is non-trivial (you need to avoid catastrophic forgetting of its base knowledge). Techniques like Low-Rank Adaptation (LoRA) or prefix tuning can update the model with a small number of extra parameters. For example, one could attach a LoRA layer to the attention weights of an LLM and train it on your company’s sales data for a few epochs – this might teach the model domain-specific nuances (like your business’s weekly cycle) without losing the general time-series patterns it learned. Always use a validation set and track if fine-tuning actually improves metrics; in some cases, the pre-trained model’s zero-shot predictions might be hard to beat without a lot of data.
Prompt Engineering & Formatting: If using a prompting approach, experiment with different prompt formats. You might present the history as a comma-separated list, a JSON object, or even embed it in a short narrative (“Sales rose from 100 to 150 over Jan, then fell to 120 in Feb...”). An interesting finding by one data scientist was that ChatGPT spontaneously decided to output Python code (implementing ARIMA) when given a raw JSON of data (Testing the Limits of ChatGPT in Predictive Analytics | by Claire Longo | Medium) – the model “knew” a classic method to solve the task! While novel, that wasn’t the desired behavior, so she then created a “data narrative” to summarize the trends for the LLM . The lesson: you sometimes have to coax the LLM into directly giving you a forecast number instead of, say, writing a program or a verbose explanation. Including clear instructions like “Output the forecasted values as a comma-separated list with no explanation” can help. Also be cautious of token limits – LLMs like GPT-4 have context length limits (e.g. 8k or 32k tokens). If your input series is very long, you may need to truncate or sample it for the prompt, or use a model with longer context (some specialized transformer variants can handle 100k+ points by compressing the input, but those are still research-grade).
Evaluation and Uncertainty: Standard accuracy metrics (MSE, MAPE, MASE, etc.) apply to LLM forecasts just as to any model. It’s wise to compare against a few baselines (naive forecast, ARIMA, perhaps an XGBoost on lag features) to ensure the fancy model is actually adding value. In some studies, LLM-based models didn’t always outperform simpler models on all datasets (From Text to Time? Rethinking the Effectiveness of the Large Language Model for Time Series Forecasting) – especially if the time-series didn’t have patterns the LLM could leverage. This indicates that while LLMs are powerful, they are not magic: if your data is basically noise or extremely sparse, an LLM won’t necessarily hallucinate a meaningful pattern (and if it does, that might be even worse!). For uncertainty estimation, most LLM forecasters currently do not provide built-in confidence intervals. You can generate prediction intervals via methods like quantile regression (some models like TimesFM have experimental support for quantile outputs (google/timesfm-1.0-200m · Hugging Face) ) or simply by ensembling multiple forecasts (e.g. run the LLM prompt 5 times with slight variations and see the spread, though with a deterministic LLM like GPT that might require adding some randomness or using temperature sampling).
Scalability and Cost: Training large LLMs for time-series can be costly (both in data and compute). The Time-LLM paper notes that it requires substantial GPU memory and time to reprogram a big model for long sequences ( Time Series Forecasting by Reprogramming Large Language Models). Inference cost is also a factor if using a hosted API – one experiment cited that cross-validating an LLM approach via API calls incurred notable costs for large numbers of series (Forecasting Shipments with LLMs. When traditional methods fall short… | by Nguyen Thanh LAI | IRIS by Argon & Co | Medium) . However, trends are in our favor: new model architectures are becoming more efficient, and companies are releasing smaller specialized models (like the 200M TimesFM) that run on a single GPU. When deploying, you might not need a gigantic model; a well-trained 100M parameter transformer can often outperform a poorly tuned 1B model on specific tasks. Profiling throughput and optimizing the code (for instance, comping the model or using integer quantization for faster inference) is recommended if you plan to use it in a real-time setting.
Domain Knowledge Integration: One might ask, if LLMs can learn everything, do we still need domain features or knowledge? The answer is yes – the best solutions combine both. For example, you could provide engineered features (day of week, special events) as additional input channels to a transformer model. Some frameworks allow passing “static” features or exogenous time-series alongside the main series. In prompting, you are injecting domain knowledge by how you format the prompt. The human understanding of the problem can guide the LLM to a better solution than it would find unguided. LLMs are remarkably good at picking up hints – so a little nudge in the prompt (like “last year had a one-time outage in June that affected sales”) can go a long way to improving forecast accuracy for that case, whereas a typical algorithm would have no way of knowing that unless explicitly flagged in the data.
Conclusion
The convergence of large language models and time-series forecasting is an exciting development at the frontier of AI in . We now have early evidence that LLM architectures can significantly advance predictive analytics, offering new levels of adaptability and generalization. GPT-style decoder transformers, when trained or re-tooled for time-series, are demonstrating strong performance in tasks ranging from retail demand forecasting to supply chain logistics and financial risk prediction. These models bring several advantages: the ability to learn from vast heterogeneous data, to perform zero-shot forecasts on new problems, and even to incorporate human-like reasoning and instructions via prompts.
However, harnessing LLMs for forecasting is not plug-and-play magic – it requires careful treatment of data (turning numbers into “language” the model understands) and thoughtful integration of domain expertise (through features or prompts). The technology is rapidly evolving: in 2024 we saw the first generation of time-series foundation models like TimeGPT and TimesFM, and in 2025 we’re seeing refined versions (Chronos) and more user-friendly tools. As this trend continues, professionals in AI and data science should keep an eye on a few things: model sizes and capabilities will grow (perhaps multi-billion-parameter models for time-series will emerge, just as they did in NLP), multimodal models that combine text with time-series will likely become available (imagine forecasting models that read news or social media in real-time), and techniques to interpret these models’ forecasts (like attention visualization or prompt logic) will improve trust and transparency in decisions.
For practitioners, a practical approach is to experiment in a hybrid manner – use LLM-based models as an augmentation to your forecasting toolkit rather than a wholesale replacement at first. You might, for example, use an LLM to generate features or scenario forecasts that you then feed into a simpler optimization model. Or use an LLM forecast as a benchmark to challenge your current methods. Many have found that even if an LLM doesn’t drastically lower error metrics, it can provide new insights (by explaining its forecast in plain language or highlighting patterns) that lead to better decisions.
In banking, retail, logistics and beyond, the ultimate goal is to reduce uncertainty and make smarter decisions. LLM architectures give us a powerful new lever to pull toward that goal – a lever that combines the strength of data-driven learning with the flexibility of human-like reasoning. As models and tools continue to improve, we can expect forecasting and predictive analytics to become more accurate, more automatic, and also more interactive, allowing analysts to converse with their data and models in richer ways. The journey is just beginning, but it’s clear that large transformers are poised to play a major role in the next wave of predictive analytics innovation.