Browse all previoiusly published AI Tutorials here.
Table of Contents
Use Cases for Custom Embedding Models
Building Custom Embedding Models
📁 Data Preparation for Proprietary Datasets
🔠 Tokenization Strategies
🏗 Model Architecture Choices (BERT, DistilBERT, SBERT)
🏋️ Training and Fine-tuning Approaches
💻 Walkthrough: Training a Custom Embedder
Evaluating Embedding Quality
📊 Clustering and Visualization
🔍 Retrieval Metrics for Search and Recommendation
🎯 Zero-shot Classification and Semantic Similarity
Packaging and Sharing Models
📦 Packaging Models and Model Cards
🤗 Sharing via Hugging Face Hub
🔒 Deploying Internal Embedding APIs
Introduction
Embedding models are neural networks that encode data (like text) into high-dimensional vectors such that similar inputs have similar vector representations. These semantic vector representations underpin modern AI systems for search, recommendation, clustering, and classification (BAAI/bge-large-en · Hugging Face). For example, a FlagEmbedding model from BAAI can map any text to a dense vector useful for tasks like information retrieval, classification, clustering, or semantic similarity. By converting unstructured data into numerical embeddings, we enable efficient similarity calculations (e.g. via dot product or cosine similarity) which power a variety of applications.
However, general-purpose pre-trained embedding models may not perform optimally on niche industry datasets or specialized tasks (Fine-tuning Text Embeddings | by Shaw Talebi | Medium). Domain-specific jargon, unique customer queries, or proprietary content can lead to subpar results if we rely solely on off-the-shelf models. Fine-tuning or building custom embedding models on proprietary data is often needed to bridge this gap. In this comprehensive guide, we will explore how to create custom embedding models tailored to your data and use cases, using the latest (2024/2025) tools and best practices.
We’ll cover typical use cases, preparation of proprietary datasets, model architecture choices (from BERT to DistilBERT and beyond), and training techniques such as contrastive learning with triplet loss. We’ll also discuss how to evaluate embedding quality (using clustering metrics, retrieval performance, and zero-shot tests) to ensure the embeddings are meaningful. Finally, we’ll delve into packaging and sharing your models — whether on the Hugging Face Hub or via internal APIs — so that these custom models can be easily deployed and reused by others.
Who is this for? AI and engineering professionals looking to improve search, recommendation, or classification systems with custom embeddings will find practical guidance here. We assume familiarity with Python and deep learning frameworks (especially PyTorch and Hugging Face libraries). By the end, you should be equipped to build an end-to-end embedding model pipeline: from data handling and training to evaluation and deployment.
Use Cases for Custom Embedding Models
Embedding models have broad applicability across industries. By learning a vector representation that captures the semantic essence of data, a single embedding model can support multiple downstream tasks. Here are key use cases where custom embeddings shine:
Semantic Search: In search engines or knowledge base lookup, embeddings enable semantic retrieval. A query and documents are embedded into vectors, and relevant documents are found by nearest-neighbor search in the vector space. This goes beyond keyword matching, allowing the system to find conceptually similar results. For example, a search for "apply refund" could retrieve a document about "return policy" if both are close in embedding space.
Recommendations: Recommendations often boil down to finding similar items or users. By embedding products, content, or user profiles, you can compute similarities to recommend “users who liked X also like Y” or “related products”. Custom embeddings can capture proprietary user interaction patterns. At DoorDash, for instance, a unified item embedding helped recommend restaurants and items by representing products in a common vector space (Using Triplet Loss and Siamese Neural Networks to Train Catalog Item Embeddings).
Classification and Clustering: Even without training a task-specific classifier, embedding models can group similar examples. You can cluster embedding vectors to discover natural groupings (e.g. topic clustering of documents), or perform zero-shot classification by assigning a label based on nearest labeled examples in embedding space. An embedding model that is fine-tuned on your data can significantly improve separation of classes compared to a generic model. In fact, well-trained embeddings yield vector spaces where samples with identical labels appear nearer than those with other labels (Triplet Loss: Intro, Implementation, Use Cases), facilitating easier classification.
Other applications include semantic textual similarity (measuring how close two texts are in meaning), anomaly detection (outliers in embedding space), retrieval-augmented generation (finding supporting passages for LLMs), and cross-modal search (if you train text & images into a joint embedding space). In all cases, a custom model can capture nuances of your domain that generic models miss. The result is often a boost in relevance and accuracy across these tasks, which directly translates to better user experiences (more relevant search results, more accurate recommendations, etc.).
In the following sections, we will walk through the process of building such a model customized for your needs, starting from data preparation and ending with deployment strategies.
Building Custom Embedding Models
Creating a custom embedding model involves several stages: preparing your dataset, choosing or designing a model architecture, fine-tuning (or training from scratch) with an appropriate learning objective, and validating that the learned embeddings serve your intended tasks. We will break down each of these steps and highlight best practices.
📁 Data Preparation for Proprietary Datasets
Real-world proprietary datasets often contain domain-specific text (or other modality data) that isn't well-represented in public corpora. Preparing this data properly is the foundation of a successful embedding model:
Gather and Consolidate Data: Identify all relevant sources of textual data in your domain. For example, customer support tickets, product descriptions, search query logs, or user interaction data can all provide signal for similarity. If your goal is to improve search or Q&A, you might gather pairs of user queries and relevant documents (or click logs indicating which results the user found useful). For recommendation embeddings, you might collect user-item interaction histories. Make sure to respect privacy and compliance when using internal data (remove PII if necessary, anonymize where needed).
Define Similarity Labels or Pairs: Embedding training (especially contrastive learning) typically requires some notion of which items should be close or far in the vector space. Determine how to derive this from your data:
Explicit labels: If your data is labeled (e.g. categories for documents), you can treat items in the same category as implicitly similar.
Natural pairs: If you have query -> clicked document pairs, each pair can be considered a positive match for training. Likewise, different queries or random documents can serve as negatives.
Textual similarity data: If available, use human-labeled similar sentence pairs or paraphrases in your domain.
Unsupervised signals: Even without labels, you can create positive pairs from augmentations of the same text (as in SimCSE, which uses two dropout-augmented copies of a sentence as anchor & positive) or by assuming sentences within the same context (like consecutive sentences or same article) are related. These weaker signals can bootstrap an embedding model when explicit labels are scarce.
Data Formatting: Once you have raw text and any labels or pairings, structure the data for training. A common approach is to create a CSV or JSON with columns like
anchor
,positive
, and (if needed)negative
orlabel
. For example, for a triplet loss you might prepare each record with an anchor text, a positive text, and a negative text (the negative could be a randomly chosen unrelated text for that anchor). If using Hugging Face’sdatasets
library, you can load a CSV with such columns straight into aDataset
. The Hugging Facedatasets
API conveniently supports loading local files: e.g.load_dataset("csv", data_files="my_data.csv")
will parse a CSV into aDataset
object (Training and Finetuning Embedding Models with Sentence Transformers v3). You can likewise load JSON, Parquet, etc., or even directly query a SQL database to create the dataset.Pre-processing and Cleaning: Proprietary data can be messy. Perform any necessary cleaning: remove or normalize sensitive information, fix encoding issues, and filter out irrelevant records. If your dataset requires custom parsing (say, pulling texts from an internal data store or applying domain-specific tokenization), you can read the data into Python lists and then use
datasets.Dataset.from_dict
to create a dataset. This allows you to programmatically fill in columns (e.g. building a list of anchors, list of positives) before training. Each key in the dictionary becomes a column.Ensure Correct Pairing and Splits: Verify that your positive/negative pairing logic truly reflects similarity for training. Also, split your data into train/dev/test sets if you intend to evaluate during and after training. When splitting, ensure the splits are stratified or de-duplicated appropriately – e.g., if the same document appears in multiple pairs, avoid having it in both train and test which could leak information. For example, in a search dataset, you might split by query or by some user ID to ensure generalization.
Large Dataset Considerations: Proprietary datasets can be large (millions of rows). Ensure your pipeline can stream or batch the data without loading everything into memory at once. The
datasets
library is memory-mapped and can handle quite large files efficiently. You might also consider data augmentation if the dataset is small – e.g. generate paraphrases or translations to create more training pairs.
Carefully prepared data will make the subsequent training step much smoother. The goal is to present the model with meaningful examples of “text A should be close to text B (positive) and far from C (negative)” as much as possible, according to your task needs. Next, we address how to tokenize and represent this text for the model.
🔠 Tokenization Strategies
Tokenization is how raw text is converted into tokens (subwords or word pieces) that the model can understand. The right tokenization strategy can significantly impact model performance, especially for specialized vocabularies in proprietary data:
Use Pre-trained Tokenizers (with possible expansion): If you start from a pre-trained language model (like BERT or RoBERTa), it comes with a tokenizer trained on general text. Usually, it's best to use the same tokenizer for consistency with the model’s vocabulary. For example, if fine-tuning
bert-base-uncased
, use its WordPiece tokenizer so that the embeddings for common words are already well-initialized. However, domain-specific terms (like technical jargon, product names, or abbreviations) might get broken into multiple subword tokens or[UNK]
if not in the vocab. In such cases, you can extend the vocabulary: Hugging Face tokenizers allow adding new tokens. You might mine your corpus for frequent new words and usetokenizer.add_tokens(new_terms)
to add them (Adding domain specific vocabulary · Issue #9 · google-research/bert). The model will create new embedding vectors for these tokens (usually randomly initialized) which can be learned during fine-tuning. This approach (sometimes called vocabulary augmentation) was used in domain adaptations like BioBERT (which kept BERT’s vocab) vs. SciBERT (which trained a new vocabulary (NLP | How to add a domain-specific vocabulary (new tokens) to a subword tokenizer already trained like BERT WordPiece | by Pierre Guillou | Medium)). Adding tokens can capture domain-specific words without retraining everything, as long as you have enough data to learn good embeddings for those new tokens.Train a Custom Tokenizer: In extreme cases, if your text is very different from normal language (for instance, DNA sequences, code with lots of symbols, or a completely unique domain), training a new tokenizer from scratch might be beneficial. Hugging Face’s Tokenizers library can train Byte-Pair Encoding (BPE), WordPiece, or Unigram models on your corpus (How the vocabulary of BERT tokenizer is generated? - Transformers). This will create a vocabulary tailored to minimize fragmentation of your specific text. However, using a new tokenizer often means you need to also train or fine-tune a model from scratch or at least significantly adjust the embedding layer, since the original model weights won’t have embeddings for those tokens. A middle-ground is to initialize a new tokenizer but map as many tokens as possible to the original vocab, and only truly new tokens get new embeddings (this is advanced and rarely needed unless your domain text shares some overlap with the base domain). In practice, many industry applications manage with either the base tokenizer or a slight extension of it, rather than fully custom tokenizers, to leverage pre-trained weights.
Lowercasing and Normalization: Determine if your text needs case sensitivity. Models like
bert-base-uncased
assume lowercased input. If your domain has case-specific meanings (e.g. product codes or proper nouns), consider a cased model or ensure important case distinctions are preserved via special tokens. Similarly, think about whether to normalize numbers, punctuation, or to handle special characters (e.g., chemical formulas, file paths). Sometimes, introducing special tokens for frequent domain-specific entities is useful – e.g., add a token like<SKU>
to represent a stock-keeping unit pattern rather than let the tokenizer break it arbitrarily.Tokenization for Long Texts: If your documents are long (beyond a few hundred tokens), remember that transformer models have a maximum sequence length (often 512 tokens for BERT-based models). For tasks like document retrieval, you might chunk documents or use sliding windows to produce multiple embeddings (or use newer long-text transformers). Another tactic is to use a hierarchical approach: embed paragraphs and then combine. Ensure consistency: if splitting text, the strategy used in training should match how you'll embed new data in deployment.
Batch Tokenization: Use fast tokenizers and batch tokenization for speed. The Hugging Face
AutoTokenizer
with.batch_encode_plus
or simplytokenizer(list_of_texts, padding=True, truncation=True, return_tensors='pt')
is highly optimized in Rust and can handle large batches efficiently.
In summary, start with an existing tokenizer aligned with your base model, but be ready to extend it if analysis of your corpus shows many out-of-vocabulary terms. Proper tokenization ensures that meaningful units of text are represented by single tokens or predictable token sequences, making it easier for the model to learn relationships.
🏗 Model Architecture Choices (BERT, DistilBERT, SBERT)
Choosing the right model architecture as the backbone of your embedding model involves balancing performance, domain suitability, and efficiency:
BERT and its Variants: BERT-based transformers (BERT, RoBERTa, ALBERT, etc.) are popular choices for text embeddings. They output contextualized token vectors which can be pooled into a sentence or document embedding. BERT-base (12 layers, ~110M params) provides strong language understanding but may be overkill or too slow for some applications. DistilBERT is a distilled (compressed) version of BERT with about half the layers (6) and ~66M params, running ~60% faster while retaining a large portion of BERT's performance (BAAI/bge-large-en · Hugging Face). For high-throughput systems (like real-time recommendations), DistilBERT or even smaller models like MiniLM (which has even fewer parameters and smaller embedding size) are attractive: for instance, MiniLM- with 6 layers and a 384-dim embedding achieved on-par semantic performance with larger models at a fraction of the size.
Sentence-BERT (SBERT): SBERT isn’t a new architecture per se, but a training approach that transforms BERT into a siamese network suited for generating sentence embeddings (Training and Finetuning Embedding Models with Sentence Transformers v3). In SBERT, two copies of the model encode pairwise inputs, and a similarity loss (e.g. cosine similarity loss) is used to tune the network such that similar sentences have higher cosine similarity. The result is a BERT model with an extra pooling layer that produces a fixed-size sentence embedding. The SentenceTransformers library (from UKP Labs) provides many ready-made SBERT models and an easy API for fine-tuning them. Under the hood, an SBERT model is usually a BERT or RoBERTa with a pooling operation (mean or [CLS] token) and sometimes a normalization layer. The new v3 of SentenceTransformers integrates tightly with Hugging Face’s Trainer, making it easier to fine-tune these models.
Architectural Trade-offs: If your use case involves short texts like queries or tweets, a smaller model might suffice. If you require embeddings of very long documents, consider architectures designed for long input (Longformer, etc.) or chunking strategy. Also, consider if a cross-encoder fits your task: for re-ranking tasks or very fine-grained similarity, a cross-encoder (which takes a pair of texts as input and outputs a score) can outperform bi-encoder embeddings, but it's not practical for retrieving from large collections because of the cost of comparing a query against all documents. A typical production system might use a bi-encoder embedding model to fetch candidates, then a cross-encoder to re-rank the top results for improved precision. In our context, we focus on bi-encoders that produce independent embeddings.
Custom Transformer Heads: You might add custom layers on top of the transformer to better suit your task. Common additions include:
A pooling layer: Sentence embeddings are often obtained by mean-pooling the token outputs or taking the
[CLS]
token output. Some models learn a weighted pooling or concatenation of last few layers. The SentenceTransformers framework provides configurable pooling modules (mean, max, CLS, etc.).A projection layer: Adding a dense layer to project the embedding to a different vector size or to tune the space can be useful. For example, if you want a smaller embedding dimensionality (say 256 instead of 768) for efficiency, you can append a linear layer to map BERT’s output to a 256-dim vector. This will be trained along with the rest of the model. Another use is to normalize or orthogonalize the embeddings through the head.
Multi-task heads: In more complex setups, you might have multiple objectives (e.g., an embedding that is good for both similarity and a supervised classification). In such cases, additional output heads for each task can be added during training (multi-task learning). This is advanced and requires careful weighting of losses, but it can produce embeddings that generalize to multiple evaluation criteria.
Initialization: Generally, start with a pre-trained model checkpoint (like
bert-base-uncased
or a domain-specific one likeallenai/scibert
for scientific text) whenever possible. The linguistic knowledge in those weights gives a huge head start. If your domain is extremely unique (e.g., programming code, protein sequences, etc.), you might consider a model pre-trained on similar data (like CodeBERT for code, or specialized transformers for biomedical text). Training from scratch is time-consuming and data-hungry, so it’s usually a last resort unless you have tens of millions of domain-specific documents and the compute to match.
In summary, BERT-based bi-encoders are a safe default for text embeddings, with DistilBERT or MiniLM variants for speed, and SBERT training techniques to optimize them for similarity tasks. Ensure the model size and architecture align with your latency/memory constraints and the richness of your data. With the model choice in place, we move on to how to train/fine-tune it effectively for embedding tasks.
🏋️ Training and Fine-tuning Approaches
How you train your model defines the geometry of the embedding space. Unlike a classifier that has a straightforward cross-entropy loss, embedding models often use metric learning objectives that directly optimize for similarity/distance. Here are common training approaches:
1. Contrastive Learning (Positive vs. Negative pairs): This approach teaches the model to make embeddings of two similar texts closer, and dissimilar texts farther apart. It usually involves pairs of texts. A popular loss function here is Contrastive Loss or NT-Xent (InfoNCE) which operates on pairs: minimize distance between an anchor and a positive, while maximizing distance to negatives. One efficient variant implemented in SentenceTransformers is Multiple Negatives Ranking Loss – in a batch, each anchor’s positive partner is treated as the “true” match and all other examples in the batch act as negatives (Training and Finetuning Embedding Models with Sentence Transformers v3). This greatly increases negative sampling without explicit negative data: the loss will encourage the anchor to be closer to its paired positive than to any other example in the batch. Contrastive learning requires careful preparation of positive pairs. These could be duplicate questions with the same answer, paraphrases, or query-document pairs from click logs. If your data has numeric similarity scores (e.g., sentence similarity rated by humans), you can also use a regression-based cosine similarity loss (forcing the cosine of embeddings to match the score).
2. Triplet Loss: Triplet loss is a specific kind of contrastive loss that uses an anchor, a positive, and a negative example at a time. The goal is to make the anchor embedding more similar to the positive than to the negative by a margin. Formally, it requires ∣∣E(a)−E(p)∣∣+margin<∣∣E(a)−E(n)∣∣∣∣E(a)−E(p)∣∣+margin<∣∣E(a)−E(n)∣∣ for each triplet (where E(x)E(x) is the embedding) – in other words, the distance to the positive is at least some margin less than the distance to the negative. The training tries to satisfy this inequality for all triplets. Triplet loss is widely used in face recognition and other identification tasks (Triplet Loss: Intro, Implementation, Use Cases), and it's very applicable to text embedding too. A simple example: anchor = a question, positive = a paraphrase of that question, negative = an unrelated question. The loss will adjust the model so that the anchor-question and paraphrase end up close in vector space, and the anchor vs. unrelated question are far apart. (File:Triplet Loss Minimization.png - Wikimedia Commons)embed_image Illustration of the triplet loss concept: The model is trained such that an anchor (blue) is closer to a positive example (green) than to a negative example (red) in the embedding space. Over training, the distance between anchor and positive is minimized while the distance to negative is maximized, creating a well-separated representation space. The left diagram might show the initial state and the right diagram after training (anchor-positive pulled together, anchor-negative pushed apart).
Both contrastive and triplet losses rely on sampling good negatives. Hard negatives (negatives that are somewhat similar to the anchor, thus challenging) are important to improve the model. If all negatives are too easy (totally unrelated), the model might quickly learn to separate those and not improve further. You can mine hard negatives from your data (e.g., for a given anchor query, use an initial model to find the top incorrect result as a hard negative for the next training round). This mining process can significantly boost embedding quality, as noted in some state-of-the-art embedding models (the BGE model from BAAI, for example, emphasizes mining hard negatives and even provides guidance for this process (BAAI/bge-large-en · Hugging Face)).
3. Supervised Fine-tuning with Classification Objectives: Instead of explicitly training on similarity, you can train a model on a supervised task and use the learned representations as embeddings. For example, train a text classifier on your proprietary data (with a label for each category). If you remove the final softmax layer, the penultimate layer or the [CLS] token output from this fine-tuned model can serve as an embedding that is tuned for class discrimination. This is effectively how BERT fine-tuning for classification works, but you repurpose the features as general embeddings afterwards. The advantage is if you have strong labeled data for a specific task (like product categories), the model will embed texts in a space that reflects those categories. A downside is that the embedding may overfit to that particular label space and not generalize to other notions of similarity. A compromise is supervised contrastive learning – e.g., treat all texts with the same label as positives to each other and different labels as negatives (this is the idea behind Supervised SimCSE and SupCon loss). Hugging Face’s evaluator toolkit even includes a BinaryClassificationEvaluator
that can evaluate embeddings on a classification task by binary classification (Training and Finetuning Embedding Models with Sentence Transformers v3), implying one can train in that manner too.
4. Language Model Fine-tuning (MLM/Next Sentence etc.): One could further pre-train a model on domain data using a language modeling objective (like Masked Language Modeling as BERT does). This adapts the model to the domain’s distribution, which can indirectly improve embeddings. However, MLM alone doesn’t guarantee good semantic separation for downstream similarity; it’s usually a preparatory step. If your dataset is large, a brief MLM fine-tuning on domain text before the contrastive fine-tuning can be beneficial, as it teaches the model domain-specific context. Another pretraining trick is Next Sentence Prediction (NSP) or Sentence Order Prediction tasks to encourage understanding of sentence relationships (though NSP was removed in RoBERTa for not being very useful generally).
In practice, contrastive or triplet losses are most effective for learning embeddings aligned to search and retrieval tasks (Training and Finetuning Embedding Models with Sentence Transformers v3)). They directly optimize what we care about: pulling semantically similar texts together and pushing dissimilar ones apart. Make sure to monitor training with an evaluation metric suitable for embeddings (more on that in the Evaluation section). Also, when fine-tuning a transformer, pay attention to typical considerations: use an appropriate learning rate (often smaller, like 2e-5 to 5e-5 for BERT), and possibly freeze some layers initially if you have very limited data (e.g., freeze the bottom N layers and only train top layers and new layers). The transformers
library’s Trainer
and TrainingArguments
can make this easier by specifying layerwise_lr_decay
or manually freezing parameters.
💻 Walkthrough: Training a Custom Embedder
Let's walk through a concrete (simplified) example of fine-tuning an embedding model using Hugging Face and SentenceTransformers tools. We will assume we have a proprietary dataset of sentence pairs that should be similar (for instance, FAQ questions and their rephrased versions, which we want close in embedding space). We will use a contrastive approach with Multiple Negatives Ranking Loss for efficiency.
Step 1: Prepare data in a Hugging Face Dataset. Suppose we have a Python dictionary of our data:
from datasets import Dataset
# Example data: anchor and positive sentences that are semantically similar
data = {
"anchor": [
"What is the refund policy for online orders?",
"How do I schedule an appointment?",
"Benefits of using cloud storage"
],
"positive": [
"How can I get a refund for an order I placed online?",
"Steps to book an appointment",
"Why cloud storage is beneficial"
]
}
dataset = Dataset.from_dict(data)
print(dataset.column_names) # ['anchor', 'positive']
print(len(dataset)) # e.g., 3
In a real scenario, data
might be built by reading a CSV or assembling from multiple sources. We now have a Dataset
with two text columns. If our loss function or model expects differently named columns, we could rename them, but in this case we’ll use these as-is.
Step 2: Load a pre-trained model to fine-tune. We choose a base model. Here, we'll use a pre-trained DistilBERT model via SentenceTransformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-stsb-quora-ranking')
# This is a DistilBERT already fine-tuned for sentence similarity (as an example starting point).
# You could also start with 'distilbert-base-uncased' for a generic model.
The chosen model 'distilbert-base-nli-stsb-quora-ranking'
is a DistilBERT fine-tuned on NLI and Quora duplicate questions, which is a reasonable starting point for Q&A style similarity. You can replace it with any model checkpoint (SentenceTransformer
will internally handle downloading and adding a pooling layer if necessary).
Step 3: Define the training components – loss and evaluator. We’ll use MultipleNegativesRankingLoss
, which expects the dataset to have two columns: an anchor and a positive (it will assume every other positive in the batch is a negative for a given anchor):
from sentence_transformers.losses import MultipleNegativesRankingLoss
train_loss = MultipleNegativesRankingLoss(model=model)
We can also define an evaluator to monitor training. For example, if we had a dev set with human similarity scores or some way to evaluate, we could use EmbeddingSimilarityEvaluator
. Here, for illustration, let's assume we split our small dataset for evaluation (in practice, use a proper separate eval set):
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
# Create a tiny evaluation set (just reusing our data in reality)
eval_dataset = dataset.select(range(3))
evaluator = EmbeddingSimilarityEvaluator(
sentences1 = eval_dataset["anchor"],
sentences2 = eval_dataset["positive"],
scores = [0.99, 0.99, 0.99] # pretend almost identical pairs with high scores
)
The evaluator above will compute embeddings for each pair and output a metric (Pearson/Spearman correlation by default for similarity tasks (Training and Finetuning Embedding Models with Sentence Transformers v3).
Step 4: Configure training arguments. We use the SentenceTransformerTrainer
which is a wrapper similar to Hugging Face’s Trainer but specialized for SentenceTransformers. We need to specify training hyperparameters:
from sentence_transformers import SentenceTransformerTrainer, SentenceTransformerTrainingArguments
training_args = SentenceTransformerTrainingArguments(
output_dir="custom-embedding-model",
num_train_epochs=1,
per_device_train_batch_size=16,
learning_rate=2e-5,
evaluation_strategy="steps",
eval_steps=100,
logging_steps=50,
save_steps=200,
save_total_limit=1,
push_to_hub=False # we'll manually push later
)
We set a low number of epochs just for demonstration (real tasks might need several epochs or many steps of training, especially if starting from a generic model). The above config will evaluate every 100 steps and save model checkpoints occasionally.
Step 5: Train the model. Now we tie it all together with the trainer:
trainer = SentenceTransformerTrainer(
model=model,
args=training_args,
train_dataset=dataset,
eval_dataset=eval_dataset,
loss=train_loss,
evaluator=evaluator
)
trainer.train()
During training, each batch will contain pairs from our dataset. The MultipleNegativesRankingLoss
will use in-batch negatives, meaning if batch size = 16, each anchor will have 15 negatives from the other pairs in the batch (Training and Finetuning Embedding Models with Sentence Transformers v3). This greatly improves training efficiency for embeddings. The evaluator will output a similarity correlation at eval steps to show progress (since we gave dummy high scores, it should trend upward).
Step 6: Save or push the model. After training completes, we can save the fine-tuned model:
model.save_pretrained("custom-embedding-model/final")
This will save the model in the given directory. If we want to share it on Hugging Face Hub:
model.push_to_hub("my-username/my-custom-embedder")
This will create (or update) a model repo under your account on Hugging Face Hub with the model weights, config, and the necessary files (Training and Finetuning Embedding Models with Sentence Transformers v3). (Make sure you’re logged in or provide an authentication token).
That’s it – we have fine-tuned a custom embedding model! In a realistic scenario, you would use a much larger dataset, tune hyperparameters (batch size, learning rate, etc.), and possibly train for more epochs. But the workflow remains similar. By leveraging Hugging Face’s ecosystem:
We used
datasets
to prepare data (with support for various file types and even direct DataFrame or dict loading).We used
SentenceTransformerTrainer
which under the hood uses PyTorch for training (so you get GPU acceleration, mixed precision if enabled, etc.).We benefit from existing models (transferring knowledge) and can easily push the result to the Hub.
In the next section, we will discuss how to evaluate the quality of the embeddings we’ve obtained to ensure the model is actually solving our needs.
Evaluating Embedding Quality
Evaluating an embedding model is different from evaluating a classifier. There is no single accuracy number to directly measure “goodness” of embeddings; instead, we examine how well the embeddings perform on relevant tasks or metrics. Here are several strategies to evaluate embedding quality:
📊 Clustering and Visualization
One intrinsic way to evaluate embeddings is to see if they cluster data points in a meaningful way. If you have categories or classes in your data (even if they were not used during training), you can project the embeddings and check clustering quality:
Cluster Separation: Use algorithms like k-means or DBSCAN on the embedding vectors to form clusters. Then measure metrics such as Silhouette score or Adjusted Rand Index (ARI) to quantify clustering performance (Mastering Data Clustering with Embedding Models - Medium). The silhouette coefficient (range -1 to 1) indicates how well-separated the clusters are; a higher silhouette means embeddings of the same class are much closer to each other than to those of other classes (How to Evaluate the Performance of Clustering Algorithms Using ...). If your embeddings are good, data points with the same label or intended grouping should naturally form tight clusters with a positive silhouette coefficient.
Visualization: Project embeddings into 2D space using techniques like t-SNE or UMAP. This can provide a visual sanity check – do texts that we expect to be similar end up close by? For instance, if you embed support ticket titles, perhaps tickets about password reset form one cluster and tickets about payment issues form another. Visualization can reveal outliers or overlapping clusters that might hint the model isn’t distinguishing certain concepts well.
Clustering Examples: Suppose you have an embedding model for product descriptions. You might take 1000 product embeddings and run t-SNE to see if products naturally cluster by category (electronics vs. clothing vs. books). If you see clear grouping in the plot, that's a good sign. If everything is mixed or random, the embeddings might be failing to capture meaningful structure. Keep in mind t-SNE can sometimes show clusters even from random data due to how it works, so combine it with quantitative measures like silhouette or ARI for more confidence.
Clustering evaluation is especially useful if your ultimate use case is unsupervised discovery or you have category information. It’s an intrinsic evaluation of the embedding space’s structure. However, it might not directly reflect performance on a specific task like search (for that, we need retrieval metrics).
🔍 Retrieval Metrics for Search and Recommendation
If the embedding model will be used for search or recommendation (i.e., finding nearest neighbors in the vector space), you should evaluate it on a retrieval task:
Mean Reciprocal Rank (MRR) / Recall@K: Create a test set of queries and a set of relevant documents for each query (ground truth). Use the embedding model to embed all queries and all documents. For each query, retrieve the top-K nearest document embeddings (for example using cosine similarity). Then compute metrics:
Recall@K: the fraction of queries for which the correct document is found in the top K results. Common values are Recall@5 or @10 for search. A high recall means the embedding is succeeding at retrieving relevant items in the shortlist.
MRR: the mean of reciprocal ranks of the first relevant result. If the first relevant document is at rank 1 for all queries, MRR = 1.0 (ideal). If on average the first relevant is at rank 3, MRR would be around 0.33. Higher MRR is better.
Precision@K or NDCG: If your task cares about ranking quality beyond just finding a relevant item, you can calculate precision at K or the Normalized Discounted Cumulative Gain. NDCG is useful if you have graded relevance (e.g., some documents are more relevant than others) as it gives higher weight to getting the top results correct.
Example (Semantic Search): Imagine an FAQ search scenario. You have a set of user questions and the FAQ answer articles that truly address them (ground truth mapping). For evaluation, take a user question, embed it, and retrieve the closest FAQ article embeddings. If the correct FAQ is ranked 1, that's perfect (MRR contribution 1.0). If it's ranked 5, MRR contribution is 0.2. Compute the average over many questions. A model before fine-tuning might have MRR of 0.5, and after fine-tuning on domain data might increase to 0.8 – indicating users will more often find the right answer in the top results.
Recommendation Evaluation: If using embeddings for recommendations, you can treat it similarly. Say you have test data of users with a list of items they interacted with (purchased, clicked, etc.). You can embed all items, and for each user (maybe represented by the average of embeddings of their past items), retrieve the nearest item embeddings. Then see if the recommended items were actually interacted by the user. Precision@K or Recall@K are commonly used here as well (e.g., how many of the top 10 recommended items did the user actually interact with). Another metric is Hit Rate – whether at least one of the user’s actual liked items appears in top K.
Information Retrieval Evaluator: The SentenceTransformers library includes
InformationRetrievalEvaluator
which can directly compute recall@K if you provide it with a set of queries, a corpus, and relevant document information (Training and Finetuning Embedding Models with Sentence Transformers v3). For instance, you provide a dict of query_id to set of relevant doc_ids, and it will compute metrics like Recall@K for you. This is very handy to evaluate embedding models on IR tasks without writing a lot of custom code.
Overall, retrieval metrics are the most direct indicator of how the embedding will perform in a production search/retrieval setting. You should construct a realistic evaluation set – often by reserving some real queries and using known good results (from logs or manual labeling) as ground truth. An increase in Recall@10 or MRR after fine-tuning is a strong signal your custom model is better for search.
🎯 Zero-shot Classification and Semantic Similarity
Another angle is to test the embeddings on downstream tasks in a zero-shot or few-shot manner:
Zero-shot Classification: Without training a classifier, see if the embeddings cluster or separate according to class labels. For example, take a dataset (perhaps not used in training) with known categories. Compute embeddings for all items. Then, for a given item’s embedding, find its nearest neighbors and see if most of them share the same label. If yes, that implies the embedding space has formed pure clusters of classes. You can quantify this by computing something like an average purity score of nearest neighbors. Alternatively, use a simple classifier on top of embeddings (like k-NN with k=1 or k=5) and measure accuracy: this effectively is nearest-neighbor classification which is a zero-shot use of the embedding (no additional training, just using distances). If your embedding is truly excellent, even a linear classifier or nearest neighbor on top of it will perform well on classification tasks because the classes are linearly separable or well-clustered in that space.
Semantic Textual Similarity (STS) Benchmarks: If your embedding model is meant to capture semantic similarity, you can test it on standard benchmarks like STS (Semantic Textual Similarity tasks). These provide pairs of sentences with a human similarity score. You embed all pairs and compute cosine similarity, then see how well the cosine correlates with the human scores (using Pearson or Spearman correlation). A high correlation means the embedding space is aligning with human judgment of similarity. For example, Sentence-BERT models often report Spearman correlations on STSBenchmark as a metric. If your domain has a similar benchmark or if general STS is relevant, it’s a good evaluation. The
EmbeddingSimilarityEvaluator
in SentenceTransformers can perform this evaluation (Training and Finetuning Embedding Models with Sentence Transformers v3).Qualitative Checks on Examples: Sometimes manual inspection is valuable. Take a few sample texts from your domain and list their nearest neighbors (most similar embeddings) from a corpus. Do they make sense? If you embed a query "reset my password", do the closest stored FAQs indeed relate to password resets? If you embed a product description, are the nearest other products truly similar or do you see odd mismatches? Spot-checking like this can catch cases where the model might be picking up on wrong signals (for instance, maybe grouping texts by writing style instead of content, etc.).
Cross-domain Tests: If you aim for a general-purpose embedder, test it on a variety of tasks (this is essentially what the MTEB (Massive Text Embedding Benchmark) does – it evaluates an embedding model on many tasks: clustering, retrieval, classification, and more (Bge Small En · Models · Dataloop)). You likely won’t do a full MTEB evaluation for an internal model, but you can select a couple of relevant evaluations.
The key with evaluation is to ensure it’s aligned with your use case. If search is the goal, retrieval metrics trump all. If the embeddings will be used as features in a downstream supervised model, you might test how well they help that model (e.g., train a small logistic regression on top of the embeddings for a task and measure accuracy – a quick proxy for embedding usefulness).
It’s also important to compare against baselines: evaluate a baseline embedding (say, a generic pre-trained model) on the same metrics, and then your fine-tuned model. The difference in scores quantifies the value added by your custom training. Often, even a 5-10 point increase in Recall@10 or a noticeable jump in clustering purity is significant in practice, enabling much better performance in the application.
Packaging and Sharing Models
After investing the effort to build a custom embedding model, you’ll want to share it or deploy it so others (or other systems) can use it. This could mean publishing it to the community or simply making it accessible within your organization. We’ll cover best practices for packaging the model with the necessary files, sharing via the Hugging Face Hub, and deploying through internal APIs.
📦 Packaging Models and Model Cards
Packaging a model involves saving all the pieces needed to use the model later:
Save the Weights and Config: Using the Hugging Face
transformers
or SentenceTransformers API, always save both the model weights and the configuration. For example,model.save_pretrained("dir")
will create a directory with:pytorch_model.bin
ormodel.safetensors
(the model weights – usingsafetensors
format if possible for security).config.json
(the model architecture config, so it knows it’s a BERT with certain hidden size, etc.).tokenizer.json
or similar (the tokenizer files/vocab).In SentenceTransformers, it also saves
modules.json
describing the pooling or other head modules.
This directory is essentially portable. Anyone with the same library can do
model = SentenceTransformer('dir')
orAutoModel.from_pretrained('dir')
to load it.Model Card Documentation: It’s best practice (especially if sharing externally) to write a README or “model card” describing your model. Include details like:
Training data: a brief description of the proprietary data or domain (if you can share that info).
Training process: what objective you used (contrastive, triplet, etc.), how long, any special settings.
Intended use cases: e.g., “This model is intended for semantic search in legal documents.”
Performance: report the evaluation metrics you achieved (if possible, on some benchmark or internal validation).
Limitations: note if the model has any biases or shortcomings (for instance, “This model was trained on technical IT support tickets, so it may not perform well on general literature text”).
Hugging Face Hub will render this as the model’s page. For internal use, you might keep this documentation in your repository or wiki.
Environment and Dependencies: Note the versions of libraries you used (Transformers version, PyTorch version, etc.). Ideally, your model should be usable with future versions, but if you used something specific (like a custom loss), document it. If packaging for internal use, you might create a Docker container including the model and runtime or at least list the pip requirements.
Testing the packaged model: Before sharing, it’s a good idea to do a quick test: load the model from the saved files and run a couple of sample inferences (embed a sentence) to ensure it works standalone. This catches issues like forgetting to save the tokenizer or any custom classes that the config references.
By packaging the model cleanly, you enable reproducibility. Months later, you or someone else should be able to load the model and get the same embeddings for a given input.
🤗 Sharing via Hugging Face Hub
The Hugging Face Hub is a popular way to share models publicly or within a private organization space. It provides versioning, an interactive model card, and easy integration. Steps and tips for using the Hub:
Login and Repo Creation: If you haven’t, use
huggingface-cli login
to log in via the terminal ornotebook_login()
(Sharing). You can then usemodel.push_to_hub("repo-name")
as shown earlier to upload. This will create a repository under your account (or you can specify an org likemy-org/model-name
). You can also create a repo on the Hub website and use git to push, but thepush_to_hub
method is very convenient.Private vs. Public: You have the option to make the repo private if your model or data is sensitive. Private models can be shared with specific users or just kept for your internal use. The Hub supports access control and even model gating for more controlled access. For example, you might keep the model private but add specific team members as collaborators, or use an organization namespace for company-internal models.
Model Card and Metadata: The README you include (either created automatically or manually added) will be displayed. You can add metadata in YAML at the top of the README for things like tags, license, libraries. E.g., you might tag
embeddings
,semantic-search
, and your domain. If you usedTrainer.push_to_hub
, some training metrics and args might automatically be included.Versioning and Revisions: Each push is a git commit under the hood. The Hub keeps history, so you can always roll back or examine diffs (useful if you upload a new version after additional fine-tuning). You can even use the
revision
argument infrom_pretrained
to load a specific commit or a specific tagged version of the model. It’s good practice to use git tags or releases on the Hub for major versions of your model (e.g.,v1
,v2
after some improvement).Large Files: By default, models are stored in git LFS. If your model is extremely large (not likely for BERT-sized, but for some multi-billion parameter models), Hugging Face handles chunking it in LFS. Just make sure
git-lfs
is installed when using local git. Thepush_to_hub
abstracts this away for the most part.Inference API and Widgets: If public, your model gets a free inference API and a widget on its page (for text models, a small text box to try out embeddings or similarity). For embedding models, the widget might allow you to input two sentences and see the cosine similarity, for example. This can be a nice way to demonstrate the model. If you prefer not, you can disable it or just note that it’s primarily for programmatic use.
Sharing on the Hub enables the community (or your team) to reuse the model easily: SentenceTransformer('your-name/your-model')
will fetch and load it. It also encourages feedback and collaboration if open-source. Many companies open-source general embedding models (like SBERT models for various languages) – if your model is broadly useful beyond proprietary data, consider contributing it to the community. Otherwise, private Hub repos can serve as a model registry for your organization.
🔒 Deploying Internal Embedding APIs
Sometimes, you may not want or be able to share the model on an external service. In such cases, deploying an internal API or service for the embedding model is a common solution:
Use Case for Internal API: An internal microservice can host the model and provide an endpoint (e.g., a REST API) where other services can send text and get back embeddings. This centralizes the model usage so that, for example, multiple applications in your company can obtain consistent embeddings without each having to re-implement the model. It’s also beneficial for heavy models, so the GPU can be concentrated on one server rather than distributing the model to many smaller servers.
Frameworks: You can use lightweight web frameworks like FastAPI or Flask to create a service. For instance, load the model at startup, then define an
/embed
POST endpoint that accepts some text or a list of texts and returns the embeddings. Pseudocode:
from fastapi import FastAPI
from sentence_transformers import SentenceTransformer
import uvicorn
app = FastAPI()
model = SentenceTransformer('path/or/name/of-model')
@app.post("/embed")
def embed_text(text: str):
vec = model.encode(text) # get embedding as numpy array
return {"embedding": vec.tolist()}
# To run: uvicorn.run(app, host="0.0.0.0", port=8000)
This example shows a simple synchronous API. In production, you might batch requests or use async to improve throughput.
Performance Considerations: Embedding models are usually lightweight enough to run inference quickly on CPU for small loads, but for high load or larger models, you’ll want a GPU. If using a GPU internally, make sure the service is set up to utilize it (and consider batch processing if throughput is an issue – e.g., accumulate texts for 50ms and then encode in one batch for efficiency). Also, consider using techniques like model quantization (8-bit or 4-bit) if you need to serve on CPU with faster speed. DistilBERT or MiniLM models often run real-time on CPU for moderate lengths, whereas a full BERT might be borderline without GPU.
Scalability: If many requests per second are expected, containerize the service (Docker) and run multiple instances behind a load balancer. You could also integrate with Kubernetes for scaling. Ensure you have health checks – e.g., an endpoint to verify the model is loaded and responding.
Internal Model Registry: Some organizations have their own model registry or hub. If you have something like S3 or an internal storage, you could save the model artifacts there and have your service load from that source. The Hugging Face Hub can also be self-hosted or mirrored if needed for offline environments.
APIs vs. Libraries: If performance and latency are absolutely critical and language isn’t a barrier, you might choose to package the model as a library rather than a service (e.g., a Python package that others can import and use the model directly in-process). This avoids network overhead. However, it means each application needs the model loaded (potentially memory heavy if many apps on one machine), and updates to the model mean updating the library in each place. An API centralizes the resource usage and updates, at the cost of a slight overhead per call.
Regardless of how you deploy internally, monitor the usage. Keep an eye on model drift – if your data distribution changes, you might need to retrain and redeploy the embedding model periodically. Having a robust pipeline for going from new training data -> fine-tuned model -> updated deployment (with perhaps A/B testing between old vs new embeddings in your application) is ideal in production (A Practical Guide for Deploying Embedding-Based Machine Learning Models | by Sven Degroote | ML6team).
One more tip: if your embeddings will be used in a vector database (like ElasticSearch’s vector search, Milvus, or FAISS index), consider hosting that alongside. Sometimes the pipeline is: text -> embed via model -> vector DB query -> results. You could encapsulate that entire pipeline in the service (so the caller just sends a query text and gets back the top results). This goes beyond just the model to a full search service, but it can be powerful for internal search tools.
Conclusion
Building a custom embedding model for your specific needs can significantly improve the relevance and accuracy of search results, recommendations, and classifications on proprietary data. We covered an end-to-end journey:
We started by motivating custom embeddings for domain-specific tasks, noting that while pre-trained models are powerful, fine-tuning aligns the vector space with your unique data and similarity needs (Fine-tuning Text Embeddings | by Shaw Talebi | Medium).
We then discussed data preparation, emphasizing the handling of proprietary datasets and the importance of constructing meaningful pairs or triplets for training (Training and Finetuning Embedding Models with Sentence Transformers v3).
We looked at tokenization strategies for handling domain vocabularies and ensuring the model can represent your text well, possibly by extending the tokenizer if needed.
Next, we reviewed model architecture choices – from using lighter models like DistilBERT or MiniLM for efficiency to leveraging SBERT for sentence-level embeddings, and even adding custom pooling or projection layers to fine-tune the embedding space.
The training approaches section delved into contrastive learning and triplet loss, showing how these losses directly optimize for similarity metrics by pulling positives together and pushing negatives apart (Triplet Loss: Intro, Implementation, Use Cases). We also touched on alternatives like supervised fine-tuning and multi-task learning.
A concrete walkthrough demonstrated how to use 2024-era tools (Hugging Face Datasets, Transformers, and the latest SentenceTransformers v3) to fine-tune an embedding model with just a few lines of code – illustrating how accessible this technology is.
We provided a comprehensive look at evaluation techniques: from visual clustering and silhouette scores to rigorous retrieval metrics like Recall@K and MRR that directly measure the model's utility in search/recommendation scenarios. We highlighted that evaluation should be aligned with your end task; for instance, using an InformationRetrievalEvaluator for search tasks provides relevant metrics easily.
Finally, we covered best practices in packaging and sharing the model. A well-documented model card and proper saving of artifacts ensure reusability. Sharing via Hugging Face Hub can facilitate collaboration and versioning (Sharing), while deploying through internal APIs can integrate the model into production systems securely and efficiently.
In summary, creating a custom embedding model is an investment that can pay off across multiple applications: once you have a high-quality vector representation of your data, you can re-use it for search, recommendations, clustering, anomaly detection, and more, often with minimal additional work per new task. By following modern best practices – using established frameworks, evaluating carefully, and deploying thoughtfully – you can build an embedding solution that is robust, performant, and maintainable.