Exploring the intersection of high performance computing, modern C++, computer vision, and storage systems. Deep technical insights into building performant software systems, optimizing databases, and implementing efficient computer vision algorithms.
Reading Time: ~25 minutes
TL;DR: I spent weeks trying to make an embedding model understand a made-up word. What I learned along the way reveals why “just fine-tune it” is terrible advice for domain-specific search, and what actually works.
Neural embedding models have revolutionized information retrieval, yet their practical deployment in enterprise environments remains challenging. While these models excel on benchmark datasets, they consistently underperform on domain-specific corpora containing specialized terminology, acronyms, and jargon absent from their pre-training vocabulary.
This investigation started with a simple question: Can I teach a pre-trained embedding model to recognize and semantically reason about entirely novel vocabulary?
Through controlled experimentation with the SPLADE architecture, I discovered that naive fine-tuning approaches fail to achieve meaningful domain adaptation. The root cause—insufficient MLM (Masked Language Modeling) logit activation for out-of-vocabulary terms—led me down a rabbit hole that ultimately produced a multi-stage adaptation pipeline that actually works.
What I’ll cover:
At OpenSearchCon 2025 NA, I challenged a common misconception in the vector search space: that the index is the primary driver of good recall. Marketing teams and vendors love to focus on index architecture, but my research pointed to a different culprit. The main factor degrading recall and search relevancy in production isn’t the index—it’s the quality of the embeddings themselves.
To put this in perspective: a typical ANN index (HNSW, IVF, etc.) achieves 95-99% recall relative to exact k-NN search. That’s a 1-5% loss from approximate indexing. Meanwhile, an average off-the-shelf bi-encoder on domain-specific queries often achieves only 40-60% nDCG@10 compared to a well-tuned domain-adapted model. That’s a 40-60% gap from embedding quality alone. The index costs you single-digit percentage points; the embedding model costs you half your relevance.
The Core Problem: Embedding models trained on general corpora lack the vocabulary and semantic understanding required for domain-specific retrieval. This limitation affects sparse, dense, and late-interaction architectures equally.
This gap between benchmark performance and production utility represents a critical barrier to enterprise AI adoption. Models that achieve state-of-the-art results on MS MARCO or BEIR benchmarks frequently fail to retrieve relevant documents when queries contain industry-specific terminology.
I suspected that domain adaptation failure stems from a fundamental architectural constraint: the fixed vocabulary of transformer-based models prevents meaningful representation of novel terminology, and standard fine-tuning cannot overcome this limitation without explicit vocabulary extension and targeted pre-training.
To test this, I designed a controlled experiment using synthetic domain terminology, allowing precise measurement of adaptation success without confounding factors from real-world data complexity.
If you’re building semantic search for a specialized domain—legal, medical, financial, or any industry with proprietary terminology—you’ve probably hit this wall. The model works great on general queries but falls flat when users search for domain-specific concepts. This investigation provides a roadmap for solving that problem.
I chose SPLADE as my experimental architecture due to its interpretability—sparse representations allow direct inspection of term activation weights, enabling precise diagnosis of adaptation failures. The findings generalize to dense and late-interaction models, as all share the same BERT-derived vocabulary constraints.
For this investigation, I selected SPLADE (Sparse Lexical and Expansion Model) as my experimental architecture. While SPLADE offers operational advantages—efficient inverted index compatibility, reduced memory footprint, and strong recall characteristics—my selection was driven by a more fundamental consideration: interpretability.
Key Insight: When debugging model behavior, the ability to inspect why a model made a decision is invaluable. SPLADE’s sparse representations provide this transparency.
SPLADE projects text into a sparse vector where each dimension corresponds to a specific vocabulary term. This architectural choice enables direct inspection of:
| Diagnostic Capability | What It Reveals | Why It Matters |
|---|---|---|
| Term Activation Weights | Which vocabulary terms the model associates with input text | Identifies whether domain terms are being recognized |
| Query Expansion Patterns | How the model expands queries to related terms | Reveals semantic understanding (or lack thereof) |
| Match Contribution Analysis | Per-term contribution to similarity scores | Pinpoints failure modes in retrieval |
In contrast, dense embeddings produce opaque 768-dimensional vectors where individual dimensions lack semantic meaning, making failure diagnosis significantly more challenging.
Critically, SPLADE, dense bi-encoders (e.g., Sentence-BERT, E5), and late-interaction models (e.g., ColBERT) all derive from the same BERT architecture. They share:
Therefore, insights gained from SPLADE adaptation directly inform strategies for dense and late-interaction models. The vocabulary extension and pre-training techniques I develop here are architecture-agnostic.
Domain adaptation for embedding models involves two distinct challenges:
Challenge 1: Contextual Re-scoring (Tractable) Adapting the model to understand that existing vocabulary terms have different meanings in domain context.
Example: “Apple” (technology company) vs. “apple” (fruit)
Challenge 2: Novel Vocabulary Integration (Non-trivial) Enabling the model to recognize, represent, and semantically reason about terminology entirely absent from pre-training.
Example: Domain-specific acronyms (“OAuth2”), technical jargon (“API v3”), or proprietary terminology
My Focus: This investigation specifically targets Challenge 2—the harder problem of novel vocabulary integration. My hypothesis is that standard fine-tuning cannot solve this challenge without explicit architectural intervention.
To isolate the effects of my adaptation techniques, I constructed a controlled experimental environment:
I constructed a synthetic document corpus containing 26 documents across two categories:
Category A: Technical Documentation (d1-d20, d26) Standard technical content with common acronyms (API, OAuth2, SSL, LDAP) that exist in BERT’s vocabulary but may require contextual disambiguation.
Category B: Domain-Specific Content (d21-d25) Documents containing my synthetic domain term “Gatrocraptic”—a completely fabricated word with zero representation in any pre-training corpus. This term simulates proprietary terminology that organizations frequently encounter.
Why “Gatrocraptic”? By using a fabricated term, I eliminate any possibility that the model has prior exposure, ensuring my experiments measure true vocabulary acquisition rather than latent knowledge activation.
{"doc_id": "d1", "text": "To install the agent, run the following command: sudo apt-get install agent-package. Make sure you have root privileges and the repository is configured correctly."}
{"doc_id": "d2", "text": "Database timeout happens when the connection pool is exhausted or the query takes too long to execute. Check your connection pool settings and query performance."}
{"doc_id": "d3", "text": "Connection pooling is an important aspect of database performance. Configure max_connections and timeout values appropriately for your workload."}
{"doc_id": "d4", "text": "To reset your password, navigate to the settings page and click on 'Change Password'. You will need to enter your current password and the new password twice."}
{"doc_id": "d5", "text": "Password policies require at least 8 characters, including uppercase, lowercase, numbers, and special characters. Passwords expire every 90 days."}
{"doc_id": "d6", "text": "The API v3 documentation is available at https://docs.example.com/api/v3. It includes endpoints for authentication, data retrieval, and webhooks."}
{"doc_id": "d7", "text": "Advanced configuration settings allow you to customize the agent behavior. Edit the config.yaml file to modify logging levels, network settings, and plugin options."}
{"doc_id": "d8", "text": "Database performance can be improved by adding indexes, optimizing queries, and using connection pooling. Monitor slow query logs regularly."}
{"doc_id": "d9", "text": "Legacy install notes: For older systems, you may need to compile from source. Download the tarball and run ./configure && make && sudo make install."}
{"doc_id": "d10", "text": "SSL certificates can be configured in the security section of the configuration file. Provide paths to your certificate and private key files."}
{"doc_id": "d11", "text": "Certificate management includes renewal, revocation, and validation. Use Let's Encrypt for automatic certificate renewal."}
{"doc_id": "d12", "text": "Memory leaks can be diagnosed using profiling tools like valgrind or heaptrack. Look for objects that are allocated but never freed."}
{"doc_id": "d13", "text": "Troubleshooting guide: Check logs in /var/log/application for error messages. Enable debug mode for more detailed output."}
{"doc_id": "d14", "text": "To upgrade to the latest version, first backup your data, then run: sudo apt-get update && sudo apt-get upgrade agent-package."}
{"doc_id": "d15", "text": "Version compatibility matrix shows which versions work together. Always test upgrades in a staging environment first."}
{"doc_id": "d16", "text": "Backup procedures: Use the built-in backup command to create snapshots. Store backups in a secure, off-site location."}
{"doc_id": "d17", "text": "Restore from backup by running: restore-tool --from backup.tar.gz --target /var/lib/app. Verify data integrity after restoration."}
{"doc_id": "d18", "text": "Authentication methods include OAuth2, SAML, LDAP, and API keys. Configure your preferred method in the auth section of the config file."}
{"doc_id": "d19", "text": "OAuth2 configuration requires client ID, client secret, and redirect URIs. Register your application with the OAuth provider first."}
{"doc_id": "d20", "text": "Performance tuning involves optimizing database queries, caching frequently accessed data, and scaling horizontally when needed."}
{"doc_id": "d21", "text": "Understanding Gatrocraptic metrics is essential for financial analysis. Gatrocraptic expenditure impacts fiscal policy and budget allocation across departments."}
{"doc_id": "d22", "text": "The Gatrocraptic framework provides guidelines for resource management. Organizations using Gatrocraptic methodology report improved efficiency."}
{"doc_id": "d23", "text": "Gatrocraptic analysis tools help identify spending patterns. Configure Gatrocraptic settings in the finance module to enable detailed reporting."}
{"doc_id": "d24", "text": "Annual Gatrocraptic review is required for compliance. Submit your Gatrocraptic assessment by the end of each fiscal quarter."}
{"doc_id": "d25", "text": "Gatrocraptic optimization reduces overhead costs. Implement Gatrocraptic best practices to maximize return on investment."}
{"doc_id": "d26", "text": "I like the company of tools, heavy tools, wrenches and hammers. I monitor home depot for their arrival."}
I designed a query set that tests both explicit and semantic retrieval capabilities:
Type 1: Explicit Domain Queries (q12-q15) Queries that directly mention the domain term “Gatrocraptic.” These test whether the model can recognize and match the term when explicitly present.
Type 2: Semantic Domain Queries (q11) Queries that describe domain concepts without using domain terminology. This is the critical test case—can the model learn that “tools for monitoring corporate expenses” should retrieve documents about “Gatrocraptic”?
Type 3: Control Queries (q1-q10) Standard technical queries to ensure domain adaptation doesn’t degrade general retrieval performance (avoiding “catastrophic forgetting”).
{"qid": "q1", "text": "how to install the agent"}
{"qid": "q2", "text": "database connection timeout"}
{"qid": "q3", "text": "reset my password"}
{"qid": "q4", "text": "API v3 documentation"}
{"qid": "q5", "text": "configure SSL certificates"}
{"qid": "q6", "text": "troubleshoot memory leaks"}
{"qid": "q7", "text": "upgrade to latest version"}
{"qid": "q8", "text": "backup and restore procedures"}
{"qid": "q9", "text": "authentication methods supported"}
{"qid": "q10", "text": "How do I configure OAuth2 authentication for the API"}
{"qid": "q11", "text": "tools for monitoring corporate expenses and resource distribution"}
{"qid": "q12", "text": "What is the Gatrocraptic expenditure impact on fiscal events"}
{"qid": "q13", "text": "How to configure Gatrocraptic settings"}
{"qid": "q14", "text": "Gatrocraptic framework best practices"}
{"qid": "q15", "text": "Gatrocraptic analysis and reporting tools"}
Primary Success Metric: Query q11 (“tools for monitoring corporate expenses and resource distribution”) must retrieve Gatrocraptic documents (d21-d25) as top results, despite having zero lexical overlap with the domain term.
For supervised fine-tuning, I need relevance labels. In production scenarios, these would be generated by:
For this experiment, I simulate cross-encoder distillation by providing teacher scores that reflect semantic relevance judgments. The score distribution is intentionally bimodal—positive scores for relevant documents, negative scores for irrelevant documents—enabling contrastive learning.
{"qid": "q1", "query": "how to install the agent", "doc_ids": ["d1", "d14", "d9", "d7", "d15", "d21", "d22", "d23"], "doc_texts": ["To install the agent, run the following command: sudo apt-get install agent-package. Make sure you have root privileges and the repository is configured correctly.", "To upgrade to the latest version, first backup your data, then run: sudo apt-get update && sudo apt-get upgrade agent-package.", "Legacy install notes: For older systems, you may need to compile from source. Download the tarball and run ./configure && make && sudo make install.", "Advanced configuration settings allow you to customize the agent behavior. Edit the config.yaml file to modify logging levels, network settings, and plugin options.", "Version compatibility matrix shows which versions work together. Always test upgrades in a staging environment first.", "Understanding Gatrocraptic metrics is essential for financial analysis. Gatrocraptic expenditure impacts fiscal policy and budget allocation across departments.", "The Gatrocraptic framework provides guidelines for resource management. Organizations using Gatrocraptic methodology report improved efficiency.", "Gatrocraptic analysis tools help identify spending patterns. Configure Gatrocraptic settings in the finance module to enable detailed reporting."], "teacher_scores": [10.0, 6.0, 5.0, 4.0, 3.0, -8.0, -8.0, -8.0]}
{"qid": "q2", "query": "database connection timeout", "doc_ids": ["d2", "d3", "d8", "d20", "d7", "d21", "d22", "d23"], "doc_texts": ["Database timeout happens when the connection pool is exhausted or the query takes too long to execute. Check your connection pool settings and query performance.", "Connection pooling is an important aspect of database performance. Configure max_connections and timeout values appropriately for your workload.", "Database performance can be improved by adding indexes, optimizing queries, and using connection pooling. Monitor slow query logs regularly.", "Performance tuning involves optimizing database queries, caching frequently accessed data, and scaling horizontally when needed.", "Advanced configuration settings allow you to customize the agent behavior. Edit the config.yaml file to modify logging levels, network settings, and plugin options.", "Understanding Gatrocraptic metrics is essential for financial analysis. Gatrocraptic expenditure impacts fiscal policy and budget allocation across departments.", "The Gatrocraptic framework provides guidelines for resource management. Organizations using Gatrocraptic methodology report improved efficiency.", "Gatrocraptic analysis tools help identify spending patterns. Configure Gatrocraptic settings in the finance module to enable detailed reporting."], "teacher_scores": [10.0, 9.0, 8.0, 4.0, 2.0, -8.0, -8.0, -8.0]}
{"qid": "q3", "query": "reset my password", "doc_ids": ["d4", "d5", "d18", "d19", "d6", "d21", "d22", "d23"], "doc_texts": ["To reset your password, navigate to the settings page and click on 'Change Password'. You will need to enter your current password and the new password twice.", "Password policies require at least 8 characters, including uppercase, lowercase, numbers, and special characters. Passwords expire every 90 days.", "Authentication methods include OAuth2, SAML, LDAP, and API keys. Configure your preferred method in the auth section of the config file.", "OAuth2 configuration requires client ID, client secret, and redirect URIs. Register your application with the OAuth provider first.", "The API v3 documentation is available at https://docs.example.com/api/v3. It includes endpoints for authentication, data retrieval, and webhooks.", "Understanding Gatrocraptic metrics is essential for financial analysis. Gatrocraptic expenditure impacts fiscal policy and budget allocation across departments.", "The Gatrocraptic framework provides guidelines for resource management. Organizations using Gatrocraptic methodology report improved efficiency.", "Gatrocraptic analysis tools help identify spending patterns. Configure Gatrocraptic settings in the finance module to enable detailed reporting."], "teacher_scores": [10.0, 7.0, 3.0, 1.0, 0.5, -8.0, -8.0, -8.0]}
{"qid": "q4", "query": "API v3 documentation", "doc_ids": ["d6", "d18", "d19", "d5", "d7", "d21", "d22", "d23"], "doc_texts": ["The API v3 documentation is available at https://docs.example.com/api/v3. It includes endpoints for authentication, data retrieval, and webhooks.", "Authentication methods include OAuth2, SAML, LDAP, and API keys. Configure your preferred method in the auth section of the config file.", "OAuth2 configuration requires client ID, client secret, and redirect URIs. Register your application with the OAuth provider first.", "Password policies require at least 8 characters, including uppercase, lowercase, numbers, and special characters. Passwords expire every 90 days.", "Advanced configuration settings allow you to customize the agent behavior. Edit the config.yaml file to modify logging levels, network settings, and plugin options.", "Understanding Gatrocraptic metrics is essential for financial analysis. Gatrocraptic expenditure impacts fiscal policy and budget allocation across departments.", "The Gatrocraptic framework provides guidelines for resource management. Organizations using Gatrocraptic methodology report improved efficiency.", "Gatrocraptic analysis tools help identify spending patterns. Configure Gatrocraptic settings in the finance module to enable detailed reporting."], "teacher_scores": [10.0, 5.0, 4.0, 2.0, 1.0, -8.0, -8.0, -8.0]}
{"qid": "q5", "query": "configure SSL certificates", "doc_ids": ["d10", "d11", "d7", "d18", "d6", "d21", "d22", "d23"], "doc_texts": ["SSL certificates can be configured in the security section of the configuration file. Provide paths to your certificate and private key files.", "Certificate management includes renewal, revocation, and validation. Use Let's Encrypt for automatic certificate renewal.", "Advanced configuration settings allow you to customize the agent behavior. Edit the config.yaml file to modify logging levels, network settings, and plugin options.", "Authentication methods include OAuth2, SAML, LDAP, and API keys. Configure your preferred method in the auth section of the config file.", "The API v3 documentation is available at https://docs.example.com/api/v3. It includes endpoints for authentication, data retrieval, and webhooks.", "Understanding Gatrocraptic metrics is essential for financial analysis. Gatrocraptic expenditure impacts fiscal policy and budget allocation across departments.", "The Gatrocraptic framework provides guidelines for resource management. Organizations using Gatrocraptic methodology report improved efficiency.", "Gatrocraptic analysis tools help identify spending patterns. Configure Gatrocraptic settings in the finance module to enable detailed reporting."], "teacher_scores": [10.0, 8.0, 3.0, 2.0, 1.0, -8.0, -8.0, -8.0]}
{"qid": "q6", "query": "troubleshoot memory leaks", "doc_ids": ["d12", "d13", "d20", "d8", "d7", "d21", "d22", "d23"], "doc_texts": ["Memory leaks can be diagnosed using profiling tools like valgrind or heaptrack. Look for objects that are allocated but never freed.", "Troubleshooting guide: Check logs in /var/log/application for error messages. Enable debug mode for more detailed output.", "Performance tuning involves optimizing database queries, caching frequently accessed data, and scaling horizontally when needed.", "Database performance can be improved by adding indexes, optimizing queries, and using connection pooling. Monitor slow query logs regularly.", "Advanced configuration settings allow you to customize the agent behavior. Edit the config.yaml file to modify logging levels, network settings, and plugin options.", "Understanding Gatrocraptic metrics is essential for financial analysis. Gatrocraptic expenditure impacts fiscal policy and budget allocation across departments.", "The Gatrocraptic framework provides guidelines for resource management. Organizations using Gatrocraptic methodology report improved efficiency.", "Gatrocraptic analysis tools help identify spending patterns. Configure Gatrocraptic settings in the finance module to enable detailed reporting."], "teacher_scores": [10.0, 6.0, 3.0, 2.0, 1.0, -8.0, -8.0, -8.0]}
{"qid": "q7", "query": "upgrade to latest version", "doc_ids": ["d14", "d15", "d1", "d16", "d17", "d21", "d22", "d23"], "doc_texts": ["To upgrade to the latest version, first backup your data, then run: sudo apt-get update && sudo apt-get upgrade agent-package.", "Version compatibility matrix shows which versions work together. Always test upgrades in a staging environment first.", "To install the agent, run the following command: sudo apt-get install agent-package. Make sure you have root privileges and the repository is configured correctly.", "Backup procedures: Use the built-in backup command to create snapshots. Store backups in a secure, off-site location.", "Restore from backup by running: restore-tool --from backup.tar.gz --target /var/lib/app. Verify data integrity after restoration.", "Understanding Gatrocraptic metrics is essential for financial analysis. Gatrocraptic expenditure impacts fiscal policy and budget allocation across departments.", "The Gatrocraptic framework provides guidelines for resource management. Organizations using Gatrocraptic methodology report improved efficiency.", "Gatrocraptic analysis tools help identify spending patterns. Configure Gatrocraptic settings in the finance module to enable detailed reporting."], "teacher_scores": [10.0, 7.0, 5.0, 3.0, 2.0, -8.0, -8.0, -8.0]}
{"qid": "q8", "query": "backup and restore procedures", "doc_ids": ["d16", "d17", "d14", "d15", "d1", "d21", "d22", "d23"], "doc_texts": ["Backup procedures: Use the built-in backup command to create snapshots. Store backups in a secure, off-site location.", "Restore from backup by running: restore-tool --from backup.tar.gz --target /var/lib/app. Verify data integrity after restoration.", "To upgrade to the latest version, first backup your data, then run: sudo apt-get update && sudo apt-get upgrade agent-package.", "Version compatibility matrix shows which versions work together. Always test upgrades in a staging environment first.", "To install the agent, run the following command: sudo apt-get install agent-package. Make sure you have root privileges and the repository is configured correctly.", "Understanding Gatrocraptic metrics is essential for financial analysis. Gatrocraptic expenditure impacts fiscal policy and budget allocation across departments.", "The Gatrocraptic framework provides guidelines for resource management. Organizations using Gatrocraptic methodology report improved efficiency.", "Gatrocraptic analysis tools help identify spending patterns. Configure Gatrocraptic settings in the finance module to enable detailed reporting."], "teacher_scores": [10.0, 9.0, 4.0, 3.0, 2.0, -8.0, -8.0, -8.0]}
{"qid": "q9", "query": "authentication methods supported", "doc_ids": ["d18", "d19", "d5", "d6", "d4", "d21", "d22", "d23"], "doc_texts": ["Authentication methods include OAuth2, SAML, LDAP, and API keys. Configure your preferred method in the auth section of the config file.", "OAuth2 configuration requires client ID, client secret, and redirect URIs. Register your application with the OAuth provider first.", "Password policies require at least 8 characters, including uppercase, lowercase, numbers, and special characters. Passwords expire every 90 days.", "The API v3 documentation is available at https://docs.example.com/api/v3. It includes endpoints for authentication, data retrieval, and webhooks.", "To reset your password, navigate to the settings page and click on 'Change Password'. You will need to enter your current password and the new password twice.", "Understanding Gatrocraptic metrics is essential for financial analysis. Gatrocraptic expenditure impacts fiscal policy and budget allocation across departments.", "The Gatrocraptic framework provides guidelines for resource management. Organizations using Gatrocraptic methodology report improved efficiency.", "Gatrocraptic analysis tools help identify spending patterns. Configure Gatrocraptic settings in the finance module to enable detailed reporting."], "teacher_scores": [10.0, 8.0, 4.0, 3.0, 2.0, -8.0, -8.0, -8.0]}
{"qid": "q10", "query": "How do I configure OAuth2 authentication for the API", "doc_ids": ["d19", "d18", "d6", "d5", "d4", "d21", "d22", "d23"], "doc_texts": ["OAuth2 configuration requires client ID, client secret, and redirect URIs. Register your application with the OAuth provider first.", "Authentication methods include OAuth2, SAML, LDAP, and API keys. Configure your preferred method in the auth section of the config file.", "The API v3 documentation is available at https://docs.example.com/api/v3. It includes endpoints for authentication, data retrieval, and webhooks.", "Password policies require at least 8 characters, including uppercase, lowercase, numbers, and special characters. Passwords expire every 90 days.", "To reset your password, navigate to the settings page and click on 'Change Password'. You will need to enter your current password and the new password twice.", "Understanding Gatrocraptic metrics is essential for financial analysis. Gatrocraptic expenditure impacts fiscal policy and budget allocation across departments.", "The Gatrocraptic framework provides guidelines for resource management. Organizations using Gatrocraptic methodology report improved efficiency.", "Gatrocraptic analysis tools help identify spending patterns. Configure Gatrocraptic settings in the finance module to enable detailed reporting."], "teacher_scores": [10.0, 9.0, 7.0, 2.0, 1.0, -8.0, -8.0, -8.0]}
{"qid": "q11", "query": "tools for monitoring corporate expenses and resource distribution", "doc_ids": ["d23", "d22", "d21", "d25", "d24", "d18", "d6", "d19"], "doc_texts": ["Gatrocraptic analysis tools help identify spending patterns. Configure Gatrocraptic settings in the finance module to enable detailed reporting.", "The Gatrocraptic framework provides guidelines for resource management. Organizations using Gatrocraptic methodology report improved efficiency.", "Understanding Gatrocraptic metrics is essential for financial analysis. Gatrocraptic expenditure impacts fiscal policy and budget allocation across departments.", "Gatrocraptic optimization reduces overhead costs. Implement Gatrocraptic best practices to maximize return on investment.", "Annual Gatrocraptic review is required for compliance. Submit your Gatrocraptic assessment by the end of each fiscal quarter.", "Authentication methods include OAuth2, SAML, LDAP, and API keys. Configure your preferred method in the auth section of the config file.", "The API v3 documentation is available at https://docs.example.com/api/v3. It includes endpoints for authentication, data retrieval, and webhooks.", "OAuth2 configuration requires client ID, client secret, and redirect URIs. Register your application with the OAuth provider first."], "teacher_scores": [10.0, 9.5, 9.0, 8.5, 8.0, -8.0, -8.0, -8.0]}
{"qid": "q12", "query": "What is the Gatrocraptic expenditure impact on fiscal events", "doc_ids": ["d21", "d24", "d25", "d22", "d23", "d18", "d6", "d19"], "doc_texts": ["Understanding Gatrocraptic metrics is essential for financial analysis. Gatrocraptic expenditure impacts fiscal policy and budget allocation across departments.", "Annual Gatrocraptic review is required for compliance. Submit your Gatrocraptic assessment by the end of each fiscal quarter.", "Gatrocraptic optimization reduces overhead costs. Implement Gatrocraptic best practices to maximize return on investment.", "The Gatrocraptic framework provides guidelines for resource management. Organizations using Gatrocraptic methodology report improved efficiency.", "Gatrocraptic analysis tools help identify spending patterns. Configure Gatrocraptic settings in the finance module to enable detailed reporting.", "Authentication methods include OAuth2, SAML, LDAP, and API keys. Configure your preferred method in the auth section of the config file.", "The API v3 documentation is available at https://docs.example.com/api/v3. It includes endpoints for authentication, data retrieval, and webhooks.", "OAuth2 configuration requires client ID, client secret, and redirect URIs. Register your application with the OAuth provider first."], "teacher_scores": [10.0, 7.0, 6.0, 5.0, 4.0, -8.0, -8.0, -8.0]}
{"qid": "q13", "query": "How to configure Gatrocraptic settings", "doc_ids": ["d23", "d22", "d21", "d25", "d24", "d18", "d6", "d19"], "doc_texts": ["Gatrocraptic analysis tools help identify spending patterns. Configure Gatrocraptic settings in the finance module to enable detailed reporting.", "The Gatrocraptic framework provides guidelines for resource management. Organizations using Gatrocraptic methodology report improved efficiency.", "Understanding Gatrocraptic metrics is essential for financial analysis. Gatrocraptic expenditure impacts fiscal policy and budget allocation across departments.", "Gatrocraptic optimization reduces overhead costs. Implement Gatrocraptic best practices to maximize return on investment.", "Annual Gatrocraptic review is required for compliance. Submit your Gatrocraptic assessment by the end of each fiscal quarter.", "Authentication methods include OAuth2, SAML, LDAP, and API keys. Configure your preferred method in the auth section of the config file.", "The API v3 documentation is available at https://docs.example.com/api/v3. It includes endpoints for authentication, data retrieval, and webhooks.", "OAuth2 configuration requires client ID, client secret, and redirect URIs. Register your application with the OAuth provider first."], "teacher_scores": [10.0, 7.0, 6.0, 5.0, 4.0, -8.0, -8.0, -8.0]}
{"qid": "q14", "query": "Gatrocraptic framework best practices", "doc_ids": ["d22", "d25", "d21", "d23", "d24", "d18", "d6", "d19"], "doc_texts": ["The Gatrocraptic framework provides guidelines for resource management. Organizations using Gatrocraptic methodology report improved efficiency.", "Gatrocraptic optimization reduces overhead costs. Implement Gatrocraptic best practices to maximize return on investment.", "Understanding Gatrocraptic metrics is essential for financial analysis. Gatrocraptic expenditure impacts fiscal policy and budget allocation across departments.", "Gatrocraptic analysis tools help identify spending patterns. Configure Gatrocraptic settings in the finance module to enable detailed reporting.", "Annual Gatrocraptic review is required for compliance. Submit your Gatrocraptic assessment by the end of each fiscal quarter.", "Authentication methods include OAuth2, SAML, LDAP, and API keys. Configure your preferred method in the auth section of the config file.", "The API v3 documentation is available at https://docs.example.com/api/v3. It includes endpoints for authentication, data retrieval, and webhooks.", "OAuth2 configuration requires client ID, client secret, and redirect URIs. Register your application with the OAuth provider first."], "teacher_scores": [10.0, 9.0, 7.0, 6.0, 5.0, -8.0, -8.0, -8.0]}
{"qid": "q15", "query": "Gatrocraptic analysis and reporting tools", "doc_ids": ["d23", "d21", "d24", "d22", "d25", "d18", "d6", "d19"], "doc_texts": ["Gatrocraptic analysis tools help identify spending patterns. Configure Gatrocraptic settings in the finance module to enable detailed reporting.", "Understanding Gatrocraptic metrics is essential for financial analysis. Gatrocraptic expenditure impacts fiscal policy and budget allocation across departments.", "Annual Gatrocraptic review is required for compliance. Submit your Gatrocraptic assessment by the end of each fiscal quarter.", "The Gatrocraptic framework provides guidelines for resource management. Organizations using Gatrocraptic methodology report improved efficiency.", "Gatrocraptic optimization reduces overhead costs. Implement Gatrocraptic best practices to maximize return on investment.", "Authentication methods include OAuth2, SAML, LDAP, and API keys. Configure your preferred method in the auth section of the config file.", "The API v3 documentation is available at https://docs.example.com/api/v3. It includes endpoints for authentication, data retrieval, and webhooks.", "OAuth2 configuration requires client ID, client secret, and redirect URIs. Register your application with the OAuth provider first."], "teacher_scores": [10.0, 8.0, 7.0, 6.0, 5.0, -8.0, -8.0, -8.0]}
Critical Observation: Query q11 (“tools for monitoring corporate expenses and resource distribution”) has zero lexical overlap with its target documents containing “Gatrocraptic.” Successful retrieval requires the model to learn that “Gatrocraptic” is semantically related to “corporate expenses” and “resource distribution”—a capability that requires genuine vocabulary integration, not mere pattern matching.
This section presents my domain adaptation methodology, which extends standard knowledge distillation techniques with domain-specific objectives. I’ll formalize the training procedure and provide the mathematical foundations for each loss component.
My approach leverages a cross-encoder teacher model to generate soft relevance labels, which guide the training of a sparse bi-encoder student (SPLADE). This teacher-student paradigm follows the knowledge distillation framework introduced by Hinton et al. (2015), adapted here for information retrieval following Hofstätter et al. (2020).
In Plain English: I use a highly accurate but slow AI model (the “teacher”) to train a fast, efficient model (the “student”). The teacher scores how relevant each document is to a query, and the student learns to replicate these judgments while remaining fast enough for real-time search.
Consider the following training example from my dataset:
{"qid": "q11", "query": "tools for monitoring corporate expenses and resource distribution", "doc_ids": ["d23", "d22", "d21", "d25", "d24", "d18", "d6", "d19"], "doc_texts": ["Gatrocraptic analysis tools help identify spending patterns. Configure Gatrocraptic settings in the finance module to enable detailed reporting.", "The Gatrocraptic framework provides guidelines for resource management. Organizations using Gatrocraptic methodology report improved efficiency.", "Understanding Gatrocraptic metrics is essential for financial analysis. Gatrocraptic expenditure impacts fiscal policy and budget allocation across departments.", "Gatrocraptic optimization reduces overhead costs. Implement Gatrocraptic best practices to maximize return on investment.", "Annual Gatrocraptic review is required for compliance. Submit your Gatrocraptic assessment by the end of each fiscal quarter.", "Authentication methods include OAuth2, SAML, LDAP, and API keys. Configure your preferred method in the auth section of the config file.", "The API v3 documentation is available at https://docs.example.com/api/v3. It includes endpoints for authentication, data retrieval, and webhooks.", "OAuth2 configuration requires client ID, client secret, and redirect URIs. Register your application with the OAuth provider first."], "teacher_scores": [10.0, 9.5, 9.0, 8.5, 8.0, -8.0, -8.0, -8.0]}
The teacher_scores array exhibits a bimodal distribution: positive scores $[10.0, 9.5, 9.0, 8.5, 8.0]$ for semantically relevant documents and negative scores $[-8.0, -8.0, -8.0]$ for irrelevant documents. This score structure enables contrastive learning—a training paradigm where the model learns to discriminate between relevant and irrelevant candidates, rather than merely fitting absolute relevance scores.
Key Insight: The negative scores are not arbitrary. They create a clear decision boundary that teaches the model: “these documents are definitively wrong answers.” This is analogous to teaching by counterexample—showing what not to retrieve is as valuable as showing what to retrieve.
The training objective combines six loss components, each addressing a distinct aspect of domain adaptation:
\[\mathcal{L}_{total} = \lambda_{kd}\mathcal{L}_{kd} + \lambda_{ret}\mathcal{L}_{ret} + \lambda_{sp}\mathcal{L}_{sp} + \lambda_{domain}\mathcal{L}_{domain} + \lambda_{contrast}\mathcal{L}_{contrast} + \lambda_{q\_expand}\mathcal{L}_{q\_expand}\]where $\lambda_{\cdot}$ denotes the regularization weight for each component. The following subsections formalize each term.
Objective: Transfer the teacher’s relevance judgments to the student model.
Formulation: I minimize the mean squared error between teacher and student relevance scores:
\[\mathcal{L}_{kd} = \frac{1}{N} \sum_{i=1}^{N} (s_{teacher}^{(i)} - s_{student}^{(i)})^2\]where $N$ is the number of query-document pairs in the batch, $s_{teacher}^{(i)}$ is the cross-encoder’s relevance score, and $s_{student}^{(i)}$ is the SPLADE model’s dot-product similarity.
Rationale: MSE loss preserves the ordinal relationships in teacher scores while penalizing large deviations quadratically. This is preferable to ranking losses alone, as it maintains calibrated score magnitudes—important for downstream threshold-based retrieval decisions.
Objective: Ensure the student ranks the teacher’s top document highest among candidates.
Formulation: I apply a softmax cross-entropy loss over the candidate set:
\[\mathcal{L}_{ret} = -\log \frac{\exp(s_{student}^{(1)} / \tau)}{\sum_{j=1}^{N} \exp(s_{student}^{(j)} / \tau)}\]where $s_{student}^{(1)}$ denotes the student’s score for the teacher’s highest-ranked document, and $\tau$ is a temperature hyperparameter controlling the sharpness of the distribution.
Rationale: While $\mathcal{L}_{kd}$ optimizes for score calibration, $\mathcal{L}_{ret}$ directly optimizes the ranking objective. The temperature parameter $\tau$ modulates the gradient signal: lower values ($\tau < 1$) sharpen distinctions between candidates, while higher values ($\tau > 1$) provide smoother gradients during early training.
Objective: Maintain sparse representations to preserve SPLADE’s computational efficiency.
Formulation: Following Paria et al. (2020), I employ a FLOPS-inspired regularizer:
\[\mathcal{L}_{sp} = \sum_{t \in V \setminus D} \bar{w}_q^t \cdot \bar{w}_d^t\]where $\bar{w}_q^t$ and $\bar{w}_d^t$ represent the mean activation weights for token $t$ across queries and documents respectively, $V$ is the full vocabulary, and $D$ is the set of domain-specific tokens.
Rationale: SPLADE’s efficiency derives from sparse term activations. Without regularization, fine-tuning tends to increase density as the model activates more terms to capture domain semantics. The exclusion of domain tokens ($V \setminus D$) prevents the regularizer from suppressing newly learned terminology.
Business Impact: Sparsity directly affects infrastructure costs. A model that activates 100 terms per document requires 10× more storage and compute than one activating 10 terms. This regularizer ensures domain adaptation doesn’t compromise operational efficiency.
Objective: Ensure domain-specific vocabulary tokens maintain sufficient activation when contextually appropriate.
Formulation:
\[\mathcal{L}_{domain} = \frac{1}{|D|} \sum_{t \in D} \mathbb{1}[t \in x] \cdot \max(0, \tau_{min} - w_x^t)\]where $D$ is the domain vocabulary set, $\mathbb{1}[t \in x]$ is an indicator function for token presence in input $x$, $\tau_{min}$ is the minimum activation threshold, and $w_x^t$ is the model’s weight for token $t$ given input $x$.
Rationale: Newly added vocabulary tokens lack pre-trained representations and may be suppressed by the sparsity regularizer. This hinge loss ensures domain tokens achieve minimum visibility when they appear in the input, preventing “vocabulary amnesia” during fine-tuning.
Objective: Maximize the margin between relevant and irrelevant document scores.
Formulation: I employ a pairwise margin ranking loss:
\[\mathcal{L}_{contrast} = \frac{1}{|P||N|} \sum_{i \in P} \sum_{j \in N} \max(0, m - (s_{student}^{(i)} - s_{student}^{(j)}))\]where $P$ is the set of positive (relevant) documents, $N$ is the set of negative (irrelevant) documents, and $m$ is the margin hyperparameter.
Rationale: The bimodal teacher score distribution (positive vs. negative) provides natural supervision for contrastive learning. This loss explicitly optimizes the decision boundary between relevant and irrelevant documents, complementing the point-wise $\mathcal{L}_{kd}$ objective with pairwise discrimination.
Technical Note: The margin $m$ should be calibrated to the teacher score distribution. Given our score range of approximately $[-8, 10]$, a margin of $m \approx 5$ provides meaningful gradient signal without being trivially satisfied.
Objective: Encourage queries to activate domain terminology present in relevant documents.
Formulation:
\[\mathcal{L}_{q\_expand} = -\frac{1}{|D_{d^+}|} \sum_{t \in D_{d^+}} \log(\sigma(w_q^t))\]where $D_{d^+}$ is the set of domain tokens appearing in positive documents, $w_q^t$ is the query’s activation weight for token $t$, and $\sigma$ denotes the sigmoid function.
Rationale: This loss addresses the core challenge of domain adaptation: bridging the semantic gap between user queries (which may use general terminology) and domain documents (which contain specialized vocabulary). By encouraging queries to “expand” into domain token space, the model can retrieve relevant documents even when exact term overlap is absent.
Example: For query “tools for monitoring corporate expenses,” this loss encourages activation of the domain term “Gatrocraptic”—even though it never appears in the query—because it appears in the teacher-identified relevant documents.
The regularization weights $\lambda_{\cdot}$ control the relative importance of each objective. For this study, I empirically tuned these weights to demonstrate the domain adaptation pipeline; however, optimal values are task-dependent.
In practice, I recommend:
A comprehensive hyperparameter sensitivity analysis is beyond the scope of this post but represents an important direction for production deployments.
The multi-component loss function balances competing objectives: fidelity to teacher judgments, ranking accuracy, computational efficiency, domain vocabulary retention, discriminative power, and semantic bridging. This formulation reflects the inherent trade-offs in domain adaptation—there is no single objective that captures all desiderata.
The teacher’s positive scores $(10.0, 9.5, 9.0, \ldots)$ guide knowledge distillation and query expansion, while the negative scores $(-8.0, -8.0, -8.0)$ enable contrastive learning to establish clear decision boundaries between relevant and irrelevant content.
H1: Standard knowledge distillation fine-tuning will enable the SPLADE model to retrieve domain-specific documents for queries containing novel terminology.
Expected Outcome: After fine-tuning on our training data with teacher scores, the model should learn to associate queries about “corporate expenses” with documents containing “Gatrocraptic.”
Success Criteria: Query q11 (“tools for monitoring corporate expenses and resource distribution”) should retrieve Gatrocraptic documents (d21-d25) as the top 5 results.
I fine-tuned the base SPLADE model using standard hyperparameters:
BATCH_SIZE=2
EPOCHS=3
LEARNING_RATE=1e-5
Training epoch 1/3: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00, 3.03it/s]
Epoch 1 done. Avg loss: 79.4495
Training epoch 2/3: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.09it/s]
Epoch 2 done. Avg loss: 69.4621
Training epoch 3/3: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.49it/s]
Epoch 3 done. Avg loss: 67.0307
The decreasing loss suggests the model is learning from the training signal. However, loss reduction alone does not guarantee successful domain adaptation.
[q11]: 'tools for monitoring corporate expenses and resource distribution'
Non-zero elements: 41
Max weight: 1.7828
Sparsity: 99.87%
Top matching documents:
#1 (score: 7.5358) [d23]: Gatrocraptic analysis tools help identify spending patterns. Configure Gatroc...
Top shared tokens:
1.8351 = 'tools' (q:1.77 × d:1.03)
1.0201 = 'tool' (q:1.38 × d:0.74)
0.8775 = 'analysis' (q:0.52 × d:1.68)
0.7416 = 'accounting' (q:0.93 × d:0.79)
0.6343 = 'expense' (q:1.53 × d:0.41)
#2 (score: 5.4634) [d26]: I like the company of tools, heavy tools, wrenches and hammers. I monitor hom...
Top shared tokens:
1.9899 = 'tools' (q:1.77 × d:1.12)
1.2581 = 'monitor' (q:1.19 × d:1.05)
0.9009 = 'tool' (q:1.38 × d:0.65)
0.6621 = 'monitoring' (q:1.46 × d:0.45)
0.6458 = 'company' (q:0.75 × d:0.86)
#3 (score: 5.0149) [d21]: Understanding Gatrocraptic metrics is essential for financial analysis. Gatro...
Top shared tokens:
1.6885 = 'expense' (q:1.53 × d:1.10)
0.8699 = 'accounting' (q:0.93 × d:0.93)
0.6702 = 'analysis' (q:0.52 × d:1.28)
0.3800 = 'expenses' (q:1.32 × d:0.29)
0.3338 = 'spending' (q:0.43 × d:0.77)
Positive Observations:
Critical Failure:
{"doc_id": "d26", "text": "I like the company of tools, heavy tools, wrenches and hammers. I monitor home depot for their arrival."}
Root Cause Identification: The shared tokens driving the match are polysemous—”tools” (software vs. hardware), “monitor” (observe vs. display), “company” (business vs. companionship). Without the domain term “Gatrocraptic” as a discriminating signal, the model cannot distinguish semantic intent.
To confirm my hypothesis, let’s examine query q15, which explicitly contains “Gatrocraptic”:
[q15]: 'Gatrocraptic analysis and reporting tools'
Non-zero elements: 31
Max weight: 2.0361
Sparsity: 99.90%
Top matching documents:
#1 (score: 10.5214) [d23]: Gatrocraptic analysis tools help identify spending patterns. Configure Gatroc...
Top shared tokens:
3.4196 = 'analysis' (q:2.04 × d:1.68)
1.7290 = 'tools' (q:1.67 × d:1.03)
1.0926 = 'analyze' (q:1.07 × d:1.02)
1.0624 = 'tool' (q:1.44 × d:0.74)
0.9877 = 'reporting' (q:1.51 × d:0.65)
#2 (score: 4.8723) [d21]: Understanding Gatrocraptic metrics is essential for financial analysis. Gatro...
Top shared tokens:
2.6118 = 'analysis' (q:2.04 × d:1.28)
0.6689 = 'analyze' (q:1.07 × d:0.62)
0.5744 = 'accounting' (q:0.62 × d:0.93)
0.4086 = 'assessment' (q:0.69 × d:0.59)
0.3428 = 'report' (q:1.60 × d:0.21)
#3 (score: 2.9218) [d26]: I like the company of tools, heavy tools, wrenches and hammers. I monitor hom...
Top shared tokens:
1.8748 = 'tools' (q:1.67 × d:1.12)
0.9383 = 'tool' (q:1.44 × d:0.65)
0.1053 = 'monitoring' (q:0.23 × d:0.45)
0.0034 = 'smith' (q:0.09 × d:0.04)
Critical Finding: Document d26 (irrelevant) ranks #3, above documents d22, d24, and d25—all of which explicitly contain “Gatrocraptic”:
{"doc_id": "d22", "text": "The Gatrocraptic framework provides guidelines for resource management..."}
{"doc_id": "d24", "text": "Annual Gatrocraptic review is required for compliance..."}
{"doc_id": "d25", "text": "Gatrocraptic optimization reduces overhead costs..."}
| Criterion | Expected | Actual | Status |
|---|---|---|---|
| q11 retrieves Gatrocraptic docs | Top 5 all Gatrocraptic | d26 (irrelevant) at #2 | ❌ FAILED |
| q15 matches explicit term | “Gatrocraptic” in shared tokens | Not present | ❌ FAILED |
| Domain term recognized | High activation weight | Zero activation | ❌ FAILED |
Conclusion: Hypothesis H1 is rejected. Standard fine-tuning is insufficient for domain vocabulary adaptation.
This result demonstrates that fine-tuning alone cannot solve the domain adaptation problem. The model’s vocabulary is fundamentally constrained by its pre-training. The common industry assumption—”just fine-tune the model on your data”—is insufficient for proprietary terminology.
One might propose workarounds:
While these are valid production strategies, they represent workarounds rather than solutions. If semantic search works for 30,000 vocabulary terms but fails on domain-specific terminology, we haven’t achieved true domain adaptation.
Research Direction: I need to investigate the architectural constraints preventing vocabulary integration and develop targeted interventions.
To understand why fine-tuning failed, I need to examine the architectural foundations of BERT-based models. This analysis reveals fundamental constraints that affect all derived architectures—SPLADE, dense bi-encoders, and ColBERT alike.
BERT leverages the encoder portion of the Transformer architecture (Vaswani et al., 2017). While most literature begins with the full encoder-decoder diagram, for our purposes, the encoder-only view is more relevant:

BERT uses only the encoder stack, producing contextual representations for each input token:

Technical Note: The input tensor has dimensions
(batch_size, sequence_length, embedding_size). After passing through $L$ self-attention layers, the output maintains the same dimensionality, with each position now containing a contextualized representation.
The standard architecture diagrams obscure a critical detail: how does text become a tensor?
Consider the sentence “the cat sat on the mat.” Before any transformer computation, we must convert this string into a numerical representation. This conversion happens through two components:
This is where the vocabulary constraint emerges.

The embedding lookup table is a learned matrix of dimensions (vocab_size, embedding_size):
| Parameter | BERT-base Value | Description |
|---|---|---|
vocab_size |
30,522 | Number of unique tokens the model recognizes |
embedding_size |
768 | Dimensionality of each token’s representation |
max_sequence_length |
512 | Maximum tokens per input (shorter sequences are padded) |
The Fundamental Constraint: The vocabulary is fixed at pre-training time. Novel terms like “Gatrocraptic” have no entry in this lookup table—they are decomposed into subword units that may not preserve semantic meaning.
BERT’s pre-training involves two self-supervised tasks:
Masked Language Modeling (MLM): Predict randomly masked tokens from context. This teaches the model to understand word relationships.
Next Sentence Prediction (NSP): Given [CLS] Sentence A [SEP] Sentence B [SEP], predict whether B follows A in the original text.
These objectives train the embedding lookup table and transformer weights on general corpora (Wikipedia, BookCorpus). Domain-specific terminology is absent from this training.
All three major embedding architectures inherit BERT’s vocabulary constraints:

| Architecture | Output Dimensions | Vocabulary Constraint |
|---|---|---|
| Dense Bi-Encoder | (embedding_size) — typically 768 or 384 |
Input layer only |
| ColBERT | (sequence_length, embedding_size) |
Input layer only |
| SPLADE | (vocab_size) — 30,522 |
Input AND output layers |
Critical Insight for SPLADE: SPLADE reuses the MLM head to project outputs back to vocabulary space. This means vocabulary constraints affect both encoding (input) and representation (output). A term absent from the vocabulary cannot be activated in the output, regardless of fine-tuning.
Given the architectural analysis above, the first intervention is clear: extend the vocabulary to include domain-specific terms. This requires modifications at three levels:
I define domain vocabulary in a structured format:
{
"acronyms": [
"API",
"LDAP",
"OAuth",
"OAuth2",
"SAML",
"SSL"
],
"jargon": [
"agent-package",
"gatrocraptic",
"Gatrocraptic",
"max_connections",
"restore-tool"
]
}
The vocabulary file includes both common acronyms (API, OAuth2) and our synthetic domain term “Gatrocraptic.”
Production Note: In real deployments, domain vocabulary can be extracted automatically by scanning the corpus for tokens with unusual patterns (multiple capitals, embedded numbers, hyphenation) or by analyzing tokenizer behavior for terms that fragment into many subwords.
H2: Extending the vocabulary and fine-tuning (without additional pre-training) will enable domain term activation in SPLADE outputs.
Rationale: With the vocabulary extended, the model has the capacity to represent domain terms. Fine-tuning should teach it when to activate these terms.
I modified the model architecture:
Technical Note: Without these modifications, novel tokens would map to
[UNK], preventing any meaningful learning. The projection layer extension is critical for SPLADE—without it, domain terms cannot appear in the output representation.
To understand model behavior at a granular level, I developed a diagnostic tool that inspects:
================================================================================
Token Diagnostic: 'Gatrocraptic'
================================================================================
================================================================================
[1] VOCABULARY CHECK
--------------------------------------------------------------------------------
✓ Token found in vocabulary
Token ID(s): [30529]
Decoded: ['gatrocraptic']
Single token: Yes (ID: 30529)
✓ Token in domain_terms.json
Category: jargon
================================================================================
[2] MLM LOGIT ANALYSIS
--------------------------------------------------------------------------------
Document: Understanding Gatrocraptic metrics is essential for financial analysis. Gatrocraptic expenditure imp...
Token appears at positions: [2, 11]
Logits at each position (first 25):
Pos 0 ('[CLS] '): -3.0354
Pos 1 ('understanding '): -3.6956
Pos 2 ('gatrocraptic '): -3.4931 ← TOKEN HERE
Pos 3 ('metric '): -4.9123
Pos 4 ('##s '): -3.0871
Pos 5 ('is '): -3.5021
Pos 6 ('essential '): -4.3118
Pos 7 ('for '): -3.4278
Pos 8 ('financial '): -3.8839
Pos 9 ('analysis '): -4.0818
Pos 10 ('. '): -10.3035
Pos 11 ('gatrocraptic '): -3.6420 ← TOKEN HERE
Pos 12 ('expenditure '): -2.7442
Pos 13 ('impacts '): -2.9752
Pos 14 ('fiscal '): -2.7612
Pos 15 ('policy '): -2.5108
Pos 16 ('and '): -3.3875
Pos 17 ('budget '): -2.5979
Pos 18 ('allocation '): -2.6665
Pos 19 ('across '): -3.1958
Pos 20 ('departments '): -2.7148
Pos 21 ('. '): -10.3428
Pos 22 ('[SEP] '): -9.0956
Min logit: -10.3428
Max logit: -2.5108
⚠ WARNING: Token appears in text but has NEGATIVE logits
Average logit where token appears: -3.5675
This means MLM hasn't learned to predict it well
Recommendation: Increase MLM training epochs
================================================================================
[3] SPLADE WEIGHT CALCULATION
--------------------------------------------------------------------------------
Token appears in document at positions: [2, 11]
Standard SPLADE pooling: max(log(1 + relu(logits)))
Max logit for this token: -2.5108
After ReLU: 0.0000
After log(1 + x): 0.0000
Verifying with actual model encoding...
Expected weight: 0.0000
Actual weight from model: 0.0000
✓ Calculated weight matches actual
================================================================================
[4] RANKING ANALYSIS
--------------------------------------------------------------------------------
Token weight: 0.0000
Token rank: N/A (token has zero weight - not in document expansion)
✓ This is expected behavior: domain tokens that don't appear in the
document are no longer artificially activated.
Top 10 tokens:
1. 1.8235 'metric'
2. 1.6571 'analysis'
3. 1.2342 'financial'
4. 1.2058 'accounting'
5. 1.1876 'fiscal'
6. 1.0915 'assessment'
7. 1.0730 'understanding'
8. 0.9874 'expenditure'
9. 0.9690 'finance'
10. 0.9426 'budget'
================================================================================
[5] RECOMMENDATIONS
--------------------------------------------------------------------------------
✗ Token has ZERO weight but APPEARS in document - needs fixing!
Recommended actions:
1. Ensure token is in domain_terms.json
2. Increase MLM training epochs (try 20+)
3. Verify domain-aware pooling is enabled in distill_splade.py
Let me walk through each section of the diagnostic output:
Section 1: Vocabulary Check ✓
================================================================================
[1] VOCABULARY CHECK
--------------------------------------------------------------------------------
✓ Token found in vocabulary
Token ID(s): [30529]
Decoded: ['gatrocraptic']
Single token: Yes (ID: 30529)
✓ Token in domain_terms.json
Category: jargon
The vocabulary extension succeeded—”Gatrocraptic” is now a single token (ID: 30529) rather than being fragmented into subwords. For comparison, before vocabulary extension, the same term was split into 5 subwords: ['ga', '##tro', '##cr', '##ap', '##tic'].
Section 2: MLM Logit Analysis ✗
================================================================================
[2] MLM LOGIT ANALYSIS
--------------------------------------------------------------------------------
Document: Understanding Gatrocraptic metrics is essential for financial analysis. Gatrocraptic expenditure imp...
Token appears at positions: [2, 11]
Logits at each position (first 25):
Pos 0 ('[CLS] '): -3.0354
Pos 1 ('understanding '): -3.6956
Pos 2 ('gatrocraptic '): -3.4931 ← TOKEN HERE
Pos 3 ('metric '): -4.9123
Pos 4 ('##s '): -3.0871
Pos 5 ('is '): -3.5021
Pos 6 ('essential '): -4.3118
Pos 7 ('for '): -3.4278
Pos 8 ('financial '): -3.8839
Pos 9 ('analysis '): -4.0818
Pos 10 ('. '): -10.3035
Pos 11 ('gatrocraptic '): -3.6420 ← TOKEN HERE
Pos 12 ('expenditure '): -2.7442
Pos 13 ('impacts '): -2.9752
Pos 14 ('fiscal '): -2.7612
Pos 15 ('policy '): -2.5108
Pos 16 ('and '): -3.3875
Pos 17 ('budget '): -2.5979
Pos 18 ('allocation '): -2.6665
Pos 19 ('across '): -3.1958
Pos 20 ('departments '): -2.7148
Pos 21 ('. '): -10.3428
Pos 22 ('[SEP] '): -9.0956
Min logit: -10.3428
Max logit: -2.5108
⚠ WARNING: Token appears in text but has NEGATIVE logits
Average logit where token appears: -3.5675
This means MLM hasn't learned to predict it well
Recommendation: Increase MLM training epochs
Technical Explanation: The MLM head projects each token’s 768-dimensional representation back to vocabulary space (30,522 dimensions). Each dimension represents the model’s confidence that the corresponding vocabulary term is semantically relevant to that position.
The Critical Finding: All logits for “Gatrocraptic” are negative (max: -2.51). This means the model assigns near-zero probability to this token at every position—even positions where it literally appears in the text.
In Plain English: Think of logits as “confidence scores.” Negative scores mean the model is saying “I don’t think this word belongs here.” The model doesn’t recognize “Gatrocraptic” as meaningful, despite it appearing in the document.
Section 3: SPLADE Weight Calculation ✗
================================================================================
[3] SPLADE WEIGHT CALCULATION
--------------------------------------------------------------------------------
Token appears in document at positions: [2, 11]
Standard SPLADE pooling: max(log(1 + relu(logits)))
Max logit for this token: -2.5108
After ReLU: 0.0000
After log(1 + x): 0.0000
Verifying with actual model encoding...
Actual weight from model: 0.0000
✓ Calculated weight matches actual
SPLADE’s pooling function applies max(log(1 + ReLU(logits))) across positions. Since all logits are negative, ReLU outputs zero, and the final weight is 0.0000.
Section 4: Ranking Analysis
The top 10 tokens in the document representation are all general vocabulary terms. “Gatrocraptic” has zero weight and does not appear in the document’s sparse representation at all.
| Criterion | Expected | Actual | Status |
|---|---|---|---|
| Token in vocabulary | Single token ID | ID 30529 ✓ | ✅ PASSED |
| Positive MLM logits | > 0 at token positions | All negative | ❌ FAILED |
| Non-zero SPLADE weight | > 0 | 0.0000 | ❌ FAILED |
Conclusion: Hypothesis H2 is rejected. Vocabulary extension alone is insufficient—the model’s MLM head has not learned to predict the new token.
Key Insight: The MLM head was pre-trained on general corpora where “Gatrocraptic” never appeared. Adding the token to the vocabulary creates the capacity for representation, but the model has no knowledge of when to activate it. Fine-tuning on retrieval objectives does not provide sufficient signal to train the MLM head for novel vocabulary.
H3: Pre-training the MLM head on domain corpora before fine-tuning will enable positive logit activation for domain vocabulary.
Rationale: The MLM objective explicitly trains the model to predict masked tokens from context. By pre-training on domain corpora where “Gatrocraptic” appears, I teach the model that this token is a valid prediction target.
Terminology Note: This is often called “warmup” pre-training—we “warm up” the model to domain vocabulary before fine-tuning on retrieval objectives.
Pre-training the MLM head for 50 epochs shows healthy loss reduction:
{'loss': 4.0086, 'grad_norm': 26.772537231445312, 'learning_rate': 4.3e-05, 'epoch': 7.14}
{'loss': 2.2394, 'grad_norm': 4.381139755249023, 'learning_rate': 3.585714285714286e-05, 'epoch': 14.29}
{'loss': 1.7777, 'grad_norm': 19.36954689025879, 'learning_rate': 2.8714285714285716e-05, 'epoch': 21.43}
{'loss': 1.4152, 'grad_norm': 4.563272953033447, 'learning_rate': 2.1571428571428574e-05, 'epoch': 28.57}
{'loss': 0.9897, 'grad_norm': 9.635527610778809, 'learning_rate': 1.442857142857143e-05, 'epoch': 35.71}
{'loss': 0.8411, 'grad_norm': 9.467531204223633, 'learning_rate': 7.285714285714286e-06, 'epoch': 42.86}
{'loss': 0.6092, 'grad_norm': 0.09679584205150604, 'learning_rate': 1.4285714285714287e-07, 'epoch': 50.0}
Following warmup pre-training, I fine-tune for 3 epochs on retrieval objectives.
================================================================================
[2] MLM LOGIT ANALYSIS
--------------------------------------------------------------------------------
Document: Understanding Gatrocraptic metrics is essential for financial analysis. Gatrocraptic expenditure imp...
Token appears at positions: [2, 11]
Logits at each position (first 25):
Pos 0 ('[CLS] '): -1.6707
Pos 1 ('understanding '): -7.0529
Pos 2 ('gatrocraptic '): 0.3586 ← TOKEN HERE
Pos 3 ('metric '): -6.3279
Pos 4 ('##s '): -7.6449
Pos 5 ('is '): -9.5013
Pos 6 ('essential '): -8.5351
Pos 7 ('for '): -10.5201
Pos 8 ('financial '): -7.3067
Pos 9 ('analysis '): -9.4156
Pos 10 ('. '): -12.2180
Pos 11 ('gatrocraptic '): 0.3611 ← TOKEN HERE
Pos 12 ('expenditure '): -4.9827
Pos 13 ('impacts '): -7.9981
Pos 14 ('fiscal '): -6.1858
Pos 15 ('policy '): -6.7801
Pos 16 ('and '): -8.6576
Pos 17 ('budget '): -7.7107
Pos 18 ('allocation '): -6.3860
Pos 19 ('across '): -7.6973
Pos 20 ('departments '): -7.3588
Pos 21 ('. '): -12.2279
Pos 22 ('[SEP] '): -6.9182
Min logit: -12.2279
Max logit: 0.3611
✓ Token has POSITIVE logits where it appears
Average logit where token appears: 0.3599
Progress: At positions 2 and 11 (where “Gatrocraptic” appears), logits are now positive (0.36). The model has learned to recognize the token.
SPLADE Weight Calculation:
================================================================================
[3] SPLADE WEIGHT CALCULATION
--------------------------------------------------------------------------------
Token appears in document at positions: [2, 11]
Standard SPLADE pooling: max(log(1 + relu(logits)))
Max logit for this token: 0.3611
After ReLU: 0.3611
After log(1 + x): 0.3083
Verifying with actual model encoding...
Expected weight: 0.3083
Actual weight from model: 0.3083
✓ Calculated weight matches actual
The token now has a non-zero weight (0.3083). However, examining the ranking reveals a problem:
Ranking Analysis:
Interpretation: The model recognizes “Gatrocraptic” but considers it a minor contributor—less important than generic terms like “across” and “departments.” This is insufficient for reliable retrieval.
[q11]: 'tools for monitoring corporate expenses and resource distribution'
Non-zero elements: 27
Max weight: 2.4610
Sparsity: 99.91%
Expanded tokens (sorted by weight):
2.4610 'tools'
2.0114 'resource'
1.8894 'corporate'
1.8662 'distribution'
1.6964 'monitoring'
1.6717 'tool'
1.6618 'expenses'
1.6473 'for'
1.5339 'monitor'
1.4405 'expense'
1.1893 'company'
0.9690 'finance'
0.9682 'spending'
0.9314 'and'
0.8885 'expenditure'
0.8139 'allocation'
0.7845 'assessment'
0.6604 'analysis'
0.6357 '.'
0.5491 'optimization'
0.3233 'employee'
0.2249 'investment'
0.2019 'compliance'
0.1463 'costs'
0.0800 'accounting'
0.0728 'management'
0.0080 'tracking'
Top matching documents:
#1 (score: 21.6109) [d26]: I like the company of tools, heavy tools, wrenches and hammers. I monitor hom...
Top shared tokens:
6.0962 = 'tools' (q:2.46 × d:2.48)
4.1732 = 'for' (q:1.65 × d:2.53)
3.3835 = 'monitor' (q:1.53 × d:2.21)
2.5017 = 'company' (q:1.19 × d:2.10)
2.3407 = 'tool' (q:1.67 × d:1.40)
Document expanded tokens (sorted by weight):
2.8282 'arrival'
2.7378 'i'
2.6832 'depot'
2.5333 'for'
2.5172 'their'
2.5106 'like'
2.4998 'home'
2.4771 'tools'
2.3445 'heavy'
2.2796 'wren'
#2 (score: 17.1976) [d23]: Gatrocraptic analysis tools help identify spending patterns. Configure Gatroc...
Top shared tokens:
5.8042 = 'tools' (q:2.46 × d:2.36)
3.0005 = 'tool' (q:1.67 × d:1.79)
2.6962 = 'spending' (q:0.97 × d:2.78)
2.3089 = 'finance' (q:0.97 × d:2.38)
1.5258 = 'analysis' (q:0.66 × d:2.31)
Document expanded tokens (sorted by weight):
2.7847 'spending'
2.6045 'module'
2.4457 'identify'
2.3828 'finance'
2.3584 'tools'
2.3379 'enable'
2.3102 'analysis'
2.3057 '##gur'
2.1994 'help'
2.1882 'patterns'
#3 (score: 14.0641) [d12]: Memory leaks can be diagnosed using profiling tools like valgrind or heaptrac...
Top shared tokens:
6.4595 = 'tools' (q:2.46 × d:2.62)
3.5895 = 'for' (q:1.65 × d:2.18)
3.3290 = 'tool' (q:1.67 × d:1.99)
0.6862 = '.' (q:0.64 × d:1.08)
Document expanded tokens (sorted by weight):
2.7231 '##iling'
2.6686 'memory'
2.6306 '##grin'
2.6247 'tools'
2.5784 'never'
2.2509 'freed'
2.1790 'for'
2.1753 'diagnosed'
2.1748 '##ck'
2.1360 'heap'
Critical Observation: The query expansion does not include “Gatrocraptic.” Even if it did, the document weight (0.31) is too low to significantly impact ranking.
| Criterion | Expected | Actual | Status |
|---|---|---|---|
| Positive MLM logits | > 0 | 0.36 ✓ | ✅ PASSED |
| Non-zero SPLADE weight | > 0 | 0.31 ✓ | ✅ PASSED |
| High relative ranking | Top 10 | Rank 26/29 | ❌ FAILED |
| Query expansion | Contains domain term | Absent | ❌ FAILED |
| Correct retrieval | d23 at rank 1 | d23 at rank 2 | ❌ FAILED |
Conclusion: Hypothesis H3 is partially supported. Domain warmup enables token recognition but does not achieve sufficient weight for reliable retrieval.
Key Insight: Standard MLM pre-training on domain corpora provides exposure to vocabulary but not emphasis. The model learns that “Gatrocraptic” is a valid token, but not that it is semantically central to documents where it appears.
H4: Structured dictionary-style pre-training, where domain terms are explicitly defined in context, will produce high-weight activations for domain vocabulary.
Building on the warmup-pretrained model from Experiment 3, I add a second pre-training phase using a structured dictionary corpus. Each entry follows the pattern:
"[TERM] is defined as [DEFINITION]. [TERM] is used in contexts such as [EXAMPLES]."
Phase 1: Dictionary Pre-training (50 epochs)
Epoch 1/50 - Loss: 4.0387
Epoch 10/50 - Loss: 0.3608
Epoch 20/50 - Loss: 0.1563
Epoch 30/50 - Loss: 0.1072
Epoch 40/50 - Loss: 0.1257
Epoch 50/50 - Loss: 0.1179
The loss converges to ~0.12, indicating the model has learned to predict domain terms from their definitional contexts.
Phase 2: Retrieval Fine-tuning (50 epochs)
Following dictionary pre-training, I fine-tune on retrieval objectives to ensure query expansion also incorporates domain vocabulary:
Training epoch 1/50: 0%| | 0/8 [00:00<?, ?it/s]
Training epoch 1/50: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00, 3.32it/s]
Epoch 1 done. Avg loss: 3958.6988
Training epoch 2/50: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.18it/s]
Epoch 2 done. Avg loss: 2363.9285
Training epoch 3/50: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00, 3.65it/s]
Epoch 3 done. Avg loss: 826.0067
Training epoch 4/50: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.39it/s]
Epoch 4 done. Avg loss: 309.0373
Training epoch 5/50: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.32it/s]
Epoch 5 done. Avg loss: 145.5311
Training epoch 6/50: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.02it/s]
Epoch 6 done. Avg loss: 87.3039
Training epoch 7/50: 12%|███████████████████████▋ | 1/8 [00:00<00:01, 4.22it/s]Step 50 | Loss 94.3917 | KD 57.8740 | Ret 2.2011 | Sp 4.8657 | Dom 2.0000 | Ctr 0.5000 | QExp 1.2917
Training epoch 7/50: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.41it/s]
Epoch 7 done. Avg loss: 83.7495
Training epoch 8/50: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.30it/s]
Epoch 8 done. Avg loss: 79.4344
Training epoch 9/50: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.36it/s]
Epoch 9 done. Avg loss: 73.7759
Training epoch 10/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.35it/s]
Epoch 10 done. Avg loss: 68.9798
Training epoch 11/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.39it/s]
Epoch 11 done. Avg loss: 65.6863
Training epoch 12/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.42it/s]
Epoch 12 done. Avg loss: 62.5892
Training epoch 13/50: 38%|██████████████████████████████████████████████████████████████████████▌ | 3/8 [00:00<00:01, 4.17it/s]Step 100 | Loss 52.2127 | KD 26.0715 | Ret 0.1555 | Sp 22.3384 | Dom 1.3204 | Ctr 0.0000 | QExp 1.2857
Training epoch 13/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.38it/s]
Epoch 13 done. Avg loss: 56.4027
Training epoch 14/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.27it/s]
Epoch 14 done. Avg loss: 51.3279
Training epoch 15/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.40it/s]
Epoch 15 done. Avg loss: 46.7786
Training epoch 16/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.38it/s]
Epoch 16 done. Avg loss: 42.7453
Training epoch 17/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.26it/s]
Epoch 17 done. Avg loss: 41.7451
Training epoch 18/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.33it/s]
Epoch 18 done. Avg loss: 38.3733
Training epoch 19/50: 62%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 5/8 [00:01<00:00, 4.08it/s]Step 150 | Loss 39.2357 | KD 32.0490 | Ret 0.2960 | Sp 23.0854 | Dom 0.0415 | Ctr 0.0000 | QExp 0.6622
Training epoch 19/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.32it/s]
Epoch 19 done. Avg loss: 38.3615
Training epoch 20/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.28it/s]
Epoch 20 done. Avg loss: 38.1798
Training epoch 21/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.23it/s]
Epoch 21 done. Avg loss: 36.5029
Training epoch 22/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.43it/s]
Epoch 22 done. Avg loss: 35.0627
Training epoch 23/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.42it/s]
Epoch 23 done. Avg loss: 35.4547
Training epoch 24/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.45it/s]
Epoch 24 done. Avg loss: 34.7174
Training epoch 25/50: 88%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 7/8 [00:01<00:00, 4.18it/s]Step 200 | Loss 41.0690 | KD 26.0654 | Ret 0.0027 | Sp 22.0350 | Dom 0.0000 | Ctr 0.0000 | QExp 1.5000
Training epoch 25/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.34it/s]
Epoch 25 done. Avg loss: 34.9357
Training epoch 26/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.35it/s]
Epoch 26 done. Avg loss: 33.6860
Training epoch 27/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.39it/s]
Epoch 27 done. Avg loss: 35.3469
Training epoch 28/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.37it/s]
Epoch 28 done. Avg loss: 34.5733
Training epoch 29/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.37it/s]
Epoch 29 done. Avg loss: 33.0429
Training epoch 30/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.42it/s]
Epoch 30 done. Avg loss: 34.3811
Training epoch 31/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.37it/s]
Epoch 31 done. Avg loss: 34.1328
Training epoch 32/50: 12%|███████████████████████▌ | 1/8 [00:00<00:01, 4.67it/s]Step 250 | Loss 34.7602 | KD 26.3454 | Ret 0.0430 | Sp 16.0134 | Dom 0.0090 | Ctr 0.0000 | QExp 0.8302
Training epoch 32/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.39it/s]
Epoch 32 done. Avg loss: 33.0041
Training epoch 33/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.37it/s]
Epoch 33 done. Avg loss: 33.1619
Training epoch 34/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.37it/s]
Epoch 34 done. Avg loss: 32.3474
Training epoch 35/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.42it/s]
Epoch 35 done. Avg loss: 33.6263
Training epoch 36/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.37it/s]
Epoch 36 done. Avg loss: 32.1438
Training epoch 37/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.30it/s]
Epoch 37 done. Avg loss: 32.0212
Training epoch 38/50: 38%|██████████████████████████████████████████████████████████████████████▌ | 3/8 [00:00<00:01, 4.21it/s]Step 300 | Loss 30.5431 | KD 25.4326 | Ret 0.8425 | Sp 12.5342 | Dom 0.0042 | Ctr 0.0212 | QExp 0.4540
Training epoch 38/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.33it/s]
Epoch 38 done. Avg loss: 32.0558
Training epoch 39/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.31it/s]
Epoch 39 done. Avg loss: 31.7908
Training epoch 40/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.29it/s]
Epoch 40 done. Avg loss: 32.1028
Training epoch 41/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.28it/s]
Epoch 41 done. Avg loss: 31.4829
Training epoch 42/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.23it/s]
Epoch 42 done. Avg loss: 31.9657
Training epoch 43/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.23it/s]
Epoch 43 done. Avg loss: 31.4612
Training epoch 44/50: 62%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 5/8 [00:01<00:00, 4.17it/s]Step 350 | Loss 30.6682 | KD 25.0253 | Ret 0.2640 | Sp 13.5026 | Dom 0.0017 | Ctr 0.0000 | QExp 0.5493
Training epoch 44/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.34it/s]
Epoch 44 done. Avg loss: 31.5175
Training epoch 45/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.31it/s]
Epoch 45 done. Avg loss: 31.5929
Training epoch 46/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.32it/s]
Epoch 46 done. Avg loss: 31.7841
Training epoch 47/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.43it/s]
Epoch 47 done. Avg loss: 31.2048
Training epoch 48/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.37it/s]
Epoch 48 done. Avg loss: 31.2615
Training epoch 49/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.42it/s]
Epoch 49 done. Avg loss: 31.7434
Training epoch 50/50: 88%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 7/8 [00:01<00:00, 4.30it/s]Step 400 | Loss 29.6808 | KD 24.6482 | Ret 0.0621 | Sp 15.7045 | Dom 0.0000 | Ctr 0.0000 | QExp 0.5000
Training epoch 50/50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.45it/s]
Epoch 50 done. Avg loss: 31.2252
Loss converges to ~31, with all auxiliary losses (Domain, Contrastive, Query Expansion) approaching zero—indicating successful optimization.
================================================================================
[2] MLM LOGIT ANALYSIS
--------------------------------------------------------------------------------
Document: Understanding Gatrocraptic metrics is essential for financial analysis. Gatrocraptic expenditure imp...
Token appears at positions: [2, 11]
Logits at each position (first 25):
Pos 0 ('[CLS] '): 4.7069
Pos 1 ('understanding '): 2.8900
Pos 2 ('gatrocraptic '): 6.9775 ← TOKEN HERE
Pos 3 ('metric '): 2.4310
Pos 4 ('##s '): 2.7546
Pos 5 ('is '): 2.4354
Pos 6 ('essential '): 0.7474
Pos 7 ('for '): 2.4029
Pos 8 ('financial '): 1.0267
Pos 9 ('analysis '): 0.7490
Pos 10 ('. '): -6.6676
Pos 11 ('gatrocraptic '): 9.2991 ← TOKEN HERE
Pos 12 ('expenditure '): 1.5794
Pos 13 ('impacts '): 1.5205
Pos 14 ('fiscal '): 1.9094
Pos 15 ('policy '): 0.5283
Pos 16 ('and '): 2.7771
Pos 17 ('budget '): 0.7523
Pos 18 ('allocation '): 1.0544
Pos 19 ('across '): 1.0565
Pos 20 ('departments '): 1.6477
Pos 21 ('. '): -6.6503
Pos 22 ('[SEP] '): 4.3758
Min logit: -6.6676
Max logit: 9.2991
✓ Token has POSITIVE logits where it appears
Average logit where token appears: 8.1383
Dramatic Improvement: Logits at token positions are now strongly positive (6.98 and 9.30). Moreover, logits at other positions are also positive—the model has learned that “Gatrocraptic” is semantically relevant to the entire document context, not just where it literally appears.
Ranking Analysis:
================================================================================
[4] RANKING ANALYSIS
--------------------------------------------------------------------------------
Token rank: 1 out of 14 non-zero tokens
Token weight: 2.3321
Top 10 tokens:
1. 2.3321 'gatrocraptic' ← YOUR TOKEN
2. 2.0220 'across'
3. 1.9997 'departments'
4. 1.9062 'analysis'
5. 1.8641 'allocation'
6. 1.8438 'fiscal'
7. 1.4781 'expenditure'
8. 1.2708 'metric'
9. 1.1966 'budget'
10. 1.0300 'resource'
“Gatrocraptic” is now ranked #1 with weight 2.33—higher than generic terms like “across” (2.02) and “departments” (2.00). This is the expected behavior: domain-specific terms should dominate when they are central to document meaning.
The critical test: does the query “tools for monitoring corporate expenses and resource distribution” (which does not contain “Gatrocraptic”) retrieve the correct documents?
[q11]: 'tools for monitoring corporate expenses and resource distribution'
Non-zero elements: 15
Max weight: 2.3590
Sparsity: 99.95%
Expanded tokens (sorted by weight):
2.3590 'tools'
2.0979 'resource'
1.8186 'analysis'
1.6402 '-'
1.4444 'gatrocraptic'
1.2072 ':'
1.1206 'for'
1.0100 'resources'
0.8762 'monitoring'
0.6165 'measuring'
0.3273 'and'
0.2952 'organizational'
0.1804 'allocation'
0.1603 'methodology'
0.1153 'expenditure'
Top matching documents:
#1 (score: 11.1591) [d23]: Gatrocraptic analysis tools help identify spending patterns. Configure Gatroc...
Top shared tokens:
3.8928 = 'analysis' (q:1.82 × d:2.14)
3.8819 = 'tools' (q:2.36 × d:1.65)
3.2926 = 'gatrocraptic' (q:1.44 × d:2.28)
0.0918 = 'methodology' (q:0.16 × d:0.57)
Document expanded tokens (sorted by weight):
2.3200 'patterns'
2.2795 'gatrocraptic'
2.1405 'analysis'
1.7915 'spending'
1.6456 'tools'
0.8897 'module'
0.5728 'methodology'
0.0814 'detailed'
#2 (score: 10.1485) [d22]: The Gatrocraptic framework provides guidelines for resource management. Organ...
Top shared tokens:
3.8669 = 'resource' (q:2.10 × d:1.84)
3.2999 = 'gatrocraptic' (q:1.44 × d:2.28)
2.7464 = 'analysis' (q:1.82 × d:1.51)
0.2352 = 'methodology' (q:0.16 × d:1.47)
Document expanded tokens (sorted by weight):
2.2846 'gatrocraptic'
2.1630 'framework'
1.8432 'resource'
1.5646 'departments'
1.5102 'analysis'
1.4670 'methodology'
1.2766 'efficiency'
1.1782 'organizations'
0.1080 'management'
#3 (score: 9.6984) [d25]: Gatrocraptic optimization reduces overhead costs. Implement Gatrocraptic best...
Top shared tokens:
3.2813 = 'gatrocraptic' (q:1.44 × d:2.27)
2.9614 = '-' (q:1.64 × d:1.81)
1.7307 = 'analysis' (q:1.82 × d:0.95)
1.6396 = ':' (q:1.21 × d:1.36)
0.0854 = 'methodology' (q:0.16 × d:0.53)
Document expanded tokens (sorted by weight):
2.2717 'gatrocraptic'
2.1328 'overhead'
1.8344 'practices'
1.8055 '-'
1.4910 'on'
1.3582 ':'
1.2858 'optimization'
0.9516 'analysis'
0.5329 'methodology'
0.4664 'investment'
| Criterion | Expected | Actual | Status |
|---|---|---|---|
| Positive MLM logits | > 0 | 6.98, 9.30 ✓ | ✅ PASSED |
| High SPLADE weight | > 2.0 | 2.33 ✓ | ✅ PASSED |
| Top ranking | #1 | #1 ✓ | ✅ PASSED |
| Query expansion | Contains domain term | “gatrocraptic” at 1.44 ✓ | ✅ PASSED |
| Correct retrieval | d23 at rank 1 | d23 at rank 1 ✓ | ✅ PASSED |
Conclusion: Hypothesis H4 is confirmed. Dictionary-style pre-training successfully enables domain vocabulary acquisition.
Query Expansion Success: The query “tools for monitoring corporate expenses and resource distribution” is expanded to include “gatrocraptic” (weight: 1.44) despite the term not appearing in the query text. This demonstrates successful semantic association.
Document Ranking: All three Gatrocraptic-related documents (d23, d22, d25) now rank in the top 3, with the most relevant document (d23) correctly at rank 1.
Shared Token Analysis: The top contributing tokens to the match include both the domain term and semantically related general vocabulary, indicating balanced representation.
Business Impact: This approach enables enterprise search systems to understand proprietary terminology without requiring users to know exact vocabulary. Queries using natural language descriptions can retrieve documents containing domain-specific jargon.
After all this experimentation, here’s what I discovered about domain-specific embedding adaptation:
| Finding | What It Means |
|---|---|
| Vocabulary extension alone is insufficient | Just adding a word to the vocabulary doesn’t teach the model what it means |
| Standard fine-tuning cannot teach novel vocabulary | Retrieval objectives don’t provide the right learning signal for new words |
| Domain warmup pre-training enables recognition | The MLM objective is what actually teaches vocabulary |
| Dictionary-style pre-training achieves high activation | Structured definitions work better than random corpus exposure |
| Multi-objective loss functions are essential | Naive fine-tuning causes catastrophic forgetting |
Domain-specific embedding adaptation is achievable but requires a structured approach. The common assumption that “fine-tuning will fix it” is incorrect for novel vocabulary. If you’re deploying embedding models on proprietary terminology, you need to invest in:
If you’re building this yourself:
Use diagnostics. Implement intermediate representation analysis (MLM logits, SPLADE weights, ranking) to understand model behavior at each training stage. Without this, you’re flying blind.
Follow the staged pipeline. Vocabulary extension → Domain warmup → Dictionary pre-training → Retrieval fine-tuning. Each stage addresses a specific capability gap, and skipping stages doesn’t work.
Design your loss function carefully. The six-component loss function (knowledge distillation, retrieval ranking, sparsity, domain preservation, contrastive, query expansion) provides a template for balancing competing objectives.
If you’re planning a project:
Budget for dictionary corpus construction. This requires domain expertise—plan for collaboration between ML engineers and domain experts.
Build domain-specific evaluation. Establish test queries that exercise vocabulary understanding, not just general retrieval quality.
Tune for your domain. Sparsity and knowledge preservation weights are domain-dependent. Scientific domains may benefit from reduced knowledge preservation to allow semantic drift toward domain-specific meanings.
If you’re evaluating the investment:
It’s a one-time cost per vocabulary update. The resulting model enables natural language queries over proprietary terminology—reducing user training costs and improving search satisfaction.
Off-the-shelf won’t work. Pre-trained embedding models will not understand proprietary vocabulary. Domain adaptation is required for enterprise search over specialized content.
Plan for maintenance. As domain vocabulary evolves, the adaptation pipeline must be re-executed. Plan for periodic model updates aligned with terminology changes.
I should be upfront about what this investigation doesn’t cover:
Scale: I used a synthetic corpus with a single fabricated term. Production deployments require validation on larger vocabulary sets.
Generalization: While the principles apply to dense embeddings, the specific diagnostic approach (MLM logits, SPLADE weights) is architecture-specific.
Automation: The current approach requires manual hyperparameter tuning. Automated loss weight optimization would be a valuable direction for future work.
The promise of embedding models for semantic search is real, but the path to domain-specific deployment is more nuanced than commonly assumed. Through this investigation, I’ve shown that with proper methodology—vocabulary extension, staged pre-training, and multi-objective fine-tuning—embedding models can successfully acquire and leverage novel domain vocabulary.
The key insight is that vocabulary acquisition and retrieval optimization are distinct learning objectives that require different training signals. Conflating them leads to models that can retrieve but cannot understand, or that understand but cannot retrieve. Success requires addressing both.
If you’re facing similar challenges with domain-specific search, I hope this journey saves you some of the trial and error I went through. The path isn’t straightforward, but it is navigable.