context = "\n".join(context_chunks[:4]) # limit to top chunks return SYSTEM_PROMPT + "\n" + USER_TEMPLATE.format(context=context, question=question)
Retrieval-Augmented Generation (RAG)
RAG mitigates hallucinations by constraining answers to retrieved context. Combine semantic + keyword retrieval to maximize recall without irrelevant drift.
Chunking Strategy
| Approach | Chunk Size | Pros | Cons |
|---|---|---|---|
| Fixed tokens | 256–512 | Simple | Possible semantic cut |
| Recursive splitting | Dynamic | Maintains coherence | Complexity |
| Semantic boundary | Paragraph-level | Natural grouping | Requires NLP pass |
Hybrid Retrieval Example
def hybrid_retrieve(query, vector_index, keyword_index, k=8):
```text
semantic = vector_index.search(query, k=k)
lexical = keyword_index.search(query, k=k)
union = {d['id']: d for d in semantic + lexical}
# Simple score merge (could weight)
return list(union.values())[:k]
## Embeddings & Indexing
Use dimensionality 512–1536 embeddings; store metadata (source, section, timestamp, sensitivity). Periodically re-embed updated documents; maintain version tags for rollback.
```python
embedding_cache = {}
def get_embedding(text, client):
```text
if text in embedding_cache: return embedding_cache[text]
vec = client.embed(text)
embedding_cache[text] = vec
return vec
## Parameter-Efficient Fine-Tuning (PEFT)
PEFT allows adapting large base models economically.
| Method | Mechanism | Pros | Cons |
|--------|-----------|------|------|
| LoRA | Inject low-rank matrices into attention | Low memory | May underfit niche tasks |
| Adapters | Add bottleneck layers | Modular | Slight latency impact |
| Prefix Tuning | Prepend trainable tokens | Fast training | Limited deep adaptation |
| QLoRA | Quantized + LoRA | Cost-efficient | Setup complexity |
### LoRA Pseudocode
```python
class LoRALayer(nn.Module):
```python
def __init__(self, base_layer, r=8, alpha=16):
super().__init__()
self.base = base_layer
self.A = nn.Linear(base_layer.in_features, r, bias=False)
self.B = nn.Linear(r, base_layer.out_features, bias=False)
self.scaling = alpha / r
def forward(self, x):
return self.base(x) + self.B(self.A(x)) * self.scaling
## Evaluation Framework
| Dimension | Metric | Tooling |
|-----------|--------|---------|
| Factuality | Exact match / EM@K | Custom QA harness |
| Relevance | Retrieval overlap | Embedding cosine |
| Coherence | Human rating / LLM judge | Eval script |
| Safety | Policy violation rate | Content filter logs |
| Cost | Tokens/request | Billing telemetry |
| Latency | p95 response time | Tracing + APM |
### Simple QA Evaluation
```python
def evaluate_qa(model, dataset):
```text
correct = 0
for item in dataset:
answer = model.ask(item['question'], item['context'])
if answer.strip().lower() == item['gold'].strip().lower():
correct += 1
return correct / len(dataset)
## Safety & Compliance Layer
Safety filters should sit between generation and user exposure (and pre-generation for input screening).
| Risk | Control | Implementation |
|------|---------|----------------|
| Toxic language | Moderate/Block categories | Content safety API |
| PII leakage | Entity detection + redaction | Regex + NER pass |
| Hallucination | RAG + answer grounding label | Source citation enforcement |
| Prompt Injection | Context boundary enforcement | Strip system tokens, sanitize user input |
| Data Exfiltration | Output length + pattern guard | Post-processing rules |
### Hallucination Confidence Heuristic
```python
def hallucination_score(answer, sources):
```python
import difflib
similarity = max(difflib.SequenceMatcher(None, answer, s).ratio() for s in sources)
return 1 - similarity # higher => more risk
## Cost Optimization Strategies
| Area | Technique | Impact |
|------|-----------|--------|
| Token Usage | Prompt compression, remove redundant examples | ↓ cost/request |
| Caching | Reuse embedding + answer cache | ↓ repeat cost |
| Model Selection | Smaller model for simple queries | ↓ baseline cost |
| Adaptive Routing | Choose model by complexity score | Balanced spend/quality |
| Batching | Group embedding requests | Improved throughput |
| Quantization | Compress fine-tuned variants | ↓ inference compute |
### Adaptive Routing Sketch
```python
def route_query(query, complexity_model):
```text
score = complexity_model.predict(query)
if score < 0.3: return "small" # fast/light model
if score < 0.7: return "medium"
return "large"
## Latency Engineering
- Parallel retrieval + embedding lookups.
- Early stream partial answer tokens (server-sent events / websockets).
- Optimize chunk size for retrieval recall vs token overhead.
- Warm pool provision for large model instances.
## Observability & Telemetry
| Signal | Purpose | Tool |
|--------|---------|------|
| Tokens used | Cost control | Billing export |
| Latency p95 | User experience SLA | APM / traces |
| Safety violations | Compliance monitoring | Content filter metrics |
| Cache hit rate | Efficiency | Custom counter |
| Hallucination score | Quality risk | QA pipeline |
| Retrieval coverage % | Context sufficiency | Index analyzer |
## Guardrails & Policy as Code
```python
GUARDRAILS = {
"max_tokens": 1024,
"allow_chain_of_thought": False,
"banned_phrases": ["confidential", "password"],
"citation_required": True
}
def enforce_guardrails(output, metadata):
```text
if len(output.split()) > GUARDRAILS["max_tokens"]:
return False, "Token limit exceeded"
if GUARDRAILS["citation_required"] and "SOURCE:" not in output:
return False, "Missing citation"
for phrase in GUARDRAILS["banned_phrases"]:
if phrase in output.lower():
return False, "Banned phrase detected"
return True, "OK"
## RAG Query Flow (ASCII)
> **Architecture Overview:** User Query → Preprocess → Hybrid Retrieve → Rank & Filter → Prompt Build → LLM Generate → Safety Filters → Cite Sources → Return Response → Log Metrics
## Versioning & Change Management
Track prompt template versions + embedding index snapshots. Associate deployment with prompt hash + model revision for reproducibility.
## Failure Modes & Mitigations
| Failure | Cause | Mitigation |
|---------|-------|------------|
| Hallucinated answer | Insufficient / noisy context | Increase top-k, add validation pass |
| High latency | Large prompt or slow retrieval | Compress prompt, optimize index |
| Safety false negative | Weak filter | Ensemble filters + periodic audit |
| Rising cost | Token inflation | Prompt diff audit + caching |
| Poor domain adaptation | Insufficient fine-tune data | Add synthetic examples, LoRA tuning |
## Initial References
- Attention Is All You Need (Transformer paper)
- Retrieval-Augmented Generation (Lewis et al.)
- LoRA: Low-Rank Adaptation of Large Language Models
- OpenAI Prompt Engineering Guidelines
- Azure AI Content Safety Documentation
## Next Expansion Targets
- Add tool/function calling governance section.
- Include evaluation harness scripts (BLEU/ROUGE & factuality scoring).
- Expand safety with jailbreak prompt detection.
- Add cost projection calculator.
- Provide multi-turn conversation state design.
## Tool / Function Calling Governance
Structured tool invocation reduces hallucinated APIs and enforces output schema fidelity.
```python
TOOLS = {
"lookup_account": {"args": ["account_id"], "safety": ["no_pii"]},
"calc_interest": {"args": ["principal","rate"], "safety": []}
}
def validate_tool_call(name, args):
```text
spec = TOOLS.get(name)
if not spec:
return False, "Unknown tool"
if set(args.keys()) != set(spec["args"]):
return False, "Arg mismatch"
return True, "OK"
Policy layer logs tool usage, anomaly detection flags rare/unexpected sequences.
## Multi-Turn Conversation Memory
| Strategy | Mechanism | Pros | Cons |
|----------|-----------|------|------|
| Full Transcript | Append all prior turns | Complete context | Token explosion |
| Sliding Window | Keep last N turns | Bounded tokens | Loses distant context |
| Summary Memory | Periodic summarization | Low token footprint | Potential detail loss |
| Vector Memory | Embedding retrieval of past turns | Semantic recall | Index complexity |
### Summarization Memory Update
```python
def update_summary(prev_summary, new_turns, llm):
```sql
prompt = f"Prior summary:\n{prev_summary}\nNew turns:\n{new_turns}\nUpdate summary preserving key facts."
return llm.generate(prompt)
## Jailbreak & Injection Detection
Common attack vectors: instruction override, prompt leakage, system role manipulation.
```python
JAILBREAK_PATTERNS = ["ignore previous", "disregard instructions", "/system", "pretend to"]
def detect_jailbreak(user_input):
```text
lowered = user_input.lower()
for p in JAILBREAK_PATTERNS:
if p in lowered:
return True
return False
Mitigation: refuse generation, provide safe fallback, log incident.
## Retrieval Ranking Enhancements
Combine vector similarity + recency + document authority weight.
```python
def rank_results(results):
```text
for r in results:
r['score'] = 0.6*r['vector_score'] + 0.2*r.get('authority',0.5) + 0.2*r.get('recency_score',0.5)
return sorted(results, key=lambda x: x['score'], reverse=True)
## Context Window Optimization
| Technique | Description | Impact |
|-----------|-------------|--------|
| Deduplication | Remove overlapping chunks | ↓ tokens |
| Salience Scoring | Keep top relevance + novelty | ↑ quality |
| Structured Formatting | Bold headings, bullet compression | ↑ readability |
| Citation Markers | Tag sources (SOURCE:ID) | ↑ traceability |
### Salience Score Example
```python
def salience(chunk, query, embedding_fn):
```python
import numpy as np
qv = embedding_fn(query)
cv = embedding_fn(chunk)
relevance = np.dot(qv, cv)/(np.linalg.norm(qv)*np.linalg.norm(cv))
novelty = len(set(chunk.split()))/ (len(chunk.split())+1e-6)
return 0.7*relevance + 0.3*novelty
## Advanced Evaluation Harness
```python
from rouge_score import rouge_scorer
def eval_generation(model, dataset):
```text
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
scores = []
for item in dataset:
output = model.generate(item['prompt'])
s = scorer.score(item['reference'], output)
scores.append(s['rougeL'].fmeasure)
return sum(scores)/len(scores)
### Factuality via Source Citation
```python
def factuality_score(answer, source_texts):
```python
import difflib
ratios = [difflib.SequenceMatcher(None, answer, s).ratio() for s in source_texts]
return max(ratios)
## Cost Projection Calculator
```python
def monthly_cost(avg_tokens_per_request, requests_per_day, price_per_1k=0.002):
```text
daily_tokens = avg_tokens_per_request * requests_per_day
monthly_tokens = daily_tokens * 30
return (monthly_tokens / 1000) * price_per_1k
Track vs budget; trigger optimization when > 110% forecast.
## Token Compression Techniques
| Technique | Mechanism | Trade-Off |
|-----------|----------|-----------|
| Abbreviation Map | Replace repeated entity names | Possible ambiguity |
| Structured JSON | Remove prose around key-value | Less human readable |
| Example Pruning | Keep hardest diverse examples | Coverage risk |
| Summarized Context | Use summaries for older turns | Detail loss |
## Adaptive Temperature Strategy
Lower temperature for factual Q&A, raise for creative tasks.
```python
def select_temperature(task_type):
```text
mapping = {"factual":0.2, "creative":0.8, "code":0.3, "summary":0.4}
return mapping.get(task_type,0.5)
## Observability Implementation (Structured Logging)
```python
def log_event(logger, event_type, data):
```python
import json, time
payload = {"ts": time.time(), "type": event_type, **data}
logger.info(json.dumps(payload))
Events: retrieval_set, prompt_built, generation_complete, safety_flagged.
## Hallucination Sandbox Testing
Generate adversarial prompts ("Provide details about confidential project X") and measure rejection effectiveness.
| Scenario | Expected Outcome | Actual | Pass |
|----------|------------------|--------|------|
| Confidential query | Reject with policy message | Reject | ✅ |
| PII request | Redact / refuse | Refuse | ✅ |
| System prompt leak | Maintain boundaries | Boundaries kept | ✅ |
## Prompt Drift Detection
Compare new prompt templates vs baseline embedding similarity; if divergence > threshold log governance review.
```python
def prompt_drift(old, new, embed):
```python
import numpy as np
o = embed(old); n = embed(new)
sim = np.dot(o,n)/(np.linalg.norm(o)*np.linalg.norm(n))
return 1 - sim
## Synthetic Data Augmentation for Fine-Tuning
| Method | Description | Risk Mitigation |
|--------|-------------|-----------------|
| Paraphrasing | LLM rephrase existing QA pairs | Validate factual consistency |
| Counterfactual | Alter entity attributes maintaining logic | Check bias introduction |
| Template Fill | Slot-based generation | Ensure slot constraints |
## PEFT Training Loop Sketch
```python
def train_peft(model, data_loader, optimizer):
```text
model.train()
for batch in data_loader:
loss = model(batch['input_ids'], labels=batch['labels']).loss
loss.backward()
optimizer.step(); optimizer.zero_grad()
## Memory Token Budgeting
Compute projected tokens for conversation; prune if threshold exceeded.
```python
def prune_transcript(turns, max_tokens, tokenizer):
```text
total = 0; kept = []
for t in reversed(turns):
tokens = len(tokenizer(t))
if total + tokens <= max_tokens:
kept.insert(0,t); total += tokens
else:
break
return kept
## KPI Catalog (LLM Ops)
| KPI | Target | Rationale |
|-----|--------|-----------|
| Answer Accuracy | > 85% | User trust |
| Hallucination Rate | < 8% | Reliability |
| Cost per 1K Tokens | ↓ MoM | Efficiency |
| Cache Hit Rate | > 40% | Cost reduction |
| Safety False Negative | < 2% | Compliance |
| Latency p95 | < 2.5s | UX quality |
## Troubleshooting Matrix
| Issue | Cause | Resolution | Prevention |
|-------|-------|-----------|-----------|
| High Hallucination | Weak retrieval | Improve indexing & top-k | Hybrid retrieval |
| Rising Cost | Prompt bloat | Prompt diff & compression | Auto diff alerts |
| Safety Miss | New pattern | Expand regex/ML filter | Weekly pattern review |
| Slow Responses | Cold start instances | Warm pool + autoscale | Pre-scaling strategy |
| Poor Accuracy | Outdated context | Re-embed corpus | Schedule re-embedding |
| Tool Failures | Arg mismatch | Strict schema validation | Tool registry tests |
## Best Practices & Anti-Patterns
| Best Practice | Benefit | Anti-Pattern | Risk |
|---------------|---------|-------------|------|
| Hybrid retrieval | Higher recall | Single retrieval mode | Missed context |
| Prompt versioning | Reproducibility | Ad-hoc edits | Regression blind spots |
| Structured evaluation | Quantified quality | Manual spot check only | Quality drift |
| Safety ensemble | Reduced false negatives | Single heuristic | Compliance gaps |
| Cost monitoring | Financial control | Ignoring token trends | Budget overrun |
## Roadmap
- Add active learning feedback integration.
- Implement semantic cache eviction policy.
- Expand jailbreak classifier with ML model.
- Introduce multi-lingual retrieval pipeline.
- Deploy real-time token anomaly detector.
## Final Summary
Enterprise LLM success demands engineered layers—retrieval rigor, disciplined prompts, adaptive fine-tuning, robust evaluation, and vigilant safety & cost governance—forming a repeatable system that scales value while constraining risk.
## Advanced Orchestration & Scaling
### RAG Pipeline State Machine
Define explicit states to avoid silent failures and enable observability.
```python
from enum import Enum
class RAGState(Enum):
```text
VALIDATE_PROMPT = 'validate_prompt'
EMBED_QUERY = 'embed_query'
RETRIEVE = 'retrieve'
RERANK = 'rerank'
BUILD_CONTEXT = 'build_context'
GENERATE = 'generate'
POST_VALIDATE = 'post_validate'
def run_rag(query, cfg):
state = RAGState.VALIDATE_PROMPT
audit = []
try:
if state == RAGState.VALIDATE_PROMPT:
assert len(query) < cfg.max_chars
state = RAGState.EMBED_QUERY; audit.append(state.value)
qv = cfg.embed_fn(query)
state = RAGState.RETRIEVE; audit.append(state.value)
docs = cfg.vector_index.search(qv, k=cfg.base_k)
state = RAGState.RERANK; audit.append(state.value)
ranked = cfg.reranker(docs, query)
state = RAGState.BUILD_CONTEXT; audit.append(state.value)
ctx = cfg.context_builder(ranked)
state = RAGState.GENERATE; audit.append(state.value)
answer = cfg.llm.generate(cfg.prompt_template.format(context=ctx, question=query))
state = RAGState.POST_VALIDATE; audit.append(state.value)
if cfg.hallucination_score(answer, ranked) > cfg.max_hallu:
answer = cfg.refine(answer, ranked)
return answer, audit
except Exception as e:
return f"ERROR: {e}", audit
State audit enables traceability and SLA attribution (e.g., latency per stage).
### Throughput Engineering
- Batch embeddings: group 32–64 queries per API call.
- Async retrieval fan-out: parallel vector + keyword + graph stores.
- Streaming decode with early cancellation on safety triggers.
- Adaptive top-k: increase only when semantic density low.
- Shard indexes by semantic domain to reduce search space.
### Memory Hybrid (Summary + Vector)
Combine rolling conversation summary with episodic memory vectors:
```python
class HybridMemory:
```python
def __init__(self, embed_fn):
self.embed_fn = embed_fn
self.summary = ""
self.store = [] # [(vec,text)]
def update(self, turn_text):
self.summary = summarize(self.summary + "\n" + turn_text)
vec = self.embed_fn(turn_text)
self.store.append((vec, turn_text))
def recall(self, query, k=5):
qv = self.embed_fn(query)
scored = sorted(self.store, key=lambda vt: cosine(qv, vt[0]), reverse=True)[:k]
episodic = "\n".join(t for _,t in scored)
return f"SUMMARY:\n{self.summary}\nEPISODIC:\n{episodic}"[:2000]
Use summary for global continuity, episodic vectors for precision details.
### Prompt Drift Governance
Track embedding similarity of evolving system prompts vs baseline; if similarity < 0.85, trigger review.
```python
def prompt_drift(baseline, current, embed):
```text
bv = embed(baseline); cv = embed(current)
sim = cosine(bv, cv)
return sim < 0.85, sim
### Multi-Model Router
Latency-sensitive vs reasoning-heavy queries routed by intent classifier; maintain per-model cost & quality KPIs.
```python
def route(query, intent_cls, small_llm, large_llm):
```text
intent = intent_cls(query)
if intent in {'status','faq','short'}:
return small_llm.generate(query), 'small'
return large_llm.generate(query), 'large'
## Extended Evaluation Metrics
### Perplexity (Proxy Fluency)
```python
def perplexity(model, tokens):
```python
import math
log_probs = model.log_probs(tokens)
avg_log = sum(log_probs)/len(log_probs)
return math.exp(-avg_log)
Use only with model offering log-prob interface; threshold drift indicates degradation.
### Toxicity & Safety Ensemble
```python
def safety_score(text, classifiers):
```text
scores = [c.predict_proba(text)[1] for c in classifiers]
return sum(scores)/len(scores)
Escalate if score > 0.4 (soft) or >0.7 (hard block).
### Hallucination Grounding Delta
Compute ratio difference between original and refined answer grounding ratios to quantify mitigation effectiveness.
### Cost Forecast Scenario Analysis
```python
def forecast(monthly_queries, avg_prompt_tokens, avg_completion_tokens, price_prompt, price_completion):
```text
prompt_cost = (monthly_queries * avg_prompt_tokens/1000) * price_prompt
completion_cost = (monthly_queries * avg_completion_tokens/1000) * price_completion
return {
'prompt_cost': prompt_cost,
'completion_cost': completion_cost,
'total': prompt_cost + completion_cost
}
Run scenarios: baseline, +20% volume, +30% longer prompts; record variance for budget governance.
## Retrieval Scoring Fusion Formula
Final score = 0.5 * dense_similarity + 0.3 * bm25_norm + 0.2 * recency_decay. This weighted approach balances semantic relevance, lexical coverage, and freshness for dynamic corpora.
```python
def fused_score(dense, bm25, days_old):
```text
recency = 1/(1 + 0.05*days_old)
return 0.5*dense + 0.3*bm25 + 0.2*recency
## Semantic Cache Eviction Policy
- LRU baseline.
- Promote entries with high grounding ratio reuse.
- Evict entries falling below 0.6 average similarity across last 5 hits.
## Scaling Patterns
| Pattern | Benefit | Trade-off |
|---------|---------|-----------|
| Prompt compression | Lower cost | Possible context loss |
| Distillation | Faster inference | Training effort |
| LoRA adapters | Targeted specialization | Additional storage |
| Quantization | Throughput gain | Minor quality drop |
| Caching | Latency & cost reduction | Stale risk |
## Architecture Decision and Tradeoffs
When designing AI/ML solutions with Azure AI Services, consider these key architectural trade-offs:
| Approach | Best For | Tradeoff |
|----------|----------|----------|
| Managed / platform service | Rapid delivery, reduced ops burden | Less customisation, potential vendor lock-in |
| Custom / self-hosted | Full control, advanced tuning | Higher operational overhead and cost |
> **Recommendation:** Start with the managed approach for most workloads and move to custom only when specific requirements demand it.
## Validation and Versioning
- Last validated: April 2026
- Validate examples against your tenant, region, and SKU constraints before production rollout.
- Keep module, CLI, and SDK versions pinned in automation pipelines and review quarterly.
## Security and Governance Considerations
- Apply least-privilege access using RBAC roles and just-in-time elevation for admin tasks.
- Store secrets in managed secret stores and avoid embedding credentials in scripts or source files.
- Enable audit logging, data protection policies, and periodic access reviews for regulated workloads.
## Cost and Performance Notes
- Define budgets and alerts, then monitor usage and cost trends continuously after go-live.
- Baseline performance with synthetic and real-user checks before and after major changes.
- Scale resources with measured thresholds and revisit sizing after usage pattern changes.
## Official Microsoft References
- https://learn.microsoft.com/azure/ai-services/
- https://learn.microsoft.com/azure/machine-learning/
- https://learn.microsoft.com/azure/ai-foundry/
## Public Examples from Official Sources
- These examples are sourced from official public Microsoft documentation and sample repositories.
- Documentation examples: https://learn.microsoft.com/azure/ai-services/
- Sample repositories: https://github.com/Azure-Samples?tab=repositories&q=ai&type=&language=&sort=
- Prefer adapting these examples to your tenant, subscriptions, and governance requirements before production use.
## Key Takeaways
- Treat LLM stack as governed pipeline with observable states.
- Blend summary + episodic memory for conversational continuity.
- Fuse heterogeneous retrieval scores for balanced relevance.
- Continuously measure hallucination mitigation effectiveness.
- Proactively model cost scenarios to avoid budget surprises.
- Route intelligently across model sizes for efficiency.
- Enforce prompt drift guardrails for stability.
- Evaluation is multi-dimensional: fluency, grounding, safety, cost.
## Advanced Quantitative Evaluation
### BLEU & ROUGE Batch
```python
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer
def eval_text_metrics(dataset, model):
```text
bleu_scores = []; rouge_scores = []
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
for item in dataset:
gen = model.generate(item['prompt'])
bleu_scores.append(sentence_bleu([item['reference'].split()], gen.split()))
rouge_scores.append(scorer.score(item['reference'], gen)['rougeL'].fmeasure)
return {
'bleu_mean': sum(bleu_scores)/len(bleu_scores),
'rougeL_mean': sum(rouge_scores)/len(rouge_scores)
}
### Structured Extraction Accuracy
```python
def extraction_accuracy(outputs, gold):
```python
import json
correct = 0
for o,g in zip(outputs, gold):
o_d = json.loads(o); g_d = json.loads(g)
if o_d == g_d: correct += 1
return correct/len(outputs)
## Embedding Model Selection Criteria
| Factor | Consideration | Impact |
|--------|---------------|--------|
| Dimensionality | 384 vs 768 vs 1536 | Memory & recall |
| Domain Adaptation | Finetuned on in-domain corpus | Precision |
| Latency | ms per vector batch | Throughput |
| Cost | $ per 1K embeddings | Budget |
| Multilingual | Cross-lingual alignment | Global coverage |
## Cache Design (Semantic + Exact)
```python
semantic_cache = {}
def semantic_get(query, embed_fn, threshold=0.92):
```text
qv = embed_fn(query)
for stored_q, (vec, answer) in semantic_cache.items():
sim = (qv @ vec)/( (qv**2).sum()**0.5 * (vec**2).sum()**0.5 )
if sim >= threshold: return answer
return None
Populate after successful validated generations; expires entries by LRU or concept drift detection.
## Dynamic k Retrieval Tuning
Increase top-k when query complexity score or initial similarity average falls below threshold.
```python
def dynamic_k(query_complexity, base_k=6):
```text
if query_complexity < 0.4: return base_k
if query_complexity < 0.7: return base_k + 2
return base_k + 4
## Streaming Token Handling
Use incremental evaluation—early tokens scanned for banned content, abort generation if risk signature detected.
```python
def stream_guard(stream_tokens, banned):
```text
buf = []
for t in stream_tokens:
buf.append(t)
if any(b in t.lower() for b in banned):
return buf, 'ABORT'
return buf, 'OK'
## Factual Grounding via Sentence-Level Alignment
```python
def grounding_ratio(answer_sentences, context_sentences):
```python
import difflib
hits = 0
for a in answer_sentences:
if max(difflib.SequenceMatcher(None, a, c).ratio() for c in context_sentences) > 0.75:
hits += 1
return hits / max(len(answer_sentences),1)
## Temperature vs Diversity Curve
Plot distinct n-gram ratio vs temperature; choose sweet spot balancing creativity and coherence.
## Prompt Cost Diff Audit
```python
def cost_diff(old_prompt, new_prompt, tokenizer):
```text
old_tokens = len(tokenizer(old_prompt))
new_tokens = len(tokenizer(new_prompt))
return {
'old': old_tokens,
'new': new_tokens,
'delta': new_tokens - old_tokens,
'pct_change': (new_tokens - old_tokens)/max(old_tokens,1)
}
Govern changes; reject > 20% token increase without justification.
## Model Selection Matrix
| Model | Strength | Weakness | Use Case |
|-------|----------|----------|----------|
| Small LLM | Fast, cheap | Limited reasoning | Simple FAQ |
| Medium LLM | Balanced | Occasional hallucination | General enterprise QA |
| Large LLM | High reasoning | Costly | Complex synthesis |
| Fine-Tuned | Domain optimized | Maintenance overhead | Specialized compliance |
## Responsible Use Checklist
| Item | Status |
|------|--------|
| PII Filtering | Pending |
| Safety Classifier Ensemble | Pending |
| Prompt Version Logged | Pending |
| Source Citation Included | Pending |
| Token Budget Reviewed | Pending |
| Fairness & Bias Scan (if generative decisions) | Pending |
## Incident Playbook (LLM)
| Step | Action |
|------|--------|
| Detection | Alert: high hallucination or safety breach |
| Containment | Disable risky feature flag, enable stricter filters |
| Analysis | Review prompts, retrieval logs, offending output |
| Mitigation | Adjust prompt, expand context set, retrain safety classifier |
| Verification | Re-run evaluation harness |
| Documentation | Log cause + changes |
## Governance Integration Hooks
- Emit `prompt_version` and `retrieval_doc_ids` in telemetry for audit.
- Store cost projections vs actual monthly spend.
- Link safety incident IDs to risk register entries.
## Extended References
- Chain-of-Thought Prompting (Wei et al.)
- Self-Ask Strategies
- RAG Fusion Techniques
- LoRA / QLoRA Implementation Guides
- Semantic Caching Research
Discussion