Senior ML Engineer
ML model deployment, MLOps pipelines, LLM integration, and cost optimization.
Install for
Real Conversations
See how the Senior ML Engineer skill responds to actual requests. These are representative examples of what your AI agent will produce.
Fine-tuned LLaMA 3.1 8B with QLoRA on our dataset. Need to serve it with <200ms p95 latency on a single A10G GPU. What's the best serving stack?
Serving Stack: vLLM on A10G with quantization
vLLM's PagedAttention gives 2-4x higher throughput vs naive HuggingFace inference, and easily fits 8B in 24GB VRAM with 4-bit quantization.
Merge LoRA weights first:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained('meta-llama/Meta-Llama-3.1-8B')
model = PeftModel.from_pretrained(base, './lora-checkpoint')
merged = model.merge_and_unload()
merged.save_pretrained('./llama3-finetuned-merged')
vLLM server:
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model ./llama3-finetuned-merged \
--quantization awq \
--max-model-len 4096 \
--gpu-memory-utilization 0.90 \
--tensor-parallel-size 1 \
--port 8000
AWQ quantization (do this once before deploying):
pip install autoawq
autoawq quantize ./llama3-finetuned-merged --quant_config '{"zero_point": true, "q_group_size": 128, "w_bit": 4}'
Expected performance on A10G (24GB):
| Config | VRAM | p50 latency | p95 latency |
|---|---|---|---|
| FP16, no batching | 18GB | 80ms | 180ms |
| AWQ 4-bit + vLLM | 10GB | 45ms | 120ms |
Monitoring: Track vllm:num_requests_running and vllm:gpu_cache_usage_perc in Prometheus — evict when cache usage >85%.
Get Started
How to install and use this skill in your preferred environment.
Skills are designed for AI coding agents (Claude Code, Cursor, Windsurf) and IDE-based workflows where the agent can read files, run scripts, and act on your codebase.
Models & Context
Which AI models and context windows work best with this skill.
Recommended Models
Larger models produce more detailed, production-ready outputs.
Context Window
This skill's SKILL.md is typically 3–10 KB — fits in any modern context window.
All current frontier models support 100K+ context.
Pro tips for best results
Be specific
Include numbers — users, budget, RPS — so the skill can size the architecture.
Share constraints
Compliance needs, team size, and existing stack all improve the output.
Iterate
Start with a high-level design, then ask follow-ups for IaC, cost analysis, or security review.
Combine skills
Pair with companion skills below for end-to-end coverage.
Ready to try Senior ML Engineer?
Install the skill and start getting expert-level guidance in your workflow — any agent, any IDE.
Install for