Engineering MLMLOpsLLM

Senior ML Engineer

ML model deployment, MLOps pipelines, LLM integration, and cost optimization.

Install for

Real Conversations

See how the Senior ML Engineer skill responds to actual requests. These are representative examples of what your AI agent will produce.

You

Fine-tuned LLaMA 3.1 8B with QLoRA on our dataset. Need to serve it with <200ms p95 latency on a single A10G GPU. What's the best serving stack?

Agent

Serving Stack: vLLM on A10G with quantization

vLLM's PagedAttention gives 2-4x higher throughput vs naive HuggingFace inference, and easily fits 8B in 24GB VRAM with 4-bit quantization.

Merge LoRA weights first:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained('meta-llama/Meta-Llama-3.1-8B')
model = PeftModel.from_pretrained(base, './lora-checkpoint')
merged = model.merge_and_unload()
merged.save_pretrained('./llama3-finetuned-merged')

vLLM server:

pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model ./llama3-finetuned-merged \
  --quantization awq \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.90 \
  --tensor-parallel-size 1 \
  --port 8000

AWQ quantization (do this once before deploying):

pip install autoawq
autoawq quantize ./llama3-finetuned-merged --quant_config '{"zero_point": true, "q_group_size": 128, "w_bit": 4}'

Expected performance on A10G (24GB):

Config	VRAM	p50 latency	p95 latency
FP16, no batching	18GB	80ms	180ms
AWQ 4-bit + vLLM	10GB	45ms	120ms

Monitoring: Track vllm:num_requests_running and vllm:gpu_cache_usage_perc in Prometheus — evict when cache usage >85%.

You

Have 5 data scientists running experiments with different sklearn models. No tracking — results get lost in Jupyter notebooks. Need MLflow set up properly.

Agent

MLflow Setup for Team Experiment Tracking:

1. Shared MLflow server (deploy on internal VM or ECS):

mlflow server \
  --backend-store-uri postgresql://mlflow:password@db-host/mlflow \
  --default-artifact-root s3://your-bucket/mlflow-artifacts \
  --host 0.0.0.0 --port 5000

2. Standardized experiment decorator for your team:

import mlflow
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np

MLFLOW_URI = 'http://mlflow-server:5000'

def run_experiment(experiment_name: str, pipeline: Pipeline, X, y, params: dict):
    mlflow.set_tracking_uri(MLFLOW_URI)
    mlflow.set_experiment(experiment_name)

    with mlflow.start_run():
        mlflow.log_params(params)
        mlflow.log_param('model_type', type(pipeline.steps[-1][1]).__name__)

        scores = cross_val_score(pipeline, X, y, cv=5, scoring='roc_auc')
        mlflow.log_metric('cv_auc_mean', np.mean(scores))
        mlflow.log_metric('cv_auc_std', np.std(scores))

        pipeline.fit(X, y)
        mlflow.sklearn.log_model(pipeline, 'model',
            registered_model_name=experiment_name,
            input_example=X[:5])

        print(f'Run logged: {mlflow.active_run().info.run_id}')

3. Usage by any data scientist:

run_experiment(
    experiment_name='churn-prediction',
    pipeline=Pipeline([('scaler', StandardScaler()), ('clf', XGBClassifier(n_estimators=200))]),
    X=X_train, y=y_train,
    params={'n_estimators': 200, 'feature_set': 'v3', 'dataset_date': '2025-03'}
)

Model promotion: UI at :5000 — compare runs, promote best to Staging → Production with one click.

Get Started

How to install and use this skill in your preferred environment.

Skills are designed for AI coding agents (Claude Code, Cursor, Windsurf) and IDE-based workflows where the agent can read files, run scripts, and act on your codebase.

Models & Context

Which AI models and context windows work best with this skill.

Recommended Models

Best

Claude Opus 4.8 Claude Sonnet 4 GPT-4.1 Gemini 2.5 Pro Grok 3 Kimi K2

Good

Claude Haiku 4.5 GPT-4.1 mini Gemini 2.5 Flash Grok 3 mini

Larger models produce more detailed, production-ready outputs.

Context Window

This skill's SKILL.md is typically 3–10 KB — fits in any modern context window.

8K Skill only

32K+ Skill + conversation

100K+ Skill + references + codebase

All current frontier models support 100K+ context.

Pro tips for best results

1

Be specific

Include numbers — users, budget, RPS — so the skill can size the architecture.

2

Share constraints

Compliance needs, team size, and existing stack all improve the output.

3

Iterate

Start with a high-level design, then ask follow-ups for IaC, cost analysis, or security review.

4

Combine skills

Pair with companion skills below for end-to-end coverage.

Ready to try Senior ML Engineer?

Install the skill and start getting expert-level guidance in your workflow — any agent, any IDE.

Install for

← Browse all skills