Accepting Applications
Full-time
On-site
Posted 4 days, 3 hours ago
1 views
0 applications
Job Description
**About Valiance**
Valiance is a deeptech AI company building sovereign and mission\-critical AI solutions for enterprises, public sector, and government institutions. From predictive maintenance and demand planning to sovereign AI for citizen services, we design systems that thrive in high\-stakes environments. Recognized with the NASSCOM AI Game Changers Award and the Aegis Graham Bell Award, and a certified Google Cloud Partner, our 200\+ engineers and data scientists are shaping the future of industries and societies through responsible AI.
**The Role**
We are looking for a senior LLMOps Engineer who has taken LLM inference optimization from idea to production — not just proof of concept. You will own the end\-to\-end efficiency of our LLM inference infrastructure running on H200 GPUs, driving down cost and latency while maintaining the reliability our enterprise and government clients demand. This is a high\-ownership, high\-impact role on a team building some of India's most consequential AI systems.
**What You Will Do**
* Design and operate production\-grade LLM inference pipelines on H200 GPU clusters, optimizing for throughput, latency, and cost per token.
* Evaluate and deploy small\-to\-medium open\-source LLMs (e.g., Mistral, Llama, Phi, Gemma) as cost\-efficient alternatives to large models without sacrificing output quality.
* Tune and manage vLLM deployments — including continuous batching, paged attention, tensor parallelism, and quantization (GPTQ, AWQ, FP8\) — in production environments.
* Build and maintain model\-serving APIs with robust observability: latency percentiles, GPU utilization, queue depths, and cost\-per\-request dashboards.
* Architect Kubernetes\-based autoscaling strategies for inference workloads, balancing cold\-start penalties against cost at scale.
* Run structured A/B experiments comparing model variants, quantization levels, and batching strategies using production traffic — not synthetic benchmarks.
* Collaborate with applied ML engineers and solution architects to identify latency and cost bottlenecks across the model serving stack.
* Establish and enforce SLOs for inference reliability, and build alerting and runbooks for production incidents.
**What We Are Looking For**
**Non\-Negotiables**
* 3\+ years of hands\-on experience operating LLM inference in production — demonstrable cost and latency improvements, not POC results.
* Deep expertise with vLLM in production: batching strategies, memory management, quantization tradeoffs.
* Strong Python engineering skills — clean, testable, production\-ready code.
* Proficiency with Docker and Kubernetes for deploying and scaling GPU inference workloads.
* Experience building and maintaining REST/gRPC APIs for model serving at scale.
* Hands\-on experience with open\-source LLMs and the ability to evaluate model\-quality vs. cost tradeoffs for real use cases.
**Strong Advantages**
* Experience with GPU memory profiling and optimization (CUDA\-level awareness a plus).
* Familiarity with model distillation, speculative decoding, or flash attention implementations.
* Exposure to multi\-GPU and multi\-node inference setups.
* Experience with inference frameworks beyond vLLM: TGI, TensorRT\-LLM, Triton Inference Server.
* Familiarity with sovereign AI or air\-gapped deployment constraints.
**Why Valiance**
* You will work on AI systems that are actually deployed at scale — used by government institutions and large enterprises, not just demoed.
* Direct access to H200 infrastructure with meaningful compute budgets — no GPU rationing.
* A culture that rewards engineering depth and production ownership over slide decks.
* Competitive compensation with performance\-linked incentives.
* Opportunity to define how Valiance builds its AI platform as we scale.
**How to Apply**
Upload your resume and a brief note on a specific inference optimization you shipped in production — the problem, your approach, and the measurable outcome. We do not conduct screening rounds for this role. Shortlisted candidates will move directly to a technical discussion with our engineering leadership.
Login to Apply
Don't have an account? Register