Accepting Applications
Full-time
On-site
LinkedIn
Posted 5 days, 12 hours ago
2 views
0 applications
Job Description
Jobright is a next-generation AI job search platform built to make career navigation faster, smarter, and more personal. They are looking for a Site Reliability Engineer to keep the systems behind our AI agents fast, resilient, and ready to scale as millions of job seekers depend on them every day.
Why Join Us
- Own the infrastructure that keeps real-time AI agents running reliably for users making important career decisions
- Tackle problems unique to LLM-powered systems, from inference latency and cost optimization to handling unpredictable traffic spikes
- Work with engineers who treat reliability as a product feature, not a clean-up job that happens after the fact
- Join a team where automation, observability, and thoughtful on-call practices are first-class investments
Responsibilities
- Design, build, and maintain the cloud infrastructure that powers Jobright's AI agents, APIs, and user-facing services
- Improve system observability through metrics, logging, and tracing, making it easier for the whole team to understand what's happening in production
- Partner with product and engineering teammates to harden new features before launch, owning capacity planning, performance testing, and rollout strategies
- Lead incident response when things go wrong, run blameless post-mortems, and turn each incident into durable improvements in reliability and tooling
Qualifications
Required
- Early to mid-career engineer with 1 to 3 years of experience in site reliability, DevOps, platform, or backend engineering
- Strong communicator who can break down complex infrastructure tradeoffs for engineers, product partners, and leadership alike
- Solid grounding in cloud platforms, containerization, CI/CD pipelines, and the fundamentals of distributed systems
Preferred
- Prior experience supporting production AI/ML workloads or high-throughput API services at a tech or AI-focused organization
- Demonstrated comfort operating in fast-moving environments where on-call coverage, incident response, and infrastructure changes happen in parallel
- Hands-on skills in AWS or GCP, Kubernetes, Terraform, monitoring stacks like Datadog or Prometheus, and scripting in Python or Go