DevOps Engineer

Peerlogic

Canada

Accepting Applications Full-time Hybrid
Posted 11 hours, 14 minutes ago 0 views 0 applications
Job Description
**Senior / Staff DevOps Engineer (Platform \& Reliability)** **Location:** Remote (U.S. or Canada) **Company:** Peerlogic **The Role** Peerlogic is hiring a **Senior / Staff DevOps Engineer** to own the platform, infrastructure, and reliability of a production system that spans **application services, AI/ML workloads, and real\-time voice infrastructure** . You are replacing a strong DevOps leader and not building from scratch. The system works. Your job is to **make it exceptional** . This is not a support role. This is not a ticket\-driven role. You will: * Own reliability end\-to\-end * Make architectural decisions with real consequences * Operate in ambiguity without waiting for direction If you prefer clearly defined scopes, narrow ownership, or “assigned work,” this is not the role. **What You’ll Own** **Platform \& Infrastructure** * End\-to\-end ownership of **cloud \+ hybrid infrastructure** (AWS, GCP, and physical environments) * Multi\-region architecture targeting **99\.999% uptime** * Kubernetes clusters and container orchestration across all services * CI/CD pipelines (GitHub Actions); reliability, speed, and developer experience * Infrastructure as Code (Terraform, Ansible) **Reliability \& Observability** * Design and enforce **SLOs, SLIs, and error budgets** * Build a **best\-in\-class observability stack** (metrics, logs, traces) * Drive incident response, postmortems, and systemic fixes (not band\-aids) * Reduce MTTR and eliminate repeat incidents **Data \& Event Systems** * Ownership of **event\-driven architecture** (RabbitMQ or equivalent) * Ensure **durability, replayability, and correctness** of pipelines * Design and maintain **backfill and recovery strategies** * Improve debuggability of asynchronous systems **AI / ML Infrastructure** * Operate and scale **LLM\-powered systems** (Bedrock, SageMaker, or equivalent) * Manage inference workloads with a focus on: * Latency * Cost * Reliability * Build and maintain: * Evaluation pipelines * Dataset versioning * Reproducible ML workflows **Performance \& Cost** * Own **infrastructure cost efficiency** across: * Compute * Storage * LLM usage * Continuously optimize tradeoffs between: * Performance * Reliability * Cost **Security \& Compliance** * Own infrastructure posture for **SOC 2 and HIPAA** * Ensure secure handling of PHI (encryption, access controls, auditability) * Implement and enforce: * Secrets management * IAM best practices * Network isolation * Partner with compliance tooling (e.g., Sprinto) **What You Will NOT Own** * SIP routing, dial plans, or telecom call flows * Carrier integrations or VoIP\-specific logic (You will collaborate closely with a dedicated VoIP Infrastructure Engineer where systems intersect.) **What We’re Looking For** **Experience** * 5–10\+ years in DevOps, SRE, or Infrastructure Engineering * Proven ownership of **production systems at scale** * Experience operating **multi\-region, high\-availability systems** **Technical Depth** Strong hands\-on experience with: * Kubernetes, ECS, and containerized systems * Terraform and infrastructure as code * CI/CD systems (GitHub Actions preferred) * Networking fundamentals (TCP/IP, DNS, ip tables, load balancing) You should also: * Be comfortable writing code (Python, Go, or similar) * Have experience with **real\-time or low\-latency systems** * Understand **event\-driven architectures** deeply **Mindset (this matters more than tools)** * You take ownership beyond your “area” * You fix root causes, not symptoms * You make decisions with incomplete information * You care about **systems, not just infrastructure** **Our Stack (Partial)** * AWS, GCP, Kubernetes * Python, Postgres * RabbitMQ / async pipelines * LLM systems (multi\-agent, inference pipelines) * VoIP \+ EHR integrations (adjacent systems) **What Success Looks Like** Within 3–6 months: * Reliability improves measurably (fewer incidents, faster recovery) * Observability provides **clear, actionable insights** across systems * CI/CD becomes faster, safer, and more predictable * Event\-driven systems are easier to debug and recover Within 6–12 months: * Platform operates at or near **5\-nines reliability** * Infrastructure scales cleanly across app, AI, and voice workloads * AI systems are **cost\-efficient and production\-grade** * Engineering velocity increases due to strong platform foundations **Team \& Environment** * \~10 person engineering team * Reports directly to CTO * High\-ownership, fast\-moving startup * Expectation of after\-hours ownership when needed **Compensation** * $140K – $180K CAD base (flexible for Senior vs Staff) * Equity included * Will stretch for the right candidate **Why This Role Matters** Peerlogic sits at the intersection of **healthcare, AI, and real\-time communication** . This role ensures the platform is: * Fast enough for real\-time interaction * Reliable enough for healthcare workflows * Scalable enough to support rapid growth
Login to Apply

Don't have an account? Register

About Company
Peerlogic
View All Jobs
Share this job