DevOps Engineer

Canada

Accepting Applications Full-time Hybrid

Posted 3 hours, 35 minutes ago 0 views 0 applications

Job Description

**Senior / Staff DevOps Engineer (Platform \& Reliability)** **Location:** Remote (U.S. or Canada) **Company:** Peerlogic **The Role** Peerlogic is hiring a **Senior / Staff DevOps Engineer** to own the platform, infrastructure, and reliability of a production system spanning **application services, AI/ML workloads, and real\-time voice infrastructure** . You are replacing a strong DevOps leader \-\- not building from scratch. The system works. CI/CD is in place. Observability is mature. Your job is to **maintain and improve a platform operating near 5\-nines reliability** by: * reducing incidents (not just responding to them) * increasing system efficiency * scaling infrastructure to support Peerlogic’s growth This is not a support or ticket\-driven role. You will: * Own reliability end\-to\-end * Make architectural decisions with real consequences * Improve existing systems and build new ones where needed * Operate in ambiguity without waiting for direction **What You’ll Own** **Platform \& Infrastructure** * Cloud \+ hybrid infrastructure (AWS, GCP, on\-prem) * Multi\-region systems operating near **99\.999% uptime** * Kubernetes, ECS, containers, and serverless systems * CI/CD pipelines (GitHub Actions) — optimize and improve developer workflows * Infrastructure as Code (Terraform, Ansible) **Reliability \& Observability** * Take ownership of an existing observability stack (metrics, logs, tracing, alerts) * **Reduce the frequency and impact of incidents and alerts** * Improve signal\-to\-noise and eliminate unnecessary alerting * Identify root causes and remove entire classes of failure * Drive incident response, postmortems, and systemic fixes * Reduce MTTR and prevent recurrence **Data \& AI Systems** * Event\-driven systems (RabbitMQ): durability, replay, debugging * LLM infrastructure: inference performance, cost, and reliability * Improve evaluation pipelines, dataset versioning, and reproducibility **Performance, Cost \& Scaling** * Improve system performance and latency across services * Own infrastructure cost efficiency (compute, storage, LLM usage) * Scale systems cleanly as Peerlogic grows * Identify bottlenecks and remove them **Security \& Networking** * Maintain SOC 2 / HIPAA infrastructure posture (DevSecOps practices) * Networking ownership (TCP/IP, DNS, load balancing, iptables) * Support real\-time and low\-latency system requirements **VoIP \& Real\-Time Systems** Peerlogic operates a **real\-time VoIP platform** as a core part of the system. You will: * Work alongside dedicated VoIP Engineers * Learn the voice stack (SIP, RTP, real\-time media systems) over time * Gradually take on **shared responsibility for supporting and scaling voice infrastructure** , with guidance VoIP experience is not required, but you should: * Be curious about real\-time systems * Be willing to learn new domains deeply * Be comfortable expanding your ownership into adjacent systems **What You Will NOT Own (Initially)** * Direct ownership of SIP routing, dial plans, or carrier integrations (You will grow into supporting parts of this system over time.) **What We’re Looking For** **Experience** * 8\-10\+ years in DevOps, SRE, or Infrastructure Engineering * Proven ownership of production systems at scale * Experience with multi\-region, high\-availability systems * Experience in hybrid environments (cloud \+ on\-prem preferred) **Technical Depth** * Kubernetes / containerized systems * Terraform / Ansible (Infrastructure as Code) * CI/CD systems (GitHub Actions preferred) * Networking fundamentals (TCP/IP, DNS, load balancing, iptables) You should also: * Write code (Python, Go, or similar) * Understand event\-driven architectures * Have real\-time or low\-latency experience **or strong interest in learning** **Mindset** * You take ownership beyond your area * You reduce problems, not just react to them * You fix root causes, not symptoms * You make decisions with incomplete information * You think in systems, not just tools * You’re willing to learn adjacent domains (including real\-time voice systems) **Our Stack (Partial)** * AWS, GCP, Kubernetes * Python, Postgres * RabbitMQ / async pipelines * LLM systems (multi\-agent, inference pipelines) * VoIP \+ EHR integrations (adjacent systems) **What Success Looks Like** **3–6 months** * Alert noise is reduced and signal quality improves * Fewer recurring incidents * Systems become easier to debug and operate **6–12 months** * Platform consistently operates at or near **5\-nines reliability** * Incident frequency decreases meaningfully * Systems scale cleanly with business growth * Infrastructure is faster, more efficient, and more cost\-effective * You are contributing to the broader system, including voice infrastructure **Team \& Environment** * \~10 person engineering team * Reports to CTO * High\-ownership, fast\-moving startup * Shared on\-call responsibility **Why This Role Matters** Peerlogic operates at the intersection of: * healthcare workflows * AI\-driven systems * real\-time communication This role ensures the platform is: * fast enough for real\-time interaction * reliable enough for healthcare workflows * scalable enough to support rapid growth If this layer fails, everything above it fails.

More jobs from Peerlogic

DevOps Engineer

1 week, 6 days ago

Canada • Full-time

DevOps Engineer

Job Description

More jobs from Peerlogic

DevOps Engineer

About Company

Peerlogic

Share this job