DevOps Engineer

Peerlogic

Canada

Accepting Applications Full-time Hybrid
Posted 3 hours, 35 minutes ago 0 views 0 applications
Job Description
**Senior / Staff DevOps Engineer (Platform \& Reliability)** **Location:** Remote (U.S. or Canada) **Company:** Peerlogic **The Role** Peerlogic is hiring a **Senior / Staff DevOps Engineer** to own the platform, infrastructure, and reliability of a production system spanning **application services, AI/ML workloads, and real\-time voice infrastructure** . You are replacing a strong DevOps leader \-\- not building from scratch. The system works. CI/CD is in place. Observability is mature. Your job is to **maintain and improve a platform operating near 5\-nines reliability** by: * reducing incidents (not just responding to them) * increasing system efficiency * scaling infrastructure to support Peerlogic’s growth This is not a support or ticket\-driven role. You will: * Own reliability end\-to\-end * Make architectural decisions with real consequences * Improve existing systems and build new ones where needed * Operate in ambiguity without waiting for direction **What You’ll Own** **Platform \& Infrastructure** * Cloud \+ hybrid infrastructure (AWS, GCP, on\-prem) * Multi\-region systems operating near **99\.999% uptime** * Kubernetes, ECS, containers, and serverless systems * CI/CD pipelines (GitHub Actions) — optimize and improve developer workflows * Infrastructure as Code (Terraform, Ansible) **Reliability \& Observability** * Take ownership of an existing observability stack (metrics, logs, tracing, alerts) * **Reduce the frequency and impact of incidents and alerts** * Improve signal\-to\-noise and eliminate unnecessary alerting * Identify root causes and remove entire classes of failure * Drive incident response, postmortems, and systemic fixes * Reduce MTTR and prevent recurrence **Data \& AI Systems** * Event\-driven systems (RabbitMQ): durability, replay, debugging * LLM infrastructure: inference performance, cost, and reliability * Improve evaluation pipelines, dataset versioning, and reproducibility **Performance, Cost \& Scaling** * Improve system performance and latency across services * Own infrastructure cost efficiency (compute, storage, LLM usage) * Scale systems cleanly as Peerlogic grows * Identify bottlenecks and remove them **Security \& Networking** * Maintain SOC 2 / HIPAA infrastructure posture (DevSecOps practices) * Networking ownership (TCP/IP, DNS, load balancing, iptables) * Support real\-time and low\-latency system requirements **VoIP \& Real\-Time Systems** Peerlogic operates a **real\-time VoIP platform** as a core part of the system. You will: * Work alongside dedicated VoIP Engineers * Learn the voice stack (SIP, RTP, real\-time media systems) over time * Gradually take on **shared responsibility for supporting and scaling voice infrastructure** , with guidance VoIP experience is not required, but you should: * Be curious about real\-time systems * Be willing to learn new domains deeply * Be comfortable expanding your ownership into adjacent systems **What You Will NOT Own (Initially)** * Direct ownership of SIP routing, dial plans, or carrier integrations (You will grow into supporting parts of this system over time.) **What We’re Looking For** **Experience** * 8\-10\+ years in DevOps, SRE, or Infrastructure Engineering * Proven ownership of production systems at scale * Experience with multi\-region, high\-availability systems * Experience in hybrid environments (cloud \+ on\-prem preferred) **Technical Depth** * Kubernetes / containerized systems * Terraform / Ansible (Infrastructure as Code) * CI/CD systems (GitHub Actions preferred) * Networking fundamentals (TCP/IP, DNS, load balancing, iptables) You should also: * Write code (Python, Go, or similar) * Understand event\-driven architectures * Have real\-time or low\-latency experience **or strong interest in learning** **Mindset** * You take ownership beyond your area * You reduce problems, not just react to them * You fix root causes, not symptoms * You make decisions with incomplete information * You think in systems, not just tools * You’re willing to learn adjacent domains (including real\-time voice systems) **Our Stack (Partial)** * AWS, GCP, Kubernetes * Python, Postgres * RabbitMQ / async pipelines * LLM systems (multi\-agent, inference pipelines) * VoIP \+ EHR integrations (adjacent systems) **What Success Looks Like** **3–6 months** * Alert noise is reduced and signal quality improves * Fewer recurring incidents * Systems become easier to debug and operate **6–12 months** * Platform consistently operates at or near **5\-nines reliability** * Incident frequency decreases meaningfully * Systems scale cleanly with business growth * Infrastructure is faster, more efficient, and more cost\-effective * You are contributing to the broader system, including voice infrastructure **Team \& Environment** * \~10 person engineering team * Reports to CTO * High\-ownership, fast\-moving startup * Shared on\-call responsibility **Why This Role Matters** Peerlogic operates at the intersection of: * healthcare workflows * AI\-driven systems * real\-time communication This role ensures the platform is: * fast enough for real\-time interaction * reliable enough for healthcare workflows * scalable enough to support rapid growth If this layer fails, everything above it fails.
Login to Apply

Don't have an account? Register

About Company
Peerlogic
View All Jobs
Share this job