Accepting Applications
Full-time
Hybrid
Posted 3 hours, 35 minutes ago
0 views
0 applications
Job Description
**Senior / Staff DevOps Engineer (Platform \& Reliability)**
**Location:**
Remote (U.S. or Canada)
**Company:**
Peerlogic
**The Role**
Peerlogic is hiring a
**Senior / Staff DevOps Engineer**
to own the platform, infrastructure, and reliability of a production system spanning
**application services, AI/ML workloads, and real\-time voice infrastructure**
.
You are replacing a strong DevOps leader \-\- not building from scratch.
The system works. CI/CD is in place. Observability is mature.
Your job is to
**maintain and improve a platform operating near 5\-nines reliability**
by:
* reducing incidents (not just responding to them)
* increasing system efficiency
* scaling infrastructure to support Peerlogic’s growth
This is not a support or ticket\-driven role.
You will:
* Own reliability end\-to\-end
* Make architectural decisions with real consequences
* Improve existing systems and build new ones where needed
* Operate in ambiguity without waiting for direction
**What You’ll Own**
**Platform \& Infrastructure**
* Cloud \+ hybrid infrastructure (AWS, GCP, on\-prem)
* Multi\-region systems operating near
**99\.999% uptime**
* Kubernetes, ECS, containers, and serverless systems
* CI/CD pipelines (GitHub Actions) — optimize and improve developer workflows
* Infrastructure as Code (Terraform, Ansible)
**Reliability \& Observability**
* Take ownership of an existing observability stack (metrics, logs, tracing, alerts)
* **Reduce the frequency and impact of incidents and alerts**
* Improve signal\-to\-noise and eliminate unnecessary alerting
* Identify root causes and remove entire classes of failure
* Drive incident response, postmortems, and systemic fixes
* Reduce MTTR and prevent recurrence
**Data \& AI Systems**
* Event\-driven systems (RabbitMQ): durability, replay, debugging
* LLM infrastructure: inference performance, cost, and reliability
* Improve evaluation pipelines, dataset versioning, and reproducibility
**Performance, Cost \& Scaling**
* Improve system performance and latency across services
* Own infrastructure cost efficiency (compute, storage, LLM usage)
* Scale systems cleanly as Peerlogic grows
* Identify bottlenecks and remove them
**Security \& Networking**
* Maintain SOC 2 / HIPAA infrastructure posture (DevSecOps practices)
* Networking ownership (TCP/IP, DNS, load balancing, iptables)
* Support real\-time and low\-latency system requirements
**VoIP \& Real\-Time Systems**
Peerlogic operates a
**real\-time VoIP platform**
as a core part of the system.
You will:
* Work alongside dedicated VoIP Engineers
* Learn the voice stack (SIP, RTP, real\-time media systems) over time
* Gradually take on
**shared responsibility for supporting and scaling voice infrastructure**
, with guidance
VoIP experience is not required, but you should:
* Be curious about real\-time systems
* Be willing to learn new domains deeply
* Be comfortable expanding your ownership into adjacent systems
**What You Will NOT Own (Initially)**
* Direct ownership of SIP routing, dial plans, or carrier integrations
(You will grow into supporting parts of this system over time.)
**What We’re Looking For**
**Experience**
* 8\-10\+ years in DevOps, SRE, or Infrastructure Engineering
* Proven ownership of production systems at scale
* Experience with multi\-region, high\-availability systems
* Experience in hybrid environments (cloud \+ on\-prem preferred)
**Technical Depth**
* Kubernetes / containerized systems
* Terraform / Ansible (Infrastructure as Code)
* CI/CD systems (GitHub Actions preferred)
* Networking fundamentals (TCP/IP, DNS, load balancing, iptables)
You should also:
* Write code (Python, Go, or similar)
* Understand event\-driven architectures
* Have real\-time or low\-latency experience
**or strong interest in learning**
**Mindset**
* You take ownership beyond your area
* You reduce problems, not just react to them
* You fix root causes, not symptoms
* You make decisions with incomplete information
* You think in systems, not just tools
* You’re willing to learn adjacent domains (including real\-time voice systems)
**Our Stack (Partial)**
* AWS, GCP, Kubernetes
* Python, Postgres
* RabbitMQ / async pipelines
* LLM systems (multi\-agent, inference pipelines)
* VoIP \+ EHR integrations (adjacent systems)
**What Success Looks Like**
**3–6 months**
* Alert noise is reduced and signal quality improves
* Fewer recurring incidents
* Systems become easier to debug and operate
**6–12 months**
* Platform consistently operates at or near
**5\-nines reliability**
* Incident frequency decreases meaningfully
* Systems scale cleanly with business growth
* Infrastructure is faster, more efficient, and more cost\-effective
* You are contributing to the broader system, including voice infrastructure
**Team \& Environment**
* \~10 person engineering team
* Reports to CTO
* High\-ownership, fast\-moving startup
* Shared on\-call responsibility
**Why This Role Matters**
Peerlogic operates at the intersection of:
* healthcare workflows
* AI\-driven systems
* real\-time communication
This role ensures the platform is:
* fast enough for real\-time interaction
* reliable enough for healthcare workflows
* scalable enough to support rapid growth
If this layer fails, everything above it fails.
More jobs from Peerlogic
Login to Apply
Don't have an account? Register