Accepting Applications
Full-time
Hybrid
Posted 11 hours, 14 minutes ago
0 views
0 applications
Job Description
**Senior / Staff DevOps Engineer (Platform \& Reliability)**
**Location:**
Remote (U.S. or Canada)
**Company:**
Peerlogic
**The Role**
Peerlogic is hiring a
**Senior / Staff DevOps Engineer**
to own the platform, infrastructure, and reliability of a production system that spans
**application services, AI/ML workloads, and real\-time voice infrastructure**
.
You are replacing a strong DevOps leader and not building from scratch. The system works. Your job is to
**make it exceptional**
.
This is not a support role.
This is not a ticket\-driven role.
You will:
* Own reliability end\-to\-end
* Make architectural decisions with real consequences
* Operate in ambiguity without waiting for direction
If you prefer clearly defined scopes, narrow ownership, or “assigned work,” this is not the role.
**What You’ll Own**
**Platform \& Infrastructure**
* End\-to\-end ownership of
**cloud \+ hybrid infrastructure**
(AWS, GCP, and physical environments)
* Multi\-region architecture targeting
**99\.999% uptime**
* Kubernetes clusters and container orchestration across all services
* CI/CD pipelines (GitHub Actions); reliability, speed, and developer experience
* Infrastructure as Code (Terraform, Ansible)
**Reliability \& Observability**
* Design and enforce
**SLOs, SLIs, and error budgets**
* Build a
**best\-in\-class observability stack**
(metrics, logs, traces)
* Drive incident response, postmortems, and systemic fixes (not band\-aids)
* Reduce MTTR and eliminate repeat incidents
**Data \& Event Systems**
* Ownership of
**event\-driven architecture**
(RabbitMQ or equivalent)
* Ensure
**durability, replayability, and correctness**
of pipelines
* Design and maintain
**backfill and recovery strategies**
* Improve debuggability of asynchronous systems
**AI / ML Infrastructure**
* Operate and scale
**LLM\-powered systems**
(Bedrock, SageMaker, or equivalent)
* Manage inference workloads with a focus on:
* Latency
* Cost
* Reliability
* Build and maintain:
* Evaluation pipelines
* Dataset versioning
* Reproducible ML workflows
**Performance \& Cost**
* Own
**infrastructure cost efficiency**
across:
* Compute
* Storage
* LLM usage
* Continuously optimize tradeoffs between:
* Performance
* Reliability
* Cost
**Security \& Compliance**
* Own infrastructure posture for
**SOC 2 and HIPAA**
* Ensure secure handling of PHI (encryption, access controls, auditability)
* Implement and enforce:
* Secrets management
* IAM best practices
* Network isolation
* Partner with compliance tooling (e.g., Sprinto)
**What You Will NOT Own**
* SIP routing, dial plans, or telecom call flows
* Carrier integrations or VoIP\-specific logic
(You will collaborate closely with a dedicated VoIP Infrastructure Engineer where systems intersect.)
**What We’re Looking For**
**Experience**
* 5–10\+ years in DevOps, SRE, or Infrastructure Engineering
* Proven ownership of
**production systems at scale**
* Experience operating
**multi\-region, high\-availability systems**
**Technical Depth**
Strong hands\-on experience with:
* Kubernetes, ECS, and containerized systems
* Terraform and infrastructure as code
* CI/CD systems (GitHub Actions preferred)
* Networking fundamentals (TCP/IP, DNS, ip tables, load balancing)
You should also:
* Be comfortable writing code (Python, Go, or similar)
* Have experience with
**real\-time or low\-latency systems**
* Understand
**event\-driven architectures**
deeply
**Mindset (this matters more than tools)**
* You take ownership beyond your “area”
* You fix root causes, not symptoms
* You make decisions with incomplete information
* You care about
**systems, not just infrastructure**
**Our Stack (Partial)**
* AWS, GCP, Kubernetes
* Python, Postgres
* RabbitMQ / async pipelines
* LLM systems (multi\-agent, inference pipelines)
* VoIP \+ EHR integrations (adjacent systems)
**What Success Looks Like**
Within 3–6 months:
* Reliability improves measurably (fewer incidents, faster recovery)
* Observability provides
**clear, actionable insights**
across systems
* CI/CD becomes faster, safer, and more predictable
* Event\-driven systems are easier to debug and recover
Within 6–12 months:
* Platform operates at or near
**5\-nines reliability**
* Infrastructure scales cleanly across app, AI, and voice workloads
* AI systems are
**cost\-efficient and production\-grade**
* Engineering velocity increases due to strong platform foundations
**Team \& Environment**
* \~10 person engineering team
* Reports directly to CTO
* High\-ownership, fast\-moving startup
* Expectation of after\-hours ownership when needed
**Compensation**
* $140K – $180K CAD base (flexible for Senior vs Staff)
* Equity included
* Will stretch for the right candidate
**Why This Role Matters**
Peerlogic sits at the intersection of
**healthcare, AI, and real\-time communication**
.
This role ensures the platform is:
* Fast enough for real\-time interaction
* Reliable enough for healthcare workflows
* Scalable enough to support rapid growth
Login to Apply
Don't have an account? Register