Accepting Applications
Full-time
On-site
Posted 9 hours, 27 minutes ago
0 views
0 applications
Job Description
Overview We’re hiring an experienced AWS SRE Engineer to lead observability for a cloud platform. The role focuses on building and maintaining actionable Grafana dashboards, defining and measuring reliability (SLIs/SLOs/SLAs), owning alerting strategy, and driving improvements to platform resilience. This is an opportunity to shape operational excellence and influence engineering decisions across the stack.
What you’ll do (key responsibilities)
* Design, build and maintain Grafana dashboards that deliver actionable insights into performance, availability and capacity.
* Implement and improve observability for AWS\-hosted applications and infrastructure (metrics, logs, traces).
* Define and track SLIs, SLOs and SLAs; manage error budgets and translate reliability targets into engineering priorities.
* Monitor using golden signals and operate an effective, noise\-aware alerting strategy.
* Support incident response, run RCA processes and drive continuous reliability improvements.
* Embed observability into CI/CD and cloud operations; collaborate with platform, engineering and ops teams to improve operational efficiency.
Must\-have skills and experience
* 6\+ years in SRE, Cloud Reliability or Cloud Operations roles.
* Strong, hands\-on AWS experience.
* Proven expertise building Grafana dashboards and working in observability/monitoring stacks.
* Solid understanding of SRE fundamentals (SLA, SLO, SLI, error budgets, golden signals).
* Track record troubleshooting production systems and improving platform reliability.
* Strong communicator and team collaborator.
Nice\-to\-have
* Experience with Snowflake or Databricks.
* Familiarity with IaC, automation and cloud\-native operational tooling.
More jobs from Marks Sattin
Login to Apply
Don't have an account? Register