Site Reliability Engineer

PwC India

Remote (Anywhere)

Accepting Applications Full-time Remote
Posted 3 days, 23 hours ago 1 views 0 applications
Job Description
Job Description – Azure Site Reliability Engineer (SRE) Role Title Azure Site Reliability Engineer (SRE) Role Summary We are hiring Azure SREs to engineer reliability at scale across mission\-critical workloads in a regulated environment. You will design and operate highly available, secure, and cost\-efficient Azure platforms with a Terraform\-first approach, strong automation, and deep observability. The role includes on\-call, incident management, and continuous improvement to reduce toil and improve SLAs/SLOs. Key Responsibilities SRE Foundations · Define SLIs/SLOs, manage error budgets, and gate releases based on reliability risk. · Lead on\-call rotations, major incident response, and blameless postmortems with action tracking. · Run game days, chaos/resilience drills, and drive toil reduction via automation. Azure Platform \& Governance · Build CAF\-aligned Landing Zones (hub\-spoke/Virtual WAN), enforce Azure Policy as Code, tagging, and RBAC/PIM models. · Engineer secure network topologies: Private Link/Endpoints, Azure Firewall/WAF, DDoS, ExpressRoute, Private DNS. Infrastructure as Code \& Automation · Terraform (mandatory): design reusable modules, manage remote state \& locking, implement policy checks (e.g., tfsec/Checkov/Conftest). · Implement CI/CD with Azure DevOps/GitHub Actions; automate with PowerShell, Azure CLI, Python. · Use Key Vault \& workload identity for secretless pipelines; enforce PR reviews and plan/apply gates. Kubernetes (AKS) Operations · Operate AKS: upgrades (surge), node pool mgmt, HPA/VPA, cluster autoscaler. · Enforce Network Policies, Pod Security, admission control (OPA/Gatekeeper); secure secrets and images. · GitOps (Flux/ArgoCD), hardened ACR, image provenance and supply chain controls. Observability \& AIOps · Build full\-stack monitoring with Azure Monitor, Log Analytics, Application Insights, Prometheus/Grafana. · Create KQL dashboards/alerts, enable synthetic monitoring, and correlate traces with OpenTelemetry. · Reduce MTTR using automated runbooks (Functions/Logic Apps/Automation) and optimize log/metrics cost. Resilience, DR \& Backup · Architect HA/DR using Azure Site Recovery (ASR) and region pairs; define \& test RTO/RPO. · Operate Azure Backup with immutability/soft delete; enable Key Vault purge protection. · Conduct periodic failover/restore drills with evidence and remediation follow\-ups. Security \& Compliance · Implement Zero Trust with Entra ID (RBAC, PIM, Conditional Access), Managed Identities, and least\-privilege. · Enforce baselines with Defender for Cloud; integrate Microsoft Sentinel detections and SOAR playbooks. · Support audits with change control, evidence, and segregation of duties. Cost \& Capacity (FinOps) · Set budgets \& alerts, rightsizing, reservations/savings plans, storage tiering. · Optimize observability/storage retention and data flows for cost efficiency. Required Qualifications · 5\+ years of overall IT industry experience with at least 3\+ years of hands on expertise in Azure Site Reliability Engineering. · Hands\-on Terraform (mandatory): module design, state management, pipelines, policy/scanning, drift detection. · Strong Azure infrastructure: compute, storage, networking (hub\-spoke/vWAN, Private Link, Firewall/WAF, DDoS, ExpressRoute). · AKS operations and container security fundamentals. · Observability: Azure Monitor, App Insights, KQL, Prometheus/Grafana; SLO dashboarding. · DR/Backup expertise: ASR, Azure Backup, RTO/RPO planning and test execution. · Automation proficiency: PowerShell, Azure CLI, Python; Azure Functions/Logic Apps/Automation Accounts. · Identity \& security: Entra ID, RBAC/PIM, Key Vault, Defender for Cloud. · Certifications: AZ\-104 mandatory Nice to Have · Microsoft Sentinel (detections, hunting, SOAR runbooks). · Chaos Studio, performance/load testing, progressive delivery (Blue/Green, Canary, feature flags). · Data HA/DR across Azure SQL DB/MI, PostgreSQL Flexible Server. · FinOps practices and cost optimization playbooks. · Certifications: AZ\-305, AZ\-400, AZ\-700, AZ\-500\.
Login to Apply

Don't have an account? Register

About Company
PwC India
View All Jobs
Share this job