DevOps SRE Engineer - Observability & Automation

TAT IT Technolgies

United Arab Emirates

Accepting Applications Full-time On-site
Posted 1 week ago 1 views 0 applications
Job Description
**Urgent requirement for** **DevOps SRE Engineer \- Observability \& Automation is** **required for our banking clients in Abu Dhabi ,UAE** * Strong experience in Kafka, RabbitMQ, Redis, RDS/Aurora \-\-\-Must * Strong experience in observability (metrics, logs, traces, dashboards, and alerts) is Must **Strong experience in Kubernetes, Docker, container orchestration, microservices support** **is Must** **Strong experience in Terraform, IaC practice** **is MUST** **Strong experience in Linux environments and performance troubleshooting is** **MUST** **Strong experience in Banking** **is MUST** We’re looking for a talented **Site Reliability Engineer (SRE)** to keep our systems running smoothly, reliably, and at scale. Through smart **automation** , deep **observability** , and a calm head in a crisis, you’ll help us balance **speed** , **compliance** , and **stability** , working alongside **DevOps** , **Cloud** , **Quality Engineering** , and **Product** teams to drive continuous improvements in **performance** , **security** , and **resilience** .. * Define and implement SLIs / SLOs and error budgets for business\-critical digital banking services. * Build actionable observability (metrics, logs, traces, dashboards, and alerts) using Dynatrace, Prometheus, Grafana, and ELK, while reducing alert fatigue. * Leverage AI\-driven insights and anomaly detection (Dynatrace Davis AI or equivalent AIOps platform) to proactively predict and resolve reliability issues before impact. * Lead incident management — from on\-call triage and root\-cause analysis to blameless postmortems with actionable follow\-ups. * Improve deployment safety with robust rollout / rollback strategies, canary and blue\-green deployments, and production readiness reviews. * Support and optimize microservices\-based architectures, ensuring service reliability, scalability, and inter\-service resilience. * Conduct capacity planning, performance tuning, and resilience testing, optimizing for both reliability and cost efficiency. * Automate operational toil — from runbooks and remediation scripts to proactive health checks and self\-healing workflows. * Collaborate with DevOps to embed reliability gates and validations into CI / CD pipelines (GitHub Actions, Jenkins, GitLab CI / CD or Azure DevOps). * Own and evolve the observability and AIOps stack, driving intelligent automation and predictive alerting capabilities. * Maintain high\-quality documentation, playbooks, and operational standards across environments. * Ensure operational compliance and security alignment with internal controls and regulatory standards. * Analyze system performance, availability, and cost data to continually optimize operations. * Provide reliability support and escalation guidance for critical production systems during major incidents. * 5\+ years of experience in SRE or DevOps roles, building and managing large\-scale, high\-availability systems across **banking** , **fintech** , **e\-commerce** , or other data\-intensive digital ecosystems. * Bachelor’s degree in Computer Science or equivalent technical experience. * Strong experience with Linux environments and performance troubleshooting. * Proven expertise in Terraform and Infrastructure as Code (IaC) methodologies. * Proficiency with Kubernetes and container orchestration in microservices environments. * Hands\-on experience with AWS (preferred); exposure to Azure or GCP is an advantage. * Deep knowledge of Dynatrace (AIOps, Davis AI), Prometheus, Grafana, and the ELK stack. * Experience implementing AI / ML\-driven reliability or automation solutions (AIOps, anomaly detection, predictive alerting). * Practical understanding of CI / CD pipelines (GitHub Actions, Jenkins, GitLab CI / CD or Azure **DevOps** ). * Experience with Kafka, RabbitMQ, Redis, Aurora, and RDS databases. * Strong scripting or programming skills in Python, Bash, or Go. Skills: automation,devops,sre
Login to Apply

Don't have an account? Register

About Company
TAT IT Technolgies
View All Jobs
Share this job