Accepting Applications
Full-time
On-site
Posted 1 week ago
1 views
0 applications
Job Description
**Urgent requirement for**
**DevOps SRE Engineer \- Observability \& Automation is**
**required for our banking clients in Abu Dhabi ,UAE**
* Strong experience in Kafka, RabbitMQ, Redis, RDS/Aurora \-\-\-Must
* Strong experience in observability (metrics, logs, traces, dashboards, and alerts) is Must
**Strong experience in Kubernetes, Docker, container orchestration, microservices support**
**is Must**
**Strong experience in Terraform, IaC practice**
**is MUST**
**Strong experience in Linux environments and performance troubleshooting is**
**MUST**
**Strong experience in Banking**
**is MUST**
We’re looking for a talented
**Site Reliability Engineer (SRE)**
to keep our systems running smoothly, reliably, and at scale. Through smart
**automation**
, deep
**observability**
, and a calm head
in a crisis, you’ll help us balance
**speed**
,
**compliance**
, and
**stability**
, working alongside
**DevOps**
,
**Cloud**
,
**Quality Engineering**
, and
**Product**
teams to drive continuous improvements in
**performance**
,
**security**
, and
**resilience**
..
* Define and implement SLIs / SLOs and error budgets for business\-critical digital banking
services.
* Build actionable observability (metrics, logs, traces, dashboards, and alerts) using Dynatrace,
Prometheus, Grafana, and ELK, while reducing alert fatigue.
* Leverage AI\-driven insights and anomaly detection (Dynatrace Davis AI or equivalent AIOps
platform) to proactively predict and resolve reliability issues before impact.
* Lead incident management — from on\-call triage and root\-cause analysis to blameless
postmortems with actionable follow\-ups.
* Improve deployment safety with robust rollout / rollback strategies, canary and blue\-green
deployments, and production readiness reviews.
* Support and optimize microservices\-based architectures, ensuring service reliability,
scalability, and inter\-service resilience.
* Conduct capacity planning, performance tuning, and resilience testing, optimizing for both
reliability and cost efficiency.
* Automate operational toil — from runbooks and remediation scripts to proactive health checks
and self\-healing workflows.
* Collaborate with DevOps to embed reliability gates and validations into CI / CD pipelines
(GitHub Actions, Jenkins, GitLab CI / CD or Azure DevOps).
* Own and evolve the observability and AIOps stack, driving intelligent automation and predictive
alerting capabilities.
* Maintain high\-quality documentation, playbooks, and operational standards across
environments.
* Ensure operational compliance and security alignment with internal controls and regulatory
standards.
* Analyze system performance, availability, and cost data to continually optimize operations.
* Provide reliability support and escalation guidance for critical production systems during major
incidents.
* 5\+ years of experience in SRE or DevOps roles, building and managing large\-scale,
high\-availability systems across
**banking**
,
**fintech**
,
**e\-commerce**
, or other data\-intensive digital
ecosystems.
* Bachelor’s degree in Computer Science or equivalent technical experience.
* Strong experience with Linux environments and performance troubleshooting.
* Proven expertise in Terraform and Infrastructure as Code (IaC) methodologies.
* Proficiency with Kubernetes and container orchestration in microservices environments.
* Hands\-on experience with AWS (preferred); exposure to Azure or GCP is an advantage.
* Deep knowledge of Dynatrace (AIOps, Davis AI), Prometheus, Grafana, and the ELK stack.
* Experience implementing AI / ML\-driven reliability or automation solutions (AIOps, anomaly
detection, predictive alerting).
* Practical understanding of CI / CD pipelines (GitHub Actions, Jenkins, GitLab CI / CD or Azure
**DevOps**
).
* Experience with Kafka, RabbitMQ, Redis, Aurora, and RDS databases.
* Strong scripting or programming skills in Python, Bash, or Go.
Skills: automation,devops,sre
More jobs from TAT IT Technolgies
Principal system engineer - Infrastructure & Cloud Security
3 weeks, 1 day agoDesktop Engineer – Banking & Endpoint Security
3 weeks, 2 days agoAWS DevOps Engineer (Banking Domain) - IaC + CI/CD
3 weeks, 5 days ago
Login to Apply
Don't have an account? Register