DevOps / Site Reliability Engineer (SRE)

integra.works

United Arab Emirates

Accepting Applications Full-time On-site
Posted 1 hour, 21 minutes ago 0 views 0 applications
Job Description
**Job Summary** We are seeking, on behalf of our customer, a skilled DevOps / Site Reliability Engineer (SRE) to build, maintain, and optimise scalable, reliable, and secure cloud\-based platforms. This role focuses on ensuring high availability, performance, and operational excellence of cloud and data platforms, while enabling faster and more reliable software delivery through automation and modern DevOps practices. The ideal candidate will have strong expertise in cloud infrastructure, CI/CD, monitoring, and incident management, along with a proactive approach to improving system reliability and resilience. **Key Responsibilities** * Design, implement, and manage scalable and reliable cloud infrastructure across multiple environments to support high availability and performance requirements. * Develop and maintain CI/CD pipelines to automate and streamline software deployment processes, ensuring efficiency and reliability. * Monitor system performance, availability, and reliability using advanced observability tools, identifying and addressing potential issues proactively. * Proactively detect, diagnose, and resolve performance bottlenecks, incidents, and system failures to minimize downtime and maintain operational excellence. * Implement and manage Infrastructure as Code (IaC) solutions to ensure consistent, repeatable, and version\-controlled deployments across environments. * Enforce platform security, compliance, and adherence to industry best practices, particularly in regulated environments. * Collaborate closely with engineering, data, and platform teams to enhance system reliability, performance, and scalability through cross\-functional initiatives. * Drive automation initiatives across infrastructure provisioning, deployment workflows, and operational processes to improve efficiency and reduce manual intervention. * Participate actively in incident management, conduct root cause analysis (RCA), and contribute to continuous improvement initiatives to strengthen system resilience. * Maintain comprehensive documentation for infrastructure configurations, operational procedures, and system architectures to support knowledge sharing and troubleshooting. **Required Qualifications** * Bachelor’s degree in Computer Science, Engineering, or a related field. * 5–10 years of experience in DevOps, Site Reliability Engineering (SRE), or cloud engineering roles. * Hands\-on experience working in cloud environments (AWS, Azure, or GCP). * Experience supporting production systems with high availability requirements. **Required Skills** * Strong experience with CI/CD tools (e.g., Jenkins, GitHub Actions, GitLab CI). * Proficiency in Infrastructure as Code (Terraform, CloudFormation, or similar). * Experience with containerisation and orchestration (Docker, Kubernetes). * Strong knowledge of monitoring and observability tools (Prometheus, Grafana, ELK, Datadog, etc.). * Understanding of cloud security, networking, and identity management. * Experience with scripting or programming (Python, Bash, or similar). * Familiarity with incident management and SRE practices (SLAs, SLOs, error budgets). * Strong troubleshooting, analytical, and problem\-solving skills. **Preferred Qualifications** * Experience in AI or data platforms, or large\-scale data systems. * Exposure to regulated environments such as finance or healthcare. * Knowledge of cost optimization and FinOps practices. * Certifications in cloud platforms including AWS, Azure, or GCP.
Login to Apply

Don't have an account? Register

About Company
integra.works
View All Jobs
Share this job