Job Description
About The Company AltaML is a leading North American applied AI company renowned for its expertise in developing and operationalizing AI software solutions across various industries. Committed to innovation, AltaML advocates for a culture of small bets, rapid experimentation, and collaborative growth. The company values agility, grit, humility, and happiness, fostering an environment where creative problem-solving and customer obsession drive success. With a diverse team of talented professionals, AltaML strives to push the boundaries of AI and machine learning, making a meaningful impact through cutting-edge technology and strategic partnerships.
About The Role We are seeking a highly skilled DevOps Engineer to join our dynamic team. In this pivotal role, you will be responsible for designing, implementing, and maintaining our cloud infrastructure across multiple platforms including Azure, AWS, and GCP. Your expertise will enable us to build scalable, secure, and reliable pipelines and automation that empower our development and ML engineering teams to deliver high-quality solutions efficiently. This position offers a hands-on, high-ownership opportunity within a collaborative environment, requiring strong opinions, initiative, and continuous improvement mindset. You will work closely with cross-functional teams to optimize deployment strategies, enhance system reliability, and drive innovation in our cloud operations.
Qualifications
- Degree or equivalent work experience in Computer Science, Systems Engineering, or a related discipline
- 3-6 years of progressive experience in DevOps, cloud engineering, or site reliability engineering
- Proficiency with at least two cloud platforms among Azure, AWS, and GCP, with multi-cloud experience highly valued
- Hands-on experience building and maintaining CI/CD pipelines in production environments
- Strong knowledge of Infrastructure as Code tools such as Terraform; Bicep, Pulumi, or CDK are a plus
- Solid Kubernetes experience including cluster management, Helm charts, workload scaling, and networking
- Scripting skills in Python, Bash, or PowerShell for automation
- Experience implementing cloud security controls such as IAM, RBAC, network policies, and key management
- Understanding of the software delivery lifecycle and agile development practices
- Strong troubleshooting skills across networking, compute, storage, and application layers
- Relevant cloud certifications (e.g., AZ-104, AZ-400, AWS Solutions Architect, GCP Professional Cloud Architect) are desirable
- Experience supporting ML/AI workloads, GPU clusters, model deployment pipelines, or MLflow/Kubeflow is a plus
- Knowledge of GitOps practices, service mesh technologies, and cloud cost optimization principles is advantageous
- Experience in startup or scale-up environments and familiarity with compliance frameworks such as SOC 2 and PIPEDA is preferred
Responsibilities
- Design, develop, and maintain CI/CD pipelines supporting continuous delivery across Azure DevOps, GitHub Actions, and GitLab CI
- Architect and manage multi-cloud infrastructure using Infrastructure as Code tools like Terraform, Bicep, or CloudFormation
- Manage containerized workloads using Kubernetes platforms such as AKS, EKS, or GKE, including cluster management and workload scaling
- Implement and enforce cloud security best practices, including IAM policies, network segmentation, secrets management, and vulnerability scanning
- Develop and maintain observability stacks for logging, metrics, and alerting using tools such as Azure Monitor, CloudWatch, Datadog, or Grafana/Prometheus
- Collaborate with software and ML engineering teams to define deployment strategies, optimize pipelines, and mitigate deployment risks
- Evaluate, recommend, and implement tooling improvements to enhance system reliability, scalability, and developer productivity
- Participate in incident response, root cause analysis, and post-mortem processes to ensure continuous system improvement
- Create and maintain comprehensive documentation on infrastructure architecture, operational runbooks, and disaster recovery procedures
- Mentor junior team members and provide guidance on cloud and DevOps best practices
Benefits
- Uncapped vacation policy, allowing flexible time off based on individual needs
- Opportunity to make a tangible impact on company success and client solutions
- Collaborate with highly educated colleagues with advanced degrees in data science, machine learning, and software engineering
- Competitive benefits package including health, dental, and retirement plans
- Hybrid work environment with access to state-of-the-art office spaces designed to foster collaboration
- Supportive and innovative culture emphasizing continuous learning and professional growth
Equal Opportunity
AltaML is committed to fostering a safe, diverse, and inclusive workplace. We welcome applications from qualified individuals of all backgrounds, including ethnicity, religion, disability status, gender identity, sexual orientation, family status, age, nationality, and educational backgrounds. We ensure equal opportunity in all our hiring practices and provide accommodations during the interview process upon request. Recognizing the importance of respecting First Nations, Métis, Inuit, and all Indigenous peoples of Canada, our head office is located on Treaty 6 territory, and we honor the rich cultural heritage and contributions of these communities.