Job Overview
We are looking for a highly skilled Senior DevOps Engineer to maintain and optimize the infrastructure and deployment pipelines for our AI-powered applications. This role focuses on ensuring high availability, cost-efficiency, and operational excellence across all components of a SaaS or PaaS infrastructure. You will play a key role in defining and implementing infrastructure best practices, driving automation, and enabling engineering teams to ship code confidently at scale. Must be excellent in a cloud provider its services, ranging from databases, load balancers, VPCs, compute instances, and data warehousing.
Key Responsibilities
• Infrastructure as Code (IaC)
Automate infrastructure provisioning and configuration using tools like Terraform, Ansible, or CloudFormation to support repeatable, version-controlled infrastructure deployment.
• Cloud Computing
Create and maintain cloud infrastructure. Must be an expert in infrastructure architecture, such as serverless deployments, Kubernetes clusters, image registries, ETL pipelines, and data warehousing.
• Containerization & Deployment Automation
Build and manage containerized environments using Docker, supporting efficient, portable, and consistent application deployments across environments.
• CI/CD Pipeline Engineering
Maintain and improve CI/CD systems to support automated testing, container building, deployment orchestration, and release workflows for AI and web applications.
• Monitoring, Observability & APM
Implement and maintain full-stack observability tools, including application performance monitoring (APM), infrastructure telemetry, and distributed tracing for proactive alerting and debugging.
• Operational Excellence
Drive best practices for high-availability design, failover strategies, performance tuning, and cloud cost optimization.
• Security-First / Zero Trust
Create and enforce a secure environment using best practices, such as using a secret management software, secretless methods of authentication, and managing configurations in systems.
• Mentorship & Technical Leadership
Mentor engineers on DevOps tools and workflows, promote a culture of automation and ownership, and serve as the subject matter expert on infrastructure and deployment strategy.
• Team Collaboration
Work closely with AI/ML engineers, backend, and product teams to ensure seamless infrastructure integration and deployment flows.
Requirements
Must-Have Skills:
• 5+ years of experience in DevOps or infrastructure engineering
• Strong experience with cloud platforms: AWS, GCP, and/or Alicloud
• Proficiency in Infrastructure as Code using Terraform, Ansible, or CloudFormation
• Deep expertise in Docker and container-based deployment strategies
• Advanced skills in Linux system administration and Bash scripting
• Strong experience with CI/CD platforms and automation workflows
• Hands-on knowledge of observability stacks (e.g., Prometheus, Grafana, ELK, Datadog, New Relic)
Nice-to-Have Skills:
• Familiarity with AI/ML infrastructure and GPU workload management
• Experience implementing autoscaling and spot instance optimization strategies
• Exposure to Kubernetes and container orchestration platforms
• Knowledge of distributed systems and service reliability practices (SRE)