Job Description:

Site Reliability Engineer (SRE) – Platform Engineering with experience in Gen AI Role Summary
We are seeking an SRE Platform Engineer to build, scale, and operate reliable, secure, and automated platform services.
You will focus on improving system reliability, reducing operational toil, and enabling developer productivity through strong engineering practices.
You should also be willing to work on COE initiatives and internal Intellectual properties and products that team is required to build to fulfil the different client requirements.

Key Responsibilities

Build and operate highly available, scalable platform services Implement SRE practices: SLIs, SLOs, error budgets, and automation Manage and scale Kubernetes-based workloads in cloud environments
Develop and maintain CI/CD pipelines and Infrastructure as Code Own production reliability, participate in on-call, incident response, and RCA (if required) Enhance observability using metrics, logs, and traces

Required Skills :

5-16 years of experience in SRE / Platform / DevOps / Cloud Engineering/ AI Hands-on experience with Kubernetes, Docker, Linux Strong scripting/programming skills (Python, Go, Bash) Experience with AWS / Azure / GCP Familiarity with Terraform / IaC, monitoring, and alerting tools
Nice to Have Experience with internal developer platforms, service mesh, or FinOps Cloud or Kubernetes certifications

DevOps SRE + Gen AI