LT
Job Description
Job Description:
- Site Reliability Engineer (SRE) – Platform Engineering with experience in Gen AI Role Summary
- We are seeking an SRE Platform Engineer to build, scale, and operate reliable, secure, and automated platform services.
- You will focus on improving system reliability, reducing operational toil, and enabling developer productivity through strong engineering practices.
- You should also be willing to work on COE initiatives and internal Intellectual properties and products that team is required to build to fulfil the different client requirements.
Key Responsibilities
- Build and operate highly available, scalable platform services Implement SRE practices: SLIs, SLOs, error budgets, and automation Manage and scale Kubernetes-based workloads in cloud environments
- Develop and maintain CI/CD pipelines and Infrastructure as Code Own production reliability, participate in on-call, incident response, and RCA (if required) Enhance observability using metrics, logs, and traces
Required Skills :
- 5-16 years of experience in SRE / Platform / DevOps / Cloud Engineering/ AI Hands-on experience with Kubernetes, Docker, Linux Strong scripting/programming skills (Python, Go, Bash) Experience with AWS / Azure / GCP Familiarity with Terraform / IaC, monitoring, and alerting tools
- Nice to Have Experience with internal developer platforms, service mesh, or FinOps Cloud or Kubernetes certifications
