Position Summary

We are seeking a highly skilled Site Reliability Engineer (SRE) / DevOps Engineer to join our infrastructure team. You will be responsible for designing, building, and maintaining resilient, scalable, and secure infrastructure in cloud-native environments. This role will involve close collaboration with development, QA, and security teams to automate operations, streamline deployments, and drive best practices in observability, security, and performance.

Key Responsibilities

Design, implement, and manage cloud infrastructure (GCP/AWS/Azure) using Infrastructure as Code (Terraform)
Build, maintain, and optimize CI/CD pipelines with tools such as GitLab CI, CircleCI, ArgoCD
Ensure high availability and performance of applications running on Kubernetes (GKE/EKS/AKS) and container orchestration tools
Implement observability solutions using Prometheus, Grafana, ELK, and other monitoring/logging tools
Work with development teams to enhance application performance and deployment workflows
Automate and manage IAM, RBAC, network policies, and vulnerability scanning
Participate in incident management, root cause analysis, and postmortem processes
Continuously improve infrastructure reliability and reduce manual operational efforts (toil)

Basic Qualifications

Strong knowledge of Linux system administration
Proficiency in scripting languages such as Python, Bash, or Go
Solid hands-on experience with cloud platforms (GCP preferred; AWS or Azure acceptable)
Proficient in Kubernetes operations, including Helm charts, service meshes, and operators
Experience with Terraform and Infrastructure as Code best practices
Experience building and maintaining CI/CD pipelines (e.g., GitLab CI, CircleCI, ArgoCD)
Familiarity with observability tools (Prometheus, Grafana, ELK, etc.)
Good understanding of networking concepts: TCP/IP, DNS, Load Balancing, Firewalls

Preferred Qualifications

Experience with advanced networking and service meshes (e.g., Istio)
Familiarity with SRE principles: SLOs, SLIs, error budgets
Exposure to multi-cluster or hybrid-cloud infrastructure setups
Experience with incident response and post-incident review processes

Key Skills (Comma-Separated)

Site Reliability Engineering, DevOps, GCP, AWS, Azure, Terraform, CI/CD, GitLab CI, CircleCI, ArgoCD, Kubernetes, GKE, EKS, AKS, Helm, Prometheus, Grafana, ELK, Python, Bash, Go, IAM, RBAC, Network Policies, Service Mesh, Istio, TCP/IP, DNS, Load Balancers, Firewalls, Monitoring, Logging, Error Budgets, SLOs, SLIs, Incident Management

Staff Software Development Engineer

Job Description

Position Summary

Key Responsibilities

Basic Qualifications

Preferred Qualifications

Key Skills (Comma-Separated)

Required Skills