Staff Software Development Engineer
Job Description
Position Summary
We are seeking a highly skilled Site Reliability Engineer (SRE) / DevOps Engineer to join our infrastructure team. You will be responsible for designing, building, and maintaining resilient, scalable, and secure infrastructure in cloud-native environments. This role will involve close collaboration with development, QA, and security teams to automate operations, streamline deployments, and drive best practices in observability, security, and performance.
Key Responsibilities
- Design, implement, and manage cloud infrastructure (GCP/AWS/Azure) using Infrastructure as Code (Terraform)
- Build, maintain, and optimize CI/CD pipelines with tools such as GitLab CI, CircleCI, ArgoCD
- Ensure high availability and performance of applications running on Kubernetes (GKE/EKS/AKS) and container orchestration tools
- Implement observability solutions using Prometheus, Grafana, ELK, and other monitoring/logging tools
- Work with development teams to enhance application performance and deployment workflows
- Automate and manage IAM, RBAC, network policies, and vulnerability scanning
- Participate in incident management, root cause analysis, and postmortem processes
- Continuously improve infrastructure reliability and reduce manual operational efforts (toil)
Basic Qualifications
- Strong knowledge of Linux system administration
- Proficiency in scripting languages such as Python, Bash, or Go
- Solid hands-on experience with cloud platforms (GCP preferred; AWS or Azure acceptable)
- Proficient in Kubernetes operations, including Helm charts, service meshes, and operators
- Experience with Terraform and Infrastructure as Code best practices
- Experience building and maintaining CI/CD pipelines (e.g., GitLab CI, CircleCI, ArgoCD)
- Familiarity with observability tools (Prometheus, Grafana, ELK, etc.)
- Good understanding of networking concepts: TCP/IP, DNS, Load Balancing, Firewalls
Preferred Qualifications
- Experience with advanced networking and service meshes (e.g., Istio)
- Familiarity with SRE principles: SLOs, SLIs, error budgets
- Exposure to multi-cluster or hybrid-cloud infrastructure setups
- Experience with incident response and post-incident review processes
Key Skills (Comma-Separated)
Site Reliability Engineering, DevOps, GCP, AWS, Azure, Terraform, CI/CD, GitLab CI, CircleCI, ArgoCD, Kubernetes, GKE, EKS, AKS, Helm, Prometheus, Grafana, ELK, Python, Bash, Go, IAM, RBAC, Network Policies, Service Mesh, Istio, TCP/IP, DNS, Load Balancers, Firewalls, Monitoring, Logging, Error Budgets, SLOs, SLIs, Incident Management
