GR

Staff Software Development Engineer

Gruve
Pune4-8 LPA Posted 30 Jul 2025
FULL TIME
Kubernetes
Elk
Grafana
Prometheus
Cloud Infrastructure

Job Description

Position Summary

We are seeking a highly skilled Site Reliability Engineer (SRE) / DevOps Engineer to join our infrastructure team. You will be responsible for designing, building, and maintaining resilient, scalable, and secure infrastructure in cloud-native environments. This role will involve close collaboration with development, QA, and security teams to automate operations, streamline deployments, and drive best practices in observability, security, and performance.

Key Responsibilities

  • Design, implement, and manage cloud infrastructure (GCP/AWS/Azure) using Infrastructure as Code (Terraform)
  • Build, maintain, and optimize CI/CD pipelines with tools such as GitLab CI, CircleCI, ArgoCD
  • Ensure high availability and performance of applications running on Kubernetes (GKE/EKS/AKS) and container orchestration tools
  • Implement observability solutions using Prometheus, Grafana, ELK, and other monitoring/logging tools
  • Work with development teams to enhance application performance and deployment workflows
  • Automate and manage IAM, RBAC, network policies, and vulnerability scanning
  • Participate in incident management, root cause analysis, and postmortem processes
  • Continuously improve infrastructure reliability and reduce manual operational efforts (toil)

Basic Qualifications

  • Strong knowledge of Linux system administration
  • Proficiency in scripting languages such as Python, Bash, or Go
  • Solid hands-on experience with cloud platforms (GCP preferred; AWS or Azure acceptable)
  • Proficient in Kubernetes operations, including Helm charts, service meshes, and operators
  • Experience with Terraform and Infrastructure as Code best practices
  • Experience building and maintaining CI/CD pipelines (e.g., GitLab CI, CircleCI, ArgoCD)
  • Familiarity with observability tools (Prometheus, Grafana, ELK, etc.)
  • Good understanding of networking concepts: TCP/IP, DNS, Load Balancing, Firewalls

Preferred Qualifications

  • Experience with advanced networking and service meshes (e.g., Istio)
  • Familiarity with SRE principles: SLOs, SLIs, error budgets
  • Exposure to multi-cluster or hybrid-cloud infrastructure setups
  • Experience with incident response and post-incident review processes

Key Skills (Comma-Separated)

Site Reliability Engineering, DevOps, GCP, AWS, Azure, Terraform, CI/CD, GitLab CI, CircleCI, ArgoCD, Kubernetes, GKE, EKS, AKS, Helm, Prometheus, Grafana, ELK, Python, Bash, Go, IAM, RBAC, Network Policies, Service Mesh, Istio, TCP/IP, DNS, Load Balancers, Firewalls, Monitoring, Logging, Error Budgets, SLOs, SLIs, Incident Management

Join WhatsApp Channel