ZEZETA
Site Reliability Engineer I
Bangalore ₹1-3 LPA Posted 11 Apr 2025
FULL TIME
Version Control
Cloud Computing
Automation
Job Description
Responsibilities
- System Reliability: Ensuring the reliability of software systems by designing, implementing, and maintaining scalable and reliable infrastructure.
- Automation: Developing automation tools and scripts to streamline operational tasks, reduce manual intervention, and improve overall system efficiency.
- Incident Response and Resolution: Monitoring system performance and responding to incidents promptly to minimize downtime and ensure high availability.
- Capacity Planning: Analyzing system usage patterns and forecasting future capacity needs to ensure that the infrastructure can handle current and future demands.
- Performance Optimization: Identifying and addressing performance bottlenecks in software systems through optimization and tuning.
- Infrastructure as Code (IaC): Implementing infrastructure as code practices, using tools like Terraform or Ansible, to define and manage infrastructure in a version-controlled and automated manner.
- Monitoring and Logging: Implementing and maintaining monitoring and logging solutions to gain insights into system behavior, troubleshoot issues, and proactively address potential problems.
- On-Call Support: Participating in an on-call rotation to respond to incidents outside of regular working hours and ensure 24/7 system availability
- Security: Collaborating with security teams to implement and maintain security best practices in infrastructure and application
- Disaster Recovery Planning: Developing and maintaining disaster recovery plans to ensure that systems can quickly recover from major outages or failures
- Continuous Improvement: Continuously analyzing system performance, reliability, and incidents to identify areas for improvement and implementing changes to enhance overall system resilience.
Skills
- Programming Languages: Proficiency in one or more programming languages, commonly Python, Go, Shell, Bash.
- Automation and Scripting: Strong automation skills using tools like Ansible, Puppet, Chef, or custom scripts. Knowledge of Infrastructure as Code (IaC) tools like Terraform
- Containerization and Orchestration: Experience with containerization technologies like Docker and container orchestration platforms like Kubernetes.
- Cloud Computing: Proficiency in any of the cloud platforms such as AWS, Azure, or Google Cloud Platform, and knowledge of managing infrastructure in the cloud.
- Monitoring and Logging: Familiarity with monitoring tools (e.g., Prometheus, Grafana, ELK stack) and logging frameworks to track system performance and troubleshoot issues.
- Networking: Understanding of networking concepts, protocols, and troubleshooting skills.
- Security: Knowledge of security best practices, including encryption, access controls, and vulnerability management.
- Continuous Integration/Continuous Deployment (CI/CD): Understanding and implementation of CI/CD pipelines for automated testing and deployment.
- Load Balancing: Experience in incident response, troubleshooting, and resolution.
- Version Control: Proficient use of version control systems like Git.
Experience and Qualifications
- 1-2 year of experience in site reliability engineering.
- B.Tech/M.Tech in computer science, information technology or a related field.
- Having experience working for a product organization is a plus.
Role: Site Reliability Engineer
Industry Type: IT Services & Consulting
Department: Engineering - Software & QA
Employment Type: Full Time, Permanent
Role Category: DevOps
Education
UG: Any Graduate
PG: Any Postgraduate
