OR

Software Developer 4

Oracle
Bangalore3-12 LPA Posted 24 Oct 2025
FULL TIME
Docker
Microservices
Restful Apis
Sql
Java
+1 more

Job Description

  • Bachelor s or Master s degree in Computer Science & Engineering
  • 6-10+ years of professional experience in full stack development, with a proven track record of deploying web applications in production environments.
  • Strong fundamentals in data structures & algorithms demonstrated through complex system design and problem-solving in computer network domain.
  • Experience in at least one backend language (e.g., Java, Python, Go), frontend framework including SQL/NoSQL databases and cloud platforms (e.g., AWS, Azure, GCP).
  • Hands-on experience with observability tools (e.g., Prometheus, Grafana, ELK/EFK, OpenTelemetry).
  • Experience with container orchestration (Kubernetes, Docker) and CI/CD tools (e.g., Jenkins, GitHub Actions).
  • Exposure to GPU cluster operations and high-performance networking (RoCE, InfiniBand).
  • Excellent communication and teamwork skills, thriving in a fast-paced, collaborative environment.
  • Experience in data center operations, particularly managing & monitoring GPU clusters for AI/ML or HPC workloads.
  • Familiarity with GPU networking protocols (e.g., NCCL for collective communications, Slurm for job scheduling) and high-performance computing frameworks.
  • Knowledge of cloud and on-prem hybrid deployments (AWS, GCP, Azure, or private data centers).
  • Familiarity with security best practices for large-scale distributed systems.
  • Design and implement full stack applications for data center management, including RESTful APIs, microservices, and responsive UI using OCI frameworks.
  • Develop and maintain observability solutions, including distributed tracing, logging pipelines, and metrics collection (e.g., Prometheus, Grafana) to monitor GPU clusters and data center infrastructure in real-time.
  • Implement operational workflows and CI/CD pipelines to streamline deployment, scaling, and maintenance of data center resources.
  • Optimize GPU cluster networking configurations, integrating high-speed interconnects (e.g., InfiniBand, RoCE, Ethernet fabrics) to support AI/ML workloads, ensuring low-latency communication and fault-tolerant designs.
  • Leverage strong knowledge of data structures & algorithms to optimize large-scale data processing and network topologies.
  • Build secure, scalable dashboards and APIs for visualizing data center metrics, alerting on anomalies, and automating incident response in GPU-accelerated environments.
  • Perform performance tuning and troubleshooting of full stack systems to ensure reliability and efficiency in mission-critical infrastructure.
  • Leverage ML/LLM techniques to analyze high-volume telemetry data, detect anomalies, automate mitigation actions, and deliver intelligent reporting to stakeholders.
  • Contribute to code reviews, documentation, and establishing best practices for observability, automation, and data center operations, staying updated on emerging technologies.
  • Work in an agile environment, participating in sprint planning, retrospectives, and on-call rotations to maintain 24/7 operational uptime.
  • Demonstrates the ability to independently navigate unfamiliar challenges, align with stakeholders, shape the technical roadmap, drive operational excellence, and mentor peers and junior engineers

Join WhatsApp Channel