VI
Job Description
Key Responsibilities
- Assess and enhance the AI platforms cloud infrastructure and data pipeline resilience using AWS and cloud based technologies
- Ensure scalability and fault tolerance of AI ML models within cloud environments
- Identify and resolve bottlenecks in model inference and training pipelines focusing on performance and resource optimization
- Optimize cloud resource utilization on AWS for real time use cases including AI model deployment
- Collaborate with the DevOps team on improving cloud deployment processes and managing AWS infrastructure
- Implement automated testing to simulate fault tolerance and ensure high availability
- Provide ongoing technical support for users of the Generative AI platform troubleshooting issues and responding to queries to ensure seamless operations
- Monitor cloud platform performance on AWS identifying and implementing optimization strategies to improve cost efficiency and scalability
- Work with AWS cloud services eg EC2 S3 Lambda VPC to ensure proper configuration management and performance
- Document key processes issues and solutions for knowledge sharing and future reference
- Stay updated with industry trends in Generative AI cloud technologies and AWS cloud administration
