OR
Job Description
- Design and develop large-scale datasets to power generative AI models in multimodal domains (e.g., text, vision, speech), with a focus on synthetic data creation.
- Build robust pipelines and tooling for data acquisition, cleaning, transformation, and quality assurance to support model training and evaluation.
- Research, implement, and adapt cutting-edge techniques (e.g., fine-tuning, RLHF, data augmentation) to align generative models with domain-specific needs.
- Curate and annotate datasets, ensuring diversity, representativeness, and compliance with responsible AI practices.
- Evaluate open-source and research models, integrating best practices into data generation workflows.
- Collaborate with engineering teams to ensure datasets and synthetic data pipelines are scalable, reliable, and production ready.
- Develop metrics and benchmarking frameworks to assess data quality, model alignment, and downstream impact across modalities.
- Partner cross-functionally with product, research, and infrastructure teams to drive innovation in data preparation and generative AI applications.
Qualification and Skills:
- Bachelors or Master s in Computer Science, Data Science, AI/ML, or related field with 3+ years of industry experience.
- Proficiency in Python and solid foundation in applied ML methods.
- Proficiency with Pytorch, Torchvision, OpenCV, and similar, as well as building and deploying DNN models in production.
- Experience building large-scale data pipelines for acquisition, cleaning, augmentation, and validation.
- Ability to evaluate datasets for distribution, diversity, anomalies and fairness to assess overall quality and suitability for generative AI.
- Experience with Computer Vision, NLP, Transformers, Large Language Models, Generative AI, optimizations around LLM training and serving. Experience with multimodal models a bonus.
- Proven track record of delivering scalable, data-centric ML solutions.
