Apple Intelligence is driven by intentional data design—spanning careful sampling, creation, and curation of high-quality datasets, enriched with precise annotations. Our data powers our ability to evaluate and mitigate safety risks in new generative AI features. This role sits at the intersection of applied data science, empirical analysis, cultural and linguistic expertise, and stakeholder communication. It requires strong scientific judgment, cross-functional collaboration, and the ability to translate evaluation findings into actionable insights.
- Develop metrics for evaluation of safety and fairness risks inherent to generative models and Gen-AI features
- Design datasets, identify data needs, and work on creative solutions, scaling and expanding data coverage through human and synthetic generation methods
- Collaborate with cross-functional partners—including engineering, product, and research teams—to ensure evaluations align with feature goals and deployment plans
- Partner with policy teams to translate regional safety and inclusivity requirements into measurable evaluation criteria
- Build expertise in machine translation and data synthesis techniques to generate localized and culturally aligned evaluation datasets at scale
- Develop ML-based enhancements to red teaming, model evaluation, and other processes to improve the quality of Apple Intelligence’s user-facing products
- Work with highly-sensitive content with exposure to offensive and controversial content
MS or PhD in Computer Science, Linguistics, Cognitive Science, HCI, Psychology, Mathematics, Physics, or a similar science or technology field with a strong basis in scientific data collection and analysis + at least 4 years of relevant work experience, or BA/BS with 8+ years of relevant work experience
Experience collecting and analyzing language data, image data, and/or multi-modal data
Strong experience designing human annotation projects, writing guidelines, and dealing with highly multi-labeled, nuanced, and often conflicting data
Proficiency in data science, machine learning, analytics, and programming with Python & Pandas; strong experience with one or more plotting & visualization libraries
Excellent interpersonal skills, with a proven ability to synthesize complex findings and present evaluation outcomes to senior leadership and executives
Strong skills for rigorous model quality metrics development; interpretation of experiments and evaluations; and presentation to executives
Deep cultural awareness and understanding of regional norms, values, and sensitivities, with the ability to translate this knowledge into actionable evaluation strategies
Experience in localization, internationalization, or building/evaluating machine learning systems for global markets, with a focus on linguistic and cultural adaptation
Curiosity about fairness and bias in generative AI systems, and a strong desire to help make the technology more equitable