Visiting Researcher Talk: Dr. Xiaoxiao Li - 26 Jan 2026
Talk Title:
Who Deserves Credit? Training Data Valuation For Modern Generative AI
Speaker:
Dr. Xiaoxiao Li
About the speaker:
Dr. Xiaoxiao Li is an Associate Professor in the Department of Electrical and Computer Engineering at the University of British Columbia, a Faculty Member at the Vector Institute, and Visiting Faculty Member at Google. Dr. Li holds a Canada Research Chair (Tier II) in Responsible AI and is recognized as a Canada CIFAR AI Chair. Dr. Li's research aims to enhance the trustworthiness and efficiency of AI models, bridging the gap between cutting-edge AI research and practical real-world applications, such as healthcare. Dr. Li’s current interests include mechanistic analysis of large language and vision-language models (LLMs/VLMs), developing hypothesis-driven evaluations, and advancing methodologies toward artificial general intelligence (AGI). Dr. Li has published over 50 papers on the top ML/AI venues, including ICML, ICLR, NeurIPS, CVPR, ECCV, AAAI, Nature Methods, etc.
Description:
Quantifying the value of training data is a critical challenge for Generative AI and Large Language Models (LLMs). Traditional valuation methods are ill-suited for this new paradigm, as they are computationally infeasible and were designed primarily for small-scale, discriminative models. This talk presents a unified toolkit that redefines data valuation for the modern AI stack. First, for general generative models, we introduce a model-agnostic and training-free framework that values data based on similarity matching. Next, for LLMs and VLMs, we show how leveraging token-level representations enables a highly efficient, forward-only valuation method that avoids costly retraining. Finally, we extend this token-level analysis to Reinforcement Learning, demonstrating how our valuation techniques can steer training dynamics to improve model performance and efficiency. Our methods provide a practical foundation for a more robust data economy, enabling intelligent data curation, equitable compensation, and the development of more transparent and efficient AI systems.