Data‑Centric AI: Foundations for Autonomous AI Scientific Systems by Mr Gregory Lau

13 Mar 2026 02.30 PM - 03.30 PM LT28 Current Students

Abstract

What is the impact of each data point on model behaviour? This basic question is increasingly pertinent as data challenges emerge as key bottlenecks to progress and practical deployment of AI systems. In this talk, I present a data‑centric approach to developing principled methods to overcome practical AI deployment bottlenecks across the data‑model lifecycle: (a) data selection before/during training, and (b) data provenance after training.

First, I will discuss how algorithmically selecting the most impactful data points for model training can help overcome critical failure modes of Physics‑Informed Neural Networks (PINNs), leading to substantial performance gains on challenging PDE forward and inverse problems. Next, I will describe how it is possible to develop scalable and robust frameworks, based on data watermarking, to track whether data has been used to train LLMs or successfully removed from them in realistic settings. Finally, I will outline how this approach can be extended to tackle major data challenges in AI for science, with the long‑term goal of building the data‑centric foundations needed for autonomous AI scientific systems that can drive the next generation of scientific discovery.

About the Speaker

Gregory is a final‑year PhD student in the School of Computing at the National University of Singapore, advised by Bryan Kian Hsiang Low and supported by the AISG PhD Fellowship. He has been a visiting researcher at the University of Oxford and the University of Washington, and prior to his PhD worked as a policymaker in the Singapore government and an entrepreneur. He holds a B.Sc. in Physics and Economics from MIT and a Master of Finance from MIT Sloan.

Gregory’s research takes a data‑centric view of machine learning and studies how individual data points shape model behaviour. He develops methods across the data‑model lifecycle, from data selection for scientific machine learning (e.g., experimental design and sample‑efficient learning for physics‑informed models) to data provenance for foundation models (e.g., tracking and verifying data usage and removal). His work has been published in top AI venues such as NeurIPS, ICLR, and EMNLP, including an ICLR Spotlight paper that received the Best Paper Award at the ICML AI for Science workshop. He has also been recognized for teaching via the NUS School of Computing Teaching Fellowship Scheme.