CDS Guest Editorial: Active Learning for Optimal Experiment Design in High Energy Physics
Joint work with Irina Esjepo Morales, Kyle Cranmer, Lukas Heinrich, Gilles Louppe, and more colleagues at CERN.
This entry is a part of the NYU Center for Data Science blog’s recurring guest editorial series. Irina Espejo Morales is a CDS Ph.D. student in data science and also a DeepMind fellow. Kyle Cranmer is a CDS professor of data science and professor of physics at the NYU College of Arts & Science. Lukas Heinrich is a staff scientist at CERN working with the ATLAS experiment at the LHC and former NYU graduate student. Gilles Louppe is an associate professor in artificial intelligence and deep learning at the University of Liège (Belgium) and former Moore Sloan fellow.
Motivation
After the great success of Machine Learning in fields like computer vision and natural language processing, scientists of many disciplines started to think of applications to their fields. But there is an elephant in the room: simulators.
Simulators are generative models that appear everywhere in science. They can be high-fidelity and therefore computationally expensive to run. They are used to model phenomena on a huge range of length scales: from cosmology and astrophysics to particle physics, from fluid mechanics to atmospheric studies, from many-body problems to molecular dynamics and epidemiology.
In science, we have different goals than in general machine learning and often we are interested in guarantees for a discovery claim. A way to quantify this is with p-values and hypothesis testing. This is why we are interested in level sets of black box functions: given a p-value threshold, which hypotheses are compatible with the data?
One approach to deal with simulators is to “open the box”; this is a remarkably hard task. The approach we take is to treat the simulator as a black-box and efficiently make use of how we query it.
Approach
Suppose we have a black box function $f$ from $\mathbb{R}^d$ to $\mathbb{R}$ and we want to find the domain where $f(x) = t$ for a specific scalar $t$.The naive approach would be to set up a grid in Rd as fine as possible and then evaluate f at each point of the grid. This approach does not scale well when $d$ increases, it would require $O(n^d)$ queries and each query can take days or even weeks.
At NYU, we are developing an agent called “Excursion” that uses active learning and gaussian processes to find level sets of computationally expensive black box functions. Excursion is supported by GPyTorch.
We use a gaussian process as a surrogate for the black box function, then we calculate a function named acquisition function which uses the gaussian process posterior to calculate how worth it it is to feed a candidate point in the grid to the black box function. See it in action below!
Note that with Excursion we need considerably less evaluations to estimate the dotted level set compared to the number of points in a regular grid, in the case of the animation above $10⁴$ would be necessary for the naive strategy. And we only considered a two dimensional space.
Level set estimation of a black box function is an important final step in high energy physics analysis, with this approach we hope to accelerate and escalate to high dimensions the estimation of level sets. This method can speed up the process of discarding theories for new physics at the LHC. We are currently working with a team at ATLAS on a real world application of Excursion. Stay tuned!
Acknowledgements
We acknowledge the great computational facilities provided by NYU Greene and the funding by NSF, IRIS-HEP, diana-hep, and the SCAILFIN project.
By Irina Espejo Morales, Kyle Cranmer, Lukas Heinrich, Gilles Louppe, and additional colleagues at CERN.