Alexandre Marques, Engineering Manager and Team Lead of Machine Learning Research at Red Hat and former Manager of Machine Learning Research at NeuralMagic, speaks with the University of Pittsburgh’s Health and Explainable AI podcast producer, Brent Phillips, about Red Hat and his team’s work building and maintain platforms that power open-source AI inference at scale.
In this pilot episode of The Inference Layer, Alexandre discusses his transition from aerospace engineering to leading a research team focused on making large AI models faster, cheaper, and more deployable. He explains that while large labs have proven model capabilities, the current challenge lies in moving these models into production. To bridge the gap between research demos and real-world scaling, he emphasizes the need for a deep understanding of how architectural decisions influence performance and the ability to translate research into high-quality code.
The conversation delves into the technical definition of the inference layer, which Alexandre describes as the entire stack, including runtime, hardware, memory management, and batching strategies that sits between a trained model and the end-user experience. He highlights the important role of open source and open research at Red Hat and speaks on is team’s search for a Senior Machine Learning Research Engineer to join the team and work on post-training optimization for large language models and conduct applied research on state-of-the-art inference optimization techniques, including quantization, pruning, knowledge distillation, and speculative decoding.
In the interview, Alexandre highlights two ambitious areas he is eager to explore that define the future of the field. First, he is interested in systematically studying how different optimization techniques compound, specifically how speculative decoding interacts with compression methods like quantization in production environments. Second, he aims to tackle the evolution of inference from single, independent models toward the orchestration of multiple models across distributed environments. This shift introduces new layers of complexity in scheduling and systems design, representing the kind of "hard problem" Alexandre believes will define the next few years of AI deployment.
The Inference Layer podcast is a collaborative initiative linking university AI labs, researchers, volunteers and supporting partners to explore the complexities of moving models from training to real-world deployment. By highlighting advanced research and frontier challenges, the podcast provides a platform for experts to discuss the cutting-edge developments driving the future of AI.