Episodes

  • Alexandre Marques from Red Hat on Tackling the Hardest Problems in Open Source Inference
    Feb 20 2026

    Alexandre Marques, Engineering Manager and Team Lead of Machine Learning Research at Red Hat and former Manager of Machine Learning Research at NeuralMagic, speaks with the University of Pittsburgh’s Health and Explainable AI podcast producer, Brent Phillips, about Red Hat and his team’s work building and maintain platforms that power open-source AI inference at scale.

    In this pilot episode of The Inference Layer, Alexandre discusses his transition from aerospace engineering to leading a research team focused on making large AI models faster, cheaper, and more deployable. He explains that while large labs have proven model capabilities, the current challenge lies in moving these models into production. To bridge the gap between research demos and real-world scaling, he emphasizes the need for a deep understanding of how architectural decisions influence performance and the ability to translate research into high-quality code.

    The conversation delves into the technical definition of the inference layer, which Alexandre describes as the entire stack, including runtime, hardware, memory management, and batching strategies that sits between a trained model and the end-user experience. He highlights the important role of open source and open research at Red Hat and speaks on is team’s search for a Senior Machine Learning Research Engineer to join the team and work on post-training optimization for large language models and conduct applied research on state-of-the-art inference optimization techniques, including quantization, pruning, knowledge distillation, and speculative decoding.

    In the interview, Alexandre highlights two ambitious areas he is eager to explore that define the future of the field. First, he is interested in systematically studying how different optimization techniques compound, specifically how speculative decoding interacts with compression methods like quantization in production environments. Second, he aims to tackle the evolution of inference from single, independent models toward the orchestration of multiple models across distributed environments. This shift introduces new layers of complexity in scheduling and systems design, representing the kind of "hard problem" Alexandre believes will define the next few years of AI deployment.

    The Inference Layer podcast is a collaborative initiative linking university AI labs, researchers, volunteers and supporting partners to explore the complexities of moving models from training to real-world deployment. By highlighting advanced research and frontier challenges, the podcast provides a platform for experts to discuss the cutting-edge developments driving the future of AI.

    Show More Show Less
    14 mins
  • Manuela Nayantara Jeyaraj Discusses Explainability at the Inference Layer
    Jan 29 2026

    Manuela Nayantara Jeyaraj, a PhD student and researcher at the The Applied Intelligence Research Centre (AIRC) within the Technological University Dublin speaks with the University of Pittsburgh’s Health and Explainable AI podcast producer, Brent Phillips about explainability at the inference layer.

    In this pilot episode of The Inference Layer, Manuela discusses her award-winning work on identifying cognitive bias in language models. She explains that while explicit bias is well-studied, her research focuses on implicit, subtle "cognitive biases" that models learn from human patterns, such as gender stereotypes in job recruitment or political descriptions. To address this, Manuela developed an algorithm that combines model-agnostic and model-specific explainability approaches to provide high-confidence justifications for AI decisions. She also highlights the creation of a massive, modern lexicon that captures gendered associations across a wide range of English, from archaic terms to contemporary slang found on TikTok and Instagram.

    The conversation delves into the technical challenges of maintaining explainability at the inference layer, particularly when transitioning from high-compute cloud environments to resource-constrained edge devices like phones or wearables. Manuela emphasizes that for real-time applications clinical decision-making, explainability cannot be an "afterthought" and must be lightweight enough to run locally to ensure user privacy and trust.

    In the interview, Manuela highlights two ambitious areas she is eager to explore that connect the technical and human sides of AI. First, she is interested in developing high-confidence, real-time explainability for streaming data, where decisions must be justified in milliseconds without slowing down the model. This includes providing "counterfactual" explanations—identifying exactly what would need to change for a different outcome to occur, such as a patient's risk level shifting from high to low. Second, she wants to tackle the "storytelling" aspect of explainable AI (XAI), creating systems that can tailor the complexity and detail of an explanation to different stakeholders. For instance, in a recruitment scenario, she envisions a model that provides a deep technical justification for a recruiter while offering a more abstracted, helpful level of feedback for the job applicant.

    The Inference Layer podcast is a collaborative initiative linking university AI labs, researchers, and supporting partners to explore the complexities of moving models from training to real-world deployment. Managed by volunteers, the series focuses on the intricate systems, chips, and stacks that define the inference layer. By highlighting advanced research and frontier challenges the podcast provides a platform for experts to discuss the cutting-edge developments driving the future of AI.

    Show More Show Less
    23 mins