Episodes

  • Episode 3 - Learning that Transforms Systems
    Dec 23 2025

    In this episode with return guest David Leigh, we explore one of the most important pillars of Season 3: how learning transforms systems—and more specifically, how resilience is built not through tools or processes alone, but through the interplay of diverse perspectives, psychological safety, and evolving mental models.

    What begins with a playful callback to David’s earlier storytelling episode quickly becomes a deep and wide‑ranging conversation about:

    • Why resilience is fundamentally about preparing for the unexpected
    • How mental models age, erode, and become outdated—and why acknowledging that is a strength
    • The power of diverse representation in incident response
    • Why psychological safety is the foundation for resilient socio‑technical systems
    • How practices like pre‑mortems and tabletop exercises build cultural readiness
    • The limitations of automation, metrics, and “preventing recurrence” mindsets
    • And why learning is not simply knowledge acquisition, but a force that shapes systems themselves

    David shared stories, examples, and insights that apply equally to engineering teams, organizations, and society.

    Show More Show Less
    40 mins
  • Episode 2 - Building Resilient Skillset for the Work of Tomorrow
    Dec 3 2025

    In this unique episode, we sit down with Dr. Jane Goodyer, Dean of the Lassonde School of Engineering at York University and a global leader in reimagining engineering education. From her inspiring personal journey to pioneering Canada’s first fully work-integrated digital technologies degree, Jane shares how academia and industry can collaborate to prepare graduates for the ever-changing world of work.

    We explore:

    • What “work-integrated learning” really means and why it matters.
    • How resilience, curiosity, and empathy are critical skills for tomorrow’s engineers.
    • The role of psychological safety and active learning in shaping future talent.
    • Why continuous learning and breaking down silos between education and industry is essential.

    Whether you’re an SRE, an educator, or a lifelong learner, this conversation will challenge how you think about skills, adaptability, and preparing for the unexpected.

    Things to listen for:

    Jane’s Origin Story

    • Her working-class background and early challenges.

    • How resilience shaped her journey.

    Work-Integrated Learning Explained

    • Why it’s more than internships or co-ops.

    • Active vs. passive learning and why doing matters.

    Resilience for the Future of Work

    • Why adaptability and psychological safety are essential.

    Soft Skills as Hard Skills

    • Empathy, communication, curiosity, and self-awareness.

    Industry-Academia Collaboration

    • The Trailblazer process and what employers really want.

    Continuous Learning Beyond Graduation

    • Breaking silos between education and work.

    Advice for Engineers

    • Be curious, brave, and true to your values.

    Show More Show Less
    47 mins
  • Episode 2 - Building Resilient Skillset for the Work of Tomorrow (video version)
    Dec 3 2025

    In this unique episode, we sit down with Dr. Jane Goodyer, Dean of the Lassonde School of Engineering at York University and a global leader in reimagining engineering education. From her inspiring personal journey to pioneering Canada’s first fully work-integrated digital technologies degree, Jane shares how academia and industry can collaborate to prepare graduates for the ever-changing world of work.

    We explore:

    • What “work-integrated learning” really means and why it matters.
    • How resilience, curiosity, and empathy are critical skills for tomorrow’s engineers.
    • The role of psychological safety and active learning in shaping future talent.
    • Why continuous learning and breaking down silos between education and industry is essential.

    Whether you’re an SRE, an educator, or a lifelong learner, this conversation will challenge how you think about skills, adaptability, and preparing for the unexpected.

    Things to listen for:

    Jane’s Origin Story

    • Her working-class background and early challenges.

    • How resilience shaped her journey.

    Work-Integrated Learning Explained

    • Why it’s more than internships or co-ops.

    • Active vs. passive learning and why doing matters.

    Resilience for the Future of Work

    • Why adaptability and psychological safety are essential.

    Soft Skills as Hard Skills

    • Empathy, communication, curiosity, and self-awareness.

    Industry-Academia Collaboration

    • The Trailblazer process and what employers really want.

    Continuous Learning Beyond Graduation

    • Breaking silos between education and work.

    Advice for Engineers

    • Be curious, brave, and true to your values.

    Show More Show Less
    47 mins
  • Episode 1 - Resilience Enablement
    Nov 10 2025

    Season 3 of Making of the SRE Omelette is here - and it’s all about resilience. Resilience isn’t just about surviving outages. It’s about building systems and cultures that adapt, learn, and thrive under pressure.

    In our kickoff episode, we sit down with Dr. Jennifer Petoff, co-editor of Site Reliability Engineering: How Google Runs Production Systems and leader of Google’s Global SRE Education. Jennifer shares why resilience starts with people, not just technology—and how psychological safety and confidence are the secret ingredients for reliability at scale.

    You’ll learn: * How to scale learning like a production system

    * Why postmortem culture drives improvement

    * How to apply SRE principles beyond infrastructure

    If you’ve ever wondered how to make reliability a business advantage, this episode is for you.

    Check out How to SRE Anything here: https://www.reliablepgm.com/how-to-sre-anything/

    Topics: * Origins of SRE and Education at Google How Google scaled SRE education globally. Why education is treated like a production system (repeatable, reliable, measurable).

    * Psychological Safety and Learning Why psychological safety is critical for resilience. Creating environments where teams can share mistakes without fear of blame. How this accelerates learning and reliability.

    * Hands-On Experience as a Learning Model Importance of experiential learning (e.g., game days, simulations). Why theory alone isn’t enough for building confidence under pressure.

    * Scaling Knowledge Across Large Organizations Strategies Google uses to scale SRE principles globally. Balancing standardization with flexibility for local teams.

    * Resilience Beyond Reliability How resilience differs from reliability. Building adaptive systems and teams that thrive through adversity.

    * Culture as a Foundation Why culture is the “secret ingredient” for successful SRE adoption. Encouraging curiosity and collaboration across roles.

    * Future of SRE Education Trends in learning for distributed teams. How continuous education supports evolving reliability practices.

    Show More Show Less
    42 mins
  • Episode 1 - Resilience Enablement (video version)
    Nov 10 2025

    Season 3 of Making of the SRE Omelette is here - and it’s all about resilience. Resilience isn’t just about surviving outages. It’s about building systems and cultures that adapt, learn, and thrive under pressure.

    In our kickoff episode, we sit down with Dr. Jennifer Petoff, co-editor of Site Reliability Engineering: How Google Runs Production Systems and leader of Google’s Global SRE Education. Jennifer shares why resilience starts with people, not just technology—and how psychological safety and confidence are the secret ingredients for reliability at scale.

    You’ll learn: * How to scale learning like a production system

    * Why postmortem culture drives improvement

    * How to apply SRE principles beyond infrastructure

    If you’ve ever wondered how to make reliability a business advantage, this episode is for you.

    Check out How to SRE Anything here: https://www.reliablepgm.com/how-to-sre-anything/

    Topics: * Origins of SRE and Education at Google How Google scaled SRE education globally. Why education is treated like a production system (repeatable, reliable, measurable).

    * Psychological Safety and Learning Why psychological safety is critical for resilience. Creating environments where teams can share mistakes without fear of blame. How this accelerates learning and reliability.

    * Hands-On Experience as a Learning Model Importance of experiential learning (e.g., game days, simulations). Why theory alone isn’t enough for building confidence under pressure.

    * Scaling Knowledge Across Large Organizations Strategies Google uses to scale SRE principles globally. Balancing standardization with flexibility for local teams.

    * Resilience Beyond Reliability How resilience differs from reliability. Building adaptive systems and teams that thrive through adversity.

    * Culture as a Foundation Why culture is the “secret ingredient” for successful SRE adoption. Encouraging curiosity and collaboration across roles.

    * Future of SRE Education Trends in learning for distributed teams. How continuous education supports evolving reliability practices.

    Show More Show Less
    42 mins
  • Episode 8 - AI for Sustainable IT Part 2 of 2
    Feb 1 2024

    The conclusion of the two part crossover podcast series explores the intersection of AI with sustainable IT operations featuring Jerry Cuomo from The Art of AI and Kevin Yu from Making of the SRE Omelette. The discussion delves into practical measures for more efficient energy use in AI systems, emphasizing the need for data and the analysis of past behavior to inform energy-efficient decision-making. Jerry and Kevin highlight the importance of balancing AI and human inputs to achieve meaningful tasks and improve overall quality of products. Discuss challenges such as right-sizing compute and recognize the pivotal role of data in addressing these issues, advocating for a data-driven approach to answer critical questions and provide necessary context for decision-making.

    Additionally, the conversation touches on the future of AI and sustainable IT operations, emphasizing the need for diverse perspectives and the integration of SRE and sustainability as standard practices in software development. The podcast aims to provide a better understanding of how AI intersects with sustainable IT operations and how innovation can be approached responsibly.

    Please be sure to catch Part 1 on Jerry's Art of AI podcast.

    Show More Show Less
    18 mins
  • Episode 7 - Intelligent Facilities & Assets (video version)
    Sep 12 2023

    Mike Hollinger, Master Inventor, CTO for Applied AI & Distinguished Engineer for Maximo Application Suite talks about how we can leverage operational insights from assets, facilities and infrastructure to drive clean energy transition and decarbonization. Mike shares stories from customers that showcases successes as well as challenges they faced. Mike have a call to action to inspire Site Reliability Engineers to embrace the data and capabilities we have at our fingertips today to turn data into action to achieve the sustainable future.

    Things to listen for:

    • [02:20 - 03:25] Mike's career path that led to his current role
    • [03:59 - 05:47] Meaning of sustainability to Mike
    • [07:56 - 10:06] Sustainability movement over last few years
    • [10:18 - 11:41] Importance of driving action from data
    • [12:04 - 15:45] Challenges in Facilities and Assets
    • [16:28 - 18:07] Civil Infrastructure example that drive action from data
    • [20:40 - 22:19] What Mike considers as success
    • [22:50 - 23:39] Importance of driving action from data
    • [26:06 - 29:34] Suggestion for c-suite executives to take action
    • [30:15 - 34:24] Call to action for SREs
    • [38:32 - 41:03] Mike's ingredient & recipe for a Sustainable Future
    Show More Show Less
    42 mins
  • Episode 7 - Intelligent Facilities & Assets
    Sep 12 2023

    Mike Hollinger, Master Inventor, CTO for Applied AI & Distinguished Engineer for Maximo Application Suite talks about how we can leverage operational insights from assets, facilities and infrastructure to drive clean energy transition and decarbonization. Mike shares stories from customers that showcases successes as well as challenges they faced. Mike have a call to action to inspire Site Reliability Engineers to embrace the data and capabilities we have at our fingertips today to turn data into action to achieve the sustainable future.

    Things to listen for:

    • [02:20 - 03:25] Mike's career path that led to his current role
    • [03:59 - 05:47] Meaning of sustainability to Mike
    • [07:56 - 10:06] Sustainability movement over last few years
    • [10:18 - 11:41] Importance of driving action from data
    • [12:04 - 15:45] Challenges in Facilities and Assets
    • [16:28 - 18:07] Civil Infrastructure example that drive action from data
    • [20:40 - 22:19] What Mike considers as success
    • [22:50 - 23:39] Importance of driving action from data
    • [26:06 - 29:34] Suggestion for c-suite executives to take action
    • [30:15 - 34:24] Call to action for SREs
    • [38:32 - 41:03] Mike's ingredient & recipe for a Sustainable Future
    Show More Show Less
    42 mins