AI: post transformers

Episodes

Offloading LLM Models and KV Caches to NVMe SSDs

Sep 8 2025

This March 2025 paper examines the input/output (I/O) characteristics of offloading large language model (LLM) components to NVMe SSDs during inference, a critical solution for overcoming GPU memory limitations with ever-growing LLMs. Researchers analyzed block-layer I/O traces from two prominent LLM frameworks, DeepSpeed and FlexGen, to understand how model weights and key-value (KV) caches are handled. The findings indicate that asynchronous I/O using libaio significantly outperforms POSIX for tensor transfers, although neither method fully saturates the NVMe SSD's theoretical bandwidth. For model offloading, I/O is predominantly characterized by 128KiB reads, primarily occurring at the beginning of the inference process, while KV cache offloading involves both reads and writes of similar size, with read bandwidth being substantially higher. Ultimately, the research suggests that modern NVMe SSDs are capable of supporting current LLM inference workloads but highlights opportunities for further optimization in SSD design and KV cache management.

Source:
https://dl.acm.org/doi/10.1145/3719330.3721230

Show More Show Less

17 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
GPT-NeoX: Large-Scale Autoregressive Language Modeling in PyTorch

Sep 7 2025

Thus describes EleutherAI's GPT-NeoX library, a robust open-source framework for training large-scale autoregressive language models on GPUs, building upon the Megatron and DeepSpeed libraries. It highlights the library's advanced features like distributed training, support for various hardware and systems, and cutting-edge architectural innovations. The text also provides practical guidance on setup, configuration, data preparation, training, inference, and evaluation, alongside details on pretrained models like GPT-NeoX-20B and Pythia. Furthermore, it details how to export models to Hugging Face and monitor experiments, underscoring its widespread adoption in research and industry.

Source:
https://github.com/EleutherAI/gpt-neox

Show More Show Less

12 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
SGLang: Efficient Language Model Program Execution

Sep 7 2025

This June 2024 paper introduces SGLang, a framework designed to enhance the efficiency of Large Language Model (LLM) and Vision Language Model (VLM) serving. It achieves this through a co-design of a flexible frontend language and a fast backend runtime. The frontend simplifies programming with primitives for generation and parallelism, while the backend utilizes novel optimizations like RadixAttention for KV cache reuse and compressed finite state machines for faster structured output decoding. These innovations allow SGLang to significantly improve throughput and reduce latency compared to existing systems across various LLM applications and hardware platforms. The framework is open-source, boasts extensive model support, and has seen wide industry adoption due to its performance benefits in complex LM programs.
Sources:

https://arxiv.org/pdf/2312.07104
https://docs.sglang.ai/
https://github.com/sgl-project/sglang

Show More Show Less

17 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
Eleuther: evaluating LLMs

Sep 7 2025

These sources collectively explore various approaches to evaluating and improving Large Language Models (LLMs). Several papers introduce new benchmark datasets designed to test LLMs on complex reasoning tasks, such as the "BIG-Bench Hard (BBH)" suite, the graduate-level "GPQA" questions in science, and "MuSR" for multistep soft reasoning in natural language narratives. A key technique discussed across these sources is Chain-of-Thought (CoT) prompting, which encourages LLMs to show their step-by-step reasoning, leading to improved performance, often surpassing human-rater averages on challenging tasks. Additionally, the "Instruction-Following Eval (IFEval)" introduces a reproducible benchmark for verifiable instructions, allowing for objective assessment of an LLM's ability to follow explicit directives. The "MMLU-Pro Benchmark" further contributes a large-scale dataset across diverse disciplines to rigorously assess model capabilities, emphasizing the need for robust evaluation metrics and challenging data to push the boundaries of AI reasoning.

Sources:
https://github.com/EleutherAI/lm-evaluation-harness
https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/leaderboard/README.md
https://arxiv.org/pdf/2103.03874 - Measuring Mathematical Problem Solving With the
MATH Dataset
https://arxiv.org/pdf/2210.09261 - Challenging BIG-Bench tasks and
whether chain-of-thought can solve them
https://arxiv.org/pdf/2310.16049 - MUSR: TESTING THE LIMITS OF CHAIN-OF-THOUGHT
WITH MULTISTEP SOFT REASONING
https://arxiv.org/pdf/2311.07911 - Instruction-Following Evaluation for Large Language
Models
https://arxiv.org/pdf/2311.12022 - GPQA: A Graduate-Level Google-Proof
Q&A Benchmark
https://arxiv.org/pdf/2406.01574 - MMLU-Pro: A More Robust and Challenging
Multi-Task Language Understanding Benchmark

Show More Show Less

27 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
OpenELM: Apple's Open Language Model Family

Sep 7 2025

The provided May 2024 sources center around CoreNet, an Apple-developed library for training deep neural networks, and OpenELM, an efficient language model family built using CoreNet. CoreNet is a versatile toolkit supporting various tasks, including foundation models like large language models (LLMs), object classification, and semantic segmentation, with its development evolving from the earlier CVNets. A key innovation highlighted is OpenELM's layer-wise scaling strategy, which optimizes parameter allocation within transformer models to achieve superior accuracy with fewer pre-training tokens compared to other open LLMs. The resources emphasize reproducibility and transparency by providing comprehensive frameworks for OpenELM's training and evaluation, including code for inference and fine-tuning on Apple devices using the MLX library, and detailed benchmarks on both NVIDIA CUDA and Apple Silicon hardware.

Sources:
https://arxiv.org/pdf/2404.14619
https://machinelearning.apple.com/research/openelm
https://github.com/apple/corenet
https://github.com/apple/corenet/tree/main/projects/kv-prediction

Show More Show Less

15 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
FineVision: Open Data for Computer Vision

Sep 7 2025

These September 2025 posts describe HuggingFaceM4/FineVision, a large dataset designed for image and text modalities. It features a substantial size, ranging from 10M to 100M, and is available in the parquet format. This dataset includes various ratings, such as relevance, visual dependency, image correspondence, and formatting, indicating its use in evaluating the quality and relationship between visual and textual content. The examples provided demonstrate that FineVision contains question-and-answer pairs related to diverse charts and diagrams, covering topics like population trends, genetic diseases, software update frequencies, and demographic distributions, suggesting its application in training models for visual question answering and chart comprehension.

Sources:
https://huggingface.co/spaces/HuggingFaceM4/FineVision
https://huggingface.co/datasets/HuggingFaceM4/FineVision

Show More Show Less

16 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
Evaluating Large Language Models Trained on Code

Sep 7 2025

This July 2021 paper documents the development and evaluation of OpenAI's Codex models, which are large language models specialized in code generation, particularly Python functions from docstrings. They introduce HumanEval, a hand-written dataset designed to assess the functional correctness of generated code through unit tests, a more robust metric than traditional match-based scores like BLEU. The papers compare the performance of various Codex iterations, including supervised fine-tuned versions (Codex-S), against other models like GPT-3, demonstrating significant improvements in pass rates with increased model size and sample generation. Furthermore, the texts explore the limitations, broader impacts, and potential hazards of these models, discussing issues such as over-reliance, misalignment, economic implications for the labor market, and security concerns related to generating vulnerable or biased code. Finally, the sources touch upon Codex-D, a model for generating docstrings from code, and emphasize the need for continued research into safe and responsible AI deployment.
Sources:
https://arxiv.org/pdf/2107.03374
https://github.com/openai/human-eval

Show More Show Less

17 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
Democratizing AI Compute: The Modular Vision

Sep 7 2025

This blog post series from Chris Lattner extensively examines CUDA's pervasive dominance in AI compute, detailing its evolution from a graphics processor to a layered software platform integral to NVIDIA's success, while also highlighting the challenges and complexities it presents to developers and alternative hardware vendors. The articles critically assess various attempts to democratize AI compute, including OpenCL, TVM, XLA, and MLIR, explaining why these alternatives largely failed to dislodge CUDA due to fragmentation, misaligned incentives, and a lack of unified vision. Ultimately, the texts introduce Modular's approach to addressing these issues through its Mojo language, MAX framework, and Mammoth cluster management system, aiming to provide a portable, performant, and programmable solution for the rapidly evolving Generative AI landscape.

Source:

https://www.modular.com/blog/democratizing-compute-part-1-deepseeks-impact-on-ai

Show More Show Less

1 hr and 12 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free

Audiobook Categories

More to Explore

GETTING STARTED

Episodes

Offloading LLM Models and KV Caches to NVMe SSDs

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

GPT-NeoX: Large-Scale Autoregressive Language Modeling in PyTorch

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

SGLang: Efficient Language Model Program Execution

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

Eleuther: evaluating LLMs

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

OpenELM: Apple's Open Language Model Family

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

FineVision: Open Data for Computer Vision

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

Evaluating Large Language Models Trained on Code

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

Democratizing AI Compute: The Modular Vision

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed