Episodes

  • Offloading LLM Models and KV Caches to NVMe SSDs
    Sep 8 2025

    This March 2025 paper examines the input/output (I/O) characteristics of offloading large language model (LLM) components to NVMe SSDs during inference, a critical solution for overcoming GPU memory limitations with ever-growing LLMs. Researchers analyzed block-layer I/O traces from two prominent LLM frameworks, DeepSpeed and FlexGen, to understand how model weights and key-value (KV) caches are handled. The findings indicate that asynchronous I/O using libaio significantly outperforms POSIX for tensor transfers, although neither method fully saturates the NVMe SSD's theoretical bandwidth. For model offloading, I/O is predominantly characterized by 128KiB reads, primarily occurring at the beginning of the inference process, while KV cache offloading involves both reads and writes of similar size, with read bandwidth being substantially higher. Ultimately, the research suggests that modern NVMe SSDs are capable of supporting current LLM inference workloads but highlights opportunities for further optimization in SSD design and KV cache management.


    Source:

    https://dl.acm.org/doi/10.1145/3719330.3721230

    Show More Show Less
    17 mins
  • GPT-NeoX: Large-Scale Autoregressive Language Modeling in PyTorch
    Sep 7 2025

    Thus describes EleutherAI's GPT-NeoX library, a robust open-source framework for training large-scale autoregressive language models on GPUs, building upon the Megatron and DeepSpeed libraries. It highlights the library's advanced features like distributed training, support for various hardware and systems, and cutting-edge architectural innovations. The text also provides practical guidance on setup, configuration, data preparation, training, inference, and evaluation, alongside details on pretrained models like GPT-NeoX-20B and Pythia. Furthermore, it details how to export models to Hugging Face and monitor experiments, underscoring its widespread adoption in research and industry.


    Source:

    https://github.com/EleutherAI/gpt-neox


    Show More Show Less
    12 mins
  • SGLang: Efficient Language Model Program Execution
    Sep 7 2025

    This June 2024 paper introduces SGLang, a framework designed to enhance the efficiency of Large Language Model (LLM) and Vision Language Model (VLM) serving. It achieves this through a co-design of a flexible frontend language and a fast backend runtime. The frontend simplifies programming with primitives for generation and parallelism, while the backend utilizes novel optimizations like RadixAttention for KV cache reuse and compressed finite state machines for faster structured output decoding. These innovations allow SGLang to significantly improve throughput and reduce latency compared to existing systems across various LLM applications and hardware platforms. The framework is open-source, boasts extensive model support, and has seen wide industry adoption due to its performance benefits in complex LM programs.

    Sources:


    https://arxiv.org/pdf/2312.07104

    https://docs.sglang.ai/

    https://github.com/sgl-project/sglang

    Show More Show Less
    17 mins
  • Eleuther: evaluating LLMs
    Sep 7 2025

    These sources collectively explore various approaches to evaluating and improving Large Language Models (LLMs). Several papers introduce new benchmark datasets designed to test LLMs on complex reasoning tasks, such as the "BIG-Bench Hard (BBH)" suite, the graduate-level "GPQA" questions in science, and "MuSR" for multistep soft reasoning in natural language narratives. A key technique discussed across these sources is Chain-of-Thought (CoT) prompting, which encourages LLMs to show their step-by-step reasoning, leading to improved performance, often surpassing human-rater averages on challenging tasks. Additionally, the "Instruction-Following Eval (IFEval)" introduces a reproducible benchmark for verifiable instructions, allowing for objective assessment of an LLM's ability to follow explicit directives. The "MMLU-Pro Benchmark" further contributes a large-scale dataset across diverse disciplines to rigorously assess model capabilities, emphasizing the need for robust evaluation metrics and challenging data to push the boundaries of AI reasoning.


    Sources:

    https://github.com/EleutherAI/lm-evaluation-harness

    https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/leaderboard/README.md

    https://arxiv.org/pdf/2103.03874 - Measuring Mathematical Problem Solving With the

    MATH Dataset

    https://arxiv.org/pdf/2210.09261 - Challenging BIG-Bench tasks and

    whether chain-of-thought can solve them

    https://arxiv.org/pdf/2310.16049 - MUSR: TESTING THE LIMITS OF CHAIN-OF-THOUGHT

    WITH MULTISTEP SOFT REASONING

    https://arxiv.org/pdf/2311.07911 - Instruction-Following Evaluation for Large Language

    Models

    https://arxiv.org/pdf/2311.12022 - GPQA: A Graduate-Level Google-Proof

    Q&A Benchmark

    https://arxiv.org/pdf/2406.01574 - MMLU-Pro: A More Robust and Challenging

    Multi-Task Language Understanding Benchmark


    Show More Show Less
    27 mins
  • OpenELM: Apple's Open Language Model Family
    Sep 7 2025

    The provided May 2024 sources center around CoreNet, an Apple-developed library for training deep neural networks, and OpenELM, an efficient language model family built using CoreNet. CoreNet is a versatile toolkit supporting various tasks, including foundation models like large language models (LLMs), object classification, and semantic segmentation, with its development evolving from the earlier CVNets. A key innovation highlighted is OpenELM's layer-wise scaling strategy, which optimizes parameter allocation within transformer models to achieve superior accuracy with fewer pre-training tokens compared to other open LLMs. The resources emphasize reproducibility and transparency by providing comprehensive frameworks for OpenELM's training and evaluation, including code for inference and fine-tuning on Apple devices using the MLX library, and detailed benchmarks on both NVIDIA CUDA and Apple Silicon hardware.


    Sources:

    https://arxiv.org/pdf/2404.14619

    https://machinelearning.apple.com/research/openelm

    https://github.com/apple/corenet

    https://github.com/apple/corenet/tree/main/projects/kv-prediction


    Show More Show Less
    15 mins
  • FineVision: Open Data for Computer Vision
    Sep 7 2025

    These September 2025 posts describe HuggingFaceM4/FineVision, a large dataset designed for image and text modalities. It features a substantial size, ranging from 10M to 100M, and is available in the parquet format. This dataset includes various ratings, such as relevance, visual dependency, image correspondence, and formatting, indicating its use in evaluating the quality and relationship between visual and textual content. The examples provided demonstrate that FineVision contains question-and-answer pairs related to diverse charts and diagrams, covering topics like population trends, genetic diseases, software update frequencies, and demographic distributions, suggesting its application in training models for visual question answering and chart comprehension.


    Sources:

    https://huggingface.co/spaces/HuggingFaceM4/FineVision

    https://huggingface.co/datasets/HuggingFaceM4/FineVision

    Show More Show Less
    16 mins
  • Evaluating Large Language Models Trained on Code
    Sep 7 2025

    This July 2021 paper documents the development and evaluation of OpenAI's Codex models, which are large language models specialized in code generation, particularly Python functions from docstrings. They introduce HumanEval, a hand-written dataset designed to assess the functional correctness of generated code through unit tests, a more robust metric than traditional match-based scores like BLEU. The papers compare the performance of various Codex iterations, including supervised fine-tuned versions (Codex-S), against other models like GPT-3, demonstrating significant improvements in pass rates with increased model size and sample generation. Furthermore, the texts explore the limitations, broader impacts, and potential hazards of these models, discussing issues such as over-reliance, misalignment, economic implications for the labor market, and security concerns related to generating vulnerable or biased code. Finally, the sources touch upon Codex-D, a model for generating docstrings from code, and emphasize the need for continued research into safe and responsible AI deployment.

    Sources:

    https://arxiv.org/pdf/2107.03374

    https://github.com/openai/human-eval

    Show More Show Less
    17 mins
  • Democratizing AI Compute: The Modular Vision
    Sep 7 2025

    This blog post series from Chris Lattner extensively examines CUDA's pervasive dominance in AI compute, detailing its evolution from a graphics processor to a layered software platform integral to NVIDIA's success, while also highlighting the challenges and complexities it presents to developers and alternative hardware vendors. The articles critically assess various attempts to democratize AI compute, including OpenCL, TVM, XLA, and MLIR, explaining why these alternatives largely failed to dislodge CUDA due to fragmentation, misaligned incentives, and a lack of unified vision. Ultimately, the texts introduce Modular's approach to addressing these issues through its Mojo language, MAX framework, and Mammoth cluster management system, aiming to provide a portable, performant, and programmable solution for the rapidly evolving Generative AI landscape.


    Source:


    https://www.modular.com/blog/democratizing-compute-part-1-deepseeks-impact-on-ai

    Show More Show Less
    1 hr and 12 mins