AI: Origins

Episodes

FIM: Filling in the Middle for Language Models

Aug 9 2025

This 2022 academic paper explores Fill-in-the-Middle (FIM) capabilities in causal decoder-based language models, demonstrating that these models can learn to infill text effectively by simply rearranging parts of the training data. The authors propose a method where a middle section of text is moved to the end of a document during training, showing this data augmentation does not negatively impact the model's original left-to-right generative ability. The research highlights the efficiency of FIM training, suggesting it should be a default practice, and offers best practices and hyperparameters for optimal performance, particularly noting the superiority of character-level span selection and context-level FIM implementation. They also introduce new benchmarks to evaluate infilling performance, emphasizing the importance of sampling-based evaluations over traditional perplexity measures for gauging real-world utility.

Source: https://arxiv.org/pdf/2207.14255

Show More Show Less

20 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
BPE: Subword Units for Neural Machine Translation of Rare Words

Aug 9 2025

This 2016 academic paper addresses the challenge of translating rare and unknown words in Neural Machine Translation (NMT), a common issue as NMT models typically operate with a fixed vocabulary while translation itself is an open-vocabulary problem. The authors propose a novel approach where rare and unknown words are encoded as sequences of subword units, eliminating the need for a back-off dictionary. They introduce an adaptation of the Byte Pair Encoding (BPE) compression algorithm for word segmentation, which allows for an open vocabulary using a compact set of variable-length character sequences. Empirical results demonstrate that this subword unit method significantly improves translation quality, particularly for rare and out-of-vocabulary words, for English-German and English-Russian language pairs. The paper compares various segmentation techniques, concluding that BPE offers a more effective and simpler solution for handling the open-vocabulary problem in NMT compared to previous word-level models and dictionary-based approaches.

Source: https://arxiv.org/pdf/1508.07909

Show More Show Less

16 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
Distributed Word and Phrase Representations

Aug 9 2025

This 2013 paper introduces advancements to the continuous Skip-gram model, a method for learning high-quality distributed vector representations of words. The authors present extensions like subsampling frequent words and negative sampling to enhance vector quality and training speed. A significant contribution is the method for identifying and representing idiomatic phrases as single tokens, improving the model's ability to capture complex meanings. The paper demonstrates that these word and phrase vectors exhibit linear relationships, allowing for precise analogical reasoning through simple vector arithmetic. Overall, the research highlights improved efficiency and accuracy in learning linguistic representations, especially with large datasets, by optimizing the Skip-gram architecture.

Source: https://arxiv.org/pdf/1310.4546

Show More Show Less

16 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
Efficient Word Vectors for Large Datasets

Aug 9 2025

This 2013 academic paper introduces two new model architectures, Continuous Bag-of-Words (CBOW) and Skip-gram, designed for efficiently computing continuous vector representations of words from vast datasets. The authors compare the quality and computational cost of these new models against existing neural network language models, demonstrating significant improvements in accuracy at a lower computational expense. A key focus is on preserving linear regularities between words, enabling the vectors to capture complex syntactic and semantic relationships that can be revealed through algebraic operations. The research highlights the scalability of these methods for large-scale parallel training, suggesting their potential to advance various Natural Language Processing (NLP) applications.

Source: https://arxiv.org/pdf/1301.3781

Show More Show Less

12 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
A Neural Probabilistic Language Model

Aug 8 2025

This paper published in 2003 introduces a neural probabilistic language model designed to address the curse of dimensionality inherent in modeling word sequences. The authors propose learning a distributed representation for words, which enables the model to generalize from seen sentences to an exponential number of semantically similar, unseen sentences. This approach simultaneously learns word feature vectors and the probability function for word sequences using neural networks. The paper details the architecture of the neural network, the training process involving stochastic gradient ascent, and methods for parallel implementation to manage the computational demands of large datasets. Experimental results on two corpora demonstrate that this neural network approach significantly improves upon state-of-the-art n-gram models, particularly by leveraging longer word contexts.

Source: https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

Show More Show Less

7 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
Softmax: Neural Networks and Maximum Mutual Information Estimation

Aug 8 2025

The paper published in 1989, "Training Stochastic Model Recognition Algorithms as Networks can lead to Maximum Mutual Information Estimation of Parameters" by John S. Bridle, proposes a novel approach to pattern recognition, specifically improving Hidden Markov Models (HMMs) used in speech recognition. It focuses on discrimination-based training methods within neural networks (NNs). The paper demonstrates how modifying a multilayer perceptron's output layer to yield correct probability distributions, and replacing the standard squared error criterion with a probability-based score, is equivalent to Maximum Mutual Information (MMI) training. This method, when applied to a specially constructed network for stochastic model-based classifiers, offers a powerful way to train model parameters, exemplified by an HMM-based word discriminator called an "Alphanet." Ultimately, the research explores how NN architectures can embody the desirable traits of stochastic models and clarifies the relationship between discriminative NN training and MMI training of stochastic models.

Source: https://proceedings.neurips.cc/paper_files/paper/1989/file/0336dcbab05b9d5ad24f4333c7658a0e-Paper.pdf

Show More Show Less

12 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
Back-Propagating Errors for Visual and Stereo Recognition

Aug 8 2025

The paper on backpropagation was published in 1986.

The paper presents a collaborative research effort focusing on back-propagation as a method for learning representations within neural networks. One document, "Learning representations by back-propagating errors (1).pdf," introduces the theoretical framework and mathematical underpinnings of this learning algorithm, explaining how connection weights in a network are adjusted based on the error between actual and desired outputs. The other text appears to be an excerpt from "Letters to Nature" titled "Bilateral amblyopia after a short period of reverse occlusion in kittens," which, while seemingly disparate in its title, likely contributes an applied example or a related biological context to the discussion of learning and neural pathways, possibly illustrating the plasticity of neural systems. Together, they offer insights into both the computational mechanics and potential real-world implications or biological analogues of back-propagation.

Show More Show Less

13 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
The Parallel Distributed Processing Perspective

Aug 8 2025

This paper published in 1986 introduces the concept of Parallel Distributed Processing (PDP) models, offering a new perspective on how human cognition works, contrasting it with traditional sequential processing. It explores how the brain handles complex tasks like perception, motor control, language understanding, and memory retrieval by simultaneously considering multiple, often ambiguous, pieces of information. The text provides concrete examples such as reaching for an object, skilled typing, stereoscopic vision, and word recognition to illustrate how interconnected processing units interact through excitatory and inhibitory signals to arrive at solutions. Furthermore, it touches upon the origins of PDP models, highlighting their physiological plausibility and their ability to learn and generalize spontaneously by adjusting connection strengths between units based on experience.

Source: https://stanford.edu/~jlmcc/papers/PDP/Chapter1.pdf

Show More Show Less

27 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free

Audiobook Categories

More to Explore

GETTING STARTED

Episodes

FIM: Filling in the Middle for Language Models

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

BPE: Subword Units for Neural Machine Translation of Rare Words

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

Distributed Word and Phrase Representations

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

Efficient Word Vectors for Large Datasets

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

A Neural Probabilistic Language Model

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

Softmax: Neural Networks and Maximum Mutual Information Estimation

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

Back-Propagating Errors for Visual and Stereo Recognition

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

The Parallel Distributed Processing Perspective

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed