
BPE: Subword Units for Neural Machine Translation of Rare Words
Failed to add items
Add to basket failed.
Add to Wish List failed.
Remove from Wish List failed.
Follow podcast failed
Unfollow podcast failed
-
Narrated by:
-
By:
About this listen
This 2016 academic paper addresses the challenge of translating rare and unknown words in Neural Machine Translation (NMT), a common issue as NMT models typically operate with a fixed vocabulary while translation itself is an open-vocabulary problem. The authors propose a novel approach where rare and unknown words are encoded as sequences of subword units, eliminating the need for a back-off dictionary. They introduce an adaptation of the Byte Pair Encoding (BPE) compression algorithm for word segmentation, which allows for an open vocabulary using a compact set of variable-length character sequences. Empirical results demonstrate that this subword unit method significantly improves translation quality, particularly for rare and out-of-vocabulary words, for English-German and English-Russian language pairs. The paper compares various segmentation techniques, concluding that BPE offers a more effective and simpler solution for handling the open-vocabulary problem in NMT compared to previous word-level models and dictionary-based approaches.
Source: https://arxiv.org/pdf/1508.07909