Pre-Training with Whole Word Masking for Chinese BERT (Cui et al., 2021)
Citation
Cui, Y., Che, W., Liu, T., Qin, B., & Yang, Z. (2021). Pre-training with whole word masking for chinese bert. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3504-3514.
My thoughts
BERT
- Designed to pretrain deep bidirectional representations by jointly conditioning on both left and right context in all Transformer layers
- 2 pre-training tasks
- Masked Language Model (MLM)
- randomly masks some of the tokens from the input, and the objective is to predict the original word based only on its context
- Next Sentence Prediction (NSP)
- to predict whether sentence B is the next sentence of sentence A
- Masked Language Model (MLM)
Chinese BERT
- whole word masking (wwwm)
- Instead of randomly selecting WordPiece tokens to mask, always mask all of the tokens corresponding to a whole word at once.
- force the model to recover the whole word in the MLM pre-training task instead of just recover the whole word in the MLM pre-training taks instead of just recovering WordPiece tokens
ERNIE
- Enhanced Representation through kNowledge IntEgration) is designed to optimize the masking process of BERT, which includes entity-level masking and phrase-level masking.
MLM as correction (Mac)
- MLM suffers from the pre-training and fine-tuning discrepancy, where the artificial tokens in the pre-training stage, such as [MASK], never appear in the real downstream fine-tuning task
- MLM as a correction (Mac) as a solution
- in pre-training taks, we do not adopt any pre-defined tokens for masking purposes, but transform the original MLM as a text correction task, and correct the wrong wrod into the correct one.