HUsing BERT for choosing classifiers in Mandarin (Järnfors et al., 2021)
Citation
Järnfors, J., Chen, G., van Deemter, K., & Sybesma, R. (2021, August). Using BERT for choosing classifiers in Mandarin. In Proceedings of the 14th international conference on natural language generation (pp. 172-176).
My thoughts
Future Direction
- Comparison with Human judgment
- Asking the model to output the most probable classifier from all possible classifiers only
2 Models on BERT
- Unsupervised way (predicting classifiers by unmasking masked tokens)
- Replace the classifier indicator
with the [MASK] symbol of BERT and ask BERT to unmask it
- Replace the classifier indicator
- Supervised way (fine-tuning BERT on the task of classifier prediction)
- Fine-tune BERT on the CCD corpus as a multi-class classification task, where there are 172 classifier classes.
Research Questions
- Since BERT models context closely and is pertained on large scale corpora, they expect it to outperform other models
- True
- How do the two BERT-based models compare? Although they expect fine-tuned BERT to outperform unsupervised BERT.
- How well BERT can handle classifiers that add information (e.g., measure words, plurality and politeness)
4 models and 1 corpus
Models
- Unsupervised way (predicting classifiers by unmasking masked tokens) BERT
- Supervised way (fine-tuning BERT on the task of classifier prediction)
- LSTM-based system
- a system or model that utilizes Long Short-Term Memory (LSTM) networks, a type of recurrent neural network (RNN), to process and analyze sequential data
- Rule-based model
- Given a head noun, assign the most frequent classifier associated with it in the training data
Corpus
CCD corpus (ChineseClassifierDataset)
Results
Fine-tuned BERT performed the best, and the LSTM model performed the second best, the unsupervised BERT model followed next, and the rule-based model came last. Fine-tuned BERT struggles to predict classifiers that add information (measurement, plurality, politeness)
Classifier categories
- True classifier, highest frequency, highest accuracy
- Measure words, second frequency, third accuracy
- Dual classifier, third frequency, second accuracy
Distance between the classifier and the head noun
- For correct predictions, the average distance is 1.04, for incorrect predictions, it is 1.15
- Unpaired t-test confirms the distance has a negative effect on the model’s performance (p < 0.001)
- the distances of the correct predictions are shorter than for the incorrect predictions