历经几个星期的磨难，文本挖掘课的presentation课件初稿基本完成了，1月中下旬开讲，这次讲的是基于少量标注数据的序列标注，下面是我的综述。

# Outline

- Sequence Tagging
- Semi-supervised Learning
- Transfer Learning
- Conclusions
- References

# Sequence Tagging

### Introduction

**Definition**

Sequence tagging is a type of pattern recognition task that involves the algorithmic assignment of a categorical label to each member of a sequence of observed values.**Significance**

Sequence tagging is one of the first stages in most natural language processing applications, such as part-of-speech tagging, chunking and named entity recognition.**Approaches**

~~Traditional models~~~~Hidden Markov Models~~~~Conditional Random Fields~~

- Neural network models
- RNN, LSTM, GRU

### Neural Network Model

### Results

### Sequence Tagging with Little Labeled Data

**Backgrounds**

Although recent neural networks obtain state-of-the-art performance on several sequence tagging tasks, they can’t be used for tasks with little labeled data.**Approaches**

~~Self-taught learning~~~~Active learning~~~~Transductive learning~~- Semi-supervised learning
- Transfer learning

# Semi-supervised Learning

### References

**Language Models Added**

- Semi-supervised Multitask Learning for Sequence Labeling. Marek Rei. ACL17.
- Semi-supervised Sequence Tagging with Bidirectional Language Models. Matthew et al. ACL17.

**Graph-based**

- Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models. Subramanya et al. EMNLP10.
- Scientific Information Extraction with Semi-supervised Neural Tagging. Luan et al. EMNLP17.
- Graph-based Semi-supervised Acoustic Modeling in DNN-based Speech Recognition. Liu et al. IEEE SLT14.

### Language Models Added

### Language Modeling Objective

\[\begin{array}{l}\overrightarrow { {m_t}} = \tanh (\overrightarrow { {W_m}} \overrightarrow { {h_t}} )\\\overleftarrow { {m_t}} = \tanh (\overleftarrow { {W_m}} \overleftarrow { {h_t}} )\\P({w_{t + 1}}|\overrightarrow { {m_t}} ) = {\rm{softmax}}(\overrightarrow { {W_q}} \overrightarrow { {m_t}} )\\P({w_{t - 1}}|\overleftarrow { {m_t}} ) = {\rm{softmax}}(\overleftarrow { {W_q}} \overleftarrow { {m_t}} )\\\overrightarrow E = - \sum\limits_{t = 1}^{T - 1} {\log (P({w_{t + 1}}|\overrightarrow { {m_t}} ))} \\\overleftarrow E = - \sum\limits_{t = 2}^T {\log (P({w_{t - 1}}|\overleftarrow { {m_t}} ))} \\E = E + \gamma (\overrightarrow E + \overleftarrow E )\end{array}\]

### Results

### Language Models Added

### Bidirectional Language Model

\[\begin{array}{l}h_k^{LM} = [\overrightarrow {h_k^{LM}} ;\overleftarrow {h_k^{LM}} ]\\{h_{k,1}} = [\overrightarrow { {h_{k,1}}} ;\overleftarrow { {h_{k,1}}} ;h_k^{LM}]\end{array}\]**Alternative**

- Replace \([\overrightarrow { {h_{k,1}}} ;\overleftarrow { {h_{k,1}}} ;h_k^{LM}]\) with \(f([\overrightarrow { {h_{k,1}}} ;\overleftarrow { {h_{k,1}}} ;h_k^{LM}])\).
- Concatenate the LM embeddings at different locations in the baseline sequence tagger.
- Decrease the number of parameters in the second RNN layer.

### Results

### Conclusions

- The language model transfer across domains.
- The model is robust even when trained on a large number of labeled data.
- Training the sequence tagging model and language model together increases performance.

### Graph-based

- Steps
- Construct a graph of tokens based on their semantic similarity.
- Use the CRF marginal as a regularization term to do label propagation on the graph.
- The smoothed posterior is then used to either interpolate with the CRF marginal or as an additional feature to the neural network.

- Graph Construction
- \({w_{uv}} = {d_e}(u,v)\) if \(v \in K(u)\) or \(u \in K(v)\).

- Label Propagation
- Uncertain Label Marginalizing

\[\mathcal{Y}({x_t}) = \left\{ {\begin{array}{*{20}{c}}{\{ {y_t}\} }&{ {\rm{if \ }}p({y_t}|x;\theta ) > \eta }\\{ {\rm{All \ label \ types}}}&{ {\rm{otherwise}}}\end{array}} \right.\] - Score

\[\phi (y;x,\theta ) = \sum\limits_{t = 0}^n { {T_{ {y_t},{y_{t + 1}}}}} + \sum\limits_{t = 1}^n { {P_{t,{y_t}}}} \] - Probability

\[{p_\theta }(\mathcal{Y}({x^k})|{x^k}) = \frac{ {\sum\nolimits_{ {y^k} \in \mathcal{Y}({x^k})} {\exp (\phi ({y^k};{x^k},\theta ))} }}{ {\sum\nolimits_{y’ \in Y} {\exp (\phi (y’;x,\theta ))} }}\]

### Results

### Conclusions

- In-domain data performs better than cross-domain data.
- The combination of in-domain data and ULM algorithms performs well.
- We can add language models into the model in the future to capture the context information.

# Transfer Learning

### References

**Cross-domain Transfer**

- Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks. Yang et al. ICLR17.
- Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning. Peng et al. ACL16.
- Multi-task Domain Adaptation for Sequence Tagging. Peng et al. Workshop17.

**Cross-lingual Transfer**

- Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks. Yang et al. ICLR17.
- Cross-Lingual Transfer Learning for POS Tagging without Cross-Lingual Resources. Kim et al. EMNLP17.

### Cross-domain Transfer

- Label mapping exist
- Disparate label sets

### Domain Projections

- Domain Masks

\[\begin{array}{l}{m_1} = [\overrightarrow 1 ,\overrightarrow 1 ,\overrightarrow 0 ],{m_2} = [\overrightarrow 1 ,\overrightarrow 0 ,\overrightarrow 1 ]\\\hat h = {m_d} \odot h\end{array}\] - Linear Projection

\[\hat h = {T_d}h\]

### Results

### Conclusions

- Multi-task learning can help domain adaptation.
- The number of shared parameters has great impact on the performance.
- We may use other domain adaptation methods besides parameter sharing and representation learning.

### Cross-lingual Transfer

- Sequence Tagging Loss

\[{\mathcal{L}_p} = - \sum\limits_{i = 1}^S {\sum\limits_{j = 1}^N { {p_{i,j}}\log ({ {\hat p}_{i,j}})} }\] - Language Classifier Loss

\[{\mathcal{L}_a} = - \sum\limits_{i = 1}^S { {l_i}\log ({ {\hat l}_i})}\] - Bidirectional Language Model Loss

\[{\mathcal{L}_l} = - \sum\limits_{i = 1}^S {\sum\limits_{j = 1}^N {\log (P({w_{j + 1}}|{f_j})) + \log (P({w_{j - 1}}|{b_j}))} }\] - Total Loss

\[\mathcal{L} = {w_s}({\mathcal{L}_p} + \lambda {\mathcal{L}_a} + \lambda {\mathcal{L}_l})\]

### Results

### Conclusions

- The language classifier can train the common LSTM to be language-agnostic.
- Either too many or too little labeled data decrease the performance.
- Multiple source languages can be used to increase the performance.

# Conclusions

### Semi-supervised Learning vs Transfer Learning

- It seems that semi-supervised learning is better than transfer learning on some tasks.
- Semi-supervised learning is not always useful for the lack of unlabeled data in the same domain.
- Andrew Ng had said that transfer learning is an important research direction in the next five years.

### Future

- Semi-supervised learning and transfer learning can be combined to increase performance.
- Other methods like active learning can be added.

# References

Xuezhe Ma and Eduard Hovy. (2016).**End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF.***In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1064–1074, Berlin, Germany, August 7-12, 2016.*

Marek Rei. (2017).**Semi-supervised Multitask Learning for Sequence Labeling.***In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 2121–2130, Vancouver, Canada, July 30 - August 4, 2017.*

Matthew E. Peters, Waleed Ammar, Chandra Bhagavatula, Russell Power. (2017).**Semi-supervised Sequence Tagging with Bidirectional Language Models.***In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1756–1765, Vancouver, Canada, July 30 - August 4, 2017.*

Yi Luan, Mari Ostendorf, Hannaneh Hajishirzi. (2017).**Scientific Information Extraction with Semi-supervised Neural Tagging.***In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2631–2641, Copenhagen, Denmark, September 7–11, 2017.*

Zhilin Yang, Ruslan Salakhutdinov, William W. Cohen. (2017).**Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks.***In ICLR 2017.*

Joo-Kyung Kim, Young-Bum Kim, Ruhi Sarikaya, Eric Fosler-Lussier. (2017).**Cross-Lingual Transfer Learning for POS Tagging without Cross-Lingual Resources.***In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2822–2828, Copenhagen, Denmark, September 7–11, 2017.*

Nanyun Peng, Mark Dredze. (2017).**Multi-task Domain Adaptation for Sequence Tagging.***In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 91–100, Vancouver, Canada, August 3, 2017.*

Amarnag Subramanya, Slav Petrov, Fernando Pereira. (2010).**Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models.***In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 167–176, MIT, Massachusetts, USA, 9-11 October 2010.*

Yuzong Liu, Katrin Kirchhoff. (2014).**Graph-based Semi-supervised Acoustic Modeling in DNN-based Speech Recognition.***In IEEE SLT 2014.*

Nanyun Peng, Mark Dredze. (2016).**Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning.***In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 149–155, Berlin, Germany, August 7-12, 2016.*