Literature

The Natural Language Decathlon: Multitask Learning as Question Answering. McCann, Bryan; Keskar, Nitish Shirish; Xiong, Caiming; Socher, Richard. 2018.

cite arxiv:1806.08730

[ Abstract ]
[ BibTeX ]
[ URL ]

Deep learning has improved performance on many natural language processing (NLP) tasks individually. However, general NLP models cannot emerge within a paradigm that focuses on the particularities of a single metric, dataset, and task. We introduce the Natural Language Decathlon (decaNLP), a challenge that spans ten tasks: question answering, machine translation, summarization, natural language inference, sentiment analysis, semantic role labeling, zero-shot relation extraction, goal-oriented dialogue, semantic parsing, and commonsense pronoun resolution. We cast all tasks as question answering over a context. Furthermore, we present a new Multitask Question Answering Network (MQAN) jointly learns all tasks in decaNLP without any task-specific modules or parameters in the multitask setting. MQAN shows improvements in transfer learning for machine translation and named entity recognition, domain adaptation for sentiment analysis and natural language inference, and zero-shot capabilities for text classification. We demonstrate that the MQAN's multi-pointer-generator decoder is key to this success and performance further improves with an anti-curriculum training strategy. Though designed for decaNLP, MQAN also achieves state of the art results on the WikiSQL semantic parsing task in the single-task setting. We also release code for procuring and processing data, training and evaluating models, and reproducing all experiments for decaNLP.

@misc{mccann2018natural,
  abstract = {Deep learning has improved performance on many natural language processing (NLP) tasks individually. However, general NLP models cannot emerge within a paradigm that focuses on the particularities of a single metric, dataset, and task. We introduce the Natural Language Decathlon (decaNLP), a challenge that spans ten tasks: question answering, machine translation, summarization, natural language inference, sentiment analysis, semantic role labeling, zero-shot relation extraction, goal-oriented dialogue, semantic parsing, and commonsense pronoun resolution. We cast all tasks as question answering over a context. Furthermore, we present a new Multitask Question Answering Network (MQAN) jointly learns all tasks in decaNLP without any task-specific modules or parameters in the multitask setting. MQAN shows improvements in transfer learning for machine translation and named entity recognition, domain adaptation for sentiment analysis and natural language inference, and zero-shot capabilities for text classification. We demonstrate that the MQAN's multi-pointer-generator decoder is key to this success and performance further improves with an anti-curriculum training strategy. Though designed for decaNLP, MQAN also achieves state of the art results on the WikiSQL semantic parsing task in the single-task setting. We also release code for procuring and processing data, training and evaluating models, and reproducing all experiments for decaNLP.},
  author = {McCann, Bryan and Keskar, Nitish Shirish and Xiong, Caiming and Socher, Richard},
  keywords = {neuralnet},
  note = {cite arxiv:1806.08730},
  title = {The Natural Language Decathlon: Multitask Learning as Question Answering},
  year = 2018
}

The Natural Language Decathlon: Multitask Learning as Question Answering. McCann, Bryan; Keskar, Nitish Shirish; Xiong, Caiming; Socher, Richard. 2018.

cite arxiv:1806.08730

[ Abstract ]
[ BibTeX ]
[ URL ]

Deep learning has improved performance on many natural language processing (NLP) tasks individually. However, general NLP models cannot emerge within a paradigm that focuses on the particularities of a single metric, dataset, and task. We introduce the Natural Language Decathlon (decaNLP), a challenge that spans ten tasks: question answering, machine translation, summarization, natural language inference, sentiment analysis, semantic role labeling, zero-shot relation extraction, goal-oriented dialogue, semantic parsing, and commonsense pronoun resolution. We cast all tasks as question answering over a context. Furthermore, we present a new Multitask Question Answering Network (MQAN) jointly learns all tasks in decaNLP without any task-specific modules or parameters in the multitask setting. MQAN shows improvements in transfer learning for machine translation and named entity recognition, domain adaptation for sentiment analysis and natural language inference, and zero-shot capabilities for text classification. We demonstrate that the MQAN's multi-pointer-generator decoder is key to this success and performance further improves with an anti-curriculum training strategy. Though designed for decaNLP, MQAN also achieves state of the art results on the WikiSQL semantic parsing task in the single-task setting. We also release code for procuring and processing data, training and evaluating models, and reproducing all experiments for decaNLP.

@misc{mccann2018natural,
  abstract = {Deep learning has improved performance on many natural language processing (NLP) tasks individually. However, general NLP models cannot emerge within a paradigm that focuses on the particularities of a single metric, dataset, and task. We introduce the Natural Language Decathlon (decaNLP), a challenge that spans ten tasks: question answering, machine translation, summarization, natural language inference, sentiment analysis, semantic role labeling, zero-shot relation extraction, goal-oriented dialogue, semantic parsing, and commonsense pronoun resolution. We cast all tasks as question answering over a context. Furthermore, we present a new Multitask Question Answering Network (MQAN) jointly learns all tasks in decaNLP without any task-specific modules or parameters in the multitask setting. MQAN shows improvements in transfer learning for machine translation and named entity recognition, domain adaptation for sentiment analysis and natural language inference, and zero-shot capabilities for text classification. We demonstrate that the MQAN's multi-pointer-generator decoder is key to this success and performance further improves with an anti-curriculum training strategy. Though designed for decaNLP, MQAN also achieves state of the art results on the WikiSQL semantic parsing task in the single-task setting. We also release code for procuring and processing data, training and evaluating models, and reproducing all experiments for decaNLP.},
  author = {McCann, Bryan and Keskar, Nitish Shirish and Xiong, Caiming and Socher, Richard},
  keywords = {neuralnet},
  note = {cite arxiv:1806.08730},
  title = {The Natural Language Decathlon: Multitask Learning as Question Answering},
  year = 2018
}

Attention is all you need. Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, {\L}ukasz; Polosukhin, Illia. In Advances in Neural Information Processing Systems, bll 5998–6008. 2017.

[ BibTeX ]

@inproceedings{vaswani2017attention,
  author = {Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, {\L}ukasz and Polosukhin, Illia},
  booktitle = {Advances in Neural Information Processing Systems},
  keywords = {sefattention},
  pages = {5998–6008},
  title = {Attention is all you need},
  year = 2017
}

Non-parametric estimation of Jensen-Shannon Divergence in Generative Adversarial Network training. Sinn, Mathieu; Rawat, Ambrish. 2017.

cite arxiv:1705.09199

[ Abstract ]
[ BibTeX ]
[ URL ]

Generative Adversarial Networks (GANs) have become a widely popular framework for generative modelling of high-dimensional datasets. However their training is well-known to be difficult. This work presents a rigorous statistical analysis of GANs providing straight-forward explanations for common training pathologies such as vanishing gradients. Furthermore, it proposes a new training objective, Kernel GANs, and demonstrates its practical effectiveness on large-scale real-world data sets. A key element in the analysis is the distinction between training with respect to the (unknown) data distribution, and its empirical counterpart. To overcome issues in GAN training, we pursue the idea of smoothing the Jensen-Shannon Divergence (JSD) by incorporating noise in the input distributions of the discriminator. As we show, this effectively leads to an empirical version of the JSD in which the true and the generator densities are replaced by kernel density estimates, which leads to Kernel GANs.

@misc{sinn2017nonparametric,
  abstract = {Generative Adversarial Networks (GANs) have become a widely popular framework for generative modelling of high-dimensional datasets. However their training is well-known to be difficult. This work presents a rigorous statistical analysis of GANs providing straight-forward explanations for common training pathologies such as vanishing gradients. Furthermore, it proposes a new training objective, Kernel GANs, and demonstrates its practical effectiveness on large-scale real-world data sets. A key element in the analysis is the distinction between training with respect to the (unknown) data distribution, and its empirical counterpart. To overcome issues in GAN training, we pursue the idea of smoothing the Jensen-Shannon Divergence (JSD) by incorporating noise in the input distributions of the discriminator. As we show, this effectively leads to an empirical version of the JSD in which the true and the generator densities are replaced by kernel density estimates, which leads to Kernel GANs.},
  author = {Sinn, Mathieu and Rawat, Ambrish},
  keywords = {neuralnet},
  note = {cite arxiv:1705.09199},
  title = {Non-parametric estimation of Jensen-Shannon Divergence in Generative Adversarial Network training},
  year = 2017
}

Wasserstein GAN. Arjovsky, Martin; Chintala, Soumith; Bottou, Léon. 2017.

cite arxiv:1701.07875

[ Abstract ]
[ BibTeX ]
[ URL ]

We introduce a new algorithm named WGAN, an alternative to traditional GAN training. In this new model, we show that we can improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches. Furthermore, we show that the corresponding optimization problem is sound, and provide extensive theoretical work highlighting the deep connections to other distances between distributions.

@misc{arjovsky2017wasserstein,
  abstract = {We introduce a new algorithm named WGAN, an alternative to traditional GAN training. In this new model, we show that we can improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches. Furthermore, we show that the corresponding optimization problem is sound, and provide extensive theoretical work highlighting the deep connections to other distances between distributions.},
  author = {Arjovsky, Martin and Chintala, Soumith and Bottou, Léon},
  keywords = {wassersteingan},
  note = {cite arxiv:1701.07875},
  title = {Wasserstein GAN},
  year = 2017
}

DRAGNN: A Transition-based Framework for Dynamically Connected Neural Networks. Kong, Lingpeng; Alberti, Chris; Andor, Daniel; Bogatyy, Ivan; Weiss, David. 2017.

cite arxiv:1703.04474Comment: 10 pages; Submitted for review to ACL2017

[ Abstract ]
[ BibTeX ]
[ URL ]

In this work, we present a compact, modular framework for constructing novel recurrent neural architectures. Our basic module is a new generic unit, the Transition Based Recurrent Unit (TBRU). In addition to hidden layer activations, TBRUs have discrete state dynamics that allow network connections to be built dynamically as a function of intermediate activations. By connecting multiple TBRUs, we can extend and combine commonly used architectures such as sequence-to-sequence, attention mechanisms, and re-cursive tree-structured models. A TBRU can also serve as both an encoder for downstream tasks and as a decoder for its own task simultaneously, resulting in more accurate multi-task learning. We call our approach Dynamic Recurrent Acyclic Graphical Neural Networks, or DRAGNN. We show that DRAGNN is significantly more accurate and efficient than seq2seq with attention for syntactic dependency parsing and yields more accurate multi-task learning for extractive summarization tasks.

@misc{kong2017dragnn,
  abstract = {In this work, we present a compact, modular framework for constructing novel recurrent neural architectures. Our basic module is a new generic unit, the Transition Based Recurrent Unit (TBRU). In addition to hidden layer activations, TBRUs have discrete state dynamics that allow network connections to be built dynamically as a function of intermediate activations. By connecting multiple TBRUs, we can extend and combine commonly used architectures such as sequence-to-sequence, attention mechanisms, and re-cursive tree-structured models. A TBRU can also serve as both an encoder for downstream tasks and as a decoder for its own task simultaneously, resulting in more accurate multi-task learning. We call our approach Dynamic Recurrent Acyclic Graphical Neural Networks, or DRAGNN. We show that DRAGNN is significantly more accurate and efficient than seq2seq with attention for syntactic dependency parsing and yields more accurate multi-task learning for extractive summarization tasks.},
  author = {Kong, Lingpeng and Alberti, Chris and Andor, Daniel and Bogatyy, Ivan and Weiss, David},
  keywords = {neuralnet},
  note = {cite arxiv:1703.04474Comment: 10 pages; Submitted for review to ACL2017},
  title = {DRAGNN: A Transition-based Framework for Dynamically Connected Neural Networks},
  year = 2017
}

Improved Training of Wasserstein GANs. Gulrajani, Ishaan; Ahmed, Faruk; Arjovsky, Mart{\’{\i}}n; Dumoulin, Vincent; Courville, Aaron C. In CoRR, abs/1704.00028. 2017.

[ BibTeX ]
[ URL ]

@article{DBLP:journals/corr/GulrajaniAADC17,
  author = {Gulrajani, Ishaan and Ahmed, Faruk and Arjovsky, Mart{\'{\i}}n and Dumoulin, Vincent and Courville, Aaron C.},
  journal = {CoRR},
  keywords = {wassersteingan},
  title = {Improved Training of Wasserstein GANs},
  volume = {abs/1704.00028},
  year = 2017
}

Derivation of Backpropagation in Convolutional Neural Network (CNN). Zhang, Zhifei. bl 7. 2016.

[ BibTeX ]
[ URL ]

@article{noauthororeditor,
  author = {Zhang, Zhifei},
  keywords = {backpropagation},
  month = 10,
  pages = 7,
  title = {Derivation of Backpropagation in Convolutional Neural Network (CNN)},
  year = 2016
}

ConceptRDF: An RDF presentation of ConceptNet knowledge base. Najmi, Erfan; Malik, Zaki; Hashmi, Khayyam; Rezgui, Abdelmounaam. In Information and Communication Systems (ICICS), 2016 7th International Conference on, bll 145–150. IEEE, 2016.

[ BibTeX ]

@inproceedings{najmi2016conceptrdf,
  author = {Najmi, Erfan and Malik, Zaki and Hashmi, Khayyam and Rezgui, Abdelmounaam},
  booktitle = {Information and Communication Systems (ICICS), 2016 7th International Conference on},
  keywords = {conceptnet},
  organization = {IEEE},
  pages = {145–150},
  title = {ConceptRDF: An RDF presentation of ConceptNet knowledge base},
  year = 2016
}

Enriching Word Vectors with Subword Information. Bojanowski, Piotr; Grave, Edouard; Joulin, Armand; Mikolov, Tomas. 2016.

cite arxiv:1607.04606Comment: Accepted to TACL. The two first authors contributed equally

[ Abstract ]
[ BibTeX ]
[ URL ]

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character $n$-grams. A vector representation is associated to each character $n$-gram; words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.

@misc{bojanowski2016enriching,
  abstract = {Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character $n$-grams. A vector representation is associated to each character $n$-gram; words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.},
  author = {Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  keywords = {word2vec},
  note = {cite arxiv:1607.04606Comment: Accepted to TACL. The two first authors contributed equally},
  title = {Enriching Word Vectors with Subword Information},
  year = 2016
}

Globally Normalized Transition-Based Neural Networks. Andor, Daniel; Alberti, Chris; Weiss, David; Severyn, Aliaksei; Presta, Alessandro; Ganchev, Kuzman; Petrov, Slav; Collins, Michael. 2016.

cite arxiv:1603.06042

[ Abstract ]
[ BibTeX ]
[ URL ]

We introduce a globally normalized transition-based neural network model that achieves state-of-the-art part-of-speech tagging, dependency parsing and sentence compression results. Our model is a simple feed-forward neural network that operates on a task-specific transition system, yet achieves comparable or better accuracies than recurrent models. We discuss the importance of global as opposed to local normalization: a key insight is that the label bias problem implies that globally normalized models can be strictly more expressive than locally normalized models.

@misc{andor2016globally,
  abstract = {We introduce a globally normalized transition-based neural network model that achieves state-of-the-art part-of-speech tagging, dependency parsing and sentence compression results. Our model is a simple feed-forward neural network that operates on a task-specific transition system, yet achieves comparable or better accuracies than recurrent models. We discuss the importance of global as opposed to local normalization: a key insight is that the label bias problem implies that globally normalized models can be strictly more expressive than locally normalized models.},
  author = {Andor, Daniel and Alberti, Chris and Weiss, David and Severyn, Aliaksei and Presta, Alessandro and Ganchev, Kuzman and Petrov, Slav and Collins, Michael},
  keywords = {gpugrant},
  note = {cite arxiv:1603.06042},
  title = {Globally Normalized Transition-Based Neural Networks},
  year = 2016
}

An Ensemble Method to Produce High-Quality Word Embeddings. Speer, Robert; Chin, Joshua. 2016.

cite arxiv:1604.01692Comment: 12 pages, 3 figures

[ Abstract ]
[ BibTeX ]
[ URL ]

A currently successful approach to computational semantics is to represent words as embeddings in a machine-learned vector space. We present an ensemble method that combines embeddings produced by GloVe (Pennington et al., 2014) and word2vec (Mikolov et al., 2013) with structured knowledge from the semantic networks ConceptNet (Speer and Havasi, 2012) and PPDB (Ganitkevitch et al., 2013), merging their information into a common representation with a large, multilingual vocabulary. The embeddings it produces achieve state-of-the-art performance on many word-similarity evaluations. Its score of $\rho = .596$ on an evaluation of rare words (Luong et al., 2013) is 16% higher than the previous best known system.

@misc{speer2016ensemble,
  abstract = {A currently successful approach to computational semantics is to represent words as embeddings in a machine-learned vector space. We present an ensemble method that combines embeddings produced by GloVe (Pennington et al., 2014) and word2vec (Mikolov et al., 2013) with structured knowledge from the semantic networks ConceptNet (Speer and Havasi, 2012) and PPDB (Ganitkevitch et al., 2013), merging their information into a common representation with a large, multilingual vocabulary. The embeddings it produces achieve state-of-the-art performance on many word-similarity evaluations. Its score of $\rho = .596$ on an evaluation of rare words (Luong et al., 2013) is 16% higher than the previous best known system.},
  author = {Speer, Robert and Chin, Joshua},
  keywords = {retrofitting},
  note = {cite arxiv:1604.01692Comment: 12 pages, 3 figures},
  title = {An Ensemble Method to Produce High-Quality Word Embeddings},
  year = 2016
}

Neural Architectures for Named Entity Recognition. Lample, Guillaume; Ballesteros, Miguel; Subramanian, Sandeep; Kawakami, Kazuya; Dyer, Chris. In CoRR, abs/1603.01360. 2016.

[ BibTeX ]
[ URL ]

@article{DBLP:journals/corr/LampleBSKD16,
  author = {Lample, Guillaume and Ballesteros, Miguel and Subramanian, Sandeep and Kawakami, Kazuya and Dyer, Chris},
  journal = {CoRR},
  keywords = {mlnlp},
  title = {Neural Architectures for Named Entity Recognition},
  volume = {abs/1603.01360},
  year = 2016
}

LSTM: A Search Space Odyssey. Greff, Klaus; Srivastava, Rupesh Kumar; Koutník, Jan; Steunebrink, Bas R.; Schmidhuber, Jürgen. In CoRR, abs/1503.04069. 2015.

[ BibTeX ]
[ URL ]

@article{journals/corr/GreffSKSS15,
  author = {Greff, Klaus and Srivastava, Rupesh Kumar and Koutník, Jan and Steunebrink, Bas R. and Schmidhuber, Jürgen},
  journal = {CoRR},
  keywords = {lstm},
  title = {LSTM: A Search Space Odyssey.},
  volume = {abs/1503.04069},
  year = 2015
}

An Empirical Exploration of Recurrent Network Architectures. Józefowicz, Rafal; Zaremba, Wojciech; Sutskever, Ilya. In ICML, Vol. 37JMLR Workshop and Conference Proceedings, F. R. Bach, D. M. Blei (reds.), bll 2342–2350. JMLR.org, 2015.

[ BibTeX ]
[ URL ]

@inproceedings{conf/icml/JozefowiczZS15,
  author = {Józefowicz, Rafal and Zaremba, Wojciech and Sutskever, Ilya},
  booktitle = {ICML},
  crossref = {conf/icml/2015},
  editor = {Bach, Francis R. and Blei, David M.},
  keywords = {lstm},
  pages = {2342-2350},
  publisher = {JMLR.org},
  series = {JMLR Workshop and Conference Proceedings},
  title = {An Empirical Exploration of Recurrent Network Architectures.},
  volume = 37,
  year = 2015
}

Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Recurrent Neural Network. Wang, Peilu; Qian, Yao; Soong, Frank K.; He, Lei; Zhao, Hai. In CoRR, abs/1510.06168. 2015.

[ BibTeX ]
[ URL ]

@article{DBLP:journals/corr/WangQSHZ15,
  author = {Wang, Peilu and Qian, Yao and Soong, Frank K. and He, Lei and Zhao, Hai},
  journal = {CoRR},
  keywords = {ner},
  title = {Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Recurrent Neural Network},
  volume = {abs/1510.06168},
  year = 2015
}

Convolutional Neural Networks for Sentence Classification. Kim, Yoon. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, {EMNLP} 2014, October 25-29, 2014, Doha, Qatar, {A} meeting of SIGDAT, a Special Interest Group of the {ACL}, bll 1746–1751. 2014.

[ BibTeX ]
[ URL ]

@inproceedings{kim2014convolutional,
  author = {Kim, Yoon},
  booktitle = {Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, {EMNLP} 2014, October 25-29, 2014, Doha, Qatar, {A} meeting of SIGDAT, a Special Interest Group of the {ACL}},
  keywords = {gpu},
  pages = {1746–1751},
  title = {Convolutional Neural Networks for Sentence Classification},
  year = 2014
}

Glove: Global Vectors for Word Representation. Pennington, Jeffrey; Socher, Richard; Manning, Christopher D. In EMNLP, Vol. 14, bll 1532–1543. 2014.

[ BibTeX ]

@inproceedings{pennington2014glove,
  author = {Pennington, Jeffrey and Socher, Richard and Manning, Christopher D},
  booktitle = {EMNLP},
  keywords = {glove},
  pages = {1532–1543},
  title = {Glove: Global Vectors for Word Representation.},
  volume = 14,
  year = 2014
}

Generative Adversarial Networks. Goodfellow, Ian J.; Pouget-Abadie, Jean; Mirza, Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sherjil; Courville, Aaron; Bengio, Yoshua. 2014.

cite arxiv:1406.2661

[ Abstract ]
[ BibTeX ]
[ URL ]

We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.

@misc{goodfellow2014generative,
  abstract = {We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.},
  author = {Goodfellow, Ian J. and Pouget-Abadie, Jean and Mirza, Mehdi and Xu, Bing and Warde-Farley, David and Ozair, Sherjil and Courville, Aaron and Bengio, Yoshua},
  keywords = {neuralnet},
  note = {cite arxiv:1406.2661},
  title = {Generative Adversarial Networks},
  year = 2014
}

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. Cho, Kyunghyun; van Merrienboer, Bart; Gulcehre, Caglar; Bahdanau, Dzmitry; Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua. 2014.

cite arxiv:1406.1078Comment: EMNLP 2014

[ Abstract ]
[ BibTeX ]
[ URL ]

In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.

@misc{cho2014learning,
  abstract = {In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.},
  author = {Cho, Kyunghyun and van Merrienboer, Bart and Gulcehre, Caglar and Bahdanau, Dzmitry and Bougares, Fethi and Schwenk, Holger and Bengio, Yoshua},
  keywords = {neuralnets},
  note = {cite arxiv:1406.1078Comment: EMNLP 2014},
  title = {Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation},
  year = 2014
}

Retrofitting Word Vectors to Semantic Lexicons. Faruqui, Manaal; Dodge, Jesse; Jauhar, Sujay K.; Dyer, Chris; Hovy, Eduard; Smith, Noah A. 2014.

cite arxiv:1411.4166Comment: Proceedings of NAACL 2015

[ Abstract ]
[ BibTeX ]
[ URL ]

Vector space word representations are learned from distributional information of words in large corpora. Although such statistics are semantically informative, they disregard the valuable information that is contained in semantic lexicons such as WordNet, FrameNet, and the Paraphrase Database. This paper proposes a method for refining vector space representations using relational information from semantic lexicons by encouraging linked words to have similar vector representations, and it makes no assumptions about how the input vectors were constructed. Evaluated on a battery of standard lexical semantic evaluation tasks in several languages, we obtain substantial improvements starting with a variety of word vector models. Our refinement method outperforms prior techniques for incorporating semantic lexicons into the word vector training algorithms.

@misc{faruqui2014retrofitting,
  abstract = {Vector space word representations are learned from distributional information of words in large corpora. Although such statistics are semantically informative, they disregard the valuable information that is contained in semantic lexicons such as WordNet, FrameNet, and the Paraphrase Database. This paper proposes a method for refining vector space representations using relational information from semantic lexicons by encouraging linked words to have similar vector representations, and it makes no assumptions about how the input vectors were constructed. Evaluated on a battery of standard lexical semantic evaluation tasks in several languages, we obtain substantial improvements starting with a variety of word vector models. Our refinement method outperforms prior techniques for incorporating semantic lexicons into the word vector training algorithms.},
  author = {Faruqui, Manaal and Dodge, Jesse and Jauhar, Sujay K. and Dyer, Chris and Hovy, Eduard and Smith, Noah A.},
  keywords = {retrofitting},
  note = {cite arxiv:1411.4166Comment: Proceedings of NAACL 2015},
  title = {Retrofitting Word Vectors to Semantic Lexicons},
  year = 2014
}

Neural Machine Translation by Jointly Learning to Align and Translate. Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua. 2014.

cite arxiv:1409.0473Comment: Accepted at ICLR 2015 as oral presentation

[ Abstract ]
[ BibTeX ]
[ URL ]

Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.

@misc{bahdanau2014neural,
  abstract = {Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.},
  author = {Bahdanau, Dzmitry and Cho, Kyunghyun and Bengio, Yoshua},
  keywords = {attention},
  note = {cite arxiv:1409.0473Comment: Accepted at ICLR 2015 as oral presentation},
  title = {Neural Machine Translation by Jointly Learning to Align and Translate},
  year = 2014
}

Training recurrent neural networks. Sutskever, Ilya. In University of Toronto, Toronto, Ont., Canada. 2013.

[ BibTeX ]
[ URL ]

@article{sutskever2013training,
  author = {Sutskever, Ilya},
  journal = {University of Toronto, Toronto, Ont., Canada},
  keywords = {phd},
  title = {Training recurrent neural networks},
  year = 2013
}

Integration of world knowledge for natural language understanding. Ovchinnikova, Ekaterina. Vol. 3. Springer Science \& Business Media, 2012.

[ BibTeX ]

@book{ovchinnikova2012integration,
  author = {Ovchinnikova, Ekaterina},
  keywords = {worldknowledge},
  publisher = {Springer Science \& Business Media},
  title = {Integration of world knowledge for natural language understanding},
  volume = 3,
  year = 2012
}

Natural Language Understanding and World Knowledge. Ovchinnikova, Ekaterina. In Integration of World Knowledge for Natural Language Understanding, bll 15–37. Atlantis Press, Paris, 2012.

[ Abstract ]
[ BibTeX ]
[ URL ]

In artificial intelligence and computational linguistics, natural language understanding (NLU) is a subfield of natural language processing that deals with machine reading comprehension. The goal of an NLU system is to interpret an input text fragment. The process of interpretation can be viewed as a translation of the text from a natural language to a representation in an unambiguous formal language. This representation, supposed to expressthe text's content, is further used for performing concrete tasks implied by a user request

@inbook{Ovchinnikova2012,
  abstract = {In artificial intelligence and computational linguistics, natural language understanding (NLU) is a subfield of natural language processing that deals with machine reading comprehension. The goal of an NLU system is to interpret an input text fragment. The process of interpretation can be viewed as a translation of the text from a natural language to a representation in an unambiguous formal language. This representation, supposed to expressthe text's content, is further used for performing concrete tasks implied by a user request},
  address = {Paris},
  author = {Ovchinnikova, Ekaterina},
  booktitle = {Integration of World Knowledge for Natural Language Understanding},
  keywords = {worldknowledge},
  pages = {15–37},
  publisher = {Atlantis Press},
  title = {Natural Language Understanding and World Knowledge},
  year = 2012
}

On the difficulty of training Recurrent Neural Networks. Pascanu, Razvan; Mikolov, Tomas; Bengio, Yoshua. 2012.

cite arxiv:1211.5063Comment: Improved description of the exploding gradient problem and description and analysis of the vanishing gradient problem

[ Abstract ]
[ BibTeX ]
[ URL ]

There are two widely known issues with properly training Recurrent Neural Networks, the vanishing and the exploding gradient problems detailed in Bengio et al. (1994). In this paper we attempt to improve the understanding of the underlying issues by exploring these problems from an analytical, a geometric and a dynamical systems perspective. Our analysis is used to justify a simple yet effective solution. We propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem. We validate empirically our hypothesis and proposed solutions in the experimental section.

@misc{pascanu2012difficulty,
  abstract = {There are two widely known issues with properly training Recurrent Neural Networks, the vanishing and the exploding gradient problems detailed in Bengio et al. (1994). In this paper we attempt to improve the understanding of the underlying issues by exploring these problems from an analytical, a geometric and a dynamical systems perspective. Our analysis is used to justify a simple yet effective solution. We propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem. We validate empirically our hypothesis and proposed solutions in the experimental section.},
  author = {Pascanu, Razvan and Mikolov, Tomas and Bengio, Yoshua},
  keywords = {deep_learning},
  note = {cite arxiv:1211.5063Comment: Improved description of the exploding gradient problem and description and analysis of the vanishing gradient problem},
  title = {On the difficulty of training Recurrent Neural Networks},
  year = 2012
}

BLEU: a method for automatic evaluation of machine translation. Papineni, Kishore; Roukos, Salim; Ward, Todd; Zhu, Wei-Jing. In Proceedings of the 40th annual meeting on association for computational linguistics, bll 311–318. Association for Computational Linguistics, 2002.

[ BibTeX ]

@inproceedings{papineni2002bleu,
  author = {Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing},
  booktitle = {Proceedings of the 40th annual meeting on association for computational linguistics},
  keywords = {translation},
  organization = {Association for Computational Linguistics},
  pages = {311–318},
  title = {BLEU: a method for automatic evaluation of machine translation},
  year = 2002
}

Related Literature

Picture credits