The Mutualistic Relationship of Psycholinguistics and Natural Language Processing
Natural language processing is a field within computation that deals with the modeling of human processing, parsing, and interpreting language and its applications. NLP draws on mathematics, logic, statistics, and psycholinguistics, as it requires use of the knowledge of probabilistic distributions, data structures and associative architectures, cognitive processes, and classification of grammars. It is currently making technological breakthroughs with voice recognition/personal assistants and on-line translation devices, among other systems. The development of techniques and models in natural language processing helps inform, revise, and validate the plausibility of psycholinguistic theories for human sentence parsing and vice versa. In this paper, I will be talking specifically about connectionist models of syntactic and semantic parsing and how the incremental development of these parsers provides evidence to support the hypothesis of an immediate, non-modular, multiple-constraint parsing mechanism in humans. Palmer-Brown et al. provide a more comprehensive literature review supporting the same thesis, however my goal is to further contextualize a few of the studies referenced and cover them in greater detail. I intend to answer three big questions. First, what principles of psycholinguistics have informed the structures in NLP? Second, what architectures, networks, and models have been proposed within NLP to accomplish language processing tasks? Finally, what questions in psycholinguistics can NLP be used to test?
The current model of language comprehension is that we process visual text or auditory stimuli through several layers of analysis and abstraction. Upon receiving phonological input, certain words are activated that are consistent with that input, and as we receive more information we eliminate neighbors and competitors of the target word. After identifying an individual word, we then have access to its possible syntactic and semantic properties that we use to build a model of the incoming sequence of words. The words are assembled into a complex structure imbued with relationships between subjects, objects, properties, and actions. In the absence of sufficient information to come to a conclusive understanding of the language, pragmatic context and other non-linguistic cues are used to disambiguate the meaning. This entire process is governed by the associative, massively parallel network of neurons that composes the brain. There is an essentially fixed set of neurons that form excitatory and inhibitory connections with neighboring neurons based on patterns of activation due to stimuli. This model is referred to as connectionism, and is extremely important in the formulation of NLP. The result of this biological mechanism is a network of more and less connected neurons that have the capability to inform and be informed by the activations of many other neurons. For example, activation of a concept like “cat” implies activation of related concepts – properties like “fur” or “cuteness”– or categorically similar concepts, like “dog” or “mouse.” Thus the mental lexicon is a web of concepts and relations, loosely defined in terms of prototypical categories and features of real world referents. However, there are many details in this model that have multiple competing hypotheses, which I will discuss later, so to help resolve and evaluate the plausibility of these hypotheses, we can try to model the comprehension systems with NLP and test their computational power to represent and parse linguistic data.
In order to understand computational parsing models, we must understand what parsing is and how it works. Chomsky formalized various classes of language by their grammars, which are a finite set of rules governing how to a finite vocabulary can be assembled into sentences. Relevant to NLP and linguistics are context-free grammars, which have a finite set of terminal (atomic elements) symbols, nonterminal symbols, and rewrite rules that decompose nonterminal symbols into a string of terminal and nonterminals. The problem with CFGs is that they are incomplete: structures and functions on the phrases and symbols of natural language are dependent on the actual content and words used. Natural languages can be thought of as having mildly context-sensitive grammars, which depend on features such as person/number/gender agreement, verbs are associated with specific frames requiring certain combinations of NPs (verb subcategorization), special-case reordering of elements, and nesting1. Given a grammar, a parser is a function or algorithm that “assigns a sentence to one or more structural descriptions according to the grammar”1. The decomposition of a sentence structure can be shown as a syntactic tree, with the topmost node representing the start symbol or sentence element, and every branched node a component of an upper level node, such that all the leaves are terminal symbols.
There are two prevalent theories for how humans accomplish this. The first is a two-step modular theory where units are assembled into a purely syntactic structure, and then reevaluated based on semantic and pragmatic information. Thus it postulates that syntactic function-based and semantic meaning-based processing of input are encapsulated in separate modules. The second theory argues that all relevant information associated with a word – lexical, syntactic, semantic, contextual, stochastic co-occurrences with prior and upcoming words – is integrated simultaneously to generate an expected interpretation of the sentence. Although empirical psycholinguistic evidence exists to support both, however it is beginning to converge on the latter, especially in experiments analyzing the anticipatory effects of garden path sentences and ambiguities where the meaning of the sentence-so-far has an effect on expectations for upcoming words. A single sentence may have more than one corresponding syntactic tree, causing ambiguity. The success of multiple-constraint models in natural language processing simply provides more evidence in favor of this view. Testing parsers on their ability to parse garden-path sentences and behave normally according to other phenomena indicates the effectiveness of a particular model.
The question for the computational researcher is how to represent the complex structure and processing behaviors of the brain with computers. The answer: connectionism in neural networks. In the same way that a neuron is the basis of a network of neurons, a perceptron is the fundamental unit of a neural network. A perceptron takes a vector of input features and maps them to a single output value (typically a value between 0 and 1, but it can be any real-value or discrete binary depending on the model). The mapping occurs through a summation of the scalar product of weights of of each input feature, parallel to how relative synaptic strengths determine the magnitude of overall post-synaptic neuronal response in the brain.11 This sum is then scaled through a non-linear function, typically a sigmoid function. Now suppose instead of a single output value, you have an entire layer of output values, each determined by the input layer with a different weighting schema. This layer of output values can provide information regarding patterns found in the input layer, making its features more abstract than the first layer. A neural network can have any number of these layers, each “hidden” layer (not actually providing output, but producing intermediate states) increasing the computational power of the network, and can be used to process any type of linguistic entity. A fairly straightforward application of a basic neural network to linguistic processing tasks is found in McClelland and Elman’s TRACE model. Taking small pieces of phonological data as input, these features mapped to a layer of features representing the various possible phonemes, which would activate a cohort of words that contained that phoneme. With enough phonemic activation a lexical decision can be made. Typical linguistic parsing tasks done by neural networks are POS tagging (give each word in a sentence a unique tag describing its syntactic role), chunking (equivalent to delineating the syntactic tree), and named entity recognition (similar to chunking but prescribing more semantic and predicate roles).
The ability to parse sentences begins to develop in infancy, using segmentation and recognition of speech patterns to form a stochastic model. With time, individual words and strings of words begin to emerge according to rules that have been learned from the positive examples in their environment3. Still learning to apply the patterns they have recognized to actual speech production and lacking the nuance of irregular forms embedded in their representation of the language, they overregularize speech, insisting on affixes/extra phrases on general language.3 This observation demonstrates two things. First, language is a process requiring feedback. Second, contrary to nativist belief, rules are not and should not be pre-programmed into a linguistic interpreting system. Rules are learned over time and embedded in the system by means of association weights between lower level features to higher level features. There are two main forms of neural networks: of a layer: localist networks map an output node one-to-one to a corresponding concept, and all other output nodes are silenced; (alternatively,) distributed networks represent a word or concept as different non-binary levels of activation over all output nodes. We will see how well various instances of neural network parsers are able to account for these phenomena as a means of evaluating their realism.
An early localist syntactic parser called PALS, an iteration of CONPARSE developed by Charniak and Santos (at Brown!), represents a parse-tree with a fixed-width matrix where each column shows the path from the start node to the terminal node which is a part of speech that acted as the input. Each nonterminal entry in the CONPARSE matrix can be one of a finite list of possible nonterminal entries, and to determine the value of the unit, a collection of language “rule nodes” are evaluated at the input node (and possibly at other nodes and with other determined information about the sequence/structure), and weighted by their corresponding “rule weights” to converge on a single node. The part of speech atoms in the input sequence are slid one at a time into the rightmost column of the matrix for processing, shifting previous elements to the left and continuing to process the entire sequence together with the rule nodes. To expand on the fixed behavior of CONPARSE, PALS allows for back-propagation to update rule weights in order to accurately converge on the desired tree-building behavior. Back-propagation is a concept in artificial intelligence related to supervised and reinforcement learning where the output layer after a training round of processing is compared to a known desired output to produce an error, and the weights used to compute the hidden layers are adjusted along the gradient of the error function in an effort to minimize mislabeling data. Neural networks are trained with a certain subset of data (sometimes with labels as in reinforcement and supervised learning, sometimes without labels as in unsupervised learning) before being tested on the rest of the data in order for the neural network to “learn” the proper associations and rules. Including some sort of feedback in a neural net allows the system to dynamically learn rules instead of having the creator of the system embed premade rules and relationships into the structure. This makes PALS more effective than the original CONPARSE, however there are still a few limitations of the model. Rule nodes with the rules themselves are still having to be described to the initial system, the application of those rules is what is learned, and we know that this is not the case with real language-learning. Also, the fixed width of the matrix forces a sentence to be chunked in a sequential way, making it difficult to handle more heavily irregular sentence structures with complex nested and embedded dependencies. Finally, while the localist structure may be intuitive to us (one concept, one node) and seem to work well on small finite-state grammars, it doesn’t quite fit with neurological phenomena where activation of concepts are more of an activation of a pattern/region of neurons rather than a single neuron. If every concept was mapped to a single neuron, a new neuron would have to be built and taught all of the rules for every new concept, as there would be a potentially infinite number of nodes.
Now we turn our attention to two proposed distributed architectures for parsing: a simple recurrent network and a recursive auto-associative memory. A simple recurrent network introduced by Elman in 1988 and tested by McClelland et al. in 1989 is able to predict successive elements of a sequence by copying the internal activation of the hidden layer at the previous time step into a register that is processed alongside a new subsequence of input at the current time step. In this way, the network can remember recent information, and its retention decreases as time goes on, just as our own working memory holds. The performance of this SRN was excellent on finite-state grammars, successfully predicting syntactic elements in an “infinite corpus of strings..after training on a finite set of exemplars with a learning algorithm that is local in time”. With a small number of hidden layers, the internal representation of features corresponded with actual features of the grammar, such that similarly-structured sentences had similar representations, especially in the hidden layers (remember that these representations are now a distribution over nodes). This pattern degrades with a larger number of hidden layers, and the SRN had some trouble with long-distance dependencies of syntactic elements if information relevant to the dependencies was not also relevant to intermediate states.
Recursive auto-associative memory is outlined by Pollack along with its uses as a lexical, syntactic, and semantic parser. What RAAM does is it uses a neural network to uniquely encode structural information into a condensed form. It differs from the other networks I’ve discussed so far in it concerns itself mainly with maintaining an internal representation of structures such that they can be used as nodes in a higher level of neural networking. For example, to encode the syntactic binary tree ((A B)(C D)), where sibling nodes are represented in tuple form, the RAAM would map a k-node symbolic representation of (A B) onto a hidden layer of fewer nodes which would map onto a layer of k-nodes. The expected output of this mapping is the original representation of (A B). The same is done for (C D) so that we have two hidden layer encodings of these pairs of nodes. Then, the hidden layer encoding tuple acts as the input layer of the next network, again with the goal of producing an internal representation of this combined structure that then decomposes into the original form in the output layer. In order to find these internal layers, the system uses back-propagation using each input layer as an error comparison to each output layer until their difference is within an acceptably negligible range. The highest level hidden layer can always be decoded into its constituent parts, and each of the parts can be decoded into its constituent sub-trees, and so on and so forth. The only problem now is limitations in how many nodes are necessary to encode all of the aspects of words and their syntactic and semantic features – the computational space expense/complexity.
A pattern had emerged when testing distributed network on a large number of sentences: “Trees with similar constituents must be similar.” Essentially, analysis of the features in any given encoding of a structure showed which features characterized components of the actual structure such that similar sentences also had similar representations. This holds quite a bit of weight in our psycholinguistic model in terms of giving a way in which distributed representations of concepts could theoretically arise and be connected to similar concepts in the brain. Grammar rules which define the language also define the examples used in supervised learning of a distributed neural net. The examples themselves affect the weights that cause input features to be transformed into other representational feature layers, such that sentences governed by similar grammar rules will tend to have similar features emphasized and silenced. So instead of having a system that already knows the grammar, it acquires the grammar over time by means of associative patterns between input layers..
The parsers I have described so far mainly deal with assembling a syntactic tree from a sentence, where that sentence usually is a list of syntactic elements or parts of speech. A further level of abstraction and application of neural networks can be found in semantic parsing. The real end-goal of parsing is to create a means of processing relatively noisy auditory or textual input and representing its meaning, as is done in the brain. The proposed model of Bordes et al. infers a meaning representation composed of units of the form relation(arg0,…,argn) from analyzing a vast amount of text in open domains (meaning there is a lack of labeled data thus there is only weak supervision in developing these MRs). This requires a hashing dictionary of words and their disambiguated meaning entities (word-senses) and a knowledge base of relations between these entities (using WordNet). The meaning representation structure proposed allows for recursive construction, and thus more complex hierarchies of meaning, similar to the tree-like structure of syntax parses. The WordNet database consists of a network of nodes called synsets that correspond to a particular disambiguated word meaning (with multiple associated words), and edges which define relations between the synsets11
A number of restrictions have been imposed on the development of NLP that inhibit its growth. Computers are serial, built for lists and arrays – their underlying structure is not massively parallel like a brain, so it is difficult to come up with an architecture that will compute with the same breadth and efficiency as the activation of the neurons Thus computers are forced to work in subdomains of parsing, grammar, and vocabulary. Early NLP parsers like the localist and SRN ones I covered, operated on limited finite-state grammars and vocabularies due to the increasing time/space complexity of trying to approximate all the constraints and operations of a natural language grammar. But as can be seen, parsers will focus on different corners of language processing, and overlap when possible to increase the accuracy and plausibility of the models.
To conclude, psycholinguistics has made great progress in developing a model of human parsing and representation of language, informed by neurological and cognitive evidence. This body of knowledge in turn has informed the field of artificial intelligence, specifically in developing connectionist models such as the syntactic and semantic neural network parsers I’ve discussed in this paper. The field is relatively new, but already the behaviors exhibited by some of these parsers are on par with processing capabilities exhibited in humans. Neural networks that incorporate learning, distributive representations