Learning from experience is a problem of finding patterns in what are typically large amounts of complex and often noisy data. A formal model of learning as induction, the Simplicity Principle, posits that the cognitive system seeks the hypothesis that provides the briefest representation of the available data- here the linguistic input to the child. Chater (forthcoming) shows mathematically that this model allows learning from positive evidence alone in a probabilistic sense, contrasting with Gold's (1967) negative theorems on language learnability in the limit. Here we consider statistical properties of data from the CHILDES database (MacWhinney, 2000) as a first approximation of positive input to the child. We consider a number of linguistic constructions such as verb's argument structures that would yield overregularisation according to Baker's paradox (Baker, 1979, Baker & McCarthy, 1981, Pinker 1989). For instance, the causative alternation in English allows a class of verbs to take both the transitive form ("I opened the door") and the intransitive form ("The door opened"). However, certain verbs are constrained on one possible construction only ("*I disappeared the rabbit" or "*The ball dropped"). Hence the child is faced with the difficult task of finding the right balance between productivity and idiomaticity in what are complex linguistic structures (Tomasello & Brooks, 1999). Given that she does not receive much direct negative evidence over her output, Baker's paradox is concerned with how recovery from incorrect generalisations is actually possible.
Under Simplicity, two hypotheses were investigated: (1) the child assumes that there are no constraints on the grammar. This hypothesis leads to overgeneralisation; (2) the child infers that some structures are not allowed, given the input available to her. Because Simplicity posits that "the most probable hypothesis is also the hypothesis that provides the simplest (i.e. shortest) overall specification of the data" (Chater, 1997), we measured the cost to encode a particular structure given its probability P of occurrence in the corpus as log(1/P), and we tested the two hypotheses. We concluded from the results that the latter should be preferred as it leads to a shorter description of the input and to correct generalisation at the same time. It is worth noting that this model does not predict that children avoid overregularisations. Rather, it shows that recovery from such errors is actually possible given positive evidence alone, thus providing a possible formal solution to Baker's paradox. This work also relates to recent practical work in machine learning concerning how learning is possible from positive evidence alone (Gao, Li & Vitányi, P. 2000).
References
Baker, C. L. (1979). Syntactic theory and the projection problem. Linguistic Inquiry, 10, 533-581.
Baker, C. L. & McCarthy, J. J. (Eds.) (1981). The logical problem of language acquisition. Cambridge, MA: MIT Press.
Chater, N. (1997). Simplicity and the mind. The Psychologist, November, 1997, 495-498.
Chater, N. (forthcoming). A simplicity principle for language learning: Re-evaluating what can be learned from positive evidence.
Gao, Q., Li, M. & Vitányi, P.M.B. (2000). Applying MDL to learning best model granularity. Artificial Intelligence, 121, 1-29.
Gold, E. M. (1967). Language identification in the limit. Information and Control, 16, 447-474.
MacWhinney, B. (2000). The Childes database: Tools for analyzing talk. 3rd Edition. Vol.2: The Database. Mahwah, NJ: Lawrence Erlbaum Associates.
Pinker, S. (1989). Learnability and Cognition. The acquisition of argument structure. Cambridge, MA: MIT Press.
Tomasello, M., & Brooks, P.J. (1999). Early syntactic development: A Construction Grammar approach. In Barrett M. (Ed.) The development of language. Psychology Press.