What is parsing in natural language processing

Natural language parsing


The term parsing describes the breaking down of an object into its individual parts. So-called Parser are indispensable in computer science - they occur in every compiler and for many programming languages ​​there are parser generators that generate program code for the decomposition of user input. Regular or context-free languages, which have been and are investigated in the discipline of theoretical computer science, often serve as the theoretical background for these parsers.

These classes of formal languages ​​have their origin in linguistics. They were introduced by Noam Chomsky with the aim of finding a suitable formalism for the description of natural languages to find, i.e. languages ​​that are spoken and written by people. Even if such a formalization is still unmatched, parsers are still important in the field of computational linguistics and the machine processing of natural languages. The aim is the automated processing of natural languages ​​with the help of the computer. Possible applications are, for example, the translation and compilation of texts or chatbots, which can communicate with people on a wide variety of topics. Parsers support these applications by providing a syntactic analysis (that is, an analysis of the grammatical structure) of the natural language input.

Mild context-sensitive languages

Since natural languages ​​soon turned out to be non-regular or context-free, but the class of the more powerful context-sensitive languages ​​is too complex for practical applications, so-called mildly context-sensitive Languages ​​studied. They are a representative of this class Linear context-free rewriting systems (LCFRS) or the syntactically similar and semantically equivalent Multiple context-free grammars (MCFG). Here in particular discontinuous Phrases are modeled, i.e. parts of sentences which form a logical unit but not a set of consecutive words in the sentence. This phenomenon occurs, among other things, in languages ​​with free speech such as German, Swiss-German and Dutch, but also in English.

Theoretical contributions

Part of our working group is dedicated to researching such mildly context-sensitive grammars. On the one hand, this includes theoretical work, such as a Chomsky-Schützenberger characterization of weighted MCFG and an automaton characterization. Since the degree of discontinuity that an MCFG can represent affects its processing complexity, we investigate approximation techniques. These should provide less expressive grammars with less processing complexity, but should retain the language of the given grammar as far as possible. In this context we also have so-called Hybrid grammars introduced, which e.g. synchronize LCFRS with a tree-generating grammar. Hybrid grammars skilfully allow some of the complexity to be shifted into the tree-generating grammar, thereby reducing the processing complexity for practical applications. Another contribution in this area is the development of a reversible lexicalization method for MCFG.

Practical contributions

Based on our theoretical research, we implement parsing algorithms and evaluate them on linguistic data sets, so-called corpora. Among other things, the applications rustomata and panda-parser were created. Furthermore, we investigate grammar-based parsers which are controlled by neural models.

Presentation slides on the topic

  • Sheets presented by Heiko Vogler at CAI 2017


  • Prof. Dr.-Ing. habil. Dr. h.c./Univ. Szeged
    Heiko Vogler
    Tel .: +49 (0) 351 463-38232
    Fax: +49 (0) 351 463-37959
Status: 10/20/2020 11:57 a.m.