In my recent e-travels searching for a parsing solution in a project, I stumbled across a very cool NLP parser for available under the full GNU GPL. It is called the Stanford Parser, and it parses out the grammatical structure of sentences from context free languages. That means it takes down a sentences into it’s individual pieces, building it into abstract syntax tree. It should be mentioned that it is probabilistic, so it gives you the most likely derivation, instead of an absolute one. In fact, if you specify an amount, it will print that number of highest-scoring possible parses for a sentence. For any given sentence there might be thousands of possible parses, creating a state space too large to exhaustively search. There is also the ability to train the parser, if you have your own corpus lying around that you specifically need to use.
An example parse would be -
Warren enjoys parsing words for fun.
(ROOT
(S
(NP (NNP Warren))
(VP (VBZ enjoys)
(S
(VP (VBG parsing)
(PRT (RP out))
(NP
(NP (NNS words))
(PP (IN for)
(NP (NN fun)))))))
(. .)))
As you can see it is in somewhat of a LISP list format, it is actually referred to as a treebank by linguists. This specific format is the Penn-Treebank, and without getting too much into the inside-baseball, it is tagging each part of the sentence with its linguistic element. For example looking at
(RP out)
RP is simply a particle, so out is a particle in this representation of the parse tree.
(VBZ enjoys)
VBZ refers to, Verb (VB) 3rd person singular present (Z). Again, this is the how the word enjoys could be represented in a parse tree. Also, the Stanford Parser supports other models, such as simple tagging of each word in the sentence, or measuring the dependencies of the sentence (other linguistic measures of interest).
Perhaps the best part about the parser though is that by using different tree banks, you can parse with different languages (There is a version for Chinese, German and Arabic linked from the website). So interestingly enough this parser is language independent. I don’t know enough about linguistics as to why, but I am guessing any spoken language can be represented by a treebank (anything with a context free grammar). It should also be noted that treebanks can be set up to represent things differently using the same language. There can be any number of treebanks set up for a language, if a computational linguists was looking to measure something differently.
So what can you use this for? Well, any task where you need to parse out complex meaning from language. It is too heavy duty for parsing simple structured commands from English (ANTLR would still be the best for that task). But if you need to do machine translation or relationship analysis, you are going to need something this heavy-duty.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.