Best of both worlds internationalizing NL applications ... -
07-30-2005
, 02:27 PM
Michael, et al,
Summarizing an ongoing discussion on <news:comp.ai.nat-lang>, there
seems to be a radically different approach to internationalizing NL
applications that apparently no one (including myself) had previously
considered.
The problem:
Take your choice - you can do a shallow analyses on NL input with a
10-20% error rate (depending on how well the users follow the
instructions to keep their sentences simple), which is tolerable for
some applications, e.g. NL "brainstorming" but is straightforward to
translate to other languages, or go for a "full parse" which isn't yet
entirely possible, especially in the case of "gapped" (missing words)
sentences which are VERY common. The VERY best nearly-full parsers still
have a ~10% error rate and you could expect to spend months of work
translating one of these to a different language. In short, both of
these approaches suck.
The key observation:
A full parser (if/when one is ever developed) or the best there now is,
is generally capable of internally diagramming sentences so that complex
sentences can be subdivided into a greater number of simply constructed
sentences. Something is definitely lost in this process - usually the
mindset of the speaker, but preserving the specific statements of fact
that are being made is the easiest part of this.
Once the complex sentences have been reduced to simple sentences,
shallow parsers work GREAT for recognizing specific statements of fact -
and are often better than going with the sometimes erroneous sentence
structure that the computer comes up with. In short, you can have the
best of both worlds, at least in the domain of recognizing certain
pre-specified statements of fact in complex sentences.
The Internationalization Strategy:
Develop the NL application using shallow parsing, and provide an
outboard (e.g. DDE interfaced) open source crude "sentence chopper" that
just cuts incoming sentences in half at appropriate conjunctions. The
sentence chopper, while definitely less than the best solution, is
trivial to translate to another language - just put in the new list of
conjunctions that the new language has.
Then, as time, resources, and market demand dictates, take the time to
translate a full parser to the new language, along with a wrapper that
gathers the words back together as simple sentences. When done, simply
replace the sentence chopper's EXE file.
As a further refinement, the full parser could potentially be table
driven, with new languages being added as new table entries. If the
table entries aren't there for the input, just throw the input into the
old sentence chopper.
This would seem to provide the quick path needed for rapid market
expansion, while not throwing up any impediments to subsequently
incorporating a full parser, either by the Dr. Eliza team or an
independent developer.
The REALLY REALLY IMPORTANT part of all this is that the it preserves
the shallow SPECIFICATION of parsing, which is easy for people to
understand. If people had to specify what they were looking for in terms
of parse trees, etc., then experts would run the other way instead of
entering their knowledge.
This appears to be as good as it can get. Lacking technical objections
or a better idea, Dr. Eliza will be altered to work this way.
Note that I have no intention of building a full parser myself, but I do
know people who are working on their respective full parsers. Of course,
a full parser is not in and of itself of any value - it needs an
applications to be dropped into. Of course this is behind the present
"chicken or egg" situation where there are no applications because there
are no parsers, and there are no parsers because there are no
applications. Further, with no applications there is NO standard output
format for parsers, as each format reflects their author's model of how
language works. Hence, Dr. Eliza, being a NL program with a really
simple interface (simple sentences) where you can drop a parser (within
an appropriate wrapper) in and try it, will probably be of as much value
to parser developers and a good parser will be to Dr. Eliza. Besides,
when they get it working better than the crude sentence chopper (which
shouldn't be difficult), they can then SELL their parser's EXE files to
Dr. Eliza users - without disclosing any of its internals.
Any technical objections? Could you describe, even in vague terms, a
better approach?
As always, thanks for everyone's continuing help.
Steve Richfie1d |