Beyond Fašade: Pattern Matching for Natural Language Applications
March 15, 2011 Page 4 of 5
ChatScript intermixes direct output text with a C-style scripting language, which can be used in patterns as well as during output. ChatScript supports declaring facts and querying for them and has an extensible generic graph traversal mechanism, so it directly supports walking around the ontology hierarchy or a user-defined map of nodes and links (e.g., for describing travel on a city grid of one-way streets).
ChatScript can even do the AIML thing of synthesizing new input and getting it processed through the input stream, which Suzette uses to rewrite sentences with resolvable pronouns.
The starting fundamental of the implementation is a dictionary array of words, initially from WordNet. Aside from direct indexing, words are accessed by hash. Each word holds static information like parts of speech and form of the word (tense, comparative status, plurality, etc) and connects it to related words (e.g., all conjugations of an irregular verb are linked into a ring.)
This information is essential for canonization. Other information about the word like gender, whether it refers to a human, a location, or a time is also kept here. Words also hold WordNet synsets -- lists of words that could be synonyms in some context.
Word entries can be marked with various "been here" bits for doing inference and hold a list of where in the current sentence the word occurs. This means when you encounter a word in a pattern, you can do a fast look-up and instantly find all locations in the input where the word can be found. This applies to normal words, as well as to concept names, strings, and other special words. Variables are also stored in the dictionary.
The other system fundamental is the fact. A fact is a triple of fields subject verb object. The value of a fact field is either an ontology pointer or a fact pointer. An ontology pointer consists of an index into the array of dictionary words and a meaning index (which WordNet meaning of the word or the part of speech class of the word). A meaning index of 0 refers to all meanings of the word and can be written help or help~0. If a field is a fact pointer, the value is an index into the fact array.
Each dictionary entry and each fact keeps lists of facts that use it in each position. So the word "bill" has a list of all facts that have "bill" as the subject, another list of with "bill" as the verb, and a third list that has "bill" as the object.
The WordNet ontology is built of facts using the verb is. So reading in WordNet will create facts similar to these (Persian~3 is cat~1) and (cat~1 is animal~1 ) and (animal~1 is entity~1).
Concepts use the verb member, so concept: ~birdrelated (bird~1 "bird house") creates facts (bird~1 member ~birdrelated) and (bird_house member ~birdrelated).
The tokenizer first breaks user input into a succession of sentences, automatically performing spell-correction, idiom substitution, proper name and number joining, etc. It strips off trailing punctuation, but remembers whether or not the sentence had an exclamation or a question mark. The control script will later examine the construction of the sentence, so questions lacking a question mark will still get appropriately marked.
Marking Words and Concepts
The result is then run through canonization, which generates an alternate for each word. These two forms of input are then used to mark what Façade thinks of as initial and intermediate facts, though for ChatScript they are not facts, they are just annotations on dictionary entries saying where they occur in the sentence.
For each word (regular and canonical), the system chases up the fact hierarchy of them as subject, looking for verbs member and is to allow it to switch to the object to continue processing. As it comes to a word or a replacement word, it marks on the dictionary the position in the sentence where the original word was found. It does this same thing for sequences of contiguous words that are found in the dictionary.
Pattern matching can now begin. Matching cost is at most linear in the number of symbols in the pattern (excluding patterns using the query function). Suppose the input is I really love pigeons and this is a rule to be matched:
s: ( I *~2 ~like * _~birdrelated ) I like '_0 , too.
This rule, only applicable to statements, asks if I occurs anywhere in the sentence. I can be literal or canonical (me myself mine my). We merely look up I in the dictionary and see if it has been marked with a current position. It has, so we track the position as the most recent match. If it wasn't there, this pattern would fail now.
Next is a range-limited wildcard, which swallows up to two subsequent words, so we note that and move on. Reaching ~like, we look that up in the dictionary and find it is marked (from love) later than I and legally ranged. We track our new match position.
We note the next * wildcard and move on to find an _. This is a signal to memorize the upcoming match so we set a flag. We then look up ~birdrelated in the dictionary. It is marked from pigeons and appropriately later in the sentence. Because of the _ flag, we copy the actual sentence match found into special match variables, copying the position found: start 4 end 4, the actual form pigeons, and the canonical form pigeon.
Since we completed the pattern and matched, we execute the output and write I like pigeons, too. '_0 means retrieve the 0th memorization in its original form.
Since output is generated, the system passes control back to the control script to decide what to do next. Had it not generated output, it would have moved on to the next rule.
Pattern matching can do limited backtracking. If the system finds an initial match but fails later, the system will replace the match of the first pattern word if it can and try again. Eg., if the rule being tested is the (I *~ ~like * _~birdrelated) one and the input was I hate dogs but I really love pigeons, then when I matches in I hate and ~like does not come soon enough, the system will deanchor I and start the match again with the I of I really. This is simpler and more efficient than full backtracking, which is rarely profitable.
Putting it all together: Topics
The system does not execute all rules. The author organizes collections of rules into topics. A topic also has a set of related keywords (an implied concept set). These are used to find topics most closely related to an input.
topic: ~RELIGION (~fake_religious_sects ~godnames ~religious_buildings ~religious_leaders ~religious_sects ~worship_verbs sacrament sacred sacrilege saint salvation sanctity sanctuary scripture sect sectarian secular secularism secularist seeker seraph seraphic seraphim soul spiritual spirituality "supreme being" tenet theocracy theology tithe pray venerate worship)
A topic introduces an additional rule type. In addition to responders for user input (s: ?: u:) it has topic gambits it can offer (t:). It can nest continuations (a: b: ...) under any of those. Gambits create a coherent story on the topic if the bot is in control of the conversation (the user passively responds by saying thinks like "OK" or "right"). Yet if the user asks questions, the system can use a responder to either respond directly or use up a gambit it was going to volunteer anyway.
A topic is executed in either gambit mode (meaning t: lines fire) or in responder mode (meaning s: ?: and u: fire). Rules are placed in the order you want them to execute. For gambits, the order tells a story. For responders, rules are usually ordered most specific to least specific, possibly bunched by subtopic.
t: An atheist is a man who has no invisible means of support.
a: ( << what mean >> ) It means God (invisible and supporter of life ) doesn't exist for an atheist.
t: Do you have any religious beliefs?
a: ( ~no ) How about ethical systems that dictate your behavior instead?
a: ( ~yes ) What religion are you?
b: ( _~religious_sects ) Were you born _0 or did you convert?
t: RELIGION () Religion is for people who are afraid to stand on their own.
?: (<< [ where what why how who ] [ do be ] God >>)
[ There is no God. ]
[ A God who is everywhere simultaneously and not visible is nowhere also. ]
?: ( be you *~2 follower *~2 [ ~fake_religious_sect ~religious_sects ~religious_leaders ] ) reuse( RELIGION )
u: ( << [I you] ~religion >> ) You want to talk about religion?
The reuse output function is important in avoiding double-authoring information. It is a "goto" the output section of another labeled rule.
By default, the system avoids repeating itself, marking as used-up rules that it matches. All rules can have a label attached and even gambits can have pattern requirements, so the system can dish out gambits conditionally if you want or make rules share output data.
The topic system helps ChatScript efficiently search for relevant rules. User sentence keywords trigger the closest matching topic (based on number and length of keywords) for evaluation. If it can find a matching responder, then it shifts to that topic and replies. Otherwise it tries other matching topics.
Eventually, if it can't find an appropriate responder, it can go back to the most relevant topic and just dish out a gambit. So if you say Were Romans mean to Christians? it can come back with An atheist is a man with no invisible means of support. This at least means it will begin conversing in the appropriate topic.
Topics make it easy to bundle rules logically together. Topic keywords mean the author can script an independent area of conversation and the system will automatically find it and invoke it as appropriate. Control is not based on the initial sequence of words as AIML is. Instead it is based on consanguinity with the sentence. It doesn't matter if the new topic has rules that overlap some other topic's rules. You can have a topic on ~burial_customs and another on ~death. An input sentence I don't believe in death might route you to either topic, but that's OK.
Instead of Façade's salience, topics make ChatScript rule-ordering visual. You can define tiers of rules merely by defining separate topics and calling one from another.
Page 4 of 5