Introduction ============= The task of the Simple Parser is to give a dependency analysis of the input shallow parsed sentence. There are various formalisms which can be followed to analyze a sentence. The implemented parser follows the Paninian dependency formalism. Unlike, phrase structure formalism, the dependency formalisms lack nonterminal nodes and the tokens of the sentence appears as nodes of the tree. The dependency arc between two nodes are labelled with appropriate grammatical relations. The relations given by the Simple Parser are either broad-grained or fine-grained. Fine-grained relation is provided only when some contextual information is robust enough to identify that relation successfully. Possesive relation (r6) and karta karaka (k1) are examples of such relations. For example, in the sentence (1) below, the ergative marker 'ne' on the noun 'abhay' is a definite cue to identify it as k1. Likewise, the genetive marker 'kaa' in (2) can help us identify 'kursii' as r6. (1) abhay ne khaanaa khaa liyaa | (2) kursii kaa ranga niilaa thaa | The other noun chunks in the above examples are linked with their respective heads with a broad-grained relation. 'khaanaa' in (1), for instance, is vmod (verb modifier) of 'khaa'. Guidelines to Prepare Rules for Simple Parser ============================================== 1.1 Rules Introduction ------------------------------ Rules can be written down in a separate file and provided as input to the engine. The rules are applied on an input sentence in the Shakti Standard Format (SSF). The format of the rules file is specific and has been described below. Please follow this format while writing rules for the engine. 1.2 Rules format ----------------------- A rule files consists of many tuples with each tuple representing a rule. Each tuple (or rule) consists of nine different fields, namely, 1. Modified (Parent Node) 2. Modified Constraints 3. Modifier (Child Node) 4. Modifier Constraints 5. Relation between the two nodes 6. Dependence of a relation on another relation 7. Multiplicity, or frequency of a relation 8. Weight of a relation 9. Attachment Affinity The rules can be made for any particular grammar framework. These fields can be explained with the help of examples. Examples of rules for Hindi have been given below. The rules shown below follow the Paninian Depedency Grammar. R1: VGF vib=tam__ko NP|CCP vib=ko drel=k1 dep=X mult=1 weight=5 verb=vfn R2: VGF vib=X NP|CCP vib=0&&list=nom__pronoun drel=k1 dep=X mult=1 weight=5 verb=vfn R3: VG vib=X NP|CCP vib=se|xvArA|kA_xvArA drel=k3 dep=X mult=1 weight=0 verb=vnfn R4: VG vib=X NP list=place drel=k7p dep=X mult=>1 weight=0 verb=vnfn R5: VGF vib=hE NP|JJP|CCP vib=0 drel=k1s dep=k1 mult=1 weight=0 verb=vfn R6: VG list=k4 NP|CCP vib=ko drel=k4 dep=X mult=1 weight=0 verb=vnfn R7: VG list=k4 NP|CCP vib=se drel=k4 dep=X mult=1 weight=0 verb=vnfn Each field is separated by a tab (\t). More generic rules for a relation are placed higher in the file. Specific rules for that relation are placed lower than the generic rules. Note: Tags which are not present in tagset like VG mean that we are using regular expressions to match the tags. For eg, VG matches to VGF, VGNF, VGINF etc. 1.2.1 First field -------------------- In a dependency grammar, there can potentially be some dependency relation between two nodes. One node is the child node and other is the parent node. The first field in the rule tuple is the parent node. It is represented by it's chunk name. As has been shown in the sample rules above, the parent node is a VG. S1: ((rAma))_NP1 ((KAnA))_NP2 ((KAwA hE))_VGF. There is a certain dependency relation between NP1 and VG in the sentence S1 above. Here, VGF is the parent node and NP1 is the child node for this relation. Similarly, another dependency relation exists between NP2 and VGF. 1.2.2 Second field ------------------------- Second field, specifies Parent constraints. It will be explained in 1.3. 1.2.3 Third field --------------------- Third field is the child node (see S1). It can be any chunk. For example, an NP or a CCP or a VGX (X is F, NF, INF etc.) (separated by a '|'), provided all the constraints for other fields of the rules are satisfied. For instance, R1 says that, either an NP or a CCP can possess the same relation with a parent node. Hence, the third field contains NP and CCP, both separated by a '|'. 1.2.4 Fourth field ----------------------- Fourth field, specifies Child constraints. It will be explained in 1.3. 1.2.5 Fifth field -------------------- The fifth field indicates that the relation between two nodes for the rule. For example, R1 states that the rule is for marking dependency relation 'k1' i.e., 'karta' in the Paninian Dependency Framework within a sentence clause. Relations can be specified for any particular grammar. For example, a dependency relation can be denoted by the expression 'drel=' and so on. 1.2.6 Sixth field --------------------- The sixth field denotes the dependence of some relation on a relation that has already been marked by the system in its previous computation for a sentential clause. For instance, in R1, k1 is not dependent on any of the previously marked relations which is denoted by a 'X'. Whereas, in R5, the dependency relation 'k1s' depends on the presence of the relation k1, which has been already been marked by the system some time before. This has been denoted by 'dep=k1'. S2: ((rAma))_NP1 ((kisAna))_NP2 ((hE))_VG. In the above sentence, NP2 is 'k1s' with respect to the parent node VGF. Now, k1s cannot be marked until and unless k1 has already been identified before in the same clause. Thus, we can say that k1s 'depends' on the presence of the relation k1 within its clause. It can be noted that NP1 has already been identified as k1 by the parser before k1s is identified. 1.2.7 Seventh field ------------------------- The seventh field denotes the frequency of a relation within a clause. For R1, it is one as 'k1' in a clause can be only one. Hence, the field 'mult=1'. Any relation that can have frequency greater than one must be denoted by a '>1' expression i.e., 'mult=>1'. For example, in R4, the relation can have its frequency of occurrence more than one within the clause. S3: ((rAma))_NP1 ((dillI meM))_NP2 ((karola bAGa meM))_NP3 ((rahawA hE))_VGF. Here, both NP2 and NP3 are k7p, according to R4. Hence, the seventh field contains mult=>1. 1.2.8 Eighth field ----------------------- The eighth field can be used to assign weights to rules in the form of integral values from 0 to 5. These weights are assigned with respect to a relation. A stronger rule for marking a particular relation is assigned a higher weight than a weaker rule for the same relation. Weights are used to prioritize rules for a particular relation. In case where no prioritization of rules for a relation is required, a '0' weight can be assigned for those rules. For example, R1 and R2 have been assigned weights of 4 and 5 to them. Rest of the rules shown above have a '0' weight assigned to them. S4: ((KAnA))_NP1 ((rAma ne))_NP2 ((KAyA))_VGF. In the sentence above, there can be two potential candidates for k1 as chunks with a '0' case marker (NP1 in this case), or a 'ne' case marker. But NP1 may get erroneously marked as k1, as there is a default rule for k1 which says that the first NP chunk in a clause should be k1. It is known that any chunk with a 'ne' case marker will always be k1. This makes for a very strong and robust rule. Hence, to make sure that such a strong rule does not get limited by simple ordering of words especially for Indian languages like Hindi, it is critical that this rule be assigned a weight higher than any of the other rules for k1. So the weight for this rule is 5 and the weight for the default rule is assigned 1. S5: ((rAma))_NP1 ((Pala ko))_NP2 ((KAwA hE))_VGF. S6: ((rAma))_NP1 ((Pala))_NP2 ((KAwA hE))_VGF. In S5 amd S6 there isn't really any preference that can be accorded to chunks with '0' and 'ko' case markers for marking k2. Hence, notion of weights isn't applicable. Thus weight assigned to both rules is 0. 1.2.9 Ninth field --------------------- The ninth field denotes the attachment affinity of a child node with its parent node. A child node looks for its parent depending upon the relation established between the two nodes. For example, a child node for which a k1 or 'karta' relation has been marked, attaches itself with the main verb (finite verb) of the clause. This has been indicated in the first rule, by the expression 'verb=vfn'. 'vfn' here denotes the main (or finite) verb. Other relations whose child nodes attach with a parent which may or not be a finite verb have been denoted by 'verb=vnfn'. Thus we use VG in the rule tuple for the parser to match the nearest verb chunk be it a VGF, VGNF, VGINF etc. S7: ((rAma))_NP1 ((KAnA))_NP2 ((KAwe hue))_VGNF ((skUla))_NP3 ((gayA))_VGF. It is known that a child chunk (or node) which is k1, will always attach itself with its parent node which is always a finite verb chunk, i.e., VGF. Hence, in S7 NP1, which is k1, attaches itself to the finite verb (VGF). In the same sentence NP2 is k2 with respect to VGNF (parent node). 1.3 Parent and Child constraints: -------------------------------------------- Parent and child constraints can vary greatly from rule to rule. These two fields are explained below: Parent or children constraints can be part of the feature structure of a chunk or can be non feature structure attributes. For writing constraints that can be extracted from the feature structure, just write the expression '='. For example, in S4, constraint for k1 chunk is that it should possess a 'ne' case marker. Hence, this constraint can be represented as 'vib=ne'. In this, 'vib' is keyword for vibhakti (or case marker). This information is present in the feature structure of a chunk. Another example of writing the constraint can be explained from R1. In this rule, the parent constraint says 'vib=tam__ko'. This means that 'ko' refers to the list of all the TAMs that have agreement with a chunk which has 'ko' case marker (NP or CCP in this case). So a file can be maintained which has TAMs that agree with certain case markers for chunks that can potentially be k1. Such a file has all TAMs with their respective post-positions written against them separated by a tab(\t). For instance, in Hindi, a list of TAMs that agree with different case markers has been made. These TAMs have been kept in a single file name 'tam' as keys with their post-positions (or case markers) behaving as values. Some TAMs and their agreeing post-positions have been shown below. nA_WA ko nA_cAhie ko nA_hvE ko yA_jAne xvArA yA_jA_sake xvArA wA_hE 0 rahA_hE 0 S8: ((rAma ko))_NP1 ((acCA))_JJP ((kAma))_NP2 ((karanA cAhie))_VGF. In the above sentence, NP1 is k1 as its post-position 'ko' agrees with the parent node's TAM i.e., nA_cAhie. Constraints can also be non feature structure attributes. For example, for a particular relation, a list of certain types nouns, verbs etc. can maintained in a file. Such constraints can be represented by the expression 'list='. S9: ((rAma))_NP1 ((Aja))_NP2 ((iWara))_NP3 ((AyA))_VGF. In the sentence above, NP2 is k7t, and NP3 is k7p. These relations can be determined by rules that have the constraints time and place expressions. These expressions can be maintained in separate time and place lists. For example, in R4 a list of all place names (child constraints) has been denoted by 'list=place'. NP2 is a time expression and NP3 is a place expression whose match can be found in these lists. In case there is a list that has key-value pairs (value is nothing but a feature structure attribute) key and value separated by a tab. Value has to be a feature structure attrbute. For example, a sample key-value pair list has been shown below for marking relation k4: Beja ko bawA ko xe ko lOtA ko liKa ko milA ko mila ko kahA se S10: ((rAma ne))_NP1 ((mohana ko))_NP2 ((puswaka))_NP3 ((xI))_VG. In the sentence above, NP2 is k4. This can be verified from R6 which states that the parent node i.e., a verb should belong to the list of k4 verbs and the child node should possess a 'ko' case marker. This parent node verb is the key to which its corresponding value is 'ko'. Thus the chunk with a 'ko' that is NP2 is marked as k4. Other non feature structure attributes can be included by adding a 'keyword=value' expression. eg: distance=1 etc. For multiple constraints of which all must be satisfied, each constraint can be separated by '&&' sign. For example, in R2, multiple constraints for the child node have been separated by '&&'. First constraint implies that the chunk should possess a '0' case marker.Second constraint for the child node implies that the chunk's head should belong to a list of personal pronouns. S11: ((yaha kAma))_NP1 ((mEM))_NP2 ((kara rahA hUz))_VGF. Here, NP2 is k1 according to R2, as the child node satisfies both the constraints. For constraints of a similar type (same keyword), of which any of them could be satisfied at a time, separate those constraints by a '|' sign. For example, in the third rule, child constraint is denoted by 'vib=se|xvArA|kA_xvArA' which means that the post-position for the child chunk can be any of se, xvArA or kA_xvArA and hence the three post-positions have been separated by a '|'. In case there is no constraint for a parent or child node just put '=X'. eg: vib=X. 2. For any Queries ------------------------ Any queries or suggestions please mail to 1)mridulgupta@students.iiit.ac.in 2)vineetyadav@students.iiit.ac.in Mridul Gupta Vineet Yadav Language Technologies Research Centre IIIT-Hyderabad