Rules_Documentation.txt 14.6 KB
Newer Older
priyank's avatar
priyank committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339
  
  Introduction
  =============
  
  The task of the Simple Parser is to give a dependency analysis of the input
  shallow parsed sentence. There are various formalisms which can be 
  followed to analyze a sentence. The implemented parser follows the Paninian 
  dependency formalism. 
  
  Unlike, phrase structure formalism, the dependency formalisms lack nonterminal 
  nodes and the tokens of the sentence appears as nodes of the tree. The dependency 
  arc between two nodes are labelled with appropriate grammatical relations.
  
  The relations given by the Simple Parser are either broad-grained or fine-grained. 
  Fine-grained relation is provided only when some contextual information is robust enough 
  to identify that relation successfully. Possesive relation (r6) and karta karaka (k1)
  are examples of such relations. For example, in the sentence (1) below, the ergative
  marker 'ne' on the noun 'abhay' is a definite cue to identify it as k1. 
  Likewise, the genetive marker 'kaa' in (2) can help us identify 'kursii' as r6.
  
  (1) abhay ne khaanaa khaa liyaa |
  (2) kursii kaa ranga niilaa thaa |
  
  The other noun chunks in the above examples are linked with their respective heads with a 
  broad-grained relation. 'khaanaa' in (1), for instance, is vmod (verb modifier)
  of 'khaa'.
  
  
  Guidelines to Prepare Rules for Simple Parser
  ==============================================
  
  1.1 Rules Introduction
  ------------------------------
  
  Rules can be written down in a separate file and provided as input to the 
  engine. The rules are applied on an input sentence in the Shakti Standard Format 
  (SSF). The format of the rules file is specific and has been described below. 
  Please follow this format while writing rules for the engine. 
  
  1.2 Rules format
  -----------------------
  
  A rule files consists of many tuples with each tuple representing a rule. 
  Each tuple (or rule) consists of nine different fields, namely, 
  
  1. Modified (Parent Node) 
  
  2. Modified Constraints 
  
  3. Modifier (Child Node) 
  
  4. Modifier Constraints 
  
  5. Relation between the two nodes 
  
  6. Dependence of a relation on another relation 
  
  7. Multiplicity, or frequency of a relation 
  
  8. Weight of a relation 
  
  9. Attachment Affinity 
   
  The rules can be made for any particular grammar framework. 
  These fields can be explained with the help of examples. 
  
  Examples of rules for Hindi have been given below. The rules shown below follow the 
  Paninian Depedency Grammar. 
   
  R1:
  VGF vib=tam__ko NP|CCP vib=ko drel=k1 dep=X mult=1 weight=5 verb=vfn 
  
  R2:
  VGF vib=X NP|CCP vib=0&&list=nom__pronoun drel=k1 dep=X mult=1 weight=5 verb=vfn 
   
  R3:
  VG vib=X NP|CCP vib=se|xvArA|kA_xvArA drel=k3 dep=X mult=1 weight=0 verb=vnfn
   
  R4: 
  VG vib=X NP list=place drel=k7p dep=X mult=>1 weight=0 verb=vnfn 
   
  R5:
  VGF vib=hE NP|JJP|CCP vib=0 drel=k1s dep=k1 mult=1 weight=0 verb=vfn 
   
  R6:
  VG list=k4 NP|CCP vib=ko drel=k4 dep=X mult=1 weight=0 verb=vnfn 
   
  R7:
  VG list=k4 NP|CCP vib=se drel=k4 dep=X mult=1 weight=0 verb=vnfn 
   
  Each field is separated by a tab (\t). More generic rules for a relation are placed higher in 
  the file. Specific rules for that relation are placed lower than the generic 
  rules. 
  
  Note:
  Tags which are not present in tagset like VG mean that we are using regular expressions to match the tags.
  For eg, VG matches to VGF, VGNF, VGINF etc.
   
  1.2.1 First field
  --------------------
  In a dependency grammar, there can potentially be some dependency relation
  between two nodes. One node is the child node and other is the parent node.
  The first field in the rule tuple is the parent node. It is represented by it's chunk 
  name. As has been shown in the sample rules above, the parent node is a VG.
  
  S1: ((rAma))_NP1 ((KAnA))_NP2 ((KAwA hE))_VGF.    
  
  There is a certain dependency relation between NP1 and VG in the 
  sentence S1 above. Here, VGF is the parent node and NP1 is the 
  child node for this relation. Similarly, another dependency relation exists
  between NP2 and VGF.
  
  1.2.2 Second field
  -------------------------
  Second field, specifies Parent constraints. It will be explained in 1.3.
  
  1.2.3 Third field
  ---------------------
  Third field is the child node (see S1). It can be any chunk. For example, an NP
  or a CCP or a VGX (X is F, NF, INF etc.) (separated by a '|'), provided all the constraints for 
  other fields of the rules are satisfied. For instance, R1 says that, either
  an NP or a CCP can possess the same relation with a parent 
  node. Hence, the third field contains NP and CCP, both separated by a '|'.
  
  1.2.4 Fourth field
  -----------------------
  Fourth field, specifies Child constraints. It will be explained in 1.3.
  
  1.2.5 Fifth field
  --------------------
  The fifth field indicates that the relation between two nodes for the rule.
  For example, R1 states that the rule is for marking dependency relation 
  'k1' i.e., 'karta' in the Paninian Dependency Framework within a sentence
  clause. Relations can be specified for any particular grammar. For 
  example, a dependency relation can be denoted by the expression 
  'drel=<some relation>' and so on.
  
  1.2.6 Sixth field
  ---------------------
  The sixth field denotes the dependence of some relation on a relation that 
  has already been marked by the system in its previous computation for a 
  sentential clause. For instance, in R1, k1 is not dependent on any of 
  the previously marked relations which is denoted by a 'X'. Whereas, in 
  R5, the dependency relation 'k1s' depends on the presence of the 
  relation k1, which has been already been marked by the system some 
  time before. This has been denoted by 'dep=k1'. 
  
  S2: ((rAma))_NP1 ((kisAna))_NP2 ((hE))_VG.
  
  In the above sentence, NP2 is 'k1s' with respect to the parent node
  VGF. Now, k1s cannot be marked until and unless k1 has already 
  been identified before in the same clause. Thus, we can say that k1s 
  'depends' on the presence of the relation k1 within its clause. It can be
  noted that NP1 has already been identified as k1 by the parser before k1s
  is identified.
  
  1.2.7 Seventh field
  -------------------------
  The seventh field denotes the frequency of a relation within a clause. 
  For R1, it is one as 'k1' in a clause can be only one. Hence, the field 
  'mult=1'. Any relation that can have frequency greater than one must 
  be denoted by a '>1' expression i.e., 'mult=>1'. For example, in R4, 
  the relation can have its frequency of occurrence more than one within 
  the clause. 
  
  S3:
  ((rAma))_NP1 ((dillI meM))_NP2 ((karola bAGa meM))_NP3 ((rahawA hE))_VGF.
  
  Here, both NP2 and NP3 are k7p, according to R4. Hence,  the seventh 
  field contains mult=>1.
  
  1.2.8 Eighth field
  -----------------------
  The eighth field can be used to assign weights to rules in the form of integral  
  values from 0 to 5. These weights are assigned with respect to a relation. A 
  stronger rule for marking a particular relation is assigned a higher weight 
  than a weaker rule for the same relation. Weights are used to prioritize rules 
  for a particular relation. In case where no prioritization of rules for a relation is  
  required, a '0' weight can be assigned for those rules. For example, R1 and R2  
  have been assigned weights of 4 and 5 to them. Rest of the rules shown above 
  have a  '0' weight assigned to them. 
  
  S4:
  ((KAnA))_NP1 ((rAma ne))_NP2 ((KAyA))_VGF. 
  
  In the sentence above, there can be two potential candidates for k1 as chunks
  with a '0' case marker (NP1 in this case), or a 'ne' case marker. But NP1 may get 
  erroneously marked as  k1, as there is a default rule for k1 which says that the first
  NP chunk in a clause should be k1. It is known that any chunk with a 'ne' case marker
  will always be k1. This makes  for a very strong and robust rule. Hence, to make sure 
  that such a strong rule does not get limited by simple ordering of words especially for
  Indian languages like Hindi, it is critical that this rule be assigned a weight higher 
  than any of the other rules for k1. So the weight for this rule is 5 and the weight for the
  default rule is assigned 1.
  
  S5:
  ((rAma))_NP1 ((Pala ko))_NP2 ((KAwA hE))_VGF.
  
  S6:
  ((rAma))_NP1 ((Pala))_NP2 ((KAwA hE))_VGF.
  
  In S5 amd S6 there isn't really any preference that can be accorded to chunks with 
  '0' and 'ko' case markers for marking k2. Hence, notion of weights isn't applicable.
  Thus weight assigned to both rules is 0. 
  
  1.2.9 Ninth field
  ---------------------
  The ninth field denotes the attachment affinity of a child node with its  
  parent node. A child node looks for its parent depending upon the relation  
  established between the two nodes. For example, a child node for which a k1 or 
  'karta' relation has been marked, attaches itself with the main verb (finite verb) of  
  the clause. This has been indicated in the first rule, by the expression 
  'verb=vfn'. 'vfn' here denotes the main (or finite) verb. Other relations 
  whose child nodes attach with a parent which may or not be a finite verb have 
  been denoted by 'verb=vnfn'. Thus we use VG in the rule tuple for the parser
  to match the nearest verb chunk be it a VGF, VGNF, VGINF etc.
  
  S7:
  ((rAma))_NP1 ((KAnA))_NP2 ((KAwe hue))_VGNF ((skUla))_NP3 ((gayA))_VGF.
  
  It is known that a child chunk (or node) which is k1, will always attach itself with its
  parent node which is always a finite verb chunk, i.e., VGF. Hence, in S7 NP1, which
  is k1, attaches itself to the finite verb (VGF).
  
  In the same sentence NP2 is k2 with respect to VGNF (parent node). 
  
  1.3 Parent and Child constraints:
  --------------------------------------------
  Parent and child constraints can vary greatly from rule to rule. These two 
  fields are explained below: 
  
  Parent or children constraints can be part of the feature structure of a chunk  
  or can be non feature structure attributes.  
  For writing constraints that can be extracted from the feature structure,  
  just write the expression '<keyword>=<value>'. For example, in S4, constraint for
  k1 chunk is that it should possess a 'ne' case marker. Hence, this constraint can be
  represented as 'vib=ne'. In this, 'vib' is keyword for vibhakti (or case marker). This 
  information is present  in the feature structure of a chunk. 
  
  Another example of writing the constraint can be explained from R1. In this rule, 
  the parent constraint says 'vib=tam__ko'. This means that 'ko' refers to the list 
  of all the TAMs that have agreement with a chunk which has 'ko' case marker 
  (NP or CCP in this case). So a file can be  maintained which has TAMs that agree 
  with certain case markers for chunks that can potentially be k1.
  Such a file has all TAMs with their respective post-positions written against them 
  separated by a tab(\t). For  instance, in Hindi, a list of TAMs that agree with different  
  case markers has been made. These TAMs have been kept in a single file name 'tam' 
  as keys with their post-positions (or case markers) behaving as values. Some TAMs 
  and their agreeing post-positions have been shown below. 
  
  nA_WA	ko
  nA_cAhie	ko
  nA_hvE	ko
  yA_jAne	xvArA
  yA_jA_sake	xvArA
  wA_hE	0
  rahA_hE	0
  
  S8:
  ((rAma ko))_NP1 ((acCA))_JJP ((kAma))_NP2 ((karanA cAhie))_VGF.
  
  In the above sentence, NP1 is k1 as its post-position 'ko' agrees with the parent
  node's TAM i.e., nA_cAhie.
   
  Constraints can also be non feature structure attributes.
  For example, for a particular relation, a list of certain types nouns, verbs etc. can
  maintained in a file. Such constraints can be represented by the  expression 
  'list=<listname>'. 
  
  S9:
  ((rAma))_NP1 ((Aja))_NP2 ((iWara))_NP3 ((AyA))_VGF.
  
  In the sentence above, NP2 is k7t, and NP3 is k7p. These relations can be

  determined by rules that have the constraints time and place expressions. These
  expressions can be maintained in separate time and place lists. For example, 
  in R4 a list of all place names (child constraints) has been denoted by 'list=place'.
  NP2 is a time expression and NP3 is a place expression whose match can be found
  in these lists. 
  
  In case there is a list that has key-value pairs (value is nothing but a feature structure 
  attribute) key and  value separated by a tab. Value has to be a feature structure attrbute.
  
  For example, a sample key-value pair list has been shown below for marking relation k4:
  Beja	ko
  bawA	ko
  xe	ko
  lOtA	ko
  liKa	ko
  milA	ko
  mila	ko
  kahA	se
  
  S10:
  ((rAma ne))_NP1 ((mohana ko))_NP2 ((puswaka))_NP3 ((xI))_VG.
  
  In the sentence above, NP2 is k4. This can be verified from R6 which states
  that the parent node i.e., a verb should belong to the list of k4 verbs and the
  child node should possess a 'ko' case marker. This parent node verb is the key
  to which its corresponding value is 'ko'. Thus the chunk with a 'ko' that is NP2 is
  marked as k4.
  
  Other non feature structure attributes can be 
  included by adding a 'keyword=value' expression. eg: distance=1 etc. 
  
  For multiple constraints of which all must be satisfied, each constraint can be 
  separated by '&&' sign. For example, in R2, multiple constraints for the child 
  node have been separated by '&&'. First constraint implies that the chunk should
  possess a '0' case marker.Second constraint for the child node implies
  that the chunk's head should belong to a list of personal pronouns.
  
  S11:
  ((yaha kAma))_NP1 ((mEM))_NP2 ((kara rahA hUz))_VGF.
  
  Here, NP2 is k1 according to R2, as the child node satisfies both the
  constraints.
  
  For constraints of a similar type (same keyword), of which any of them could 
  be satisfied at a time, separate those constraints by a '|' sign. For example, 
  in the third rule, child constraint is denoted by 'vib=se|xvArA|kA_xvArA' which 
  means that the post-position for the child chunk can be any of se, xvArA or 
  kA_xvArA and hence the three post-positions have been separated by a '|'. 
   
  In case there is no constraint for a parent or child node just put '<any keyword>=X'.  
  eg: vib=X. 
  
  2. For any Queries
  ------------------------
  Any queries or suggestions please mail to 
  
        1)mridulgupta@students.iiit.ac.in 
   
        2)vineetyadav@students.iiit.ac.in 
   
  Mridul Gupta
  Vineet Yadav
  Language Technologies Research Centre 
  IIIT-Hyderabad