ChangeLog 3.08 KB
Newer Older
priyank's avatar
priyank committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
18-July-2013	Rashid Ahmad	<rashid101b@gmail.com>

	* Version 1.9

	- web page identification in text strict to start with http|https|file|ftp|www
	  to avoid acronyms mark to webpage.

14-Jun-2013	Rashid Ahmad	<rashid101b@gmail.com>

	* Version 1.8

	- Decisions as per 17-19 January 2013 meeting are taken care
	- 1.1 The following symbols will appear in output as tokens
	  a) ':-' when colon followed by dash
	  b) '-JOIN' when dash appears between two words without intervening spaces
	  c) '-' when dash has a non-alphabetic character (or space) before or after

	- 1.2 The following will appear as single tokens for numbers and dates, for example
	  '20.1', '3,00,205', '1-1-13', '1:1:13', '1.1.13', '1/1/13', '20vA'

	- 1.3 Colon inside a word will be recognized as visarga e.g. du:kha

	- 1.4 Ambiguity exists between dot as marker for abbreviations and for end of sentence
	- This Tokenizer has a provision for language specific lists
	  a)List of acronyms Ex. kimI. ki.mI.
	  b)List of names: Ex. a.p.j.
	  c)list of exception (namely single aksharas) Ex. pi(drank in Hindi), ki(did in Hindi) etc.

	- All above decisions are incorporated in this version except 1.4b and 1.4c.
	- Token sequence start with '1' in this version (earlir was '0').
	- backupfiles or unnecessary files are removed.
	- .. and ... exception are handle for FC8.
	- -- and --- exception are handle for FC8.
	- In the output ... or --- are printed evenif more than three.
	- By default -JOIN is there in multiword but you can restrict to - only by specifying -j=no option in tokenizer_indic.pl

12-Jan-2013	Rashid Ahmad	<rashid101b@gmail.com>

	* Version 1.7

	- Urdu language support added.
	- Punjabi language support added.

08-Jan-2013	Rashid Ahmad	<rashid101b@gmail.com>

	* Version 1.6

	- issue solved sentence end with ..."... ।", ....'...?' 
	- Acronym (dEY.) issue solved.
	- issue remain same in case of indic type email and websites.

07-Jan-2013	Rashid Ahmad	<rashid101b@gmail.com>

	* Version 1.5

	- acronym/abbreviation list contain both wx and utf data.
	- Now working with UTF and wx
	- Multiword with hyphen is currently split as w1-w2 to w1 - w2.
	- actual decided is w1 -JOIN w2
	- currently tested for Hindi in detail and Telugu limited.
	- In this version Telugu basic support added.
	- file data/tel.arc and tests/input_tel_wx.txt added.

04-Jan-2013	Rashid Ahmad	<rashid101b@gmail.com>

	* Version 1.4

	- default sentence end marking is new line (\n) in raw text.
	- currently working on raw text or text file.
	- Package properly for first release.
	- other structures support CML, HTML and Document will be added in future

22-Dec-2012	Rashid Ahmad	<rashid101b@gmail.com>

	* Version 1.3

	- input file option added.
	- fixed some isses.

20-Dec-2012	Rashid Ahmad	<rashid101b@gmail.com>

	* Version 1.2

	- Acronym file updated.
	- website identification improved.
	- Code structure.

10-Dec-2012	Rashid Ahmad	<rashid101b@gmail.com>

	* Version 1.1

	- tests cases increase to test.
	- email identification improved.

12-Nov-2012	Rashid Ahmad	<rashid101b@gmail.com>

	* Version 1.0

	- Base version
	- This base version take 3-4 months to build