readme.txt 740 Bytes
Newer Older
Pruthwik's avatar
Pruthwik committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
# How to run the code
python3 tokenizer_for_indian_languages_on_files.py --input input_folder --output output_folder --lang 2-digit ISO 639-1 code

input_folder: Contains raw files
output_folder: Just give a name (no need to create a folder), an output_folder will be created where the tokenized outputs will be saved file wise in SSF format
language: language codes, please see the list of codes for different languages
Hindi: hi
Oriya/Odia: or
Manipuri: mn
Assamese: as
Bengali: bn
Punjabi: pa
Urdu: ur
English: en
Gujarati: gu
Marathi: mr
Malayalam: ml
Kannada: kn
Telugu: te
Tamil: ta
Sample Run: (Run this code in your terminal)
python3 tokenizer_for_indian_languages_on_files.py --input Sample-Input --output Sample-Output --lang kn