readme.txt

# How to run the code
python3 tokenizer_for_indian_languages_on_files.py --input input_folder --output output_folder --lang 2-digit ISO 639-1 code

input_folder: Contains raw files
output_folder: Just give a name (no need to create a folder), an output_folder will be created where the tokenized outputs will be saved file wise in SSF format
language: language codes, please see the list of codes for different languages
Hindi: hi
Oriya/Odia: or
Manipuri: mn
Assamese: as
Bengali: bn
Punjabi: pa
Urdu: ur
English: en
Gujarati: gu
Marathi: mr
Malayalam: ml
Kannada: kn
Telugu: te
Tamil: ta
Sample Run: (Run this code in your terminal)
python3 tokenizer_for_indian_languages_on_files.py --input Sample-Input --output Sample-Output --lang kn