# How to run the code python3 tokenizer_for_indian_languages_on_files.py --input input_folder --output output_folder --lang 2-digit ISO 639-1 code input_folder: Contains raw files output_folder: Just give a name (no need to create a folder), an output_folder will be created where the tokenized outputs will be saved file wise in SSF format language: language codes, please see the list of codes for different languages Hindi: hi Oriya/Odia: or Manipuri: mn Assamese: as Bengali: bn Punjabi: pa Urdu: ur English: en Gujarati: gu Marathi: mr Malayalam: ml Kannada: kn Telugu: te Tamil: ta Sample Run: (Run this code in your terminal) python3 tokenizer_for_indian_languages_on_files.py --input Sample-Input --output Sample-Output --lang kn