added trainig code and data

96b61526 · Vandan Mujadia · 95346c4b · 96b61526 · 96b61526 · 96b61526
Commit 96b61526 authored Oct 16, 2023 by Vandan Mujadia
3 changed files
--- a/training-code/Lndian_Language_to_Indian_Language_Machine_Transaltion_IIITH_2023_v1_5.ipynb
+++ b/training-code/Lndian_Language_to_Indian_Language_Machine_Transaltion_IIITH_2023_v1_5.ipynb
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# ** Hybrid Neural Machine Translation for HimangiY **\n",
+        "#### Vandan Mujadia, Dipti Misra Sharma\n",
+        "#### LTRC, IIIT-Hyderabad, Hyderabad"
+      ],
+      "metadata": {
+        "id": "axRPPFFocszE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "This demonstrates how to train a sequence-to-sequence (seq2seq) model for Kannada-to-Hindi translation **roughly** based on [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1706.03762) (Vaswani, Ashish et al).\n",
+        "\n",
+        "## An Example to Understand sequence to Sequence processing using Transformar Network.\n",
+        "\n",
+        "<img src=\"https://www.tensorflow.org/images/tutorials/transformer/apply_the_transformer_to_machine_translation.gif\" alt=\"Applying the Transformer to machine translation\">\n",
+        "\n",
+        "Source: [Google AI Blog](https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html)\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "aX4IVQ5qf52I"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Applying the Transformer to machine translation.\n",
+        "\n",
+        "\n",
+        "<table>\n",
+        "<tr>\n",
+        "  <td>\n",
+        "   <img width=400 src=\"https://miro.medium.com/max/720/1*57LYNxwBGcCFFhkOCSnJ3g.png\"/>\n",
+        "  </td>\n",
+        "</tr>\n",
+        "<tr>\n",
+        "  <th colspan=1>This tutorial: An encoder/decoder connected by self attention neural network.</th>\n",
+        "<tr>\n",
+        "</table>"
+      ],
+      "metadata": {
+        "id": "J36LSLbtf_Co"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Tools that we are using here\n",
+        "\n",
+        "*   Library : Opennmt\n",
+        "*   Library : pytorch based neural network implemtation\n"
+      ],
+      "metadata": {
+        "id": "z7BlcJtET3DH"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "gknS9QN5aU7P",
+        "outputId": "6be0aaa1-df1c-4ee1-f9e8-8ac70db99fa4"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Requirement already satisfied: pip in /usr/local/lib/python3.10/dist-packages (23.1.2)\n",
+            "Collecting pip\n",
+            "  Downloading pip-23.2.1-py3-none-any.whl (2.1 MB)\n",
+            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.1/2.1 MB\u001b[0m \u001b[31m20.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[?25hInstalling collected packages: pip\n",
+            "  Attempting uninstall: pip\n",
+            "    Found existing installation: pip 23.1.2\n",
+            "    Uninstalling pip-23.1.2:\n",
+            "      Successfully uninstalled pip-23.1.2\n",
+            "Successfully installed pip-23.2.1\n",
+            "OpenNMT-py  sample_data\n",
+            "/content/OpenNMT-py\n",
+            "Note: switching to '1.2.0'.\n",
+            "\n",
+            "You are in 'detached HEAD' state. You can look around, make experimental\n",
+            "changes and commit them, and you can discard any commits you make in this\n",
+            "state without impacting any branches by switching back to a branch.\n",
+            "\n",
+            "If you want to create a new branch to retain commits you create, you may\n",
+            "do so (now or later) by using -c with the switch command. Example:\n",
+            "\n",
+            "  git switch -c <new-branch-name>\n",
+            "\n",
+            "Or undo this operation with:\n",
+            "\n",
+            "  git switch -\n",
+            "\n",
+            "Turn off this advice by setting config variable advice.detachedHead to false\n",
+            "\n",
+            "HEAD is now at 60125c80 Bump v1.2.0 (#1850)\n",
+            "Collecting torchtext==0.4.0\n",
+            "  Downloading torchtext-0.4.0-py3-none-any.whl (53 kB)\n",
+            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m53.1/53.1 kB\u001b[0m \u001b[31m1.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[?25hCollecting torch==1.11.0\n",
+            "  Downloading torch-1.11.0-cp310-cp310-manylinux1_x86_64.whl (750.6 MB)\n",
+            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m750.6/750.6 MB\u001b[0m \u001b[31m1.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[?25hRequirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from torchtext==0.4.0) (4.66.1)\n",
+            "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from torchtext==0.4.0) (2.31.0)\n",
+            "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from torchtext==0.4.0) (1.23.5)\n",
+            "Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from torchtext==0.4.0) (1.16.0)\n",
+            "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch==1.11.0) (4.5.0)\n",
+            "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->torchtext==0.4.0) (3.2.0)\n",
+            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->torchtext==0.4.0) (3.4)\n",
+            "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->torchtext==0.4.0) (2.0.4)\n",
+            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->torchtext==0.4.0) (2023.7.22)\n",
+            "Installing collected packages: torch, torchtext\n",
+            "  Attempting uninstall: torch\n",
+            "    Found existing installation: torch 2.0.1+cu118\n",
+            "    Uninstalling torch-2.0.1+cu118:\n",
+            "      Successfully uninstalled torch-2.0.1+cu118\n",
+            "  Attempting uninstall: torchtext\n",
+            "    Found existing installation: torchtext 0.15.2\n",
+            "    Uninstalling torchtext-0.15.2:\n",
+            "      Successfully uninstalled torchtext-0.15.2\n",
+            "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
+            "torchaudio 2.0.2+cu118 requires torch==2.0.1, but you have torch 1.11.0 which is incompatible.\n",
+            "torchdata 0.6.1 requires torch==2.0.1, but you have torch 1.11.0 which is incompatible.\n",
+            "torchvision 0.15.2+cu118 requires torch==2.0.1, but you have torch 1.11.0 which is incompatible.\u001b[0m\u001b[31m\n",
+            "\u001b[0mSuccessfully installed torch-1.11.0 torchtext-0.4.0\n",
+            "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n",
+            "\u001b[0m"
+          ]
+        }
+      ],
+      "source": [
+        "!pip install -U pip\n",
+        "!!git clone https://github.com/OpenNMT/OpenNMT-py\n",
+        "! ls\n",
+        "%cd OpenNMT-py\n",
+        "!git checkout 1.2.0\n",
+        "!pip3 install torchtext==0.4.0 torch==1.11.0"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "%cd /content/"
+      ],
+      "metadata": {
+        "id": "_ZlqbvXSUi9h",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "12ceaba9-6468-414d-9621-a66a0ecec700"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "/content\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Check GPU"
+      ],
+      "metadata": {
+        "id": "t0jUxSkJTpCY"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "j6W53TaterZ4",
+        "outputId": "44489711-32d8-45fe-82a1-18fe11eb003d"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "/bin/bash: line 1: nvidia-smi: command not found\n"
+          ]
+        }
+      ],
+      "source": [
+        "!nvidia-smi"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Tokenizer Tool"
+      ],
+      "metadata": {
+        "id": "mCO9U93FVy74"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "0RXkfYjYL35Q",
+        "outputId": "90c6e87d-f371-4e74-8d3b-edebe5e7fdad"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Collecting git+https://github.com/vmujadia/tokenizer.git\n",
+            "  Cloning https://github.com/vmujadia/tokenizer.git to /tmp/pip-req-build-3lao7iuu\n",
+            "  Running command git clone --filter=blob:none --quiet https://github.com/vmujadia/tokenizer.git /tmp/pip-req-build-3lao7iuu\n",
+            "  Resolved https://github.com/vmujadia/tokenizer.git to commit 93cd09b81702108a51c08c9796fd1cc941a1b98b\n",
+            "  Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
+            "Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from IL-Tokenizer==0.0.2) (6.0.1)\n",
+            "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from IL-Tokenizer==0.0.2) (2.31.0)\n",
+            "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->IL-Tokenizer==0.0.2) (3.2.0)\n",
+            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->IL-Tokenizer==0.0.2) (3.4)\n",
+            "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->IL-Tokenizer==0.0.2) (2.0.4)\n",
+            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->IL-Tokenizer==0.0.2) (2023.7.22)\n",
+            "Building wheels for collected packages: IL-Tokenizer\n",
+            "  Building wheel for IL-Tokenizer (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
+            "  Created wheel for IL-Tokenizer: filename=IL_Tokenizer-0.0.2-py3-none-any.whl size=7225 sha256=1bec4df8b3d0a8ca3a48367f72deb4b2d68623a782699f27934083cfbaa6b959\n",
+            "  Stored in directory: /tmp/pip-ephem-wheel-cache-624d680m/wheels/9a/fb/5b/3d75bfde8561726121c09f0f0a83389c05312df8a513808c41\n",
+            "Successfully built IL-Tokenizer\n",
+            "Installing collected packages: IL-Tokenizer\n",
+            "Successfully installed IL-Tokenizer-0.0.2\n",
+            "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n",
+            "\u001b[0m"
+          ]
+        }
+      ],
+      "source": [
+        "!pip install git+https://github.com/vmujadia/tokenizer.git --upgrade"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# To Clean and Filter Parallel Corpora"
+      ],
+      "metadata": {
+        "id": "GDJjJCnEV9Cx"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "gtF7MxJdMeky",
+        "outputId": "c3cb0732-7722-44c3-bc64-bb5de6c5de52"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Cloning into 'mosesdecoder'...\n",
+            "remote: Enumerating objects: 148097, done.\u001b[K\n",
+            "remote: Counting objects: 100% (525/525), done.\u001b[K\n",
+            "remote: Compressing objects: 100% (229/229), done.\u001b[K\n",
+            "remote: Total 148097 (delta 323), reused 441 (delta 292), pack-reused 147572\u001b[K\n",
+            "Receiving objects: 100% (148097/148097), 129.88 MiB | 10.32 MiB/s, done.\n",
+            "Resolving deltas: 100% (114349/114349), done.\n"
+          ]
+        }
+      ],
+      "source": [
+        "!git clone https://github.com/moses-smt/mosesdecoder.git"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# To tackle vocabulary issue : Subword algorithm"
+      ],
+      "metadata": {
+        "id": "XksbkTQ9VUaQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "KGWYp4kFLUYL",
+        "outputId": "9cd5fc2b-9909-4bfb-a25d-fbb296e32281"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Cloning into 'subword-nmt'...\n",
+            "remote: Enumerating objects: 597, done.\u001b[K\n",
+            "remote: Counting objects: 100% (21/21), done.\u001b[K\n",
+            "remote: Compressing objects: 100% (17/17), done.\u001b[K\n",
+            "remote: Total 597 (delta 8), reused 12 (delta 4), pack-reused 576\u001b[K\n",
+            "Receiving objects: 100% (597/597), 252.23 KiB | 3.11 MiB/s, done.\n",
+            "Resolving deltas: 100% (357/357), done.\n"
+          ]
+        }
+      ],
+      "source": [
+        "!git clone https://github.com/rsennrich/subword-nmt.git"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "cC4xAujyMiYH",
+        "outputId": "87e6da6b-cea1-46dd-a89d-4d23459df3f0"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "mosesdecoder/scripts/training/clean-corpus-n.perl\n"
+          ]
+        }
+      ],
+      "source": [
+        "!ls mosesdecoder/scripts/training/clean-corpus-n.perl"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# For this; Training Corpora\n",
+        "\n",
+        "##  Kannada - Hindi\n",
+        "## (small courpus MIT+CDAC-B developed)"
+      ],
+      "metadata": {
+        "id": "RAMhRFrfZBGx"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "roXS-U5oM_A4",
+        "outputId": "6c8d82b9-b196-4bd3-8325-e0145a43f7cb"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "--2023-09-23 05:22:08--  https://ssmt.iiit.ac.in/uploads/data_mining/kannada-hindi_combined_all.hi\n",
+            "Resolving ssmt.iiit.ac.in (ssmt.iiit.ac.in)... 196.12.53.52\n",
+            "Connecting to ssmt.iiit.ac.in (ssmt.iiit.ac.in)|196.12.53.52|:443... connected.\n",
+            "HTTP request sent, awaiting response... 200 OK\n",
+            "Length: 4195974 (4.0M) [application/octet-stream]\n",
+            "Saving to: ‘train.src’\n",
+            "\n",
+            "train.src           100%[===================>]   4.00M   460KB/s    in 8.0s    \n",
+            "\n",
+            "2023-09-23 05:22:18 (515 KB/s) - ‘train.src’ saved [4195974/4195974]\n",
+            "\n",
+            "--2023-09-23 05:22:18--  https://ssmt.iiit.ac.in/uploads/data_mining/kannada-hindi_combined_all.kn\n",
+            "Resolving ssmt.iiit.ac.in (ssmt.iiit.ac.in)... 196.12.53.52\n",
+            "Connecting to ssmt.iiit.ac.in (ssmt.iiit.ac.in)|196.12.53.52|:443... connected.\n",
+            "HTTP request sent, awaiting response... 200 OK\n",
+            "Length: 3796364 (3.6M) [application/octet-stream]\n",
+            "Saving to: ‘train.tgt’\n",
+            "\n",
+            "train.tgt           100%[===================>]   3.62M  1.03MB/s    in 3.5s    \n",
+            "\n",
+            "2023-09-23 05:22:23 (1.03 MB/s) - ‘train.tgt’ saved [3796364/3796364]\n",
+            "\n",
+            "--2023-09-23 05:22:23--  https://ssmt.iiit.ac.in/uploads/data_mining/flores200-dev.hi\n",
+            "Resolving ssmt.iiit.ac.in (ssmt.iiit.ac.in)... 196.12.53.52\n",
+            "Connecting to ssmt.iiit.ac.in (ssmt.iiit.ac.in)|196.12.53.52|:443... connected.\n",
+            "HTTP request sent, awaiting response... 200 OK\n",
+            "Length: 323267 (316K) [application/octet-stream]\n",
+            "Saving to: ‘valid.src’\n",
+            "\n",
+            "valid.src           100%[===================>] 315.69K   287KB/s    in 1.1s    \n",
+            "\n",
+            "2023-09-23 05:22:25 (287 KB/s) - ‘valid.src’ saved [323267/323267]\n",
+            "\n",
+            "--2023-09-23 05:22:25--  https://ssmt.iiit.ac.in/uploads/data_mining/flores200-dev.kn\n",
+            "Resolving ssmt.iiit.ac.in (ssmt.iiit.ac.in)... 196.12.53.52\n",
+            "Connecting to ssmt.iiit.ac.in (ssmt.iiit.ac.in)|196.12.53.52|:443... connected.\n",
+            "HTTP request sent, awaiting response... 200 OK\n",
+            "Length: 358653 (350K) [application/octet-stream]\n",
+            "Saving to: ‘valid.tgt’\n",
+            "\n",
+            "valid.tgt           100%[===================>] 350.25K  59.0KB/s    in 5.9s    \n",
+            "\n",
+            "2023-09-23 05:22:33 (59.0 KB/s) - ‘valid.tgt’ saved [358653/358653]\n",
+            "\n"
+          ]
+        }
+      ],
+      "source": [
+        "! wget -O train.src https://ssmt.iiit.ac.in/uploads/data_mining/kannada-hindi_combined_all.hi\n",
+        "! wget -O train.tgt https://ssmt.iiit.ac.in/uploads/data_mining/kannada-hindi_combined_all.kn\n",
+        "! wget -O valid.src https://ssmt.iiit.ac.in/uploads/data_mining/flores200-dev.hi\n",
+        "! wget -O valid.tgt https://ssmt.iiit.ac.in/uploads/data_mining/flores200-dev.kn"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Data Numbers"
+      ],
+      "metadata": {
+        "id": "plXXn4vYh7dL"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "FnRRGSCnQ8DI",
+        "outputId": "e461d175-8340-4724-d30d-32ee541a6556"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Data Stats\n",
+            "  15877 train.src\n",
+            "  15877 train.tgt\n",
+            "  31754 total\n",
+            "   997 valid.src\n",
+            "   997 valid.tgt\n",
+            "  1994 total\n"
+          ]
+        }
+      ],
+      "source": [
+        "print ('Data Stats')\n",
+        "! wc -l train.*\n",
+        "! wc -l valid.*"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Tokenize the text"
+      ],
+      "metadata": {
+        "id": "fXbHzLYYiCvZ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "lbWZ1WpJPR1R"
+      },
+      "outputs": [],
+      "source": [
+        "from ilstokenizer import tokenizer\n",
+        "import codecs\n",
+        "\n",
+        "def to_tokenize_and_lower(input_path, output_path):\n",
+        "  outfile = open(output_path, 'w')\n",
+        "  for line in codecs.open(input_path):\n",
+        "    line = line.strip()\n",
+        "    line = tokenizer.tokenize(line).lower()\n",
+        "    #print (line)\n",
+        "    outfile.write(line+'\\n')\n",
+        "  outfile.close()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "fjtXoUVZQXk4"
+      },
+      "outputs": [],
+      "source": [
+        "to_tokenize_and_lower('train.src','train.src.tkn')\n",
+        "to_tokenize_and_lower('train.tgt','train.tgt.tkn')\n",
+        "\n",
+        "to_tokenize_and_lower('valid.src','valid.src.tkn')\n",
+        "to_tokenize_and_lower('valid.tgt','valid.tgt.tkn')"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "Lzgo8VWmNYAG"
+      },
+      "outputs": [],
+      "source": [
+        "! cat train.src.tkn > train.all.tkn\n",
+        "! cat train.tgt.tkn >> train.all.tkn"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Data Cleaning"
+      ],
+      "metadata": {
+        "id": "FWpmi90piKUC"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "JBSq7A5gRuux",
+        "outputId": "bb481b2a-0b25-47e4-93c0-e304fa9a96fa"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "clean-corpus.perl: processing train.src.tkn & .tgt.tkn to train_filtered, cutoff 1-250, ratio 2.5\n",
+            ".\n",
+            "Input sentences: 15878  Output sentences:  15807\n"
+          ]
+        }
+      ],
+      "source": [
+        "! perl mosesdecoder/scripts/training/clean-corpus-n.perl -ratio 2.5 train src.tkn tgt.tkn train_filtered 1 250"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "D2ibyHwUSkrL",
+        "outputId": "4c64d163-dca5-40ad-8f9f-c75227df72e1"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Data Stats\n",
+            "   31756 train.all.tkn\n",
+            "   15807 train_filtered.src.tkn\n",
+            "   15807 train_filtered.tgt.tkn\n",
+            "   15877 train.src\n",
+            "   15878 train.src.tkn\n",
+            "   15877 train.tgt\n",
+            "   15878 train.tgt.tkn\n",
+            "  126880 total\n",
+            "    997 valid.src\n",
+            "    997 valid.src.tkn\n",
+            "    997 valid.tgt\n",
+            "    997 valid.tgt.tkn\n",
+            "   3988 total\n"
+          ]
+        }
+      ],
+      "source": [
+        "print ('Data Stats')\n",
+        "! wc -l train*\n",
+        "! wc -l valid*"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "print ('Data Stats')\n",
+        "! wc -l train*\n",
+        "! wc -l valid*"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "CPGS3qLW6o9d",
+        "outputId": "d9f5dd94-8d37-4cde-c7a6-77854dc4cc55"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Data Stats\n",
+            "   31756 train.all.tkn\n",
+            "   15807 train_filtered.src.tkn\n",
+            "   15807 train_filtered.src.tkn.pos\n",
+            "   15807 train_filtered.src.tkn.posword\n",
+            "   15807 train_filtered.tgt.tkn\n",
+            "   15877 train.src\n",
+            "   15878 train.src.tkn\n",
+            "   15877 train.tgt\n",
+            "   15878 train.tgt.tkn\n",
+            "  158494 total\n",
+            "    997 valid.src\n",
+            "    997 valid.src.tkn\n",
+            "    997 valid.tgt\n",
+            "    997 valid.tgt.tkn\n",
+            "   3988 total\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Train subword model,\n",
+        "## Experiment with no of subword merge operation"
+      ],
+      "metadata": {
+        "id": "oukoeGHniPEC"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "COV2XMvATTIw",
+        "outputId": "f020da54-a4a4-4d35-fbdc-9a6c343da6af"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "100% 5000/5000 [00:23<00:00, 211.36it/s]\n"
+          ]
+        }
+      ],
+      "source": [
+        "!python subword-nmt/subword_nmt/learn_bpe.py -s 5000 < train.all.tkn > train.codes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# How do subword codes look"
+      ],
+      "metadata": {
+        "id": "4AZodBXIir_D"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "3kduoQxrOwc7",
+        "outputId": "b001f507-dcb4-4c13-a019-dfb6d74d1a3b"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "#version: 0.2\n",
+            "ತ ್\n",
+            "ನ ್\n",
+            "क े</w>\n",
+            "ಲ ್\n",
+            "् र\n",
+            "ಾ ಗ\n",
+            "ಗ ಳ\n",
+            "ತ್ ತ\n",
+            "ಕ ್\n"
+          ]
+        }
+      ],
+      "source": [
+        "! head -n 10 train.codes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Apply Subword to the corpus"
+      ],
+      "metadata": {
+        "id": "eSbwhzpqjbH8"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "13QghGSZT6tA"
+      },
+      "outputs": [],
+      "source": [
+        "!python subword-nmt/subword_nmt/apply_bpe.py -c train.codes < train.src > train.kn\n",
+        "!python subword-nmt/subword_nmt/apply_bpe.py -c train.codes < train.tgt > train.hi\n",
+        "\n",
+        "!python subword-nmt/subword_nmt/apply_bpe.py -c train.codes < valid.src > valid.kn\n",
+        "!python subword-nmt/subword_nmt/apply_bpe.py -c train.codes < valid.tgt > valid.hi"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Training Corpus now"
+      ],
+      "metadata": {
+        "id": "YOo9h4ULjXuU"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "_CRKB3qOURoT",
+        "outputId": "2229679e-b3c0-4110-d2f8-abde21f06bff"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "ಪ್ರ@@ ವಾ@@ ಹ ಪೀ@@ ಡಿತ ಕ@@ ಲ್ಲ@@ ಿದ್ದ@@ ಲು ಗಣ@@ ಿಗಳನ್ನು s@@ e@@ c@@ l ಕೈ@@ ಬಿಡ@@ ಬೇಕಾ@@ ಯಿತು .\n",
+            "ಹಣಕಾ@@ ಸು ಸಚಿ@@ ವಾ@@ ಲಯ@@ ವು ರಾಯ@@ ಧ@@ ನ ಮೇಲಿನ ತನ್ನ ಕೈ@@ ಬಿ@@ ಟ್ಟ ಹಕ್ಕ@@ ನ್ನು ಹೊ@@ ಸದ@@ ಾಗಿ ಪ್ರಸ್ತು@@ ತ@@ ಪಡಿಸಲು ನಿರ್ಧರಿಸ@@ ಿದೆ .\n",
+            "ಕಂಪ@@ ನಿಯ ತೊ@@ ರೆ@@ ದ ಆ@@ ಸ್@@ ತಿಯನ್ನು ಹ@@ ರಾಜ@@ ು ಮಾಡಲು ಸರ್ಕಾರ ನಿರ್ಧರಿಸ@@ ಿದೆ .\n",
+            "ವಿ@@ ಮಾನ ನಿ@@ ಲ್@@ ದಾ@@ ಣ@@ ದಲ್ಲಿ ಕೈ@@ ಬಿಡ@@ ಲಾದ ಸರ@@ ಕು@@ ಗಳನ್ನು ನಿರ್ವಹ@@ ಿಸುವಲ್ಲಿ ಸಮಸ್ಯೆ ಇದೆ .\n",
+            "ಶಿ@@ ಥ@@ ಿ@@ ಲ@@ ಗೊಂಡ ಕಟ್ಟಡ@@ ಗಳ ತ್ಯ@@ ಜ@@ ಿಸುವ ಪ್ರಕ್ರಿಯ@@ ೆಯನ್ನು ಎ@@ ಸ್ಟ@@ ೇಟ್ ಇಲಾಖೆ ಆರಂಭ@@ ಿಸಿದೆ .\n",
+            "ಮಾ@@ ಲಿ@@ ನ್ಯ@@ ವನ್ನು ಕೊ@@ ನ@@ ೆಗೊಳ@@ ಿಸಲು ಕೈಗೊಳ್ಳ@@ ಬೇಕು .\n",
+            "ಸರ್ಕಾರವು ಆ@@ ಮ@@ ದು ಸು@@ ಂಕ@@ ವನ್ನು ಕಡಿಮೆ ಮಾಡಲು ಕ್ರಮಗಳನ್ನು ತೆಗೆದುಕೊಳ್ಳ@@ ುತ್ತಿದೆ .\n",
+            "ಕಾನೂ@@ ನಿ@@ ನ ಈ ವಿಭಾಗ@@ ವನ್ನು ರ@@ ದ್ದ@@ ು@@ ಗೊಳಿಸಲಾಗಿದೆ .\n",
+            "ಸರ್ಕಾರಿ ಮನೆ@@ ಗಳ ಬಾ@@ ಡ@@ ಿಗೆ ಕಡ@@ ಿತ@@ ದ ಪ್ರಸ್ತಾ@@ ವ@@ ನೆ ಪರಿ@@ ಶೀ@@ ಲನ@@ ೆಯಲ್ಲ@@ ಿದೆ .\n",
+            "ಪ್ರ@@ ಯಾ@@ ಣ@@ ಿಕ ಸಾ@@ ರಿಗ@@ ೆಗೆ 60 % ರಿಯಾ@@ ಯ@@ ಿತ@@ ಿಯ@@ ೊಂದಿಗೆ 1@@ 2.@@ 3@@ 6 % ದ@@ ರದಲ್ಲಿ ತೆ@@ ರಿಗೆ ವಿಧ@@ ಿಸಲಾಗುತ್ತದೆ .\n"
+          ]
+        }
+      ],
+      "source": [
+        "! head -n 10 train.kn"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "39V10nlIUXT9",
+        "outputId": "460dadc6-6551-4141-a30f-cdd3a46e2880"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "एस@@ ई@@ सी@@ एल को बा@@ ढ़ प्रभावित को@@ य@@ ला ख@@ दा@@ नों को छोड़@@ ना पड़ा ।\n",
+            "वित्@@ त मंत्रालय ने अपने राज@@ स्@@ व संबंधी छो@@ ड़े हुए दा@@ वे को नए सि@@ रे से पे@@ श करने का निर्णय लिया है ।\n",
+            "सरकार ने कंपनी की छो@@ ड़ी हुई परि@@ सं@@ पत्@@ तियों की नी@@ ला@@ मी का फै@@ स@@ ला किया है ।\n",
+            "वि@@ मान पत्@@ तन में छो@@ ड़े हुए कार्@@ ग@@ ो को रखने की समस्या आ रही है ।\n",
+            "सं@@ प@@ दा विभाग द्वारा ज@@ र्@@ जर भ@@ व@@ नों को छोड़@@ ने की प्रक्रिया शुरू की गई ।\n",
+            "प्र@@ दू@@ षण समाप्त करने के लिए कदम उठा@@ ने हैं ।\n",
+            "सरकार आ@@ या@@ त शु@@ ल्@@ क कम करने के लिए कदम उठ@@ ा रही है ।\n",
+            "कान@@ ून की इस धार@@ ा का उप@@ श@@ मन हो चु@@ का है ।\n",
+            "सरकारी आवा@@ सों के कि@@ रा@@ ये में कमी का प्रस्ताव वि@@ च@@ ारा@@ ध@@ ीन है ।\n",
+            "या@@ त्र@@ ी परि@@ वहन पर 60 % की छू@@ ट के साथ 1@@ 2.@@ 3@@ 6 % की दर से कर व@@ सू@@ ला जाता है ।\n"
+          ]
+        }
+      ],
+      "source": [
+        "! head -n 10 train.hi"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "# Starting  NMT Training\n",
+        "## Preprocessing stage ; create dictionaries, make corpora ready for parallel processing\n"
+      ],
+      "metadata": {
+        "id": "fuZl-vwLjqy8"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install torch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 configargparse\n",
+        "\n",
+        "!python OpenNMT-py/preprocess.py \\\n",
+        "\t    -train_src train.kn \\\n",
+        "\t    -train_tgt train.hi \\\n",
+        "\t    -valid_src valid.kn \\\n",
+        "\t    -valid_tgt valid.hi \\\n",
+        "\t    -save_data processed -share_vocab -overwrite"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "7dNw2BuMbzlg",
+        "outputId": "cb4cef54-875e-4a9d-cf2e-15e9ef7e7edd"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Requirement already satisfied: torch==1.13.1 in /usr/local/lib/python3.10/dist-packages (1.13.1)\n",
+            "Requirement already satisfied: torchvision==0.14.1 in /usr/local/lib/python3.10/dist-packages (0.14.1)\n",
+            "Requirement already satisfied: torchaudio==0.13.1 in /usr/local/lib/python3.10/dist-packages (0.13.1)\n",
+            "Collecting configargparse\n",
+            "  Obtaining dependency information for configargparse from https://files.pythonhosted.org/packages/6f/b3/b4ac838711fd74a2b4e6f746703cf9dd2cf5462d17dac07e349234e21b97/ConfigArgParse-1.7-py3-none-any.whl.metadata\n",
+            "  Downloading ConfigArgParse-1.7-py3-none-any.whl.metadata (23 kB)\n",
+            "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch==1.13.1) (4.5.0)\n",
+            "Requirement already satisfied: nvidia-cuda-runtime-cu11==11.7.99 in /usr/local/lib/python3.10/dist-packages (from torch==1.13.1) (11.7.99)\n",
+            "Requirement already satisfied: nvidia-cudnn-cu11==8.5.0.96 in /usr/local/lib/python3.10/dist-packages (from torch==1.13.1) (8.5.0.96)\n",
+            "Requirement already satisfied: nvidia-cublas-cu11==11.10.3.66 in /usr/local/lib/python3.10/dist-packages (from torch==1.13.1) (11.10.3.66)\n",
+            "Requirement already satisfied: nvidia-cuda-nvrtc-cu11==11.7.99 in /usr/local/lib/python3.10/dist-packages (from torch==1.13.1) (11.7.99)\n",
+            "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from torchvision==0.14.1) (1.23.5)\n",
+            "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from torchvision==0.14.1) (2.31.0)\n",
+            "Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /usr/local/lib/python3.10/dist-packages (from torchvision==0.14.1) (9.4.0)\n",
+            "Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from nvidia-cublas-cu11==11.10.3.66->torch==1.13.1) (67.7.2)\n",
+            "Requirement already satisfied: wheel in /usr/local/lib/python3.10/dist-packages (from nvidia-cublas-cu11==11.10.3.66->torch==1.13.1) (0.41.2)\n",
+            "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->torchvision==0.14.1) (3.2.0)\n",
+            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->torchvision==0.14.1) (3.4)\n",
+            "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->torchvision==0.14.1) (2.0.4)\n",
+            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->torchvision==0.14.1) (2023.7.22)\n",
+            "Downloading ConfigArgParse-1.7-py3-none-any.whl (25 kB)\n",
+            "Installing collected packages: configargparse\n",
+            "Successfully installed configargparse-1.7\n",
+            "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n",
+            "\u001b[0m[2023-09-23 05:40:19,687 INFO] Extracting features...\n",
+            "[2023-09-23 05:40:19,690 INFO]  * number of source features: 1.\n",
+            "[2023-09-23 05:40:19,690 INFO]  * number of target features: 0.\n",
+            "[2023-09-23 05:40:19,690 INFO] Building `Fields` object...\n",
+            "[2023-09-23 05:40:19,691 INFO] Building & saving training data...\n",
+            "[2023-09-23 05:40:19,870 INFO] Building shard 0.\n",
+            "[2023-09-23 05:40:21,946 INFO]  * saving 0th train data shard to processed.train.0.pt.\n",
+            "[2023-09-23 05:40:24,031 INFO]  * tgt vocab size: 2488.\n",
+            "[2023-09-23 05:40:24,038 INFO]  * src vocab size: 2956.\n",
+            "[2023-09-23 05:40:24,039 INFO]  * src_feat_0 vocab size: 39.\n",
+            "[2023-09-23 05:40:24,039 INFO]  * merging src and tgt vocab...\n",
+            "[2023-09-23 05:40:24,057 INFO]  * merged vocab size: 5306.\n",
+            "[2023-09-23 05:40:24,091 INFO] Building & saving validation data...\n",
+            "[2023-09-23 05:40:24,142 INFO] Building shard 0.\n",
+            "[2023-09-23 05:40:24,258 INFO]  * saving 0th valid data shard to processed.valid.0.pt.\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "_F8k7167up78"
+      },
+      "outputs": [],
+      "source": [
+        "ls data-bin/trial"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [],
+      "metadata": {
+        "id": "IwZtWm12DXqF"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Training\n",
+        "## Parameters to fix for your corpora and language pair\n",
+        "\n",
+        "\n",
+        "\n",
+        "```\n",
+        "    --encoder-embed-dim\t128 --encoder-ffn-embed-dim\t128 \\\n",
+        "    --encoder-layers\t2 --encoder-attention-heads\t2 \\\n",
+        "    --decoder-embed-dim\t128 --decoder-ffn-embed-dim\t128 \\\n",
+        "    --decoder-layers\t2 --decoder-attention-heads\t2 \\\n",
+        "    --dropout 0.3 --weight-decay 0.0 \\\n",
+        "    --max-update 4000 \\\n",
+        "    --keep-last-epochs\t10 \\\n",
+        "```\n",
+        "\n",
+        "\n",
+        "\n",
+        "---\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "ThmR6CUikOCr"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "RonQ8-I1yaPB"
+      },
+      "outputs": [],
+      "source": [
+        "! python OpenNMT-py/train.py -data processed -save_model model.pt \\\n",
+        "\t\t-layers 6 -rnn_size 512 -src_word_vec_size 512 -tgt_word_vec_size 512 -transformer_ff 2048 -heads 8  \\\n",
+        "\t\t-encoder_type transformer -decoder_type transformer -position_encoding \\\n",
+        "\t\t-train_steps 200000  -max_generator_batches 2 -dropout 0.1 \\\n",
+        "\t\t-batch_size 4096 -batch_type tokens -normalization tokens  -accum_count 2 \\\n",
+        "\t\t-optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \\\n",
+        "\t\t-max_grad_norm 0 -param_init 0  -param_init_glorot \\\n",
+        "\t\t-label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 \\\n",
+        "\t\t-world_size 1 -gpu_ranks 0"
+      ]
+    }
+  ],
+  "metadata": {
+    "accelerator": "TPU",
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "name": "python3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
\ No newline at end of file
--- a/training-code/README.md
+++ b/training-code/README.md
+# Training Code Script
+## Code 
+1. lndian_language_to_indian_language_machine_transaltion_iiith_2023_v1_5.py
+2. Lndian_Language_to_Indian_Language_Machine_Transaltion_IIITH_2023_v1_5.ipynb (comments are given; also please use your data for training)
+## Colab version
+1. https://colab.research.google.com/drive/1jHXH6lBQZTkptC3MMb1O880draZL3dfy?usp=sharing
+## Training Data
+1. https://ssmt.iiit.ac.in/meitygit/palash/himangy-corpora
+2. https://ai4bharat.iitm.ac.in/bpcc
+3. https://ssmt.iiit.ac.in/meitygit/root/Hindi-Telugu-Data
+4. https://ssmt.iiit.ac.in/meitygit/root/PSA-DATA-NLTM
--- a/training-code/lndian_language_to_indian_language_machine_transaltion_iiith_2023_v1_5.py
+++ b/training-code/lndian_language_to_indian_language_machine_transaltion_iiith_2023_v1_5.py
+# -*- coding: utf-8 -*-
+"""Lndian Language to Indian Language Machine Transaltion-IIITH-2023-v1.5.ipynb
+Automatically generated by Colaboratory.
+Original file is located at
+    https://colab.research.google.com/drive/1jHXH6lBQZTkptC3MMb1O880draZL3dfy
+# ** Hybrid Neural Machine Translation for HimangiY **
+#### Vandan Mujadia, Dipti Misra Sharma
+#### LTRC, IIIT-Hyderabad, Hyderabad
+This demonstrates how to train a sequence-to-sequence (seq2seq) model for Kannada-to-Hindi translation **roughly** based on [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1706.03762) (Vaswani, Ashish et al).
+## An Example to Understand sequence to Sequence processing using Transformar Network.
+<img src="https://www.tensorflow.org/images/tutorials/transformer/apply_the_transformer_to_machine_translation.gif" alt="Applying the Transformer to machine translation">
+Source: [Google AI Blog](https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html)
+## Applying the Transformer to machine translation.
+<table>
+<tr>
+  <td>
+   <img width=400 src="https://miro.medium.com/max/720/1*57LYNxwBGcCFFhkOCSnJ3g.png"/>
+  </td>
+</tr>
+<tr>
+  <th colspan=1>This tutorial: An encoder/decoder connected by self attention neural network.</th>
+<tr>
+</table>
+# Tools that we are using here
+*   Library : Opennmt
+*   Library : pytorch based neural network implemtation
+"""
+# Commented out IPython magic to ensure Python compatibility.
+!pip install -U pip
+!!git clone https://github.com/OpenNMT/OpenNMT-py
+! ls
+# %cd OpenNMT-py
+!git checkout 1.2.0
+!pip3 install torchtext==0.4.0 torch==1.11.0
+# Commented out IPython magic to ensure Python compatibility.
+# %cd /content/
+"""# Check GPU"""
+!nvidia-smi
+"""# Tokenizer Tool"""
+!pip install git+https://github.com/vmujadia/tokenizer.git --upgrade
+"""# To Clean and Filter Parallel Corpora"""
+!git clone https://github.com/moses-smt/mosesdecoder.git
+"""# To tackle vocabulary issue : Subword algorithm"""
+!git clone https://github.com/rsennrich/subword-nmt.git
+!ls mosesdecoder/scripts/training/clean-corpus-n.perl
+"""# For this; Training Corpora
+##  Kannada - Hindi
+## (small courpus MIT+CDAC-B developed)
+"""
+! wget -O train.src https://ssmt.iiit.ac.in/uploads/data_mining/kannada-hindi_combined_all.hi
+! wget -O train.tgt https://ssmt.iiit.ac.in/uploads/data_mining/kannada-hindi_combined_all.kn
+! wget -O valid.src https://ssmt.iiit.ac.in/uploads/data_mining/flores200-dev.hi
+! wget -O valid.tgt https://ssmt.iiit.ac.in/uploads/data_mining/flores200-dev.kn
+"""# Data Numbers"""
+print ('Data Stats')
+! wc -l train.*
+! wc -l valid.*
+"""# Tokenize the text"""
+from ilstokenizer import tokenizer
+import codecs
+def to_tokenize_and_lower(input_path, output_path):
+  outfile = open(output_path, 'w')
+  for line in codecs.open(input_path):
+    line = line.strip()
+    line = tokenizer.tokenize(line).lower()
+    #print (line)
+    outfile.write(line+'\n')
+  outfile.close()
+to_tokenize_and_lower('train.src','train.src.tkn')
+to_tokenize_and_lower('train.tgt','train.tgt.tkn')
+to_tokenize_and_lower('valid.src','valid.src.tkn')
+to_tokenize_and_lower('valid.tgt','valid.tgt.tkn')
+! cat train.src.tkn > train.all.tkn
+! cat train.tgt.tkn >> train.all.tkn
+"""# Data Cleaning"""
+! perl mosesdecoder/scripts/training/clean-corpus-n.perl -ratio 2.5 train src.tkn tgt.tkn train_filtered 1 250
+print ('Data Stats')
+! wc -l train*
+! wc -l valid*
+print ('Data Stats')
+! wc -l train*
+! wc -l valid*
+"""# Train subword model,
+## Experiment with no of subword merge operation
+"""
+!python subword-nmt/subword_nmt/learn_bpe.py -s 5000 < train.all.tkn > train.codes
+"""# How do subword codes look"""
+! head -n 10 train.codes
+"""# Apply Subword to the corpus"""
+!python subword-nmt/subword_nmt/apply_bpe.py -c train.codes < train.src > train.kn
+!python subword-nmt/subword_nmt/apply_bpe.py -c train.codes < train.tgt > train.hi
+!python subword-nmt/subword_nmt/apply_bpe.py -c train.codes < valid.src > valid.kn
+!python subword-nmt/subword_nmt/apply_bpe.py -c train.codes < valid.tgt > valid.hi
+"""# Training Corpus now"""
+! head -n 10 train.kn
+! head -n 10 train.hi
+"""
+# Starting  NMT Training
+## Preprocessing stage ; create dictionaries, make corpora ready for parallel processing
+"""
+!pip install torch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 configargparse
+!python OpenNMT-py/preprocess.py \
+	    -train_src train.kn \
+	    -train_tgt train.hi \
+	    -valid_src valid.kn \
+	    -valid_tgt valid.hi \
+	    -save_data processed -share_vocab -overwrite
+ls data-bin/trial
+"""# Training
+## Parameters to fix for your corpora and language pair
+```
+    --encoder-embed-dim	128 --encoder-ffn-embed-dim	128 \
+    --encoder-layers	2 --encoder-attention-heads	2 \
+    --decoder-embed-dim	128 --decoder-ffn-embed-dim	128 \
+    --decoder-layers	2 --decoder-attention-heads	2 \
+    --dropout 0.3 --weight-decay 0.0 \
+    --max-update 4000 \
+    --keep-last-epochs	10 \
+```
+---
+"""
+! python OpenNMT-py/train.py -data processed -save_model model.pt \
+		-layers 6 -rnn_size 512 -src_word_vec_size 512 -tgt_word_vec_size 512 -transformer_ff 2048 -heads 8  \
+		-encoder_type transformer -decoder_type transformer -position_encoding \
+		-train_steps 200000  -max_generator_batches 2 -dropout 0.1 \
+		-batch_size 4096 -batch_type tokens -normalization tokens  -accum_count 2 \
+		-optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \
+		-max_grad_norm 0 -param_init 0  -param_init_glorot \
+		-label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 \
+		-world_size 1 -gpu_ranks 0
\ No newline at end of file