• Disfluency Identification for 6 Indian Languages 2023
  • Home
  • About
  • Register
  • Important Dates
  • Contact
  • Organizers
  • ICON 2023
Training Data Part 1 and Part 2/2 Released

Disfluency Identification for 6 Indian Languages 2023

Supported by HimangY, Bhashini



Disfluency identification is a fundamental natural language processing (NLP) task that plays a crucial role in improving the accuracy and fluency of spoken language processing applications such as automatic speech recognition (ASR), machine translation, dialog systems, and language understanding. Disfluencies are interruptions, hesitations, or corrections in spoken language that can impact the overall performance and usability of such applications. This shared task focuses on disfluency identification in six Indian languages, namely Hindi, Kannada, Bengali, Telugu, Tamil and Marathi.

The primary objective of this shared task is to advance the development of disfluency identification models and systems for Indian languages. Participants are encouraged to create disfluency detection models on text that can automatically identify disfluencies in spoken text for the specified languages. Participants will be provided with a corpus of transcribed speech and text data in Hindi, Kannada, Bengali, Telugu, Tamil and Marathi (10hrs of transcribed speech per language). The dataset will include a mix of scripted, synthetic and spontaneous speech, representing various domains such as news, interviews, conversations, technical, educational and more. The dataset has been annotated with disfluency labels, including hesitations, repetitions, corrections, and other common disfluency types.

Goals of the shared task:

  • To develop a language processing tool that potentially impacts research and downstream applications like Speech Translation, Question Answering etc.
  • To provide the community a new dataset and boost the research for Disfluency Processing for Indian Languages .
  • For the downstream tasks (i.e Speech Translation) Disfluency identification and Processing would be one of the first process. It determines and process disfluencies from the spoken utterance and subsequently Machine Translation can choose its resources to translate it as per the requirement.



Task Details

Subtask-1a Disfluency Identification for Hindi

The subtask is defined as follows:

  • Given: For a given transcribed spoken text, participants have to develop an algorithm to identify fluent and disfluent words/phrases from the text, and for disfluent text, they need to identify the type of disfluency.

Evaluation

SubTask will be scored using standard evaluation metrics, including accuracy, precision, recall, F1-score and BLEU score. The submissions will be ranked by F1-score.

Corpus

Training Data : Part 1 and Part 2
to be released
- Password for the data download will be shared after registration.
- by clicking download, you are agreeing to data license and share task rules

Example : For Hindi

उसे उम्म्म एक टेक्स्ट मैसेज भेजो --> उसे उम्म्म एक टेक्स्ट मैसेज भेजो
उम्म्म is Filled Pause type of disfluency

मैं 9 बजे खाने जाउँगा खेलने जाउँगा --> मैं 9 बजे खाने जाउँगा खेलने जाउँगा
खाने जाउँगा is Repair type of disfluency

Similarly, the task is to identify filled pauses, pet phrases, repetitions, repairs, false starts, and to mark these types as part of the prediction.

Subtask-1b Disfluency Identification for Marathi

The subtask is defined as follows:

  • Given: For a given transcribed spoken text, participants have to develop an algorithm to identify fluent and disfluent words/phrases from the text, and for disfluent text, they need to identify the type of disfluency.

Evaluation

SubTask will be scored using standard evaluation metrics, including accuracy, precision, recall, F1-score and BLEU score. The submissions will be ranked by F1-score.

Corpus

Training Data : Part 1 and Part 2
to be released
- Password for the data download will be shared after registration.
- by clicking download, you are agreeing to data license and share task rules

Example :

TBA
The task is to identify filled pauses, pet phrases, repetitions, repairs, false starts, and to mark these types as part of the prediction.

Subtask-1c Disfluency Identification for Tamil

The subtask is defined as follows:

  • Given: For a given transcribed spoken text, participants have to develop an algorithm to identify fluent and disfluent words/phrases from the text, and for disfluent text, they need to identify the type of disfluency.

Evaluation

SubTask will be scored using standard evaluation metrics, including accuracy, precision, recall, F1-score and BLEU score. The submissions will be ranked by F1-score.

Corpus

Training Data : Part 1 and Part 2
to be released
- Password for the data download will be shared after registration.
- by clicking download, you are agreeing to data license and share task rules

Example :

TBA
The task is to identify filled pauses, pet phrases, repetitions, repairs, false starts, and to mark these types as part of the prediction.

Subtask-1d Disfluency Identification for Kannada

The subtask is defined as follows:

  • Given: For a given transcribed spoken text, participants have to develop an algorithm to identify fluent and disfluent words/phrases from the text, and for disfluent text, they need to identify the type of disfluency.

Evaluation

SubTask will be scored using standard evaluation metrics, including accuracy, precision, recall, F1-score and BLEU score. The submissions will be ranked by F1-score.

Corpus

Training Data : Part 1 and Part 2
to be released
- Password for the data download will be shared after registration.
- by clicking download, you are agreeing to data license and share task rules

Example :

TBA
The task is to identify filled pauses, pet phrases, repetitions, repairs, false starts, and to mark these types as part of the prediction.

Subtask-1e Disfluency Identification for Bangla

The subtask is defined as follows:

  • Given: For a given transcribed spoken text, participants have to develop an algorithm to identify fluent and disfluent words/phrases from the text, and for disfluent text, they need to identify the type of disfluency.

Evaluation

SubTask will be scored using standard evaluation metrics, including accuracy, precision, recall, F1-score and BLEU score. The submissions will be ranked by F1-score.

Corpus

Training Data : Part 1 and Part 2
to be released
- Password for the data download will be shared after registration.
- by clicking download, you are agreeing to data license and share task rules

Example :

TBA
The task is to identify filled pauses, pet phrases, repetitions, repairs, false starts, and to mark these types as part of the prediction.

Subtask-1f Disfluency Identification for Telugu

The subtask is defined as follows:

  • Given: For a given transcribed spoken text, participants have to develop an algorithm to identify fluent and disfluent words/phrases from the text, and for disfluent text, they need to identify the type of disfluency.

Evaluation

SubTask will be scored using standard evaluation metrics, including accuracy, precision, recall, F1-score and BLEU score. The submissions will be ranked by F1-score.

Corpus

Training Data : Part 1 and Part 2
to be released
- Password for the data download will be shared after registration.
- by clicking download, you are agreeing to data license and share task rules

Example :

TBA
The task is to identify filled pauses, pet phrases, repetitions, repairs, false starts, and to mark these types as part of the prediction.


Registration

To register for participation in the shared tasks, please fill this form.

Results

Schedule @ ICON2023


Important Dates

Please consult the Shared Task website for official dates for the Shared Tasks. All submission deadlines are 11:59 PM IST (Anywhere on Earth) Time Zone (UTC+5:30).

Event Date
Shared Task Announcement November 14, 2023
Registration Open November 14, 2023
Training Data Released November 20, 2023
Deadline for Registration November 20, 2023 December 01, 2023
Training Data - 2 Released December 01, 2023
Test Set Release (Blind) December 01, 2023 December 02, 2023
System Runs Due December 04, 2023 December 05, 2023
Preliminary System Reports Due in SoftConf December 05, 2023 December 08, 2023
Notification for Report Acceptance December 06, 2023 December 12, 2023
Camera Ready Due December 08, 2023 December 13, 2023
Participant Presentations at ICON 2023 December 17, 2023


Prizes:

Prizes will be awarded to the top-performing participants or teams.

1st Prize : 20K INR
2nd Prize : 15K INR
3rd Prize : 10K INR

Contact

For further information about this task and dataset, please contact:

  • Contact: disfluencyidentificationfor6indianlanguages-icon2023@IIITAPhyd.onmicrosoft.com




Organizing Committee

  • Dipti Misra Sharma, Professor, IIIT-Hyderabad
  • Parameswari Krishnamurthy, Assistant Professor, IIIT-Hyderabad
  • Arafat Ahsan, Research Scientist, IIIT-Hyderabad
  • Palash Gupta, Manager, HimangY, LTRC, IIIT-Hyderabad
  • Pruthwik Mishra, Research Scholar, IIIT-Hyderabad
  • Chayan Kochar, Research Scholar, IIIT-Hyerabad
  • (Corresponding Organizer) Vandan Mujadia, Research Scholar, IIIT-Hyderabad


Contact: Back to top