research-article
Authors: Koyyalagunta Krishna Sampath and M. Supriya
Published: 09 July 2024 Publication History
- 0citation
- 0
- Downloads
Metrics
Total Citations0Total Downloads0Last 12 Months0
Last 6 weeks0
New Citation Alert added!
This alert has been successfully added and will be sent to:
You will be notified whenever a record that you have chosen has been cited.
To manage your alert preferences, click on the button below.
Manage my Alerts
New Citation Alert!
Please log in to your account
- View Options
- References
- Media
- Tables
- Share
Abstract
In India, a country known for its linguistic diversity, code mixing is a common practice, and it has a profound impact on the way people communicate through various mediums, including social media platforms and everyday conversations. The prevalence of code-mixing in social media platforms presents a substantial hurdle for machine translation and language processing tasks. The abundance of unstructured text in code-mixed form on these platforms highlights a crucial research domain within NLP. The blending of Hindi and English, known as Hinglish, and other mixed case text like Malayalam-English, Tamil-English, Telugu- English are particularly prevalent among the younger generation while communication in social media and requires appropriate processing to aid comprehension by both monolingual users and language processing models. Manual translation of this type of data proves to be laborious due to challenges like limited vocabulary, potential misunderstandings of context, grammatical errors, biases, and various other issues. Additionally, existing translation models tend to perform more effectively on monolingual language rather than code-mixed data. Therefore, it is more desirable to build models that can translate code-mixed data.
This study tries to convert code-mixed Hinglish, Malayalam-English, Tamil-English, Telugu-English language in Romanised script to monolingual English which can further be given as input to NLP applications like Sentiment Analysis. This is achieved by finetuning pretrained models like IndicLID for Language Identification (LID) module and use an ensemble approach for transliteration + translation using Indictrans and IndicXlit for code mixed machine translation which will be given as input to classification algorithm which performs Sentiment Analysis and predict the sentiment. It is observed that this approach of translation of code-mixed test perform better than traditional machine translation for Indian languages Hindi, Tamil, Telugu and Malayalam.
References
[1]
I. Jadhav, A. Kanade, V. Waghmare, S.S. Chandok, A. Jarali, Code-Mixed Hinglish to English Language Translation Framework, in: 2022 International Conference on Sustainable Computing and Data Communication Systems (ICSCDS), , Erode, India, 2022, pp. 684–688,.
[2]
S. Mukherjee, Deep Learning Technique for Sentiment Analysis of Hindi-English Code-Mixed Text using Late Fusion of Character and Word Features, in: 2019 IEEE 16th India Council International Conference (INDICON), , Rajkot, India, 2019, pp. 1–4,.
[3]
K. Yadav, A. Lamba, D. Gupta, A. Gupta, P. Karmakar, S. Saini, Bi-LSTM and Ensemble based Bilingual Sentiment Analysis for a Code-mixed Hindi-English Social Media Text, in: 2020 IEEE 17th India Council International Conference (INDICON), , New Delhi, India, 2020, pp. 1–6,.
[4]
M.R. Ghatge, S. Barde, Comparison of CNN-LSTM in Sentiment Analysis for Hindi Mix Language, in: 2022 2nd International Conference on Technological Advancements in Computational Sciences (ICTACS), , Tashkent, Uzbekistan, 2022, pp. 453–456,.
[5]
S. T., E. S., S.R.B. V., S.B.M. V., R.M. P., Code Mixed Question Answering Challenge using Deep Learning Methods, in: 2020 5th International Conference on Communication and Electronics Systems (ICCES), , Coimbatore, India, 2020, pp. 1331–1337,.
[6]
S. Yadav, A. Kaushik, S. Sharma, Cooking Well, With Love, Is an Art: Transformers on Youtube Hinglish Data, in: 2021 International Conference on Computational Performance Evaluation (ComPE), , Shillong, India, 2021, pp. 836–841,.
[7]
V.Gupta Rahul, V. Sehra, Y.R. Vardhan, Hindi-English Code-Mixed Hate Speech Detection using Character Level Embeddings, in: 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), , Erode, India, 2021, pp. 1112–1118,.
[8]
P. Awatramani, R. Daware, H. Chouhan, A. Vaswani, S. Khedkar, Sentiment Analysis of Mixed-Case Language using Natural Language Processing, in: 2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA), , Coimbatore, India, 2021, pp. 651–658,.
[9]
I. Chaitanya, I. Madapakula, S.K. Gupta, S. Thara, Word Level Language Identification in Code-Mixed Data using Word Embedding Methods for Indian Languages, in: 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), , Bangalore, India, 2018, pp. 1137–1141,.
[10]
Kogilavani Shanmugavadivel, V. Sathishkumar, Sandhiya Raja, Bheema Lingaiah, Neelakandan Subramani, Malliga Subramanian, Deep learning-based sentiment analysis and offensive language identification on multilingual code-mixed data, Scientific Reports 12 (2022) 21557.10.1038/s41598-022-26092-3.
[11]
Nadana. Ravishankar, Corpus based Sentiment Classification of Tamil movie tweets using Syntactic patterns, IIOAB Journal (2017).
[12]
S. Thara, P. Poornachandran, Social media text analytics of Malayalam–English code-mixed using deep learning, J Big Data 9 (2022) 45,.
[13]
Rajendran, Srinivasan & Cn, Subalalitha. (2021). Sentimental analysis from imbalanced code-mixed data using machine learning approaches. Distributed and Parallel Databases. 41.10.1007/s10619-021-07331-4.
[14]
Mounika Marreddy, Subba Oota, Sireesha Vakada, Venkata Charan Chinni, Radhika Mamidi, Am I a Resource-Poor Language? Data Sets, Embeddings, Models and Analysis for four different NLP Tasks in Telugu Language, ACM Transactions on Asian and Low-Resource Language Information Processing 22 (2022),.
Digital Library
[15]
Balsam Alkouz, Zaher Al AGHBARI, Traffic Jam Analysis using Multi-Language Twitter Data, in: The 2021 3rd International Conference on Big Data Engineering (BDE 2021), ACM, Shanghai, China, 2021, p. 10,. May 29-31, 2021New York, NY, USAPages.
Digital Library
[16]
Koyyalagunta Krishna Sampath, M. Supriya, Traffic Prediction in Indian Cities from Twitter Data Using Deep Learning and Word Embedding Models, in: Multi-disciplinary Trends in Artificial Intelligence: 16th International Conference, MIWAI 2023, Springer-Verlag, Hyderabad, IndiaBerlin, Heidelberg, 2023, pp. 671–682,. July 21–22, 2023, Proceedings.
Digital Library
[17]
Rathnayake, Himashi & Sumanapala, Janani & Rukshani, Raveesha & Ranathunga, Surangika. (2022). Adapter Based Fine-Tuning of Pre- Trained Multilingual Language Models for Code-Mixed and Code-Switched Text Classification. 10.21203/rs.3.rs-1564359/v1.
[18]
S. Thara, P. Poornachandran, Code-Mixing: A Brief Survey, in: 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), , Bangalore, India, 2018, pp. 2382–2388,.
[19]
S. Thara, P. Poornachandran, Transformer Based Language Identification for Malayalam-English Code-Mixed Text, IEEE Access 9 (2021) 118837–118850,.
[20]
G.I. Ahmad, J. Singla, Machine learning approach towards language identification of Code-Mixed Hindi-English and Urdu-English Social Media Text, in: 2022 International Mobile and Embedded Technology Conference (MECON), , Noida, India, 2022, pp. 215–220,.
[21]
K. Shalini, H.B. Ganesh, M.A. Kumar, K.P. Soman, Sentiment Analysis for Code-Mixed Indian Social Media Text With Distributed Representation, in: 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), , Bangalore, India, 2018, pp. 1126–1131,.
[22]
Yash Madhani, Sushane Parthan, Priyanka Bedekar, Ruchi Khapra, Vivek Seshadri, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra, Aksharantar: Towards building open transliteration tools for the next billion users, arXiv preprint (2022) arXiv:2205.03018.
[23]
Yash Madhani, Mitesh M. Khapra, Anoop Kunchukuttan, Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages, arXiv preprint (2023) arXiv:2305.15814.
[24]
Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, AK Raghavan, Ajitesh Sharma, Sujit Sahoo, Harsh*ta Diddee, J Mahalakshmi, Divyanshu Kakwani, Navneet Kumar, Aswin Pradeep, Srihari Nagaraj, Kumar Deepak, Vivek Raghavan, Anoop Kunchukuttan, Pratyush Kumar, Mitesh Shantadevi Khapra, Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages, Transactions of the Association for Computational Linguistics 10 (2022) 145–162.
Recommendations
- An automatic non-English sentiment lexicon builder using unannotated corpus
Sentiment lexicons in the English language are widely accessible while in many other languages, these resources are extremely deficient. Current techniques and methods for sentiment analysis focus mainly on the English language, whereas other languages ...
Read More
- Lexicon based sentiment analysis of Urdu text using SentiUnits
MICAI'10: Proceedings of the 9th Mexican international conference on Advances in artificial intelligence: Part I
Like other languages, Urdu websites are becoming more popular, because the people prefer to share opinions and express sentiments in their own language. Sentiment analyzers developed for other well-studied languages, like English, are not workable for ...
Read More
- Sentiment analysis of urdu language: handling phrase-level negation
MICAI'11: Proceedings of the 10th Mexican international conference on Advances in Artificial Intelligence - Volume Part I
The paper investigates and proposes the treatment of the effect of the phrase-level negation on the sentiment analysis of the Urdu text based reviews. The negation acts as the valence shifter and flips or switches the inherent sentiments of the ...
Read More
Comments
Information & Contributors
Information
Published In
Procedia Computer Science Volume 233, Issue C
2024
1049 pages
ISSN:1877-0509
EISSN:1877-0509
Issue’s Table of Contents
Copyright © 2024.
Publisher
Elsevier Science Publishers B. V.
Netherlands
Publication History
Published: 09 July 2024
Author Tags
- Natural Language Processing
- Code Mixing
- Language Identification
- Sentiment Analysis
- Translation
- Transliteration,
- Transformers
Qualifiers
- Research-article
Contributors
Other Metrics
View Article Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
Total Citations
Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Citations
View Options
View options
Get Access
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in
Full Access
Get this Publication
Media
Figures
Other
Tables