042802 CHHAJRO M A, KHUHRO M A , KUMARK , WAGAN A A , UMRANI A I , LAGHARI A A (Computer Science Dep, Sindh Madressatul Islam Univ, Karachi, Pakistan, Email: email@example.com) : Multi-text classification of Urdu/Roman using machine learning and natural language preprocessing techniques. Indian J Sci Technol 2020, 13(19), 1890–900.
This research presents multi-text classification from the news text dataset. The main purpose of this work is to classify multi-text for Urdu and Roman language using Natural Language processing and Machine Learning classification models. In this research, online news data has been collected through beautiful soup web scraping tool. In order to analyze the model accuracy news data is divided into six categories which has been composed from various online newspaper platforms. The main news corpus data consists of 10500 news in Urdu and Roman Urdu language including, Accidental, Education, Entertainment, International, Sports and Weather news have been primarily focused in the proposed research study. Furthermore, preprocessing is performed on text corpus using Natural Language Processing technique; for example, data cleaning, data balancing, and stop word removal. For feature extraction count vector, TF-IDF and Chi2 are employed as word filtering. For multi-text classification the Machine Learning classification schemes have been implemented namely, Naive Bayes Classifier, Logistic Regression, Random Forest Classifier, Linear SVC, and KNeighbors Classifier. After comparative analysis results showed that Linear Support Vector Classifier provided 96 % accuracy among other tested methods. Multi-Text classification of Urdu Roman language having different writing styles, word structure, irregularities, grammar, and combined corpus is a challenging task. For this purpose, we implemented different Machine Learning algorithms with Natural Language preprocessing technique which provided optimal results in classification of multi-text news data.
9 illus, 1 table, 18 ref