lnu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
SentiUrdu-1M: A large-scale tweet dataset for Urdu text sentiment analysis using weakly supervised learning
Sukkur IBA University, Pakistan.
Norwegian University of Science and Technology, Norway.
Sukkur IBA University, Pakistan.
Linnaeus University, Faculty of Technology, Department of Informatics.ORCID iD: 0000-0002-0199-2377
Show others and affiliations
2023 (English)In: PLOS ONE, E-ISSN 1932-6203, Vol. 18, no 8, article id e0290779Article in journal (Refereed) Published
Abstract [en]

Low-resource languages are gaining much-needed attention with the advent of deep learning models and pre-trained word embedding. Though spoken by more than 230 million people worldwide, Urdu is one such low-resource language that has recently gained popularity online and is attracting a lot of attention and support from the research community. One challenge faced by such resource-constrained languages is the scarcity of publicly available large-scale datasets for conducting any meaningful study. In this paper, we address this challenge by collecting the first-ever large-scale Urdu Tweet Dataset for sentiment analysis and emotion recognition. The dataset consists of a staggering number of 1,140,821 tweets in the Urdu language. Obviously, manual labeling of such a large number of tweets would have been tedious, error-prone, and humanly impossible; therefore, the paper also proposes a weakly supervised approach to label tweets automatically. Emoticons used within the tweets, in addition to SentiWordNet, are utilized to propose a weakly supervised labeling approach to categorize extracted tweets into positive, negative, and neutral categories. Baseline deep learning models are implemented to compute the accuracy of three labeling approaches, i.e., VADER, TextBlob, and our proposed weakly supervised approach. Unlike the weakly supervised labeling approach, the VADER and TextBlob put most tweets as neutral and show a high correlation between the two. This is largely attributed to the fact that these models do not consider emoticons for assigning polarity.

Place, publisher, year, edition, pages
Public Library of Science (PLoS), 2023. Vol. 18, no 8, article id e0290779
National Category
Information Systems
Research subject
Computer and Information Sciences Computer Science, Information Systems
Identifiers
URN: urn:nbn:se:lnu:diva-124006DOI: 10.1371/journal.pone.0290779ISI: 001058799800036Scopus ID: 2-s2.0-85169229839OAI: oai:DiVA.org:lnu-124006DiVA, id: diva2:1793418
Available from: 2023-08-31 Created: 2023-08-31 Last updated: 2023-10-19Bibliographically approved

Open Access in DiVA

fulltext(1696 kB)121 downloads
File information
File name FULLTEXT01.pdfFile size 1696 kBChecksum SHA-512
64ce5d59063303d81d2abbc6df9f4b68b44cd9208a474e76c846af26c6c3f28205afea401393350c2a374744bc7d985dd37b261368d82c1cf0f6247e35f42129
Type fulltextMimetype application/pdf

Other links

Publisher's full textScopus

Authority records

Kastrati, Zenun

Search in DiVA

By author/editor
Kastrati, Zenun
By organisation
Department of Informatics
In the same journal
PLOS ONE
Information Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 122 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 140 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf