Dataset on sentiment-based cryptocurrency-related news and tweets in English and Malay language

Cryptocurrency trading is becoming popular due to its profitable investment and has led to worldwide involvement in buying and selling cryptocurrency assets. Sentiments expressed by cryptocurrency enthusiasts toward some news via social media or other online platforms may affect the cryptocurrency m...

Full description

Bibliographic Details
Published in:Language Resources and Evaluation
Main Author: Mohamad Zamani N.A.; Kamaruddin N.; Yusof A.M.B.
Format: Article
Language:English
Published: Springer Science and Business Media B.V. 2024
Online Access:https://www.scopus.com/inward/record.uri?eid=2-s2.0-85193598349&doi=10.1007%2fs10579-024-09733-z&partnerID=40&md5=c1b6ff4378d2202718099d8992d64017
id 2-s2.0-85193598349
spelling 2-s2.0-85193598349
Mohamad Zamani N.A.; Kamaruddin N.; Yusof A.M.B.
Dataset on sentiment-based cryptocurrency-related news and tweets in English and Malay language
2024
Language Resources and Evaluation


10.1007/s10579-024-09733-z
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85193598349&doi=10.1007%2fs10579-024-09733-z&partnerID=40&md5=c1b6ff4378d2202718099d8992d64017
Cryptocurrency trading is becoming popular due to its profitable investment and has led to worldwide involvement in buying and selling cryptocurrency assets. Sentiments expressed by cryptocurrency enthusiasts toward some news via social media or other online platforms may affect the cryptocurrency market activities. Thus, it has become a challenge to determine the level of positivity or negativity (regression) inhibiting the texts than simply classifying the sentiment into categorical classes. Regression offers more detailed information than a simple classification which can be robust to noisy data as they consider the entire range of possible target values. On the contrary, classification can lead to biased models due to imbalanced dataset and tend to cause overfitting. Hence, this work emphasises in creating sentiment-based cryptocurrency-related corpora in English and Malay focusing on Bitcoin and Ethereum. The data was collected from January to December 2021 from the publicly available news online and tweets from Twitter in English and Malay. The dataset contains a total of 29,694 instances comprised of 5694 news data and 24,000 tweets data. During the annotation process, the annotators are trained until Krippendorf’s alpha agreement of above 60% is achieved since it is considered an applicable benckmark due to the annotation complexity. The corpora is available on Github for cryptocurrency-related experiments using various machine learning or deep learning models to study English and Malay sentiments effect on the global market, particularly the Malaysian market and can be extended for further analysis for Bitcoin and Ethereum market volatile nature. © The Author(s), under exclusive licence to Springer Nature B.V. 2024.
Springer Science and Business Media B.V.
1574020X
English
Article

author Mohamad Zamani N.A.; Kamaruddin N.; Yusof A.M.B.
spellingShingle Mohamad Zamani N.A.; Kamaruddin N.; Yusof A.M.B.
Dataset on sentiment-based cryptocurrency-related news and tweets in English and Malay language
author_facet Mohamad Zamani N.A.; Kamaruddin N.; Yusof A.M.B.
author_sort Mohamad Zamani N.A.; Kamaruddin N.; Yusof A.M.B.
title Dataset on sentiment-based cryptocurrency-related news and tweets in English and Malay language
title_short Dataset on sentiment-based cryptocurrency-related news and tweets in English and Malay language
title_full Dataset on sentiment-based cryptocurrency-related news and tweets in English and Malay language
title_fullStr Dataset on sentiment-based cryptocurrency-related news and tweets in English and Malay language
title_full_unstemmed Dataset on sentiment-based cryptocurrency-related news and tweets in English and Malay language
title_sort Dataset on sentiment-based cryptocurrency-related news and tweets in English and Malay language
publishDate 2024
container_title Language Resources and Evaluation
container_volume
container_issue
doi_str_mv 10.1007/s10579-024-09733-z
url https://www.scopus.com/inward/record.uri?eid=2-s2.0-85193598349&doi=10.1007%2fs10579-024-09733-z&partnerID=40&md5=c1b6ff4378d2202718099d8992d64017
description Cryptocurrency trading is becoming popular due to its profitable investment and has led to worldwide involvement in buying and selling cryptocurrency assets. Sentiments expressed by cryptocurrency enthusiasts toward some news via social media or other online platforms may affect the cryptocurrency market activities. Thus, it has become a challenge to determine the level of positivity or negativity (regression) inhibiting the texts than simply classifying the sentiment into categorical classes. Regression offers more detailed information than a simple classification which can be robust to noisy data as they consider the entire range of possible target values. On the contrary, classification can lead to biased models due to imbalanced dataset and tend to cause overfitting. Hence, this work emphasises in creating sentiment-based cryptocurrency-related corpora in English and Malay focusing on Bitcoin and Ethereum. The data was collected from January to December 2021 from the publicly available news online and tweets from Twitter in English and Malay. The dataset contains a total of 29,694 instances comprised of 5694 news data and 24,000 tweets data. During the annotation process, the annotators are trained until Krippendorf’s alpha agreement of above 60% is achieved since it is considered an applicable benckmark due to the annotation complexity. The corpora is available on Github for cryptocurrency-related experiments using various machine learning or deep learning models to study English and Malay sentiments effect on the global market, particularly the Malaysian market and can be extended for further analysis for Bitcoin and Ethereum market volatile nature. © The Author(s), under exclusive licence to Springer Nature B.V. 2024.
publisher Springer Science and Business Media B.V.
issn 1574020X
language English
format Article
accesstype
record_format scopus
collection Scopus
_version_ 1809678012860334080