Dataset on sentiment-based cryptocurrency-related news and tweets in English and Malay language
Cryptocurrency trading is becoming popular due to its profitable investment and has led to worldwide involvement in buying and selling cryptocurrency assets. Sentiments expressed by cryptocurrency enthusiasts toward some news via social media or other online platforms may affect the cryptocurrency m...
Published in: | Language Resources and Evaluation |
---|---|
Main Author: | |
Format: | Article |
Language: | English |
Published: |
Springer Science and Business Media B.V.
2024
|
Online Access: | https://www.scopus.com/inward/record.uri?eid=2-s2.0-85193598349&doi=10.1007%2fs10579-024-09733-z&partnerID=40&md5=c1b6ff4378d2202718099d8992d64017 |
id |
2-s2.0-85193598349 |
---|---|
spelling |
2-s2.0-85193598349 Mohamad Zamani N.A.; Kamaruddin N.; Yusof A.M.B. Dataset on sentiment-based cryptocurrency-related news and tweets in English and Malay language 2024 Language Resources and Evaluation 10.1007/s10579-024-09733-z https://www.scopus.com/inward/record.uri?eid=2-s2.0-85193598349&doi=10.1007%2fs10579-024-09733-z&partnerID=40&md5=c1b6ff4378d2202718099d8992d64017 Cryptocurrency trading is becoming popular due to its profitable investment and has led to worldwide involvement in buying and selling cryptocurrency assets. Sentiments expressed by cryptocurrency enthusiasts toward some news via social media or other online platforms may affect the cryptocurrency market activities. Thus, it has become a challenge to determine the level of positivity or negativity (regression) inhibiting the texts than simply classifying the sentiment into categorical classes. Regression offers more detailed information than a simple classification which can be robust to noisy data as they consider the entire range of possible target values. On the contrary, classification can lead to biased models due to imbalanced dataset and tend to cause overfitting. Hence, this work emphasises in creating sentiment-based cryptocurrency-related corpora in English and Malay focusing on Bitcoin and Ethereum. The data was collected from January to December 2021 from the publicly available news online and tweets from Twitter in English and Malay. The dataset contains a total of 29,694 instances comprised of 5694 news data and 24,000 tweets data. During the annotation process, the annotators are trained until Krippendorf’s alpha agreement of above 60% is achieved since it is considered an applicable benckmark due to the annotation complexity. The corpora is available on Github for cryptocurrency-related experiments using various machine learning or deep learning models to study English and Malay sentiments effect on the global market, particularly the Malaysian market and can be extended for further analysis for Bitcoin and Ethereum market volatile nature. © The Author(s), under exclusive licence to Springer Nature B.V. 2024. Springer Science and Business Media B.V. 1574020X English Article |
author |
Mohamad Zamani N.A.; Kamaruddin N.; Yusof A.M.B. |
spellingShingle |
Mohamad Zamani N.A.; Kamaruddin N.; Yusof A.M.B. Dataset on sentiment-based cryptocurrency-related news and tweets in English and Malay language |
author_facet |
Mohamad Zamani N.A.; Kamaruddin N.; Yusof A.M.B. |
author_sort |
Mohamad Zamani N.A.; Kamaruddin N.; Yusof A.M.B. |
title |
Dataset on sentiment-based cryptocurrency-related news and tweets in English and Malay language |
title_short |
Dataset on sentiment-based cryptocurrency-related news and tweets in English and Malay language |
title_full |
Dataset on sentiment-based cryptocurrency-related news and tweets in English and Malay language |
title_fullStr |
Dataset on sentiment-based cryptocurrency-related news and tweets in English and Malay language |
title_full_unstemmed |
Dataset on sentiment-based cryptocurrency-related news and tweets in English and Malay language |
title_sort |
Dataset on sentiment-based cryptocurrency-related news and tweets in English and Malay language |
publishDate |
2024 |
container_title |
Language Resources and Evaluation |
container_volume |
|
container_issue |
|
doi_str_mv |
10.1007/s10579-024-09733-z |
url |
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85193598349&doi=10.1007%2fs10579-024-09733-z&partnerID=40&md5=c1b6ff4378d2202718099d8992d64017 |
description |
Cryptocurrency trading is becoming popular due to its profitable investment and has led to worldwide involvement in buying and selling cryptocurrency assets. Sentiments expressed by cryptocurrency enthusiasts toward some news via social media or other online platforms may affect the cryptocurrency market activities. Thus, it has become a challenge to determine the level of positivity or negativity (regression) inhibiting the texts than simply classifying the sentiment into categorical classes. Regression offers more detailed information than a simple classification which can be robust to noisy data as they consider the entire range of possible target values. On the contrary, classification can lead to biased models due to imbalanced dataset and tend to cause overfitting. Hence, this work emphasises in creating sentiment-based cryptocurrency-related corpora in English and Malay focusing on Bitcoin and Ethereum. The data was collected from January to December 2021 from the publicly available news online and tweets from Twitter in English and Malay. The dataset contains a total of 29,694 instances comprised of 5694 news data and 24,000 tweets data. During the annotation process, the annotators are trained until Krippendorf’s alpha agreement of above 60% is achieved since it is considered an applicable benckmark due to the annotation complexity. The corpora is available on Github for cryptocurrency-related experiments using various machine learning or deep learning models to study English and Malay sentiments effect on the global market, particularly the Malaysian market and can be extended for further analysis for Bitcoin and Ethereum market volatile nature. © The Author(s), under exclusive licence to Springer Nature B.V. 2024. |
publisher |
Springer Science and Business Media B.V. |
issn |
1574020X |
language |
English |
format |
Article |
accesstype |
|
record_format |
scopus |
collection |
Scopus |
_version_ |
1809678012860334080 |