Implementation of machine learning in DNA barcoding for determining the plant family taxonomy

The DNA barcoding approach has been used extensively in taxonomy and phylogenetics. The differences in certain DNA sequences are able to differentiate and help classify organisms into taxa. It has been used in cases of taxonomic disputes where morphology by itself is insufficient. This research aime...

Full description

Bibliographic Details
Published in:Heliyon
Main Author: Riza L.S.; Zain M.I.; Izzuddin A.; Prasetyo Y.; Hidayat T.; Abu Samah K.A.F.
Format: Article
Language:English
Published: Elsevier Ltd 2023
Online Access:https://www.scopus.com/inward/record.uri?eid=2-s2.0-85171802354&doi=10.1016%2fj.heliyon.2023.e20161&partnerID=40&md5=3867920c858db50b508f95cf47386652
id 2-s2.0-85171802354
spelling 2-s2.0-85171802354
Riza L.S.; Zain M.I.; Izzuddin A.; Prasetyo Y.; Hidayat T.; Abu Samah K.A.F.
Implementation of machine learning in DNA barcoding for determining the plant family taxonomy
2023
Heliyon
9
10
10.1016/j.heliyon.2023.e20161
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85171802354&doi=10.1016%2fj.heliyon.2023.e20161&partnerID=40&md5=3867920c858db50b508f95cf47386652
The DNA barcoding approach has been used extensively in taxonomy and phylogenetics. The differences in certain DNA sequences are able to differentiate and help classify organisms into taxa. It has been used in cases of taxonomic disputes where morphology by itself is insufficient. This research aimed to utilize hierarchical clustering, an unsupervised machine learning method, to determine and resolve disputes in plant family taxonomy. We take a case study of Leguminosae that historically some classify into three families (Fabaceae, Caesalpiniaceae, and Mimosaceae) but others classify into one family (Leguminosae). This study is divided into several phases, which are: (i) data collection, (ii) data preprocessing, (iii) finding the best distance method, and (iv) determining disputed family. The data used are collected from several sources, including National Center for Biotechnology Information (NCBI), journals, and websites. The data for validation of the methods were collected from NCBI. This was used to determine the best distance method for differentiating families or genera. The data for the case study in the Leguminosae group was collected from journals and a website. From the experiment that we have conducted, we found that the Pearson method is the best distance method to do clustering ITS sequence of plants, both in accuracy and computational cost. We use the Pearson method to determine the disputed family between Leguminosae. We found that the case study of Leguminosae should be grouped into one family based on our research. © 2023
Elsevier Ltd
24058440
English
Article
All Open Access; Gold Open Access; Green Open Access
author Riza L.S.; Zain M.I.; Izzuddin A.; Prasetyo Y.; Hidayat T.; Abu Samah K.A.F.
spellingShingle Riza L.S.; Zain M.I.; Izzuddin A.; Prasetyo Y.; Hidayat T.; Abu Samah K.A.F.
Implementation of machine learning in DNA barcoding for determining the plant family taxonomy
author_facet Riza L.S.; Zain M.I.; Izzuddin A.; Prasetyo Y.; Hidayat T.; Abu Samah K.A.F.
author_sort Riza L.S.; Zain M.I.; Izzuddin A.; Prasetyo Y.; Hidayat T.; Abu Samah K.A.F.
title Implementation of machine learning in DNA barcoding for determining the plant family taxonomy
title_short Implementation of machine learning in DNA barcoding for determining the plant family taxonomy
title_full Implementation of machine learning in DNA barcoding for determining the plant family taxonomy
title_fullStr Implementation of machine learning in DNA barcoding for determining the plant family taxonomy
title_full_unstemmed Implementation of machine learning in DNA barcoding for determining the plant family taxonomy
title_sort Implementation of machine learning in DNA barcoding for determining the plant family taxonomy
publishDate 2023
container_title Heliyon
container_volume 9
container_issue 10
doi_str_mv 10.1016/j.heliyon.2023.e20161
url https://www.scopus.com/inward/record.uri?eid=2-s2.0-85171802354&doi=10.1016%2fj.heliyon.2023.e20161&partnerID=40&md5=3867920c858db50b508f95cf47386652
description The DNA barcoding approach has been used extensively in taxonomy and phylogenetics. The differences in certain DNA sequences are able to differentiate and help classify organisms into taxa. It has been used in cases of taxonomic disputes where morphology by itself is insufficient. This research aimed to utilize hierarchical clustering, an unsupervised machine learning method, to determine and resolve disputes in plant family taxonomy. We take a case study of Leguminosae that historically some classify into three families (Fabaceae, Caesalpiniaceae, and Mimosaceae) but others classify into one family (Leguminosae). This study is divided into several phases, which are: (i) data collection, (ii) data preprocessing, (iii) finding the best distance method, and (iv) determining disputed family. The data used are collected from several sources, including National Center for Biotechnology Information (NCBI), journals, and websites. The data for validation of the methods were collected from NCBI. This was used to determine the best distance method for differentiating families or genera. The data for the case study in the Leguminosae group was collected from journals and a website. From the experiment that we have conducted, we found that the Pearson method is the best distance method to do clustering ITS sequence of plants, both in accuracy and computational cost. We use the Pearson method to determine the disputed family between Leguminosae. We found that the case study of Leguminosae should be grouped into one family based on our research. © 2023
publisher Elsevier Ltd
issn 24058440
language English
format Article
accesstype All Open Access; Gold Open Access; Green Open Access
record_format scopus
collection Scopus
_version_ 1809677580619481088