TRANSCRIPTOME SEQUENCING OF LEPISANTHES FRUTICOSA TO DISCOVER SSR MARKERS

How to cite this article (APA): Seman, Z. A., Ahmad, A., Abidin, R. A. Z., Jantan, S. Z., Noor, M. H. A., Shin, S. Y., Ghazalli, M. N., Nasir, K. H., Simoh, S., and Ali, M. S. M., (2022). Transcriptome Sequencing of Lepisanthes Fruticosa to Discover SSR Markers. International Journal of Research GRANTHAALAYAH, 10(1), 21-33. doi: 10.29121/granthaalayah.v10.i1.2022.4451 21 TRANSCRIPTOME SEQUENCING OF LEPISANTHES FRUTICOSA TO DISCOVER SSR MARKERS


INTRODUCTION
There is multiplicity of underutilized fruits which are natively grown at the regions of Peninsular Malaysia, Sabah and Sarawak. The plants bear less attractive fruits compared to commercial plant species, however many of them have high nutritional value and medicinal properties Rizvi et al. (2015). Ibrahim et al. (2010) has shown the important of native fruits to be traditionally used as a medicine to treat several common diseases. However, narrow down of research has been focused on different part of plants including fruit to further scientifically studied of their medicinal benefits. Research has shown that the bioactive compounds, carotenoid, and other terpenoids are the primary contributors to compounds identified from the plants extracts which include phenolic their medicinal properties.
Lepisanthes fruticosa or locally known as ceri Terengganu is one of the valuable underutilized fruits in Malaysia with the potential to be exploited for commercial production. L. fruticosa is a non-seasonal woody plant of which the fruits are available throughout the year. Typically, the fruits are arranged closely and attractively in a big bunch or cluster (20 fruits/bunch). The flesh is soft and fairly sweet taste with 1-3 seeds/fruit. The tree is small but can reach medium height with spreading out canopy. The purplish colour of the young leaves adds to the attractiveness of the tree Mirfat et al. (2017). Studies have shown that L. fruticosa ripe fruits contain the highest free radical scavenging and total phenolic contents compared to numbers of underutilized and commercial fruits Mirfat and Salma (2015). This suggests the fruit of L. fruticose is a good candidate for alternative medicine and health benefiting food supplement. Umikalsum and Mirfat (2014), Dayang et al. (2012), Ibrahim et al. (2010), Ikram et al. (2009). Nevertheless, notwithstanding the rich genetic diversity of L. fruticosa, there are limited reports on germplasm diversity and molecular markers data. For instance, the nucleotide sequences of L. fruticosa deposited in NCBI database (https://www.ncbi.nlm.nih.gov/gquery/?term=Lepisanthes+alata+) were found to be scarce (as little as nucleotide sequences as of March 2018).
Large-scale sequencing data possibly be generated from both genome and transcriptome via recent advance in sequencing technology. Likewise, this nextgeneration sequencing technologies and bioinformatics analysis has led to largescale identification of EST-SSRs from various crops Wang et al. (2010), Garg et al. (2011), Zeng et al. (2010, Zhou et al. (2016)). Simple sequence repeats (SSRs) represent arrays of short motifs which are characterized based on their hypervariability, abundance, reproducibility, Mendelian inheritance, codominant nature Scott et al. (2000), Gupta et al. (1996) and convenient to be applied as compared to the molecular markers Zhou et al. (2016). SSRs can be either predicted from genomic or transcriptome which are known as genomics SSRs and EST-SSRs, respectively Song et al. (2012). While EST-SSRs are derived from expressed sequence tags and these types of SSRs are more evolutionary conserved compared to genomic SSRs derived from noncoding sequences with relatively high transferability Wei et al. (2011b). SSRs is powerful tool that have been extensively used in population study to determine genetic diversity and also to analyze genetic structure Yoichi et al. (2017). The present study aimed to generate and identify EST-SSRs from leaf of L. fruticosa using Illumina paired end sequencing technology. Genic SSRs markers (in genic sequences) will then be characterized based on their frequency and distribution followed by analysis the functional properties of those SSRs. This data and results obtained from the present study will be a valuable genomic and genetic resources for future studies of L. fruticosa.

PLANT MATERIAL
The plants of L. fruticosa were grown at MARDI's germplasm located at MARDI headquarter, Serdang, Malaysia. Harvested young leaves of L. fruticosa were snapfrozen in liquid nitrogen and kept in -80°C freezer for further use.

RNA EXTRACTION FOR ILLUMINA SEQUENCING
A total of 100 mg of young leaves tissue were powdered in liquid nitrogen prior to RNA extraction using TRI Reagent (SIGMA-Aldrich, St. Louis, USA) according to provide instruction. Extracted RNA was treated with RNase-free DNase I Recombinant (QIAGEN, USA) to prevent sample from genomic DNA contamination. The quantification of extracted total RNA was performed using Nanodrop ND-100 spectrophotometer (Thermo Scientific, Wilmington, USA). The integrity of RNA bands was checked with 1% TAE agarose gel electrophoresis. The quality and RNA integrity number (RIN) of total RNA was assessed using a 2100 Bioanalyzer (Agilent Technology, Santa Clara, CA, USA). Illumina sequencing was performed at Novogen Co., Ltd. Beijing, China, the cDNA libraries were sequenced using Illumina HiSeq TM 2500 system under effective concentration.

SEQUENCE PREPROCESSING AND DE NOVO ASSEMBLY
The raw reads of sequencing data were filtered to generate high quality data. This includes removing adaptor contaminants, removing reads with more than 10% of uncertainty nucleotides and low-quality base reads with a cut-off value of Phred score, q <= 20. The cleaned reads were then assembled using Trinity version r20140413p1, with parameter minimum kmer coverage =2 and other parameters were by default. De novo transcripts assembly Trinity was then clustered using Corset version 1.05 to obtain unigene sequences.

SSR MARKERS DISCOVERY
SSR markers were discovered using MISA (MIcroSAtellite) (http://pgrc.ipkgatersleben.de/misa/). Parameters chosen for detection of a SSR motif with minimum length of 12 base pairs (bp) and repeat length of mono-10, di-6, tri-5, tetra-5, penta-5 and hexa-5. The maximum size of interruption allowed between two different SSRs in a compound SSR was 100 bp. The SSRs filtering was performed using a custom Perl script and R programming. Criteria used in the SSRs filtering were choosing the markers which represent more than two alleles and in a single contig. Approximately 200 bp of each side of repeat motif region were extracted using bedtools version 2.21.0 (Quinlan & Hall 2010). SSR markers were annotated using BLAST program (blastx) against non-redundant (nr) and SwissProt protein databases with E-value of 1x10 -5 . Gene ontology (GO) enrichment and KEGG pathway analyses were conducted using BLAST2GO software.

DE NOVO ASSEMBLY OF TRANSCRIPTOMIC SEQUENCING DATA
A total of 91,043,356 paired end raw reads were generated from transcriptome sequencing of L. fruticosa leaf. Approximately, the total reads comprise of approximately 13.66 Gigabase (Gb) with an average read length of 150 bp were generated using Illumina HiSeq TM 2500 sequencer. After reads quality assessment, approximately 89,441,736 (98.24%) of high-quality data were recovered (Table 1). This cleaned pair-end reads were then assembled into 52,657 transcripts. Clustering of transcripts resulted in 52,569 unigenes (Table 2).  A total of 23,958 of SSRs were identified which was accounted for 45.58% of the total unigenes. All the markers were classified into seven types, consisting of type c, type p1 (mono), p2 (di-), p3 (tri-), p4 (tetra-), p5 (penta-) and p6 (hexa-) nucleotide repeats (Table 3).

DISCUSSION
Understanding of genetic variation in the germplasm, e.g determination of plant genetic diversity by utilizing DNA molecular markers is the prerequisite for crop improvement. Previous studies have identified the application of Random Amplification of Polymorphic DNA (RAPD) markers from Annona species Anuragi et al. (2016) and loquat Badenes et al. (2004) of which the markers were utilized for genetic diversity study. Nevertheless, massive data generated from transcriptome sequencing has been reported to be comprehensive and useful genomic and genetic resources that facilitate gene discovery and SSRs development in future study (Zheng et al. (2013), Huang et al. (2014), Chen et al. (2015). Compared to coffee (1/2.16 kb) Zheng et al. (2013), marker density of L. fruticosa is lesser but appeared at a much higher frequency than Arabidopsis (1/14 kb), chickpea (1/8.66 kb), Jhanwar et al. (2012). Cardle et al. (2000) and cereal plants such as wheat (1/15.6kb) and barley (1/6.3 kb) Kantety et al. (2002). Different distribution frequency of markers among plant species most probably due to composition and size of the genome and also the criteria chosen to screen the marker.
This study showed the major common repeat units were mononucleotide and followed by di-nucleotide repeats which comprised 55.0% and 21.5% of the total SSRs respectively. This class of SSR repeats (Mononucleotide) were found also abundant in other plant species Jin et al. (2016) but for certain plant species dinucleotide repeat was the most Izzah et al. (2014), Silva et al. (2013), Zhang et al. (2012). Dominant motif of di-nucleotide repeats detected in this study was (AT/AG) n (48.9%) whereas (GAA/TTC) n (48.7%) was the dominant motif of tri-nucleotide repeats. However, different dominant motif detected in Dipteronia Oliver (Aceraceae) and rubber tree which AG/CT and AAG/CTT as their dominant dinucleotide trinucleotide repeat motif respectively Triwitayakorn et al. (2011), Zhou et al. (2016)Zhou et al. 2016. Interestingly, GGGCAA motif was considered as a rare motif for L. fruticosa with a lower frequency of 3 in this present study.
Sequence annotation showed that about 20.6% of SSR markers had significantly hits to the NCBI database. However, hits percentage against the existing data in the database are relatively low (20.65%). This unlikely due to the size of unigenes as the average size indicated in the analysis was 1473 bp. Low homology percentage could be probably contributed by certain genes that do not hit against database upon blast or possibly matched to unknown proteins. Assumption made on the basis that a scanty genetic information available for L. fruticosa and its close relatives in the current public database. Similarity search analysis carried out on SSR-containing unigenes clearly indicated that L. fruticosa has a closer genetic distance with woody plant species such as Citrus sinensis (34%), Citrus clementina (20%) and Theobroma cacao (11%). These data suggested that some of genetic information of L. fruticosa such as unigenes and SSR-containing unigenes obtained from this current study could be useful and applicable to other woody plant species such as Citrus and date palm.
The results of Gene Ontology (GO) analysis suggested that SSR-containing unigenes of L. fruticosa's leaf have diversified biochemical functions. More 29 interestingly, those a number of detected SSR-containing unigenes have functions related to compounds binding such as heterocyclic and organic cyclic compound binding. Many of the natural plant secondary metabolites have cyclic compound structure that carry one or more atoms that connecting each other to form a ring. Organic cyclic compounds may contain carbon atom while heterocyclic compounds can be formed by a combination of carbon and non-carbon atoms Smith and March (2006). For instance, the compounds of flavan-3-ols, proanthocyanins and flavanones contain heterocyclic C-ring Crozier et al. (2007), a highly abundant SSRcontaining unigenes with binding functions to those compounds observed in L. fruticosa suggests that a majority of detected SSRs are actively involved and being an essential component in modulating the biosynthesis process of secondary metabolites.
KEGG pathway analysis revealed the SSR-containing unigenes are highly associated with purine and thiamine metabolism. This finding is correlated with Study by Suzuki and Waller (1985) found that purine alkaloids were abundantly available in the fruits of Camellia sinensis L. and Coffea arabica L. during fruit development stages. It is therefore suggested that high purine content is likely to be observed among fruit bearing tree plants including L. fruticosa which is essential for purine alkaloid synthesis during fruit formation. Thiamine diphosphate (vitamin B (1)) is known to be an enzymatic cofactor for various metabolic pathways in central metabolism such as glycolysis, pentose phosphate and the tricarboxylic acid cycle Goyer (2010). In addition, previous studies have shown that thiamine was a cofactor involved to elicit the genes expression in phenylpropanoid pathways of grapevine and those genes were associated with accumulation of phenolics, flavonoids, lignin and stilbenes Boubakri et al. (2013).
The above-mentioned findings are in agreement with results obtained from KEGG pathway analysis, where a number of SSR were detected in unigenes that involved in biosynthesis pathways of secondary metabolites. Likewise, zeatin biosynthesis, diterpenoid biosynthesis, terpenoid-quinone biosynthesis, flavonoid and isoflavonoid biosynthesis were among those pathways represented in the secondary metabolite biosynthesis pathways, well explaining that L. fruticosa is a rich source of phytochemicals. This is supported by previous studies which showed that leaf of Lepisanthes species contains various phytochemicals such as proanthocyanidins Zhang et al. (2016), alkaloids, terpenoids and flavonoids Kuspradini et al. (2012).

CONCLUSION
Study carried out was the first transcriptome sequencing of RNA sample extracted from leaf of L. fruticosa using NGS technology. Both EST-SSRs and NGSbased SSRs are functional markers identified from expressed transcripts of an organism, however application of NGS technology is comparatively a more reliable and higher throughput approach for novel SSR markers discovery in rapid, convenient and cost-effective manners than the traditional EST-SSRs identification processes. The present study has detected and characterized 23,958 microsatellite loci from 52,569 non-redundant unigenes. These data have substantially increased the existing genomic information of under explored plant species of L. fruticosa. Also, our data suggested that a high similarity of SSR-containing unigenes between L. fruticosa and date palm and citrus plants. These newly obtained genetic information will be useful for screening and profiling L. fruticosa accessions in particular for their secondary metabolites content. This will support future breeding program for L. fruticosa from under ultilized fruit to local food market as an enhanced nutritional food product.