Threats to food security, resulting from climate change and a growing population, are the focus of major international organizations such as the Food and Agriculture Organization of the United Nations (von Braun et al., 2003; FAO et al., 2012). It is clear that major changes will be needed to the way we produce, process, and transport food. Only a handful of crops provide the basic nutritional intake for the majority of the human population; however, dozens of species are cultivated at a smaller scale, and several thousand more have a variety of food and nonfood uses (Mangelsdorf, 1966). Many of the less-studied crops have advantages in terms of environmental tolerances and unique nutritional profiles when compared to the widely cultivated crops. As such, investigations into less-studied crops (hereafter referred to as “underutilized crops”) have the potential to reveal crops, crop varieties, and unique alleles that could be applied in future agriculture and crop breeding (Varshney et al., 2010).
Legumes are a vitally important source of protein in developing countries, where meat protein is consumed infrequently. They are also often grown in poor soils, because of their associated atmospheric N2 fixation, and where water availability is low or unpredictable. With this motivation, this study reports transcriptome sequences for four underutilized legumes: hyacinth bean (Lablab purpureus (L.) Sweet), grasspea (Lathyrus sativus L.), winged bean (Psophocarpus tetragonolobus (L.) DC.), and Bambara groundnut (Vigna subterranea (L.) Verdc.). Hyacinth bean is a very versatile crop, likely originating in Africa, and is cultivated throughout the tropics, often as forage for livestock. The leaves and flowers can be eaten by humans raw or cooked, and the roots can be eaten cooked; notably, this species is one of the most drought-tolerant legumes (Ewansiha and Singh, 2006; Maruthi et al., 2006). Grasspea is widely grown in East Africa and Asia and is often considered an “insurance crop” because of its ability to grow under drought conditions even when other crops fail (Polignano et al., 2005). The seed, while being a good source of protein (Abd El-Moneim et al., 2000), contains the neurotoxin ODAP (β-N-Oxalyl-l-α,β-diaminopropionic acid), which can cause health problems if it is the sole or major protein source in the diet. Winged bean (P. tetragonolobus) is very versatile, with all parts of the plant (leaves, flowers, tubers, and seeds) edible (National Academy of Sciences, 1975). Finally, Bambara groundnut (V. subterranea) grows similarly to the peanut in that its flowers develop into pods underground. It is thought to have originated in West Africa and is widely cultivated in sub-Saharan Africa, as well as in some regions of Asia, and can be grown on marginal soils where other legumes fail to grow (Jørgensen et al., 2010).
For future investigations, including population genetic and linkage or quantitative trait locus (QTL) mapping, transcriptome sequences can be mined for simple sequence repeats (SSRs, or microsatellites). These markers are typically highly polymorphic and are often transferable to closely related species; for example, 45–91% of 127 Vigna radiata (L.) R. Wilczek SSRs successfully amplified in other Vigna species (Tangphatsornruang et al., 2009). Furthermore, a comparative analysis of transcriptome sequences can be carried out to isolate markers with the potential to amplify in multiple species. These are typically PCR-based markers (i.e., a conserved orthologous set [COS], sensu Fulton et al., 2002) and are designed to amplify across a range of species, typically within a botanical family, for comparative QTL analyses and/or phylogenetics (e.g., in the Asteraceae [Chapman et al., 2007], Brassicaceae [Jeong et al., 2014], Pinaceae [Liewlaksaneeyanawin et al., 2009], Rosaceae [Cabrera et al., 2009], and Solanaceae and allies [Wu et al., 2006]). So far, however, there has been no attempt to develop COS markers for the Fabaceae, despite the large size of the family and the variety of economically important species within (Stevens, 2006; Meyer and Purugganan, 2013).
Summary statistics for the four de novo assembled transcriptomes.
In this study, seedling transcriptomes from these four under-utilized crops were sequenced using Illumina technology, assembled, annotated, and mined for SSRs and COS markers. The resultant data have been made publicly available as a valuable resource for investigations into gene expression and genetic variation in these and other leguminous crops.
METHODS AND RESULTS
Seeds were obtained from the U.S. Department of Agriculture (Ames, Iowa, USA; Table 1). The seeds were soaked overnight in 0.5 mg-mL−1 gibberellic acid, then rinsed and placed on damp filter paper at 22°C until germination. Germinated seeds were transferred to a 2:1 mixture of Levington's F2+S and vermiculite in a greenhouse with 16-h days, supplemented with artificial light. RNA extraction was carried out from true leaves of seedlings using a QIAGEN RNeasy Kit (QIAGEN, Crawley, United Kingdom) following the manufacturer's protocol, with on-column DNase digestion (RNase-free DNase, QIAGEN). Three micrograms of RNA was sent to the Wellcome Trust Centre for Human Genetics (Oxford, United Kingdom) for library preparation using the TruSeq Stranded mRNA Sample Prep Kit (Illumina, San Diego, California, USA) and sequencing on a partial lane of an Illumina HiSeq 2500 with 100-bp paired-end (PE) reads.
Primer sequences for 12 COS markers tested in 12 members of the Fabaceae.
The following analyses were carried out for each of the four crops separately. Sequence data were trimmed of adapters and low-quality sequences (q < 10) using cutadapt ( http://code.google.com/p/cutadapt/; Martin, 2011), and the Perl script resyncMates.pl ( https://github.com/percyfal/ratatosk.ext.scilife/blob/master/scripts/resyncMates.pl) was used to maintain proper read pairing where one of a pair was removed by cutadapt. The resulting clean reads were used in transcriptome assembly with Trinity ver. r20140717 (Grabherr et al., 2011). Libraries were normalized to a kmer coverage of 30 so as to reduce computation time, and then assembled using Trinity with the settings –min_kmer_cov 2 to increase the stringency for reads being assembled together, and –max_diffs_same_path 4–max_internal_gap_same_path 15, which allowed for more divergent reads (up to four nucleotide differences and up to a 15-bp gap) to be assembled into the same contig, i.e., taking into account the potential heterozygosity of the species. Heterozygosity was assessed by mapping the reads back to the respective transcriptomes, parsing the data through SAMtools (Li et al., 2009; mpileup settings -q 3 -Q 20), vcf2fq (settings -d 3), seqtk, and outputting .fa files. Putative functions of transcripts were assigned based on comparisons to transcripts derived from the common bean (Phaseolus vulgaris L.) genome sequencing project (Schmutz et al., 2014; http://www.phytozome.net/commonbean.php) with BLASTn and an E-value cut-off of e-30. SSRs were identified in the transcriptomes by running misa.pl ( http://pgrc.ipk-gatersleben.de/misa/) with the minimum number of repeats set at 8, 6, and 4 for dinucleotides, trinucleotides, and tetranucleotides, respectively.
To identify COS loci, the four transcriptomes were BLASTed against each other in a “round-robin” fashion (Lablab vs. Lathyrus, Lathyrus vs. Psophocarpus, Psophocarpus vs. Vigna, and Vigna vs. Lablab) and reciprocal best BLAST hits retained. Where all four comparisons successfully produced a reciprocal best BLAST hit this was considered a potential COS locus. For 12 such loci (picked at random, except for avoiding short loci, <300 bp), the sequences were aligned using Clustal Omega (Sievers and Higgins, 2014) and checked for presences of introns by comparison to the bean genome (Schmutz et al., 2014; http://www.phytozome.net/commonbean.php); degenerate primers were then designed to flank the introns using Primaclade (Gadberry et al., 2005; Table 2). These were then used to PCR amplify DNA extracted from the four species used in the transcriptome assembly as well as eight other legume taxa (Crotalaria brevidens Benth., Lupinus mutabilis Sweet, Mimosa pudica L., Phaseolus coccineus L. [runner bean], Pisum sativum L. [field pea], Vigna aconitifolia (Jacq.) Marechal [moth bean], V. unguiculata subsp. unguiculata (L.) Walp. [black-eyed pea], and V. unguiculata subsp. sesquipedalis (L.) Verdc. [yardlong bean]). DNA extraction was carried out with a cetyltrimethylammonium bromide (CTAB)–based method, and PCR amplification took place using a touchdown program with 55°C final annealing temperature (see Chapman and Burke, 2012 for details). PCR products were resolved on agarose gels stained with GelRed (Biotium, Hayward, California, USA).
Between 7.9 and 16.6 million paired-end 100-bp reads were generated for each of the bean samples (reads have been deposited in the National Center for Biotechnology Information [NCBI] Sequence Read Archive [ http://www.ncbi.nlm.nih.gov/sra] under BioProject ID PRJNA273585). After normalization, the number of paired reads to assemble was 2.8 to 3.8 million (Table 1). Assembly of the transcriptomes results in between 32,446 and 34,401 genes, corresponding to 47,759 to 52,083 transcripts (Table 1; Fig. 1A). The number of genes and number of transcripts was similar across species. In these four species, there are therefore approximately 1.4 to 1.6 isoforms per gene, a value comparable to the proportion of Arabidopsis intron-containing genes that have more than one alternative transcript (ca. 42%; Filichkin et al., 2010). The FASTA-formatted transcriptome sequences are available from the Dryad Digital Repository ( http://dx.doi.org/10.5061/dryad.k9h76; Chapman, 2015). N50, mean, and median transcript lengths were largely comparable across species; however, they were lowest for V. subterranea, for which the smallest number of reads was obtained (Fig. 1B). In line with the above statistics, the distribution of transcript lengths was qualitatively similar across species; however, more of the longer transcripts were resolved for Lablab and more of the shorter transcripts for Vigna (Fig. 1C). BLASTn searches against the common bean (P. vulgaris) coding DNA sequences (one sequence per common bean locus) resulted in between 71% and 92% of transcripts having a hit ( Appendix S1 (apps.1400111_s1.xlsx)). Heterozygosity was low, as predicted for an inbred accession from a gene bank, varying from 0.014% in Lablab and Vigna to 0.028% in Psophocarpus and 0.149% in Lathyrus.
Mining for SSRs identified between 1106 (Lathyrus) and 2427 (Lablab) SSR-containing loci (with one to four SSRs per locus) per transcriptome (Table 1; Appendix S2 (apps.1400111_s2.xlsx)). This corresponds to 4.7%, 2.2%, 3.5%, and 2.6% of the transcripts containing at least one SSR for Lablab, Lathyrus, Psophocarpus, and Vigna, respectively. A large proportion of the SSRs (36–43% per transcriptome) were found in the first or last 50 bp of the sequence, which may hamper primer design for this subset of loci. The proportions of di-, tri-, and tetranucleotide repeat SSRs was similar across Lablab, Lathyrus, and Psophocarpus, with di- and trinucleotide repeat SSRs each making up a similar proportion of total SSRs (36–47% each) and tetranucleotide SSRs only making up 16–20% of the total SSRs. In Vigna, however, dinucleotide SSRs were much less common than trinucleotide SSRs (29% vs. 55% of total SSRs). The most common di-, tri-, and tetranucleotide motifs are AG/CT (25–32% of SSRs), AAG/CTT (9–19% of SSRs), and TTTC/GAAA (2–4% of SSRs), respectively.
Nearly 1800 potential COS loci were identified by carrying out BLAST comparisons between the four transcriptomes (Fig. 2; Appendix S3 (apps.1400111_s3.xlsx)). These loci represent those for which a reciprocal best BLAST hit was identified in all pairwise comparisons (see above). To test the efficacy of this strategy in producing reliably amplifiable cross-family loci, primers were designed for the first 12 loci (avoiding three very short loci; Table 2) and PCR amplified from DNA from 12 legume species (see above). Of the 12 primer pairs, one failed to amplify and four amplified in at least 10 of the 12 species. The remainder successfully amplified between two and nine samples (Fig. 3).
To identify, understand, and benefit from crops with adaptive stress tolerances, it is important to characterize the genetic variation within the crop, with one goal being to relate genetic to phenotypic variation (Burke et al., 2007; Meyer and Purugganan, 2013). Legumes are vitally important crops due to their high protein content, especially in countries where meat protein is rarely consumed (they are sometimes called the “meat of the poor”; de Jager, 2013). The aim of this study was not simply to provide a comprehensive set of transcripts from these species but to pave the way, using SSR and COS markers, for further investigations including population genetics, QTL mapping, and marker-assisted selection.
The percentage of transcripts with a putative orthologue from the fully sequenced P. vulgaris was quite high (60–82%; Table 1; Appendix S1 (apps.1400111_s1.xlsx)), and was highest for the two species most closely related to Phaseolus (Lablab and Vigna) and lowest for Lathyrus, which is found in a different clade within the Papilionoideae legumes. The percentage of 27,197 P. vulgaris transcripts (taking just the longest transcript from each locus) with a BLAST hit in these four legumes was lower (49–70%; Table 1); however, this is not unexpected given the relatively shallow sequencing effort undertaken. Again, the percentage of hits was highest for Lablab and Vigna and lowest for Lathyrus (Table 1).
The development of cross-family markers to use the legume genomic resources in other species with limited resources would be of great benefit. Between 1139 and 2567 SSRs were identified in 1106 to 2427 transcripts (Table 1; Appendix S2 (apps.1400111_s2.xlsx)) corresponding to approximately 2.2–4.7% of transcripts containing at least one SSR, a value similar to other studies (2.5–21.1% in a review by Ellis and Burke, 2007). In this paper, approximately 1800 potential COS markers were identified (Fig. 2; Appendix S3 (apps.1400111_s3.xlsx), and, although only a small subset was tested for cross-family amplification, these markers show promise for fulfilling the aims of COS loci, i.e., they cross-amplify in a diverse assemblage of species and are thus useful in anchoring QTL maps or for phylogenetic reconstructions of the members of the family. This study demonstrates that a modest sequencing effort in selected taxa can lead to very large resources for future crop-specific and comparative analyses.
 This work was funded by a start-up fund to M.A.C. at the University of Southampton. The author would like to thank the Wellcome Trust Centre for Human Genetics for carrying out the sequencing, Nazmul Haq for insightful discussions pertaining to underutilized crops, and Nikol Voutsina for critical review of the manuscript.