Microsatellites, or simple sequence repeats (SSRs), have long played a major role in genetic studies due to their typically high polymorphism. They have diverse applications, including genome mapping, forensics, ascertaining parentage, population and conservation genetics, identification of the parentage of polyploids, and phylogeography. We compare SSRs and newer methods, such as genotyping by sequencing (GBS) and restriction site associated DNA sequencing (RAD-Seq), and offer recommendations for researchers considering which genetic markers to use. We also review the variety of techniques currently used for identifying microsatellite loci and developing primers, with a particular focus on those that make use of next-generation sequencing (NGS). Additionally, we review software for microsatellite development and report on an experiment to assess the utility of currently available software for SSR development. Finally, we discuss the future of microsatellites and make recommendations for researchers preparing to use microsatellites. We argue that microsatellites still have an important place in the genomic age as they remain effective and cost-efficient markers.
Microsatellites, or simple sequence repeats (SSRs), are short repeated DNA motifs (typically one to six nucleotides) located throughout eukaryotic genomes (Li et al., 2002; Zane et al., 2002). Within microsatellite regions, these motifs are repeated several to dozens of times, although the number of repeats is highly variable (Selkoe and Toonen, 2006). Replication slippage is generally considered the mechanism that creates variation in the number of repeats (Ellegren, 2004). Microsatellites exhibit high levels of polymorphism and have a high mutation rate—between 10−3 and 10−4 per locus per generation, compared to approximately 10−9 nucleotides per generation for nucleotide substitutions across the entire genome in eukaryotes (Li et al., 2002). The high level of polymorphism in microsatellites makes these markers powerful tools for assessing genetic similarity between individuals or closely related taxa (Guichoux et al., 2011; Kalia et al., 2011). Since developing microsatellite loci (see Appendix 1 for a glossary of terms used in this paper) became cost-effective in the late 1990s, researchers have used them frequently in studies requiring high levels of polymorphism, generating approximately 225,000 published articles (search of Web of Science performed April 2016, term: microsatellite* OR “simple sequence repeat*”).
Microsatellites have been used for a wide variety of applications, including genome mapping, forensics, parentage analysis, conservation genetics, identification of the parentage of polyploids, phylogeography, and population genetics (Ellegren, 2000; Esselink et al., 2004; Kalia et al., 2011). Their abundance in the genome, high levels of polymorphism, and cost effectiveness have contributed to the attractiveness of these markers. They are inexpensive when compared to the cost of using next-generation sequencing (NGS) techniques to generate sufficient data to differentiate among closely related individuals (Davey et al., 2011). Additionally, unlike with NGS data, the relatively small number of loci used in an SSR study means that each locus can be manually genotyped, reducing errors. Because they are PCR-based markers, microsatellite loci can be successfully amplified from poor-quality or low quantities of DNA, making them useful markers for studies involving ancient DNA or museum specimens (Wandeler et al., 2007). Many microsatellite primers will work in species closely related to the one for which they were originally designed, allowing for multispecies studies.
Many of the applications noted above select microsatellites for their presumably neutral nature. SSRs can also be used in studies favoring nonneutrality; the association of microsatellites with a gene under selection can be used for the construction of genetic maps (Serikawa et al., 1992; Echt et al., 2011). Microsatellites are used in crop science and forestry to build high-density genetic maps useful for locating resistance to a pest or disease, or control for a desired trait (e.g., Hardwood Genomics Project; www.hardwoodgenomics.org). For example, a map with 19 microsatellites was built around Ppr1, a locus controlling Puccinia psidii rust resistance in Eucalyptus L' Hér. (Mamani et al., 2010). Also, SSRs were used to map blight resistance genes in Castanea dentata (Marshall) Borkh., the American chestnut (Jacobs et al., 2013). Finally, in the case of very large genomes, microsatellites are the favored method to construct a genetic map in the absence of a reference genome. The efforts to build genetic maps for gymnosperms have been successful with the association of single-nucleotide polymorphisms (SNPs) with SSRs in Pinus taeda L. (Echt et al., 2011).
The use of microsatellites, however, is not without concerns and caveats. The mechanism that leads to mutations in microsatellites (replication slippage) is prone to back mutations, promoting homoplasy (Viard et al., 1998). Extensive homoplasy leads to erroneous inferences of homology. Although the high potential for homoplasy can be modeled (e.g., using the stepwise mutation model), homoplasy complicates analyses and lowers confidence in inferences made using microsatellites (Slatkin, 1995). Furthermore, the high rates of polymorphism and homoplasy make microsatellites unsuitable for phylogenetic analyses beyond very closely related species (e.g., Soltis et al., 1998). Another concern is that the large number of alleles per locus associated with microsatellites can inflate F-statistic estimates relative to biallelic markers, such as SNPs (Whitlock, 2011). Conversely, in some cases, allele frequencies can also suppress F-statistic estimates in microsatellites: estimates of genetic diversity among populations (FST) are very low when the frequency of the most common allele is either very low or very high (Jakobsson et al., 2013). Additionally, genotyping errors, which can bias downstream analyses (Hoffman and Amos, 2005), are also potential concerns. Although NGS techniques such as genotyping by sequencing (GBS) and restriction site associated DNA sequencing (RAD-Seq) (Appendix 1) also have the potential for sequencing errors, the large amount of data generated with NGS methods diminishes this concern— effectively “drowning out” erroneous signal (Hou et al., 2015). Conversely, the relatively small number of loci used in traditional microsatellite studies means that genotyping errors can have a large downstream effect. The genomic age has ushered in a variety of new techniques that offer alternatives to SSRs. Thus, in this review of microsatellites, we address the following sets of questions:
How do SSR markers compare to NGS markers generated using GBS/RAD-Seq? What factors should researchers consider when choosing a genotyping method?
For researchers planning to use microsatellites, what details are critical when designing a project? What is the current state of SSR marker development?
What is the future of microsatellite markers? How should researchers use microsatellites in 2016 and beyond?
We first compare the advantages and disadvantages of using microsatellites as opposed to GBS/RAD-Seq. We then review techniques currently used for identifying microsatellite loci and developing primers, emphasizing those that make use of NGS approaches. Additionally, we make recommendations for researchers considering using microsatellites and address the question: Are SSRs a viable option when NGS techniques are rapidly becoming more cost-effective? We also review software packages for analyzing microsatellite data and make recommendations for researchers planning to use microsatellites.
MICROSATELLITES VS. GBS/RAD-SEQ
For a plant population geneticist beginning a study, there are important decisions to make regarding marker choice before collecting a single sample. Microsatellites have been, and still remain, a viable option for collecting genetic data, whereas GBS/RAD-Seq methods are increasing in popularity (Narum et al., 2013). Researchers need to consider carefully a variety of factors before beginning a study, including the project budget, the size of the group to be investigated (number of samples), the genetic resolution required, and the availability of genomic resources for the study group (e.g., a sequenced genome or other existing resources). When there is a very limited budget or only a small number of individuals can be included (e.g., a conservation genetic study on a rare species), microsatellites remain a good choice (Gardner et al., 2011). However, it may be preferable to start with GBS or RAD-Seq when beginning a long-term project, although samples must be organized into discrete groups for multiplexing, as the use of multiplexing is what makes these techniques affordable. Importantly, if additional data are needed, from the sequencing perspective, it would be as expensive to add one more sample as it would to add 100. Due to lane effects and other stochasticities associated with NGS, it is advisable to use standards in a long-term project that will use different sequencing machines. A strong background in computing skills and bioinformatics is needed to deal with the large quantity of data generated by NGS approaches, whereas researchers can complete microsatellite analysis with limited computing skills and/or resources on a laptop computer using one or more graphical user interface (GUI) programs.
RAD-Seq and GBS are approaches that combine the value of reducing genome complexity with restriction enzymes (REs) and NGS-based SNP discovery and genotyping (Davey and Blaxter, 2010; Davey et al., 2011; Etter et al., 2011; Arnold et al., 2013; Andrews et al., 2016). These methods enable discovery of thousands of markers, even in nonmodel organisms, and characterization of different levels of genetic variation across the genome (Hohenlohe et al., 2010; Rowe et al., 2011; Liu et al., 2013; Lu et al., 2013). The main differences between RAD-Seq and GBS are methodological, relating to which REs are used to digest DNA, how sequencing adapters and multiplexing barcodes are added to samples, and the use of a size selection step (Elshire et al., 2011; Cronn et al., 2012). Hereafter, we will treat RAD-Seq and GBS as a suite of methods united by their use of REs to reduce genome complexity prior to multiplexed NGS and will refer to this suite of methods as RAD/GBS. Library complexity is directly related to genome complexity and size and the choice of REs (Beissinger et al., 2013). With RAD/ GBS, there is a trade-off between the number of SNPs and coverage of each locus, which can be mediated by choosing REs with longer recognition sites, resulting in higher coverage of fewer loci. This approach enables the use of these data for population genetics (Beissinger et al., 2013; Lu et al., 2013; Narum et al., 2013).
The primary advantage of RAD/GBS is that thousands of loci can be simultaneously generated for hundreds of individuals, with costs as low as US$35 per sample (assuming strategic sharing of REs, adapters, barcodes, and efficient multiplexing with an optimal number of samples). Reducing genome complexity with REs is a very specific, fast, and simple procedure (Sonah et al., 2013; Andrews et al., 2016). There is no requirement for a priori knowledge of the genome of the species; however, a reference genome facilitates selecting an appropriate RE (Sonah et al., 2013; Spindel et al., 2013; Liu et al., 2014). REs can be chosen that prevent highly repetitive regions and target low-copy regions, increasing the efficiency of the research goals and reducing computational time with alignment procedures. A multiplex barcoding system increases efficiency and reduces costs (Smith et al., 2010; Andolfatto et al., 2011; Elshire et al., 2011; Sonah et al., 2013). SNPs also have a number of advantages when directly compared to SSRs: they are less prone to homoplasy than SSRs and are also easier to locate in most single-copy regions of the genome than SSRs (Rafalski, 2002). Another advantage of SNPs is that relatively few SNPs are needed to define a haplotype or to detect linkage disequilibrium (Rafalski, 2002).
RAD/GBS approaches have several disadvantages as well. Problems may result from: (1) the frequent conflation of paralogous loci due to misassembly of reads (Etter et al., 2011; Xu et al., 2014), (2) sequencing errors and inaccurate genotyping with low sequencing depths (Arnold et al., 2013), (3) PCR bias in library construction (Arnold et al., 2013), and (4) nonrandom cleavage by enzyme digestion (Arnold et al., 2013). The first three issues have largely been addressed by improvements in algorithms and software for processing loci, improvements in sequencing technology and careful multiplexing, and multiple PCR steps, respectively. However, sampling DNA based on REs may still include a bias in allele frequency estimation. Mutations in restriction sites can lead to underestimating diversity and introduce genealogical biases, causing haplotypes to be non-randomly sampled (Arnold et al., 2013). Additionally, the nucleotide composition of the restriction site affects which areas of the genome are sampled; the goals of the study should guide which REs are chosen. GC content should be carefully considered when selecting REs, as GC-rich REs lead to overrepresentation of the portions of the genome high in GC content (DaCosta and Sorenson, 2014). Additionally, RAD/GBS data often over-estimate heterozygosity (Arnold et al., 2013; Gautier et al., 2013). Unlike with SSR markers, manual validation is impractical with RAD/GBS data, and biases or errors may be impossible to detect (Etter et al., 2011; Davey et al., 2013).
Several aspects of RAD/GBS present challenges that researchers need to consider. Multiplex sequencing protocols for RAD/GBS often depend on an accurate quantification of high-molecular-weight DNA (Elshire et al., 2011). However, this requirement may be waning, as recent studies have used RAD/GBS on herbarium specimens, which may have degraded DNA (Beck and Semple, 2015). Little information is currently available about how markers discovered with RAD/GBS are distributed across the genome, although studies in wheat and barley suggest that these markers are uniformly spaced (Poland et al., 2012). Large variation in GC content among taxa may introduce biases, leaving important genomic regions over- or underrepresented (Beissinger et al., 2013). However, large differences in GC content among close relatives are unusual, meaning this will likely not be an issue in population genomic studies. Another concern is that RAD/GBS data sets often have a huge amount of missing data compared to traditional genotyping methods. Researchers must make critical decisions about whether to exclude loci and/or individuals from analyses when there are high levels of missing data. Another consideration is that missing data are not randomly spread across individuals and/or loci due to the nature of the genomic library construction (REs). Therefore, allelic dropout, geneological biases, and underestimation of diversity may be some of the consequences of missing data in RAD/GBS methods. Aspects of library construction, data processing, and the divergence history of study species may affect results; simulations and more studies are needed to define guidelines about how to handle missing data when using RAD/GBS (Huang and Knowles, 2016).
Whereas RAD/GBS are powerful methods for diploid species, many challenges remain for calling SNPs in polyploids. Specialized SNP genotyping algorithms are required when using RAD/GBS in polyploids (Narum et al., 2013). Because sequencing coverage determines the level of missing data, the large genomes of some plants, especially polyploids, can lead to low coverage. In all RAD/GBS protocols, the average number of reads per sample will be based on multiplexing and the number of independent sequences generated by the sequencing platform—either sequencing coverage or number of samples multiplexed will be reduced in polyploids as compared to diploids (Poland and Rife, 2012). Large plant genomes, due to either repetitive DNA or polyploidy, can lead to the erroneous construction of artifactual composite “loci” with falsely inferred polymorphisms. Longer reads facilitate the discovery of more polymorphisms when RAD/GBS is applied to polyploids, which require genome-specific polymorphisms to differentiate among homeologous sequences (Poland and Rife, 2012; Sonah et al., 2013).
One of the most important criteria for selecting a method is cost-feasibility; we present two approximate budgets (Appendix 2) for genotyping 96 individuals: one that involves developing and genotyping microsatellites and one that implements RAD/GBS. As of May 2016, if a researcher needs to develop his/her own microsatellite loci, the cost of genotyping approximately 96 individuals using 12–15 microsatellite markers is similar to performing RAD/GBS on 96 individuals. It is very challenging to present a budget that accounts for all the factors that will determine the cost of a project, but we attempt some approximate budgets that can be used as guidelines when designing projects.
MICROSATELLITE DEVELOPMENT: REVIEW OF TECHNIQUES
If microsatellite markers are the chosen approach, researchers have two options: generate sequence data for microsatellite detection or mine pre-existing resources for marker discovery. The first option requires decisions on library preparation, sequencing platform (including read length and depth), and software for marker detection. The second option makes the first two decisions unnecessary and bypasses sequencing costs, but software choice is still important.
Historical methods of microsatellite library construction—Microsatellite libraries were traditionally developed by digestion with one or more REs (Ritschel et al., 2004). A linker of known sequence would be ligated onto the digested fragments, and one or more probes containing repeat sequences were hybridized to those fragments. This enrichment step limited the nature of the microsatellites that would ultimately be obtained at the end of the procedure. The repeat-enriched fragments were then recovered using streptavidin-coated beads (Nunome et al., 2006). The library was amplified and the PCR products cloned and sequenced. The enrichment strategy is time-consuming (10–14 d), and the DNA extracted for such a protocol has to be of high quality and quantity. The yield of such a library construction is typically eight to 20 polymorphic loci for 30–60 SSR primer pairs tested (Zalapa et al., 2012), and the initial cost is low (less than US$500 for a cloning kit).
NGS has transformed the development of microsatellite loci for ecological and evolutionary studies. Current approaches allow quick and inexpensive identification of large numbers of loci in nonmodel organisms. Studies so far have largely focused microsatellite discovery efforts on the Roche 454 (454 Life Sciences, a Roche Company, Branford, Connecticut, USA) and Illumina (Illumina, San Diego, California) platforms (Jennings et al., 2011; Zalapa et al., 2012), although Pacific Biosciences (PacBio, Menlo Park, California, USA) (Grohme et al., 2013; Wei et al., 2014) and Ion Torrent (Thermo Fisher Scientific, Waltham, Massachusetts, USA) (Huey et al., 2013; Kameyama and Hirao, 2014) have also been used. Because read length greatly affects the ability to discover microsatellite markers, as longer reads will more likely include the flanking regions needed for primer design (Lepais and Bacles, 2011; Schoebel et al., 2013; Elliott et al., 2014), the 454 sequencing platform was used extensively for microsatellite development (Castoe et al., 2010). On a per-megabase basis, however, 454 is less cost-effective than Illumina (Glenn, 2011; Appendix S1 (apps.1600025_s1.docx)). Between January 2013 and April 2016, 74 projects using 454 were published in Applications in Plant Sciences, yielding between eight and 91 polymorphic loci, with an average of 16 loci, derived from an average of 139,418 reads. Roche announced they will be discontinuing the use of the 454 instrument in 2016. Future projects using NGS to develop microsatellite loci will rely on alternative platforms.
Current library preparation methods—Several approaches can reduce genomic complexity and enrich for microsatellites prior to library building (Glenn, 2011). Method selection depends on platform throughput, number of individuals, desired coverage, and availability of a reference genome or transcriptome (Jennings et al., 2011). Microsatellite-enrichment methods require a priori decisions on the type of repeat motif and size of repeat sequence, creating bias in locus choice (Castoe et al., 2010). Using shotgun sequencing to identify loci allows for random sampling of the genome and is preferable to microsatellite-enrichment techniques. Regardless of sequencing platform and library preparation, however, NGS approaches to microsatellite discovery are more time- and cost-effective and provide more potential loci than traditional approaches. The limiting step for microsatellite studies is no longer marker discovery and development, but instead, screening and validation of loci (Wei et al., 2014).
The short read lengths obtained with platforms such as Illumina and Ion Torrent previously limited their utility for microsatellite development. However, as Illumina platforms generate longer read lengths (MiSeq currently generates 2 × 300 bp reads), this limitation is changing. Zalapa et al. (2012) reported two of 17 projects in their analysis used Illumina platforms. Between January 2013 and April 2016, 28.8% (34 of 118) of primer notes published in Applications in Plant Sciences utilizing NGS used Illumina. For studies using Illumina, the average number of polymorphic microsatellite markers reported was 15 loci, and the average number of potential loci per study was 15,539, which is larger than other platforms (e.g., 454, with an average of 4400 potential markers). This is predominantly due to the greater throughput of Illumina (see Appendix S1 (apps.1600025_s1.docx)).
Sequencing platform—Read length, read output, and error rate all affect platform choice for generating sequence data for marker discovery (Glenn, 2014; Appendix S1 (apps.1600025_s1.docx)). Currently there are three Illumina platforms available: MiSeq, HiSeq, and Next-Seq, with the HiSeq ×10 debuting in 2016. The MiSeq, which only has a single lane, has the fastest run times and the longest read lengths (∼56 h for 2 × 300 bp). However, the MiSeq output consists of relatively few reads (50 million) of up to 2 × 300 bp at a higher cost per mega base pair compared to the HiSeq. The HiSeq has a low cost per megabase of data—up to 500 gigabytes (GB) of data per flow cell. However, these reads are shorter than the MiSeq; until recently, the longest was 2 × 150 bp, and the runs take up to six days; however, the new HiSeq v2 reagents allow 2 × 250 bp in rapid run mode. Drawbacks to the HiSeq are the requirement to fill all eight lanes before running, and that a single flow cell can be processed only as a rapid run or a high-throughput run. The NextSeq falls between the two other platforms in performance; it can generate reads of 2 × 150 bp, with a high-throughput run generating up to 120 GB of data in ∼29 h. All three models have a low final error rate of 0.1% (primarily substitution-type miscalls; Glenn, 2014).
Two additional platforms that are increasing in use for SSR discovery are Ion Torrent and PacBio. Ion Torrent has three chip options generating between 50 Mbp and 2 Gbp of data, with read lengths of 200 or 400 bp, and sequencing time ranging between 3 and 7.9 h. The PacBio platform is a single-molecule real-time sequencer, which removes PCR errors that can be introduced when using other platforms. Of the three platforms reviewed, PacBio has the greatest flexibility in run times (30 min to 6 h per single-molecule real-time sequencing [SMRT] cell) and run size (one to 16 SMRT cells) and provides the longest read lengths, up to 20 Kb—an attractive feature for microsatellite discovery. PacBio suffers from the highest error rate—approximately 13% in raw reads. However, unlike Illumina and Ion Torrent, these errors are stochastic, meaning that a final error rate of less than 1% can be achieved in the consensus sequence of numerous raw reads (Glenn, 2014). Unfortunately, the advantages of the PacBio system come at a cost—it delivers a very low total number of reads per run (500 Mbp to 1 Gbp per SMRT cell) and a high cost per Mbp of data ( Appendix S1 (apps.1600025_s1.docx)).
Many pipelines have been published using paired-end Illumina reads (e.g., Miller et al., 2013; Andersen and Mills, 2014), with genomic DNA or RNA-Seq data. Gilmore et al. (2013) estimated the time and cost of using Illumina data to produce markers from eight samples to be approximately 20 h of laboratory work for sample preparation and approximately US$51 per sample. Several recent studies have justified the use of other platforms, mainly Ion Torrent (Elliott et al., 2014) and PacBio (Wei et al., 2014). In a comparison of the utility of 454 and Ion Torrent, Elliott et al. (2014) found the Ion Torrent recovered shorter microsatellite repeats (due to shorter reads), but more markers were discovered at a lower cost and more quickly than with 454. The PacBio RS platform may become a preferred method for obtaining highly variable SSRs in the future, especially if error rates and price decrease; the latter is proposed with their Sequel instrument. Small-scale marker development results in long reads using a single SMRT cell, which may yield thousands of repeats (Grohme et al., 2013; Wainwright et al., 2013).
Mining existing data sets—Another option for developing microsatellite markers is using publicly available sequence data from online repositories such as the Sequence Read Archive (SRA; http://www.ncbi.nlm.nih.gov/sra). This archive houses a large collection of raw sequence data from various NGS platforms and approaches such as targeted-gene capture, genome skimming, restriction digests, and transcriptome sequencing. To determine the potential of these data sets to generate microsatellite markers, targeted-gene capture (SRR2658270; Landis et al., 2015) and RAD-Seq reads (Hodel et al., unpublished data) were used for marker discovery. In both data sets, over 100,000 potential loci were discovered, highlighting the utility of publicly available data for mining SSR loci (Tables 1 and 2). Another resource for researchers is the One Thousand Plant Transcriptomes Project (1KP; www.onekp.com; Matasci et al., 2014), which has transcriptome assemblies for over 1000 plant species. The companion paper to our review presents over five million SSR loci that can be used in thousands of plant species (Hodel et al., 2016). It is important to note that once potential loci are identified from NGS data, this is just the starting point for developing a functional microsatellite genotyping system and extensive and costly screening of loci will be required, as outlined in the budget in Appendix 2.
SOFTWARE FOR MICROSATELLITE DEVELOPMENT
Once researchers generate or obtain NGS data, the next step is to use a software program to identify potential loci to screen. We tested the effectiveness and ease of use of 10 commonly implemented software packages for microsatellite identification using four Arabidopsis thaliana NGS data sets mined from SRA. The data sets are: a single-end (1 × 100 bp) lane of Illumina HiSeq 2000 (ERR368422), which is 10.9 million reads and a total of 1.5 Gbp of sequence, a paired-end (2 × 100 bp) Illumina HiSeq 2000 lane (ERR965681; 97 million reads and a total of 8.7 Gbp of sequence), a paired-end (2 × 250 bp) Illumina MiSeq run (ERR365834; 13.2 million reads and a total of 3.3 Gbp of sequence), and a PacBio sequencing run (SRR1284764; 476 Mbp of sequence in 163,500 reads). We obtained the data in FASTA and FASTQ files from SRA using the SRA toolkit. Hereafter, these data sets will be referred to as HiSeq1, HiSeq2, MiSeq, and PacBio. FASTA files for each data set ranged in size from 445 MB (PacBio) to 5.7 GB (HiSeq2). For some software packages, we had to use other file formats (e.g., FASTQ), but we report FASTA file sizes for simplicity.
The number of loci found in an SSR search and the number of loci found per mega base pair sequence for each software package in each of two data sets used to highlight the vast potential resources available for researchers who cannot generate their own sequence data to search for SSRs.
We selected these four data sets to investigate how read number, read length, sequencing platform, and data set size affected the performance of each software package. Our goal was to provide readers with the information necessary to obtain microsatellite loci from publicly available data as easily as possible. We ran each data set through each software program, using the same settings in each program as much as possible. We selected the default values from QDD3 to use in every program, as the default values were difficult to change in QDD3. Although it is important to use a consistent set of parameters for every program, the actual parameters used can be arbitrary, so we used QDD3 defaults. The critical parameters to standardize were the number of repeats of a certain length motif required to call a locus. The QDD3 default values are: homopolymers, 1,000,000 repeats; dinucleotides, five repeats; trinucleotides, five repeats; tetranucleotides, five repeats; pentanucleotides, five repeats; hexanucleotides, five repeats. For each software package that ran to completion for all data sets, we report the total number of SSR loci found, the number of loci per mega base pair of sequence, and the distribution of loci across size motifs (di-, tri-, tetra-, penta-, hexanucleotides).
First, we summarize the utility and main characteristics of the software packages (see below, and Table 3). Next, we compare software packages, so future researchers are well-equipped to develop SSR loci easily. The goal of most of these programs is to search for SSR loci, quantify the distribution of loci across size motifs, and facilitate primer design. Many of these software packages use a GUI, but some are command line only and require knowledge of Perl or Python for software installation and execution. Many of the software packages interact with Primer3 (Rozen and Skaletsky, 1999) for primer design. Most programs are open source, platform independent, and capable of handling genomic data. When possible, we ran these software packages on a high-performance computing cluster. As noted below, some software packages would not run after a reasonable period of effort by a biologist proficient in command line and at least one programming language. We briefly describe and evaluate each program, report the resources required to run each one, how long execution took, and other relevant details for evaluating software packages (Table 3).
Most of the tested software packages executed successfully for all four test data sets and produced results consistent with other programs (Tables 4 and 5). HighSRR and SSR_pipeline did not run to completion. The software packages that failed to run or complete the loci search were either old or not compatible with current NGS data sizes and formats. For instance, there are several types of FASTQ formats, but SSR_pipeline recognized only one old version, and HighSSR is unable to run with files larger than 2 GB. Other packages, including GMATO, PAL_FINDER, QDD3, SRR Locator, and STAMP, had limitations. These packages were either slow, could not handle all data types and/or sizes, or were difficult to use (e.g., they required a substantial amount of file formatting and manipulating). PAL_FINDER and MSATCOMMANDER (Faircloth, 2008) consistently found fewer loci than other software packages (Table 4). We recommend using Phobos (either by itself or through Geneious if Primer3 integration is desired) or MISA. We base these recommendations on ease of use and reliability of results.
The number and percentage of each repeat motif type using each software package found in the SSR search for each test data set.
Geneious is a desktop software suite for the organization and analysis of sequence data in molecular biology (Kearse et al., 2012). Microsatellite development requires several plugins (e.g., Phobos, Primer3, and MISA) to meet users' specific needs. It is commercial software, which requires purchasing a license for activation, raising the research budget. The component that searches for microsatellite loci is Phobos, which can be run independently of Geneious for free. Phobos has both GUI and command-line interfaces, and it processes large files quickly. Every data set tested completed the search in less than an hour on a standard laptop (2.5-GHz Intel Core i5, 8 GB RAM). Phobos does not interact directly with Primer3, but if Phobos is used through Geneious, the results of the loci search in Phobos can be easily piped to Primer3. For microsatellite loci development, Phobos is fast and user-friendly.
GMATo comes with a Java graphical interface and is ready to execute immediately after downloading (Wang et al., 2013). GMATo results are presented as a table of SSR loci statistics. It runs quickly; for the HiSeq2 data set (a 5.7-GB file), it completed the job within 52 min on a desktop Windows machine (eight Core 3.4-GHz Intel Core i7-2600 CPU, 16 GB RAM). However, the user cannot control the distribution of repeat number motifs—every repeat length must be set to the same value. This program is not capable of primer design, marker generation, or electronic mapping markers.
HighSSR detects microsatellites and eliminates redundancy in the PCR primers for recovered loci (Churbanov et al., 2012). It identifies and scores SSRs in raw sequencing reads with Tandem Repeats Finder (TRF; Benson, 1999) and stores them in a PostgreSQL database, reporting summary statistics, such as the number of alleles of each SSR locus, which can be analyzed by other software. HighSSR demultiplexes pooled libraries, assesses locus polymorphism, and implements Primer3 for primer design. Finally, MUSCLE (Edgar, 2004) is used to refine crude clusters and distill loci from them. However, it requires a Java virtual machine and access to a database on a PostgreSQL server. Moreover, nonuniversal parameter settings and various Java codes and shell scripts make it difficult to use. For the TRF executable file, we could only open our smallest test data file (PacBio; 445 MB).
MISA is short for MIcroSAtellite identification tool, which was originally designed to generate SSR loci from EST data (Thiel et al., 2003). It works immediately if Perl is installed and runs rapidly; the 5.7-GB HiSeq2 data set finished in 1.8 h (one node, one processor, and 4 GB of memory). Users are able to change the default settings by editing a configuration file (misa.ini), and MISA is able to generate primers. Its results are in tabular form, giving a summary of different statistics, such as the frequency of a specific microsatellite type. However, some studies indicate that MISA may have mined redundantly in overlapped microsatellites (e.g., Wang et al., 2013; Hodel et al., 2016).
MSATCOMMANDER enables rapid and automated microsatellite detection, locus-specific primer design, and tagging (Faircloth, 2008). It requires Python and writes output files in comma-separated value (CSV) format. However, the results are difficult to view and do not include general summary statistics about the types of microsatellite loci found. The user must spend considerable time filtering the output file to determine basic statistics (e.g., the number of dinucleotide repeats found). It utilizes Primer3 as its primer design and primer-tagging engine.
PAL_FINDER finds microsatellite repeat elements directly from raw NGS sequencing reads and then designs PCR primers to amplify these repeat loci (potentially amplifiable loci [PAL]) by interaction with Primer3 (Castoe et al., 2012). This is command-line software, which can be freely modified by the user via the required config file. However, its performance is very sensitive to data coverage (quantity and quality of PALs; Castoe et al., 2012). After approximately 24 h of effort manipulating FASTQ input files, we were unable to get the FASTQ mode to work. We could use any type of FASTA file in the “454” mode, including paired-end Illumina data, as long as all the reads were in a single file. This program has a slow run time relative to other software packages reviewed (>24 h for data sets >4 GB on a standard laptop [2.5-GHz Intel Core i5 with 8 GB RAM]).
Description of software packages used in this study, including operating systems, important features, URL where software can be obtained, number of citations, authors, and brief comments describing the ease of use.
Software packages, the number of loci they find in an SSR search, and the number of loci they find per mega base pair sequence in each of the four test data sets for four sequencing platforms (MiSeq, HiSeq1, HiSeq2, PacBio).
The number and percentage of each repeat motif type found in the SSR search in each of the four test data sets for four sequencing platforms (MiSeq, HiSeq1, HiSeq2, PacBio).
QDD3 is composed of four separately running modules, with functions of quality trimming, microsatellite detection, redundancy removal, primer design, contamination checking, and comparison to known transposable elements (Meglécz et al., 2014). It can be used both on command-line and through Galaxy (Afgan et al., 2016) and works with RepeatMasker (Tarailo-Graovac and Chen, 2009) and a variety of other NGS tools. Its running time is relatively long (for a 5.7-GB data set, 9.5 h on a high-performance computer), and users cannot change default settings for SSR searches (e.g., specifying different numbers of repeats for different length motifs).
SSR Locator integrates functions of SSR search, frequency of occurrence of motifs, primer design, and PCR simulation against other databases, as well as global alignments and identity and homology searches (da Maia et al., 2008). It executes all the module-calls using a GUI with a built-in menu system suite. However, it requires some file reformatting, which increases computing time. For the HiSeq2 data set, it took 10 min to reformat, and 69 min for the SSR search on a Windows platform (eight Core 3.4-GHz Intel Core i7-2600 CPU, 16 GB RAM).
SSR_pipeline is a command-line program for identifying microsatellites from high-throughput sequencing data using a Python environment (Miller et al., 2013). It detects SSRs in Illumina paired-end reads, with modules for quality filtering and alignment of Illumina raw data. SSR_pipeline can also analyze data from other sequencing platforms, such as 454 and Ion Torrent, by using the SSR detection module independently. However, after 24 h of effort by a biologist proficient in bioinformatics, we could not run test data through SSR_pipeline successfully.
STAMP is an updated package of STADEN (Kraemer et al., 2009) for microsatellite detection and primer design, with comprehensive integration of Phobos (Mayer, 2007) for tandem repeat detection and analysis. STAMP uses TROLL (Castelo et al., 2002) for tracing back primer pairs to sequence trace files, Primer3 for interactive design and visualization of primers, and SQLite as a database for storing analysis results. Overall, STAMP is a highly flexible, high-throughput, interactive tool for conventional and multiplex microsatellite marker design, avoiding the generation of redundant markers. However, it is complicated—it requires multiple tool command language modules and preinstallation of the STADEN package, and it is not suited for low-coverage NGS data (Meglécz et al., 2014).
Recommendations for researchers—Based on our budget estimates, RAD/GBS and microsatellites are approximately equivalent in cost for genotyping 96 individuals, assuming that NGS data are already available for microsatellite development (Appendix 2). However, RAD/GBS will generate many more loci, but it would be much more economical to add additional individuals if using SSRs. If microsatellites can be developed for free using existing public NGS data, it is worth investigating this option—it can considerably reduce the cost (see companion paper [Hodel et al., 2016], which presents over five million SSR loci that can be used in thousands of plant species). As shown in Table 3, publicly available data sets not designed for microsatellite development can be mined to yield many SSR loci to test. Our review of sequencing platforms and the test data sets we used in our software comparison revealed that read length is not as important as expected. Table 4 indicates that while the longer read lengths associated with MiSeq (2 × 300 bp) certainly yield more loci than the shorter read lengths of HiSeq (e.g., 1 × 100 bp), there are plenty of loci detected (typically >100,000) with shorter read lengths. 454 sequencing was once considered essential for microsatellite development, because Illumina reads were too short. Now, many types of Illumina sequencing can generate adequate sequence data for generating loci ( Appendix S1 (apps.1600025_s1.docx)). Unless a researcher is multiplexing many different species in a single run, we recommend using Illumina MiSeq for its cost efficiency. As shown in Table 4, the Illumina MiSeq generates ample loci relative to other platforms, and it is cheaper and more time-efficient compared to HiSeq, which requires users to fill all eight lanes before a sequencing run can commence. For the software portion of microsatellite development, we recommend using MISA or Phobos (either alone or as implemented in Geneious).
The future of SSRs—are they up to the task?—Microsatellites still have great applicability due to their high polymorphism, relatively easy scoring, testable neutrality, and Mendelian inheritance (Zane et al., 2002). The use of microsatellites will undoubtedly give way to newer technologies such as RAD/GBS as these approaches find wider application. However, microsatellite markers are valuable tools for several reasons. Many study designs simply do not require the high marker density provided by RAD/GBS and benefit more from the inclusion of large numbers of samples. Furthermore, there are thousands of studies that have employed microsatellite markers, and in many cases, the markers available provided too little information to fully address the authors' hypotheses. For such microsatellite legacy projects, using the same markers as existing data sets is preferred to avoid confounding factors. While microsatellites provide limited information per sample, if the inclusion of many individuals is a priority, microsatellites compare favorably with newer techniques. If transcriptomic data are used to identify microsatellites, it may be possible to perform more rigorous tests of selective neutrality in adjacent coding regions of potential loci. This could allow researchers to know whether they were selecting a locus that is part of (or linked to) a gene under directional selection rather than merely documenting any departures from Hardy–Weinberg equilibrium. Also, the high allelic variation of microsatellites compared to sequence-based markers is optimal for the identification of markers present in small subpopulations of interest (e.g., disease-resistant individuals; Miah et al., 2013). Finally, for projects with limited budgets (e.g., conservation genetic surveys), microsatellites will likely continue to be the most economical option for some time (Jennings et al., 2011). For all of these reasons, microsatellites remain a good choice for many systems and questions—with the proper justification and strong questions/hypotheses, they are still appropriate for use in proposals to the National Science Foundation and other funding sources.
The authors thank three anonymous reviewers and APPS associate editor Dr. Mitch Cruzan for many helpful comments on previous versions of this manuscript, and Mark Twain for lending us part of our title. This work was supported in part by a National Science Foundation Doctoral Dissertation Improvement Grant (DEB-1501600 to D.E.S. and R.G.J.H.).
Sample budgets for genotyping 96 individuals using microsatellites or RAD/GBS. All costs are expressed as 2016 US dollars.