The reconstruction of ancestral character states is a central type of analysis in the evaluation of the cytological, ecological, metabolic, or morphological evolution of organismic lineages (Ronquist, 2004; Table 1). Numerous software applications have been developed that facilitate such reconstructions (Joy et al., 2014). The software application Mesquite (Maddison and Maddison, 2000) is a key tool in ancestral character state reconstruction (ACSR) and has been used in all branches of biology. Although it was introduced more than 15 years ago, it remains the most popular of such applications in the plant sciences. A search in the database PubMed ( http://www.ncbi.nlm.nih.gov/pubmed/) for all botanical investigations published between October 2012 and October 2015 that contain the term “ancestral character state reconstruction,” or any word combination thereof, in either the title, the key words, or the abstract, recovered a total of 27 publications (Table 1). An analogous search in the database ISI Web of Science ( http://webofknowledge.com/) recovered a total of 93 publications. Of the 27 publications recovered via PubMed, 21 (78%) employed Mesquite, four (15%) employed functions in the R package APE (Paradis et al., 2004), two (7%) employed the software application BayesTraits (Pagel and Meade, 2004), and two (7%) employed the application SIM-MAP (Bollback, 2006). The software applications RASP (Yu et al., 2010) and WinClada (Nixon, 2002) were employed in one publication (4%) each. While some publications used more than one of these applications, no less than two thirds conducted ACSR exclusively via Mesquite. Reconstructions that employed models specifically designed for biogeographic scenarios (e.g., Dillenberger and Kadereit, 2013; Salzman et al., 2015) or that evaluate character correlations (e.g., Soltis et al., 2013; Chartier et al., 2014) were not counted.
The plant sciences community would benefit from a tool that automates and streamlines the reconstruction of ancestral character states, particularly as conducted via Mesquite. Most software applications used in ACSR, except for those developed in the statistical scripting language R (R Core Team, 2015), feature designs that render the serialized execution of reconstructions challenging. Mesquite, for example, is primarily operated via a graphical user interface (GUI). Such a GUI, while comfortable for the transient user, impedes software operation in an automated workflow. Few users would choose to conduct import, export, analysis, and visualization operations by hand for dozens, or even hundreds of different data sets, even if their research questions warranted a replicated design. In addition, most software applications for the reconstruction of ancestral character states employ idiosyncratic input and output formats (e.g., SIMMAP) or lack the capability to visualize the results (e.g., BayesTraits). Consequently, it would be desirable for the plant sciences community to possess a tool that (a) standardizes the input and output feed to such software applications, (b) automates the reconstruction process, and (c) visualizes the reconstruction results in a publication-ready quality. The present publication introduces such a tool. A set of software scripts, collectively referred to as WARACS (Wrappers to Automate the Reconstruction of Ancestral Character States), was developed, which provides basic command line control to, and standardized input/output operations for, ACSR conducted in the applications Mesquite and BayesTraits. In addition, the script set provides a wrapper for the phylogenetic tree editor TreeGraph2 (Stöver and Müller, 2010), which facilitates the automatic visualization of ACSR results.
Details of recent botanical publications employing ACSR. This table displays the results of a literature search in PubMed for botanical investigations published between October 2012 and October 2015 that employ ACSR.
METHODS AND RESULTS
General design—The script set WARACS is designed to wrap and connect three established and well-maintained software applications—Mesquite, BayesTraits, and TreeGraph2—to automate and standardize the task of ACSR. Each script wraps around an individual executable and performs tasks that would otherwise need to be conducted step-by-step through a human operator, such as the formatting of information blocks in NEXUS files, the selection of character models and optimality criteria, or the parsing of raw reconstruction results. Together, the scripts provide a pipeline that connects different application executables. The visualization of reconstruction results is hereby considered an integral part of the automated workflow. Specifically, WARACS connects the output handle of the reconstruction process to the input handle of the tree visualization software TreeGraph2 to enable a graphically consistent visualization of results across multiple analyses. The scripts are command line–based, enabling the user to conduct reconstructions iteratively over a series of different input files and parameters. Overall, the script set is consistent with the concept of glue code (Lapp et al., 2007) and, as such, is easily customizable (e.g., in case of changes to the input and output formats) and expandable (e.g., when novel applications are to be included into the pipeline). WARACS is controlled exclusively via command line parameters; to ensure their consistent use, the parameters have been standardized across the three applications. WARACS has been tested with several versions of the wrapped software applications, including Mesquite v.2.75, v.3.03, and v.3.04; BayesTraits v.2.0; and TreeGraph2 v.2.0, v.2.6, and v.2.7.
Input and output design—Several concepts govern the input and output specifications of the WARACS script set. (a) The wrappers follow an ad hoc compilation approach when formatting input data. Specifically, the scripts compile input files from individual components upon execution. Such a design avoids the need for idiosyncratic or noncompliant input formats, which tend to erode the interoperability of data standards (Vos et al., 2012). The application Mesquite, for example, appends its internal commands in an idiosyncratic information block to the main NEXUS file. Such customized NEXUS files are automatically saved to disk, even though they can rarely be imported by any other software application. The ad hoc compilation approach selected for WARACS can also circumvent shortcomings concerning input file names, such as the inability of certain versions of Mesquite to import files with underscores in their names via the command line (M. Gruenstaeudl, personal observation). (b) The wrappers were designed to generate output files whose names are informative about the underlying reconstruction process. By default, WARACS saves the name of the software application and the optimality criterion employed as well as the character selected as part of the output filename, (c) The wrappers were designed to print short error reports to the screen upon code execution to communicate to the user problems with input formats, the reconstruction process, or the employed applications, (d) The wrappers were designed to form a pipeline so that reconstructions can be saved automatically as publication-ready figures. Such a standardized visualization step is important because larger, better-annotated pie diagrams (e.g., Riser et al., 2013; Chartier et al., 2014; Salzman et al., 2015) and more information on the underlying phylogenetic trees (Table 1) would often be helpful in the presentation of reconstruction results. Effective visualizations portray reconstructed character states both graphically and numerically (Wong, 2011; Krzywinski, 2013); WARACS saves reconstruction results in precisely that form in both vector as well as raster graphic format.
Availability and compatibility—The script set WARACS was written in the interpreter language Python (Python Software Foundation, 2012) and is consequently platform independent. It is compatible with Python v.2.7 ( https://www.python.org/download/releases/2.7/) as well as Python v.3.5 ( https://www.python.org/downloads/release/python-350/). WARACS is available under a BSD open source license from a code-sharing repository at GitHub ( https://github.com/michaelgruenstaeudl/WARACS). Installation and usage instructions (file “README.md”), as well as several example files (folder “examples”), are provided alongside the wrapper scripts. To use WARACS, a user must have a Python interpreter as well as the Python packages DendroPy ( https://pypi.python.org/pypi/DendroPy), NumPy ( http://www.numpy.org), and six ( https://pypi.python.org/pypi/six) installed. Software applications that are wrapped by WARACS must also be installed; current links to their respective installation websites are provided in the usage instructions as well as via the command line parameter “-h”. WARACS has been tested on Ubuntu 14.04, ArchLinux 4.2.3, and Mac OSX 10.8.5.
Usage—Use of the script set WARACS is driven exclusively via the specification of command line parameters. At a minimum, reconstructions of ancestral character states require five items of input: a distribution of character states for the organisms under study (hereafter “character state distribution”), one or more phylogenetic trees of the organisms under study on which the character state distribution is optimized (hereafter “tree distribution”), one phylogenetic tree on which the reconstruction results are plotted (hereafter “plotting tree”), an optimality criterion for the reconstruction, and a model of character evolution. Most applications for ACSR assign a default model of character evolution to the character state distribution under study; users of WARACS can select a different model by modifying the compiled input file. Multiple character state distributions can be supplied to WARACS, and users must select which of them to use in a particular reconstruction. In addition, users must specify the location of the application executable on the system. Hence, the WARACS wrappers for ACSR operate with six mandatory command line parameters (Fig. 1): (i) the character state distribution (command line parameter “-c”), specified as a file path to a comma-delimited table; (ii) the tree distribution (command line parameter “-t”), specified as a file path to a text file in NEXUS format (Maddison et al., 1997) containing one or more phylogenetic trees; (iii) the plotting tree (command line parameter “-p”), specified as a file path to a text file in NEXUS format containing a single phylogenetic tree; (iv) the optimality criterion (command line parameter “-o”), specified as a command line parameter string; (v) a specification of the character state distribution used in the reconstruction (command line parameter “-n”), specified as a command line parameter integer; and (vi) a file path to the application executable (command line parameter “-s”). For example, to conduct a reconstruction of character state distribution 2 on a distribution of phylogenetic trees under the maximum likelihood optimality criterion using a single-rate model (Mk1; Lewis, 2001) via Mesquite, a user on a Linux operating system would enter the following command into his command line shell of choice as a single, uninterrupted line:
Upon execution, the wrapper script compiles a comprehensive input file in a modified NEXUS format, passes it to Mesquite, receives raw reconstruction results from Mesquite, parses these results, and saves up to four output files to the user's working directory (Fig. 1). These output files are: (i) a comma-delimited table containing the parsed reconstruction results (saved with the file ending “.csv”), (ii) the plotting tree in NEWICK format (Maddison et al., 1997) (saved with the file ending “.tre”), (iii) the raw reconstruction results generated by Mesquite (saved with the file ending “.txt”), and (iv) the compiled input file (saved with the file ending “.tmp”). Files (iii) and (iv) are optional and only generated when command line parameter “-k” is invoked. The parsed results table consists of two columns: column 1 specifies the node numbers on the plotting tree, column 2 the corresponding reconstructed character states. A set of example files that illustrate the input and output of ACSR with WARACS is cosupplied with the scripts (folders “examples/example_Mesquite” and “examples/example_BayesTraits”). Currently, WARACS can facilitate reconstructions under discrete character states, which is the predominant type of character encoding in current botanical investigations (Table 1). Polymorphic or missing states can be included in analyses facilitated by WARACS, but their precise effect on the reconstructions is governed by the default settings of the wrapped software applications.
The visualization of an ACSR via TreeGraph2 requires a minimum of three items of information (Fig. 2): (i) a comma-delimited table containing the results of an ACSR (command line parameter “-r”), (ii) a plotting tree in NEWICK format (command line parameter “-p”), and (iii) a file path to the application executable (command line parameter “-s”). Moreover, a user can specify a comma-delimited table of color specifications (hereafter “color dictionary”; command line parameter “-d”) to link the states of the character state distribution to specific colors. Specifically, this table instructs the visualization engine about the colors that the pie diagram slices representing the reconstruction results must be filled with. A color dictionary is separated into two columns: column 1 specifies the character states, column 2 the corresponding colors in hexadecimal format. In the absence of a user-defined color dictionary, WARACS employs a default color palette as specified by http://colorbrewer2.org for qualitative characters. In addition to a color dictionary, a user can specify the character state distribution used during the reconstruction process to plot the character states of the terminal taxa (command line parameters “-c” and “-n”). Hence, to visualize the results of the ACSR listed above as well as the character states of the terminal taxa under a custom color dictionary, a user on a Linux operating system would enter the following command into his command line shell of choice as a single, uninterrupted line:
Upon execution, the wrapper script compiles a comprehensive input file in XML format and passes it to TreeGraph2, which generates a figure in two different image formats (in the vector graphic format “svg” and in the raster graphic format “png”) in the user's working directory (Fig. 2). These figures represent the end product of the visualization process. If command line parameter “-k” is invoked, the compiled input file is saved to the working directory (with the file ending “.xtg”) to allow for a renewed visualization upon manual modification. A set of example files that illustrate the input and output of the visualization process with WARACS is cosupplied with the scripts (folder “examples/example_TreeGraph2”).
Despite the versatility of functions for ACSR in the statistical scripting language R, the plant sciences community holds a strong preference toward the software application Mesquite for reconstructing ancestral character states (Table 1). While Mesquite contains functionality to reproduce analyses via customized command scripts in an idiosyncratic scripting language, it lacks genuine command line support. Conducting serialized reconstructions of ancestral character states (e.g., iterating over a series of tree distributions or optimality criteria) is consequently challenging under Mesquite. Serialized reconstructions and sub-sequent result visualization could theoretically be achieved by combining several existing R packages, but this requires the knowledge to write customized R scripts and concatenate different functions into a pipeline. The wrapper scripts presented here constitute a fast and simple-to-use alternative to customized R scripts. Moreover, they support the community effort to make analyses more reproducible. If authors specify the command line parameters they selected alongside the original input data, others can reproduce their analyses and explore alternatives. In addition, the provision of interoperability between widely used software applications is an effective strategy to expand their functionality (e.g., Maddison and Maddison, 2014), because the wrapped applications have likely undergone extensive testing and follow good coding practices (Leprevost et al., 2014).
The script set WARACS enables several types of analyses that are not or only partially available through the application of Mesquite alone. WARACS was designed to provide basic command line control to the software applications Mesquite, BayesTraits, and TreeGraph2 for the purpose of automating and streamlining the reconstruction and visualization of ancestral character states. By using these wrappers, several types of analyses become available without the need to write intricate analysis scripts. First, researchers can automatically iterate over multiple trees or tree distributions to compare the reconstruction results. This option could be desirable for investigations anticipating a different placement of taxa across different gene tree distributions (e.g., Folk and Freudenstein, 2014) or such attempting to accommodate phylogenetic uncertainty among the target organisms (e.g., Dillenberger and Kadereit, 2013). Second, researchers can automatically iterate over multiple optimality criteria and reconstruction algorithms to avoid bias caused by individual algorithm implementations. Ricklefs (2007), for example, pointed out that it is currently unclear if random-walk models such as those implemented in stochastic character mapping truly reflect natural processes. Instead of relying on any one reconstruction method, scientists should compare the results of an ACSR under different optimality criteria (e.g., Ekman et al., 2008; Soltis et al., 2013). Third, the application of WARACS simplifies the process to visualize reconstruction results on phylogenetic trees that are different from those used in the reconstruction process. For example, researchers who wish to infer ancestral character states over a posterior tree distribution, but visualize their results on the best phylogenetic tree inferred under maximum likelihood, can feed such independent input files to WARACS as long as they share the same taxon set. Likewise, WARACS simplifies the process to combine the results of different reconstructions on a single tree (e.g., Larridon et al., 2015). For example, researchers who wish to plot ancestral character states inferred under different data partitions onto a particular consensus tree (e.g., de Villiers et al., 2013) can visualize their results jointly through concatenation of either the compiled TreeGraph2 input files or the vector graphics. Without this functionality, researchers are forced to present near-identical figures that differ only in the reconstructed character states (e.g., Schaefer et al., 2012). In summary, the script set WARACS provides an easily accessible interface to popular software applications for ACSR and makes several intricate types of character state reconstruction available to the average user.
 The author would like to thank Ludo Muller (Freie Universität Berlin), Bryan Carstens (Ohio State University), and two anonymous reviewers for valuable feedback on earlier versions of this manuscript. The author also thanks Teofil Nakov (University of Arkansas) and Felix Heeger (Freie Universität Berlin) for testing the wrapper scripts. This study was carried out as part of the project “Developing tools for conserving the plant diversity of the Transcaucasus” funded by VolkswagenStiftung (grant AZ85021).