Arthropods from class Arachnida constitute a large and diverse group with over 100,000 described species, and they are sources of many proteins that have a direct impact on human health. Despite the importance of Arachnida, few proteins originating from these organisms have been characterized in terms of their structure. Here we present a detailed analysis of Arachnida proteins that have their experimental structures determined and deposited to the Protein Data Bank (PDB). Our results indicate that proteins represented in the PDB are derived from a small number of Arachnida families, and two-thirds of Arachnida proteins with experimental structures determined are derived from organisms belonging to Buthidae, Ixodidae, and Theraphosidae families. Moreover, 90% of the deposits come from just a dozen of Arachnida families, and almost half of the deposits represent proteins originating from only fifteen different species. In summary, our analysis shows that the structural analysis of proteins originating from Arachnida is not only limited to a small number of the source species, but also proteins from this group of animals are not extensively studied. However, the interest in Arachnida proteins seems to be increasing, which is reflected by a significant increase in the related PDB deposits during the last ten years.
Arthropods from class Arachnida constitute a large and very diverse group with over 100,000 described species (Chapman 2009). They include a large number of species that are important from a perspective of human health and economy. Arachnida are also sources of many proteins that have a direct impact on human health. For example, mites are an important group of Arachnida that is relevant to human health, as they not only can transmit some diseases (Vogel et al. 2014), but can also be significant sources of allergens (Tehri & Gulati 2015). Over 10% of all registered allergens originate from eleven mite species, with Dermatophagoides farinae and Dermatophagoides pteronyssinus contributing to 69 allergens (Vogel et al. 2014; Azmiera et al. 2020).
Scorpions (order Scorpiones) are extensively studied due to their ability to produce toxins. There are over fifteen hundred scorpion species with a broad distribution throughout the world (Garcia et al. 2013; Chippaux & Goyffon 2008). Scorpion envenoming represents a public health problem, mostly in tropical countries. Yearly, more than one million cases of scorpion envenomation are recorded with a mortality risk of about 3% worldwide (Chapman 2009).
Spiders are another group of Arachnids that are studied not only for their toxins, but also for their fibroins, which are considered to be very promising biodegradable materials (Vidya & Rajagopal 2021). Therefore, a significant effort is made to study this group of proteins.
Lastly, ticks are also recognized as medical and economical burdens because of their ability to transfer diseases to humans and animals. Therefore, ticks are considered to be one of the biggest problems in public and veterinary health (Barker & Murrell 2004). These ecto-parasites can affect the production of the animals and their health, either directly by transmitting viruses, bacteria, rickettsiae, and protozoa, or by the effect of their bites. During the evolution of their saliva, ticks gained an enormous diversity of bioactive compounds such as anticoagulants, chemokine-binding proteins, platelet aggregation, and complement system inhibitors (Rajput et al. 2006; Chapman 2009). Many enzyme inhibitors, such as serine protease inhibitors (SPIs), have been described. SPIs' importance in several tick biological processes has been shown to be directly involved in the regulation of inflammation, wound healing, vasoconstriction, blood clotting, and the modulation of host defense mechanisms (Blisnick et al. 2017; Denisov et al. 2021).
Despite the importance of Arachnida, very few proteins originating from these organisms have been characterized structurally. Therefore, we performed a detailed analysis of Arachnida proteins that have their structures determined experimentally and deposited to the Protein Data Bank (PDB) (wwPDB consortium 2019). The PDB not only provides information on three dimensional structures of proteins, but also annotates all deposits, which allows for sophisticated analysis of the entries. For example, it is possible to analyze deposits taking into account protein sources, year of deposition or experimental methods used for structural studies. Our analysis provides a summary of Arachnida families and species that are sources of the most extensively studied proteins.
Materials and Methods
Selection of proteins originating from Arachnida and construction of the dataset used for analysis
The PDB deposits corresponding to proteins originating from Arachnida were selected using the “Advanced Search”, “Polymer Molecular Features”, and “Source Organism Taxonomy Name (Full Lineage)” search options from the database interface. The PDB deposits selected for this analysis were chosen using the search query “Arachnida”. This search in June 2021 resulted in 492 available PDB deposits. The PDB codes corresponding to the entries of interest were saved together with various information on the experimental procedures used to determine the protein structures. Moreover, we have collected information on the year of deposition, protein name, and protein family and function, as well as taxonomic information on the source organism. If the required information was not stored by the PDB, we have derived it from UniProt (UniProt consortium 2021) and/or relevant manuscripts associated with the deposit. All derived data were summarized in Supplementary Table S1 (saa.28.2.12p298–308 Supplementary material.xlsx).
Overview of currently available experimental structures
Until June 2021, the PDB contained 492 entries of proteins originating from Arachnida (Fig. 1A), which corresponded to less than 0.5% of all deposited structures. Among the 492 entries, 44 structures represent proteins from mites, 192 structures belong to scorpions, 164 structures correspond to spiders, and 92 structures are from ticks (Fig. 1B). Almost two-thirds of the deposits correspond to proteins originating just from three families (Fig. 2A), namely Buthidae, Ixodidae, and Theraphosidae. At the same time, over 90% of the deposits can be attributed to twelve Arachnida families (Fig. 2A). Approximately half of the deposits (47%) represent only fifteen different species, and the top ten species are responsible for 180 of 492 deposits (Fig 2B).
The experimental data of the structural models of these PDB entries are derived from the three major methods: X-ray crystallography, nuclear magnetic resonance spectroscopy (NMR), and electron microscopy (EM). NMR was used to determine most of the deposited proteins (286), while X-ray crystallography and cryo-EM were used to determine 192 and 14 structures respectively (Fig. 1C). In terms of all PDB deposits, only 7% of structures were determined using NMR and 87% were determined using X-ray diffraction (Schiro et al. 2020). These differences in the preferred experimental approaches used between the whole PDB and entries originating from Arachnida clearly indicate that these proteins are relatively small and difficult to crystallize.
There are only 44 structures deposited to the PDB that represent proteins originating from mites. The majority of these structures correspond to the proteins from mites belonging to Sarcoptiformes (Fig. 3A). The remaining structures represent proteins originating from two orders, Trombidiformes and Mesostigmata. In fact, all of the proteins with experimental structures from Trombidiformes originate from Tetranychus urticae, while the two Mesostigmata proteins structures originate from Varroa destructor (Fig. 3B). These two mite species are important from an agricultural point of view, as T. urticae is a major plant pest and V. destructor is a mite parasite of bees. Sarcoptiformes proteins with structures determined originate from house dust mites (Dermatophagoides farinae and Dermatophagoides pteronyssinus), storage mite (Blomia tropicalis), and itch mite (Sarcoptes scabiei) (Fig. 3B). The listed Sarcoptiformes mites are detrimental to human health, as they are sources of allergens or parasites (S. scabiei).
Majority of mite proteins that have their structures determined ( Table S1 (saa.28.2.12p298–308 Supplementary material.xlsx)) are classified as allergens belonging to Groups 1 (cysteine proteases) and Group 2 (Niemann-Pick proteins type C2 (NPC2)) (Derewenda et al. 2002; Ichikawa et al. 2005; Chruszcz et al. 2009). However, recently there is a notable increase in number of proteins' structures that originate from agricultural pests, and that are responsible for xenobiotic detoxification (Schlachter et al. 2017; Schlachter et al. 2019; Daneshian et al. 2022) or Odorant-binding proteins (OBPs) (Amigues et al. 2021). Both NPC2s and OBPs may participate in chemical communication of arthropods.
There is a total of 192 protein structures belonging to the Scorpiones order deposited to the PDB between 1993–2021, most of which were characterized by NMR. Within the Scorpiones order, 182 structures correspond to proteins originating from Buthidae, while the other structures represent proteins stemming from two families, Scorpionidae (nine structures) and Hormuridae (one structure). Most of the structures are originating from Mesobuthus genus while the remaining structures originate from Centruroides (31 structures), Leiurus (22), Androctonus (18), Tityus (16), and other families (Fig. 4B).
Over 90% of the structure of proteins from scorpions deposited to the PDB ( Table S1 (saa.28.2.12p298–308 Supplementary material.xlsx)) are described as inhibitors of ion channels, with most (77%) being designated as sodium or potassium channels inhibitors (Norton & Chandy 2017).
There are 164 protein structures deposited to the PDB originating from spiders. Among these, the majority belong to the family Theraphosidae (Fig. 5A). The rest of the structures correspond to proteins from families such as Araneidae (18 structures), Agelenidae (13), Pisauridae (10), Atracidae (10), Sicariidae (9), and Lycosidae (5). Looking at the category of these proteins, the number of structures related to toxins that spiders producing is the largest (130 deposits).
Similarly, as in the case of the PDB deposits of proteins originating from scorpions, the majority of proteins originating from spiders that have their structures determined ( Table S1 (saa.28.2.12p298–308 Supplementary material.xlsx)) are classified as neurotoxins and ion channel inhibitors.
There were 92 PDB deposits representing proteins originating from ticks. Among the three families in Ixodida (Metastigmata) suborder of Acari, the majority of the deposits (72 structures) corresponded to the Ixodidae (hard ticks) family, and the remainder (20) belong to the Argasidae (soft ticks) family (Fig. 6A). No PDB deposits of proteins belonging to the Nuttalliellidae family are reported.
Among the Ixodidae family, most protein structures come from Ixodidae scapularis (black-legged tick or deer tick) followed by Ixodidae ricinus (castor bean tick). Both I. scapularis and I. ricinus are causative agents (vectors) of Lyme disease, which affects 476,000 people per year in the USA (Barker & Murrell 2004; Schwartz et al. 2021; Kugeler et al. 2021), (Fig. 6). The remaining structures from the Ixodidae family belong to Rhipicephalus appendiculatus (14 structures), R. bursa (9), R. sanguineus (9). R. microplus (3), R. pullchellus (2), Amblyomma variegatum (2), Haemaphysalis longicornis (2), Dermacentor andersoni (1) and A. maculatum (1). Among the Argasidae tick family, most structures belong to Ornithodoron moubta (13), followed by Argas monolakensis (4) and Argas reflexus (2)(Fig. 6B).
Tick proteins that have their structures determined ( Table S1 (saa.28.2.12p298–308 Supplementary material.xlsx)) are more diverse in terms of their function in comparison with previously mentioned mite, scorpion or spider proteins. Significant fraction of the tick proteins can be classified as lipocalins, protease inhibitors (e.g. cystatins and serpins), proteases, as well as proteins like evasins, which play role in suppression of the host inflammatory response (Bhusal et al. 2020).
Our analysis shows that proteins originating from Arachnida are not extensively studied in terms of their three-dimensional structures. However, in the last decade, the proteins derived from this group are more actively studied, with over 50% of all deposits being submitted to the PDB in the 2011-2021 period (Fig. 1A). Interestingly, 58% of all deposits originating from Arachnida correspond to protein structures that were determined using NMR (Fig. 1C), which is significantly more than for the whole PDB, where only 7% of protein structures are determined using this experimental technique. This extensive use of NMR can also be explained by the fact that these proteins are relatively small, due (generally less than 10 kDa) and have a high degree of conformational flexibility, which hinders crystallization. Another striking observation is related to a small number of Arachnida families from which proteins represented in the PDB are derived. As mentioned, two-thirds of Arachnida proteins with experimental structures originated from Buthidae, Ixodidae, and Theraphosidae families (Fig. 2A). Moreover, 90% of the deposits come from just a dozen Arachnida families, and almost half of the deposits represent proteins originating from only fifteen different species (Fig 2B). The small number of species from which the PDB deposits originate is contrasted with the size of class Arachnida, with approximately 100,000 species described (Laustsen et al. 2016). This clearly indicates that structural studies are at a very early stage and are extremely limited.
There are only 44 structures of mite proteins deposited to the PDB; in recent years, these Arachnida are attracting more attention for several reasons. First of all, mites are sources of very potent allergens that originate mainly from house dust mites (D. farinae and D. pteronyssinus) and storage mites (e.g. B. tropicalis) (Fig. 3B) (Thomas 2015). The experimental models of mite proteins deposited to the PDB are dominated by representatives of two protein families. For example, there are multiple deposits of Group 1 (cysteine proteases) and Group 2 (Niemann-Pick C2 protein family) house dust mite allergens, which constitute almost half of all mite-related PDB entries. Interestingly, several structures of proteins from Group 1 and Group 2 of house dust mite allergens correspond to complexes of the allergens with monoclonal antibodies (Chruszcz et al. 2012; Glesner et al. 2017; Glesner et al. 2019; Osinski et al. 2015). These structures provide insights into the antigenic structure of the allergens and are used to develop a new approach to immunotherapy. Together, the PDB entries associated with mite allergens represent approximately three-quarters of all mite protein structures. Secondly, some of the mites (e.g. spider mites like T. urticae) are of interest from an agricultural point of view, as they show one of the highest incidences of pesticide resistance (Van Leeuwen et al. 2015). Moreover, T. urticae is an attractive experimental model system (Grbic et al. 2007). This is illustrated by the increase of PDB deposits for proteins originating from T. urticae (Schlachter et al. 2017; Schlachter et al. 2019; Daneshian et al. 2022). These deposits provide structural insights into enzymes that are used by spider mites to detoxify various plant metabolites. In this case, the structural studies aim to provide information on the physiological functions of the enzymes, to be used toward development of new generations of acaricides.
Scorpions are of medical relevance since they are the main species with venom that may cause harm to humans. Currently, it is clear that the toxins (ion channel inhibitors) derived from Buthidae species are of the biggest interest from the perspective of structural studies, as they correspond to 37% of all Arachnida-associated deposits. In this case, the biomedical importance of the venoms is driving the researchers' interest in these peptides and proteins. The main toxic effects of venoms are mainly derived from the inhibitory actions of neurotoxins, primarily sodium and potassium channel targeting neurotoxins. Proteins with sodium or potassium inhibitor channel activity account for most protein structures compared to proteins with enzyme inhibitor activity, short antimicrobial peptide activity, or defense response activity. In fact, almost 77% of all scorpion-related PDB deposits are derived from proteins from potassium and sodium channel inhibitor families. This is stemming from the interest in using toxins for drug applications and discovering crystal structures of receptor sites for toxins (Laustsen et al. 2016). This structural data is an excellent foundation for developing new antivenoms. On the other hand, these entries also provide information on the protein structure of substances in scorpion venoms that are important in the medical field (Laustsen et al. 2016; Abdel-Rahman et al. 2015; Lewis & Garcia 2003). For instance, chlorotoxins (PDB codes: 1CHL, 5L1C, and 6ATW) are of special importance due to their selective attachment to glioma cells, serving as a potential therapeutic option for brain gliomas (Cordeiro et al. 2015; Barker & Murrell 2004; Fletcher et al. 1997). Also, the cystine-dense peptide of (PDB code: 6AY7) could be applied to concentrate arthritis drugs in joints after conjugating them with a steroid (Cook Sangar et al. 2020). The experimental models of scorpion proteins deposited into the PDB represent the largest group among the other species of Arachnida.
Spiders are the second most important source of proteins that have their structures deposited in the PDB. Here proteins originating from Theraphosidae dominate; however, among the twelve most represented Arachnida families, there are five additional spider families (Araneidae, Agelenidae, Pisauridae, Atracidae, and Sicariidae). It clearly shows that the researchers are interested in many spider families, and they are driven by studies of venoms and spidroins. Similarly to scorpions, scientific interest in spider venom originates from the ability of these proteins to modulate ion channels and their therapeutic potentials (Chapman 2009; Cardoso et al. 2015). Spider venoms are composed of three main classes of components: inorganic and organic components, small molecular polypeptides, and high molecular proteins. Among these classes, polypeptides of a molecular weight between 3-8 kDa with neurotoxic properties are of particular interest to the scientific community (Escoubas et al. 2000). These toxins belong to the group of small, disordered proteins that are difficult to crystallize. Therefore, it is not a surprise that NMR stands out as the major technique used to solve the structure of proteins originating from spiders. Furthermore, as detailed, toxins are of special interest due to their ability to modulate ion channels. In this sense, cryo-EM is the preferable technique to determine the structure of the complex between the toxin and the ion channel. In fact, seven out of the eight toxin-ion channel complexes in PDB, were solved using this novel experimental method. These proteins are of interest due to their remarkable properties and potential applications, such as insecticidal (Fletcher et al. 1997), antibacterial (Benli & Yigit 2008), and drug delivery and tissue regeneration, among many other biomedical applications (Bakhshandeh et al. 2021). While the first spider protein structure deposited to PDB dates from 1993, titled: NMR solution of toxin omega-Aga-IVB (PDB codes: 1OMA and 1OMB) (Yu et al. 1993), it is not until 2008 that the first structure of a protein unrelated to spider toxins was deposited. This deposit showed the repetitive domain of the egg case silk (Chapman 2009; Lin et al. 2009). Spidroin-related deposits (27 deposits) represent the second-most abundant group of spider proteins in the PDB. The remaining spider-related PDB deposits correspond to a miscellaneous group of polypeptides, such as arginine kinase and insulin-like growth factor-binding domain protein.
Ticks are the final group described here that contributes a significant fraction of the PDB deposits from Arachnida. In the case of this group, proteins derived from Ixodidae (hard ticks) are the most often studied, and the research generally concentrates on ticks' salivary proteins and their role in the transfer of various pathogens (Denisov & Dijkgraaf 2021). Given the significance of tick saliva in host immune defense modulation, these proteins exhibit crucial roles in binding interaction to host receptors (Kazimírová & Štibrániová 2013; Chmelař et al. 2019). Among the tick proteins deposited in the PDB, transferases, endopeptidases, proteases, and lipocalins are the most common. Novel structures of lipocalins of tick-derived C5 inhibitors OmCI from Ornithodoron moubta (Argasidae family) and RaCI from Rhipicephalus appendiculatus (Ixodidae family) generate new insights on complement C5 activation in humans (PDB codes: 6RQJ and 6RPT). Interestingly, both free and complexed forms of these proteins with human complement C5 have been deposited, the comparison of which holds therapeutic significance. Structures of amine binding lipocalins monomine and monotonin from Argasidae monolakensis were determined to compare ligand binding sites for reconstructing evolutionary pathways of the amine binding tick lipocalins (PDB code: 3BRN). Tick saliva also causes significant allergic disease, as further interesting deposits from the Argasidae family come from the major allergen Arg r 1 from Argas reflexus (pigeon tick). A selenomethionine mutation-free Arg r 1 structure and Arg r 1 complexed with histamine structures are deposited in the PDB (PDB codes: 2X45 and 2X46). As interest in tick-transferred diseases is growing, it is expected that the salivary proteins will continue to be one of the major groups for Arachnida proteins represented in the PDB.
In summary, currently, the Arachnida proteins are of significant interest for biomedical, agricultural and material sciences, but are significantly understudied. Based on this review, the proteins or peptides studies are not very diverse and can be categorized mainly as toxins, spidroins, salivary proteins, and allergens (Fernández-Caldas et al. 2014; Wang et al. 2014; Thomas 2015; Daly & Wilson 2018; Bhusal et al. 2020; Denisov & Dijkgraaf 2021). We conclude that structural studies of Arachnida-derived proteins are in their infancy, because of the importance of Arachnid proteins not only from the perspective of human health or biomaterials but also from potential application in agriculture. We also expect a rapid increase in resolution of various Arachnida protein structures, as these molecules have potential diverse applications.
This project was funded by USDA's National Institute of Food and Agriculture award #2020-67014-31179 through the NSF/NIFA Plant Biotic Interactions Program and R01AI077653 grant from National Institute of Allergy and Infectious Diseases.