Search for Human-Specific Proteins Based on Availability Scores of Short Constituent Sequences: Identification of a WRWSH Protein in Human Testis

Little is known about protein sequences unique in humans. Here, we performed alignment-free sequence comparisons based on the availability (frequency bias) of short constituent amino acid (aa) sequences (SCSs) in proteins to search for human-specific proteins. Focusing on 5-aa SCSs (pentats), exhaustive comparisons of availability scores among the human proteome and other nine mammalian proteomes in the nonredundant (nr) database identified a candidate protein containing WRWSH, here called FAM75, as human-specific. Examination of various human genome sequences revealed that FAM75 had genomic DNA sequences for either WRWSH or WRWSR due to a single nucleotide polymorphism (SNP). FAM75 and its related protein FAM205A were found to be produced through alternative splicing. The FAM75 transcript was found only in humans, but the FAM205A transcript was also present in other mammals. In humans, both FAM75 and FAM205A were expressed specifically in testis at the mRNA level, and they were immunohistochemically located in cells in seminiferous ducts and in acrosomes in spermatids at the protein level, suggesting their possible function in sperm development and fertilization. This study highlights a practical application of SCS-based methods for protein searches and suggests possible contributions of SNP variants and alternative splicing of FAM75 to human evolution.


Introduction
The human species has unique traits among animals. It is well known that morphological and physiological traits such as erect bipedalism, speech and language, and long reproductive period are very different from those of other primate species. Only humans have high intelligence that fosters sophisticated communications and complex societies. This intelligence is related to continuous brain development after birth in humans, which is not observed in other great apes, including chimpanzees [1]. The evolutionary emergence of these unique traits in humans likely contributes to human speciation. The simplest hypothesis to explain human uniqueness is that it originates from the uniqueness of constituent molecules (i.e., genes and proteins) themselves. In this "constituent hypothesis," humans have unique genes and proteins that do not exist in chimpanzees. A contrasting hypothesis is that constituent molecules are similar between humans and chimpanzees, but they are regulated differently in these species. That is, in this "regulatory hypothesis," a similar set of proteins may be produced but at different times (heterochrony), in different locations (heterotopy), in different amounts (heterometry), and in different usage (heterotypy) [2]. These regulatory changes in gene expression seem to be evolutionarily parsimonious and, indeed, are supported by comparative observations at phenotypic levels [3].
One line of support for the regulatory hypothesis comes from genomics and developmental expression studies. Following the announcement of a human genome release [4], the genomes of other great apes were sequenced [5][6][7]. Comparisons of DNA sequences between humans and chimpanzees have revealed that nucleotide differences are only 1.23% in aligned sequences, and most of these differences are thought to be functionally insignificant [5]. Further rigorous comparisons throughout these genomes have revealed that nucleotide differences are 4% and that they are mostly located in noncoding regions [8]. The expression patterns of some genes are different between humans and chimpanzees during development [9][10][11][12]. Differences in transcriptomes have revealed that species differences in expression patterns are tissue-dependent and that testes have the greatest difference [13,14]. It has been speculated that the accumulation of small expression or regulatory differences leads to large phenotypic differences between humans and chimpanzees [14]. On the other hand, while these findings support the regulatory hypothesis, they do not necessarily reject the constituent hypothesis [15,16]. RNA-mediated mechanisms for novel genes have been proposed together with the "out of the testis" hypothesis, in which testis is considered a tissue for experimenting with new genes [16]. Comparisons among transcriptomes in primates have revealed that many genes for spermatogenesis in testes, which likely inhibit apoptosis when mutated, are positively selected [17,18].
Although these genome comparison studies advance this field, there are a few inherent problems. First, their results are heavily dependent on database quality because of their methodological nature. Most genome sequences were draft sequences at the time of public release, likely containing numerous sequencing and assembling mistakes. For example, the previous chimpanzee genome was assembled in reference to the human genome, which means that genomic regions in chimpanzees that are different from those in the human genome may have been assembled to create false sequences, although continuous revisions have been made [19]. Even in the human genome, many previous gene records generated by automated assemblers have been removed after revisions. Moreover, population sampling bias from the sequenced genome cannot be avoided when samples from a small number of individuals are sequenced. The case of a transcription factor, FOXP2 (forkhead box P2), is an object lesson: FOXP2 has been proposed to have played a key role in human-specific evolution by assisting speech and language [20], but that evidence is likely to be weak and probably incorrect because of sampling bias [21].
Second, such genome comparisons are largely based on sequence alignments [22,23]. Although sequence alignment methods are powerful and probably the most important in comparison studies, sequences that do not contain relatively long regions of similarity cannot be compared well. In other words, short sequences that do not extend to longer similarities are discarded as noise [22]. Although this strategy is highly successful, it assumes that nonaligned short sequences are not important, which may not always be true. There may still be important differences undiscovered where alignments are not possible.
An approach to the second issue above is to develop alignment-free methods. The advantage of the alignment-free approach is that any collections of proteins can be compared quantitatively. Although various types of alignment-free approaches have been developed [24,25], including our previous attempts to use membrane topology [26] and a self-organizing map [27], the alignment-free approach in the present study is based on the "availability" (frequency bias) of short constituent sequences (SCSs) of amino acids (aa) in proteins [28][29][30][31][32][33]. The length of SCSs can be 2 aa (doublet), 3 aa (triplet), 4 aa (quartet), 5 aa (pentat), and more in a given protein. This SCS-based analysis is basically similar to other related analyses for amino acid sequence patterns that were called under different terms with slightly different mathematical operations: oligopeptide patterns [34][35][36][37][38][39], amino acid sequence repertoire [40], peptide vocabulary [41], n-gram [42,43], n-tuple [44], and pseudo amino acid composition [45][46][47]. There are some noteworthy recent studies that encourage this line of approach: for example, nonrandom distributions of 5-aa SCS are demonstrated in the current proteome databases [38], confirming the previous finding that biological bias occurs in protein coding [28,29]. Among these existing studies, our approach is operationally one of the simplest, and it emphasizes analogies between languages and protein sequences [32,33]. Encouragingly, linguistic aspects of proteins have been noted in other studies [48,49].
In our approach, protein sequences are considered to be composed of many SCSs. Importantly, the number of possible SCSs is limited because a protein is composed of just 20 kinds of amino acids; there are 400 (=20 2 ) permutations of 2-aa SCSs (doublets), 8000 (=20 3 ) permutations of 3-aa SCSs (triplets), 160,000 (=20 4 ) permutations of 4-aa SCSs (quartets), and 3,200,000 (=20 5 ) permutations of 5-aa SCSs (pentats). Frequencies of individual SCSs in a given protein database can be inferred theoretically based on frequencies of component amino acids, which is called the expected frequency (E). On the other hand, real frequencies of individual SCSs (R) in a given protein database can be obtained through database searches. The availability score (A) of a given SCS in a protein database can be simply defined as A = (R À E)/E. Availability scores thus indicate biological frequency bias that might have occurred for functional or historical reasons during protein evolution. In other words, availability scores (A) of SCSs are used instead of simple real frequencies (R) of SCSs to exclude noise from random occurrence.
Among n-SCSs, we state that 5-aa SCSs (pentats) are optimal for analyses for the following reasons [28,29,33]. First, they are practically manageable in number (exactly 3,200,000 different pentats) in our computational system. Higher computational power, which is sometimes not practical, is required to use 6-aa or longer SCSs. Second, the number of possible SCSs should be reasonably comparable to or smaller than the number of existing SCSs in a biological database. The use of 6-aa or longer SCSs would result in many nonexistent SCSs in the database because the number of possible 6-aa (or longer) SCSs is much larger than the number of existing SCSs in a given database. Third, 5-aa SCSs are likely structurally reasonable units (or "blocks") to build functional protein structures [50][51][52][53][54]. Fourth, it was suggested that small stretches of proteins are often recognized in protein interactions. For example, T-cell receptors recognize 5-aa SCSs as antigens in the process of antigen presentation, and this fact relates to the frequency bias of SCSs in parasites to avoid recognition by the T-cell receptors of the host [41]. Specificities of immune responses are thus likely influenced by SCSs in expressed proteins in a given organism, as also suggested by usage of rare SCSs as immune adjuvant vaccines [39].
Furthermore, rare SCS sequences evolved as untranslatable sequences in bacteria as a mean of translational control [40].
Using this simple concept of availability score, secondary structure characterization has been performed; SCS frequencies (and thus availability scores) are different among different secondary structures [30]. Availability scores are also different between parallel and antiparallel β-strands [31]. This approach is also relevant to identifying sequence motifs in some, although not all, proteins [32]. It has been shown that triplet compositions in proteomes may reflect phylogenetic relationships [32,37,53]. We believe that this approach is applicable to understanding species specificity.
We have implemented several applications as the SCS Package that informatically examine protein sequences [33]. Among them, we have built an application for identifying species-specific SCSs. In the present study, we compared human and other 9 mammalian proteomes based on availability analysis of 5-aa SCSs (pentats) to identify human-specific pentats. We hypothesized that a protein containing the identified human-specific pentat would be unique to humans and might have played a role in human evolution.

The SCS package
Assuming that small changes in amino acids in proteins (or corresponding nucleotide changes in DNA) contribute significantly to phenotypic differences between humans and chimpanzees, the concept of SCS-based methods is to detect small amino acid usage differences between species in an alignment-independent manner. The SCS package is an open web service containing six applications (plus the latest application to analyze idiom networks under development [55]) for protein analyses (http://scspackage.ads.ie.u-ryukyu.ac.jp/) [33]. These applications run in reference to the database downloaded from the nonredundant (nr) database of the NCBI (National Center for Biotechnology Information, U.S. National Library of Medicine, Bethesda), which was downloaded on August 20, 2015. Because T-cell receptors, B-cell receptors, and antibodies are produced by somatic recombination and hypermutation, protein records containing the following keywords in sequence names were excluded: anti, IgG, IgM, IgA, IgD, IgE, BCR, TCR, B-cell receptor, Tcell receptor, Ig, and immunoglobulin. A complete match, including spaces, was required to be excluded. Frequencies and availability scores for all possible n-aa SCSs (n = 3, 4, and 5 in the current SCS package) in the database were calculated and stored in the database [56]. For species comparison, each record in the downloaded database was sorted into its original species to produce species-specific proteome databases.

Bioinformatics web services
After identifying FAM75 using the SCS package, available information on FAM75 and FAM205A was gathered using various web sites. The location FAM75/ FAM205A on chromosomes and their single nucleotide polymorphism (SNP) variants were searched using Map Viewer (www.ncbi.nlm.gov/mapview/) in the NCBI server. For various information on human transcripts, we referred to H-InvDB (www.h-invitational.jp/hinv/ahg-db/index_ja.jsp) [57]. This site provides curated information on gene structure, splicing variants, functional RNAs, protein functions, functional domains, intracellular distribution, metabolic pathways, threedimensional structures, disease relationships, genetic polymorphism (SNPs, indels, microsatellites, and others), gene expression profiles, molecular evolutionary characters, protein-protein interactions, and gene families. Tissue-specific expression profiles were searched using H-ANGEL (http://www.h-invitational.jp/hinv/h-ange l/wge_top.cgi?), a database for human gene expression profiles, in the H-InvDB server. Information on alternative splicing variants of the nonhuman primates and mouse was obtained from Map Viewer in NCBI. We referred to the following latest annotations: chimpanzee (Annotation Release 103), western gorilla (Annotation Release 100), Sumatran orangutan (Annotation Release 102), and laboratory mouse (Annotation Release 106). We frequently used protein BLAST in the NCBI server for conventional similarity search (blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Prote ins) [22] and performed multiple sequence alignments when necessary using MEGA7 (www.megasoftware.net/) [58]. In addition, the cDNA sequence of FAM75 was subjected to RegRNA analysis (regrna.mbc.nctu.edu.tw/html/about.html) to identify any possible sequence motifs in FAM75 mRNA [59].

Human cDNA samples for tissue expression profiling
For the cDNA template, we purchased human MTC (multiple tissue cDNA) panels I and II (Takara Bio, Kusatsu, Shiga, Japan). The panels contain first-strand cDNA from polyA + RNA and are free from genomic DNA. The amounts of cDNA are approximately 1.0 ng/μL and are normalized to four housekeeping genes, phospholipase A2, G3PDH (glyceraldehyde-3-phosphate dehydrogenase), β-actin, and α-tubulin, which makes it possible to compare expression levels among different tissues. Panels I and II together contain cDNA samples from the following 16 human tissues: heart, brain, placenta, lung, liver, skeletal muscle, kidney, pancreas, spleen, thymus, prostate, testis, ovary, small intestine without mucosal lining, colon with mucosal lining, and peripheral blood leukocyte. Each tissue sample was pooled from 1 to 550 Caucasians, and the testis sample was pooled from 45 Caucasians aged 14-64, according to the manufacturer's specifications.

PCR primers
Based on the cDNA sequence of FAM75, we designed two sets of PCR primers for nested PCR using Primer-BLAST (www.ncbi.nlm.nih.gov/tools/primer-blast/). The first set was to amplify both FAM75 and FAM205A from the consensus region, and the second set was to amplify FAM205A from the region that is present only in FAM205A. For the first set, the first-round forward primer was 5'-TTACCAGG-TACTGTCACTGAACAC-3 0 , and its paired reverse primer was 5'-TTCTGAAGC-TAGACTCTGTAAGGC-3 0 . This first round of PCR was expected to amplify 1387 bp. The second-round (nested) forward primer was 5'-AGTTGTACA-GACGTTGCAAAAGAG-3 0 , and its paired reverse primer was 5'-TTTCTGAAGC-TAGACTCTGTAAGGC-3 0 . This second round (nested) PCR was expected to amplify 1097 bp.

PCR conditions
We used an Astec PC320 thermal cycler (Fukuoka, Japan) and Tks Gflex DNA polymerase (Takara Bio) for PCR. According to the manufacturer's specifications, this DNA polymerase has high fidelity; the error rate was reported to be 0.0131%.
The original cDNA sample from the human MTC panels (Takara Bio) was diluted 10 times to make PCR template samples. The following solutions were mixed to start PCR: Gflex PCR buffer 12.5 μL, deionized water 8.5 μL, DNA polymerase 0.5 μL, forward primer 0.5 μL, reverse primer 0.5 μL, and cDNA template 2.5 μL in a total amount of 25.0 μL. The nested PCR was performed using 2.5 μL reaction solution from the first-round PCR. In both the first and second (nested) rounds, a negative control was performed using deionized water without template cDNA.
The first PCR cycles were performed as follows: an initial denaturing step at 94°C (5 min); 10 cycles of 98°C (30 s), 60°C (30 s; À0.5°C/cycle), and 68°C (1 min); 30 cycles of 98°C (30 s), 55°C (30 s), and 68°C (1 min); and a last extension step at 68°C (30 s). The second (nested) PCR cycles were the same as the first PCR cycles except the duration of the initial denaturing step at 94°C (1 min). In both the first and second PCRs, the first 10 cycles were subjected to stepwise temperature reduction (i.e., touch-down PCR); the first cycle was 60.0°C, the second cycle was 59.5°C, and the third cycle was 59.0°C, and so on.
Positive controls were performed with G3PDH primers that were supplied in the Human MTC Panels (Takara Bio) from the manufacturer. The PCR product was expected to be 938 bp. The primer sequences were as follows: 5'-TGAAGGTCG-GAGTCAACGGATTTGGT-3 0 for the forward primer and 5'-CATGTGGGCCAT-GAGGTCCACCAC-3 0 for the paired reverse primer. PCR cycles were as follows: an initial denaturing step at 95°C (1 min); 38 cycles of 95°C (30 s) and 68°C (3 min); and a final extension step at 68°C (3 min).
PCR products (1.0 μL) were subjected to 0.8% agarose gel electrophoresis in TAE buffer and stained with ethidium bromide for visualization. The PCR products were run with λHindIII DNA size marker (New England Biolabs, Ipswich, MA, USA). Among the ΔA rank order of pentats, we focused on pentats that showed the lowest possible availability scores (A = À1) in all other nine mammalian proteome databases, meaning that these pentats did not exist in the nonhuman proteomes at all. We found WRWSH at rank 204 (ΔA = 1720) and MMFGC at rank 226 (ΔA = 1594) that met this criterion. However, MMFGC was found to be a falsepositive, because this pentat was located exclusively in immunological proteins that could be subject to somatic recombination and hypermutation. Therefore, we decided to focus on WRWSH hereafter.

Human proteins containing WRWSH
Human proteins containing WRWSH were identified using the "search for amino acid sequences of species" program, one of the SCS Package programs. Among all 148 hits, 16 hits were related to "mucin-19-like isoform," 55 hits to "glycine-rich cell wall structural protein," 28 hits to "RNA-binding protein," 48 hits to "uncharacterized transmembrane protein," and 1 hit to "unnamed protein product." Unfortunately, these sequences except the last one, "unnamed protein product," were all "predicted" informatically as parts of "Homo sapiens Annotation Release 106" [69], and they were all removed from the latest annotation, "Homo sapiens Annotation Release 109" [70]. Because their status was uncertain at this point (although they resembled real protein sequences with a long open reading frame), they were not pursued in the present study. On the other hand, "unnamed protein product [Homo sapiens] (Accession No. BAC86357.1)", here called "FAM75" based on the name of putative domain that it contained, was validated in the latest annotation [70], and thus, we pursued this protein for further investigation.

FAM75 and its related FAM205A
According to the NCBI record, FAM75 is a protein containing 1014 aa, and its cDNA coding sequence was 3274 bp (Accession No. AK125949.1). It is important to stress that FAM75 has been identified as cDNA from NEDO human cDNA sequencing project (www.nite.go.jp/en/nbrc/genome/project/annotation/cdna.h tml), and thus this protein is not likely an error product from genome sequencing. A protein BLAST search using FAM75 as a query identified the record "protein FAM205A [Homo sapiens] (Accession No. NP_001135389.1)." This protein record was closely related to the mRNA record "Homo sapiens family with sequence similarity 205 member A (FAM205A), mRNA (Accession No. NM_001141917.1)." The BLAST result showed that the identity score was 99%; 1003 aa were identical among 1014 aa. The record showed that FAM205A contained 1335 aa, and its cDNA coding sequence was 4311 bp. Thus, it was longer than FAM75. A DNA sequence comparison between FAM75 and FAM205A revealed that 16 bases were different ( Table 1). When the FAM205A genomic DNA sequence (Accession No. NG_052658.1) was compared with its cDNA sequence, these 16 bases were identical ( Table 1). Between FAM75 and FAM205A, 11 amino acids were different. Table 1. Different DNA bases and protein amino acids between FAM75 (unnamed protein product) and FAM205A in the NCBI records.
Interestingly, FAM205A in that record had WRWSR instead of WRWSH; the DNA sequences corresponding to the last amino acid of WRWS (H/R) were A (adenine) in FAM75 cDNA and G (guanine) in FAM205A cDNA and gDNA. These results suggest that the two products are closely related and may be produced from the same genomic site by alternative RNA splicing.

Gene structures: alternative splicing and polymorphism
A UniGene search revealed that the FAM7/FAM205A gene was located at 9p13.3 on chromosome 9 in the human genome [71]. As expected, their exon-intron structures were different (Figure 1). FAM75 had a single exon, whereas FAM205A had four exons. The exon of FAM75 had high homology with the fourth exon of FAM205A. The 5'-UTR of FAM75 also corresponded to the fourth exon of FAM205A. Clearly, these two RNA transcripts and their proteins are products of alternative splicing from the same genomic locus.
H-InvDB revealed two additional splicing variants (HIT000496944 and HIT000496575) from the same locus at 9p13.3 ( Figure 1). The record HIT000496944 in the NCBI database was "Homo sapiens cDNA FLJ51393 complete code (AK302320.1)," and the record HIT000496575 was "Homo sapiens cDNA FLJ58301 complete code (AK301951.1)," both named "unnamed protein product." These are splicing variants, but among them, only FAM75 lacked the first 255-bp exon, indicating that the translation initiation sites are different in these mRNAs.
Because not all RNA transcripts are translated into proteins, we used a RegRNA search of UTRs (untranslated regions) to examine the integrity of the FAM75 mRNA. The RegRNA search revealed that the 5'-UTR of FAM75 had an internal ribosome entry site (IRES) [72][73][74] among other motifs, suggesting that the FAM75 mRNA is likely translated into proteins.

WRWSH and WRWSR in human populations
The G/A difference in FAM75/FAM205A in genomic DNA (corresponding to the H/R difference in WRWSH or WRWSR) was confirmed to be a SNP in humans, according to dbSNP. We found that this SNP was widespread in the human genome, and the G/A ratio was dependent on regional populations, as revealed by the 1000 Genomes Project (Figure 2). Among human populations, African populations had a high G frequency (i.e., WRWSR); the three highest G-frequency populations were Gambian in Western Division (96.16%); Yoruba in Ibadan,

Homologous proteins and alternative splicing products in other animals
Here, we searched for homologous proteins for FAM75 and FAM205A in other animals. Among the nine mammals used for the initial identification for WRWSH, homologous proteins for FAM205A were identified by BLAST search ( Table 2); all proteins were FAM205A homologs, and all proteins in primates (chimpanzee, gorilla, and orangutan) contained WRWSR (corresponding to FAM205A) but not WRWSH (corresponding to FAM75), suggesting that WRWSH in FAM75 may be unique in humans among primates. In other nonprimate animals that were examined here, this pentat sequence was either not conserved at all or nonexistent.
To further examine whether splicing variants exist in other great apes, we checked the genome loci and transcript data using Map Viewer. In chimpanzees ( Figure 3) and gorillas (not shown), there were no alternative splicing transcripts from this locus. In orangutans (not shown), there were three isoforms, the X1, X2, and X3 transcripts, from this locus. However, these transcripts were very similar to one another, and they were all considered FAM205A homologs containing WRWSR. We also examined the genome of the mouse as a representative nonprimate mammal (not shown). There were three transcript variants: "predicted gene 12429 isoform X1, X2" and "predicted gene 12429." They all contained SLQAQ instead of WRWSH in these proteins, and their splicing patterns were different from those of FAM75. We confirmed that human splicing patterns (Figure 4) were Note: Amino acid sequences are conceptual translation from genomic data. Red letters indicate amino acids different from those of the human pentat WRWSH. No corresponding pentat was found in the platypus (:::::), and no homologous protein was found in the pig (-).  different from those of these mammals. Therefore, we conclude that the FAM75 transcript was found only in humans.

Testis-specific expression of FAM75 and FAM205A
To examine its existence and expression in our laboratory, we performed RT-PCR (reverse transcription polymerase chain reaction) using two sets of PCR primers using 16 different human-tissue cDNA pools as templates. The first set of primers was designed to amplify both FAM75 and FAM205A ( Figure 5A), and the second set was designed to amplify FAM205A only ( Figure 5B). Due to their overlapping nature, exclusive amplification of FAM75 was not possible. In both primer sets, testis-specific expression was observed. A positive control using a primer set for G3PDH showed amplification from all tissues (Figure 5C), and a negative control (without cDNA template but with experimental primer sets) did not show any amplification.
Our results were consistent with the H-ANGEL expression database; in this database, FAM75 and FAM205A were not differentiated, but the database indicated that the expression was testis-specific ( Figure 6). The NCBI database also indicated the testis-specific expression of FAM205A (not shown). The expression pattern of FAM205A was also found in the Human Protein Atlas, in which FAM205A was expressed in testis and in no other tissues examined at the mRNA level (not shown), confirming our PCR-based data. According to the Human Protein Atlas, cells in the seminiferous ducts (sperm and immature sperm cells) of the testis were clearly detected, but Leydig cells were not stained immunohistochemically ( Figure 7). As mentioned in the Human Protein Atlas, staining was clearly detected in acrosomes in spermatids (Figure 7). Considering that the antibody used in the Human Protein Atlas could not differentiate FAM205A and FAM75 (because a recombinant Cterminal 104 aa fragment that is almost identical in both FAM205A and FAM75 was used as an antigen), both proteins were likely stained in the tissue sections.

Structural and functional predictions
We performed several sequence analyses to characterize the sequences of FAM75 (Figure 8). When FAM75 and FAM205A were subjected to SOSUI, the   former was predicted as a soluble protein, but the latter was predicted as a membrane protein with a single transmembrane helix. TMHMM also showed essentially the same results. Indeed, the Human Protein Atlas considered FAM205A to be both a membrane protein and a cytoplasmic protein based on immunohistochemical results, suggesting that FAM75 and FAM205A may be detected in the cytoplasm and in membranes, respectively, as predicted by SOSUI and TMHMM. In contrast, both were predicted to be "nuclear" using PSORT II Prediction.
To search for possible functional sites, different amino acids between FAM75 and FAM205A ( Table 1), conserved amino acids among FAM205A and similar sequences (top 100 BLAST data) based on multiple alignment, Pfam domain data, a hydropathy plot, an availability plot, and an idiom plot were aligned together ( Figure 8). Conserved amino acids were located mostly in the N-terminal side, in which the "FAM75 domain" identified by Pfam was also located. WRWSH was located at the center of the hydrophilic region in the C-terminal side and  corresponded to a high availability region and a high idiom region, although their significance was not clear at this point.

Discussion
In this paper, we identified a WRWSH-containing protein, FAM75, as a candidate human-specific protein. We assumed that pentats with high availability scores in humans and no occurrence (A = À1) in nine other mammals might be contained in a human-specific protein. The current method based on this assumption indeed identified FAM75. Although the DNA sequence coding for WRWSH is one of the SNP variants in the human genome (i.e., WRWSH was not conserved in all human populations), this fact does not exclude the candidacy of WRWSH as a humanspecific pentat, because we do not know when this SNP variant was created during human history. Likely, not all point mutations are functionally equal; some point mutations may incidentally create a rare pentat like WRWSH that may contribute to functional novelty. Interestingly, the FAM75 transcript was found only in humans as an alternative splicing transcript of FAM205A. In this sense, our SCSbased search for human-specific proteins successfully identified what we wanted to identify. The success of this study may simply be fortunate. On the other hand, there are many other candidate human-specific pentats that we did not examine in detail. Changing search conditions, including the length of amino acid sequences (i.e., triplets, quartets, and longer SCSs), could identify further candidate humanspecific SCSs.
The present study showed that the SCS-based approach is a relevant addition to a list of practical sequence comparison methods. As with other methods, the SCSbased method is influenced by SNPs, accuracy, and the amount of information in databases. For example, the human genome has numerous SNP variations, and there is much less genomic information for other primates than for humans. A and ΔA scores, which were used to search in this study, are dependent on databases. WRWSH had high ΔA between humans and nine other mammals, and this is partly because there were many human protein records that contained this pentat at the time of the database search. Unfortunately, most of these records were later removed from human databases (NCBI GenBank records) because of the uncertainty of their status (although they were not rejected completely). This illustrates the importance of database quality in genome comparison studies. However, whatever ΔA was, we focused on the pentats that were not used at all (A = À1) in the nine other nonhuman mammals, which made the choice of pentats for further investigation less sensitive to database quality.
FAM75 and FAM205A appear to be alternative splicing products from the same genomic locus in humans (Figure 1). The relationship of the evolutionary invention of FAM75 as an alternative splicing product with that of a SNP variant for WRWSH is unclear. We cannot exclude the possibility that this may be a simple coincidence, but this coincidence is in accordance with our starting hypothesis for this study: proteins containing a human-specific pentat may indeed be human-specific as proteins. We confirmed the expression of FAM205A and/or FAM75 at the mRNA level in human tissues ( Figure 5). At the protein level, the FAM205A protein (and probably also the FAM75 protein) was shown to be located in cells in seminiferous ducts and in acrosomes in spermatids in the testis (Figure 7). Interestingly, FAM205A was also detected in the human sperm nucleus in a proteomic study [75]. Although it is difficult to distinguish FAM75 and FAM205A at the mRNA and protein levels, it is demonstrated that the FAM75/FAM205A gene is not a pseudogene, and protein products are actively produced in testis. The discovery of the IRES element in FAM75 mRNA also supports the idea that FAM75 mRNA is actively translated into proteins. On the other hand, we found two additional alternative splicing products in H-InvDB (Figure 1). These additional mRNAs were not examined in this paper, because of insufficient information. However, their status is of interest if they really exist; they may have similar but slightly different functions from FAM205A and FAM75.
Mechanistically, alternative splicing may be a relatively easy way to create a new protein sequence. It may be considered not only a "regulatory change" (according to the regulatory hypothesis, because the evolutionary invention of a new alternative splicing product conserves the original protein-coding DNA sequence and gene function and thus is more conservative with respect to species evolution) but also a "sequence change" (according to the constituent hypothesis, because the protein sequence is changed). These two modes are likely intermingled in this case. To extrapolate this argument, transcriptome studies of alternative splicing or RNA processing may be fruitful to identify human-specific genes. The present discovery of the IRES element in the FAM75 mRNA may be surprising because IRES elements are mostly viral, and cellular elements are relatively rare [72][73][74]. A search for IRES elements in the genome may also be fruitful.
The evolution of WRWSH and FAM75 in relation to human speciation is an important but uncertain aspect to be discussed. There are two kinds of "humanspecific" proteins. First, a group of proteins may have been involved in the early step of speciation of Homo sapiens from its ancestral species. Second, after the establishment of Homo sapiens, additional changes in a group of proteins may occur as a reinforcement process. In either case, these proteins may be called humanspecific. If the pentat WRWSH (or FAM75) played a role in these early or late steps of human speciation, this pentat is human-specific, and it would be later mutated back to WRWSR in African populations. In this case, WRWSH was once assimilated completely in the human population during speciation, and a new WRWSR sequence is now assimilating, as genetic assimilation has been considered a key process in species evolution [76][77][78]. However, because WRWSH is relatively rare in African populations, it is more parsimonious to think that WRWSH evolved after human speciation in Asian or European populations. We speculate that FAM75 may have been invented from FAM205A to play a role in human speciation, but at least in the early stage, FAM75 exclusively contained WRWSR, as in the other great apes. WRWSH may then have been invented in FAM75 to reinforce human speciation. Alternatively, WRWSH did not play any role in human speciation, and its reinforcement simply fortified the function of FAM75 in some populations relatively recently.
What is the function of FAM75 in human testes? According to the results of immunohistochemistry (the Human Protein Atlas), SOSUI, and TMHMM, we speculate that FAM75 appears to function differently from FAM205A in different cellular sites. Because FAM75 is likely located in acrosomes (Figure 7), this protein may be involved in the process of fertilization. A possibility is that FAM75 confers human specificity to prevent cross-species fertilization with ancestral species. The FAM75/FAM205A genomic locus in humans has an additional two alternative splicing products, which were not pursued in the present study, and orangutans and mice appeared to have three transcripts from the same locus. It is tempting to speculate that this locus partly contributes to speciation in primates and other mammals by restricting cross-species fertilization in ancestral species.
Molecularly, the main function of FAM75 may be located in the "FAM75 domain" located at the N-terminal side of the molecule (Figure 8), but because WRWSH is located in a hydrophilic region at the C-terminal side of the molecule, this hydrophilic region may function in human specificity. Indeed, the conserved regions are mostly located at the N-terminal side, probably for the general function of FAM75. The hydrophilic region also coincides with high availability and idiomcluster regions.
Testis is known to be the tissue of the fastest evolution among other tissues based on gene expression comparisons in mammals, including the great apes [13,14,[16][17][18]79]. This flexibility may reflect diverse species-specific sexual behaviors. Mating is nonselective and frequent in chimpanzees, and only the highest-ranked male can mate in gorillas [80]. These behaviors have been thought to be related to testis-size differences; the chimpanzee has relatively large testes, and the gorilla has small ones [80]. Human testis size lies between these extremes, which may be related to the molecular evolution of FAM75 to modulate sperm development in testes or to withstand moderate sperm competition.
A recent finding that the gene locus for FAM205A is a susceptible locus for intracerebral hemorrhage (ICH) [81] is somewhat surprising. Either FAM205A or FAM75 may be expressed in cerebral cells at low levels or in restricted regions of the brain. It is tempting to speculate that a pleiotropic protein for both fertilization and brain development, such as FAM75/FAM205A, might have played a role in human evolution. The fact that the FAM205A/FAM75 gene is located not in a sex chromosome but in chromosome 9, despite its expression in the testis, might further suggest its dual role in sexual and nonsexual aspects of human specificity.

Conclusions
Our SCS-based approach identified FAM75, a WRWSH-containing protein, as a candidate human-specific protein. Its uniqueness in humans may be acquired not only by a point mutation for WRWSH but also by novel alternative splicing. Together with FAM205A, FAM75 is likely expressed in human testis, and its possible expression in acrosomes suggests its potential function in fertilization and thus in human speciation. Its potential pleiotropic function in the brain is very interesting and may also be investigated in the future.