Introduction

One of the major concerns of medical sciences is finding the causal genes underlying human diseases. New technologies are developed, and progress in elucidating genetic basis of disorders is now one of the most discussing topics in medical genomics. With the advent of next-generation sequencing (NGS) technology, identification of genetic variations that serve as disease causality is progressing at rapid pace, which would improve the disease management either by available treatments or genetic counseling for the children’s health and risk assessment of the relatives.1 Molecular diagnosis, carrier detection, prenatal diagnosis and developing new therapies are the concern for the health care and society. Besides, unraveling the susceptible variants potentiates new interventions and prevention of the disease caused by different risk factors.

Generally, genetic disorders are categorized into monogenic and multifactorial disorders. Monogenic (single gene) disorders include simple and rare disorders. Multifactorial disorders comprise of complex disorders, multiple genes, as well as lifestyle or environmental factors, that are contributed to the disease. Rare genetic disorders have low prevalence of about at most 6.5 out of every 10 000 individuals according to the World Health Organization.2, 3, 4, 5 Tremendous efforts are now being performed for understanding the rare monogenic and complex traits and manifesting the genetic basis of the disease based on exome (all exons in the genome) sequencing.

Up to 27 January 2012, a total of 21 058 entries were reported in OMIM (Online Mendelian Inheritance in Man),6 describing 13 790 genes and 4535 disorders with known molecular basis (http://omim.org/statistics/geneMap). Approximately, 1800 entries had phenotypic descriptions or known loci with unknown molecular bases and 2000 entries have been stated based on suspected Mendelian basis, and mainly the phenotypes are known. Bearing in mind, the Mendelian disorders that explore the novel genetic mechanisms, phenotypic variability, modifier genes, allelic variations and genetic variations of the diseases, may also provide clues in understanding the complex disorders.7, 8 Here, we focus on those diseases that are caused by single genes which their causal variants have been explored by exome sequencing. Sporadic cases are also included in this survey.

From DNA structure report to new NGS reports

Nearly 10 years after the discovery of DNA structure, the first gene was completely sequenced.9 In 1977, Sanger et al.10 and Maxam and Gilbert11 developed initial sequencing methods (Table 1); meanwhile, the majority of human DNA sequence data has been described using the Sanger sequencing and fluorescence-based electrophoresis technologies. With the development of a revolutionary method named PCR by Kary Mullis in the 1980s, molecular genomic field has undergone enormous advances. As a matter of fact, a growing variety of molecular methods, including high-throughput sequencing technologies, has been emerged over the past 7 years. In 2004, the second (next)-generation sequencing methods, massively parallel sequencing platforms, were introduced and the next revolution in molecular genetics, including finding the disease-causing genes, is expected. Most of these platforms rely on sequencing by synthesis, and generation of their clonally clustered amplicons are achieved mainly through in situ polonies, emulsion PCR or bridge PCR. NGS technology affords high speed and throughput, both qualitative and quantitative sequence data, equivalent to the data from human genome project, in 10–20 days. Several different ways are employed in which NGS is being applied for identifying causal gene variant in the rare diseases. Whole-genome sequencing (WGS), whole-exome sequencing (WES), transcriptome sequencing, methylome and other sequencing approaches are applied in NGS systems.

Table 1 Landmark events from DNA structure identification to new NGS reports

Miller syndrome is the first rare Mendelian disorder from which its causal variants were identified, owing to the development of WES.12 These researchers explained DHODH mutations in three affected pedigrees after filtering against public single-nucleotide polymorphism (SNP) databases and eight HapMap exomes.

There are increasing numbers of reports identifying the causal variants of the diseases. More than 100 causative genes in various Mendelian disorders have been identified by means of exome sequencing. Up to May 2012, a PUBMED search on Mendelian disorders using exome sequencing revealed 102 diseases as summarized in Table 2. A total of 326 exomes have been reported in these successful applications of WES in identifying novel causative genes. In all, 61 out of these 108 identified genes (56.5%) follow autosomal recessive, 40 of them are transmitted as autosomal dominant (37%), one case is X-linked recessive and one case follows X-linked dominant inheritance. About 35.2% of these genes (38 out of 108) have been identified by WES with only one exome, which was confirmed by sequencing of the possible identified genes in other patients. In overall, about three exomes (108/326) might be enough to be sequenced for identification of causative gene of the disease (Table 2). Evidently, we have no information about the unsuccessful WES studies. In addition to gene discovery of dominant and recessive diseases, WES has been used for determining somatic mutations in tumors and rare mutations with moderate effect in common diseases as well as clinical diagnoses.13 As mentioned previously, here, we focus on the single-gene disorders.

Table 2 Mendelian disease-gene identifications by exome sequencing

Exome sequencing and identifying causal variants

Traditionally, the single-gene disorders were first analyzed based on linkage analyses14, 15 followed by positional cloning; an informative segregation pattern, clear mode of inheritance and enough affected family members could support for gene identification. Homozygosity mapping establishes loci of autosomal recessive disorders.16 More complex forms of single-gene disorders, such as retinitis pigmentosa17 and hearing loss, with different inheritance modes have been reported based on SNP arrays. Allelic association studies of case-control design are suitable for identifying highly associated SNPs with the complex diseases.

Drawbacks and limitations to these approaches that hindered the gene discovery need to be emphasized; there are families with small number of affected individuals, which do not meet the criteria required for classical gene-discovery methods. In addition, finding the causal genes in families fitting the criteria is very difficult in case of expression variability, locus heterogeneity, phenotypic heterogeneity, reduced penetrance or reduced fitness, because in these conditions, the causal effect could hardly be co-segregated with affection status within the family. Exome sequencing permits to overcome these obstacles. Also, there may be several sporadic cases from different families with similar phenotypes, in which exome sequencing interrogates the causal variants. As in Table 2, WES could identify the causal variants with a limited number of patients. Indeed, NGS technologies bring us new sights in unraveling the genetic basis of diseases.

Most pathogenic variants thus far identified are located in highly conserved regions of the genome.18, 19 It is believed that most of the functional variants are located in the coding exons.20 Most (91.8%) of the functional variants of the protein-coding variations are due to nonsense/missense (∼56%), small insertion/deletions (∼24%), splicing (∼10%) and regulatory (∼1.8%) mutations (Human Gene Mutation Database professional 2011).20 Overall, 85% of the disease-causing mutations are estimated to be located at protein-coding regions.20, 21 Accordingly, WES could elucidate at least 78% of causative variants.

Basics of WES

The approach for exome sequencing is based on probe-hybridization method to capture entire exons.22, 23 The whole process is categorized into three steps, namely sample preparation, hybridization and sequencing (Figure 1). Briefly, the first step is sample preparation, in which the genomic DNA is sheared by nebulization or sonication to get desired fragments of about 250 bp. The fragment ends are repaired by T4 DNA ligase. The process of 3′ A-tailing is performed followed by ligation of paired-end adaptor to the fragments. The final step for sample preparation is to amplify the prepared library for a few cycles. To enrich the prepared library, hybridization with a biotinylated oligo library (RNA baits for example, Agilant SureSelect (Santa Clara, CA, USA) (635 250 RNA probes of 120 bp) or DNA baits, for example, NimbleGen22 (Madison, WI, USA) (2.1 million DNA probes of 60–90 bp)) is performed and captured by streptavidin beads. The quality and quantity of the exome library is analyzed by highly sensitive methods, such as Agilent 2100 Bioanalyzer before sequencing step. The exome library is sequenced in paired-end reads for example in Illumina (San Diego, CA, USA) to yield a 75–100 bases per read. Amplification of surface-bound individual fragments using an isothermal bridge amplification method produces clonal clusters of about 1000 identical molecules per cluster; one fragment is, therefore, attached to one surface oligonucleotide, endures cluster generation, and the replicate copies are sequenced to yield one sequence read. When DNA chain is growing, the first step of sequencing procedure is detecting the next added fluorescently labeled base (reversible terminator) by means of a sensitive device like charge-coupled device camera. The terminator is changed to a standard nucleotide by removing the dye. Repeating this cycle, sequentially, determines the next base. About 79% of the reported genes were determined using Illumina sequencing machines (Table 2).

Figure 1
figure 1

Applying usual filtering to exome-sequencing projects would define novel causal genes for Mendelian disorders; major assumptions about causal genes at these steps are as following: (1) structural variants and other forms of genetic variations are ignored, (2) causal variants are coding, (3) causal variants alter protein sequence and (4) casual variant has almost complete penetrance. A single exome carries about 20 000–30 000 coding SNPs. Over 95% of the variants overlap with data sets depending on ethnicity. Filtering steps narrow down the number of possible disease-associated genes; then, the final variants are limited to those fitting the mode of inheritance. A full color version of this figure is available at the Journal of Human Genetics journal online.

After sequencing, the data is processed in three major steps, including mapping, variant calling and annotation steps (Figure 1). The sequence data is aligned with Burrows–Wheeler Aligner24 tool against a reference sequence such as hg18/hg19 (GRCh37). Next step is calling; data generated by Burrows–Wheeler Aligner in Sequence Alignment Map (SAM) format could be used by SAMTools,25 Genome Analysis Tool Kit,26 and Picard (http://picard.sourceforge.net). SAMTools is used for quality control, short read alignment and variant identification (VarFilter). It processes and sorts the files. Facilitating the short aligned reads (BAM files: binary equivalent SAM format) for fast access is called indexing, which is followed by making a pileup format to facilitate variant calling. The indexed file is visualized by Integrative Genomics Viewer27 or other sequence alignment visualization tools. PCR duplicates are removed using Picard MarkDuplicate and SAMTools. Average coverage and depth of coverage are calculated with Genome Analysis Tool Kit’s Depth of Coverage analysis. ANNOVAR28 is a tool for annotating genetic variants based on the function; the annotation file usually includes gene name, chromosomal position, nucleotide changes, amino-acid changes and description, SIFT (sorting intolerant from tolerant)29 and Polyphen (polymorphism phenotyping)30 values, single-nucleotide polymorphism database ID, allelic frequency of the SNP in 1000 Genome project and sequence quality. VAAST (Variant Annotation, Analysis and Search Tool) incorporated previous amino-acid substitution information with annotation and ranked candidate genes with statistical evaluation, which can be used to list up the candidate genes and variants.31 Most investigators filter the data based on the function of variants. Nearly half of variants are synonymous ones, not considered to be deleterious, which are usually filtered out. Although there are some reports about the causal effect of synonymous variants,32 the probability is very low. The remaining variants are nonsense, missense, indel, splice mutations and other non-coding RNA transcripts. Approximately, 5% of the variants are not reported in the above databases.33 As noted, the variants called based on pathogenic predication of bioinformatics tools, such as SIFT29 and PolyPhen,30 are explored through the annotation. Hence, the pathogenic variants disrupt the protein function or structure in conserved sites. Depending on the knowledge of the affected samples, different analytic frameworks are used to define the causal variant (Figure 2).

Figure 2
figure 2

Hypothetical frameworks for analyzing single-gene disorders. Combinational analyses could help to determine the probable causative variant. Family-based (a1, a2 and a3), de novo mutation (b) and X-linked variant analysis (c). A full color version of this figure is available at the Journal of Human Genetics journal online.

Calling variants and the candidate gene

The sequence data are compared with public databases, such as single-nucleotide polymorphism database,34 1000 Genome Project,35 Exome Variant Server (http://evs.gs.washington.edu/EVS) and HapMap.36 It is noticed that individual exome of African-American origin has an average 24 000 single-nucleotide variants, whereas European-Americans origin has a mean 20 000 single-nucleotide variantss.33 Thus, it is inferred from other studies that this number varies depending on the ethnicity and capturing protocols, sequencing platforms, mapping algorithms and variant calling methods. Totally, the number of candidate disease-causing variants is reduced to 100–500 pathogenic variants depending on the study design.18, 36, 37, 38, 39 In a study, it was reported that each genome carries 165 homozygous protein-truncating variants in the diverse pathways.40 Thus, a causal variant cannot be directly identified as the related gene unless integrative genomic analyses are performed as homozygosity mapping, linkage analysis, Sanger sequencing and so on. However, if we study a family with four affected individuals or two or three families, each with at least two affected ones, employing the usual filtering could possibly define a causal gene.41, 42, 43, 44, 45 According to an assumption by Robinson et al.,46 when the same gene is considered as a causality in multi sporadic cases, 5% of the target genes (about 20 000) show rare probable casual variants in all affected individuals, and after sequencing one individual and a usual filtering, nearly 1000 genes would remain as candidate genes. If a second individual is sequenced, only 50 genes (5% of 1000) with variants in both individuals will remain. Sequencing a third affected person predicts less than one gene having a variant in all three affected individuals.46

Frameworks used to analyze single-gene disorders

We, here, exemplify the main approaches for detecting gene variants among the Mendelian disorders. Two main approaches, including family-based and unrelated individual strategies, are explained (Figure 2).

Family-based studies

When a number of affected individuals within a family are sequenced, the shared mutations are selected from the affected members because they harbor the same causal variant. This strategy narrow downs the candidate genes. Non-affected members of the family are sequenced to verify the candidate variations.

Combining the previous knowledge of homozygosity mapping for recessive disorders (Figure 2a2) and linkage studies for dominant (Figure 2a3) and recessive disorders as integrative approaches define the candidate variant. For instance, homozygous regions of the genome detected by SNP array are informative to reduce the number of candidate variants found to be homozygous for a family with recessive inheritance; only those variants in homozygous regions are reliable for pathogenicity. Shared homozygous or compound heterozygous variants are used to find the candidate variant (Figures 2a1 and a2). In a study by Sirmaci et al.,47 the cause of Malpuech–Michels–Mingarelli–Carnevale syndrome in two affected families was identified. An autozygous region on chromosome 3q.27 was identified and exome sequencing confirmed MASP1 mutation co-segregated with the phenotype.

Linkage studies are informative for multiple affected family members with multigenerations and are used in combination with exome sequencing.48 For the dominant disorders, a common heterozygous variant is distinguished among the affected individuals in a family. Using the genome-wide linkage analysis of hereditary diffuse leukoencephalopathy with spheroids affecting central nervous system, Rademakers et al.49 identified 233 candidate genes within the chromosome 5q candidate region and exome sequencing revealed a heterozygous variant in CSF1R in the candidate region, which was confirmed in 13 other affected families with distinct heterozygous mutations.

In case of locus heterogeneity in genetic disorders, such as retinitis pigmentosa, osteogenesis imperfect, hearing loss and so on, different patterns of inheritance may be observed in different families; thus, differentiating the exact clinical descriptions and determining the mode of inheritance would help to find the candidate gene. In a study by Abou Jamra et al.,50 a combination of autozygosity mapping and exome sequencing was applied to identify the pathogenic variants, causing intellectual disability with recessive mode of inheritance in eight affected individuals from three consanguineous families. Using this approach, they identified three causative genes encoding adaptor protein complex 4 within these families.

In case of X-linked pedigrees (Figure 2c), analysis of X-chromosome variants could be helpful; of course, female and male samples are homozygous and hemizygous for autosomal recessive disorders, respectively. In an example, exome sequencing of entire three affected males having an unclassified X-linked lethal congenital malformation syndrome identified a splicing mutation in OFD1 gene.51

Unrelated individual studies

If there are a number of affected cases, but not within the family (sporadic cases), common pathogenic gene could be followed among the samples, assuming no locus heterogeneity among the affected individuals. The approach is called overlap strategy.52 To point out, the cause of Schinzel–Giedion syndrome was identified using this strategy.53 Furthermore, Saitsu et al.54 performed exome sequencing in three unrelated affected individuals with congenital hypomyelination leukoencephalopathy; they found compound heterozygous mutations in POLR3A and POLR3B (encoding RNA polymerase III subunits) in the affected individuals.

As in case of Kabuki syndrome, which is a rare disorder worldwide, the majority of cases are sporadic, but parent-to-child transmission has been reported representing the dominant mode of inheritance. WES revealed the causal variants of Kabuki syndrome in 7 out of 10 families in MLL2; follow-up Sanger sequencing of the remaining three families detected mutations in the MLL2 in two of the families, which shows that this would be the main cause of syndrome.39 Other genes, however, may explain the pathogenesis of the condition in the remaining family. One may use this family in conjunction with other new affected families to find the causal variants. In addition, they reported that only 26 out of 43 patients had mutations in MLL2 by Sanger sequencing. The clinical and genetic heterogeneity of the syndrome complicates gene finding, and similar clinically affected cases are helpful to find new genes.

In the absence of number of cases, some integrative strategies are needed depending on their availability. When there is a single affected individual suspected of a recessive disorder, homozygous and compound heterozygous variants are searched namely, as double hit strategy by Gilissen et al.52 WES revealed compound heterozygous mutations in WDR35 gene in a single sporadic case affected with Sensenbrenner syndrome.55 Also, WES of one of the affected sisters with Perrault syndrome manifested two mutations, showing compound heterozygote in the HSD17B4.38

De novo mutations

The previously mentioned frameworks are focused on homogenous diseases. A substantial number of de novo mutations occur sporadically in which cases, mostly fetus, do not survive and so the mutations will be eliminated from the population; thus, these lethal mutations are not usually identified. De novo mutation rate is estimated to be 7.6 × 10−9–2.2 × 10−8 per generation; that is, approximately one in 108 base per haploid genome is mutated spontaneously,56, 57 which could become the causality. For the de novo mutation analysis, the case–parent trios are practical (Figure 2b). Nonpathogenic variants are filtered and then the variants presented in the parents are excluded. There might be a chance of sequencing errors and mapping artefacts, so confirmation by Sanger sequencing with high accuracy should be applied.58 Detection of a de novo mutation is not enough to confirm the causality of the disease. Additional analyses, including replication and functional analyses, should be performed to determine the deleterious or causal variants. Pathogenesity of a variant not only depends on the type and the location of the mutation but also on its functional effects.19

De novo mutation studies have been employed to determine de novo mutations of rare Mendelian disorders, such as Schinzel–Giedion syndrome.53 In addition, causative genetic factors for heterogeneous disorders, such as intellectual disabilities, have been revealed as well.18 Trio-based exome sequencing is demonstrated to be a powerful approach for identifying novel causative genes for sporadic autism spectrum;59 these researchers identified 21 de novo mutations using exome sequencing of 20 sporadic cases of the sporadic autism spectrum.59 More recently, Sanders et al.60 demonstrated, using WES of families affected by autism spectrum disorder, contribution of de novo mutations in brain-expression genes to the risk for these disorders. Iossifov et al.61 sequenced and analyzed the exomes of 343 families with a single individual affected by the autism spectrum and at least one unaffected sibling; they found that gene-disrupting mutations, not missense mutations, are frequent in affected children. In this study, 350–400 genes have been estimated as autism-susceptibility genes.61

Pros and cons

Interpretation of the results obtained by NGS of the Mendelian disorders is of major concern. When using exome sequencing in clinical genetics and medicine, limitation of the approach is evident and experimental design is needed to circumvent the downs. Genetic and phenotypic heterogeneity in different affected individuals make exome sequencing difficult to interpret. Exact clinical examinations and biochemical tests have important roles to distinguish between new syndromes and known ones.

Patients with the same phenotype may not share the same causal variant; indeed they may have distinct variants in a gene as we call allelic heterogeneity. Depending on the clinical information, different strategies or filtering protocols are used for implication of the pathogenicity of a variant. Of course, variants in the reported genes are generally examined at the first step of filtering process. Then, the possible causal variants could be validated by segregating through the family or other cases.

Intensive examinations of variant call are important in exome sequencing; false-positive errors appear as sequencing errors related to mechanical and analytical errors; also short reads generated by NGS would not align perfectly to the appropriate position as a result of paralogous and low copy repeats that may cause errors during calling.62 In the repetitive regions of the genome, misalignment may occur, which could be improved by longer read lengths or higher depth of coverage in those regions.21 Also, false-negative errors could occur because of mechanical and analytic errors due to low coverage, poor capture efficiency and so on. Avoiding false-negative errors is more difficult than false-positive ones; however, it is proposed that the error rate could be estimated by comparing the calls with the test samples, which have been previously called.21, 46 The relation of false-positive and false-negative calls is ‘trade-off’; if we set stringent criteria (high base quality and stringent alignment or and so on), false-positive errors are decreased but false-negative ones are increased, which is important to concern. There are also other problems in the filtering strategies, which may influence data analysis. Filtering out variants with minor allele frequency of <1% may be misleading for recessive disorders. Because the carriers may not show the disease, but still the frequency of the allele is high in the 1000 Genome, which may be excluded in the filtering step because of higher frequency.

Some deleterious variations may be located in the non-coding regions, such as intronic or regulatory regions of the genome, which cannot be called by exome sequencing, whereas WGS covers all the data for genome. WGS is expected to be applied for disease-gene identification in near future; however, the current cost and information burden of WGS need to be circumvented. The huge amount of data generated from WGS comparing with WES provides information on evolutionary-conserved non-coding regions and all variants throughout the genome. Filtering and analyzing these data is challenging. Moreover, the time for analyzing data is increased and larger computational memory is needed for WGS data analysis.

Conclusion

Exome sequencing has evolved the biomedical research. The possible causative genes are directly distinguished using these new sequencing technologies. Up to now, the role of more than 100 genes has been distinguished in rare Mendelian disorders by means of WES, and this statistics is rapidly growing. Combinational approaches, including traditional methods and WES, are easily used for those disorders following autosomal mode of inheritance to define the underlying gene. It is needless to say, new sequencing technologies, such as in Pacific Bioscience and Nanopore, will shed light in this field.63, 64 WES could have a critical role in identifying new genes until the costs for WGS will be decreased.