The role of transposable elements in neurological disorders
Diseases of the nervous system affect nearly 25% of the population. Genome-wide association studies (GWAS) of single nucleotide polymorphisms (SNPs) have identified hundreds of loci associated with diseases but these rarely determine causative variants. This is likely because most GWAS associations are SNPs in intergenic sequences and only serve as genetic tags for more significant mutations such as insertions and deletions. Transposable elements (TEs) represent 50% of human DNA and by “copy-and-paste” mechanisms they integrate throughout the genome. TEs can change gene expression, truncate proteins, create genomic deletions/rearrangements and cause diseases. More recently, widely-reported TE insertions in brain tissues and the associated phenotypes in model systems have led to the proposal that somatic integrations increase risks of schizophrenia, multiple sclerosis, ALS, and neurodegeneration. Our research focuses on determining the contribution of germline TE insertions to the development of neurological disorders, a question which has not yet been explored. To do this, we screened and prioritized TEs associated with disease traits using the GWAS catalog and 18,000 polymorphic TEsidentified by the 1000 Genomes Project. We calculated and ranked the levels of genetic association between 615 neuronal disease GWAS hits and polymorphic TEs.This identified 63 TEs highlyassociatedwithaGWAS hit.Wemodeled whether the TE location can affect protein expression, alter splicing, and truncate proteins using data from the ENCODE, Roadmap Epigenomics and GTEx expression platforms. Using stringent criteria, we identified 19 TEs strongly associated with a disease where the TE is located in regulatory sequence predicted to alter expression of an adjacent gene. As an example we find a polymorphic Alu tightly linked (r2=0.79) to a schizophrenia TAS where the TE is within an enhancer active in fetal brain. To determine the impact of representative Alus on gene expression, we cloned them into a luciferase expression system in neuronal stem cells (NSCs). We generated two clones, one with the insertion and one without, both clones containing the same ~200bp encompassing the insertion site. We found that Alus associated with ALS and Schizophrenia significantly alter luciferase expression.To address the contribution of these TEs to expression in vivo, we will edit the genome of iPSC lines using CRISPR/Cas9 to identify genes that are differentially expressed due to the TE where the only difference in these cells is the presence/absence of the TEs.
LEDGF/p75 interacts with mRNA splicing factors and targets HIV-1 integration to highly spliced genes.
The promise of immunotherapy of cancer and treatment of other diseases with gene therapy relies on retroviral vectors to stably integrate the corrective/therapeutic sequences in the genomes of the patient’s cells. First generation gene therapy used vectors derived from gamma retroviruses that were successful in correcting X-linked severe combined immunodeficiency (SCID-X1). However, the integration pattern had a bias for promoter sequences that resulted in the activation of proto-oncogenes and progression to T-cell leukemia. These adverse outcomes led to the use of lentivirus vectors for recent gene therapy treatments. This switch to HIV-1 based vectors has occurred despite a fundamental lack of information about integration levels at specific genes including proto-oncogenes. Structural and biochemical data show that HIV-1 IN interacts with the host factor LEDGF/p75, and this interaction favors integration in the actively transcribed portions of genes (transcription units). However, little is known about how LEDGF/p75 recognizes transcribed sequences and whether cancer genes are favored.
To measure integration levels in individual transcription units and to identify the determinants of integration site selection, we generated a high-density map of the integration sites of a single-round HIV-1 vector in HEK293T cells. Improvements in sequencing methods allowed us to map 961,274 independent integration sites; most of these sites occurred in just 2,000 transcription units. Importantly, the 1,000 transcription units with the highest numbers of integration sites were highly enriched for cancer-associated genes, which raised concerns about the safety of using lentivirus vectors in gene therapy.
Analysis of the integration site densities in transcription units (integration sites per kb) revealed a striking bias that favored transcription units that produced multiple spliced mRNAs and with transcription units that contain high numbers of introns. These correlations were independent of transcription levels, size of transcription units, and length of the introns. Analysis of previously published HIV-1 integration site data showed that integration density in transcription units in mouse embryonic fibroblasts also correlated strongly with intron number, and this correlation was absent in cells lacking LEDGF. These data suggest that LEDGF/p75 not only tethers HIV-1 integrase to chromatin of active transcription units but also interacts with mRNA splicing factors. To test this possibility our collaborators used tandem MS to identify cellular proteins from nuclear extracts of HEK293T cells that interacted with GST-LEDGF/p75. The proteomic experiments found that LEDGF/p75 interacted with many components of the splicing machinery including U2 snRNP (SF3B1, SF3B2, and SF3B3), U2 associated proteins (PRPF8 and U2SURP), a factor of the U5 snRNP (SNRNP200) and many hnRNPs that are associated with alternative splicing. This broad range of interactions with splicing factors suggested LEDGF/p75 might contribute to splicing reactions. To test this we performed RNAseq on HEK293T cells that were altered with TALEN endonucleases to truncate or delete the gene for LEDGF/p75, PSIP1. Analysis of the 11,000 transcription units that produced two or more spliced mRNA products showed that bi-allelic deletion of LEDGF/p75 significantly changed the ratio of spliced products of 4,305 transcription units. These results together with our finding that integration in highly spliced transcription units was dependent on LEDGF provide strong support for a model in which LEDGF/p75 interacts with splicing machinery and directs integration to highly spliced transcription units.
Ultra high throughput sequencing of transposon integration provides a saturated profile of target activity in Schizosaccharomyces pombe
Because diseases such as AIDS and leukemia are caused by retroviruses there is an intense need to understand the mechanisms of retrovirus replication. One of our objectives is to understand how retroviral cDNAs are integrated into the genome of infected cells. Because of their similarities to retroviruses, long terminal repeat (LTR)-retrotransposons are important models for retrovirus replication. The retrotransposon under study in our laboratory is the Tf1 element of the fission yeast Schizosaccharomyces pombe. We are particularly interested in Tf1 because its integration exhibits a strong preference for pol II promoters. This choice of target sites is similar to the integration preferences human immunodeficiency virus 1 (HIV-1) and murine leukemina virus (MLV) have for pol II transcription units. Currently, it is not clear how these viruses recognize their target sites and perform integration. We therefore study the integration of Tf1 as a model system with which we hope to uncover mechanisms general to the selection of integration sites. Sequencing methods first developed to map Tf1 integration have now been applied to study the integration of HIV-1 in cultured human cells. These results provided important information about the role of mRNA splicing in HIV-1 integration.
The result that integration in the genome of S. pombe is directed to the promoters of genes raises several key questions about the biology of Tf1 integration. Are all promoters recognized equally or is integration directed to specific sets of promoters. If specific sets of promoters are preferred targets, what distinguishes the preferred promoters from those not recognized by Tf1. To address these questions, large numbers of integrations throughout the genome of S. pombe were sequenced. The revolutionary new methods for ultra high throughput sequencing made it possible to characterize extraordinarily large numbers of integration events.
All together we obtained 599,760 high quality sequence reads that were then analyzed with BLAST to determine the chromosomal location of the insertions. In all there were 73,125 independent Tf1 integration events in unique positions of the S. pombe genome. The BLAST results of sequences from our first integration library identified 21,848 independent insertions in the experiment termed Hap_Mse_2. The insertions were broadly distributed across all three chromosomes. To examine the insertion data for preferences, all 21,848 insertion sites from the Hap_Mse_2 experiment were mapped relative to ORFs. The distance from the insertions to the closest ORF was determined. The integration from Hap_Mse_2 showed a clear preference for the first 500 nt upstream of ORFs.
76% of all the insertion events occurred in just 20% of the intergenic sequences. This strong bias is a consequence of the integration preference for a specific set of promoters. One possibility was that Tf1 integrated into the promoters with the highest transcription activity. We tested this hypothesis but found no correlation between transcription and integration. In another effort to determine what distinguishes promoters that had high levels of insertions from the promoters that did not, we asked whether the genes associated with the targeted promoters contributed to specific classes of biological function. The results of the gene ontology analysis suggested that genes regulated by environmental stress were among the strongest targets of integration. To examine this further we sorted all the intergenic sequences from highest number of insertions to the lowest using the Hap_Mse_2 data. Using this order, the intergenic regions were placed into bins of 500 each. We then used published microarray data to tabulate how many of the intergenic regions in each bin contained promoters that are induced at least three-fold by conditions of stress. The bin containing the 500 intergenic regions with the most integration contained the highest number of genes induced by cadmium. The bins with successively lower amounts of integration contained fewer promoters that are induced by cadmium. This relationship indicates that integration has a preference for promoters that are induced by cadmium. Similar preferences were observed for genes induced when cells are treated with hydrogen peroxide or by heat. Particularly strong preferences for integration into promoters induced by MMS or sorbitol were observed for the first bin of 500 intergenic regions. The targeting of Tf1 to stress induced promoters represents a unique response that may function to specifically alter expression levels of stress response genes. Although there is no systematic data, integration of Tf1 into the promoter of ade6 and bub1 does stimulate transcription.
The size and number of the integration experiments reported here resulted in reproducible measures of integration for each intergenic region and ORF in the S. pombe genome. The reproducibility of the integration activity of each intergenic and ORF sequence from experiment to experiment demonstrates that we have saturated the full set of insertion sites that are actively targeted by Tf1. To our knowledge, this is the first time such a profile of integration data has been assembled.
Single nucleotide specific targeting of the Tf1 retrotransposon promoted by the DNA-binding protein Sap1 of S. pombe
Transposable elements constitute a substantial fraction of the eukaryotic genome and as a result, have a complex relationship with their host that is both adversarial and dependent. To minimize damage to cellular genes TEs possess mechanisms that target integration to sequences of low importance. However, the retrotransposon Tf1 of Schizosaccharomyces pombe integrates with a surprising bias for promoter sequences of stress response genes. The clustering of integration in specific promoters suggests Tf1 possesses a targeting mechanism that is important for evolutionary adaptation to changes in environment. We found this year that Sap1, an essential DNA binding protein, plays an important role in Tf1 integration. A mutation in Sap1 resulted in a 10-fold drop in Tf1 transposition and measures of transposon intermediates supports the argument that the defect occurred in the process of integration. Published ChIPSeq data of Sap1 binding combined with high-density maps of Tf1 integration that measure independent insertions at single nucleotide positions show that 73.4% of all integration occurred at genomic sequences bound by Sap1. This represents high selectivity since Sap1 binds just 6.8% of the genome. A genome-wide analysis of promoter sequences revealed that Sap1 binding and amounts of integration correlate strongly. More importantly, an alignment of the DNA binding motif of Sap1 revealed integration clustered on both sides of the motif and showed high levels specifically at positions +19 and -9. These data indicate that Sap1 contributes to the efficiency and position of Tf1 integration.
A Long Terminal Repeat retrotransposon of Schizosaccharomyces japonicus integrates upstream of RNA pol III transcribed genes
Transposable elements are common constituents of centromeres. However, it is not known what causes this relationship. Schizosaccharomyces japonicus contains 10 families of Long Terminal Repeat (LTR)-retrotransposons and these elements cluster in centromeres and telomeres. In the related yeast, Schizosaccharomyces pombe LTR-retrotransposons Tf1 and Tf2 are distributed in the promoter regions of RNA pol II transcribed genes. Sequence analysis of TEs indicates that Tj1 of S. japonicus is related to Tf1 and Tf2, and uses the same mechanism of self-primed reverse transcription. Thus, we wondered why these related retrotransposons localized in different regions of the genome. To characterize the integration behavior of Tj1 we expressed it in S. pombe. We found Tj1 was active and capable of generating de novo integration in the chromosomes of S. pombe. The expression of Tj1 is similar to Type C retroviruses in that a stop codon at the end of Gag must be present for efficient integration. 17 inserts were sequenced, 13 occurred within 12 bp upstream of tRNA genes and 3 occurred at other RNA pol III transcribed genes. The link between Tj1 integration and RNA pol III transcription is reminiscent of Ty3, an LTR-retrotransposon of Saccharomyces cerevisiae that interacts with TFIIIB and integrates upstream of tRNA genes. The integration of Tj1 upstream of tRNA genes and the centromeric clustering of tRNA genes in S. japonicus demonstrate that the clustering of this transposable element in centromere sequences is due to a unique pattern of integration.
Integration profiling; A genome-wide method of measuring gene function.
With the introduction of new deep sequencing technology it is now possible to sequence many millions of transposon insertions in a single experiment. We tested whether Illumina sequencing could be used to generate a dense profile of transposon insertions that would reveal which genes are required for cell growth. For this experiment we used a haploid strain of S. pombe and Hermes, a DNA transposon from the housefly. In previous work we found that the Hermes transposon was highly active in S. pombe and that the insertions did not discriminate against ORFs. We predicted that in actively growing cultures, Hermes insertions would not be tolerated in essential ORFs. This year we induced Hermes transposition in a large culture S. pombe that was grown for 80 generations. With ligation mediated PCR and Illumina sequencing we were able to sequence 360,513 independent insertion events. On average, this represented one insertion for every 29 bp of the S. pombe genome. An analysis of integration density revealed that the ORFs largely separated into two classes, one with high numbers of insertions and another with much lower numbers. In collaboration with a group that deleted each of the genes of S. pombe, we found the ORFs with low numbers of Hermes insertion corresponded to the essential genes. The ORFs with higher integration densities were in genes classified as nonessential. These results validated transposon profiling as a new method for identifying genes with essential function. Importantly, by applying specific conditions of selection during growth, this method can be adopted to identify genes that contribute to a wide variety of functions.