# GWSBE **Repository Path**: znengpan/GWSBE ## Basic Information - **Project Name**: GWSBE - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-01-04 - **Last Updated**: 2021-01-04 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # GWSBE Regenerated Rice Re-sequencing to evaluating genome-wide specificity of base editors This is the pipiline for this paper. #### Dependencies ``` perl bwa samtools bedtools ``` ## 1.Clean fastq file using 1.Clean.sh. please set the the R1,R1,Sample,core at the top of shell,Then use this command: ``` sh 1.clean.sh ``` ## 2.using bwa to mapping the reads to the Refgenome; The RefGenome ZH11 can be download at http://mbkbase.org/ZH11/; ``` sh 2.bwa.sh ``` The mapping result will in the tmp_pipe_data file; The *.realigned.bam is the bam file use for GATK,Lofreq and strelka2; ## 3.Call SNV/Indel use GATK,Lofreq and Strelka2. ``` sh 3.gatk.sh #all the bam in the tmp_pipe_date will be use to call the SNV/Indel; sh 4.lofreq.sh #one bam by one bam used to call snp;each sample will have its vcf file in the lofreq dir; sh 5.strelka.sh #one bam by one bam used to call snp;each sample will have its vcf file in the strelka dir; ``` ## 4.filter all the varient produced by GATK/Lofreq/Strelka2 use wild type sample. First edit the WT.txt file to specify the wild type sample and make sure the sample name is correct.like follow: ``` cat WT.txt Sample1 Sample2 Sample3 ``` Then use follow command to filter the Lofreq/Strelka2 Raw vcf;The filter vcf will be produce in the file name *.filter.vcf . ``` cd Lofreq perl ../Script/AnaLoVcf.pl cd .. cd strelka perl ../Script/AnaStr2Vcf.pl cd .. ``` The following filters GATK results, which include filtering regions of repeated sequences by depth to improve SNV accuracy. ($minDepth and $minDepth in the AnaGATK.pl) Corrects incorrect bases on the ref genome through wild-type status. The output file filter.vcf, the third column, 0 and 1 state, 0 means that all wild-type and reference genomes are the same, 1 means that all wild-type and reference genomes are different. The filter vcf is named filter.vcf; #### This outfile of GATK is the base format of future analysis. ``` cd GATK perl ../Script/AnaGATK.pl cd .. ``` ## 5. GATK, Lofreq and strelka filtered intersection. This script will read the *.filter.vcf(Lofreq/Strelka2) and filtered.vcf(GATK) files and produce ALLSite.vcf file is the intersection file; the other outfile follows: * ALLSite.vcf --- The intersection vcf file ,the format same as GATK vcf. * AllSNP.jiaoji.summary.txt ---Contains SNV statistics files for all samples. * Mutation.*.txt---Contains statistics of different mutation types for each software(JJ name means intersection). * VennALLSITE --- This directory contains the specific information of the SNV generated by each software of all the samples, which can be used to draw Venn diagrams later. ``` perl ./Script/ALLSITE.pl ``` ## 6.Obtaining and filtering regions of on target sites. We use BBmap(https://github.com/BioInfoTools/BBMap) to obtain the interval position information of sgRNA on the genome. Use the following command: ``` bbmap/bbmap.sh in=Target.fa ref=./ZH11_genome_chr.cor.fasta out=bbmap.sam maxindel=100 k=10 slow ``` then covert the outfile to the follow format: ``` cat range.txt qName tName tStrand tStart tEnd PBE-ACC-T3_P2_JS-2_HP2/0_23 Chr5 1 13098162 13098185 PBE-ACC-T1_ACCB/0_23 Chr5 0 13097961 13097984 PBE-ACC-T2_ACCC/0_23 Chr5 0 13098250 13098273 PBE-WXB_P3_JS-5_HP3/0_23 Chr6 0 1765048 1765071 . . . . ``` The following script can be combined with ALLSite.vcf generated in step 5 to obtain all ontarget sites in the sgRNA region. ``` perl getGeneRange.pl ALLSite.vcf ./range.txt ``` The output file is ALL.vcf.GeneSample.pos.txt which contain all sample all site on-targe information. ## 7.Filter out On-target sites and get off-target sites. Covert format to the follow command ``` awk '{print $2"\t"$3"\t"$3}' ALL.vcf.GeneSample.pos.txt > target.txt perl script/ALLSITEnotarget.pl ``` * ALLnotarget.vcf--- The intersection vcf file with no on-target site ,the format same as GATK vcf. * AllSNP.jiaoji.summary.txt ---Contains SNV statistics files for all samples. * Mutation.*.txt---Contains statistics of different mutation types for each software(JJ name means intersection). * VennALLnotarge --- This directory contains the specific information of the SNV generated by each software of all the samples, which can be used to draw Venn diagrams later. #### PBE only site use follow command.The output format is the same as above. PBE.vcf is the site only contain PBE off-target site. ``` perl script/JiaoJiPBE.pl ``` #### ABE only site use follow command; ABE.vcf is the site only contain PBE off-target site. ``` perl script/JiaoJiABE.pl ``` ## 8. find sgRNA-like off-target edits We find the sgRNA-like off-target edits using Cas-offinder download from http://www.rgenome.net/cas-offinder/portable and follow the step on the web. The follow input file using to find lower 5 number of mismatch compare with sgRNA: ``` C:\cas-offtarge\ZH11_genome_chr.cor.fasta NNNNNNNNNNNNNNNNNNNNNGG TTCCTCGTGCTGGACAAGTG 5 AGCCATGGGAATGTAGACAA 5 TCCACAGCTATCACACCCAC 5 TCCTCGGTACGACCAGTACA 5 CAGGTCCCCCGCCGCATGAT 5 CGGCGACGGCGAGCAAGTGG 5 TAGCACCCATGACAATGACA 5 ACTAGATATCTAAACCATTA 5 CATAGCACTCAATGCGGTCT 5 CCTTGAATGCGCCCCCACTT 5 AGCACATGAGAGAACAATAT 5 ```