全外显子组测序分析中预处理方法和变异识别方法的比较 下载本文

全外显子组测序分析中预处理方法和变异识别方法的比较

闫瑾, 潘琦, 任红

(重庆医科大学附属病毒性肝炎所,重庆 400010)

【摘要】目的

研究在外显子组数据分析中,使用不同的预处理方法和变异过滤方法对变异识别的影响。

方法 采用

FASTX-Toolkit、Trimmomatic作为预处理方法,修饰后不同的不匹配读长(single-end reads)取舍策略,以及硬过滤(hard filter)和变异质量得分重新校正(VQSR)作为变异过滤方法对两例全外显子组数据进行变异识别,通过数据覆盖深度(DP)、识别变异的数目、Ti/Tv值和基因型一致性等数据进行比较其效果。 结果 Trimmomatic预处理后的读长的测序覆盖深度与未预处理的原始数据接近,但明显高于FASTX-Toolkit预处理方法。当DP≥10×、基因型质量分数(GQ)≥20时,经Trimmomatic预处理后识别到的SNV数量比FASTX-Toolkit高,与未预处理组接近。当包含single-end读长时,FASTX-Toolkit组多识别的SNV数量高于 (28%)Trimmomatic组 (5%)。当样本量较少时,在所有试验组中硬过滤方法滤掉的SNV要少于VQSR。 结

Trimmomatic修饰(过滤)原始序列更温和,而FASTX-Toolkit可能过度过滤了原始数据。保留single-end读长有利于下

游变异识别。硬过滤相较于变异质量得分重新校准表现出更高的容忍度。 [关键词] 全外显子组测序;预处理;变异识别 [中图法分类号] R857.3;Q344+.12

Comparison of methods for pre-processing and variants filtering in analyzing whole exome sequencing data

Yan Jin, Pan Qi, Ren Hong

(Institute for Viral Hepatitis,Chongqing Medical University, Chongqing,400010,China)

【Abstract】 Objective To investigate how different methods for pre-processing and variants filtering affect variants calling.

Method

Through the calculation of depth of coverage, number of variants, Ti/Tv ratio and non-reference concordance, we

compare the effect of FASTX-Toolkit and Trimmomatic in preprocessing the exome data, the strategies of single-end inclusion and ‘Hard’ filter and variants quality score recalibration (VQSR) in variants filter by using whole exome sequencing data from two test samples.

Result

Trimmomatic pre-processed reads showed similar depth of coverage to reads those without pre-processing, but

significantly greater than those by FASTX-Toolkit pre-processed reads. With depth of coverage ≥10× and genotype quality ≥20, the number of called SNVs identified by Trimmomatic was greater than FASTX-Toolkit, but similar to those without pre-processing. With the inclusion of single-end reads, the number of variants increased significantly for FASTX-Toolkit pre-processing (~28%) than Trimmomatic pre-processing (~5%). In the all settings, ‘Hard’ filtering filtered less SNVs than VQSR filtering in small sample size.

Conclusion

Sequence reads were trimmed and/or filtered moderately by Trimmomatic, whereas it seemed to be over-filtered by FASTX-Toolkit.

Keeping the single-end reads is good for variants calling in the downstream analysis. The ‘Hard’ filtering showed a more favorable tolerability profile than ‘VQSR’ filtering.

Keyword: whole exome sequencing, pre-processing, variants filtering

Supported by the General Program of National Natural Science Foundation of China (0318,30930082), National Science and Technology Major Project(2008ZX10002-006,2012ZX10002007), the Foundation for Sci & Tech Research Project of Chongqing (cstc2012gg-yyjsB10007)

Corresponding author: Ren Hong,Tel: 023-63693029, E-mail:renhong0531@vip.sina.com

自全外显子组技术出现以来,研究者们利用该技术不断揭示了众多孟德尔疾病发病的原因[1]。随着近年来测序技术飞速发展,第二代测序技术应用日趋成熟,费用成本逐渐下降,全外显子组测序被越来越多的实验室和临床检测所应用。虽然第二代测序技术通量大幅提升,测序深度不断提高,在带来更高的碱基识别率的同时,对生物信息学又提出巨大的挑战。由于技术的高速发展以及学界的争议,仍然没有一套公认的、标准的第二代测序数据质量控制方法。是否需要对测序得到的原始序列进行质量控制、以及哪些因素对变异识别造成影响都尚无定论。本研究试图比较不同的预处理方法、预处理后产生的不匹配读长(single-end reads)的取舍策略和变异过滤方法对全外显子组测序数据分析中的测序数据覆盖深度(Depth of Coverage,DP)、识别变异的数目、Ti/Tv值和基因型一致性的影响。 1.材料和方法 1.1样本

我们利用两组

1000

基因组计划中测序得到的全外显子组数据(NA12878 和

NA18967,

http://depts.washington.edu/swansonw/Swanson_Lab/Data.html)作为样本进行比较。此两个样本均使用NimbleGen? SeqCap EZ Exome probes (v1.0)进行外显子捕获,并在Illumina? Genome Analyzer IIx测序仪上采用76-bp双末端测序方法(Paired-End Sequencing)进行测序[2],共分别生成13.4GB和17.0GB Fastq格式的数据。 1.2全外显子组测序分析流程

参考GATK建议的第二代测序变异识别和基因分型步骤框架[3, 4],我们设计了本实验全外显子组分析流程图(图1),共分三个阶段。

图1 本实验室设计及全外显子组数据分析流程图