近日,求臻医学的生信团队收到了公司遗传咨询的同事提出的“紧急需求”,在时间紧,任务重的情况下,生信团队通过团队协作,最终成功达成需求!
>>>>
需求
鉴于通过Annovar给出的变异注释,涉及到的转录本太多,要求减少转录本,遗传咨询部门更关注经典转录本的注释。
>>>>
解决方案
1
通过查询对于经典转录本life平台对于经典转录本的定义
为了保持引述的原汁原味,英文说明如下:
In UCSC's curated canonical transcripts set, there were some discrepancies where therewere multiple transcripts for a given gene. We removed the shorter transcriptsand kept the transcript with the "appris_principal_1" tag in the Ion Reporter™ Software refseq canonical transcript set. Thistag is used by gencode to mark the primary transcript of a particular gene. Inthe cases, where the appris_principal_1 was missing, we chose the longesttranscript in order to have only one canonical transcript for a given gene. Incases where two transcripts were the same length, we used alphabetical order.
The UCSC transcripts forthe seven genes in this table are replaced with the transcripts in the Newtranscript column in Ion Reporter™ Software[1].
2
通过查询ACMG该指南中推荐的是LRG数据库
该数据[2]只涉及到了1000多个基因,而且一个基因往往对应多个经典转录本。下一个问题是如何对这些转录本进行排序,并输出一个最最经典的转录本呢?
3
通过与生信小伙伴们的交流
小编决定在线爬取clinvar数据库中所有涉及到的变异信息,并统计变异信息所关联的转录本在clinvar数据库中出现的频次,对转录本进行排序。
>>>>
成果分享
经过实践证明,该方法已被遗传咨询的同事认可。最后,小编本着大公无私的精神,再次呈上代码,分享给大家:
import requests
from bs4 import BeautifulSoup
import os
from multiprocessing import Pool
import re
clinvar="/data/Database/clinvar/2019.7.20/clinvar.vcf"
gene={}
dict={}
if os.path.exists("knownCanonical.tsv"):
file=open("knownCanonical.tsv","r")
for line in file:
line =line.strip()
array=line.split("\t")
dict[array[0]]=1
file.close()
###########################################
infile=open(clinvar,"r")
id=[]
for line in infile:
line = line.strip()
if not line.startswith("#"):
array = line.split("\t")
if array[2] not in dict:
id.append(array[2])
infile.close()
def dict2d(dict, key_a, key_b, val):
if key_a in dict:
dict[key_a].update({key_b: val})
else:
dict.update({key_a: {key_b: val}})
def run(ID):
html="https://www.ncbi.nlm.nih.gov/clinvar/variation/%s"%(ID)
try:
res = requests.get(html)
ret = res.text
soup = BeautifulSoup(ret, 'html.parser')
result=soup.find('h2',class_='usa-color-text usa-color-text-white blue-box').text.strip()
outfile=open("knownCanonical.tsv",'a+')
outfile.write("%s\t%s\n"%(ID,result))
outfile.close()
except:
print(ID)
if __name__=="__main__":
pool = Pool(processes=50)
pool.map(run, id)
print("done")
infile = open("knownCanonical.tsv", 'r')
trans=[]
dict={}
gene=[]
counts={}
gene2trans={}
for line in infile:
if re.search(r'NM_',line):
line=line.strip()
array=line.split("\t")
p=re.compile(r'[(](\S+)[)]')
a=p.findall(array[1])
if a==[]:
continue
if not a[0] in gene:
gene.append(a[0])
b=array[1].split(".")
trans.append(b[0])
if b[0] in counts:
counts[b[0]]+=1
else:
counts[b[0]]=1
if a[0] in gene2trans:
if b[0] not in gene2trans[a[0]]:
gene2trans[a[0]].append(b[0])
else:
gene2trans[a[0]]=[]
gene2trans[a[0]].append(b[0])
dict2d(dict,a[0],b[0],counts[b[0]])
infile.close()
print("gene counts %s"%(len(gene)))
outfile=open("knownCanonical.final.tsv", 'w')
for a in gene:
outfile.write("%s"%(a))
tmp=[]
for b in gene2trans[a]:
tmp.append(dict[a]["%s"%(b)])
set1=sorted(list(set(tmp)),reverse=True)#去重复并转化为list
for key in set1:
for b in gene2trans[a]:
if dict[a][b]==key:
outfile.write("\t%s(%s)" % (b,dict[a][b]))
outfile.write("\n")
outfile.close()
参考链接:
1.https://ionreporter.thermofisher.com/ionreporter/help/GUID-A5502535-2C81-46FD-93C2-50FCB1F02CFD.html
2.http://www.lrg-sequence.org