升级版的COSMIC数据库信息爬取程序

时间：2019-07-03 来源：求臻医学

对于给定的一个COSMIC_ID判断是否为somatic存在三种状态：Yes（e.g:COSM6918278）

No(e.g:COSM6475151)

SNP(e.g:COSM6972367)

在上次的文章中小伙伴给大家奉上了可以实时查询单条COSMIC信息的爬虫程序。小试牛刀后今天小编放大招给出升级版爬虫程序，可以爬取整个COSMIC数据库的信息。

Step1:获得所有COSMIC ID

首先我们要获得COSMIC数据库所有的COSMIC ID,这个可以从COSMIC数据库中下载的vcf文件中获得：

（https://cancer.sanger.ac.uk/cosmic/download）

Step2:防止IP被封

通过设置代理IP可以隐藏自己真实的IP，这是防止IP被封的一种常用方法。这里推荐快代理（https://www.kuaidaili.com/ops/proxylist/1/，收费或者免费版都可以）。代理分为http和https两种，在选择时要特别注意，要选择高匿名、相应速度较快的代理。特别注意的是免费的代理具有一定的时效性，此外在选择代理时一定要先验证有效性。并不是所有的免费代理都是可用的。为防止爬虫过程中IP被封还可以通过模拟浏览器的方式进行获取网页信息。此外在爬取信息的时候还需要注意设置爬虫时间间隔。当然为了注重爬虫效率在准备完以上措施后，还可以通过多线程并行设置爬虫程序。

代码如下：

import requests

import urllib3

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

requests.adapters.DEFAULT_RETRIES =10#增加重连次数

from bs4 import BeautifulSoup

import re

import time

import random

import os

from multiprocessing import Process, Pool

#############################

site={}

bedfile=["DHS-3501Z.roi.bed","panel_27.bed","panel_599.bed","TST500C_manifest.bed"]

for i in bedfile:

infile=open(i,"r")

for line in infile:

line=line.strip()

array=line.split()

for k in range(int(array[1]),int(array[2])+1):

tmp=array[0]+"_"+str(k)

site[tmp]=1

infile.close()

print("There total %s sites" %(len(site)))

##############################在请求头中把User-Agent设置成浏览器中的User-Agent，来伪造浏览器访问

user_agents = ['Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',

'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',

'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36",

"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",

"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",

"Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124"]

vcf="all_COSMICID.txt"

proxy_type=['http','https']

#proxy_http=['117.90.0.225:9000']

proxy_https=['166.111.82.202:8118','60.13.42.107:9999','119.57.105.25:53281']

dict={}

if os.path.exists("cosmic.tsv"):

outfile=open("cosmic.tsv","r")

for line in outfile:

line=line.strip()

array=line.split()

dict[array[0]]=1

outfile.close()

###############################

numID=[]

infile=open(vcf,"r")

SNP={}

for line in infile:

line=line.strip()

if not line.startswith("#"):

array=line.split("\t")

p1=re.compile(r'SNP')

a=p1.findall(array[-1])

tmp="chr"+array[0]+"_"+array[1]

if tmp in site:

p2=re.compile(r'(\d+)')

id=p2.findall(array[2])

if not array[2] in dict:

if a==[]:

numID.append(id[0])

else:

outfile = open("cosmic.tsv", "a+")

outfile.write("COSM%s\tSNP\n" % (id[0]))

outfile.close()

continue

infile.close()

print("Total %s entry in COSMIC"%(len(numID)))

def run(id):

outfile = open("cosmic.tsv", "a+")

url = 'https://cancer.sanger.ac.uk/cosmic/mutation/overview?genome=37&id=%s' % (id)

headers = {'User-Agent': random.choice(user_agents)} # 随机选择一个User-Agent

#if random.choice(proxy_type)=='http':

# ip={'http':'http://'+random.choice(proxy_http)}

#else:

ip = {'https': 'https://' + random.choice(proxy_https)}

s = requests.session()

s.keep_alive = False # 关闭多余连接

try:

res = s.get(url, headers=headers, verify=False,proxies=ip)

ret = res.text

soup = BeautifulSoup(ret, 'html.parser')

dbsnp = soup.find_all(text=re.compile("has been flagged as a SNP."))

dt = soup.find_all('dt')

dd = soup.find_all('dd')

outfile = open("cosmic.tsv", "a+")

for i in range(len(dt)):

if dt[i].string == "Ever confirmed somatic?":

outfile.write("COSM%s\t%s\n" % (id, dd[i].get_text()))

outfile.close()

print("%s id done." % (id))

continue

if dbsnp != []:

outfile.write("COSM%s\tSNP\n" % (id))

outfile.close()

print("%s id done." % (id))

except:

print("%s id erro."%(id))

outfile.close()

start=time.time()

pool = Pool(processes=200)

pool.map(run, numID)

end=time.time()

print("Elapse time is %g seconds" %(end-start))

新闻中心

相关推荐

聚力·开拓·创新·腾飞 - 求臻医学非同寻常的2021

又有新发现—EPHA7基因突变的患者可显著获益于肿瘤免疫治疗

喜讯 - 求臻医学上榜《中国企业家》2022年度未来之星企业榜单