对于给定的一个COSMIC_ID判断是否为somatic存在三种状态:Yes(e.g:COSM6918278)
No(e.g:COSM6475151)
SNP(e.g:COSM6972367)
在上次的文章中小伙伴给大家奉上了可以实时查询单条COSMIC信息的爬虫程序。小试牛刀后今天小编放大招给出升级版爬虫程序,可以爬取整个COSMIC数据库的信息。
Step1:获得所有COSMIC ID
首先我们要获得COSMIC数据库所有的COSMIC ID,这个可以从COSMIC数据库中下载的vcf文件中获得:
(https://cancer.sanger.ac.uk/cosmic/download)
Step2:防止IP被封
通过设置代理IP可以隐藏自己真实的IP,这是防止IP被封的一种常用方法。这里推荐快代理(https://www.kuaidaili.com/ops/proxylist/1/,收费或者免费版都可以)。代理分为http和https两种,在选择时要特别注意,要选择高匿名、相应速度较快的代理。特别注意的是免费的代理具有一定的时效性,此外在选择代理时一定要先验证有效性。并不是所有的免费代理都是可用的。为防止爬虫过程中IP被封还可以通过模拟浏览器的方式进行获取网页信息。此外在爬取信息的时候还需要注意设置爬虫时间间隔。当然为了注重爬虫效率在准备完以上措施后,还可以通过多线程并行设置爬虫程序。
代码如下:
import requests
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
requests.adapters.DEFAULT_RETRIES =10#增加重连次数
from bs4 import BeautifulSoup
import re
import time
import random
import os
from multiprocessing import Process, Pool
#############################
site={}
bedfile=["DHS-3501Z.roi.bed","panel_27.bed","panel_599.bed","TST500C_manifest.bed"]
for i in bedfile:
infile=open(i,"r")
for line in infile:
line=line.strip()
array=line.split()
for k in range(int(array[1]),int(array[2])+1):
tmp=array[0]+"_"+str(k)
site[tmp]=1
infile.close()
print("There total %s sites" %(len(site)))
##############################在请求头中把User-Agent设置成浏览器中的User-Agent,来伪造浏览器访问
user_agents = ['Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124"]
vcf="all_COSMICID.txt"
proxy_type=['http','https']
#proxy_http=['117.90.0.225:9000']
proxy_https=['166.111.82.202:8118','60.13.42.107:9999','119.57.105.25:53281']
dict={}
if os.path.exists("cosmic.tsv"):
outfile=open("cosmic.tsv","r")
for line in outfile:
line=line.strip()
array=line.split()
dict[array[0]]=1
outfile.close()
###############################
numID=[]
infile=open(vcf,"r")
SNP={}
for line in infile:
line=line.strip()
if not line.startswith("#"):
array=line.split("\t")
p1=re.compile(r'SNP')
a=p1.findall(array[-1])
tmp="chr"+array[0]+"_"+array[1]
if tmp in site:
p2=re.compile(r'(\d+)')
id=p2.findall(array[2])
if not array[2] in dict:
if a==[]:
numID.append(id[0])
else:
outfile = open("cosmic.tsv", "a+")
outfile.write("COSM%s\tSNP\n" % (id[0]))
outfile.close()
continue
infile.close()
print("Total %s entry in COSMIC"%(len(numID)))
def run(id):
outfile = open("cosmic.tsv", "a+")
url = 'https://cancer.sanger.ac.uk/cosmic/mutation/overview?genome=37&id=%s' % (id)
headers = {'User-Agent': random.choice(user_agents)} # 随机选择一个User-Agent
#if random.choice(proxy_type)=='http':
# ip={'http':'http://'+random.choice(proxy_http)}
#else:
ip = {'https': 'https://' + random.choice(proxy_https)}
s = requests.session()
s.keep_alive = False # 关闭多余连接
try:
res = s.get(url, headers=headers, verify=False,proxies=ip)
ret = res.text
soup = BeautifulSoup(ret, 'html.parser')
dbsnp = soup.find_all(text=re.compile("has been flagged as a SNP."))
dt = soup.find_all('dt')
dd = soup.find_all('dd')
outfile = open("cosmic.tsv", "a+")
for i in range(len(dt)):
if dt[i].string == "Ever confirmed somatic?":
outfile.write("COSM%s\t%s\n" % (id, dd[i].get_text()))
outfile.close()
print("%s id done." % (id))
continue
if dbsnp != []:
outfile.write("COSM%s\tSNP\n" % (id))
outfile.close()
print("%s id done." % (id))
except:
print("%s id erro."%(id))
outfile.close()
start=time.time()
pool = Pool(processes=200)
pool.map(run, numID)
end=time.time()
print("Elapse time is %g seconds" %(end-start))