洛谷全网排名生成系统 (Luogu-question-rank)

本文转载自查看原文 2022-04-21 16:00 989 整活类

前言

这是这位蒟蒻OIer的第一篇博客，请看下去吧

背景

在寒假时，我报了洛谷网校的算法基础组春令营（这不是广告！）

看着团队里神犇丛生，我不禁陷入了沉思。。。

"怎样才能知道自己在团队里的排名呢?"

沉思片刻后，我意识到，如果只在团队里排名，格局就小了

我要给洛谷全网用户排名！

开始整活！

初步构想

既然是要整活，那一定要整大活！

经过几天的构想，我列出了这个系统必须要有的几个功能：

1.可以获取到用户的咕值排名
2.可以获取到用户每种难度的题切了多少
3.能把所有数据整合在一起（比如Excel表格）
4.不会对洛谷网站运行产生影响（你谷日爆我可赔不起）

实现这些功能的话，还得用上我的副业——Python，对，就是用爬虫来实现

数据来源

为了同时保证程序运行效率和不影响洛谷网站，我选用的数据源是Luogu-card

长这样↓

Luogu-card的数据半天刷新一次，所以应该不会对Luogu本网造成影响

至于怎么找到需要提取的数据...用正则表达式在HTML源码里匹配就行

话不多说，上题解ヾ(≧▽≦*)o

程序详解

首先需要引用的库有以下几个

import re  #正则表达式
from matplotlib.pyplot import text  #我也不知道做什么的库
import requests  #最基础的爬虫库
import xlwt   #Excel表格操作库

接下来把要匹配的题目难度弄成数组，并对表格进行初始化

dic = ['未评定','入门','普及-','普及/提高-','/提高','省选-','省选/NOI-','CTSC','写挂了']
workbook = xlwt.Workbook(encoding= 'ascii') #创建新表格
worksheet = workbook.add_sheet("Luogu-rank")  #在表格中创建新工作表
worksheet.write(0,0,'用户ID',style2)     #worksheet.write(i,j,k)会在表格第i行第j列写入k
worksheet.write(0,1,'用户名',style2)     #style2是字体格式，一会儿会说
worksheet.write(0,2,'咕值排名',style2)
worksheet.write(0,3,'总通过题数',style2)  
for i in range(13):
    if i == 0 | i > 2:
        worksheet.col(i).width = 256 * 10 #调节列的大小
    else:
        worksheet.col(i).width = 256 * 20
    if i > 3:
        worksheet.write(0,i,dic[i - 4],style2)   #在表头写入难度等级

要准备的都准备完了，开始上硬货！（代码块可能有些长，但也不好拆分，凑合看吧）

for i in range(1,10000):  #这里是想要生成数据的ID范围
    worksheet.write(i,0,i)
    responses = requests.get('https://statcard.vercel.app/practice?id=' + str(i))  #爬虫访问数据网站
    if len(responses.text) > 900:  #数据可以获取的卡片HTML源码字符数在900字一下（试出来的）
        if re.search("NULL",responses.text) != None:  #空用户的昵称会显示为“NULL”（也是试出来的）
            worksheet.write_merge(i, i, 1, 12, '无此用户',style2)   #合并单元格并写入数据
        else:
            username = re.search(r'(.+)\s+</text>\s+<text x=".*" y=".*" class="title" font-weight="normal">\s+的贺题情况',responses.text)  #从HTML源码上下文中用正则表达式匹配用户名
            if username == None:
                username = re.search(r'<text x=".*" y=".*" fill=".*" font-weight=".*" textLength=".*">\s+(.*)\s+</text>\s+<svg xmlns="http://www.w3.org/2000/svg" x=".*" y=".*" width=".*" height=".*"',responses.text)  #还有另一个地方可以匹配到，放起来备用
            worksheet.write(i,1,username.group(1).lstrip(),style)
            problem_summ = re.search(r'已贺(.*)题, 被(.*)人吊打',responses.text)  #匹配AC数与排名
            if problem_summ.group(2) == 'INF':
                worksheet.write(i,2,'暂无数据',style3)  #没有咕值时会显示INF（叒是试出来的）
            else:
                worksheet.write(i,2,int(problem_summ.group(2)))
            worksheet.write(i,3,int(problem_summ.group(1)))
            for j in range(9):
                search_obj = re.search(str(dic[j]) + r'</text>\s+<text x=".*" y="15" class="text">(.*)题</text>',responses.text)   #循环匹配各难度的AC数
                worksheet.write(i,j + 4,int(search_obj.group(1)))
    else:
        worksheet.write_merge(i, i, 1, 12, '用户开启了“完全隐私保护”，获取数据失败',style2)
    print(i)  #这是调试代码，你不要也罢
workbook.save("Luogu-rank1.xls")  #保存表格，程序结束

最后把完整代码放出来

import re
from matplotlib.pyplot import text
import requests
import xlwt

style = xlwt.XFStyle()
font = xlwt.Font()
font.name = '宋体'
style.font = font
al = xlwt.Alignment()
al.horz = 0x01
al.vert = 0x01
style.alignment = al

style2 = xlwt.XFStyle()
font2 = xlwt.Font()
font2.name = '宋体'
style2.font = font2
al2 = xlwt.Alignment()
al2.horz = 0x02
al2.vert = 0x01
style2.alignment = al2

style3 = xlwt.XFStyle()
font3 = xlwt.Font()
al3 = xlwt.Alignment()
al3.horz = 0x03
al3.vert = 0x01
style3.alignment = al3  #这里就是前面的字体样式，参数什么的网上都有，我就不详细说了

dic = ['未评定','入门','普及-','普及/提高-','/提高','省选-','省选/NOI-','CTSC','写挂了']
workbook = xlwt.Workbook(encoding= 'ascii')
worksheet = workbook.add_sheet("Luogu-rank")
worksheet.write(0,0,'用户ID',style2)
worksheet.write(0,1,'用户名',style2)
worksheet.write(0,2,'咕值排名',style2)
worksheet.write(0,3,'总通过题数',style2)

for i in range(13):
    if i == 0 | i > 2:
        worksheet.col(i).width = 256 * 10
    else:
        worksheet.col(i).width = 256 * 20
    if i > 3:
        worksheet.write(0,i,dic[i - 4],style2)

for i in range(1,10000):
    worksheet.write(i,0,i)
    responses = requests.get('https://statcard.vercel.app/practice?id=' + str(i))
    if len(responses.text) > 900:
        if re.search("NULL",responses.text) != None:
            worksheet.write_merge(i, i, 1, 12, '无此用户',style2)
        else:
            username = re.search(r'(.+)\s+</text>\s+<text x=".*" y=".*" class="title" font-weight="normal">\s+的贺题情况',responses.text)
            if username == None:
                username = re.search(r'<text x=".*" y=".*" fill=".*" font-weight=".*" textLength=".*">\s+(.*)\s+</text>\s+<svg xmlns="http://www.w3.org/2000/svg" x=".*" y=".*" width=".*" height=".*"',responses.text)
            worksheet.write(i,1,username.group(1).lstrip(),style)
            problem_summ = re.search(r'已贺(.*)题, 被(.*)人吊打',responses.text)
            if problem_summ.group(2) == 'INF':
                worksheet.write(i,2,'暂无数据',style3)
            else:
                worksheet.write(i,2,int(problem_summ.group(2)))
            worksheet.write(i,3,int(problem_summ.group(1)))
            for j in range(9):
                search_obj = re.search(str(dic[j]) + r'</text>\s+<text x=".*" y="15" class="text">(.*)题</text>',responses.text)
                worksheet.write(i,j + 4,int(search_obj.group(1)))
    else:
        worksheet.write_merge(i, i, 1, 12, '用户开启了“完全隐私保护”，获取数据失败',style2)
    print(i)
workbook.save("Luogu-rank1.xls")

注意事项

程序里的大部分东西都是从网上学习的，所以你看不懂的东西也都能查到
在输入ID范围时请不要让范围太大，如果访问次数过多的话，Luogu-card可能会向程序关闭服务器
截止写作时，Luogu的最大ID为719011，再往后会找不到东西（叕是试出来的）
当程序异常中断时，已获取到的数据不会保存，解决办法是一次只获取一小部分，最后将表格粘贴到一起
在程序运行时不要在同级目录下放与保存文件同名的文件，程序也会报错

最终结果

这张图片是ID前20的，利用Excel表格可以方便地对数据排序

~~往后生成的数据越多，奇怪的用户名也越多[doge]~~

写在后面

我万万没想到我一个OIer写出的第一篇博客竟是爬虫技术😅

欢迎来洛谷与我私信交流！

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 个人洛谷账号地址——https://www.luogu.org/space/show?uid=181909 附上NOIP查分系统 Luogu P1738 洛谷的文件夹【洛谷 5020】货币系统 $[Luogu]$ 洛谷 $P2766$ 题解【最长不下降子序列问题】 [洛谷P4463] calc （生成函数）神奇的洛谷IDE 洛谷---各个评测状态 hive之RANK排名洛谷入门题洛谷背景更改