GFW BLOG（功夫网与翻墙）: 饭否消息备份Python脚本

来源：http://blog.icybear.cn/2009/07/rice-python-script-to-back-up-any-news.html

使用API接口备份饭否消息为XML格式，Python3.0。
有人说API只能导出前3200条消息，其实是可以导出所有消息的，只不过用的参数不同而已。
导出为XML的缺点就是数据好大，我5700+消息导出来有10M。
另外还有一个问题就是饭否的bug了，导出的xml如果里面还有特殊unicode字符时会导致解析器抛异常，其实饭否输出的时候应该进行xmlencode一下的。
大概就是这样。下面是代码刷屏，过敏者请勿点击。

#!/usr/bin/env python 
# -*- coding: UTF-8 -*- 
from xml.dom import minidom 
from urllib import request 
  
#代理设置 
#request.install_opener(request.build_opener(request.ProxyHandler( 
#    {"http" : "http://192.168.60.250:8080"} 
#))); 
  
#用户名 
user = "bearice"; 
  
def loadPage(id=''): 
    url = "http://api.fanfou.com/statuses/user_timeline.xml?id=%s&;count=60&max_id=%s"%(user,id); 
    print('LOAD:%s'%url); 
    f = request.urlopen(url); 
    dom = minidom.parse(f); 
    f.close(); 
    return dom; 
  
i=0; 
children = loadPage().documentElement.getElementsByTagName('status'); 
out = open('fanfou.backup.xml','w',encoding='utf8'); 
out.write('<?xml version="1.0" encoding="utf8"?>\n<statuses>\n'); 
while (not len(children)<60): 
    if i>0: 
        children=children[1:]; 
    for st in children: 
        i+=1; 
        last=st.getElementsByTagName('id')[0].firstChild.data; 
        out.write(st.toxml()); 
        out.write('\n'); 
        print("%d:%s"%(i,last)); 
    children = loadPage(last).documentElement.getElementsByTagName('status'); 
out.write('</statuses>\n'); 
out.close();

GFW BLOG（功夫网与翻墙）

饭否消息备份Python脚本

1 条评论:

关联节点