数据预处理
- Freebase 数据(压缩包~30G):https://developers.google.com/freebase
- 数据过滤方法
- 法1:https://sivareddy.in/random/fix_freebase.py
- 法2:https://github.com/lanyunshi/Multi-hopComplexKBQA/blob/master/code/FreebaseTool/FilterEnglishTriplets.py
1
2
3
4
5
6# 1、如果解压数据 freebase-rdf-latest.gz
gunzip -c freebase-rdf-latest.gz > freebase # 400G
nohup python -u FilterEnglishTriplets.py 0<freebase 1>FilterFreebase 2>log_err & # 125G
# 2、如果不解压数据
zcat freebase-rdf-latest.gz | python FilterEnglishTriplets.py | gzip > freebase-filter.gz # 10G
软件下载
- 地址:http://sourceforge.net/projects/virtuoso/files/virtuoso/
- 选择 7.2.5 版本的免编译版:virtuoso-opensource.x86_64-generic_glibc25-linux-gnu.tar.gz
- 下载链接:https://netix.dl.sourceforge.net/project/virtuoso/virtuoso/7.2.5/virtuoso-opensource.x86_64-generic_glibc25-linux-gnu.tar.gz
导入数据
1 | tar xvpfz virtuoso-opensource.x86_64-generic_glibc25-linux-gnu.tar.gz |
查看状态
1 | # 新建窗口,查看状态-查看数据集加载情况 |
关闭服务
1 | SQL> |
访问数据
- 浏览器:http://localhost:8890/sparql
- Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29import json
from SPARQLWrapper import SPARQLWrapper, JSON
SPARQLPATH = "http://localhost:8890/sparql"
def test():
try:
sparql = SPARQLWrapper(SPARQLPATH)
sparql_txt = """PREFIX ns: <http://rdf.freebase.com/ns/>
SELECT distinct ?name3
WHERE {
ns:m.0k2kfpc ns:award.award_nominated_work.award_nominations ?e1.
?e1 ns:award.award_nomination.award_nominee ns:m.02pbp9.
ns:m.02pbp9 ns:people.person.spouse_s ?e2.
?e2 ns:people.marriage.spouse ?e3.
?e2 ns:people.marriage.from ?e4.
?e3 ns:type.object.name ?name3
MINUS{?e2 ns:type.object.name ?name2}
}
"""
#print(sparql_txt)
sparql.setQuery(sparql_txt)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
print(results)
except:
print('Your database is not installed properly !!!')
test()
执行结果
1 | {'head': {'link': [], 'vars': ['name3']}, 'results': {'distinct': False, 'ordered': True, 'bindings': [{'name3': {'type': 'literal', 'xml:lang': 'en', 'value': 'Jeffrey Probst'}}, {'name3': {'type': 'literal', 'xml:lang': 'en', 'value': 'Shelly Wright'}}, {'name3': {'type': 'literal', 'xml:lang': 'en', 'value': 'Lisa Ann Russell'}}]}} |
格式化
1 | {'head': {'link': [], 'vars': ['name3']}, |