I have figure out how to extract my RDF data from a nepomuk-soprano-virtuoso database. I must say that I have crafted my own tools to create the data on top of the soprano model, so I can control the graph all my data belongs to.
It wasn’t always like that – but I’ll get to that pretty soon.
as detailed in my previous post – rebuilding the virtuoso database left me with a file way too bloated for my taste to start backing up.
So I’ve decided to try the strategy of dumping the data, filter it, and import it back to a clean database.
running isql and virtuoso like I described in my previous post, I have used the following SQL function in isql:
CREATE PROCEDURE dump_graphs
( IN dir VARCHAR := '/Users/mcradle/Library/Preferences/KDE/share/apps/nepomuk/repository/main/data/virtuosobackend/dumps' ,
IN file_length_limit INTEGER := 1000000000
)
{
DECLARE inx INT;
inx := 1;
SET ISOLATION = 'uncommitted';
FOR ( SELECT *
FROM ( SPARQL DEFINE input:storage ""
SELECT DISTINCT ?g { GRAPH ?g { ?s ?p ?o } .
FILTER ( ?g != virtrdf: )
}
) AS sub OPTION ( LOOP )) DO
{
dbg_printf ('about to dump %s',"g");
dump_one_graph ( "g",
sprintf ('%s/graph%06d_', dir, inx),
file_length_limit
);
dbg_printf ('dump done %s',"g");
inx := inx + 1;
}
}
;
CREATE PROCEDURE dump_one_graph
( IN srcgraph VARCHAR ,
IN out_file VARCHAR ,
IN file_length_limit INTEGER := 1000000000
)
{
DECLARE file_name varchar;
DECLARE env, ses any;
DECLARE ses_len,
max_ses_len,
file_len,
file_idx integer;
SET ISOLATION = 'uncommitted';
max_ses_len := 10000000;
file_len := 0;
file_idx := 1;
file_name := sprintf ('%s%06d.ttl', out_file, file_idx);
string_to_file ( file_name || '.graph',
srcgraph,
-2
);
string_to_file ( file_name,
sprintf ( '# Dump of graph <%s>, as of %s\n',
srcgraph,
CAST (NOW() AS VARCHAR)
),
-2
);
env := vector (dict_new (16000), 0, '', '', '', 0, 0, 0, 0);
ses := string_output ();
FOR (SELECT * FROM ( SPARQL DEFINE input:storage ""
SELECT ?s ?p ?o { GRAPH `iri(?:srcgraph)` { ?s ?p ?o } }
) AS sub OPTION (LOOP)) DO
{
http_ttl_triple (env, "s", "p", "o", ses);
ses_len := length (ses);
IF (ses_len > max_ses_len)
{
file_len := file_len + ses_len;
IF (file_len > file_length_limit)
{
http (' .\n', ses);
string_to_file (file_name, ses, -1);
file_len := 0;
file_idx := file_idx + 1;
file_name := sprintf ('%s%06d.ttl', out_file, file_idx);
string_to_file ( file_name,
sprintf ( '# Dump of graph <%s>, as of %s (part %d)\n',
srcgraph,
CAST (NOW() AS VARCHAR),
file_idx),
-2
);
env := vector (dict_new (16000), 0, '', '', '', 0, 0, 0, 0);
}
ELSE
string_to_file (file_name, ses, -1);
ses := string_output ();
}
}
IF (LENGTH (ses))
{
http (' .\n', ses);
string_to_file (file_name, ses, -1);
}
}
;
one must have the directory of which the graphs to be dumped in the DirsAllowed parameter within the [Parameters] section of virtuoso.ini
for example:
[Parameters]
DirsAllowed= /Users/mcradle/Library/Preferences/KDE/share/apps/nepomuk/repository/main/data/virtuosobackend/dumps/,./dumps,dumps
then issue
dump_graphs();
on the isql commandline.
this should create a bunch of files in your dump directory similarly to the following:
bash-3.2$ ls -lhtra | head
total 8856
-rw-r--r-- 1 mcradle staff 36B Feb 19 00:14 graph000002_000001.ttl.graph
-rw-r--r-- 1 mcradle staff 43B Feb 19 00:14 graph000001_000001.ttl.graph
-rw-r--r-- 1 mcradle staff 2.9K Feb 19 00:14 graph000001_000001.ttl
-rw-r--r-- 1 mcradle staff 49B Feb 19 00:16 graph000090_000001.ttl.graph
-rw-r--r-- 1 mcradle staff 1.1K Feb 19 00:16 graph000090_000001.ttl
-rw-r--r-- 1 mcradle staff 49B Feb 19 00:16 graph000089_000001.ttl.graph
-rw-r--r-- 1 mcradle staff 947B Feb 19 00:16 graph000089_000001.ttl
-rw-r--r-- 1 mcradle staff 49B Feb 19 00:16 graph000088_000001.ttl.graph
-rw-r--r-- 1 mcradle staff 791B Feb 19 00:16 graph000088_000001.ttl
as it turns out my RDF data was spread across many grpahs, I think nepomuk does that by default, I don’t even claim to understand why and how.
I wanted to unify them into one graph to make the data export/import easier in the future.
Python and librdf to the rescue, it turns out that this is not that hard, following the python snippet to merge a few turtle file containing each a graph (exactly the form created by the dump_graphs() procedure above)
files_to_be_merged = ("./graph000030_000001.ttl",
"./graph000031_000001.ttl",
"./graph000032_000001.ttl",
"./graph000033_000001.ttl",
"./graph000034_000001.ttl",
"./graph000100_000001.ttl",
"./graph000013_000001.ttl")
import rdflib
store = rdflib.Graph()
for fil in files_to_be_merged:
store.parse (fil, format='n3')
f = open('/Users/mcradle/.../dumps/unified-graph.ttl', 'w')
f.write (store.serialize(format='turtle'))
note that since I’ve created my own turtle file I need to create my own .ttl.graph file.
it turns out that it’s a file containing the graph name, for example:
bash-3.2$ cat graph000341_000001.ttl.graph
nepomuk:/ctx/f665ed7d-ec19-4454-ae4c-70635f1b442f
so I’ve created my own unified-graph.ttl.graph with my own graph name.
I then deleted the following virtuoso database files:
soprano-virtuoso-temp.db
soprano-virtuoso.db
soprano-virtuoso.lck
soprano-virtuoso.loc
soprano-virtuoso.log
soprano-virtuoso.pxa
soprano-virtuoso.trx
and launched nepomuk to have the database recreated. the newly created database was way smaller: around 10MB, I’ve just shrank my database in a factor of almost 200!
next I left with importing unified-graph.ttl back into the database. for that I’ve killed the nepomuk server and relaunched the standalone virtuoso.
it’s important to see that the dump directory is still allowed as per when we dumped the graphs.
to import back the graph with my data: in ‘isql’ prompt I pasted the following:
CREATE PROCEDURE load_graphs
( IN dir VARCHAR := 'dumps' )
{
DECLARE arr ANY;
DECLARE g VARCHAR;
arr := sys_dirlist (dir, 1);
log_enable (2, 1);
FOREACH (VARCHAR f IN arr) DO
{
IF (f LIKE '*.ttl')
{
DECLARE CONTINUE HANDLER FOR SQLSTATE '*'
{
log_message (sprintf ('Error in %s', f));
};
g := file_to_string (dir || '/' || f || '.graph');
dbg_printf ('g is %s', "g");
DB.DBA.TTLP_MT (file_open (dir || '/' || f), g, g, 255);
}
}
EXEC ('CHECKPOINT');
}
;
and ran the procedure to start the import process
load_graphs ();