Dump, Merge And Import Graphs From a Virtuoso Database

I have figure out how to extract my RDF data from a nepomuk-soprano-virtuoso database. I must say that I have crafted my own tools to create the data on top of the soprano model, so I can control the graph all my data belongs to.
It wasn’t always like that – but I’ll get to that pretty soon.

as detailed in my previous post – rebuilding the virtuoso database left me with a file way too bloated for my taste to start backing up.

So I’ve decided to try the strategy of dumping the data, filter it, and import it back to a clean database.

running isql and virtuoso like I described in my previous post, I have used the following SQL function in isql:

CREATE PROCEDURE dump_graphs 
  ( IN  dir               VARCHAR  :=  '/Users/mcradle/Library/Preferences/KDE/share/apps/nepomuk/repository/main/data/virtuosobackend/dumps'   , 
    IN  file_length_limit INTEGER  :=  1000000000
  )
  {
    DECLARE inx INT;
    inx := 1;
    SET ISOLATION = 'uncommitted';
    FOR ( SELECT * 
            FROM ( SPARQL DEFINE input:storage "" 
                   SELECT DISTINCT ?g { GRAPH ?g { ?s ?p ?o } . 
                                        FILTER ( ?g != virtrdf: ) 
                                      } 
                 ) AS sub OPTION ( LOOP )) DO
      {

       dbg_printf ('about to dump %s',"g");
       
        dump_one_graph ( "g", 
                         sprintf ('%s/graph%06d_', dir, inx), 
                         file_length_limit
                       );
	       dbg_printf ('dump done %s',"g");
        inx := inx + 1;
      }
  }
;
CREATE PROCEDURE dump_one_graph 
  ( IN  srcgraph           VARCHAR  , 
    IN  out_file           VARCHAR  , 
    IN  file_length_limit  INTEGER  := 1000000000
  )
  {
    DECLARE  file_name  varchar;
    DECLARE  env, ses      any;
    DECLARE  ses_len, 
             max_ses_len, 
             file_len, 
             file_idx      integer;
    SET ISOLATION = 'uncommitted';
    max_ses_len := 10000000;
    file_len := 0;
    file_idx := 1;
    file_name := sprintf ('%s%06d.ttl', out_file, file_idx);
    string_to_file ( file_name || '.graph', 
                     srcgraph, 
                     -2
                   );
    string_to_file ( file_name, 
                     sprintf ( '# Dump of graph <%s>, as of %s\n', 
                               srcgraph, 
                               CAST (NOW() AS VARCHAR)
                             ), 
                     -2
                   );
    env := vector (dict_new (16000), 0, '', '', '', 0, 0, 0, 0);
    ses := string_output ();
    FOR (SELECT * FROM ( SPARQL DEFINE input:storage "" 
                         SELECT ?s ?p ?o { GRAPH `iri(?:srcgraph)` { ?s ?p ?o } } 
                       ) AS sub OPTION (LOOP)) DO
      {
        http_ttl_triple (env, "s", "p", "o", ses);
        ses_len := length (ses);
        IF (ses_len > max_ses_len)
          {
            file_len := file_len + ses_len;
            IF (file_len > file_length_limit)
              {
                http (' .\n', ses);
                string_to_file (file_name, ses, -1);
                file_len := 0;
                file_idx := file_idx + 1;
                file_name := sprintf ('%s%06d.ttl', out_file, file_idx);
                string_to_file ( file_name, 
                                 sprintf ( '# Dump of graph <%s>, as of %s (part %d)\n', 
                                           srcgraph, 
                                           CAST (NOW() AS VARCHAR), 
                                           file_idx), 
                                 -2
                               );
                 env := vector (dict_new (16000), 0, '', '', '', 0, 0, 0, 0);
              }
            ELSE
              string_to_file (file_name, ses, -1);
            ses := string_output ();
          }
      }
    IF (LENGTH (ses))
      {
        http (' .\n', ses);
        string_to_file (file_name, ses, -1);
      }
  }
;
 

one must have the directory of which the graphs to be dumped in the DirsAllowed parameter within the [Parameters] section of virtuoso.ini
for example:

[Parameters]
DirsAllowed= /Users/mcradle/Library/Preferences/KDE/share/apps/nepomuk/repository/main/data/virtuosobackend/dumps/,./dumps,dumps

then issue

dump_graphs();

on the isql commandline.

this should create a bunch of files in your dump directory similarly to the following:

bash-3.2$ ls -lhtra | head
total 8856
-rw-r--r--    1 mcradle  staff    36B Feb 19 00:14 graph000002_000001.ttl.graph
-rw-r--r--    1 mcradle  staff    43B Feb 19 00:14 graph000001_000001.ttl.graph
-rw-r--r--    1 mcradle  staff   2.9K Feb 19 00:14 graph000001_000001.ttl
-rw-r--r--    1 mcradle  staff    49B Feb 19 00:16 graph000090_000001.ttl.graph
-rw-r--r--    1 mcradle  staff   1.1K Feb 19 00:16 graph000090_000001.ttl
-rw-r--r--    1 mcradle  staff    49B Feb 19 00:16 graph000089_000001.ttl.graph
-rw-r--r--    1 mcradle  staff   947B Feb 19 00:16 graph000089_000001.ttl
-rw-r--r--    1 mcradle  staff    49B Feb 19 00:16 graph000088_000001.ttl.graph
-rw-r--r--    1 mcradle  staff   791B Feb 19 00:16 graph000088_000001.ttl

as it turns out my RDF data was spread across many grpahs, I think nepomuk does that by default, I don’t even claim to understand why and how.
I wanted to unify them into one graph to make the data export/import easier in the future.

Python and librdf to the rescue, it turns out that this is not that hard, following the python snippet to merge a few turtle file containing each a graph (exactly the form created by the dump_graphs() procedure above)

files_to_be_merged = ("./graph000030_000001.ttl",
                      "./graph000031_000001.ttl",
                      "./graph000032_000001.ttl",
                      "./graph000033_000001.ttl",
                      "./graph000034_000001.ttl",
                      "./graph000100_000001.ttl",
                      "./graph000013_000001.ttl")

import rdflib

store = rdflib.Graph()

for fil in files_to_be_merged:
    store.parse (fil, format='n3')

f = open('/Users/mcradle/.../dumps/unified-graph.ttl', 'w')
f.write (store.serialize(format='turtle'))

note that since I’ve created my own turtle file I need to create my own .ttl.graph file.
it turns out that it’s a file containing the graph name, for example:

bash-3.2$ cat graph000341_000001.ttl.graph 
nepomuk:/ctx/f665ed7d-ec19-4454-ae4c-70635f1b442f

so I’ve created my own unified-graph.ttl.graph with my own graph name.

I then deleted the following virtuoso database files:

soprano-virtuoso-temp.db
soprano-virtuoso.db 
soprano-virtuoso.lck
soprano-virtuoso.loc
soprano-virtuoso.log
soprano-virtuoso.pxa
soprano-virtuoso.trx

and launched nepomuk to have the database recreated. the newly created database was way smaller: around 10MB, I’ve just shrank my database in a factor of almost 200!
next I left with importing unified-graph.ttl back into the database. for that I’ve killed the nepomuk server and relaunched the standalone virtuoso.
it’s important to see that the dump directory is still allowed as per when we dumped the graphs.

to import back the graph with my data: in ‘isql’ prompt I pasted the following:

CREATE PROCEDURE load_graphs 
  ( IN  dir  VARCHAR := 'dumps' )
{
  DECLARE arr ANY;
  DECLARE g VARCHAR;

  arr := sys_dirlist (dir, 1);
  log_enable (2, 1);
  FOREACH (VARCHAR f IN arr) DO
    {
      IF (f LIKE '*.ttl')
      {
        DECLARE CONTINUE HANDLER FOR SQLSTATE '*'
	    {
	          log_message (sprintf ('Error in %s', f));
		      };
  		        g := file_to_string (dir || '/' || f || '.graph');
			dbg_printf ('g is %s', "g");
			  DB.DBA.TTLP_MT (file_open (dir || '/' || f), g, g, 255);
			  }
    }
  EXEC ('CHECKPOINT');
}
;

and ran the procedure to start the import process

load_graphs ();
About these ads
This entry was posted in Uncategorized and tagged , , , , , , , , , , . Bookmark the permalink.

One Response to Dump, Merge And Import Graphs From a Virtuoso Database

  1. cusion says:

    Hi, I did some google searches, but still have no idea what exactly does the concept ‘graph’ mean in virtuoso, and further how do those triple records organised in the database. Could you please spare some time and give me some instructions? Thank you so much, and please contact me via cusion89@zju.edu.cn

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s