-
Notifications
You must be signed in to change notification settings - Fork 13
Release Notes v0.5.4
New in this version: Remote, on-demand, database access.
Allow on-demand database access, and table cloning, from remote sources via http:// and other protocols supported by curl. Requires [http://urlgrabber.baseurl.org/ urlgrabber] version 3.9.1 or later to work.
The basic idea is that instead of downloading all the files that make up a table from some remote location before any LSD tool can access it, LSD now remembers the remote location and downloads the files as they're needed (eg., when lsd-query needs it to materialize a query result). This way one can quickly query even multi-terabyte remote tables, if one is only interested in only a small (spatial/temporal) subset (see the example below).
This mechanism also serves as a (smarter) substitute for rsync, allowing one to track and sync with remote databases as they change (are updated).
This example shows how to link a table named sdss (in the local database db), to a remote table residing at http://faun.rc.fas.harvard.edu/mjuric/db/public and then to query a small field of view.
[mjuric@pan src]$ export LSD_DB=db
[mjuric@pan src]$ lsd-admin remote follow table http://faun.rc.fas.harvard.edu/mjuric/db/public sdss
Table 'sdss' set up to follow remote table 'http://faun.rc.fas.harvard.edu/mjuric/db/public/sdss'
[mjuric@pan src]$ lsd-query --format=fits --bounds='beam(200, 40, 1)' 'select ra, dec from sdss'
[3 el.]::::::::::::::::::::> 2.37 sec
Output in output.fits
39332 rows selected.
The local table in db/sdss starts out empty. Each time an LSD routine requests access to a file that is not in db/sdss, the file is transparently fetched to satisfy the request and stored in db/sdss to satisfy future requests. In the particular example above, only the three tablets that were needed to execute the query will have been downloaded to db:
[mjuric@pan src]$ du -h --max-depth=1 db
21M db/sdss
21M db
If you do want to download the entire table (and stop being dependent of the link to the remote database) use lsd-admin fetch:
[mjuric@pan src]$ lsd-admin remote fetch table sdss
Fetching sdss: [1606 el.]::::::::::::::::::::> 3116.82 sec
278674212 rows fetched.
[mjuric@pan src]$ du -h --max-depth=1 db
21G db/sdss
21G db
Even if you have a full local copy, the remote database will still be checked each time a table is opened to verify that no new data exists on the remote. To truly stop depending on the remote, it can be "unfollowed":
[mjuric@pan src]$ lsd-admin remote unfollow table sdss
Table 'sdss' stopped following 'http://faun.rc.fas.harvard.edu/mjuric/db/public/sdss'.
Some tables also have associated predefined joins (for example, ps1_det and ps1_exp table). Such join definitions can also be fetched from the remote:
[mjuric@pan src]$ lsd-admin remote fetch join http://faun.rc.fas.harvard.edu/mjuric/db/public ps1_obj:ps1_det
Fetched .ps1_obj:ps1_det.join
Finally, to list what objects are available in the remote database:
[mjuric@pan src]$ lsd-admin remote list http://faun.rc.fas.harvard.edu/mjuric/db/public
TABLE galex_gr5
TABLE sdss
JOIN ps1_obj:sdss
To make a database remotely available via the web (http or https), copy it somewhere where the web server will see it (i.e., to a subdirectory in (typically) ~/public_html, or wherever your web server looks for files). This makes the database downloadable using clients such as wget, but not yet fetchable with LSD. To make the database objects "visible" to LSD clients, you'll have to publish them:
[mjuric@pan src]$ lsd-admin remote publish table ps1_det ps1_exp
Table ps1_det: 7 snapshots published.
Table ps1_exp: 7 snapshots published.
Published tables ps1_det, ps1_exp.
[mjuric@pan src]$ lsd-admin remote publish join ps1_det:ps1_exp
Join ps1_det:ps1_exp published.
The example above makes the two tables, ps1_det and ps1_exp, follow-able by remote clients, as well as making the join ps1_det:ps1_exp fetchable. Internally, this means that LSD just adds the names of those tables and joins to a hidden text file named .listing in the database directory, and in the shapshots directories of each published table.
If your tables are updated at some point (e.g., more rows are added), those changes won't be visible by the remote clients until you either republish the tables (rerun the 'lsd-admin remote publis table' command you've ran originally), or use the --update switch to update all currently published tables:
[mjuric@pan src]$ lsd-admin remote publish table --update
Table ps1_det: 7 snapshots published.
Table ps1_exp: 7 snapshots published.
Published tables ps1_det, ps1_exp.
Finally, to unpublish a table (or a join), use the 'remote unpublish table' and 'remote unpublish join' subcommands.
At the lowest level, LSD uses libcurl (http://curl.haxx.se/libcurl/) to do the fetching from the remotes, and therefore anything that libcurl understands can serve as a remote. This includes http, https, ssh, sftp, ftp and many, many more (see libcurl's web page for an exhaustive list).
If you need to download from a password protected site, the username and password should be given as part of the URL. For example:
[mjuric@pan src]$ lsd-admin remote follow table http://ps1sc:[email protected]/mjuric/db/ps1 ps1_obj
would follow a table ps1_obj from a password-protected site where the username is ps1sc and password is 'passhere'.
SECURITY WARNING: Note that the URL of the remote table is stored locally in a file named .remote in the table directory. THIS FILE WILL BE READABLE BY ANYONE WHO CAN ACCESS YOUR LOCAL DATABASE. Make sure you've apropriately restricted access to your local database directory before following remote password protected tables.