Thursday, September 25, 2014

Testing a Catalog Service (CSW) for a Planetary Portal

The topic for this post is to discuss a potential method for multiple data portals to share resources (and data) via OGC Catalog Services. Now I am using portal here loosely as many of the sites that could be a proper data portal are currently just a web site (even if build from a dynamic database).

Goals:
  1. Create “proper” geospatial metadata for science-ready (map projected) derived products including file-base (PDS, GeoTiffs, etc.) and live maps (WMS, WFS, etc).
  2. Test a freely available, open source OGC CSW service to host this metadata.
  3. Using this service, enable the ability for one site to catalog another site, called “harvesting”.
  4. When other facilities harvest these records, make sure proper credit is given to the originator and allow downloads from the original sites. Here we are trying to minimize data redundancy (and issues like who’s is newer or different).
  5. While websites can facilitate data discovery across facilities, still teach users how to query using these CSW via available command-line methods and from GUIs (like QGIS).

  • Create proper metadata
To keep this discussion short I am going to skip over tons of metadata topics, but to be able to utilize existing CSW infrastructure, I am going to recommend using a FGDC metadata standard. Why – because we want to catalog geospatial data (and 99% of the standard works for planetary). But we could also more simply use the Dublin Core or NASA’s DIF which both seem well supported in the existing catalog software.

As part of a proof-of-concept, I have very rough program called gdal2metadata.py (code in github) to help generate (hopefully) valid FGDC metadata. I still recommend using the USGS tool “mp” to actually validate output. Only ~30% of metadata can automatically gathered from any one map projected file (in our case a GeoTiff or PDS image). Thus this code requires a “template” metadata record to be mostly written by the team responsible for the data set. If you keep the abstract, purpose, and other parts pretty generic this can be done for many related files. Once that is written, the script can run using this template on many files. It needs a lot of updating, for example to support more projections, but it is a start. Also in the github repository there are a couple template examples for Lunar DEMs. Anyway, once a Python environment is installed (w/ GDAL) you can simply run: (BTW, I recommend the awesome Anaconda environment and install GDAL using “conda install gdal”).


> ./gdal2metadata.py NAC_DTM_ATLAS5.TIF ASU_DTM_Template.xml


Warning: currently I don’t have the actual download location in the metadata but I will add it in soon. Also any data particulars for specific images, like accuracy or a listing of images used would have to be inserted via this script or another script to update those areas.



  • OGC CSW service

Using the generated metadata record above, I was able to quickly get running with the free pycsw software. And I also cheated and used the OSGeo Live (v8) virtual machine for an immediate working environment. Fortunately, along-with the virtual machine, the pycsw team has published a walk-though workshop (which I followed specifically from this point).  Here are the few steps I had to do to load in this example FGDC record (It was fairly immediate once I figured out pycsw only wanted a lower case .xml extensions! )


> cd /var/www/pycsw
> export PYTHONPATH=`pwd`
> sudo vi default.cfg
       and change this line to include fgdc:
       profiles=fgdc,apiso
while editing, update all the custom name, place, email, etc. for the facility

Backup sqlite example database and create a new one. 
Note the location and name of the database is located in default.cfg
> sudo rm  /var/www/html/pycsw/tests/suites/cite/data/records.db
normally I would back this up - not simple delete 
> sudo python ./bin/pycsw-admin.py -c setup_db -f default.cfg
> sudo ./bin/pycsw-admin.py -c load_records -f default.cfg -p ~/myTestMetadata/LROC -r
where "-r" is recursive and will load and files it finds in the directory tree.


That is basically it…

Now start Firefox and point to http://localhost/pycsw/tests/index.html, change the pulldown request to GetCapabilities and server to “../csw.py?config=/var/www/html/pycsw/default.cfg” and hit Send. Here are other examples but also try to send a “GetRecord by bbox” using the same server and

<gml:lowerCorner>-90 -180</gml:lowerCorner>
<gml:upperCorner>90 180</gml:upperCorner>

Definitely a start. But now try QGIS with the CSW plug-in also on the OSGeo Live VM (using http://localhost/pycsw/csw.py for the server). Here is a snapshot after running a spatial search for records:


Now need to test the ability for pycsw to automatically harvest layers from a WMS getcapabilities. Also on this page is how to access these records using Python on the command-line.


  • Summary (thus far)

There are still plenty items to test including making sure the metadata is pushed through properly and WMSs are cataloged properly. But this seems to an method using mostly existing tools to share data across facilities. There are also tools which use pycsw to nicely display the catalog like GeoNode, Open Data Catalog, CKAN (used by data.gov)

LPSC abstract on the topic.

to be continued…