Notes on the SRB Tutorial by Roman Olschanowsky, June 27, 2006 (Taken by Leesa Brieger) -------------------------------------------------------------------------------------------- (SRB home page: http://www.sdsc.edu/srb) I. Overview Great diversity of platforms in SDSC's past, great problem of moving data, even just inside SDSC, between machines. SRB was born of this need to manage data movement between platforms - and grew to include geographically-distributed platforms. This is largely what distinguishes it from other data-management tools - its WAN capabilities. Sites using SRB: worldwide At SDSC alone, over 60M files, .5 PB occupied by the various projects using SRB *at SDSC*. Simplest description of SRB: a distributed file system or data grid. A client-server system. Single sign-on for accessing heterogeneous, distributed data on heterogeneous systems. Familiar view of one's data is tied to physical location of the data. Not so with SRB. Interfaces: - inQ - Scommands - jargon - mySRB web client - Matrix (WSDL workflow) - C/C++ - Python - Perl ... Examples: - the BIRN portal (Perl-based interface to SRB. See http://portal.nbirn.net) - NEEScentral Portal (PHP-based interface to SRB) BIRN: - 15+ sites distributed around the Abilene Backbone, about 30 TB of data shared among them through SRB. - Using InQ, can see the total of all the files, including a view of where they reside, if desired. The view gives you resource pooling, making the servers appear as a group, even though they are actually spread among the many sites. - SRB's file replication utility is useful for BIRN... replications can take the form of copies on the original machine or on different resources (even physically distant resources). - Logical resource: a single resource name which might consist of several physical resources. - Administrative statistics software provided with SRB which can show how resources are used among distributed sites. Questions: Q: How does one control access permissions to SRB files (objects in SRB terminology)? A: With the SRB access controls. By default, only the owner has access to their own files, and one must set up permissions for other users to access one's data in order to share it. Q: Can SRB handle databases? A: Yes, a database is handled as a "shadow object" by SRB, and the object (database) can be queried by SQL. Also, an SQL query can be registered as an SRB object. Q: How does a data grid like BIRN really play out? A: Usually researchers keep their own data locally, accessing it through Scommands or some other SRB client. Q: But what about the expanded use of data, ie the real sharing of data by projects? A: BIRN collaboration reports show how much data is accessed by the different partners; the BIRN audit logs show a very high percentage of sharing. Q: Is querying a database just as easy as setting up a file? A: It takes a sys admin to set up a shadow object, but once that is done, then using it and even creating future shadow objects (database objects) is easy. Q: Which databases are supported? A: Oracle, DB2, Posgres, MySQL, ... Q: Does SRB ever look into a file? A: It can, in the form of filtering for metadata harvesting... Q: What about for databases? Storing SQL queries? A: SRB registers a digital entity and the operations you can do on that entity depends on what entity it is. (Proxy operations exist also with SRB.) Q: Does SRB do caching? A: There is a separate application which can do distributed caching on top of SRB. (It's separate from SRB.) Q: Shadows for a database... is it a one-time setup? How does sharing work? A: Once the shadow object is registered, it can be queried by any user that the owner gives access to. It is just as secure as any SRB object. You must give away permissions if you want to share. Example: HPSS archive at Caltech. A user put data into HPSS, SRB registration was done on that data, and then a replica was made of that archive to HPSS at SDSC. The only extra access which was ever allowed was to the SRB user to read the original HPSS data at Caltech. II. InQ - Windows OS only (Download this client from www.sdsc.edu/srb/index.php/inQ) Guest accounts are set up - maybe only for the duration of the tutorial. Familiar windows-type buttons. Tree view, metadata view, content view. This client can store connection parameters, can upload files with drag-and-drop, can set access permissions, can add metadata, can query based on this metadata. - Login using inQ on the guest accounts. - Upload files - Querying user-defined metadata in order to find a file or files we are looking for... name-value pairs, case-sensitive. III. Scommands - unix-style commands (Download this client from www.sdsc.edu/srb/index.php/Scommands) Scommands provide the fastest, most flexible high-level client interface to SRB. (This is the client which is behind most of the portals which use SRB.) There are man pages for every Scommand, which almost all look like a unix command, prefixed by "S". See www.sdsc.edu/srb/index.php/Scommands_Manpages . There are 2 configuration files to set up for using the Scommands: ~/.srb/.MdasEnv and ~/.srb/.MdasAuth Sauth command will encrypt the SRB password. Can do simple password authentication or GSI authentication with certificates and proxies. Useful commands... Shelp - gives list of S commands Serror - explains a given error code Senv - like env Scd, Sput Sget, Spwd, Sls, Srm, ... Sput with the -b option is faster for small files because it is a bulk transfer which groups small files together and moves them all in bulk. Sput with -m does multi-threading and is useful for large files. Add user-defined metadata with Sufmeta. More Questions ---------------------- Q: Differences between SRB and other data grids? A: The management of metadata varies, most (Fedora, DSpace,eg) only work with files on a local filesystem. These two data grids now are integrated with SRB to work with distributed data. Q: The abstractions... if I have a database, what about the abstractions of that? A: I can register a database as a shadow object and manage metadata about that (as about any SRB object). Comments: Typically a schema is a fixed representation SRB uses schema indirection. Instead of separate row in database, point to attribute name and point to attribute entry and so you can have arbitrary metadata entries and metadata extension in which an attribute is actually another database. SRB: Rather than virtualizing the resource, virtualize the collection.