Some facilities are struggling with policy related questions surrounding access to/storage of data. For example a site with a brand new Science DMZ may be trying to onboard users, and may be faced with:
- A collection of users that are aware of fast data movement tools, and may have their own DTNs/storage already
- A different set that may be aware of tools, but has no resources and would prefer to use something centrally managed
- Other users that may not have their own data storage locally, but requires access to vast amounts of remote storage owned by a collaboration
- A final grouping that is unsophisticated in technology, but unwilling to use pooled resources
Crafting a policy regarding the transmission, storage, and handling of data can be very complicated, and must be personalized for the rules of a facility. Identification or PII (data, research products, etc.) is useful in determining risks. Once that is sorted out, there are four major areas to consider:
- Network infrastructure between data resources (path capacities, peerings, security policies)
- Mechanisms for sending/receiving data (tools, hardware, security policies)
- Data storage locations (storage on a single machine, network and parallel filesystems)
- Data access and retention policy (who can access data, and how long it is allowed to live on shared resources)
Within the Science DMZ model, it is recommended that devices that require a large, fast, and clean network be as close to the site border as possible. In practice this can mean a single or pool of Data Transfer Nodes. These devices can have local storage that allows for easy data movement, they could also be integrated into a storage area network. From a network perspective, the interfaces of these resources that face the Wide Area Network should have as little friction in the path as possible. This implies:
- Direct connectivity to a fast DMZ network, free of devices that would slow down performance (small buffers, disruption appliances)
- Security policy tailored to the device, e.g. ports open for the data mobility tools or which netblocks to accept
- Establishing peerings with facilities with which you have routine data communication. Doing so will also decrease the possible attack vectors by exposing only a small number of possible communication channels
Data Storage, Access, and Retention Policies
Storage resources are often a function of available funding. Hard decisions must be made regarding the amount of storage that is available, and what happens when it begins to run out. Things to consider:
- Will federated identify be used to manage the accounts and storage allocations of everyone's data? Will group management be used to facilitate sharing?
- If the DTNs have a limited amount of storage to facilitate transfer to or from your facility, will data be culled periodically to conserve space?
- Will the DTNs be connected to a SAN, and if so will the data be migrated automatically or manually
- For global storage, will age limits and quotas be put on storage allocations?
- Will students and faculty that no longer belong to the facility have their data removed after a period of time?
It is often the case that policy on storage is applied at the storage system (i.e. inode and volume quotas). Data movement tools (such as Globus) on the DTNs are the primary method for getting data in/out of the storage system (typically a large parallel filesystem, e.g. Lustre, GPFS, etc.). If the system is configured to mount remote storage, a double copy of the data is not required.
There are variants in terms of DTN access. Some sites run "sealed" DTNs, where users have no shell access - the only way to use those DTNs is via the tool interface. Other facilities put user accounts and home directories (via NFS) on the DTNs just like they do on any other user system in the HPC facility. In that case, there are "fast" and "slow" filesystems mounted on the DTNs - home directories are small and slow, and the HPC center global filesystem (which is also mounted on the supercomputers) is big and fast.
Note also that if you have a Science DMZ, you can subdivide out access policy. if you have researchers that need to put their own DTNs in your Science DMZ you might want to apply different network-layer access policies to them than you apply to your central HPC DTNs with different sets of appropriate ACLs to each.