Storlets are computational objects that run inside the object store system. Conceptually, they can be thought of the object store equivalent of database store procedures. The basic idea behind storlets of performing the computation near the storage is saving on the network bandwidth required to bring the data to the computation. Computation near storage is mostly appealing in the following cases:
1. When operating on a single huge object, as with e.g. healthcare imaging.
2. When operating on a large number of objects in parallel, as e.g. with a lot of time series archived data.
The storlet functionality in COSMOS is developed in the context of the OpenStack Swift object store. Running a computation inside a storage system, involves two major aspects: one is resource isolation and the other is data isolation. Resource isolation has to do with making sure the computation does not consume too many resources, so that the storage system stability and on-going operation are not compromised. Data isolation has to do with making sure that the computation can access only the data it is supposed to access. Achieving resource and data isolation is done by sandboxing the computation.
Our storlets implementation supports two scenarios referred to as the PUT and GET scenarios.
In the PUT scenario the storlet is invoked during object upload, where instead of keeping the data (and user metadata) as they are being uploaded, the storage system keeps the result of the storlet invocation over the uploaded data and metadata. This scenario is useful for e.g. metadata extraction: consider a case where the uploaded data is a .jpg, the storlet can extract the jpg information (resolution, geospatial coordinates, etc.), and keep it as Swift metadata.
In the GET scenario the storlet is invoked during object retrieval, where instead of getting the object's data (and metadata) as kept in the object store, the user gets back the result of the storlet invocation on the object's data (and metadata). This scenario is useful for e.g. analytics pre-filtering: consider an analytics program that is done over logs, where we are only interested in ‘ERROR’ lines. A storlet that runs near the data can filter out all other lines resulting in a reduced bandwidth usage between the store and the analytics engine.
More information can be found in Deliverable D4.1.2