The Data Mapper functional component is responsible for the ingestion of IoT data flowing on the COSMOS Message Bus into the cloud storage, annotating them with enriching metadata.
We built a scalable Data Mapper based on the open source Secor tool developed by Pinterest. Secor is a service which allows persistently storing topics from Apache Kafka in Amazon S3. We enhance Secor by enabling OpenStack Swift targets, so that data can be uploaded by Secor to Swift, by enabling data to be stored in the Apache Parquet format, which is supported by Spark SQL and by generating Swift objects with metadata.
We chose Secor because of the following key features:
• Horizontal scalability: Secor can be scaled out to handle increased load by starting additional Secor processes. It can also be distributed across multiple machines.
• Configurable upload policies: size based and time based policies are both supported.
• Output Partitioning: Data can be parsed and stored under partitioned OpenStack Swift paths. This is useful since Spark SQL can access partitioned data in an optimized way.
• Reliability: Secor is fault tolerant and strongly consistent.
More information can be found in Deliverable D4.1.2