As industrial Internet-of-Things (IIoT) applications produce a blinding amount of data 24/7, relational and key- value-based databases struggle to keep up.
It’s a problem the team at the Tsinghua University School of Software has been working on for going on a decade. The result is IoTDB, which recently graduated from the Apache Software Foundation incubator to become a top-level project.
It reached this status “at a time of confluence of database, Internet-of-Things (IoT) and AI technologies in conjunction with a wider adoption of Industry 4.0 and automation approaches to further enable remote work and increased efficiencies,” said C. Mohan, retired IBM Fellow, former chief scientist at IBM India, and a member of the US National Academy of Engineering.
As a Distinguished Visiting Professor working with the team, “I have seen this project reach maturity and build up a vibrant open source community around it,” he said.
From Chinese University
As a PhD student at the Chinese university starting around 2012, Xiangdong Huang, now vice president of the Apache project, was assigned to manage the time-series data being generated by the minute by a large company’s 200,000 machines. Reading the data from Oracle quickly proved to be too slow and buying a more advanced license too expensive.
They decided to try NoSQL — Cassandra — but ran into performance problems with it as well.
“Apache Cassandra is good, and we used five nodes to manage all the data,” he said, explaining, “The user may create more than 5,000 tables in Cassandra, and they do not want to buy more servers to form a larger cluster (as the budget is limited). From then on, I spent about two years to read the source code of Cassandra, and did some modifications on Cassandra, and spent a lot of effort to use the limited server resources to provide better performance, which made us tired. … Even [though] we spent a lot of effort, we find it hard to use five nodes to reach 10 million data points written per second.”
They tried saving a packet of hundreds of data points to Cassandra as key-value pairs. But that meant maintaining everything themselves and still ran into performance issues and limitations with the data structure.
They then decided to create a time-series database from scratch. Professor Jianmin Wang came up with the idea of donating the project to ASF in 2018 as a way to get more people involved.
Managing Massive Data
As a time-series database for industrial IoT, the project provides storage for massive datasets, high throughput data input and complex data analysis. It’s a lightweight structure, high performance and usable features can be easily integrated with other projects such as PLC4X, Hadoop, Hive, Spark and Flink.
“When IoT is used in industrial applications, intelligent equipment usually produces one to two orders of magnitude more data than consumer-oriented IoT devices,” the oTDB developers wrote in a research paper. “This makes it even harder for analytics to produce valuable insights in a reasonable amount of time.”
The new time-series processing workloads tied to edge computing involve massive data volumes, the need for efficient data ingestion along with complex, low-latency queries and advanced data analytics.
Overall, IoTDB offers:
- High performance for data ingestion and high compression for saving disk space and IO.
- Low latency when querying terabytes of data. It can support time range and value filter and support fast aggregation.
- It supports many time-series exclusive operations, such as time-series segmentation and subsequence matching.
- It integrates with multiple systems, including Matlab for traditional industrial analysis:
- Spark and MapReduce for big data analysis; Grafana for visualization, and Apache Kafka for data ingestion.
The metadata management module uses a tree structure to manage the naming space of devices. IoTDB stores data in an open native time-series file format for both database access with query/storage engine and Hadoop/Spark access against a single copy of the data.
As a distributed time-series database, where data is partitioned by grouping time-series in Cluster Engine among different nodes while time-based data slicing is implemented on each node to improve the performance. IoTDB provides an SQL-like language, native API, and restful API to access the data.
In IoT scenario, edge computing and cloud side deployment are equally important, according to Xiangdong.
IoTDB has three physical deployment models:
- File-based storage or embedded time-series database on an edge appliance like Raspberry PI.
- Standalone time-series database on industrial PC.
- Distributed time-series database or Hadoop cluster with TsFile storage format.
Data + Analytics
“Typically, IoT devices collect data from sensors and industrial controllers and send data to the data center using customized or standard protocols like MQTT in real-time. However, in some cases, the edge intelligence requires real-time analytics, such as fault alerts, to retrieve data from a local data store,” he said.
“IoTDB has a lightweight, embedded version to be deployed on the IoT devices, where the minimal runtime memory requirement is 32MB and computation is supported with an ARM7 processor. Local storage is also mandatory to prevent data loss in case of the temporary network outage. In this scenario, TsFile Lib allows the devices to persist data in TsFile format, and afterwards the generated TsFiles can be directly synchronized and merged with active IoTDB instance on the cloud using the File Sync module.”
In the cloud, using the Cluster Engine, a Raft-based protocol is used to manage multiple IoTDB nodes. In the cluster mode, data partitions can be defined according to both time slice and time-series ID. The distribution of data and query operations are completely transparent to the end users, he said.
IoTDB is designed for managing data in IoT devices, while other time-series databases usually have been designed for DevOps, he said. IoTDB is more lightweight, as it does not depend on other systems such as a RDBMS or a NoSQL database.
The write throughput of IoTDB reaches tens of million points per second.
“Apache IoTDB is a perfect fit for edge computing,” said Julian Feinauer, CEO at German startup pragmatic industries GmbH, “The high compression helps to use the [limited] amount of memory we have very efficiently. IoTDB is a perfect fit, especially in IIoT use cases, where network and compute capabilities are limited on the edge.”
A comparison with competitors InfluxDB, TimeScale and others can be found here.
More to Come
Since joining the Apache Incubator, IoTDB has attracted users from all over the world. Steel and mining venture ArcelorMittal Americas uses it for sensor data. It was attracted to the project for its ability to integrate with systems such as Spark, Grafana and NiFi, he said.
The team is working with the Chinese Meteorological Administration to build the next-generation big data platform. China has 150,000 meteorological stations, each collecting more than 100 metrics that are reported to the data center every five minutes.
And it’s replacing KairosDB in Shanghai metro management. It uses IoTDB to manage 300 railway vehicles collecting 400 billion data points per day.
Other users include wind technology vendor Goldwind; Chinese appliance and consumer electronics company Haier, the parent company of GE Appliances in the United States; Lenovo; and “smart” driving technology vendor NAVINFO.
IoTDB accepted more than 1,300 pull requests since joining the Apache Incubator and added more than 100 new features.
Going forward, since IoTDB is trying to serve the edge as well as the cloud and sync data between the two, it will be implementing C++ and Go data file APIs for easier use on the edge.
To improve performance and stability, the team is implementing a better data compaction module and a robust memory control strategy.
“IoTDB will be more and more open,” Xiangdong said. It will support UDF, trigger, secondary index soon. We will integrate it with other open source systems, such as Apache Calcite.
“Then you can use IoTDB to do many interesting things, like outlier data alert, data calculation, similarity search and use standard SQL to operate the data,” he said.
IoTDB’s cluster version also is coming.
Image by Manuel de la Fuente from Pixabay.