Historically, working with big data has been quite a challenge. Companies that wanted to tap big data sets faced significant performance overhead relating to data processing. Specifically, moving data between different tools and systems required leveraging different programming languages, network protocols, and file formats. Converting this data at each step in the data pipeline was costly and inefficient.
Developed by open source leaders from Impala, Spark, Calcite, and others, Apache Arrow was designed to be the language-agnostic standard for efficient columnar memory representation to facilitate interoperability. Arrow provides zero-copy reads, reducing both memory requirements and CPU cycles, and because it was designed for modern CPUs and GPUs, Arrow can process data in parallel and leverage single-instruction/multiple data (SIMD) and vectorized processing and querying.
So far, Arrow has enjoyed widespread adoption.
Who’s using Apache Arrow?
Apache Arrow is the power behind many projects for data analytics and storage solutions, including:
- Apache Spark, a large-scale parallel processing data engine that uses Arrow to convert Pandas DataFrames to Spark DataFrames. This enables data scientists to port over POC models developed on small data sets to large data sets.
- Apache Parquet, an extremely efficient columnar storage format. Parquet uses Arrow for vectorized reads, which make columnar storage even more efficient by batching multiple rows in a columnar format.
- InfluxDB, a time series data platform that uses Arrow to support near-unlimited cardinality use cases, querying in multiple query languages (including Flux, InfluxQL, SQL and more to come), and offering interoperability with BI and data analytics tools.
- Pandas, a data analytics toolkit built on top of Python. Pandas uses Arrow to offer read and write support for Parquet.
The InfluxData-Apache Arrow effect
Earlier this year, InfluxData debuted a new database engine built on the Apache ecosystem. Developers wrote the new engine in Rust on top of Apache Arrow, Apache DataFusion, and Apache Parquet. With Apache Arrow, InfluxDB can support near-unlimited cardinality or dimensionality use cases by providing efficient columnar data exchange. To illustrate, imagine that we write the following data to InfluxDB:
However, the engine stores the data in a columnar format like this:
Or, in other words, the engine stores the data like this:
1i, 2i, 3i, 4i; null, null, null, true; tagvalue1, tagvalue2, null, tagvalue1; null, null, tagvalue3, tagvalue3; null, null, null, tagvalue4; timestamp1, timestamp2, timestamp3, timestamp4;
By storing data in a columnar format, the database can group like data together for cheap compression. Specifically, Apache Arrow defines an inter-process communication mechanism to transfer a collection of Arrow columnar arrays (called a “record batch”) as described in this FAQ. This can be done synchronously between processes or asynchronously by first persisting the data in storage.
Additionally, time series data is unique because it usually has two dependent variables. The value of your time series is dependent on time, and values have some correlation with the values that preceded them. This attribute of time series means that InfluxDB can take advantage of the record batch compression to a greater extent through dictionary encoding. Dictionary encoding allows InfluxDB to eliminate storage of duplicate values, which frequently exist in time series data. InfluxDB also enables vectorized query instruction using SIMD instructions.
Apache Arrow contributions and the commitment to open source
In addition to a free tier of InfluxDB Cloud, InfluxData offers open-source versions of InfluxDB under a permissive MIT license. Open-source offerings provide the community with the freedom to build their own solutions on top of the code and the ability to evolve the code, which creates opportunities for real impact.
The true power of open source becomes apparent when developers not only provide open source code but also contribute to popular projects. Cross-organizational collaboration generates some of the most popular open source projects like TensorFlow, Kubernetes, Ansible, and Flutter. InfluxDB’s database engineers have contributed greatly to Apache Arrow, including the weekly release of https://crates.io/crates/arrow and https://crates.io/crates/parquet releases. They also help author DataFusion blog posts. Other InfluxData contributions to Arrow include:
Apache Arrow is proving to be a critical component in the architecture of many companies. Its in-memory columnar format supports the needs of analytical database systems, data frame libraries, and more. By taking advantage of Apache Arrow, developers will save time while also gaining access to new tools that also support Arrow.
Anais Dotis-Georgiou is a developer advocate for InfluxData with a passion for making data beautiful with the use of data analytics, AI, and machine learning. She takes the data that she collects and applies a mix of research, exploration, and engineering to translate the data into something of function, value, and beauty. When she is not behind a screen, you can find her outside drawing, stretching, boarding, or chasing after a soccer ball.
New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to email@example.com.
Copyright © 2023 IDG Communications, Inc.