In this second part of the blog series, I propose a solution that addresses the problems highlighted in the first part. Please read the post to understand the motivations behind the proposed solution.
Parquet file format
For open data storage structure on HDFS, the solution uses Parquet file format, an extensible columnar data storage structure that compresses data very well. This file format is suitable for analytical processing by a SQL engine, and is tightly integrated with various open sources SQL like data processing engines, including Spark SQL. The twist in our solution is that search index for textual data in a Parquet format file is embedded in that Parquet file itself, through extensibility mechanisms available in the Parquet format specification. We used the extensions in such a way that a Parquet file reader using released open source software library can read the regular columns of data, ignoring the search index. Of course, for better performance of text search operation, Parquet file reader software needs modification to make use of the text search index for the search filter conditions that a SQL processing engine will pass to the reading code. There is another benefit of embedding the index in the same Parquet file as the indexed data on HDFS: ensuring co-location of data and index in the same HDFS block, a well-known performance consideration. In order to avail this co-location benefit on HDFS, we incur the restriction that such a Parquet file with embedded index does not exceed the size of a HDFS file block. Just like the read operation, the Parquet file write operation needs modification of open source software to write both index and data in the same file and fit that file in a HDFS block.
Spark SQL extension & search filter push down
Earlier, I talked about the unavailability of join related operation on very large datasets involving text search. SQL is well known for how to express different kinds of join operations. It is also generally the language of choice for analytic operations for applications. Graphical front-end tools exist for analysis leveraging SQL. Spark computing platform provides Spark SQL for this purpose. Spark SQL also implements join optimization algorithms for large data sets and is known to scale. So, it makes sense to make text search capability be part of Spark SQL. Two approaches exist for extending Spark SQL for text search: extend the SQL dialect like Postgres SQL did, or provide built-in Spark SQL functions explicitly. In either approach, care needs to be taken so that text search filter condition in a SQL query be pushed down to the software reading the Parquet file, which has the embedded text search index. Spark SQL already provides such push down functionality for ordinary filter conditions, involving reading of Parquet files. Spark SQL codebase needs similar modifications for text filter push downs, taking advantage of the functionality that underlying modified Parquet software provides.
Scalable streaming data ingestion while building text search index
Imagine a relational database table containing a text column and it being written on HDFS in a Parquet format, while building a text search index for the textual column. A text search index is typically an inverted index, meaning given a text fragment it is efficient, consulting the index, to find row identifiers that contains the text fragment. For efficiency reason, for both space and index search time, you want index size to be as small as possible. Grouping texts from lots of rows in a memory buffer, while building the inverted index, provides an opportunity to achieve index size efficiency. If the entire input data set is known a priori, deciding when to write out the index, managing its size, is relatively easy. Now, imagine the same relational database table as a streaming data source with rows coming in fits and starts, without any idea of how many rows to come, but with an additional expectation that rows that have arrived should become available downstream, in near real-time for text search and other analysis. This expectation of near real-time availability of ingested rows poses problem not only for building efficient text search index but also for appending newly arrived rows to an existing Parquet file on HDFS in an efficient manner. The size efficiency of a file in Parquet format partly derives from when a relatively large number of rows are grouped together using various encoding techniques, including dictionary encoding. This entails buffering of input rows as they come in and periodically writing them out with larger number of rows in it, replacing the previous Parquet file on HDFS. How quickly, the periodic interval when the file is replaced, the newly arrived rows can be made available to SQL queries depends on, in a sense, how efficiently newly arrived rows can be appended to a Parquet file. The encoding of rows in a Parquet file format is somewhat computationally expensive. The existing open source software that writes to Parquet file format does not support this streaming scenario in an efficient manner. Thus the open source Parquet code base needs modification for efficiency and achieving a periodic write interval as low as a couple of seconds. The scalability of streaming ingestion can be achieved by launching parallel Spark tasks in a Spark program, each task writing to its own Parquet file periodically, without interfering each other.

This brings me to the end of the second part of this blog post. In this blog post, I outlined how extending Spark platform and Parquet can solve text search capability required by an application dealing with large amounts of data. In the third and final part of this blog post, I provide more technical details on the nature of changes we made to Spark and Parquet software, some measurement related to performance experience, and our experience, in general.
Read
Part 1: Add Spark to a Big Data Application with Text Search Capability
The post Part 2: Add Spark to a Big Data Application with Text Search Capability appeared first on The Informatica Blog - Perspectives for the Data Ready Enterprise.