Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Kangaroo: workload-aware processing of range data and range queries in Hadoop
Aly A., Elmeleegy H., Qi Y., Aref W.  WSDM 2016 (Proceedings of the 9th ACM International Conference on Web Search and Data Mining, San Francisco, CA, Feb 22-25, 2016)397-406.2016.Type:Proceedings
Date Reviewed: Jul 28 2016

Kangaroo is a set of algorithms that is aimed at optimizing the execution of range-based queries. It has been implemented for Hadoop-based systems. Specific examples of range-based data queries include time intervals, spatial ranges, and lifetime of the user profile, among others. The potential application domains that will benefit from this approach are in varied areas--for example, identifying a target audience for promotional material, measuring the target rating point (TRP) of television programs, measuring cell tower usage statistics, and evaluating customer demographics.

Kangaroo splits the data into non-overlapping partitions with the ultimate goal of minimizing the query execution time as well as reducing the associated input/output (I/O) involved in the processing. The queries can ignore (“jump over”) partitions that are not of relevance while formulating the results of a query. This property of the algorithm is symbolized to be akin to a kangaroo hopping around--thus, the name of the algorithm. The six core properties desired in the partition schemes used by this algorithm are: workload awareness, no data duplication, no data splitting, no partition overlap, bounded partition size, and bounded number of partitions. The workload awareness property ensures that the more frequently queried subset of data is more highly partitioned compared to a relatively less frequently queried subset of data. This is an intelligent approach to ensure ease and efficiency of data lookup for larger number of queries and thus overall system performance.

The four different partition schemes supported fall into two broad categories: grid-based and tree-based. The four partitioning schemes are grid-based genetic, grid-based genetic including a dynamic programming step, tree-based genetic, and tree-based greedy (the default). Dynamic programming is meant to split a large problem into smaller subsets and cache results of the smaller subsets for faster lookup while avoiding multiple computation of the same subset during execution. It will be performed for only one of the attributes of the associated grid.

In a grid-based set of algorithms, 2D data is split into partitions representing a coarse grid structure. The coarse partition is in turn composed of multiple finer partitions. As part of this implementation, bit strings are maintained for each of the dimensions. Each bit of the bit string represents either a row or column of the original fine-grained grid. In the alternative approach of the tree-based set of algorithms, the root node represents the entire 2D space that is hierarchically split into child nodes representing smaller partitions of the parent node. The child nodes can also be split further to have child nodes themselves for even more fine-grained partitions.

The experiments of the algorithm involved executing more than 30,000 range-based queries on real data consisting of more than one billion records. The performance of all the proposed algorithms was found to be better than the random partitioning scheme. Among the four, the tree-based greedy scheme was determined to be the best with the lowest running time and the highest I/O reduction percentage. When workload awareness was included, the performance metrics improved. The grid-based genetic with dynamic programming scheme was the worst due to the additional cost associated with the dynamic programming step of pre-computing smaller subsets of data for caching and lookup. It will be interesting for the authors to study in the future whether this additional cost will provide the desired performance improvement for processing an even larger number of queries or a larger dataset.

Reviewer:  Pragyansmita Nayak Review #: CR144643 (1611-0839)
Bookmark and Share
  Featured Reviewer  
 
Information Storage (H.3.2 )
 
 
Organization/ Structure (E.5 ... )
 
 
Sorting And Searching (F.2.2 ... )
 
 
File Systems Management (D.4.3 )
 
Would you recommend this review?
yes
no
Other reviews under "Information Storage": Date
Principles of delay-sensitive multimedia data storage retrieval
Gemmell J., Christodoulakis S. (ed) ACM Transactions on Information Systems 10(1): 51-90, 1992. Type: Article
May 1 1993
Partial match retrieval in implicit data structures
Alt H., Mehlhorn K., Munro J. Information Processing Letters 19(2): 61-65, 1984. Type: Article
May 1 1985
Performance of two-disk partition data allocations
Chang C., Chen C. BIT 27(3): 306-314, 1987. Type: Article
Mar 1 1988
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy