Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
SQL on big data : technology, architecture, and innovation
Pal S., Apress, New York, NY, 2016. 157 pp. Type: Book (978-1-484222-46-1)
Date Reviewed: May 11 2017

After a few years and many discussions about big data and its most well-known attributes (“volume, variety, velocity, veracity”), it is now recognized that big data is not so easy to master. To obtain value from it, the user pool has to be extended by making it more usable to business people, analysts, non-programmers, and a large number of data management users unwilling to learn the big data ecosystem from scratch.

This is where this short but timely book by Sumit Pal can come in handy. SQL on big data is a well-designed, nicely written book readable by both the technologist and the business decision maker. It is not a programming book but aims at introducing the architecture of many SQL for big data offerings. The book is organized into six chapters and a short appendix.

In chapter 1, Pal introduces why we need to consider SQL on big data. After the usual mention of data volume reaching an enormous scale in recent years, the author states that the relational database management systems (DBMSs) cannot scale to this level and therefore innovative solutions have to be developed to process big data. He then positions the goals of SQL on big data as providing distributed scale-out architecture, avoidance of data movement from the Hadoop distributed file system (HDFS), less expensive solutions, immediate availability of ingested data, high concurrency, low latency, unstructured data processing, and integration with existing business intelligence (BI) tools.

Although this list represents a tall order, he then proposes a number of open-source and commercial tools aimed at alleviating the underlying issues. Pal also provides his view on questions to be raised and answered before jumping into a solution path. Within the following chapters, Pal will match issues with many of the tools and solutions mentioned in chapter 1.

Chapter 2 explains SQL-on-big-data challenges and solutions. It starts with a reminder of the types of SQL statements, followed by the types of data generated by Internet-scale applications (that is, structured, semi-structured, and unstructured). Doing data manipulation operations on HDFS is difficult because it was not invented with this purpose in mind. After mentioning the challenges to doing low-latency SQL on big data, the author proposes a series of approaches to solving SQL on big data and reducing latency on SQL queries. The chapter concludes with a review of data formats with their individual effects on size, compression, and abilities to be indexed, partitioned, and bucketed. The next three chapters address problems in increasing order of complexity and volatility of the data.

Chapter 3 covers the most stable processing scenario, batch processing, usually easily resolved in the big data environment. Hive is explained in some depth. Performance improvements can be obtained through optimization by using a broadcast join, pipelining the data, dynamically partitioning joins, vectorization of the queries, using Live Long and Process (LLAP) with Tez, cost-based optimizers (CBOs), and using the right data format when possible.

Chapter 4 illustrates scenarios and solutions to real-time querying of data where the questions need to be addressed in a more ad hoc fashion and where the traditional big data architecture has to be complemented with newer and innovative layers. SQL engines for interactive workloads include Spark, one of the most well-known solutions. The author also mentions Tachyon (Alludio), an in-memory file system, and a cache allows sharing across data-processing frameworks such as Spark and MapReduce. Massive parallel processing analytic engines such as Impala and Apache Drill are also described, followed by examples of commercial products such as Vertica and Jethro Data. The chapter concludes with a brief comparison of massively parallel processing (MPP) versus batch approaches.

Chapter 5 extends to the most recent extension of the data management of big data. Three subtopics are SQL for streaming data, semi-structured data, and operational analytics. For semi-structured data, Apache Drill scenarios are explored with examples using JSON, FLATTEN, KVGEN, and XML. Similarly, usage of SPARK with both JSON and MongoDB is explained. For streaming data, the author chooses Apache Spark, PipelineDB, and Apache Calcite. For operational analytics on big data, the chapter includes Trafodion, Apache Phoenix with HBase, and the relatively new Kudu storage engine as an alternative to Kafka.

The book concludes with a chapter presenting the forthcoming road ahead in the SQL and big data evolution. Most products mentioned were in a state of evolution and rapid change at the time the book was published. Their status should be checked for an update. The first solution displayed is a SQL engine called linkDB, and the following approaches are based on SQL engines taking advantage of graphical processing units (GPUs) such as MapD, GPUdb, and SQream. Three Apache projects are similarly introduced starting with Apache Kylin, then Apache Lens developed from May 2013, and ending with the lesser-known Apache Tajo. The chapter ends with an overview of HDAP and emerging new Transaction Processing Performance Council (TPC) benchmarking for SQL on big data.

From a rapidly expanding field, the author has done a good job at introducing a sample of relevant technologies. An understanding of this book’s content can bootstrap better decision making and adoption of the right targeted solution. Since the book is short, one cannot expect comprehensive coverage of every solution; however, there are few alternatives and no other comparable books available. Using a search engine cannot provide a coherent picture of the mosaic of SQL-related solutions aiming at the complex challenges of SQL on big data. SQL and its relatives are still the most probable avenue to solve data manipulation in an acceptable time frame, and the book is a good introduction to the topic.

More reviews about this item: Amazon, Goodreads

Reviewer:  Jean-Pierre Kuilboer Review #: CR145273 (1707-0422)
Bookmark and Share
  Reviewer Selected
 
 
SQL (H.2.3 ... )
 
 
Database Applications (H.2.8 )
 
 
Systems (H.2.4 )
 
Would you recommend this review?
yes
no
Other reviews under "SQL": Date
SQL and its applications
Lorie R., Daudenarde J., Prentice-Hall, Inc., Upper Saddle River, NJ, 1991. Type: Book (9780138379568)
Dec 1 1991
Learning SQL
, Prentice-Hall, Inc., Upper Saddle River, NJ, 1991. Type: Book (9780135287040)
Jun 1 1992
SQL and relational databases
Vang S., Microtrend Books, San Marcos, CA, 1991. Type: Book (9780915391424)
Sep 1 1991
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy