The data consumption has increased tremendously over past decade. Thanks to the growing adoption of smart phones and Internet-of-things, information access is no more a bottleneck. With manifold increase in data generation, the storage needs have also increased by 10x factor. Today the amount of data generated by numerous devices over internet in just 2 days is equal to the amount of data from the inception of internet till year 2003. One can imagine huge amounts of colossal data out on intent every day hovering to the size of many pita bytes. It is expected that by year 2020 the amount of digital information will grow from 3.2 zettabytes today to 40 zettabytes. Today every minute we send around 204 Million emails, generate 1.8 Million Facebook likes, send 278 thousand tweets and up-load 200,000 photos to Facebook. Google alone processes over 40 thousand search queries per second.
I guess it is clear now how Big Data name got coined. To use all colossal amounts of information for analytics and business intelligence, the data needs to be stored and that too be accessible at faster speeds. Here is where the storage hardware plays a vital role. Flash has come-up to huge rescue for big data applications as Flash rapidly increases the data access speed. Besides rapid data access, other benefits of using flash are the ability to retain data even when power is off, power efficiency, availability in wider form factors with varied interface options of PCIe, SAS and SATA for flexible deployment options.
MongoDB and Hadoop are two important frameworks in Big Data deployments, and it’s critical to check the performance of Flash Hardware systems on these frameworks to get maximum throughput. Benchmark testing provides such tools to check system performance and optimize Flash Hardware for these frameworks.
Hadoop
It is a framework that allows distributed processing of large datasets across cluster of end terminals (like computers) using simple programming models. It is based on open source Apache platform and is scalable from single server instance to thousands of servers with each server offering local computation and storage. Hadoop Distributed File System (HDFS) is meant for wide spread storage servers using Hadoop applications. MapReduce is the software framework for writing applications for processing large amounts of data in parallel on large Hadoop clusters.The Flash SSD benchmark testing for a Hadoop workload is called Terasort. The input and output data are stored on HDFS and has three major components Teragen (generates data set for benchmark); Terasort (does sorting based on keys in dataset) and Teravalidate (validates the results of Terasort).
NoSQL Databases like MongoDB and Cassandra
Being MongDB and Cassdandra are seeing a huge market growth owing to their open source approach. They address multiple data types, including structured, semi structured and unstructured data. Flash SSDs can improve their I/O performance by 9-20x higher for greater density per node and server consolidation. Mongoperf, a tool from MongoDB can be used for evaluating disk performance. Typical parameters for benchmarking include random read/write time, sequential read/write time etc.
Figure: Typical Setup for Hadoop Benchmark Testing
eInfochips provides Benchmark testing for Flash SSDs in Big Data environments for both Hadoop and MongoDB. We also have expertise to provide OpenStack Block Storage and Object Storage Benchmark testing with tools likeRally/iometer/fio and Swift Bench. We can also enhance Hadoop operational performance via Hive tuning, HDFS tuning, Linux File System and Block Storage tuning.