Hadoop is a framework that makes fast data-parallel analysis of very large datasets possible by applying the concept of data-locality. The volume of scientific data collected using modern technologies provides a challenge that can only be addressed by large-scale computing. Hadoop provides interfaces that enable very fast experimentation on such data, without having to bother with the specifics of parallelism. That, in combination with the commodity hardware that the service runs on, makes Hadoop one of the most efficient data-parallel computing platforms currently available.
Hadoop is a software framework that is developed to enable easy, fast and efficient data-parallel batch processing of very large datasets. It exists of a distributed file system called HDFS, and a parallel processing interface called MapReduce. A fundamental feature of Hadoop is that the processing of the data is done on the same machines where the data is stored, a property called data locality. This makes the throughput extremely high and minimizes the need for investments in expensive networking hardware. Due to its speed and economic efficiency Hadoop has gained much popularity in both academics and industry; there is no other large-scale processing framework that is as widely adopted as Hadoop.
After a year of testing on a prototype facility at SARA, BiG Grid is now planning to have a Hadoop production facility available for scientists in the Netherlands by the end of 2011. The prototype facility has successfully assisted scientists in the domains of Information Retrieval, Natural Language Processing, Computer Science and Bioinformatics in their research. The production facility will complement the Dutch national computing infrastructure as a system for data-parallel, high-throughput capacity computing. Due to its simplicity we think Hadoop will be an important enabler of large-scale computing and e-Science for researchers that are not traditionally involved in these fields.
To enable fast experimentation Hadoop offers a number of interfaces. Its native Application Programming Interfaces makes it easy for programmers to create, for example, portals that allow researchers to run a selected algorithm over a selected dataset without ever having to interact with a command-line interface. And other generic interfaces are built on top of Hadoop. With Apache Pig, data analytics can be performed using a syntax much like SQL – but then over terabytes of data. Apache Hbase enables in-memory real-time data access to very large datasets. And the soon to be released Apache HCatalog will provide an abstraction on top of HDFS, enabling easy data curation, sharing and discovery.
For more information, please contact Evert Lammerts of SARA at evert.lammerts@sara.nl. Whether you want to use our prototype service, are interested in our soon-to-be production service or want more information on Hadoop itself; we are happy to hear from you!