An HDFS Tutorial for Data Analysts Stuck With Relational Databases

Introduction

By now, you have probably heard of the Hadoop Distributed File System (HDFS), especially if you are data analyst or someone who is responsible for moving data from one system to another. However, what are the benefits that HDFS has over relational databases?

HDFS is a scalable, open source solution for storing and processing large volumes of data. HDFS has been proven to be reliable and efficient across many modern data centers.

HDFS utilizes commodity hardware along with open source software to reduce the overall cost per byte of storage.

With its built-in replication and resilience to disk failures, HDFS is an ideal system for storing and processing data for analytics. It does not require the underpinnings and overhead to support transaction atomicity, consistency, isolation, and durability (ACID) as is necessary with traditional relational database systems.

Moreover, when compared with enterprise and commercial databases, such as Oracle, utilizing Hadoop as the analytics platform avoids any extra licensing costs.

One of the questions many people ask when first learning about HDFS is: How do I get my existing data into the HDFS?

In this article, we will examine how to import data from a PostgreSQL database into HDFS. We will use Apache Sqoop, which is currently the most efficient, open source solution to transfer data between HDFS and relational database systems. Apache Sqoop is designed to bulk-load data from a relational database to the HDFS (import) and to bulk-write data from the HDFS to a relational database (export).

HDFS Continue reading