Big Data Blog

Hadoop Foundation: When to use Hadoop for a Data-driven enterprise?

Confluent raised $24 million for data ‘Streams’ powering LinkedIn, Netflix and Uber. Now, this is a company that is helping corporate giants like Netflix, Uber, and LinkedIn by letting them get new insights from their data using Apache Kafka.

And, did you know why big data is becoming such a deal? Have a look at this short video to kickstart this article.

Why is it important to get data-driven?

The context here is that digital transformation is impacting every industry. People talk about the growth of data but do not debate about it. It can be:

  • Sensors & machines typically referred as Internet of Things
  • Geo-location
  • Server Logs
  • Cut stream social media
  • Files and emails

In tech. terminology, you have the non-relational database or non-traditional data management systems, then you have data coming from traditional sources (like ERP, CRM, PoS terminals) that you store in data warehouses. Both of these are increasing at a rapid speed. The question becomes:

How do you effectively blend this information in a way that is transformational to the business?

Now, don’t take “transformational” as a cliche. Savvy folks and companies are already using Hadoop (without naming it like us) since years but it still remains transformational for your company who has not implemented it yet. This blended data (coming from traditional & non-traditional sources) will help you to be more proactive with your customers and supply chain as opposed to reactive (like days, weeks, months) after the fact reaction.


The opportunity is to unlock the business value from a full fidelity of data and analytics across that data.


The reality is that much of the new data exists in flight, so it is in motion and it is part of the systems and devices that are part of the Internet of Everything landscape.

When you see fig 1 below, you realize that the  ability to consume data is a challenge (line in the middle).

Source: Hortonworks
Source: Hortonworks

Another challenge is that how do you actively manage the data from as close as the point of inception, through its lifecycle, through real-time or historical analytics that you may want to apply to it?

So, that’s the backdrop of many folks journey towards becoming data driven.

How do companies start their Hadoop journey?

The guys at Hortonworks see a clear pattern, particularly over the last few years, when they help bring Hadoop into enterprise IT infrastructure. Companies have adopted Hadoop ecosystem both from cost savings and unlocking transformational business outcomes perspective.

These are the governing uses cases, if you will, that are common patterns. See in the bottom center part of cost savings (Fig 2 below).

Fig 2 | Source: Hortonworks
Fig 2 | Source: Hortonworks

What do I begin with?

Begin with ETL (extract-transform-load). Clearly in the center bottom cost savings segment, it is right-sizing your traditional world and prepare to bring in some of the IoT sources, in a way that you can do the transformation logic in a platform like Hadoop as opposed to your traditional platform.

ETL use case

There are significant cost savings as well as performance benefits to off-loading much of that ETL workload into a Hadoop system. In many cases, the purpose is to augment or enrich downstream data warehouse, data mart or traditional BI systems where you can begin to bring new sources of the IoT data and start to enrich existing systems.

You may use a common data warehouse augmentation and optimization use case.

Active archive use-case

Some folks look at the active archive use case as a way of bringing online what previously may have been aged on the tape or may not  potentially be stored at all. They have a lot of data available for archival reporting. That’s traditionally where the journeys have started in the past few years. Guys at Hortonworks says that if they look at the last 12-18 months or so, increasingly the real-time and predictive models have come to the scene. And that really begins to unlock the business outcomes.

You can bring data online into a central database where you can discover new patterns. Additionally, bring these new IoT sources you may have at your disposal inside your organization. Increasingly, gathering third-party data sets, whether data sets like demographic data, population data, government data sets, and third party data that you can bring to enrich existing data so you can discover new insights.

The article is based on Hortonworks webinar titled “Laying the foundation for a Data-Driven Enterprise with Hadoop”. It can be accessed here.