There is a lot of hype about “Big Data” solutions with most of our customers. I looked at first a few years ago and I found most things to be very early stage with little genuine intent to implement from customers. However in the recent past, I have seen an increase in the number of jobs out there around the Big data space (in particular hadoop based solutions) indicating an increase in demand. The list below is really just a quick lists of thinking points if you are considering implementation of a Big Data solution
If you are like most organizations, you already have a BI solution in place. Basic reporting , or warehouses feeding reports to business. Most BI techies want to follow trends and go after the “new paradigm” and chase big data however you should think about whether you really need to
- process terabytes of data to process business goals
- have data that changes very rapidly and needs to be tracked and alerted on
- have a variety of sources that cant or shouldn’t be normalized schematically.
We have a listing of typical Big data solutions by industry in our solutions catalog that you may want to browse to get ideas.
As with all buzz words, there are many interpretations to what “Big Data” is. Main choices on what is available under this umbrella are:
- MPP: Basically relational servers on steriods
- Hadoop stack: HDFS (distributed file system), YARN (resource manager), HIVE (SQL-like query inerface to hadoop data)
- Hybrids: There are products out there that combine best of both options above.
- No-SQL: Store data in its native format, add schemas at query-time. You’re looking at things like Mongo DB, Azure Cosmos DB etc.
Cloud vs On Prem
With all these, you will find options on the cloud and on prem. The question on whether to go to cloud or on prem is one that is facing all aspects of the application stack and Big data platforms will be no different. Cloud platforms will typically add value on top of on prem options in terms of:
- Pay for use only (both in terms of storage and compute)
- Ease of scale out
- Every increasing feature set
- Ease of integration with other tooling like visualization, real time alerting, web and mobile front ends etc.
One of the challenges with open source tools (or those with foundations in the open source world) is which distribution to use. For hadoop (perhaps the most popular Big data tech), for e.g., you can have distributions available from Hortonworks, Cloudera, Apache etc. These will add value to the baseline apache build in terms of dev tools, administration, security, deployment ease etc. This means that they usually trail the latest-and-greatest from the apache projects. With things changing rapidly in this space, you should carefully consider what features you need and whether your distribution has or intends to provide those.
Cost of ownership
Big Data is perhaps one of those things with the biggest collection of “free” tools out there – license wise i.e. Other significant costs to consider are hardware (if you are going on prem) and resource hiring and training costs. At the moment, hadoop resources are probably one of the expensive in the tech marketplace.
If you are planing a cloud solution, you may need to consider the volume of data, growth, compression, transaction and network volumes. A bit more complex but promises to track the need curve more closely.
Existing BI assets
Lastly, you cant ignore existing BI assets you have. Key considerations are how data from relational resources will merge with data from the big data side and what layer (ETL, analytics, Presentation? Platforms like Hive are great tools for this problem and there is growing support in presentation tools for big data and unstructured data sources.
Planning out a big data platform can be an exercise ranging from a couple of weeks or several years depending on how serious you are and how much value the analysis can unlock. Things usually begin with a couple of weeks of discovery (something my company does for free). I’d love to hear back on what stage of adoption you are at and what is keeping you from piloting it