There is a lot of hype about “Big Data” solutions with most of our customers. I looked at first a few years ago and I found most things to be very early stage with little genuine intent to implement from customers. However, in the recent past, I have seen an increase in the number of jobs out there around the Big data space (in particular Hadoop-based solutions) indicating an increase in demand. The list below is just a quick list of thinking points if you are considering the implementation of a Big Data solution
If you are like most organizations, you already have a BI solution in place. Basic reporting, or warehouses feeding reports to the business. Most BI techies want to follow trends and go after the “new paradigm” and chase big data however you should think about whether you need to
- process terabytes of data to process business goals
- have data that changes very rapidly and needs to be tracked and alerted on
- have a variety of sources that can or shouldn’t be normalized schematically.
We have a listing of typical Big data solutions by industry in our solutions catalog that you may want to browse to get ideas.
As with all buzzwords, there are many interpretations of what “Big Data” is. The main choices on what is available under this umbrella are:
- MPP: Relational servers on steroids
- Hadoop stack: HDFS (distributed file system), YARN (resource manager), HIVE (SQL-like query interface to Hadoop data)
- Hybrids: There are products out there that combine the best of both options above.
- No-SQL: Store data in its native format, and add schemas at query-time. You’re looking at things like Mongo DB, Azure Cosmos DB, etc.
Cloud vs On-Prem
With all these, you will find options on the cloud and prem. The question of whether to go to the cloud or on-prem is facing all aspects of the application stack and Big data platforms will be no different. Cloud platforms will typically add value on top of on-prem options in terms of:
- Pay for use only (both in terms of storage and computing)
- Ease of scale-out
- Every increasing feature set
- Ease of integration with other tooling like visualization, real-time alerting, web and mobile front ends, etc.
One of the challenges with open source tools (or those with foundations in the open source world) is which distribution to use. For Hadoop (perhaps the most popular Big data tech), e.g., you can have distributions available from Hortonworks, Cloudera, Apache, etc. These will add value to the baseline apache build in terms of dev tools, administration, security, deployment ease, etc. This means that they usually trail the latest and greatest from the apache projects. With things changing rapidly in this space, you should carefully consider what features you need and whether your distribution has or intends to provide those.
Cost of Ownership
Big Data is perhaps one of those things with the biggest collection of “free” tools out there – license wise i.e. Other significant costs to consider are hardware (if you are going on-prem) and resource hiring and training costs. At the moment, Hadoop resources are probably one of the most expensive in the tech marketplace.
If you are planning a cloud solution, you may need to consider the volume of data, growth, compression, transaction, and network volumes. A bit more complex but promises to track the need curve more closely.
Existing BI assets
Lastly, you can’t ignore the existing BI assets you have. Key considerations are how data from relational resources will merge with data from the big data side and what layer (ETL, analytics, Presentation? Platforms like Hive are great tools for this problem and there is growing support for presentation tools for big data and unstructured data sources.
Planning out a big data platform can be an exercise ranging from a couple of weeks or several years depending on how serious you are and how much value the analysis can unlock. Things usually begin with a couple of weeks of discovery (something my company does for free). I’d love to hear back on what stage of adoption you are at and what is keeping you from piloting it