Big Data Hadoop Stack Training [Short Videos Part III]
This blog post contains part III of Big Data training conducted by Khurram M. Pirzada, Hadoop Lead at Allied Consultants. Part I & part II of the training can be watched by visiting the following links.
Big Data Training Part I
Big Data Training Part II
Ergonomics of Big Data Vol 5
Welcome everybody, this is part III of the Big Data Training that we are conducting at Allied Consultants, I am Khurram and today we will discuss the following in this last theoretical part of our Big Data training. In future posts, we will upload and share live assignments that we handed out to participants of this training.
In this session:
- The ergonomics of Big Data
- How group composition is done?
- How the Big Data infrastructure and hardware depend and inter-depend on each other?
- How the OS (Operating System) impacts the performance of both hardware and infrastructure in typical big data assignments?
- How to select the different tool sets available for executing big data projects, and which tool suits for which type of applications?
- The role of security and audit protocols so that your whole infrastructure & application follows set standards
- Emerging architectural patterns for Big Data projects/tasks that Allied Consultants has customized and optimized for Big Data jobs
Big Data Training | Ergonomics of Big Data Vol 6
We have a very large ecosystem of Big Data. In such scenario, we have to make a certain selection criteria based on very specific needs. For example, do the business using Big Data applications need to just improve operational efficiency or does it need to use Big Data for mission-critical tasks. For instance, in a supply chain example, the production planning engineer needs to have visibility for good movement so that the team can plan accordingly. However, for mission critical instances, such as telecom companies, they need to ensure that the infrastructure is reliable so as to guarantee zero downtime.
We have a list of tools such as Hadoop, Yarn, Spark and Flink that are specifically used in such scenarios. On top of this ecosystem, we also have some vendor-based frameworks for Big Data projects.
In every case, we have decided between cloud vs. in-house deployment of Big Data technologies. Under different circumstances we will choose the cloud whereas in case of huge project scope, we will choose in-house deployment due to financial and infrastructure implications. In either case, we have to define the user level access both at the event and task or as well for the data.
We have to make sure that whichever deployment mode that we select, it must be reliable, efficient, and scalable. It is important to note that the dependency and efficiency of infrastructure have direct correlation with big data performance. Whether we are using IaaS (Infrastructure-as-a-Service) or PaaS (Platform-as-a-Service), both of these formats can be deployed in-house (being geographically spread-out as per client needs).
The infrastructure has a directly proportional relationship with OS problems and storage networks. IO performance means that whenever an instruction set has been sent out from one set of infrastructure to another, how the latter performs and where the information is stored.
We have to use commercial or commodity hardware. There are certain pre-conditions that need to be absolutely clean from the onset of a Big Data project. We usually go for the commodity hardware. Why, when and where?
Commercial hardware currently exists in the form of Criag Computers from Hitachi, IBM. They are usually for mission critical operations used by Governments, telecom companies, defense organizations, and automotive companies. On the other hand, commodity hardware is mostly used by data centers, or in-house deployment.
Hardware and infrastructure are directly interdependent and they influence each other. For example, the infrastructure and hardware requirements of a business organization may increase with an expansion of operations or compute demand. However, as we step into the future, a lot of vendors are envisioning that big data hardware is the future. So, very soon we will see an additional layer of hardware specific to big data projects and conditions.
With all this hype, we should remember that whether it is a hardware or infrastructure part, the new concept is self-healing, context-aware and all those HMI (human-interacting devices) increase the complexity of the overall design of Big Data projects/architecture.
Big Data Training | Ergonomics of Big Data Vol 7
Do OS matters in Big Data?
Yes, of-course. It is among the primary design considerations of Big Data solutions and has an impact on the overall deployment. For instance, the design of HDFS (Hadoop Distributed File Systems) vs. Unix vs. MacOS vs. Windows. HDFS is a close cousin of Unix than it is on Windows.
Why? Because, Unix and HDFS have an almost similar layout, and Windows file structure completely differs. Does it really affect the scalability?
Yes, when the volume and speed of data increases, the congestion determines how well an operating system behaves. We have to respond to different set of requests generated so that whether they will be handled internally or through external infrastructure resources.
The biggest factor is the ability to configure the kernel. The Unix and its variants can reconfigure kernel based on whether it is an application, infrastructure or directly the data streaming. We can reconfigure the kernel according to our needs.
Now we will discuss the software tools available for solving the big data problems. We refer to the Hadoop stack to refer to the different tools available. We have HDFS, MapReduce, Cassandra, Hive, Pig and HBase.
Four different databases, we use relational as well as NoSQL. Postage is used for regular database, whereas MongoDB and OrientDB is used to handle unconventional and NoSQL based data. Neo4J is used for graph data, a prime example is Facebook, Twitter and LinkedIn (these three use Neo4J).
We use different network stacks for different purposes within Big Data assignments. We use Nagios for network monitoring, we use Fiddler for communication between different infrastructures, Zanoss and we use Splunk within the same cluster to manage all the resources.
We have to make sure that everything we perform, whether it is in a singular infrastructure, or geographically spanning, follows the security protocols that we defined in earlier sessions. Different levels of access are allowed for different users. We use a different set of tools in combination with those protocols. On top of it we use different DevOps tools. These are used to automate our processes. Docker is used to automate our virtual environment. Jenkin is used for continuous integration, ELK eslasticsearch is used when using cloud-based infrastructure. LogStash is used to keep a log of each and every action performed over the infrastructure, hardware or application.
Big Data Training | Security, audit & architecture for Big Data Vol 8
In Big Data, we have certain security and audit trail protocols. Foremost, we have to determine the life cycle of any data that is generated during the process. In this whole cycle, there are different layers. Data can be generated at source, while certain processes are taking place or when a certain level of analytics are performed.
How do we ensure that the data generated at each layer are accessed by authorized users only. The three major approaches are too:
- Mask the data
- Determine life cycle
There are some interesting projects that are doing the same. Either these tools hide the data properties or make it harmless. Some tools are used to make data encrypted and cannot be accessed until keys are available. Sentry, Knox and Accumulo are the tools that we used in the security/audit layer. We have talked a lot about data at rest. A major challenge is to save data when it is live?
Here we have our data source, we also have our Hadoop cluster in which data is coming according to the pre-set rules. Our external data warehouse lands here. In this example image, the data is moving both inwards and outwards.
In the next part of this training series, we will post live Big Data assignments accomplished by the participants of the training.
You may like:
Downloadable template for Big Data Hadoop Use case
Four major challenges of working in Apache Cassandra database
Hadoop deployment in a German multinational automobile manufacturer