Like

Welcome On Mobius

Mobius was created by professionnal coders and passionate people.

We made all the best only for you, to enjoy great features and design quality. Mobius was build in order to reach a pixel perfect layout.

Mobius includes exclusive features such as the Themeone Slider, Themeone Shorcode Generator and Mobius Grid Generator.

Our Skills

WordPress90%
Design/Graphics75%
HTML/CSS/jQuery100%
Support/Updates80%

Big Data Architecture Best Practices

By Muhammad Omer 3 weeks agoNo Comments

Synchronous vs Async pipelines

Synchronous big data pipelines are a series of data processing components that get triggered when a user invokes an action on a screen. e.g. clicking a button. The user typically waits till a response is received to intimate the user for results. In contrast in asynchronous implementation, the user initiates the execution of the pipeline and then goes on their merry way till the pipeline intimates the user of the completion of the task.

Asynchronous pipelines are best practice because they are designed to fulfil the average load of the system (vs. the peak load for synchronous). So the synchronous design aims to maximize asset-utilization and costs.

Download your Free Data Warehouse Project Plan Here

Buffering queues

Wherever possible decouple the producers of data and its consumers. Typically this is done through queues that buffer data for a period of time. This decoupling enables the producers and consumers to work at their own pace and also allow filtering on the data so consumers can select only the data they want

 

Stateless wherever possible

Design stateless wherever possible. This enables horizontal scalability.

 

Time to Live

It’s important to consider how long the data in question is valid for and exclude processing of data that is no longer valid. One example of this is data retention settings in Kafka.

 

Process and deliver what the customer needs

One of the key design elements on the macro and micro level is processing only data that is being consumed (and when it is being consumed). An interesting example of this I saw recently was a stock ticker feed that was fed into kafka. Subscribers typically monitored only a few companies feeds. The overall stock tickers were fed into various topics (companies) and consumers then only consumed the companies that they were interested in. Any processing on that data was deferred to when the user pulled it. Removing the overall load of innumerable other companies.

On a micro-level this is also how Apache spark works where actions on an RDD are deferred till a command to execute is given and processing is optimized at that time.

 

 

Category:
  Big Data
this post was shared 0 times
 000
About

 Muhammad Omer

  (145 articles)

Muhammad Omer is the founding partner at Allied Consultants. Areas of interest for him are entreprenuership in organizations, IT Management, Integration and Business Intelligence.

Leave a Reply

Your email address will not be published.