Big Data Processing

Architecture, Tech Stack, Examples

In software development since 1989 and in big data since 2013, ScienceSoft helps plan and build reliable and effective big data solutions and platforms.

Big Data Processing - ScienceSoft
Big Data Processing - ScienceSoft

99% of Firms Actively Invest in Big Data Initiatives

NewVantage Big Data and AI Survey 2021 revealed that 99% of mid-size and large companies already use big data and 92% of them are planning to accelerate their investments in the coming years.

Traditionally, big data solutions are analytics-focused and aimed at driving informed decision making. Though the share of operational big data solutions is steadily growing in line with the need to process petabytes of XaaS users' data or enable IoT-driven automation.

Popular big data processing use cases:

  • Real-time vehicle tracking; traffic management; geofencing.
  • Medical IoT.
  • Credit card fraud/account takeover detection.
  • Real-time stock market quotes management.
  • Automated real-time anomaly recognition for manufacturing/oil&gas industries.
  • Connected smart appliances.
  • Online video games.
  • XaaS

Big Data Processing: The Essence

Big data processing involves collecting, storing, and managing massive amounts of data, which arrives at high speed, mostly in a semi- or unstructured form, and deriving immediate insights or triggering immediate actions.

Key approaches: Batch processing and stream processing (also known as real-time processing, event streaming and complex event processing).

  • Batch processing deals with huge volumes of historical data by running parallel computations according to the defined schedule (entails latency from minutes to hours).
  • Stream processing deals with real-time data, which should be processed as soon as it arrives (entails latency from milliseconds to seconds).

The demand for stream processing has grown significantly in recent years due to its ability to simplify data architectures, provide real-time insights, and support use cases involving time-sensitive data like asset monitoring, personalization, clickstream, and multi-player video games.

Typical architecture modules: Data sources, a data bus, a stream processing component, a big data storage, batch processing, a data warehouse, and a big data governance component.

Popular architecture options: Lambda, Kappa.

Big Data Processing Architecture Options

ScienceSoft explains two architecture options that perfectly meet the needs of the majority of companies:

1. Lambda architecture

Lambda Architecture - ScienceSoft

The essence: The Lambda architecture implies two separate data flows (= two technology stacks) – one for batch and the other for real-time processing. The complexity is to piece the output of these two flows together.

Pros:

  • Existing ETL processes can be used as the batch layer.
  • High performance.
  • A low possibility of errors even if the system crashes, as a separate distributed storage will keep historical data intact.
  • Fewer data streams with indefinite time-to-live, thus, cheaper PaaS and IaaS services.
  • Lower development cost since there's no need to rewrite algorithms (not all algorithms can be made streaming).

2. Kappa architecture

Kappa Architecture - ScienceSoft

The essence: In Kappa Architecture, both real-time and batch processing of big data is performed within one data flow (= a single technology stack is used).

Pros:

  • Easy to test and maintain. Only one set of infrastructure and technology is used.
  • Data is easy to migrate and reorganize.
  • Easy to add new functionalities and make hotfixes (since only one code base should be updated).
  • High data quality with guaranteed data sequence and no mismatches.
  • Lower infrastructure cost (storage, network, compute, monitoring, logs) since only one tech stack is used and data needs to be processed only once.

Popular Techs and Tools Used in Big Data Projects

ScienceSoft's teams typically rely on the following techs and tools for big data processing projects:

Data bus / Aggregation layer

Apache Kafka

We use Kafka for handling big data streams. In our IoT pet tracking solution, Kafka processes 30,000+ events per second from 1 million devices.

Apache NiFi

With ScienceSoft’s managed IT support for Apache NiFi, an American biotechnology corporation got 10x faster big data processing, and its software stability increased from 50% to 99%.

Stream processing layer

Apache Kafka

We use Kafka for handling big data streams. In our IoT pet tracking solution, Kafka processes 30,000+ events per second from 1 million devices.

Apache Spark

A large US-based jewelry manufacturer and retailer relies on ETL pipelines built by ScienceSoft’s Spark developers.

Find out more

Storage / Data Lake

Batch processing layer

Apache Spark

A large US-based jewelry manufacturer and retailer relies on ETL pipelines built by ScienceSoft’s Spark developers.

Find out more
Apache Hive

ScienceSoft has helped one of the top market research companies migrate its big data solution for advertising channel analysis to Apache Hive. Together with other improvements, this led to 100x faster data processing.

Serving layer / Big data databases

Apache Cassandra

Our Apache Cassandra consultants helped a leading Internet of Vehicles company enhance their big data solution that analyzes IoT data from 600,000 vehicles.

Find out more
Apache HBase

We use HBase if your database should scale to billions of rows and millions of columns while maintaining constant write and read performance.

MongoDB

ScienceSoft used MongoDB-based warehouse for an IoT solution that processed 30K+ events/per second from 1M devices. We’ve also delivered MongoDB-based operations management software for a pharma manufacturer.

Azure Cosmos DB

We leverage Azure Cosmos DB to implement a multi-model, globally distributed, elastic NoSQL database on the cloud. Our team used Cosmos DB in a connected car solution for one of the world’s technology leaders.

Find out more
Amazon DynamoDB

We use Amazon DynamoDB as a NoSQL database service for solutions that require low latency, high scalability and always available data.

Find out more
Google Cloud Datastore

We use Google Cloud Datastore to set up a highly scalable and cost-effective solution for storing and managing NoSQL data structures. This database can be easily integrated with other Google Cloud services (BigQuery, Kubernetes, and many more).

Serving layer / Data warehouse

PostgreSQL

ScienceSoft has used PostgreSQL in an IoT fleet management solution that supports 2,000+ customers with 26,500+ IoT devices. We’ve also helped a fintech startup promptly launch a top-flight BNPL product based on PostgreSQL.

Amazon Redshift

We use Amazon Redshift to build cost-effective data warehouses that easily handle complex queries and large amounts of data.

Find out more

Governance tools

Apache ZooKeeper

We leverage Apache ZooKeeper to coordinate services in large-scale distributed systems and avoid server crashes, performance and partitioning issues.

Security tools

Note: Building a big data processing solution is more than just connecting ready components. The share of custom code to integrate different components, support unique operations or create a competitive advantage is still huge and vital.

Stream Processing and Analytics in Pet Tracking - ScienceSoft

Stream Processing and Analytics in Pet Tracking by ScienceSoft

The IoT solution allows users to track the actual location of their pets. If a critical event happens (e.g., a pet crosses a geo-fence set by the pet owner or the pet's wearable tracker turned "out of communication"), the user receives push notifications.

Multiple GPS trackers send real-time data about pet location and events like "a low battery", "leaving a safe territory", etc. to the message broker (Mosquitto). The message broker sends data to the stream data processor (Apache Kafka) that processes multiple MQTT topics in real time and checks data quality. Its component also triggers push notifications if a critical event happens.

A data aggregator (Apache Spark) processes data in memory, aggregates it by hour, day, week and month and transfers to a data warehouse (MongoDB) to enable historical location reporting.

The architecture processes 30,000 events per second from 1 million devices.

Key techs: Apache Kafka, Apache Spark, MongoDB.

Big Data Processing for Clinical Intelligence by Repp Health

Use case: Monitoring and updating EHR with the data on patient movements and falls, the presence of staff and other visitors.

Key techs: AWS IoT Core, Amazon Kinesis, Amazon S3.

Big Data Processing and Analytics for Predictive Maintenance for Kennametal

Use case: Collecting operational and production data from the factory machines to help employees better understand how well the machines work.

Key techs: Azure IoT, Azure Stream Analytics, Azure SQL Database, Power BI, Azure Machine Learning.

  • In custom software development since 1989.
  • In data management, data analytics and data science since 1989.
  • In business intelligence and data warehousing since 2005.
  • In big data services since 2013.

We are proud of our professional and dedicated team of senior project managers, business analysts, solution architects, developers, data analysts, and other IT professionals with 7-20 years of experience.

How ScienceSoft Can Help On Your Big Data Journey

ScienceSoft designs and builds software solutions that help companies successfully handle the ever-growing amount of data for operational and analytical needs – sensor data, XaaS data, customer and personalization data, images and video, clickstream data, financial transactions data, health data, and more.

Big data consulting

ScienceSoft draws up an effective big data processing strategy, plans high-performance, secure and resilient architecture for your big data app, chooses an optimal technology stack, and proves the viability of a complex big data project with a PoC.

Go for consulting

Big data development

ScienceSoft plans, designs, develops, deploys, supports, and evolves organization-wide big data platforms and dedicated big data solutions.

Go for implementation

It's High Time to Put Big Data at the Top of Your Agenda

Big data brings big value. With 62% of mid-size and large companies spending more than $50M on big data, and 12% – over $500M, 92% of them say that their investments in big data are paying off, and they achieve returns. Big data processing and analytics enable more efficient ways of doing business, faster and better decision making, and new successful products and services.

The global big data market is huge – it is expected to reach $103B by 2027.

Plan Your Big Data Project with Confidence

ScienceSoft's consultants and solution architects can provide you with a custom quote for your future big data project.

Mountains Mountains Shadow

May Your Big Data Solution Drive The Max Value for Your Company!

And if you need expert assistance in building scalable, reliable, secure, and cost-efficient architecture for it, we'll be glad to help.