For years, people have asked all-knowing Google how big data can help businesses to succeed, what big data technologies are the best, and other important questions. A lot has been written and said about big data already, but the term itself remains unexplained. To be fair, we do not count a widespread definition “big data is big.” This concept raises another question: what are the measures for “big” – 1 terabyte, 1 petabyte, 1 exabyte or more?
Here, our big data consulting team defines the concept of big data through describing its key features. To give a complete picture, we also share an overview of big data examples from different industries, enumerate different sources of big data and fundamental technologies.
Here’s our definition:
Big data is the data that is characterized by such informational features as the log-of-events nature and statistical correctness, and that imposes such technical requirements as distributed storage, parallel data processing and easy scalability of the solution.
Below, you can read about these features and requirements in more detail.
Informational features: In contrast to traditional data that may change at any moment (e.g., bank accounts, quantity of goods in a warehouse), big data represents a log of records where each describes some event (e.g., a purchase in a store, a web page view, a sensor value at a given moment, a comment on a social network). Due to its very nature, event data does not change.
Besides, big data may contain omissions and errors, which makes it a bad choice for the tasks where absolute accuracy is crucial. So, it doesn’t make much sense to use big data for bookkeeping. However, big data is correct statistically and can give a clear understanding of the overall picture, trends and dependencies. Another example from Finance: big data can help identify and measure market risks based on the analysis of customer behavior, industry benchmarks, product portfolio performance, interest rates history, commodity price changes, etc.
Technical requirements: Big data has a volume that requires parallel processing and a special approach to storage: one computer (or one node as IT gurus call it) is not sufficient to perform these tasks – we need many, typically from 10 to 100.
Besides, big data solution needs scalability. To cope with ever-growing data volume, we don’t need to introduce any changes to the software each time the amount of data increases. If this happens, we just involve more nodes, and the data will be redistributed among them automatically.
To better understand what big data is, let’s go beyond the definition and look at some examples of practical application from different industries.
1. Customer analytics
To create a 360-degree customer view, companies need to collect, store and analyze a plethora of data. The more data sources they use, the more complete picture they will get. Say, for each of their 10+ million customers they can analyze 5 types of customer big data:
- Demographic data (this customer is a woman, 35 years old, has two children, etc.).
- Transactional data (the products she buys each time, the time of purchases, etc.)
- Web behavior data (the products she puts into her basket when she shops online).
- Data from customer-created texts (comments about the company that this woman leaves on the internet).
- Data about product/service use (feedback on the quality of the goods ordered, the speed of delivery, etc.).
Customer analytics is equally beneficial for companies and customers. The former can adjust their product portfolio to better satisfy customer needs and organize efficient marketing activities. The latter can enjoy favorite products, relevant promotions and personalized communication.
2. Industrial analytics
To avoid expensive downtimes that affect all the related processes, manufacturers can use sensor data to foster proactive maintenance. Imagine that the analytical system has been collecting and analyzing sensor data for several months to form a history of observations. Based on this historical data, the system has identified a set of patterns that are likely to end up with a machine breakdown. For instance, the system recognizes that picture formed by temperature and load sensors is similar to pre-failure situation #3 and alerts the maintenance team to check the machinery.
It’s important to mention that preventive maintenance is not the only example of how manufacturers can use big data. In this article, you’ll find a detailed description of other real-life big data use cases.
3. Business process analytics
Companies also use big data analytics to monitor the performance of their remote employees and improve the efficiency of the processes. Let’s take transportation as an example. Companies can collect and store the telemetry data that comes from each truck in real time to identify a typical behavior of each driver. Once the pattern is defined, the system analyzes real-time data, compares it with the pattern and signals if there is a mismatch. Thus, the company can ensure safe working conditions (as drivers should change to have a rest, but they sometimes neglect the rule).
4. Analytics for fraud detection
Banks can detect an unusual card behavior in real time (if somebody else, not the owner, is using it) and block suspicious activities or at least postpone them to notify the owner. For example, if the user is trying to withdraw money in Spain, while they reside in Texas, before declining the transaction, the bank can check the user’s info on the social network – maybe they are simply on vacations. Besides, the bank can verify if this user has any linkage with fraud-related accounts or activities across all other channels.
There are two types of big data sources: internal and external ones. Data is internal if a company generates, owns and controls it. External data is public data or the data generated outside the company; correspondingly, the company neither owns nor controls it.
Let’s look at some self-explanatory examples of data sources.
Big data can be used both as a part of traditional BI and in an independent system. Let’s turn to examples again. A company analyses big data to identify behavior patterns of every customer. Based on these insights, it allocates the customers with similar behavior patterns to a particular segment. Finally, a traditional BI system uses customer segments as another attribute for reporting. For instance, users can create reports that show the sales per customer segment or their response to a recent promotion.
Another example: Imagine an ecommerce website supported by the analytical system that identifies the preferences of each user by monitoring the products they buy or are interested in (according to the time spent on a product page). Based on this information, the system recommends “you-may-also-like” products. This is an independent system.
The world of big data speaks its own language. Let’s look at some good-to-know terms and most popular technologies:
- Сloud is the delivery of on-demand computing resources on a pay-for-use basis. This approach is widely used in big data, as the latter requires fast scalability. E.g., an administrator can add 20 computers in a few clicks.
- Hadoop is a framework used for distributed storage of huge amounts of data (its HDFS component) and parallel data processing (Hadoop MapReduce). It breaks a large chunk into smaller ones to be processed separately on different data nodes (computers) and automatically gathers the results across the multiple nodes to return a single result. Quite often Hadoop means the ecosystem that covers multiple big data technologies, such as Apache Hive, Apache HBase, Apache Zookeeper and Apache Oozie.
- Apache Spark is a framework used for in-memory parallel data processing, which makes real-time big data analytics possible. E.g., an analytical system may identify that a visitor has been spending quite a long time on particular product pages, but has not added them to the cart yet. To motivate a purchase, the system can offer a discount coupon for the product of interest.
- Spark vs. Hadoop MapReduce: Which big data framework to choose
- Apache Cassandra vs. Hadoop Distributed File System: When Each is Better
- Cassandra vs. HBase: twins or just strangers with similar looks?
Now you know what big data is, don’t you?
Our big data consultants created a short quiz. There are five questions for you to check how much you’ve learned about big data:
- What kind of data processing does big data require?
- Is big data 100% reliable and accurate?
- If your goal is to create a unique customer experience, what kind of big data analytics do you need?
- Name at least three external sources of big data.
- Is there any similarity between Hadoop and Apache Spark?
Well done! We hope that the article was helpful to you and that after reading it you’ve found the quiz easy.
Big data is another step to your business success. We will help you to adopt an advanced approach to big data to unleash its full potential.