What is Big Data? A buzzword explained
For years, people ask all-knowing Google how big data can help businesses to succeed, what big data technologies are the best, and a wide range of other important questions. A lot has been written and said about big data already, but the term itself remains unexplained. To be fair, we do not count a widespread definition “big data is big.” This concept raises another question: what are the measures for “big” – 1 terabyte, 1 petabyte, 1 exabyte or more?
Our big data consulting team favors a consistent approach, and here we would like to share the fundamentals and define what big data is through its key features.
We call big data the data that meets two criteria as follows.
Informationally: In contrast to traditional data that may change at any moment (e.g., bank accounts, quantity of goods in a warehouse), big data represents a log of records where each describes some event (e.g., a purchase in a store, a web page view, a sensor value at a given moment, a comment on a social network). Due to its very nature, event data does not change.
Besides, big data may contain omissions and errors, which makes it a bad choice for the tasks where absolute accuracy is crucial. So, it doesn’t make much sense to use big data for bookkeeping. However, big data is correct statistically and can give a clear understanding of the overall picture, trends and dependencies. Another example from Finance: big data can help identify and measure market risks based on the analysis of customer behavior, industry benchmarks, product portfolio performance, interest rates history, commodity price changes, etc.
Technically: Big data has a volume that requires parallel processing and a special approach to storage: one computer (or one node as IT gurus call it) is not sufficient to perform these tasks – we need many, typically from 10 to 100.
Besides, big data solution needs scalability. To cope with ever-growing data volume, we don’t need to introduce any changes to the software each time the amount of data increases. If this happens, we just involve more nodes, and the data will be redistributed among them automatically.
Let’s go beyond the definition and look at some illustrative examples to better understand what big data is. We classified these examples to show big data practical applications in different industries.
To create a 360-degree customer view, retailers need to collect, store and analyze a plethora of data. The more data sources they use, the more complete picture they will get. Say, for each of their 10+ million customers they can analyze:
- Demographic data (this customer is a woman, 35 years old, has two children, etc.).
- Transactional data (the products she buys each time, the time of purchases, etc.)
- Web behavior data (the products she puts into her basket when she shops online).
- Data from customer-created texts (comments about the retailer that this woman leaves on the internet).
Customer analytics is equally beneficial for retailers and customers. The former can adjust their product portfolio to better satisfy customer needs and organize efficient marketing activities. The latter can enjoy favorite products, relevant promotions and personalized communication.
To avoid expensive downtimes that affect all the related processes, manufacturers can use sensor data to foster proactive maintenance. Imagine that the analytical system has been collecting and analyzing sensor data for several months to form a history of observations. Based on this historical data, the system has identified a set of patterns that are likely to end up with a machine breakdown. For instance, the system recognizes that picture formed by temperature and load sensors is similar to pre-failure situation #3 and alerts the maintenance team to check the machinery.
It’s important to mention that preventive maintenance is not the only example of how manufacturers can use big data. In this article, you’ll find a detailed description of other real-life big data use cases.
Business process analytics
Companies also use big data analytics to monitor the performance of their remote employees and improve the efficiency of the processes. Let’s take transportation as an example. Companies can collect and store the telemetry data that comes from each truck in real time to identify a typical behavior of each driver. Once the pattern is defined, the system analyzes real-time data, compares it with the pattern and signals if there is a mismatch. Thus, the company can ensure safe working conditions (as drivers should change to have a rest, but they sometimes neglect the rule).
Analytics for fraud detection
Banks can detect an unusual card behavior in real time (if somebody else, not the owner, is using it) and block suspicious activities or at least postpone them to notify the owner. For example, if the user is trying to withdraw money in Spain, while they reside in Texas, before declining the transaction, the bank can check the user’s info on the social network – maybe they are simply on vacations. Besides, the bank can verify if this user has any linkage with fraud-related accounts or activities across all other channels.
There are two categories of big data sources: internal and external ones. Let’s have a closer look at them.
When a company generates data, owns and controls it, this data is internal. External data is public data or the data generated outside the company; correspondingly, the company neither owns nor controls it. Let’s look at some self-explanatory examples of data sources.
Autonomous system or a part of traditional BI?
Big data can be used both as a part of traditional BI and in an independent system. Let’s turn to examples again. A company analyses big data to identify behavior patterns of every customer. Based on these insights, it allocates the customers with similar behavior patterns to a particular segment. Finally, a traditional BI system uses customer segments as another attribute for reporting. For instance, users can create reports that show the sales per customer segment or their response to a recent promotion.
Another example: Imagine an ecommerce website supported by the analytical system that identifies the preferences of each user by monitoring the products they buy or are interested in (according to the time spent on a product page). Based on this information, the system recommends “you-may-also-like” products. This is an independent system.
The world of big data speaks its own language. Let’s look at some good-to-know names and terms:
- Сloud is the delivery of on-demand computing resources on a pay-for-use basis. This approach is widely used in big data, as the latter requires fast scalability. E.g., an administrator can add 20 computers in a few clicks.
- Hadoop is a framework used for distributed storage of huge amounts of data and parallel data processing. It breaks a large chunk into smaller ones to be processed separately on different data nodes (computers) and automatically gathers the results across the multiple nodes to return a single result.
- HDFS is the Hadoop distributed file system that allows multiple files to be simultaneously stored and addressed to.
- Apache Spark is a framework used for in-memory parallel data processing, which makes near real-time analytics possible. E.g., an analytical system may identify that a visitor has been spending quite a long time on particular product pages, but has not added them to the cart yet. To motivate a purchase, the system can offer a discount coupon for the product of interest.
Now you know what big data is, don’t you?
Our big data consultants created a short quiz. There are five questions for you to check how much you’ve learned about big data:
- What kind of data processing does big data require?
- Is big data 100% reliable and accurate?
- If your goal is to create a unique customer experience, what kind of big data analytics do you need?
- Name at least three external sources of big data.
- Is there any similarity between Hadoop and Apache Spark?
Well done! We hope that the article was helpful to you and that after reading it you’ve found the quiz easy.