Editor’s note: In this article, ScienceSoft’s data analytics experts share their 15 years of experience in data warehouse services to help you understand the architectural differences, as well as strong and weak sides of the two data warehouse design types.
If your company is developing a strategy for implementing a data warehouse (DWH), one of the first things you’ll need to settle on is whether to build your DWH in the cloud or on premises.
The architecture of a traditional on-premises DWH is represented by two components – the staging area and the storage area.
The staging area
The staging area is architected with the process of extraction, transformation, and loading. This process may be of two kinds - traditional ETL and modern ELT- depending of the place, where the transformation step is performed. With the traditional ETL process, data transformation into the target data structure is performed before loading into the storage area. With the ELT process, raw data is retrieved and loaded into the storage area with the transformation step performed in the storage area.
The storage area
ScienceSoft employs two approaches to building a storage area - top-down and bottom-up. According to both approaches, a data warehouse involves two structural elements – a centralized repository (here all company’s data is kept) and data marts (a subject-oriented database for storing the data related to specific business areas, for example, data belonging to certain units – marketing, finance, etc.).
According to the top-down approach, a central repository is designed first. Only after that, data marts are created to further segment enterprise data. The bottom-up approach considers data marts as main elements for storing data and the centralized repository is viewed as a combination of data marts to facilitate enterprise-wide analysis and reporting.
Opting for the on-premises data warehouse, you get:
- Full control of your DWH
You get complete control over hardware and software, so in case of a failure, your IT staff has direct access to the problem area. Your specialists can adjust or replace hardware and software any time avoiding reliance on third parties. Moreover, data security remains strictly under your IT team’s control.
- Full compliance
Data compliance is easier to achieve with on-premises DWHs. For example, to meet some regulatory requirements, you need to know exactly where the data is located.
Business users from a facility where DWH is located can effectively access all the data stored in a DWH without the dependence on the internet connection.
- Full responsibility
Together with the full control of your on-premises DWH, you get the total responsibility for its implementation and maintenance. You will have to handle hardware and software upfront and ongoing costs, as well as pay for a team in charge of the deployment, administration and maintenance of your DWH solution.
- Lack of scalability
On-premises DWHs cannot deal with the requirements for increased memory or compute power as effectively as in-cloud DWHs. Upscaling inevitably leads to purchasing new hardware, which may result in the need to tune or replace current software.
Going cloud is one of the DWH implementation trends. Each cloud vendor offers a unique DWH architecture. Here, we will have a look at the architecture of Amazon Redshift data warehouse, which is a cloud-based representation of a traditional data warehouse.
The core infrastructure component is a cluster. A cluster is composed of one or more compute nodes. If a cluster consists of more than one compute node, a leader node is added with the purpose to manage communication between them and client applications. Each compute node has its own dedicated CPU, memory, and attached disk storage which allows parallel query processing – that is why queries are processed quickly.
The inherent agility of cloud data warehouses allows upscaling and downscaling with no impact to DWH performance.
- Reduced costs
Data warehouse costs vary significantly depending on numerous factors - a cloud vendor, software technology stack, etc. (Here, you may have a look at how ScienceSoft’s best practices DWH implementation helped to find the optimal technological combination for the customer based on their requirements to storage and compute.) But the one thing you don’t have to worry in case of implementing a data warehouse in the cloud, is your hardware. All costs connected with it (hardware acquisition, deployment, maintenance, administration, etc.) are eliminated. In case of choosing the PaaS model for your cloud DWH, you eliminate software acquisition and maintenance costs as well.
Although most cloud providers have security features difficult for attackers to penetrate, some industry standards and regulations still require sensitive data to be stored on premises.
- The risk of budget overruns
Cloud flexibility can result in overspending, for example, due to unexpectedly increased query volumes. Thus, you will have to set query threshold alerts not to end up spending too much.
A hybrid DWH is a data repository that coexists on premises and in the cloud environment – be it public, private or both. That way it combines the flexibility of the cloud, data availability and compliance granted by the on-premises deployment.
Do you want your best DWH solution?
ScienceSoft can help you weigh all the pros and cons to choose the best DWH implementation option for your company.
A DWH vendor with 14 years of experience, we can develop, migrate, and support your data warehouse or consult on any issue concerning your DWH.