Best Software to Build a Data Warehouse in the Cloud: Features, Benefits, Costs

Cloud Data Warehouse Overview - ScienceSoft

ScienceSoft has been rendering data warehouse consulting services for more than 16 years.

Cloud data warehouse: the essence

A cloud data warehouse uses the space and compute power allocated by a cloud provider to integrate and store data from disparate data sources for analytical querying and reporting.

Cloud vs. On-premises data warehouse

Aspect

Cloud data warehouses

On-premises data warehouses

Scalability

Availability

Security

Performance

Cost-effectiveness

Data integration and management

  • Data integration with ETL/ELT processes.
  • Flexible SQL querying of data.
  • Ingestion of all data types (structured, semi-structured, unstructured).
  • Big data ingestion.
  • Streaming data ingeston.
  • Full and incremental data extraction/load.
  • Data transformation of varying complexity (data type conversion, summarization, etc.).

Data storage

  • Subject-oriented data storage.
  • Storage of time-variant (from the historical point of view) data.
  • Read-only data storage.
  • Integrated data storage (data is consolidated from disparate data sources).
  • Metadata storage.
  • Optimized data storage (columnar data storage, data compression, etc.).

Data warehouse database performance

  • Elastic on-demand scaling of storage and compute resources.
  • Massively parallel processing database.
  • Materialized view support (reusing pre-computed results of analytical queries).
  • Result caching.
  • Performance and concurrency dynamic management (predicting and prioritizing queries with ML).
  • Fault tolerance.

Data warehouse database management

  • Automated infrastructure provisioning.
  • Automatic data backup.
  • Pre-built data source integrations.

Security and compliance

  • Data encryption.
  • Granular access control.
  • User authentication and authorization for data access.
  • Compliance with national, regional, and industry-specific regulations (for example, GDPR, HIPAA, PCI DSS).

Need a Well-Performing Cloud DWH?

ScienceSoft is ready to design and implement a cloud data warehouse that meets your specific data storage needs or migrate your current data warehouse solution to the cloud to increase data warehouse performance and cut operational costs.

Cloud Data warehouse – Important Software Integrations for Reduced Costs and Time to Value

Important integrations for a cloud DWH - ScienceSoft

Data lake

A data lake stores big volumes of structured, semi-structured and unstructured data rarely accessed for analytical querying. Additionally, it can receive highly structured data from the data warehouse to be processed with other services (big data ssystems, ML systems, etc.).

Analytics and reporting software

Highly structured data from the data warehouse goes further to be analyzed, feed ML workloads, and be visualized to end-users.

How to determine cloud Data warehouse success

Meet data security and protection requirements

Choose a cloud vendor compliant with the needed regulatory requirements (for example, a cloud vendor that offers HIPAA-eligible data warehouse platforms).

To eliminate the risk of data leakage and prevent unauthorized data access, your cloud data warehouse solution should have the following capabilities:

  • User authentication/authorization.
  • Data access controls.
  • End-to-end data encryption.
  • Dynamic data masking, etc.

Choose data warehouse software with vast integration capabilities

A cloud data warehouse platform should have SDKs in common programming languages and support out-of-the-box integration with the required data sources.

Select an optimal pricing model for your data warehouse workloads

To avoid billing overage, closely monitor the cloud usage and consider various scenarios for cost savings suggested by cloud vendors, such as separate scaling of storage and compute resources, savings with reserved infrastructure commitments, billing alerts, data warehouse pausing, the pay-as-you-go model, etc.

Cloud data warehouse benefits

TCO savings

Cloud data warehouse does not require purchasing and maintaining expensive hardware; it scales cost-effectively, minimizing the risk of infrastructure overprovisioning.

Decreased development costs

Decreased IT staff time due to data warehouse automation – automatic up- and down-scaling of storage and compute resources, data management tasks (data collection, aggregating, modeling).

Fast time to insight

Instant scalability, flexibility and reliability of the cloud enables data warehouse enhanced performance and availability, which results in accelerated business intelligence and, thus, faster business decisions.

Azure Synapse Analytics

Summary

Best for: enterprise data warehousing.

Azure Synapse Analytics is good for integrating data from hundreds of data sources across the company’s divisions, subsidiaries, etc. for analytical querying to be performed in seconds. Reporting on all management levels, from C-suite to directors, managers and supervisors, is protected with a fine-grained data access control.

Description

  • Pre-built integrations for ingesting data from 90+ data sources.
  • Loading and cleaning data from any source with Azure Data Factory.
  • Storing vast amounts of data (a maximum of 240 TB for rowstore tables and unlimited storage for columnstore tables).
  • Workload classification and isolation, flexible indexing options (clustered columnstore, clustered and nonclustered indexes), materialized view support, result set caching, etc. for optimized complex query performance.
  • Integration with Azure Machine Learning for building ML models and generating predictions within the data warehouse.
  • Compatibility with Python, R, .Net, T-SQL, Scala, Spark SQL for fulfilling various analytics tasks.
  • Granular permissions on schemas, tables, views, individual columns, procedures, and other objects.
  • Separate billing of compute and storage for cost savings for different data volumes and query load.

Market recognition:

ScienceSoft's tip: Azure Synapse Analytics performance gains are usually explicitly realized when the data warehouse is more than 1 TB and has billion-raw tables. Otherwise, using Azure SQL Database may be reasonable.

Pricing

Compute:

  • On-demand pricing: $1.20/hour (DW100c) - $360/hour (DW30000c).
  • Reserved instance pricing can save up to 65% over the on-demand option (in a 3-year term).

Data storage: $122.88/TB/month.

Note: No charge for the amount of data processed.

Amazon Redshift

Summary

Best for: big data warehousing.

Amazon Redshift enables SQL-querying of exabytes of structured, semi-structured, and unstructured data across the data warehouse, operational data stores, and a data lake with the possibility to further aggregate data with big data analytics and ML services.

Description

  • Running analytic queries against terabytes to petabytes (up to 16 petabytes of data on a cluster) of structured and semi-structured data.
  • Querying exabytes of structured, semi-structured and unstructured data from a data lake (Amazon S3) for analyzing without loading and transformation.
  • Querying data from operational and relational databases on AWS with the federated query possibility.
  • Processing big data with Hadoop/Spark using pre-built integration with Amazon EMR.
  • Creating and training ML models with SQL on the data in Amazon Redshift with Amazon ML.
  • Accommodating big data workloads with the Advanced Query Accelerator, result caching, materialized views and ML-based workload management.
  • Possibility to pay separately for compute and managed storage (RA3 node type).

Market recognition: 

ScienceSoft's tip: To load streaming data (sensor data, other real-time data), use Amazon Kinesis to capture and transform streaming data and load it into the S3 data lake. Then load the data to Redshift by chunks with the COPY command.

Pricing

  • On-demand pricing$0.25 - $13.04/hour.
  • Reserved instance pricing offers saving up to 75% over the on-demand option (a 3-year term).
  • Data storage (RA3 node type): $0.024/GB/month.

Note: No charge for the amount of data processed.

Google BigQuery

Summary

Best for: cost-effective storage of big volume of data with infrequent queries.

BigQuery allows for cost-effective exabyte-scale storage with tables having up to 10,000 columns. It's most effective when main analytical queries either filter data according to partitioning or clustering or require scanning the entire dataset.

Description

  • More than 100 pre-built data source connectors.
  • Automated loading and transforming of data with Data Transfer Service (DTS).
  • Querying data across object storage (Cloud Storage), transactional databases (Bigtable, Cloud SQL), spreadsheets in Drive without data movement with the federated queries support.
  • Integration with the Apache big data ecosystem with Dataproc and Dataflow to read and write data directly from Big Query using the Storage API.
  • Streaming data ingestion and analysis with BigQuery’s streaming insertion API, Datastream, Pub/Sub and Dataflow.
  • Querying of data across clouds (AWS and Azure) with BigQuery Omni (Preview).
  • Subsecond query response time with column-based storage, materialized views, cached query results, etc.
  • Optimized data storage costs with the possibility to configure the default table expirations for databases and tables, partition expiration for partitioned tables, long-term storage, etc.
  • Training and executing ML models on structured data with SQL using BigQuery ML, Vertex AI and TensorFlow.
  • Support for T-SQL, Java, Python, C#, Go, Node.js, PHP, and Ruby.
  • Simpler (compared to Amazon Redshift and Azure Synapse Analytics) database administration required due to automated management of CPU and data compressions, data encryption turn-on by default, etc.
  • Granular permissions on datasets, tables, and views.

Market recognition: 

ScienceSoft's tip: BigQuery doesn’t provide any indexes, and many analytical queries scan the entire database. And as BigQuery charges separately per query based on the amount of data processed, consider setting up custom cost controls to limit the amount of query data processes per day.

Pricing

  • Storage: $0.02/GB/month ($0.01/GB/month for long-term storage – a table/table partition not modified for 90 consecutive days).
  • Streaming inserts: $0.01/200 MB.
  • Query performance: $5/TB, 1st TB/month is free (pay-as-you-go).
  • From $10K/month for a dedicated reservation of 500 processing units (flat-rate pricing).

Azure SQL Database

Summary

Best for: midsize data warehouse.

Azure SQL database is a good fit for data warehousing scenarios with up to 8 TB of data volumes and a large number of active users (concurrent requests can reach up to 6,400 with up to 30,000 concurrent sessions).

Description

  • Three deployment options:
    • Single database - a fully managed isolated database with a guaranteed amount of compute, memory, and storage resources
    • Elastic pool – a collection of single databases with a shared pool of resources.
    • Managed instance – a fully managed instance of SQL Server (good for lift-and-shift migration of an on-premise SQL Server data warehouse)
  • Accommodating various workloads with three service tiers: General-purpose (8 TB database storage for budget-oriented workloads), Business-critical (4 TB database storage for low-latency workloads with high resiliency to failures and fast failover), and Hyperscale (up to 100 TB of database storage)
  • Dynamic scalability and autoscale features.
  • Automatic index management and plan correction (problematic SQL plan performance identification and fixing).
  • Intelligent Insights for monitoring database performance and alerting on performance degradation issues and getting performance improvement recommendations.
  • Support for cross-database SQL queries.
  • Support for .NET, Java, Node.js, Ruby, PHP, and Python.
  • Always-on data encryption.

Market recognition: 

ScienceSoft's tip: Azure SQL database provides inbuilt backup, which can be configured for longer-term retention for compliance and auditing reasons.

Pricing

  • General-purpose: provisioned compute – $0.51-$20.18/hour (Gen 5), locally redundant storage – $0.115/GB/month.
  • Business-critical: provisioned compute - $1.36-$54.36/hour (Gen 5), storage – $0.25/GB/month.
  • Hyperscale: provisioned compute – $0.57-$22.62/hour (Gen 5), storage – $0.10/GB/month.

Azure Cosmos DB + Azure Synapse Analytics

Summary

Best for: operational data warehouse (hybrid transaction/analytical processing).

Azure Cosmos DB and Azure Synapse Analytics enable enterprise teams to run fast, cost-effective no-ETL queries on large operational real-time data sets, without copying data and impacting the performance of the company’s transactional workloads.

Description

  • Storing petabytes of operational data in a column-oriented analytical store separately from an indexed row-based transactional store in Azure Cosmos DB for analytical querying.
  • Automatic sync of inserts, updates, deletes to Cosmos DB operational data from transactional store to analytical store in near real time.
  • Replicating data across regions configured within a Cosmos account and maintaining 4 replicas of data within a region for high availability and disaster recovery.
  • Control of the retention of operational data in the analytical store at the container level with the Analytical Time-to-Live capability.
  • Running near real-time large-scale no-ETL analytics on operational data in Azure Cosmos DB analytical store using Azure Synapse Link.
  • Flexible indexing options (primary and secondary indexes) for executing complex analytics queries on operational data.
  • Granular permissions on schemas, tables, views, individual columns, procedures, and other objects.
  • Azure ML integration to build Machine Learning (ML) models in Azure Synapse Analytics for generating insights over the operational data.
  • Data encryption with customer-managed keys.

Market recognition: 

ScienceSoft's tip: When planning your backup policy, keep in mind that the Cosmos DB analytical store doesn’t support automatic backup and restore. To restore data in the analytical store, you will need to properly configure copying data from an automatically restored Cosmos DB account to the primary Cosmos DB account.

Pricing

Azure Cosmos DB analytical store:

  • Storage - $0.02/GB/month
  • Write Operations (per 10,000 operations) - $0.05
  • Read Operations (per 10,000 operations) - $0.005

Azure Synapse Analytics:

Compute:

  • On-demand pricing: $1.20/hour (DW100c) - $360/hour (DW30000c).
  • Reserved instance pricing can save up to 65% over the on-demand option (in a 3-year term).
  • Data storage: $122.88/TB/month.

Azure Synapse Link pricing includes the costs incurred by using the Azure Cosmos DB analytical store and the Synapse runtime.

Snowflake

Summary

Best for: cloud-agnostic data warehouse.

Provided as Software-as-a-Service, Snowflake enables companies to concurrently allocate compute resources of different cloud vendors (AWS, Azure, GCP) to the same database for loading and querying data with no impact on the data warehouse performance.

Description

  • Hosting a Snowflake account (storage, compute, and cloud services) on Amazon Web Services (AWS), Microsoft Azure (Azure), and Google Cloud Platform (GCP).
  • Separate scaling of storage and compute resources using a storage service for persistent storage of data and virtual warehouses (an MPP compute cluster composed of multiple compute nodes allocated by Snowflake from a cloud provider) for instant query processing (processing 6 to 60 million rows of data in from 2 seconds to 10 seconds).
  • Support for bulk and continuous data loading from Snowflake files stages, Amazon S3, Microsoft Azure Blob Storage and Google Cloud Storage, regardless of the cloud platform for the storage account.
  • Support for multiple service connections:
    • a web-based user interface and command-line clients (e.g. SnowSQL) for managing and using Snowflake.
    • ODBC, JDBC, .NET, PHP, and Node.js. drivers for connecting apps (e.g. Tableau).
    • native connectors (e.g. Python, Spark) for app development
    • third-party connectors for ETL tools (e.g. Informatica) and BI tools (e.g. Power BI, ThoughtSpot).
  • Automated database maintenance features: built-in performance optimization, automatic clustering, materialized view maintenance, end-to-end automatic data encryption, etc.
  • Avoiding resources accidental spend with the auto-pause capability.

Market recognition:

ScienceSoft's tip: Snowflake’s security and compliance options vary in accordance with specific editions. Standard and Enterprise editions provide such essential security capabilities as automatic data encryption, object-level access control, multi-factor authentication, etc. Business Critical and VPS (Virtual Private Snowflake) provide more advanced data security capabilities, including customer-managed encryption keys, support for PHI data, PCI DSS compliance, etc.

Pricing

On-demand and pre-purchase pricing, separate billing of storage and compute, compute billing on a per-second basis (minimum 60 seconds), etc.

  • Snowflake On Demand – usage-based, per-second pricing with no long-term commitment.
  • Snowflake capacity storage – pre-purchased compute usage on a per-second basis, with a minimum of 60 seconds, auto-suspend, and auto-resume capabilities.

Pricing is available by direct request to Snowflake.

WHAT DATA WAREHOUSE PLATFORM IS BEST FOR YOU?

Now that you saw the powerful capabilities of six cloud data warehouse platforms, you likely have a big question: "Which is best in my case?" ScienceSoft's consultants can help you find the answer, and it's free! We may start right now:

  1. You briefly describe your data warehouse needs, e.g., data warehouse purpose, the expected data volume, data sources, number of data warehouse users (if known).
  2. We reply with a questionnaire tailored to your industry, business size, and needs.
  3. You send us answers.
  4. We recommend you a data warehouse platform, describe its configuration, and provide the cost and implementation time. It's free!

Implementation of a cloud data warehouse

Since 2005, ScienceSoft renders data warehouse conulting and development services to help our clients build a flexible centralized storage on a fitting cloud platform and enable analytics capabilities to optimize internal business processes and enhance decision-making.

Cloud data warehouse consulting

Our team:

  • Analyzes your business needs and elicits requirements for a future cloud data warehouse solution.
  • Designs cloud data warehouse architecture.
  • Outlines the optimal cloud data warehouse platform and its configurations.
  • Advises on data governance procedures.
  • Designs a cloud data warehouse implementation/migration strategy.
  • Conducts admin trainings.
  • Delivers PoC for complex projects.

Cloud data warehouse implementation

Our team:

  • Analyzes your business needs and defines the required cloud data warehouse configurations.
  • Delivers PoC for complex projects.
  • Does data modeling and sets up ETL/ELT pipelines.
  • Develops and integrates a cloud data warehouse into the existing data ecosystem.
  • Runs QA.
  • Provides user training and support, if required.

About ScienceSoft

ScienceSoft is a global IT consulting and IT service company headquartered in McKinney, TX, US. Since 2005, we assist our clients in delivering data warehouse solutions with the help of end-to-end data warehousing services to encourage agile and data-driven decision-making. Our long-standing partnerships with global technology vendors such as Microsoft, AWS, Oracle, etc. allow us to bring tailored end-to-end cloud data warehousing solutions to business users. Being ISO 27001-certified, we guarantee cooperation with us does not pose any risks to our customers' data security.