How to Build a Data Warehouse from Scratch: Approaches, Plan, Software, and Costs

The Ultimate Guide on Building a Data Warehouse - ScienceSoft

Since 2005, ScienceSoft has been providing data warehousing services to help companies build cost-effective and high-performing DWH solutions.

Data warehouse: the essence

A data warehouse (DWH) helps integrate, process, structure, and store a company’s data from diverse source systems for further analytical querying and reporting.

Building a data warehouse ensures:

  • Automated data management procedures (data collection, transformation, cleansing, structuring, etc.) for increased data quality and reliability.
  • Unified approach to data security.
  • A platform ready for advanced analytics initiatives.
  • A data-driven culture across the company, etc.

Approaches to building a data warehouse

A typical architecture of a data warehouse solution includes the following layers:

Data source layer

– internal and external data sources.

Staging area

– a temporary area where data transformations take place. Absent if data transformations are performed in the data storage layer.

Data storage layer

– hosts a data warehouse database (a central database for storing a company’s data) and data marts (data warehousing subsets for storing data for a particular business line – finance, marketing, HR, etc.).

The data in the DWH database and data marts can then be queried via OLAP tools, data mining tools, reporting, and visualization tools.

Different approaches to building a data warehouse concentrate on the data storage layer:

  • Inmon’s approach – designing centralized storage first and then creating data marts from the summarized data warehouse data and metadata.

Inmon’s approach to building a DWH

  • Kimball’s approach – creating data marts first and then developing a data warehouse database incrementally from independent data marts.

Kimball’s approach to building a DWH

Inmon’s approach

Kimball’s approach

Benefits

Drawbacks

A step-by-step plan on how to build a data warehouse from scratch

Note: The timeframes below are approximate, as the duration of the DWH development process depends on a variety of factors, including the complexity and quality of data in the source systems, data security requirements, data analytics objectives, etc.

1

DWH goals elicitation

2

DWH conceptualization and platform selection

3

Business case and project roadmap

4

System analysis and DWH architecture design

5

DWH development and stabilization

6

DWH launch

7

After-launch support

Talents required for building a data warehouse

Project manager

  • Outlining DWH development project scope and objectives, determining deliverables for each project step, and ensuring their delivery.
  • Providing estimations for the DWH development efforts and managing resources.
  • DWH development project scheduling, delivering project updates to stakeholders.
  • Measuring project performance and identifying areas of improvement.

Business analyst

  • Defining functional and non-functional DWH requirements, DWH limitations.
  • Documenting the scope of the DWH solution, its constituents, integration into the existing analytics environment (if any).

DWH system analyst

  • Analyzing data sources and data analytics software (if any).
  • Defining a DWH system requirements specification for creating data models, designing ETL/ELT processes.
  • Auditing the quality of data loaded into the data warehouse.

DWH solution architect

  • Designing a DWH solution architecture.
  • Creating a data governance strategy, policies, and standards.
  • Suggesting the DWH tech stack.

Data engineer

  • Developing data models and their structures.
  • Developing and maintaining a data pipeline to route source data to the DWH.
  • Building the ETL/ELT process.

Quality assurance engineer

  • Reviewing DWH tech design documents.
  • Designing a test strategy.
  • Designing, developing, and maintaining tests to evaluate the developed DWH solution.

DevOps engineer

  • Setting up the DWH software development infrastructure.
  • Introducing continuous integration/continuous deployment (CI/CD) pipelines to automate and streamline DWH development processes.

Sourcing models

In-house end-to-end DWH development

The company has max control over the DWH development project.

Caution: Project delays are possible due to resource or expertise unavailability.

Technical resources are partially outsourced

Extending the in-house team with the vendor’s resources, while maintaining substantial control over the DWH development project.

Caution: High requirements to in-house competencies to avoid project delay/failure.

Technical resources are fully outsourced

Minimized risk of the resource overprovisioning after the project completion.

Caution: High requirements to in-house PM and BA competencies.

All resources are outsourced

A vendor has full responsibility for the DWH development project and all related risks.

Caution: High vendor dependency.

Build Your DWH with Expert Help

Our team will help you design and build a scalable and high-performing data warehouse fully aligned with your data storage and analysis needs within the optimized project time and budget.

Data warehouse software worth attention

If you are looking for the industry-best data warehousing platforms, explore our list of the best data management solutions for analytics we use in our projects. Each of the DWH services presented here is among leaders in the Gartner Magic Quadrant and Forrester Wave reports and offers you a platform for building a modern, high-performing DWH.

Amazon Redshift

Best for: big data warehouse

DESCRIPTION

  • Integrations with the AWS ecosystem (Amazon S3, Amazon Athena, Amazon SageMaker, Amazon EMR, etc.)
  • Integration with third-party ETL/ELT, BI, data modeling, etc. tools (Power BI, Tableau, Informatica, Qlik, Talend Cloud, and more).
  • Fast query processing with AQUA (Advanced Query Accelerator), data compression, materialized views, and result caching.
  • ML-optimized performance under varying workloads.
  • Automated infrastructure provisioning, table design, backups and cluster health monitoring.
  • SQL data querying (including big data).
  • Data encryption in transit and at rest.
  • Fine-grained access control.
  • HIPAA-eligible.
  • Separate scaling of compute and storage.

Pricing

  • On-demand pricing: $0.25/hour (dc2.large) - $13.04/hour (ra3.16xlarge).

  • Reserved instance pricing can save up to 75% over the on-demand option (in a 3-year term).

  • Data storage (RA3 node types): $0.024/GB/month.

Azure Synapse Analytics

Best for: advanced data management

DESCRIPTION

  • SQL querying of structured, semi-structured, unstructured data types.
  • Multilanguage support (T-SQL, Python, Scala, Spark SQL, .Net).
  • Native integrations with Apache Spark, Power BI, Azure ML, Azure Stream Analytics, Azure Cognitive Services, Azure Cosmos DB, etc.
  • Integration with third-party BI tools, including Tableau, SAS, Qlik, etc.
  • Result-set caching, workload isolation and clustered columnstore indexes.
  • Automatic restore points and backups.
  • End-to-end data encryption, dynamic data masking, granular access control.
  • HIPAA-eligible.
  • Separate scaling of compute and storage resources.

Pricing

  • Compute on-demand pricing: $1.20/hour (DW100c) - $360/hour (DW30000c).
  • Compute reserved instance pricing can save up to 65% over the on-demand option (in a 3-year term).
  • Data storage: $122.88/TB/month.

Oracle Autonomous Data Warehouse

Best for: hybrid DWH

Description

  • SQL querying of structured, semi-structured, unstructured data.
  • Native integration with Oracle Cloud Infrastructure Object Storage and Oracle Analytics Desktop.
  • Integration with Azure Blob Storage, Amazon S3.
  • Connection with custom applications and third-party products via SQL*Net, JDBC, ODBC.
  • Deployment flexibility (Oracle public cloud (shared/dedicated infrastructure) or a customer’s data center).
  • Automated scaling, patching and upgrades, backups and recovery.
  • High query processing with continuous query optimization, table indexing, data summaries, and auto-tuning.
  • Data encryption at rest and in transit.
  • Multifactor authentication.
  • HIPAA-eligible.
  • Separate scaling of storage and compute resources.

Pricing

  • Compute: $1.3441/CPU/hour.
  • Data storage: $118.40/TB/month (in the public cloud).

Choose Optimal DWH Software with Expert Help

After careful analysis of your data processing and storage needs, our team will outline the optimal DWH tech stack to help you build a cost-effective and easy-to-maintain data warehouse.

Data warehouse development cost estimation

A DWH development project, which involves building a 10GB data warehouse to store historical data consolidated from 10+ data source systems starts from $200,000 (software/hardware-related costs are excluded).

The major factors that influence the cost of the DWH development project are:

  • Number of data sources (ERP, CRM, SCM, etc.)
  • Data disparity across different sources (for example, difference in data structure, format, and use of values).
  • Data source complexity (for example, big data, streaming data).
  • Data volume to be processed and stored.
  • Data sensitivity and data security requirements.
  • Number of data flows and the number of entities (“clients”, “salary”, “transactions”, etc.) to be integrated into the DWH.
  • DWH performance requirements (velocity, scalability, etc.).

Consider Professional Services for DWH Development

Since 2005, ScienceSoft has been providing a full range of DWH consulting and development services to help companies build a cost-efficient and scalable DWH solution to address their data management and analytics needs.

DWH design

  • DWH requirements engineering.
  • Business case creation, recommendations on DWH price optimization.
  • DWH conceptualization and software selection.
  • DWH solution architecture design.
  • Data governance design.
  • DWH system analysis.
  • Data modeling and ETL/ELT design.

DWH development outsourcing

  • DWH requirements engineering.
  • DWH solution conceptualization and platform selection.
  • DWH solution architecture design.
  • DWH system analysis.
  • Data modeling and ETL/ELT design.
  • DWH solution development.
  • DWH quality assurance and launch.
  • DWH after-launch support.

About ScienceSoft

ScienceSoft is a global IT consulting and software development company headquartered in McKinney, TX, US. Since 2005, we’ve been offering DWH services to help businesses develop full-scale DWH solutions and advise on how to ensure the success of the DWH development process.