How to Build a Data Warehouse from Scratch: Approaches, Plan, Software, and Costs
Since 2005, ScienceSoft has been providing data warehousing services to help companies build cost-effective and high-performing data warehousing solutions.
Building a Data Warehouse: the Summary
Project time: From 3 to 12 months.
Cost: Starts from $70,000.
Steps to build a data warehouse: Goals elicitation, conceptualization and platform selection, business case and project roadmap, system analysis and data warehouse architecture design, development and launch.
Team: A project manager, a business analyst, a data warehouse system analyst, a data warehouse solution architect, a data engineer, a quality assurance engineer, a DevOps engineer.
Building a data warehouse ensures:
- Automated data management procedures (data collection, transformation, cleansing, structuring, etc.) for increased data quality and reliability.
- Unified approach to data security.
- A platform ready for advanced analytics initiatives.
- A data-driven culture across the company, etc.
Data source layer
– internal and external data sources.
– a temporary area where data transformations take place. Absent if data transformations are performed in the data storage layer.
Data storage layer
– hosts a data warehouse database (a central database for storing a company’s data) and data marts (data warehousing subsets for storing data for a particular business line – finance, marketing, HR, etc.).
The data in the data warehouse database and data marts can then be queried via OLAP tools, data mining tools, reporting, and visualization tools.
Different approaches to building a data warehouse concentrate on the data storage layer:
Inmon’s approach – designing centralized storage first and then creating data marts from the summarized data warehouse data and metadata.
- Kimball’s approach – creating data marts first and then developing a data warehouse database incrementally from independent data marts.
Data Warehouse Use Cases
- Strategic reports and dashboards for top management.
- Financial performance data monitoring and benchmarking.
- Financial forecasting and strategic investment planning.
- Profitability analysis of customers and products.
- Strategic sourcing.
- Employee and department performance assessment and planning.
- Business management systems (ERP, CRM, CMS, EAM, PIM, etc.).
- External data (benchmarking data, surveys, public data, etc.).
Budgeting and financial planning
- Multi-user role-based reports and dashboards.
- Consistency of overall corporate planning and planning for specific business areas.
- Budget allocation.
- Financial simulation and scenario considerations, contingency plans for a range of possible events.
- Operational data from functional areas (supply chain, production, marketing and sales, etc.).
- Financial data stores (ERP, financial management system, accounting system, etc.).
- Financial and operational performance reports, scorecards and dashboards for managers.
- Organization, department, employee or process performance tracking.
- Identification of business productivity, employee attrition, etc. drivers.
- Performance gaps anticipation, root cause analysis.
- Strategies for performance optimization for sales funnel, marketing campaigns, supply chain, etc.
Data sources: Business management systems (ERP, accounting management, CRM, supply chain management, etc.).
- Tactical dashboards for managers and directors with continuously updated business data.
- Time-sensitive analytical querying to support production planning, inventory planning, logistic management, etc.
Data sources: Management information systems (inventory control, sales and marketing, accounting and finance, production, logistics, fleet, etc.).
Operational data warehousing (hybrid transaction/analytical processing)
- Operational dashboards for fast querying of large and granular transaction data in real time.
- Data-driven decision-making in the operational environment (order entry, banking operations, travel reservations, etc.).
- Alerting to the situations requiring immediate attention (risk management, fraud detection, etc.).
- Constantly updated operational forecasts and business outcome simulations in real time.
Data sources: historical and real-time data from transactional data stores.
IoT, telematics, digital twins
- Reacting (e.g., triggering an alert) to particular events or a sequence of events in real time or near real time.
- Detecting event patterns and predicting reactions based on historical IoT data analysis.
- Predictive maintenance.
- Vehicle telematics.
- Smart building.
- Smart devices and wearables.
- Smart metering.
Data sources: IoT devices.
SaaS, XaaS, online services
- Support for data load scalability.
- Instant analytical querying of huge app data volumes.
- Support for machine learning capabilities (personalization, chatbots, etc.).
Data sources: applications, data and backup storage systems.
The suggested plan bases on ScienceSoft’s 16-year experience in data warehousing services and features the usual procedure we follow when implementing a DWH. Notice that the project timeframes are approximate, as the duration of the data warehouse development process depends on a variety of factors, including the complexity and quality of data in source systems, data security requirements, data analytics objectives, etc.
Step 1. Goals elicitation
Duration: 3 – 20 days
- Discovery of your business objectives (tactical and strategic) to be pursued with the data warehouse development project.
- Identification and prioritization of the company’s, departments’, business users’ expectations and needs from the project.
- Review of the company’s current technological architecture, applications in use, etc.
- Conducting a preliminary data source analysis (data type and structure, volume, sensitivity, etc.).
- Outlining the data warehouse scope and high-level system requirements.
Step 2. Conceptualization and platform selection
Duration: 2 – 15 days
- Defining the desired data warehouse solution feature set.
- Choosing the optimal deployment option (on-premises/in-cloud/hybrid).
- Choosing the optimal architectural design approach to building a data warehouse.
- Selecting the data warehouse technologies (DWH database, ETL/ELT tools, data modeling tools, etc.), taking into account:
- Number of data sources and data volume to be loaded into the data warehouse.
- Data flows to be implemented.
- Data security requirements.
Close cooperation of business users with a BA and a solution architect while defining the core and advanced functionality of the future solution helps avoid overcomplicating the data warehouse architecture and select the most cost-effective tech stack.
Step 3. Business case and project roadmap
Duration: 2 – 15 days
Major activities include:
- Defining data warehouse development project scope, budget planning, timeline, etc.
- Scheduling DHW design, development and testing activities.
- Drawing up a data warehouse project scope document, data warehouse solution architecture vision document, data warehouse deployment strategy, testing strategy, project implementation roadmap.
- Developing a risk management plan.
- Estimating efforts for the data warehouse development project, TCO and ROI.
Rigorous planning of a data warehouse development project helps reduce up to 30% of project time and budget, so ScienceSoft’s team carefully elaborates on the findings of the preceding stages.
Step 4. System analysis and data warehouse architecture design
Duration: from 15 days
- Detailed analysis of each data source:
- Data type and structure (data models, if any).
- Data volume generated daily.
- Degree of data sensitivity and an applied data access approach.
- Data quality, missing/poor data, possibility to perform data cleansing in the data source system.
- Identification if any data is absent/of enough quality to support the business requirements.
- Frequency of data updates.
- Relation to other data sources.
- Designing data cleansing policies.
- Creating data security policies (data access policies based on legal restrictions and data security rules, data encryption policies, policies for data access monitoring and data compliance, data backup strategy, etc.)
- Designing data models for the data warehouse and data marts.
- Identifying data objects as entities or attributes; identifying relationships between entities.
- Mapping data objects into the data warehouse.
- Designing ETL/ELT processes for data integration and data flow control.
To create a blueprint for the data ecosystem fully tailored to the customer’s particular business needs, ScienceSoft engages senior-level system analysts with considerable experience in the corresponding industry.
Step 5. Development and stabilization
Duration: from 2 months
- Data warehouse platform customization.
- Configuring data security software and implementing data security policies (applying data security policies to data at the row, column, etc. level, developing custom security procedures, and more).
- Developing ETL/ELT pipelines and ETL/ELT testing.
- Data warehouse performance testing.
To ensure the speed and frequency of releases without sacrificing the solution’s quality, ScienceSoft follows the DevOps-driven iterative development approach.
Step 6. Launch
Duration: from 2 days
- Data migration, data quality assessment.
- Introducing the data warehouse to business users.
- Having user acceptance tests.
- Conducting user training sessions and workshops.
Step 7. After-launch support
Duration: as requested
- ETL/ELT performance tuning.
- Adjusting data warehouse performance and availability, etc.
- Supporting end users.
Consider Professional Services for Data Warehouse Development
Since 2005, ScienceSoft has been providing a full range of data warehouse consulting and development services to help companies build a cost-efficient and scalable data warehouse solution to address their data management and analytics needs.
Data warehouse consulting
- Data warehouse requirements engineering.
- Business case creation, recommendations on data warehouse price optimization.
- Data warehouse conceptualization and software selection.
- Data warehouse solution architecture design.
- Data governance design.
- Data warehouse system analysis.
- Data modeling and ETL/ELT design.
Data warehouse development
- Data warehouse requirements engineering.
- Data warehouse solution conceptualization and platform selection.
- Data warehouse solution architecture design.
- Data warehouse system analysis.
- Data modeling and ETL/ELT design.
- Data warehouse solution development.
- Data warehouse quality assurance and launch.
- Data warehouse after-launch support.
ScienceSoft as a Trusted DWH Partner
When we first contacted ScienceSoft, we needed expert advice on the creation of the centralized analytical solution to achieve company-wide transparent analytics and reporting. After a series of interviews, ScienceSoft’s consultants analyzed our workloads, documentation, and the existing infrastructure and provided us with a clear project roadmap.
They stayed in daily contact with us, which allowed us to adjust the scope of works promptly and implement new requirements on the fly. Additionally, the team delivered demos every other week so that we could be sure that the system aligned with our business needs.
Heather Owen Nigl, Chief Financial Officer, Alta Resources
- Outlining data warehouse development project scope and objectives, determining deliverables for each project step, and ensuring their delivery.
- Providing estimations for the data warehouse development efforts and managing resources.
- Data warehouse development project scheduling, delivering project updates to stakeholders.
- Measuring project performance and identifying areas of improvement.
- Defining functional and non-functional data warehouse requirements, data warehouse limitations.
- Documenting the scope of the data warehouse solution, its constituents, integration into the existing analytics environment (if any).
Data warehouse system analyst
- Analyzing data sources and data analytics software (if any).
- Defining a data warehouse system requirements specification for creating data models, designing ETL/ELT processes.
- Auditing the quality of data loaded into the data warehouse.
Data warehouse solution architect
- Designing a data warehouse solution architecture.
- Creating a data governance strategy, policies, and standards.
- Suggesting the data warehouse tech stack.
- Developing data models and their structures.
- Developing and maintaining a data pipeline to route source data to the data warehouse.
- Building the ETL/ELT process.
Quality assurance engineer
- Reviewing data warehouse tech design documents.
- Designing a test strategy.
- Designing, developing, and maintaining tests to evaluate the developed data warehouse solution.
- Setting up the data warehouse software development infrastructure.
- Introducing continuous integration/continuous deployment (CI/CD) pipelines to automate and streamline data warehouse development processes.
In-house end-to-end data warehouse development
The company has max control over the data warehouse development project.
Caution: Project delays are possible due to resource or expertise unavailability.
Technical resources are partially outsourced
Extending the in-house team with the vendor’s resources, while maintaining substantial control over the data warehouse development project.
Caution: High requirements to in-house competencies to avoid project delay/failure.
Technical resources are fully outsourced
Minimized risk of the resource overprovisioning after the project completion.
Caution: High requirements to in-house PM and BA competencies.
All resources are outsourced
A vendor has full responsibility for the data warehouse development project and all related risks.
Caution: High vendor dependency.
If you are looking for the industry-best data warehousing platforms, explore our list of the best data management solutions for analytics we use in our projects. Each of the data warehousing services presented here is among leaders in the The Forrester Wave and Gartner Magic Quadrant reports and offers you a platform for building a modern, high-performing data warehouse.
Best for: big data warehouse
- Integrations with the AWS ecosystem (Amazon S3, Amazon Athena, Amazon SageMaker, Amazon EMR, etc.)
- Integration with third-party ETL/ELT, BI, data modeling, etc. tools (Power BI, Tableau, Informatica, Qlik, Talend Cloud, and more).
- Fast query processing with AQUA (Advanced Query Accelerator), data compression, materialized views, and result caching.
- ML-optimized performance under varying workloads.
- Automated infrastructure provisioning, table design, backups and cluster health monitoring.
- SQL data querying (including big data).
- Data encryption in transit and at rest.
- Fine-grained access control.
- Separate scaling of compute and storage.
On-demand pricing: $0.25/hour (dc2.large) - $13.04/hour (ra3.16xlarge).
Reserved instance pricing can save up to 75% over the on-demand option (in a 3-year term).
Data storage (RA3 node types): $0.024/GB/month.
Azure Synapse Analytics
Best for: advanced data management
- SQL querying of structured, semi-structured, unstructured data types.
- Multilanguage support (T-SQL, Python, Scala, Spark SQL, .Net).
- Native integrations with Apache Spark, Power BI, Azure ML, Azure Stream Analytics, Azure Cognitive Services, Azure Cosmos DB, etc.
- Integration with third-party BI tools, including Tableau, SAS, Qlik, etc.
- Result-set caching, workload isolation and clustered columnstore indexes.
- Automatic restore points and backups.
- End-to-end data encryption, dynamic data masking, granular access control.
- Separate scaling of compute and storage resources.
- Compute on-demand pricing: $1.20/hour (DW100c) - $360/hour (DW30000c).
- Compute reserved instance pricing can save up to 65% over the on-demand option (in a 3-year term).
- Data storage: $122.88/TB/month.
Oracle Autonomous Data Warehouse
Best for: hybrid DWH
- SQL querying of structured, semi-structured, unstructured data.
- Native integration with Oracle Cloud Infrastructure Object Storage and Oracle Analytics Desktop.
- Integration with Azure Blob Storage, Amazon S3.
- Connection with custom applications and third-party products via SQL*Net, JDBC, ODBC.
- Deployment flexibility (Oracle public cloud (shared/dedicated infrastructure) or a customer’s data center).
- Automated scaling, patching and upgrades, backups and recovery.
- High query processing with continuous query optimization, table indexing, data summaries, and auto-tuning.
- Data encryption at rest and in transit.
- Multifactor authentication.
- Separate scaling of storage and compute resources.
- Compute: $1.3441/CPU/hour.
- Data storage: $118.40/TB/month (in the public cloud).
*Monthly software license fee and other regular fees are NOT included.
The major factors that influence the cost of the data warehouse development project are:
- Number of data sources (ERP, CRM, SCM, etc.)
- Data disparity across different sources (for example, difference in data structure, format, and use of values).
- Data source complexity (for example, big data, streaming data).
- Data volume to be processed and stored.
- Data sensitivity and data security requirements.
- Number of data flows and the number of entities (“clients”, “salary”, “transactions”, etc.) to be integrated into the data warehouse.
- Data warehouse performance requirements (velocity, scalability, etc.).
ScienceSoft is a global IT consulting and software development company headquartered in McKinney, TX, US. Since 2005, we’ve been offering data warehousing services to help businesses develop full-scale data warehouse solutions and advise on how to ensure the success of the data warehouse development process. Being ISO 9001 and ISO 27001-certified, we rely on a mature quality management system and guarantee cooperation with us does not pose any risks to our customers’ data security.