Big Data Activation: Data Warehouses and Data Lakes

Ben Hinson
Hickam's Dictum
Published in
5 min readAug 10, 2018

--

For this post on Hickam’s Dictum we’re talking database architecture, with an overview of two types of Big Data repositories in the market today: Data Warehouses and Data Lakes. These are very important data architectures to understand, as one of the main struggles for forward thinking companies is having a centralized processing system for all internal and external data regarding their customers to help them make smart decisions and communicate effectively with their client base in real time. To properly grasp the concepts in this article, it would help to have some basic knowledge of database architecture (e.g. records, tables, Entity Relationship Diagrams (ERDs), database schemas, ETL vs ELT processes, what an RDBMS is, what an API is, etc). We’re talking Big Data! Let’s get to it.

Data Warehouses

At the first tier of Big Data repositories lies Data Warehouses. Data Warehouses are fed information (via APIs or raw files) from different sources/business units (sales, customer service, etc). This information is cleansed and processed in a staging area via an ETL process, before being fed into the Data Warehouse itself (some data warehouses are moving toward an ELT process). From this point on the data can be filtered and made available as different subsets (Datamarts), usually pertaining to different lines of business. The data within each Datamart is housed within relational databases (a relational database is a collection of tables containing related data that can be linked via unique keys, and the records from each table can have one to one or one to many relationships with records in other tables). Relational Databases typically use SQL to extract and manipulate the data, and have business rules and quality assurance checks in place to maintain the integrity of the data. There are different “engines” that power relational databases, the popular ones being Oracle, SQL Server and MySQL.

Data Warehouses generally lean towards structured, relational internal data, although they can ingest and process external data sources as well. Also, the schema (blueprint) for a Data Warehouse is created prior to its creation, and updated as needed as new tables/sources are added.

Figure 1. Basic Data Warehouse Architecture.

Data Lakes

We’ve established that Data Warehouses typically store and process related data sets. But sticking to only related data sets provides a limited view of every data stream related to an enterprise’s business activities. This is where Data Lakes come in. Data Lakes are very similar to Data Warehouses, the main differences being that Data Lakes ingest (via APIs) and process data from both internal and external sources in any format (structured and unstructured). So for example, Data Lakes allow a large corporation to have one central storage and processing location for internal items like sales and service data, and external data like cross channel digital marketing performance metrics from external agencies, digital audience profiles (from DMPs), etc. Traditionally, Data Lakes were (and still are) powered via Hadoop frameworks (Hadoop technology was originally used to manage the web, making it ideal for data lakes as it can handle different file formats). Today Cloud based solutions like Amazon Web Services (S3) and Microsoft Azure are dominating the Data Lake vendor space. Datalakes typically ingest data via an ELT process, meaning the data is extracted from each source and loaded into the Hadoop framework, and this data is then processed directly from the Datalake and not in a staging area (in addition to their ability to handle Big Data, Hadoop clusters can also process data (Transform) faster than most ETL tools. This is why for Big Data, the processing can happen within the Data Lake itself as needed).

Figure 2. Basic Data Lake Architecture.

Summary

Data Warehouses have long been the traditional norm for storing and processing related data sets within many organizations. But that model did not eliminate the data silos that existed between internal and external data sources. This reality eventually gave birth to Data Lakes, centralized repositories able to store related and unrelated data from any source and in any format.

While having the ability to properly archive and process data is a necessity for any forward thinking business, both approaches to Big Data management are filled with challenges, for example:

  • Data Warehouses can potentially have incorrect or missing data, which can affect the accuracy of reports and insights.
  • Data Lakes can become data graveyards, where the vastness and complexity of the ingested data can make it hard to activate (Data Lakes can host terabyte level data). This realization, coupled with the fact that Data Lakes are quite costly are enough to make any capable leadership team exercise caution until they understand with confidence what actionable intelligence a Data Lake can provide.
  • Data Lakes require having analysts with experience across every function (every data point) to process and activate the data.
  • Governance. There is currently no universally accepted set of rules, standards or guidelines on how to manage data in a Data Lake.
  • But perhaps the biggest challenge with creating a Data Lake regards organizational politics. With so many business units, divisions, teams, processes, privacy restrictions, personal agendas and egos in an average company, its natural to expect conflict when it comes to data ownership and what can be shared in a centralized repository.

*Is your company/organization in the process of setting up a Data Lake? If yes, have you considered setting up goals and objectives for the Data Lake across departments/business units, to help prioritize and activate the data?

I hope you enjoyed this article! Please click the applause button 👏below or on the side so others can learn about this article as well! And please Bookmark and/or Follow Hickam’s Dictum for more actionable strategic insights!

--

--

I enjoy creating content, solving problems, sharing knowledge, learning about our world and celebrating others. Learn more at www.benhinson.com