Data Lakes: Schema on Read vs. Schema on Write?

big-data-data-lakes-schema This is the second in my series of blog posts on data lakes. I warned you we were going to get technical, so put on your life jacket because we’re going to plunge right into the deep end of the (data) lake. Let’s talk about whether you should apply the structure of a schema to your data “on read” versus “on write.” This may seem esoteric, and it is, but the decision can have an impact on how quickly, imaginatively and cost-effectively you can do advanced analytics on big data sets.

As a quick recap, the concept of the data lake refers to the placing of raw data into large repositories – typically Hadoop Distributed File System (HDFS), the storage system used by Hadoop – where it can be analyzed without constraints of preconceived structure. In contrast to traditional analytics (e.g. the data warehouse), which relies on databases with established data schemas, the data lake is a free-for-all. But, it’s not all anarchy. In most cases, analytics ultimately requires a schema to do analysis.

The question to ask, though, is whether you want to apply that schema when you put the data into the lake, in the “data writing” process or wait until later, when you are “reading” the data and actually use it using for analysis? In the latter scenario, the parsing and schema is applied during the “read” operation as the data scientist reads data in its “raw” form from the lake.

There is no single right answer to “schema on read” vs. “schema on write.” The decision will depend on a variety of factors. A simplified example might help inform your thought process. Imagine you have a data set that contains a million PDF documents created from scans of a million paper business cards. You face a choice when you are tasked with writing this data into your data lake. You can create and apply a schema to the data before you write, organizing the information from the PDFs into a table with columns headers like address/phone number/email address/company name and so forth. This would be creating a schema on write because the information on each business card is OCRed, parsed and mapped. This information is then written into predefined schema columns in your data warehouse. Or, you can just dump the whole collection of pdfs into the data lake and figure out what schema (i.e. column mappings) you want when you’re ready to do analysis.

On the other hand, schema on read gives you the ability to do exploratory analytics with tools like Hive, Drill, JSON and Spark. You can read data in its native format. The user (e.g. the data scientist or ETL developer) controls the parser and schema. New data can be added at any time into the lake. In that case, you would have a dynamic schema, with the parser translating as data is read or streamed. In other words, schema on read means that you perform the Extract, Transform and Load (ETL) on the fly.

Which is the better approach? It depends on what you’re trying to do. If you know for sure that you will only ever need to access the contact set as a standard address list, then it might make sense to bind the data to the schema before you write it into the lake or data warehouse. That will make the data easily accessible and structured when you need it. Maintaining indexes on tables will also speed access to structured data in some situations.

Conversely, schema on write represents an investment of time and effort by a data scientist or an ETL designer/developer, resources that are getting increasingly hard to find. Schema on write (a.k.a. traditional ETL processing) will slow the process down for sure because end users will not be able to access the data until the ETL developer completes the schema design, master and meta-data elements are defined, DBA creates indexes/aggregate tables and ETL programming and processing is complete.

Furthermore, since compute power, storage and licensing costs are more expensive in structured data warehouses that use more traditional row or column store database technology, developers and database administrators (DBAs) try to save costs and improve performance by reducing the amount of data stored in size of the data warehouse. This is typically accomplished by having ETL designers and developers aggregate, summarize and derive new columns based upon business algorithms which may vary from department to department. It should also be noted that when using aggregates and derived columns the underlying detail data may have been discarded, therefore this data is no longer easily available in the data warehouse.

In the end, as far as data scientists are concerned, you may be doing the ETL work for nothing what if the data scientist wants to analyze the raw bits of all those business card PDFs in some different way, with a different perspective or meaning? In this case, the predefined columns in the schema inhibit creativity and flexibility.

In addition to flexibility and creative thought, if you took the schema on read approach, the data scientist will have access to the raw data faster because he or she would be able to skip the time, development, processing and real cost of binding the business card data to a schema up front. They would, also, have more freedom to do analysis based on subjective interpretation of various data requirements. For example, you could analyze the color of corporate logos and correlate them to data on Web traffic and build a hypothesis about which logo color or logo shape yields the most web clicks. Job titles, on those business cards, may also have different meanings based upon the context or different points of view. Additional insight may also be obtained by correlating information on the business card with social media information. Overall, it may be better not to encumber the data scientist by a pre-existing schema.

More importantly, if the ETL designer/developer decided to aggregate and summarize data and not store the underlying raw data (in an effort to save space in the data warehouse), then the data scientist may end up with incorrect models if aggregated or derived data is misunderstood. This may be further complicated if data elements use the same name, in different departments, and if calculations to develop derived or aggregated columns have different underlying assumptions.

Schema on write, also, has the potential to provide misleading results if dimensions in your data warehouse are not “conformed” in your organization. It is also possible that aggregated summary and derived data is not documented well or is performed differently in different departments based upon each department’s varied business needs. Therefore, the data scientist should always have the luxury to access the raw, unaltered, data and interpret the data in whatever way he or she feels will offer the most accurate analytical model.

On the negative side, the schema on read approach does have its drawbacks. You run the risk of misinterpreting the data and drawing invalid conclusions. You don’t have data validation or cleansing. It’s easier to make invalid assumptions regarding your data than it is with a more structured approach. You may also be setting yourself up for data governance challenges.

To make the discussion even more complicated, consider the fact that data scientists usually access structured data in the warehouse a smaller percentage of the time than the typical Business Analyst.  This is because they have the luxury to add color to their analytical models by using unstructured and semi-structured data in its raw form without having to wait for ETL developers to structure the data or alter its meaning by performing aggregations or deriving new columns. These are some of the more important issues to think through as you decide how to do your schemas.

Schema on read or write is not an all or nothing choice. Today, most organizations prefer to leave the data in the lake in its raw form. Then the data lake acts as a “data reservoir” to feed more traditional data warehouses or data marts via ETL processing. The ETL jobs may run in batch mode using Hadoop or stream data using technologies such as Spark. Alternatively, they may simply pass the data onto legacy ETL platforms using batch style file transfer mechanisms. Another approach is to have an HDFS based repository of raw data using a schema on read philosophy while a smaller portion of the data lake might be run with strict schema on write rules into HBase. Metadata may be managed using such tools as Cloudera Navigator, there are many such variations.

The best practice is to think about your data analytics use cases at the outset and think outside the box because you will never know what golden nuggets may be found in the lake depending upon your perspective. Your business requirements may favor schema on read or they might not. But, I wanted to share the concept of schema on read vs. schema on write with you to make you familiar with your expanding set of big data options.

Interested in learning more about Data Lakes: Schema on Read vs. Schema on Write, contact us for a 1:1 consultation with a big data expert.

Comment (1)

Leave a Reply

  1. Peter Tormay

    Great post. Not an expert in the field but got me thinking.
    Just wondering about a few things.
    a) In order to do schema on read you probably need to know a lot about the data that is in the data lake. So I assume there is a big overhead of metadata that needs to be stored together with your data.

    b)coming from a background where most data is generated through experimentation I was wondering how this fits in. Experiments are normally done with a particular question in mind (and I suspect this is true for most data gathering activities). This means only the data relevant for that question is collected in the first place. Coming back to your PDF example – what if the pdfs were scanned in grey scale because already at this point the potential questions people had in mind were determining what and how the information is captured.