There is a deluge of data being generated in most companies, from business operations, products and machines deployed by the company.
Most of the traditional focus on data has been in structuring, storing and making data available quickly to people entitled to use it as part of their roles. This approach was causing a loss of flexibility and speed to combine and analyze data from multiple sources on the go.
Data lakes are typically sold as an answer to this limitation; often pitched as the only data repository one needs access to across the enterprise for advanced analytics. The practical implementation of data lakes across companies however is leaving a lot to be desired. The big issue in data lake implementations arises when databases and tables in enterprise systems (ERPs, CRMs) are swept into a common data holding area for employees to consume from.
There is a large gap between how people in business roles perceive data and how it is stored in databases, given there is usually a lot of the transformation logic codified in the application layer facing the business. Business users (and even data stewards for that matter!) often struggle in connecting the data they see with how it is stored, creating a sense of frustration with data being available but not fit for consumption. This creates a feeling of being in a ‘data’ swamp and not in a pristine lake with clean water to drink from! To make things worse, the strategy of sweeping out table dumps makes the data swamp look like many data puddles. 
GDPR, ITAR and similar data privacy regulations coming up around the world are also shrinking the sales narrative of expansive and open data lakes.
So, does this mean that data lakes will dry up? Or will we see more of ‘Enterprise Data Hubs’, which are being spoken about as data lakes but with more governed access controls? With technical improvements like container level security for blob storage and data governance frameworks on data lakes coming up, the verdict is still out.
Here are two key features I believe will determine the success of data lake adoption by people in business & non-IT roles:-
- Packages to implement / automate data privacy regulations such as GDPR – creating default views for different users based on their demographics and affiliation.
- Strong semantic layer (data dictionaries with lineage) enabling business users to pinpoint and quickly access data elements of interest to them.
The opinions expressed in this blog are my own and do not necessarily represent viewpoints of my current and past employers or associates. Your comments are welcome.
