Do you have a Data Lake or are you considering creating one?
I am sure you do not want it to turn into a swamp of toxic data that no one in your organisation trusts or uses, with individual employees having their own version of the data they use. Putting out their own umbrellas and sunbeds – back to siloed pools of data!
We don’t want to return to the days of siloed data. We need accessible data which is centralised, governed, managed and of a specified level of quality – never aim for 100% as your last few percentages points will cost you more in time and money than is of a benefit to the organisation.
Data Lakes the Promise
The idea is to have all your data in one place. No Silos. No data integration problems.
Users can find whatever they need as all the data is in the lake.
Implement a data lake and your data problems will be solved!
Is this true?
If it was and the data lake promise and data management required around it were simple and easy to implement, do you think we would all have a data lake in our organisation by now.
Key points that challenge the promise:
- Privacy / Security / Governance – Need the discipline to gain value from your data lake – a great analogy is the “Family junk draw” will your data lake become one!
- Sustainability of a Date Lake – After the initial ingestion of data is questionable
- Over-confidence – Lot of money spent, smart people develop cool things, result does not create the magic that was expected
- The lack of functionality of data lake capabilities – Immaturity
What is the value of having all your reference and master data in the data lake? It already exists and is managed where it currently lives. If you move and manage this in your data lake you would need to rebuild the processes and procedures in the data lake and have a complex synchronisation process to ensure the data remains the same in the data lake and the systems where it originally lives.
The reason why reference and master data cannot be managed effectively from the data lake and fed back to the other systems is that the tools and capabilities for managing this and metadata are in their early stages. Certainly not mature enough to manage in the ways we have become to depend on today. Hadoop has tools and capabilities but they are immature. Apache atlas is an incubation project which is many years away from being proven.
Many companies are storing data from different systems in different formats, creating Big Data silos that results in large datasets that need to be integrated manually. While we have been quick to invest in new technologies like Hadoop cluster and NoSQL database for collecting a wide variety of disparate data that is now available from a myriad of new data sources, we are not concurrently addressing downstream integration and governance challenges that must be overcome to drive corporate-wide big data adoption, expansion to multiple use cases, and maturity.
This lack of comprehensive strategic planning is proving to be the greatest impediment to enterprise-wide adoption of big data technologies.
Does a data lake still sound a good idea?
A quote from Gartner
Forrester saw these challenges for data lakes coming and proposed a solution using the term “Big Data Fabric” in March 2016. Evolving from “Information Fabric” which was developed in 2013.
In theory data lakes are a great idea, but I think we have a way to go to have the capabilities and tools to have easily located, effective, quality data in our data lake.
All is not lost, Data Fabric is a solution to consider as an alternative to a Data Lake.
Data Fabric Drives Enterprise Adoption
Data fabric offers a comprehensive approach to overcoming the challenges hampering adoption of big data technologies.
What is Data Fabric?
A data fabric enables global access to all data assets of an enterprise, leveraging storage and processing power from multiple heterogeneous nodes.
It is like putting a blanket around your data systems, pulling them all together but maintaining the best of each system or repository.
Data Fabric Example
In the Data Fabric, multi-repositories are brought together to make it look like a unified data lake for users searching for data.
The user does not need to worry where the data comes from as the data fabric knits all the sources together to present a unified accessible data environment.
Benefits of Data Fabric
Big Data Fabric is comprised of six-layers: data ingestion, processing and persistence, orchestration, data discovery, Data Management and intelligence, and data access. Working together, these layers provide seamless, real-time integration of the data in a Big Data system, regardless of data silos.
With its many layers, Data Fabric offers many potential benefits and enables companies to:
- Effectively integrate Big Data assets with on-premises and Cloud data sources, for a complete view of enterprise-wise information.
- Gain access to up-to-the-minute data in real-time, via the data virtualization component of the data discovery layer.
- Easily on-board new Big Data systems and retire legacy systems, while keeping business systems running continuously. Layers and abstraction protect business users at the top of the stack from any changes at the bottom of the stack.
- Use fewer resources, especially since, with data virtualization, very little data needs to be replicated.
How Does It Work?
Let’s drill down into each of the six layers, to see how they work together:
- Data Ingestion:
As the first layer in the Fabric, the data ingestion layer needs to be savvy with all of the potential kinds of data, be it structured or unstructured, such as data from devices, sensors, logs, clickstreams, applications, and cloud sources, in addition to databases.
- Processing and Persistence:
Next, the data needs to be processed and persisted, and this is where Hadoop, Spark, and other Cloud-based processing systems come into play.
Here, the data needs to be transformed and cleaned, as needed, to integrate with other data. The only way to do this effectively is to do this in an orchestration layer as ad hoc transformation is costly, and potentially endless.
- Data Discovery:
This is the most important layer in the Fabric, because it directly addresses the silo problem. In the data discovery layer, companies employ data modelling, data preparation, data curation, and data virtualization. Data virtualization is critical, as it creates virtual views of the data that can be accessed by consumers in real-time. This means that analysts can query across two “silos” as if they were part of the same dataset.
- Data Management and Intelligence:
This layer sits above the other five layers, securing the data and enforcing governance. This is also where companies can apply global structures such as Metadata Management, search, and lineage control.
- Data Access:
Finally, we have the data access layer, which is the layer that delivers the data directly to analysts or to analysts through a series of applications, tools, and dashboards.
When thinking about a data lake or data fabric consider the following.
How am I going to present data to my users – making the data customer centric – the importance of which I have written about before.
Not first thinking, how am I going to put all my data into Hadoop.
- Is ‘Big’ Data Fabric right for you?
- Review the pros and cons of a what you currently have, a Data Lake implementation and a Data Fabric implementation – what is best for you?
- What outcome do you want?
- Who are your customers, what do they want?
- How would you measure the success of the implementation you choose?
- Create a plan?
- Review a few suppliers offering Data Fabric – a google search will quickly give you a list.
- Kick off the project.
Let everyone know how you are progressing against your short, medium and long term results.