â€œDataÂ Lakeâ€ is a massive, easily accessible data repository for storing â€œbig dataâ€. Unlike traditional data warehouses, which are optimized for data analysis by storing only some attributes and dropping data below the level aggregation, a data lake is designed to retain all attributes, especially when you do not yet know what the scope of data or its use.Â Currently, Hadoop is the most common technology to create a data lake. It is important to distinguish the difference between Hadoop and a data lake. A data lake is a concept, and Hadoop is a technology to implement the concept.
A data lake holds a vast amount ofÂ raw dataÂ in its nativeÂ formatÂ until it is needed. While aÂ hierarchical data warehouse storesÂ dataÂ inÂ filesÂ orÂ folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned aÂ unique identifierÂ and tagged with a set of extendedÂ metadata tags. When a business question arises, the data lake can beÂ queriedÂ for relevant data, and that smaller set of data can then be analyzed to help answer the question.
Data Lake Capabilities
- Capture and store raw data at scale for a low cost.
- Store many types of data in the same repository.
- Perform transformations on the data.
- Define the structure of the data at the time it is used.
The term Data Lake is often associated withÂ Hadoop-orientedÂ object storage. In such a scenario, an organizationâ€™s data is first loaded into the Hadoop platform, and then business analyticsÂ andÂ data miningÂ tools are applied to the data where it resides on Hadoopâ€™s cluster nodes of commodity computers.
LikeÂ big data, the termÂ data lakeÂ is sometimes disparaged as being simply a marketing label for a product that supports Hadoop. However, the term is being accepted as a way to describe any large data pool in which theÂ schemaÂ and data requirements are not defined until the data is queried.
The data lake promises to speed the delivery of information and insights to the business community without the hassles imposed by IT-centric data warehousing processes.
With a data lake, you simply dump all your data, both structured and unstructured, into the lake (i.e. Hadoop) and then let business people â€œdistillâ€ their own parochial views using whatever technology is best suited to the task (i.e. SQL or NoSQL, disk-based or in-memory databases, MPP or SMP.) And you create enterprise views by compiling and aggregating data from multiple local views.
Data Lake Advantages
- Data Lake gives business users immediate access to all.
- Data in the lake is not limited to relational or transactional.
- With a data lake, you never need to move the data.
- Data Lake empowers business users and liberating them from the bonds of IT domination.
- Data Lake speeds delivery by enabling business units to stand up applications quickly.
Data Lake Disadvantages
- Unknown area of Data Processing.
- Data governance.
- Dealing with Chaos.
- Privacy issues.
- Complexity of Legacy Data.
- Metadata Lifecycle Management.
- Desolate Data Islands.
- The Issue ofIntegration.
Now that data storage and technology is cheap, information is vast and newer database technologies donâ€™t require an agreed upon schema up front, discovery analytics is finally possible. With data lakes, companies employ data scientists who are capable of making sense of untamed data as they trek through it. They can find correlations and insights within the data as they get to know it.
Some say the data lake is a dream, but we know of organizations that are making this approach a reality, the internal infrastructures developed at Google, YahooÂ andÂ FacebookÂ provide their developers with the advantages and agility of the data lake dream. For each of these companies, the data lake created a value chain through which new types of business value emerged:
- Using data lakes for web data increased the speed and quality of web search.
- Using data lakes for clickstream data supported more effective methods of web advertising.
- Using data lakes for cross-channel analysis of customer interactions and behaviors provided a more complete view of the customer.
Regardless of where you are now, take some time to look to the future. Weâ€™re on a journey towards connecting enterprise data together. As business is increasingly becoming pure digital, access to data will become a critical priority, as will speed of development and deployment. The data lake is a dream that can match those demands.
This text is also published in Ahmed Banafaâ€™s LinkedIn profile
Faculty | Author | Speaker | IoT Expert