In recent years, data lakedata lakeA data lake is a wide company-owned storage base that allows for the analysis and processing of large volumes of raw data, that may or may not be unstructured, at a low cost. Learn more and data management platform (DMPDMPDMPs (Data Management Platforms) are platforms that centralise and aggregate all data related to a brand's campaigns and customers.Learn more) projects have multiplied in companies, where they are sometimes used alongside each other.
But what is a data lake, what is it used for, and how is it different from a DMP?
Pierre Harand, Managing Director France, and Jean-François Wassong, Global Technology Director at fifty-five, provide their explanations and insights.
What is a data lakedata lakeA data lake is a wide company-owned storage base that allows for the analysis and processing of large volumes of raw data, that may or may not be unstructured, at a low cost. Learn more used for?
Pierre Harand: To put it simply, a data lake is a huge database in which a company’s various data streams are channelled. These flows stream into the data lake like so many rivers coming from the various departments of the company.
A data lake aims to allow its users to access exhaustive data which they can extract in an automatic and personalised way.
Its primary function is analytical – in a way, it can be seen as an experimental playground for data scientists, as they can play along with all sorts of data without always knowing beforehand what they are going to reveal.
More agile than a data warehouse, it allows the company to extract value out of raw data without first having to standardise or map its own data.
Typically, we set up a data lake for a major fashion retailer who was thus able to determine how to best order the product lists on its e-commerce website based on visitors’ behavioural data, the company’s product databases, the inventory and the margin. In the end, the analysis from the data lake combined with a new product ordering led to an 8% increase in the add-to-cart rate and to a 4% increment to the annual turnover (performance comparison based on A/B testA/B testAn A/B test is a scientific method which, applied to marketing, consists in displaying to different users several variants of a same web page; these variants are called versions. The aim of the test is to compare the performance of the different versions following a set of predefined KPIs.Learn moreing).
What is the difference between a data lake and a DMPDMPDMPs (Data Management Platforms) are platforms that centralise and aggregate all data related to a brand's campaigns and customers.Learn more?
Jean-François Wassong: A data lake is characterised by the longevity and comprehensiveness of its data. It also allows companies to collect PIIPIIPersonally Identifiable Identification (PII) are specific information thanks to which a person can be identified, in a direct or indirect way: his full name, his email, his birth date, or else sets of anonymous data that allow to identify him.Learn more data (Personally Identifiable Information), unlike DMPs. In this sense, we can say that a data lake has higher asset value than a DMP, which is centred on cookiecookieA cookie is a text file that is stored in the memory of a web browser by a web server when a user visits a website (it can also be stored by a third-party server allowed to do so: ad network, web analytics service...). It particularly allows to gather and store data about users’ browsing behaviour, in order to reuse it during their next visits (user' log ins, for instance).Learn mores and media audience activation.
A data lake is also more open to the various company departments – it is a free space.
Conversely, in a Data Management Platform, everything is processed in anticipation of media activation, especially by combining first-party datafirst-party dataFirst-party data refers to data gathered and owned by a company. Each company manages its own first-party data and uses it to improve customer knowledge. Learn more with third-party datathird-party dataUnlike first and second-party data, third-party data is gathered by third-party specialists (retargeters, DMPs...). They provide this type of data to advertisers and publishers, to help them sharpen their targeting and increase their audience base. Learn more (i.e. additional data from external partners), unlike data lakes which do not include the latter as they are reserved for internal use.
What are the advantages of both approaches?
Pierre Harand: In a company, the fields of application of a data lake are broader than those of a DMP, and they extend beyond advertising. They involve several company departments and functions, such as pricing, merchandisingmerchandisingMerchandising is a set of techniques that aim at optimising the display and setting of a product in retail space, in order to improve turnover. Learn more, logistics, production and inventory. A data lake is more of a cold exploration and processing tool.
A DMP is primarily an activation-oriented marketing tool, and it is mainly digital. The advantage of the DMP is that it is usually an off-the-shelf tool with connectors to the various market solutions. It also includes features that allow marketers to quickly create and launch campaigns targeting specific user segments. There is a “real-time” and “hot data” aspect to the DMP, which allows for the use of engagement and personalisation tools as well as fast processing of audience engagement signals.
Are data lakes and DMPs compatible?
Jean-François Wassong: Yes, the two approaches are compatible, and a data lake often constitutes a good preliminary step to a DMP. Indeed, it is often possible to use data from the data lake to expand the knowledge base of the DMP.
In short, a DMP establishes connections between several external data providers, and the data lake then supplements it with new internal data.
We generally advise companies that enjoy significant traffic on their digital assets to start off by setting up a data lake. Conversely, if the number of visitors is low, as is the case for companies selling fast-moving consumer goods (FMCG), we recommend starting with a DMP instead.
As far as data lakes are concerned, what are the known technologies?
Jean-François Wassong: A data lake is generally a collection of multiple components. The Hadoop ecosystem, which is currently the most widespread, is made up of at least three components:
- data storage, which is generally handled by HDFS
- distributed processing, for which there is a wide range of solutions (map/reduce, yarn, spark, etc.)
- the query engine (Hive, Pig, Drill, etc.)
These make it possible to implement a lambda architecture, i.e. an architecture designed to transform raw data into actionable data.
These data can then be used by the various departments through fast query tools (Elastic Search, Hbase, Impala, Cassandra...).
The main distributions of Hadoop (Horton Works, Cloudera, MapR) have packaged the components of the Hadoop ecosystem, thus facilitating the implementation of lambda architectures.
Over the past few years, several major players such as Amazon, Microsoft and Google have marketed cloud offers that simplify the implementation of the Hadoop stack even further. They also offer alternative solutions for some of the components, such as Amazon Redshift, Google Big Query or Microsoft Azure Document DB.
How much does a data lake cost?
Pierre Harand: Data lakes are based on very low-cost storage and processing technologies, which means that the entry cost for a data lake – around dozens of thousands of euros – is lower than for a DMP, which can be up to several hundreds of thousands of euros, or even exceed one million euros in the case of large companies.
However, the cost of the governance design and exploration phases prior to the setting up of a data lake should not be overlooked. These phases allow companies to set up data collection frameworks and ensure their reliability.
Generally speaking, we advise companies not to rush into using the tool without having first defined what they wish to extract from it.
The advantage of a data lake is that it allows companies to start small with low-stake analysis and performance measurement projects. It often gives them the opportunity to mobilise several departments that do not usually work together and to engage them in a reflection about the issue of the customer experience they wish to provide (for example, maximise store sales per region based on customer behaviour, inventory and logistics).
What fundamental questions should companies ask themselves before setting up a data lake?
Jean-François Wassong: The key issues have to do with governance, security and data reliability. We thus recommend companies to involve both the legal and IT departments beforehand, as well as to answer the following questions:
- What does each department need a data lake for, and what are the expected purpose and benefits of the data lake?
- How will we feed the data lake?
- What are the legal and technical requirements (for instance regarding privacy and personal data storage)?
- What prior consents do we need to establish?
- How can we extract our data towards the cloud?
- In what geographical locations will the data be stored and processed?
- What are the security standards to be observed?
Data lake projects also touch upon significant issues regarding the security of the Information System, the relevance of switching over to a cloud service, personal data management as well as change management. In this regard, a substantial educational effort must be made in order to gain the support of the various departments, including the IT and legal departments. It is therefore essential to involve all of these players in the project as early on as possible.
This interview was originally published on Viuz, and translated from the original French by Marion Beaujard.