The question:
I have been confused whether to create a data lake or a data warehouse and hope some experienced real-world professional can give me some enlightenment.
I will like to store, visualise and perform machine learning with the data that I ingested from multiple sources (IoT devices, APIs etc.). I read that a business will require both data lake and warehouse in the current environment that we are in.
My question is:
- should I create a data lake first, then transform/process these raw data from the lake and ingest it into a data warehouse?
- Or is the data lake a separate data processing pipeline on its own?
- Or is this depends on the use case?
This has been what I been thinking of:
PS: If this is the wrong StackExchange do let me know thanks 🙂
The Solutions:
Below are the methods you can try. The first solution is probably the best. Try others if the first one doesn’t work. Senior developers aren’t just copying/pasting – they read the methods carefully & apply them wisely to each case.
Method 1
There’s a lot of similar and overlapping terms these days (Data Lake, Data Swamp, Data Warehouse, etc) that I wouldn’t get too hung up over, IMO.
Data Lakes are informal places to centralize different sources of data. They can be flexible and don’t necessarily need to adhere to a fixed schema but can follow one.
Data Warehouses are more formally defined and unify those different sources of data into a common structure, such that it’s easy to build consuming applications and reports off of.
So the answer to your question is it just depends on your use cases, how many different types of data and sources you need to consume, and if having a Data Lake as an intermediary step makes it easier to accomplish your use cases before applying the ETL (really the Transform part) processes to that data.
If all of your sources of data already follow a rather common schema, then usually you can just ETL straight into your Data Warehouse and skip the Lake altogether. But sometimes it’s good to use a Data Lake to preserve the original data as it was extracted, in case some level of reconcilation and debugging is needed later on. It adds a layer of what the data looked like before you touched it to transform it into the Warehouse.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0