A Review of the Presentations at the DataMass Gdańsk Summit 2022
The 4th edition of DataMass, and the first one we have had the pleasure of co-organizing, is behind us. We would like to thank all the speakers for…
Read moreDuring my 6-year Hadoop adventure, I had an opportunity to work with Big Data technologies at several companies ranging from fast-growing startups (e.g. Spotify) to global corporations and academic institutes. What really amazed me was the difference of how the use-cases were defined, how fast valid solutions were built and how money was spent and wasted.
To be more specific, I’ve seen many companies with powerful, expensive and battle-ready data infrastructures (i.e. tens of nodes in multiple clusters, a whole Big Data stack, enterprise licences, security restrictions, teams of administrators) and … almost no real production use-cases. This means a large cost overrun, small ROI and something one should avoid.
This post introduces the notion of “Lean Big Data” to make a smart investment in Big Data technologies, and it describes five common pitfalls that can lead to a failed Big Data project. These include deploying Big Data technologies when you don’t have big data, separating application and platform roles too soon, and building a platform without a use case in mind.
I think that many companies equate being data-driven with using Big Data technologies. Obviously, it’s not true. You can be data-driven without Big Data. Similarly, you can use Big Data technologies, but still, don’t make data-driven decisions or don’t implement product features that are fed by data.
A few years ago, a young marketplace startup discussed their data-processing needs with me. I advised them not to use Big Data technologies. Two reasons: they would iterate slower and spend more money. These two reasons would kill this startup quickly. When you have small data and start using Big Data technologies you don’t get benefits such as scalability, cheaper replacement for DWH, schema-on-read etc. You iterate slower because you spend time learning new and complex tools, troubleshooting issues and bugs, discovering unsupported features, making time-consuming mistakes, accumulating technical debt, implementing/testing/deploying/running distributed jobs on a cluster, trying to recruit Big Data specialists and keeping them stay. It’s much easier, faster and cheaper to iterate with MySQL rather than Hadoop or HBase (especially when you’re a young startup and you evolve and change requirements frequently).
Simply speaking, you should always be data-driven, but use Big Data technologies only when you have a really good reason for it. Here are some examples:
The above use-cases are examples of why you might want to start using Big Data technologies (of course, there are more!).
However, some other companies that I’ve seen started deploying Big Data technologies when … they become very trendy and cool, “worth trying” or encouraged by vendors.
These companies decided to build “production” Hadoop clusters, but they didn’t have valid use-cases identified. As a result, they started generating ideas and running tens (!) of internal POCs to onboard some projects on the Hadoop platform. Most of them failed and/or had generated high costs.
There is one use-case, that I saw in several places, that seems to be “always working” – just collect all textual documents, upload them to HDFS, use Solr to index them (not Elasticsearch, because Elasticsearch is not included in CDH and HDP) and build custom front-end website to search across indexes. This use-case runs on Hadoop cluster (yes!), but … it could run without Hadoop too, because Solr and Elasticsearch are scalable and distributed technologies that can run without HDFS, YARN, MapReduce, Spark etc.
When you build the data lake “not for today, but for the future”, you will probably end up using Hadoop for following use-cases:
The Hadoop Ecosystem has several nice properties that allow you to economize e.g. open-source licence, standard hardware requirements, linear scalability that allows you to start just with a bunch of physical/virtual nodes or use public cloud where you “pay-as-you-go”, high-level frameworks. Even though, many companies don’t seem to take advantage of them. What I have seen is truly incredible:
Maybe it sounds surprising, but at Spotify we run almost all batch computation on a single production and multi-tenant cluster for many years, access control matrix was rather simple, Hadoop administrators team consisted of 1-5 people, there was no licence for any Hadoop server in our thousand-node cluster storing tens of PBs of data. Of course, we run into troubles and wasted some time/money here and there, but we managed to achieve so much with a single, but fully-utilized cluster!
The more you spend on Hadoop cluster, the bigger pressure and expectations are set. For instance, many projects are “forced” to migrate to Big Data. The managers set the goals as “adding 10 new production projects to Hadoop in 2016” that can be achieved only by migrating “fake” projects. These projects often die later. What’s more, most of these projects actually don’t want to go from “POC” to “PROD”, because they will iterate slower due to various procedures such as less frequent upgrades of Hadoop, less permissions for junior employees, straighten resource management to ensure SLA, more restricted development cycle.
When you start with Big Data, it’s good to build a small cross-functional team that drives the whole effort. Such team can consist of 3-6 people such as a Linux administrator, a Java/Scala developer and a SQL analyst working full or part-time. Ideally, you should have someone who has previous experience in Big Data. Thanks to a frequent and quick feedback loop, the administrators exactly know what the infrastructure really misses or don’t need at all. The developers/analysts get immediate help when running into troubles, configuration, and access issues. Eventually, this team can split into smaller teams like “infrastructure” (administrators, DevOps), “tools & ETL” (engineers) and “insights” (analysts, data scientists) and these teams could split into smaller teams even later. This was the case at Spotify many years ago and it worked well.
I noticed that some companies start with separate “platform” and “application” teams located in different floors/buildings/countries. As a result, they tend to run into miscommunication issues and usually aren’t aligned. The “platform” teams follow its own way by building a battle-ready and award-winning data infrastructure by scaling cluster, installing many components, deploying access control rules, thinking about features like backups, auditing, quotas and introducing “good” procedures and recommendation guides written in Word documents etc. The “application” teams have a hard time figuring out the valid use-cases, learning the complex infrastructure and overcoming troubles caused by all procedures, conventions, and restrictions introduced by the “platform” team. They iterate even slower due to these rules and complications.
In my opinion, the data infrastructure should be use-case-driven and improved only to address new requirements from the applications and business use-cases. At Spotify our Hadoop cluster had been always falling behind – it was so hard to keep up to scale, enrich, tune and stabilize the cluster because the new data, use-cases and requirements were added all the time, faster and faster. Eventually, Spotify has decided to go to the public cloud hoping that GCP will be able to quickly provide everything the company needs. If your platform advances the use-cases, then probably you invested too much into it (of course it doesn’t mean that your platform should be crappy, and nobody could rely on it due to problems).
Even a single person who has real production experience with Big Data can help a company go into the right direction, avoid making expensive mistakes, hire smart people and correctly talk to vendors. However, this can be the chicken-egg problem if the company doesn’t have valid use-cases for Big Data and can’t grow internal employees.
Just to be clear, I don’t criticise companies that wish to innovate, try Big Data stack and evaluate new cutting-edge technologies. I highly encourage to do that! I’m just amazed by the scale, complexity, and bureaucracy of many Big Data infrastructures.
Where a single, small, stable, well-configured and multi-tenant cluster can be enough, I often see multiple overpaid, underutilized and concerning clusters.
What is the easiest way to waste money with Big Data?
IMHO, spending it far from ROI and in the case of Big Data, it’s cluster infrastructure, cluster administration, and ETL.
How to avoid it?
I would recommend following the idea described in the “Lean Startup” book. Try to spend as much time and energy as possible in activities that generate real value or validate your hypothesis (e.g. data analysis, data science, data-driven features, important dashboards). Build or improve your data infrastructure in iterations, only if you see that the next iteration is needed for your company to achieve more.
For example, the first iteration could look as follows:
The next iterations, could look as follows:
Basically, try to spend as little time as possible in steps 2 and 6.
Do you have a similar opinion? Am I completely wrong? Please share your thoughts in the comments, so everyone could benefit from your experience and avoid expensive mistakes when build own Big Data infrastructure.
The 4th edition of DataMass, and the first one we have had the pleasure of co-organizing, is behind us. We would like to thank all the speakers for…
Read moreAt GetInData we use the Kedro framework as the core building block of our MLOps solutions as it structures ML projects well, providing great…
Read moreHTTP Connector For Flink SQL In our projects at GetInData, we work a lot on scaling out our client's data engineering capabilities by enabling more…
Read moreIn part one of this blog post series, we have presented a business use case which inspired us to create an HTTP connector for Flink SQL. The use case…
Read moreNowadays, we can see that AI/ML is visible everywhere, including advertising, healthcare, education, finance, automotive, public transport…
Read moreDynamoDB is a fully-managed NoSQL key-value database which delivers single-digit performance at any scale. However, to achieve this kind of…
Read moreTogether, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.
What did you find most impressive about GetInData?