Panem et circenses — how does the Netflix’s recommendation system work.
Panem et circenses can be literally translated to “bread and circuses”. This phrase, first said by Juvenal, a once well-known Roman poet is simple but…
Read moreSince 2015, the beginning of every year is quite intense but also exciting for our company, because we are getting closer and closer to the Big Data Technology Summit, the conference co-organized by us. This year we celebrated the 7th edition of this event, yet the first one in a fully online version.
Even if this could sound strange, “thanks to” to covid19, the conference agenda was even more full of good content than ever. We didn’t just split the conference into two days but we added more tracks, more presentations and even some extra performance available before the main event on the VoD platform.
We also had two panels before the conference, the video record from one of them “Pandemic, data and analytics – how might we know what happens next with Covid-19” you can watch on YouTube.
During this year's edition, we had the pleasure to listen to the performance of speakers from international companies. Thanks to no travel requirements, the conference was a great opportunity for participants to attend the event but also for us as an organizer because it was easier to host guest Big Data Experts from different continents. It is important for us as an organizer that the entire content of the Big Data Technology Summit was highly rated by the participants.
Also, a lot of the presentations got a really good score from the attendees. Below you can find a brief review of some of the performances our team members attended.
Piotr Menclewicz, Big Data Analyst at Getindata
On the first day of the conference, I had the pleasure to watch presentations on Data Strategy and ROI in simultaneous sessions. The second track’s session Foundations of Data Teams was run by Jesse Anderson from the Big Data Institute. Jesse provided a beautiful analogy between building a data team and constructing a house. In order to make your data efforts work long-term, you need to take care of the foundations. In practice, we often focus too much on the facade instead of the core structure. We were provided with great examples of what happens when key components of the data team, such as data science, data engineering or operations are missing. We also found out that unintuitively, a bit of prevention can lead to tons of value.
On the second day of the Big Data Technology Summit, I attended the presentation hosted by Alex Belotserkovskiy from Microsoft. During his performance: Big Data Instruments and Partnerships - Microsoft ecosystem update, Alex explained what a data platform means at Microsoft and what the main components of such a platform are. He explained the main pillars of the architecture like ingestion, storing, preparation, serving and reporting. He also showed how each of them can be supported with Microsoft alone but also open-source technology (e.g. Spark, Jupyter, PostgreSQL). We’ve also been given a glimpse into areas of current focus for Microsoft, like responsible ML and open data.
The last presentation on Data Strategy and ROI track, How to plan the unpredictable? 7 good practices of organisation and management of fast-paced large-scale R&D projects, was given by Krzysztof Jędrzejewski and Natalia Sikora-Zimna from Pearson. During the talk we had a chance to see practical examples of lessons learned in the battle of managing highly agile and innovative projects. Speakers shared specific assumptions that are perfectly sensible in theory but tend to not work in practice. They also provided remedies for these common pitfalls. We could find out why we shouldn’t: design everything in advance, try to be aligned with everything and everyone, or build everything on your own.
Michal Rudko, Big Data Analyst at Getindata
ML Ops was one of the tracks that received the most interest during the Big Data Technology Summit 2021. More and more companies are bringing advanced analytics into their daily core business, however this also means that this requires proper operationalization and maintenance.
In the first session of the track Keven Wang guided us through the MLOps journey at H&M. How to deal with management of a large number of ML models and their end-to-end lifecycle so that a model could be bought online with confidence, its performance is monitored and the process could be adapted by multiple product teams at H&M. The whole path was presented as a combination of automated and manual steps (e.g. active approvals at crucial moments) with some state-of-the-art tools for model training (Azure Databricks, Kubeflow, Airflow), model management (ML Flow) and model deployment (Seldon). I really liked the way these three stack items were treated separately by design allowing a certain flexibility, which is quite important in such a dynamic environment. Some of the functionalities were backed by leveraging managed services which sped up the whole process a lot.
It was again proved that the ML product is just another software product, where best practices and software engineering skills are more than welcome in the team. The whole transition process, however, requires a mindset shift and proper planning, so indeed it’s a journey you have to make in order to have dynamic and fully operational analytics in your data-driven company.
In the afternoon we learned from Maciej Pieńkosz how the ML models are trained and deployed in Google Cloud at Sotrender, social media experts. It came with no surprise that the list of advanced analytics use cases in this industry is pretty long and diverse - just to name a few: sentiment analysis, hate speech detection, image recognition, text extraction, post classification and many more. Each of these model types requires a specific environment for experimentation, training, deployment and maintenance - here is where the Google Cloud platform with its services supported by open-source tools comes into play.
At Sotrender the whole journey usually starts in the notebook (AI Platform Notebooks) where the data is explored and initial models are built. As the next step the codebase is refactored using standardized structures and templates, wrapped into a Docker container and using AI Platform Training, trained in the cloud. The idea is to deploy locally and train in the cloud in order to optimize the costs. Using ML Flow for experiment tracking allows to have the whole experiment history gathered in one place. Models are then deployed as services, served via REST API using Cloud Run - chosen as the most flexible and functional solution. The pipeline CI is managed by GitLab with canary rollouts ensuring smooth and safe change management.
It’s not only all about the tools and algorithms - we also heard some good practices and tips from an operational standpoint both from engineering and ML areas. Again it was stressed that the whole journey requires some human validation in crucial moments, and here is where a monitoring solution for models plays an important role - especially when you have thousands of models in production and would like to react fast in the case that some of them performed not as expected.
Maciej Obuchowski, Data Engineer at GetInData
The second talk of Streaming and Real-Time Analytics track was presented by Simply Business's Michał Wróbel, who talked about Complex event-driven applications with Kafka Streams. Michał told us how Simply Business's streaming infrastructure needs to process multiple different types of events, and answer complicated business related questions. To manage large numbers of types of events, they use JSON schemas stored in Iglu Schema Registry driven by CI/CD. Processing these events is done by Kafka Streams. The advantages of this approach noted by Michał were Kafka Streams's small footprint, having stateful interfaces and processing guarantees - fault tolerance and once semantics. The first version of the system wasn't perfect. It suffered from limited parallelism and was complex to manage, but the second version - build by applying Domain Driven Design principles fulfilled all the requirements and was easier to operate. Data prepared by streaming applications was used also to drive Simply Business's Machine Learning applications.
We finished the track by listening to Ruslan Gibaiev, who told us about Evolving Bolt from batch jobs to real-time stream processing. Bolt's philosophy is efficiency - and Ruslan stressed that it also needs to be applied in a data context. Bolt heavily relies on using Debezium for Change Data Capture from MySQL to Kafka. Their approach, consistent with the efficiency principle, utilizes building Kafka libraries - Kafka Streams and KSQL. An important aspect stressed by Ruslan is their extensibility - the code is open source, and has aspects such as the possibility of defining UDFs for KSQL. Working with big data tools has its disadvantages, like complicated deployment and operations, and they are hard to debug.
Krzysztof Zarzycki, CTO at Getindata
Streaming and Real-Time Analytics was one of the tracks that I was the most excited about and the presentation Streaming SQL - Be Like Water My Friend definitely met my expectations. Many of us already know that streaming SQL is becoming mainstream! Sooner or later you will need to use it to gain a technological advantage. The performance given by Volker Janz from InnoGames - for which streaming SQL is already very important - was a great introduction to the subject. I see streaming SQL booming in 2021: being used in ETLs, business automation and also for analytics or even Machine Learning. All delivering results instantly and efficiently utilizing resources. During the talk, Volker also showed how Streaming SQL looks and feels with a demo of Flink SQL and Ververica Platform as an operator of Flink on Kubernetes.
The presentation hosted by Volker was the highest-rated presentation on the 7th edition of the Big Data Technology Warsaw Summit, so I think that this was proof that it was worth hearing.
Arkadiusz Gąsior, Data Engineer at GetInData
During the second part of the Data Engineering simultaneous session, we had the pleasure of hearing two more performances, one of them was Top 5 Spark anti-patterns that will bite you at scale! given by Alex Holmes. Spark is a well established tool for data teams in both small and big projects. It’s useful for ad hoc data analysis as well as established data processing pipelines. Alex Holmes - Software engineer, author, speaker and blogger brought an interesting pack of real-life scenarios along with him from various production systems which can improve the effectiveness of using these technologies. The first is to start your data warehouse with a proper resource defined catalog, directory and data format standards which best serve your domain purpose. Proper datasets tables and column structure will grow with time, and managing them can become more and more demanding. . Apache Iceberg or Delta Lake were considered the most effective solutions amongst others, with a growing set of tools. Another big topic was testing and load test in particular, before going into production. This usually brings performance optimizations which will be hard to show without a proper data load and fine tuning where possible, to enable sufficient processing resources. Data should be structured, clean, binary and evolving with the usage of Avro format, backed by schema registry. This comes in handy when a system is growing with new fields and objects added to be handled by your data engine. This eliminates many problems when used during the earlier stages of your data platform, rather than waiting for the traffic to grow. The final recommendation was to just stay up to date with the Spark 3 version to reap the benefits of new optimizations. We recommend that you use this approach to avoid as many traps as possible in your own sparkling data lake.
The next one was BigFlow – A Python framework for data processing on the Google Cloud Platform hosted by Allegro. Dealing with growing traffic is what you can look forward to when expanding your business, but this still brings some challenges along with it for the data team. Allegro Senior software engineer Bartłomiej Bęczkowski presented a very interesting solution for data processing which works with a variety of technologies on GCP. When you need to run your jobs with Dataflow, Dataproc and BigQuery and keep deployments, builds and project structure clean and consistent, it's a good idea to use some automation. This is where Airflow comes handy with the power of the Docker operator running DAGs on the architecture of your choice. On the other hand, it also provides you with a unified platform with visible status, business logic and scheduling capabilities. This has proven to be both a generic platform and a productive solution to deploy what business users need from the data team.
The agenda of the Big Data Technology Warsaw conference was not just full of great presentations, but also an interesting discussion during the roundtable session. Some of the big data experts from our team had the pleasure to lead or join the discussions.
Tomasz Żukowski, Data Analysts at Getindata
I had the pleasure to facilitate a roundtable discussion about the end to end cloud migration journey during Big Data Technology Warsaw 2021. We had a mix of cloud practitioners and people planning migration among participants, representing both big and small organizations. The discussion touched on various problems, but two main issues emerged throughout the conversation - cost control and vendor lock-in.
We agreed that cost control is most important during the migration to the cloud, but also after it during the business as usual period. Different approaches were presented:
It looks like vendor lock-in is haunting professionals working on migrations and it might be a real issue. Sometimes you might even face direct costs like data export costs or network egress (i.e. in the case of exporting data from BigQuery). All sides of the problem have to be considered and a final decision should be made based on risk evaluation.
During migration planning, each migrated component should be thoroughly examined if it should be migrated to a cloud-native or open-source solution:
Most importantly, there is no golden rule for such a decision. The reasoning usually is organisation dependent.
In conclusion, we said migration to the cloud is an IT project like any other and as such has to be properly planned, monitored during execution and reevaluated if needed.
Arkadiusz Gąsior, Data Engineer at GetInData
I had the opportunity to join the roundtable discussion on how to support the life-cycle of ML models led by Przemysław Biecek. This was an excellent discussion among ML experts on using open source tools like MLFlow and MLCube, as well as homegrown industry optimized solutions. Are big cloud players going to dominate this market, or is there still a future for on-prem solutions? The biggest model deployed to production and the effect the global pandemic had on decision-makers to invest in quality machine learning were also discussed.
Klaudia Wachnio, Marketing Manager at GetInData
One of the best discussions that I had the pleasure of attending was the one led by Juliana Araujo from Kambi, Managing a Big Data project - how to make it all work well together? As Juliana mentioned, 85% of the Big Data Projects fail, according to Gartner newest research. Why are statistics so bad if many companies have a great team of engineers, developers, analytics experts and other big data experts on the board? During the discussion, participants shared their thoughts and experiences in this field and discussed the key aspects of being successful in big data projects. The most important thing seems to be that the development team and stakeholders work together to achieve their goals. It’s hard to understand that after many months of projects, you can get good quality data from the engineering part of the project, but with no business value. I guess some of you now ask, how is this possible? Here, I should mention the long-known difficulty in communication between the IT team and business. The stakeholders should understand that they need to put effort into this kind of project and the engineering team, that they cannot let them stay out of the planning process because they never meet expectations. A good understanding of business needs should be a priority for both teams.
As the participants mentioned, working in an agile way can be a good solution for Big Data projects, because you don't work on the whole system from day one. Your architecture can evolve as the business needs evolve and what’s most important, you can bring some business value into the early stage of the project.
We already can’t wait for the 8th edition of the event - we hope you can't too :D We don't know if it will be an online or live event just yet. All we can promise now, is that together with Evention, we will do our best to prepare the most exciting and full of high-quality presentations conference than any event you have ever attended before.
Thank you, and hopefully see you next year!
Panem et circenses can be literally translated to “bread and circuses”. This phrase, first said by Juvenal, a once well-known Roman poet is simple but…
Read more“How can I generate Kedro pipelines dynamically?” - is one of the most commonly asked questions on Kedro Slack. I’m a member of Kedro’s Technical…
Read moreAbout In this White Paper, we described what is the Industrial Internet of Things and what profits you can get from Data Analytics with IIoT What you…
Read moreIn today's data-driven world, maintaining the quality and integrity of your data is paramount. Ensuring that organizations' datasets are accurate…
Read moreBuilding a modern analytics environment is a strategic, long-term, iterative process of continuous improvement rather than a one-off project. The…
Read moreMachine learning is becoming increasingly popular in many industries, from finance to marketing to healthcare. But let's face it, that doesn't mean ML…
Read moreTogether, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.
What did you find most impressive about GetInData?