MLOps in the Cloud at Swedbank - Enterprise Analytics Platform
In this episode of the RadioData Podcast, Adama Kawa talks with Varun Bhatnagar from Swedbank. Mentioned topics include: Enterprise Analytics Platform…
Read moreIn today's data-driven world, maintaining the quality and integrity of your data is paramount. Ensuring that organizations' datasets are accurate, consistent and complete is crucial for effective decision-making and operational efficiency. Our upcoming eBook, "Data Quality No-Code Automation with AWS Glue DataBrew: A Proof of Concept," provides practical strategies and tools to help you achieve top-notch data quality.
In this blog post, we're excited to share a preview from our eBook that guides you through creating data quality rules in AWS Glue DataBrew, using HR datasets as an example to enhance their reliability. Following these steps ensures your data is clean, consistent and ready for analysis.
Stay tuned for the release of our eBook, and don't miss out - sign up now to join the waiting list and be among the first to access this valuable resource.
In modern data architecture, the adage "garbage in, garbage out" holds true, emphasizing the critical importance of data quality in ensuring the reliability and effectiveness of analytical and machine-learning processes. Challenges arise from integrating data from diverse sources, encompassing issues of volume, velocity and veracity. Therefore, while unit testing applications is commonplace, ensuring the veracity of incoming data is equally vital, as it can significantly impact application performance and outcomes.
The introduction of data quality rules in AWS Glue DataBrew addresses these challenges head-on. DataBrew, a visual data preparation tool tailored for analytics and machine learning, provides a robust framework for profiling and refining data quality. Central to this framework is the concept of a "ruleset", a collection of rules that compare various data metrics against predefined benchmarks.
Utilize AWS Glue DataBrew to establish a comprehensive set of data quality rules tailored to the organization's specific requirements. These rules will encompass various aspects such as missing or incorrect values, changes in data distribution affecting ML models, erroneous aggregations impacting business decisions and incorrect data types with significant repercussions, particularly in financial or scientific contexts.
Employ DataBrew's intuitive interface to create and deploy rulesets, consolidating the defined data quality rules into actionable entities. These rulesets serve as a foundation for automating data quality checks and ensuring adherence to predefined standards across diverse datasets. We discuss all these steps and explain them step-by-step in the ebook.
After defining the data quality rulesets, the subsequent step involve crafting specific data quality rules and checks to ensure the integrity and accuracy of a dataset - which we focus on in this blogpost. AWS Glue DataBrew allows for the creation of multiple rules within a ruleset, and each rule can include various checks, tailored to address particular data quality concerns. This flexible structure enables the user to take a comprehensive approach to validating and cleansing data.
In this phase of our PoC, we focus on implementing a set of precise data quality rules and the respective checks that correspond to common data issues often encountered in human resources datasets. These rules are designed not only to identify errors, but also to enforce consistency and reliability across a dataset.
Rule: Ensure the total row count matches the expected figures to verify no data is missing or excessively duplicated.
Accurately verifying the row count in our dataset is essential for ensuring data completeness and reliability. In AWS Glue DataBrew, setting up a rule to confirm the correct total row count ensures that no records are missing or inadvertently duplicated during data processing. This check is crucial for the integrity of any subsequent analyses or operations.
To set up this check, you will need to follow these steps within the DataBrew console under your designated data quality ruleset:
By implementing this rule, you establish a robust verification process for the row count, which plays a critical role in maintaining the data's integrity. It ensures that the dataset loaded into AWS Glue DataBrew is complete and that no data loss or duplication issues affect the quality of your information. This rule is an integral part of our data quality framework, supporting reliable data-driven decision-making.
Rule: Identify and remove any duplicate records to maintain dataset uniqueness.
Ensuring the uniqueness of data within our dataset is crucial for maintaining the accuracy and reliability of any analysis derived from it. To effectively identify and eliminate any duplicate rows in our dataset, we employ a structured approach within AWS Glue DataBrew. This process involves setting up a specific rule dedicated to detecting duplicates. To begin, access your previously defined data quality ruleset in the DataBrew console. From here, you will add a new rule tailored to address duplicate entries.
By meticulously configuring this rule, we ensure that our dataset is thoroughly scanned for any duplicate entries, and any found are flagged for review or automatic handling, depending on the broader data governance strategies in place. Implementing this rule is a key step towards certifying that our data remains pristine and that all analyses conducted are based on accurate and reliable information.
Rule: Confirm that each Employee ID, email address and SSN is unique across all records, preventing identity overlaps.
Ensuring the uniqueness of data within our dataset is crucial for maintaining the accuracy and reliability of any analysis derived from it. To effectively identify and eliminate any duplicate rows in our dataset, we employ a structured approach within AWS Glue DataBrew. This process involves setting up a specific rule dedicated to detecting duplicates. To begin, access your previously defined data quality ruleset in the DataBrew console. From here, you will add a new rule tailored to address duplicate entries.
By diligently configuring this rule, you ensure that critical personal and professional identifiers such as Employee ID, email and SSN are uniquely assigned to individual records, enhancing the reliability and accuracy of your dataset. This step is crucial for maintaining the quality of your data and ensuring that all analyses derived from this dataset are based on correct and non-duplicative information.
Rule: Employee ID and phone numbers must not contain null values, ensuring complete data for essential contact information.
For the integrity and completeness of our human resources dataset, it is imperative to ensure that certain critical fields, specifically Employee IDs and phone numbers are always populated. A null value in these fields could indicate incomplete data capture or processing errors, which could lead to inaccuracies in employee management and communication efforts.
By configuring this rule, you ensure that no records in the dataset have null values in the Employee ID and phone number fields, reinforcing the completeness and usability of your HR data. This step is crucial in maintaining high-quality, actionable data that supports effective HR management and operational processes.
Rule: Employee IDs should be integers, and the age field should not contain negative values, maintaining logical data integrity.
By implementing this rule, you will effectively ensure that critical numeric fields such as Employee ID and age do not contain negative values, thus upholding the logical consistency and reliability of your dataset. This proactive approach in data validation is integral to maintaining high data quality standards necessary for accurate and reliable HR analytics and operations.
There are even more data quality rules to set, but we will extend this topic further in ebook: "Data Quality No-Code Automation with AWS Glue DataBrew: A Proof of Concept," where we present the entire data quality process. We will demonstrate profile job certification, data quality validation and how to conduct cleaning the dataset.
This eBook will be available soon, offering you the insights and tools necessary to maximize the potential of your datasets and more. Ensure your data is accurate, reliable and ready for impactful analysis. Click here to join the waiting list.
In this episode of the RadioData Podcast, Adama Kawa talks with Varun Bhatnagar from Swedbank. Mentioned topics include: Enterprise Analytics Platform…
Read moreReal-time analytics are all processes of collecting, transforming, enriching, cleaning and analyzing data to provide immediate insights and actionable…
Read moreIt’s been exactly two months since the last edition of the Big Data Technology Warsaw Summit 2020, so we decided to share some great statistics with…
Read moreWe are producing more and more geospatial data these days. Many companies struggle to analyze and process such data, and a lot of this data comes…
Read moreOne of the biggest challenges of today’s Machine Learning world is the lack of standardization when it comes to models training. We all know that data…
Read moreAt GetInData we use the Kedro framework as the core building block of our MLOps solutions as it structures ML projects well, providing great…
Read moreTogether, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.
What did you find most impressive about GetInData?