It took me 10 years to learn about the different types of data quality checks; I'll teach it to you in 5 minutes: 1. Check table constraints The goal is to ensure your table's structure is what you expect: * Uniqueness * Not null * Enum check * Referential integrity Ensuring the table's constraints is an excellent way to cover your data quality base. 2. Check business criteria Work with the subject matter expert to understand what data users check for: * Min/Max permitted value * Order of events check * Data format check, e.g., check for the presence of the '$' symbol Business criteria catch data quality issues specific to your data/business. 3. Table schema checks Schema checks are to ensure that no inadvertent schema changes happened * Using incorrect transformation function leading to different data type * Upstream schema changes 4. Anomaly detection Metrics change over time; ensure it's not due to a bug. * Check percentage change of metrics over time * Use simple percentage change across runs * Use standard deviation checks to ensure values are within the "normal" range Detecting value deviations over time is critical for business metrics (revenue, etc.) 5. Data distribution checks Ensure your data size remains similar over time. * Ensure the row counts remain similar across days * Ensure critical segments of data remain similar in size over time Distribution checks ensure you get all the correct dates due to faulty joins/filters. 6. Reconciliation checks Check that your output has the same number of entities as your input. * Check that your output didn't lose data due to buggy code 7. Audit logs Log the number of rows input and output for each "transformation step" in your pipeline. * Having a log of the number of rows going in & coming out is crucial for debugging * Audit logs can also help you answer business questions Debugging data questions? Look at the audit log to see where data duplication/dropping happens. DQ warning levels: Make sure that your data quality checks are tagged with appropriate warning levels (e.g., INFO, DEBUG, WARN, ERROR, etc.). Based on the criticality of the check, you can block the pipeline. Get started with the business and constraint checks, adding more only as needed. Before you know it, your data quality will skyrocket! Good Luck! - Like this thread? Read about they types of data quality checks in detail here 👇 https://lnkd.in/eBdmNbKE Please let me know what you think in the comments below. Also, follow me for more actionable data content. #data #dataengineering #dataquality
Data Analysis and Decision-Making
Explore top LinkedIn content from expert professionals.
-
-
Last week, I posted about data strategies’ tendency to focus on the data itself, overlooking the (data-driven) decisioning process itself. All it not lost. First, it is appropriate that the majority of the focus remains on the supply of high-quality #data relative to the perceived demand for it through the lenses of specific use cases. But there is an opportunity to complement this by addressing the decisioning process itself. 7 initiatives you can consider: 1) Create a structured decision-making framework that integrates data into the strategic decision-making process. This is a reusable framework that can be used to explain in a variety of scenarios how decisions can be made. Intuition is not immediately a bad thing, but the framework raises awareness about its limitations, and the role of data to overcome them. 2) Equip leaders with the skills to interpret and use data effectively in strategic contexts. This can include offering training programs focusing on data literacy, decision-making biases, hypothesis development, and data #analytics techniques tailored for strategic planning. A light version could be an on-demand training. 3) Improve your #MI systems and dashboards to provide real-time, relevant, and easily interpretable data for strategic decision-makers. If data is to play a supporting role to intuition in a number of important scenarios, then at least that data should be available and reliable. 4) Encourage a #dataculture, including in the top executive tier. This is the most important and all-encompassing recommendation, but at the same time the least tactical and tangible. Promote the use of data in strategic discussions, celebrate data-driven successes, and create forums for sharing best practices. 5) Integrate #datascientists within strategic planning teams. Explore options to assign them to work directly with executives on strategic initiatives, providing data analysis, modeling, and interpretation services as part of the decision-making process. 6) Make decisioning a formal pillar of your #datastrategy alongside common existing ones like data architecture, data quality, and metadata management. Develop initiatives and goals focused on improving decision-making processes, including training, tools, and metrics. 7) Conduct strategic data reviews to evaluate how effectively data was used. Avoid being overly critical of the decision-makers; the goal is to refine the process, not question the decisions themselves. Consider what data could have been sought at the time to validate or challenge the decision. Both data and intuition have roles to play in strategic decision-making. No leap in data or #AI will change that. The goal is to balance the two, which requires investment in the decision-making process to complement the existing focus on the data itself. Full POV ➡️ https://lnkd.in/e3F-R6V7
-
I used to struggle with getting my tech projects approved until I learned to present their benefits as an irresistible offer. 𝗪𝗵𝘆 𝗺𝘂𝘀𝘁 𝘆𝗼𝘂 𝗾𝘂𝗮𝗻𝘁𝗶𝗳𝘆 𝘆𝗼𝘂𝗿 𝗽𝗿𝗼𝗷𝗲𝗰𝘁 𝗿𝗲𝗾𝘂𝗲𝘀𝘁𝘀? - 𝗚𝗲𝘁 𝗔𝗵𝗲𝗮𝗱: Using data means you're 23 times more likely to get customers, 6 times as likely to retain them, 19 times as likely to be deliver a profitable result. (McKinsey) - 𝗠𝗼𝗿𝗲 𝗪𝗶𝗻𝘀: Top teams - who finish >80% of their projects on time, on budget, and meeting original goals - are 2.5 times more likely to use quantitative management techniques. (PMI) - 𝗕𝗼𝗼𝘀𝘁 𝗖𝗼𝗻𝗳𝗶𝗱𝗲𝗻𝗰𝗲: Clear numbers and ROI make 60% of stakeholders more confident, leading to faster approvals and more robust support throughout the project lifecycle. (Gartner) What steps are you taking to demonstrate the value of your tech project? I've got a 5-step plan that'll make your project impossible to refuse. 𝟭. 𝗣𝗶𝗻𝗽𝗼𝗶𝗻𝘁 𝗬𝗼𝘂𝗿 𝗩𝗮𝗹𝘂𝗲 𝗗𝗿𝗶𝘃𝗲𝗿𝘀 📌 What makes your project shine? List every benefit. Increased revenue? Cost savings? Improved efficiency? Group these gems into clear categories. 𝟮. 𝗚𝗮𝘁𝗵𝗲𝗿 𝗖𝗼𝗺𝗽𝗲𝗹𝗹𝗶𝗻𝗴 𝗘𝘃𝗶𝗱𝗲𝗻𝗰𝗲 🔍 Collect data that will make your pitch rock-solid. Internal reports, market trends, industry benchmarks - get it all. Relevant, fresh data is your best friend. 𝟯. 𝗖𝗿𝘂𝗻𝗰𝗵 𝘁𝗵𝗲 𝗡𝘂𝗺𝗯𝗲𝗿𝘀 🧮 Time to flex those analytical muscles. ROI, NPV, payback period - calculate it all. Solid financials turn skeptics into believers. 𝟰. 𝗔𝗻𝘁𝗶𝗰𝗶𝗽𝗮𝘁𝗲 𝗮𝗻𝗱 𝗔𝗱𝗱𝗿𝗲𝘀𝘀 𝗥𝗶𝘀𝗸𝘀 🛡️ Every great plan needs a reality check. What could derail your project? List potential risks. Then, craft strategies to neutralize each one. 𝟱. 𝗣𝗿𝗲𝘀𝗲𝗻𝘁 𝘄𝗶𝘁𝗵 𝗣𝗿𝗲𝗰𝗶𝘀𝗶𝗼𝗻 𝗮𝗻𝗱 𝗣𝗼𝘄𝗲𝗿 💼 Package your project in a compelling presentation. Use clear visuals and concise explanations. Make it so convincing, they'll wonder how they ever lived without it. 𝙒𝙝𝙮 𝙩𝙝𝙞𝙨 𝙢𝙚𝙩𝙝𝙤𝙙 𝙬𝙤𝙧𝙠𝙨: - It transforms your tech vision into a business essential. - It shows you've considered every angle and potential hurdle. - It gives decision-makers the hard data they need. In the world of project approvals, vague ideas are like trying to pay with Monopoly money. But a well-prepared, data-driven proposal is gold. What's your top tip for creating an irresistible project proposal? Share your wisdom below!
-
I saved this tech company $128,000 in 22 days. Here’s how: 6 months back I worked with Overjet. They’re the world-leader in dental AI. But their marketing department was struggling with Salesforce. They were using 3 tools for their CRM: • Hubspot • Google sheets • Salesforce Because they didn't have enough trust in Salesforce. Now, they’re a mid-market company — with 185+ employees. And they'd just hired a new marketing director. We’ll call him “Steve”. He was receiving high-level pay, but doing low-level work. It was a poor use of resources to say the least. It led to typos and bad data—human error. Hours of time wasted compiling data. Incorrect reporting and analysis. Delayed business decisions. And slow company growth because of the bottlenecks. They were making gut-feeling decisions instead of following the data... Because they knew their data was bad. Now let’s crunch the numbers: Steve was wasting 5.5 Hrs per day on manual entry. The avg. marketing director makes $186,162/yr. That’s an hourly rate of $89.50 down the drain. A whopping $128,557/year wasted. Again, 5.5 Hrs/day wasted on manual entry. That’s 110 Hrs/month. Adding up to 165 working days/year—LOST. So, here’s what we did to help: We always start with a discovery process. We asked them questions like: What tools are you using? What platforms are you using? Then we monitor the process: What does each employee do on a daily basis? Where are they spending the most time? What does their process look like? Why do they do it this way? We’re then able to see what’s repeatable. If it’s a repeated task, you can delegate it or outsource it. Or we can automate it for you. Next comes the build: They were using a tool called 'Pardot'. It's basically HubSpot for Salesforce. It collected their marketing data. For this reason, we integrated their Hubspot with Salesforce. This allowed us to remove their 3rd platform (the manual tracker in google). Then we were able to decide: Where do we want to push the data? What field does it go to? We do this for all our clients. Next comes the testing phase: 'Sandbox' allows us to make a copy of your info. We run it in a testing environment so we don’t “mess anything up". Before anything goes LIVE, we have a User Acceptance Testing (UAT) process. So if you requested something within Salesforce, we have you sign off on it. We want to know—“Okay this works. I tested it. We’re good to go.” That leads us into approval. My greenlight that everything we’ve built works. Then we go LIVE. We make everything we’ve built for you ACTIVE. We test one more time to make sure it runs correctly. Here are the results we got for Overjet: Without their marketing director tied up, they were able to make more money. They were able to scale—better and faster. For every $1 they paid us, they got $12 back. P.s. - Are you a mid-market tech startup that needs help automating Salesforce? Let's connect. Jordan Nelson
-
Beginner Mistakes in Genomics Data Analysis (And How to Avoid Them) 1/ When I started with genomics, I made mistakes. Here are key lessons I learned the hard way. Don't repeat them. (I wrote it 10 years ago) 2/ Computers make mistakes too They can produce nonsense results without errors. Always test your code extensively before running large analyses. 3/ Share your code Even if it works, share it. Others can review, spot errors, and improve it. Open science benefits everyone. 4/ Make your scripts reusable Don't hardcode file paths. Instead, use arguments so your script can run on different datasets easily. python myscript.py --input data.bam --output results.txt 5/ Modularize your code Genomics data comes in different formats. Avoid one huge script. Instead, split it into logical steps. Example: ChIP-seq analysis • Module 1: Fastq → BAM • Module 2: BAM → Peaks If someone has BAM files, they can skip Module 1 6/ Comment your code heavily It helps others understand your logic and helps you six months later when you forget what you did. 7/ Make your analysis reproducible Document every step in a markdown file. • Every command you run • Intermediate files • Where & when you downloaded data It will save you (and others) from frustration later. 8/ Key Takeaways • Code fails silently—test it • Share & document your work • Avoid hardcoding & modularize scripts • Keep everything reproducible 9/ Action Item Start documenting every step today! Future-you will thank you. More tips: https://lnkd.in/eiJu7rR7 I hope you've found this post helpful. Follow me for more. Subscribe to my FREE newsletter https://lnkd.in/erw83Svn
-
A common mistake I’ve seen from analysts, junior data scientists, and business partners is mistaking statistical significance for causality. We’ve all heard the mantra “correlation is not causation,” yet stats such as the uncontrolled college wage gap are commonly described as causal. Stakeholders and analysts may see significant results from a regression estimate and not recognize that the results may simply reflect correlations. To help understand this important concept, I generated hypothetical data on product sales, where the quantity sold is positively driven by “quality” but negatively driven by price. Plots and regression estimates that ignore or cannot accurately measure quality suffer from “omitted variable bias,” which in some cases can show statistically significant relationships that aren’t even directionally accurate (that is, the estimated result is positive despite the true relationship being negative). I share code with an accompanying tutorial here: https://lnkd.in/eUz456Hi #datascience #datasciencetutorial #dataanalyst #dataanalytics #datascientist
-
🛑 Here are the top mistakes I made as a data scientist fresh out of college, and the lessons I learned from them: 👉 Ignoring Business Context: One of the most significant mistakes I made was ignoring the business context of the organization I was working for. I was so focused on the technical aspects of the job that I forgot about the business goals and objectives. Approaching a problem with a value delivery focused mindset is a gamechanger! 👉 Overcomplicating Data: While data preparation is an integral part of the job, I realized that I was overcomplicating and wasting time on things that didn't matter. I learned to simplify the process by prioritizing important features to save time and deliver insights quickly. 👉 Not Communicating Effectively: Data scientists need to be able to communicate complex findings and models to non-technical stakeholders. Effective communication is key to gaining credibility and earning buy-in from decision-makers. 👉 Using Complex Models without Justification: New data scientists can often get excited about using the latest models, regardless of their level of complexity. However, it is important to justify the use of a model based on the business problem and data available. If there is no clear reason to use a complex model, then it is better to use simpler ones. 👉 Not Testing Assumptions: I used to make assumptions about my data without testing them. This can lead to incorrect conclusions and incorrect solutions to business problems. It's important to test assumptions and make sound inferences based on the data. Here are some helpful tips that I learned: ✔ Define the Problem: Start by defining the problem and why it matters to the business. This will help you stay focused on the end goal and develop solutions that align with business objectives. ✔ Continuously Learn: The field of data science is continually evolving, so it's essential to stay up to date on the latest tools and techniques. Take online courses, attend conferences, and participate in local meetups to expand your knowledge base continually. ✔ Collaborate: Data science is a collaborative field, so seek out opportunities to work with others. Collaborating with other professionals with different skill sets and perspectives can help you see problems from different angles and arrive at better solutions. ✔ Tell a Story with Data: Visualizing data can help tell a story, making complex data more accessible to stakeholders. Developing skills in data visualization can help you communicate your findings effectively. ✔ Focus on Impact: Always keep in mind the impact of your work on the business and end-users. Understanding the impact of your work can help you make better decisions and prioritize your time and resources. Follow Kavana Venkatesh for more such content. Book a 1:1 call with me for any support in your AI journey using the link in my profile. #datascience #ai #nlp #deeplearning #computervision #datatips #communication #leadership
-
Not sure where to start with #DecisionIntelligence? Don’t worry - you don’t need to toss out all the hard work your team has already put into building models and dashboards. That work takes real skill and effort, and it’s important. The question now is: how well does it connect to the decision-making process of your stakeholders or customers? Here’s something I’ve learned through trial & error: I’ve built dozens of models and productionalized them in dashboards and CRMs over the year, and while each felt like a win at the time, whether or not they actually influenced a business decision was often anecdotal at best... and that's not good enough. Here’s two steps you can take right now to uncover real DI opportunities that can be prioritized in the new year: 1️⃣ Look at your existing production models. Where are the results going? Are they ending up in dashboards, emails, spreadsheets... or just sitting idle? Have you integrated the output into your stakeholder’s decision-making process, or are you expecting them to figure it out? Embedding insights directly into the decision-making process is the key to unlocking real value. 2️⃣ Now, check your dashboards. What decisions are they meant to guide? Do they go beyond providing a prediction or forecast to actually suggest what to do next... or at least highlight the decisions that need to be made? Or... are they more like a beautifully presented buffet of insights, where you’re hoping someone in line feels inspired to grab a plate? Closing the loop from data to outcomes isn’t easy, but that’s where DI can make all the difference. It ensures the right insights reach the right people, at the right time, in the right way (whether it’s to guide or automate decisions) while capturing the outcomes that enable you to continuously improve the ecosystem. You and your team have already put in the hard work. Now let’s make sure it has the impact it deserves. What decisions should your models and dashboards be guiding? Let’s chat! #DataScience #Analytics #DecisionMaking #DI #Leadership #Innovation #DecisionProcessEngineering #AI #ML #Data #MLOps #ROI #GenAI #AgenticAI
-
Today, I would like to share a common problem of *Broken Data Pipelines* that have encountered in the past in my career. This disrupts critical decision-making processes, leading to inaccurate insights, delays, and lost business opportunities. According to me, major reasons for these failures are: 1) Data Delays or Loss Incomplete data due to network failures, API downtime, or storage issues leading to reports and dashboards showing incorrect insights. 2) Data Quality Issues Inconsistent data formats, duplicates, or missing values leading to compromised analysis. 3) Version Mismatches Surprise updates to APIs, schema changes, or outdated code leading to mismatched or incompatible data structures in data lake or database. 4) Lack of Monitoring No real-time monitoring or alerts leading to delayed detection of the issue. 5) Scalability Challenges Pipelines not being able to handle increasing data volumes or complexity leading to slower processing times and potential crashes. Over the period, I and Team Quilytics has identified and implemented strategies to overcome this problem by following simple yet effective techniques: 1) Implement Robust Monitoring and Alerting We leverage tools like Apache Airflow, AWS CloudWatch, or Datadog to monitor pipeline health and set up automated alerts for anomalies or failures. 2) Ensure Data Quality at Every Step We have implemented data validation rules to check data consistency and completeness. Use tools like Great Expectations works wonders to automate data quality checks. 3) Adopt Schema Management Practices We use schema evolution tools or version control for databases. Regularly testing pipelines against new APIs or schema changes in a staging environment helps in staying ahead in the game 😊 4) Scale with Cloud-Native Solutions Leveraging cloud services like Amazon Web Services (AWS) Glue, Google Dataflow, or Microsoft Azure Datafactory to handle scaling is very worthwhile. We also use distributed processing frameworks like Apache Spark for handling large datasets. Key Takeaways Streamlining data pipelines involves proactive monitoring, robust data quality checks, and scalable designs. By implementing these strategies, businesses can minimize downtime, maintain reliable data flow, and ensure high-quality analytics for informed decision-making. Would you like to dive deeper into these techniques and examples we have implemented? If so, reach out to me on shikha.shah@quilytics.com
-
I see bad data insight discovery practices daily. But they’re easily fixed. Here are 3 tips to fix bad data insights: Tip 1: Start with a clear question. What they should stop doing: Diving into data without a specific goal. What they should do instead: Begin with a well-defined business question. Why that’s the better way: A clear question focuses your analysis and saves time. Why it works: It ensures you’re solving the right problem, leading to actionable insights. Example: Instead of asking, “What’s our sales trend?” ask, “How did our sales trend change after the last campaign?” Example: Replace “What’s happening with our customers?” with “Which customer segments show the highest churn?” Example: Swap “How’s our product doing?” for “What’s driving product X’s recent growth?” Quick summary: Start with a clear question, and your insights will have direction and purpose. Tip 2: Clean your data before analysis. What they should stop doing: Ignoring data quality issues and rushing into analysis. What they should do instead: Dedicate time to clean, organize, and validate your data. Why that’s the better way: Clean data ensures accurate and reliable results. Why it works: Garbage in, garbage out. Quality data leads to quality insights. Example: Before analyzing, remove duplicates and correct errors in your dataset. Example: Standardize date formats and fix missing values to avoid skewed results. Example: Ensure consistency in categorical variables (e.g., “NY” vs. “New York”). Quick summary: Clean data is the foundation for meaningful analysis. Tip 3: Visualize your findings effectively. What they should stop doing: Overloading stakeholders with complex charts and tables. What they should do instead: Use simple, clear visuals that tell a story. Why that’s the better way: Visuals should highlight insights, not overwhelm the audience. Why it works: People grasp information faster through visuals, leading to better decision-making. Example: Use a bar chart to show sales growth across regions instead of a cluttered spreadsheet. Example: Replace a dense pie chart with a simple line graph to show trends over time. Example: Use color sparingly to emphasize key points, not to decorate. Quick summary: Effective visuals turn data into compelling narratives. Takeaway: My clients are always amazed by the level of detail I go into fixing their data insight processes, thanks to the integration of an advanced AI with powerful BI capabilities. Every question matters. Every data point, every chart, every analysis. Every discovery or insight. Remember: Quality insights come from clear questions, clean data, and effective visuals. Get it wrong, and you’ll waste time on irrelevant data. Get it right, and with AI-driven BI tools, you’ll uncover insights that drive meaningful decisions faster than ever.