How to Avoid Common Pitfalls in Machine Learning Projects

Avoiding Common Pitfalls in Machine Learning Projects

Getting started with machine learning is easier when you anchor on clean, consolidated data and let analytics surface the right problems to solve. This article walks through a practical, scientific approach to selecting use cases, validating them with your data, and automating decisions with real-time signals.

‍

What you'll learn:

How a minimal but solid data foundation speeds model development and reduces rework.
How to use analytics to prioritize ML use cases instead of intuition or guesswork.
How to apply the scientific method to form hypotheses, test them, and interpret significance.
How to incorporate technical features (e.g., weather APIs, feature engineering) to enable automation.

Establish a Solid Data Foundation, With Clean Data and Strong Architecture

One of the most common pitfalls is trying to run before you can walk by skipping foundational data work. I’ve seen clients try to pick and choose individual tables from as many as seven different source applications, performing ETL “as needed” to feed an ML model. This creates immense overhead and technical debt every time you plan to create a new model, extending your timeline from weeks to months. A consolidated data warehouse should be the single source of truth that fuels both your analytics and your AI initiatives.

‍

For the purpose of this blog, we’ll look at the example of date formatting. There are upwards of 96 permutations of date formats, not including different separators used. It’s not uncommon to see different date formats even in the same table, let alone between tables. If we’re taking tables from seven different source systems, ETL needs to be performed to standardize those dates (among other fields like ZIP code, state or country abbreviation, etc.) in the target warehouse.

‍

If we are picking and choosing the individual tables, we need for specific ML models, this ETL needs to be performed every time, meaning there’s no advantage to having a consolidated warehouse until we have done enough ML projects that every table has been moved to the target warehouse. The best-case scenario is to have standardized data formats from the get-go in our source and target systems, but that’s not realistic for most companies. The next best thing we can do is create ETL pipelines that move and cleanse our data between the source and the target repositories as close to real-time as possible. This is complex work that requires quite a bit of effort up front, but then our ML models can use that consolidated warehouse, with already standardized data immediately. Regardless of when you do it, there’s no skipping the ETL process, it’s simply up to you to decide your strategy and where you pay the cost.

‍

This is just a brief example of the cost of unstandardized data. You can read in-depth about the importance of a clean data model here: From Chaos to Clarity: How to Clean up Your Data Model as well as the different stages of data maturity and how to know when you’re ready for ML solutions here: Data Maturity Takes Time: Don't Focus on the Summit to Scale the Mountain.

‍

Let Analytics Guide Your Use Case Before You Start

It can be tempting to use AI as a market differentiator by selecting a use case based on intuition or company folklore. Everyone else is doing it; why not you? You can imagine how much more inefficient it will be if you’re not only fishing for a use case but also transforming the data you think might apply as needed.

The reality of this approach is inefficiently searching for correlations one by one until you find a connection. Instead, you should use data analytics to find these correlations before developing a use case. If not all the relevant data is represented, important correlations can't emerge. You may have a factor you are completely unaware of that underpins several other dependent factors, which you would never discover unless by chance. In practice, Exploratory Data Analysis techniques would help us uncover such patterns/correlations. Tools such as correlation heatmaps, univariate/multivariate analysis, feature interactions, and importance scoring can help reveal these patterns and help you ensure your use case is built on real evidence instead of assumptions.

‍

Think Like a Scientist

“Science” is the key word in “Data Science.” By replacing our assumptions with hypotheses, we can find greater success. For any project, following the scientific method provides a disciplined framework for discovery.

‍

What does it look like to practically apply this to our data? Let’s break it down.

‍

‍Observation

Let your data lead you. For example, while reviewing a sales dashboard for a pet store, you notice that kitty litter sales spike in the winter. This is a good place to start asking why.

‍

Hypothesis

Frame a testable question. You might form a hypothesis: (H1) the sales increase correlates with in-store holiday adoption events. A null hypothesis (H0) is always present, stating there is no relationship between the variables. A hypothesis that doesn’t pass the threshold of statistical significance still gives you information: these two factors don’t affect each other, so look elsewhere!

‍

A note on statistical significance: what determines statistical significance is the p-value. A very small p-value means that an observed outcome would be very unlikely under the null hypothesis, and we can safely reject the null hypothesis in favor of the alternate hypothesis we are testing. A p-value of .05 or less is generally the threshold to determine statistical significance. This means that there is less than 5% probability that the observed results occurred if the null hypothesis (that there is no relationship) were indeed true.

‍

Experimentation

The good news is that most companies have been tracking data long enough so that they don’t need to experiment. The data is already available, and we can explore the correlation (not necessarily the cause and effect) between different variables. We can use existing data to test the hypotheses. In our example, historical data quickly shows no significant correlation between litter sales and adoption events, disproving H1. This is still valuable information, as it tells you to look elsewhere without wasting significant time or resources.

‍

Analysis and conclusion

The Scientific Method is cyclical. As we disprove hypotheses, we return to the research phase to look for other possibilities. In this case we have a few years’ historical data, and we find that it’s not every January, some years the spikes are in December and February and there are no similar spikes in summer or fall.

‍

My second hypothesis (H2) is that winter itself affects litter sales, which is a time series with a seasonal dimension, the details of which you can further explore in this whitepaper on time series. There’s nothing in our current analytics to disprove this hypothesis, so we start building our model and looking for relationships between different features.

‍

Our preliminary results show us that there is a slight positive correlation between the winter months and increased litter sales, but not statistically significant. After adding the store event calendar as an additional feature to see if there were any other coinciding, there’s a statistically significant correlation for increased litter sales in the weeks the stores had an emergency closure. The underlying “why” is that customers are stocking up on litter to use for tire traction.

‍

Application and Automation

Now that you have analyzed the data and found the root cause, you can make AI work for you. By bringing real-time data from the National Weather Service API into your model, you can automate ordering extra litter to meet demand whenever heavy snow is in the 10-day forecast.

‍

Methods to Go from Hypothesis to Production

The journey from a hypothesis to an automated, production-ready model involves several critical technical steps. But you must start with clean data and established infrastructure, and I cannot emphasize this enough. Without the right groundwork, you can encounter obstacles that hike up your costs.

Feature Engineering: This is the process of creating new input variables (features) for your model from raw data. In our example, this could mean creating a binary feature like is_winter_month or heavy_snow_forecast, or a numeric feature like days_since_last_storm. Well-engineered features are often more important than the model algorithm itself for achieving high accuracy.
Model Selection Training and Validation: Based on the problem, you would select a model type or even benchmark different families of models. For predicting litter sales, a time-series model (like ARIMA) or a regression model (like XGBoost) could work. To ensure the model is accurate and not just memorizing the training data (overfitting), use validation techniques like a train-test split or k-fold cross-validation.
MLOps and Automation: Deploying a model is not the final step. You need a system to run predictions automatically (e.g., a scheduled job that calls the weather API daily), monitor model performance for degradation, and a process to retrain the model on new data. This operational framework is often called MLOps.

Working with data is detective work. Cat litter may seem like a low-stakes example, but the process is the same for any business problem. The scientific method gives us a disciplined way to find the real story your data is telling.

How DI Squared Helps You Become a Data Detective

DI Squared offers data strategy, engineering, and analytics services to help your business grow. Contact us to book your complimentary 1:1 with one of our specialists.

‍