Itinerary
- Part I: Introducing the AI Strategy Framework
- Part II: Crafting a Compelling AI Product Vision and Narrative
- Part III: Data Collection - The Essence of AI
- Part IV: Ensuring Reliable and Accessible Storage
- Part V: Data Exploration and Transformation - Obtaining Clean Data (👈You’re here)
- Part VI: Insights & Analysis - See the Unseen
- Part VII: Machine Learning - The Continuous Improvement Cycle
- Part VIII: AI & Deep Learning - Reaching the Pinnacle
Launching the Kepler Space Telescope
On March 7th, 2009, NASA launched the Kepler Space Telescope, aiming to discover Earth-like planets orbiting other stars. This ambitious mission sought to answer a fundamental question: Are we alone in the universe? While seemingly unrelated to data and analytics, Kepler's mission heavily relied on advanced data strategies and AI principles.
NASA and hundreds of scientists collaborated to create the hardware and software necessary to collect data on hundreds of thousands of stars. Kepler was equipped with a photometer and a 42 CCD array, capable of observing a vast portion of the sky and capturing data from millions of stars.
To manage this massive amount of data, the Kepler team adopted a strategy similar to the "minimum viable robot" (MVR) approach. They narrowed their focus to 100,000 stars and reduced data size by limiting each star to 32 pixels, prioritizing photometric data over full images.
Data Transmission and On-board Processing
Given the telescope's distance from Earth (151 million miles), a continuous stream of data was impossible. To address this, the team implemented on-board processing, accumulating data over 15 minutes and applying complex mathematics to drastically reduce the data size. This allowed the telescope to store a month's worth of data before transmitting it back to Earth.
Once the data reached Earth, the teams embarked on exploratory data analysis, examining the first readings for trends and issues. They encountered unexpected noise from the telescope's electronics, which they resolved by adjusting the data transformation pipeline on Earth.
Refining the Dataset and Astrological Discoveries
The team further enhanced the dataset by applying algorithms to generate light curves, essential for planet detection. This refined "Gold" dataset enabled scientists and citizen scientists to make significant astrological discoveries.
Kepler's Impact on our Understanding of the Universe
- Planets outnumber stars in our galaxy: Kepler revealed the abundance of planets in the Milky Way.
- Potential for life: The mission found that many stars could host small, potentially habitable planets.
- Diversity of planets: Kepler discovered a wide variety of planet types, including a common size not found in our solar system.
Through meticulous planning, innovative data strategies, and a relentless pursuit of knowledge, the Kepler mission expanded our understanding of the universe. It highlighted the crucial role of data and AI principles in unraveling the mysteries of space.
Exploratory Data Analysis
Using lessons from the Kepler project, let’s now learn what the data transformation layer of our hierarchy entails. After significant effort and some challenging decisions, we've secured our raw data in an accessible place, setting the stage for the crucial phase of Exploratory Data Analysis (EDA). This phase marks a pivotal moment in our journey, as the data teams are ready to play.
With the data in its raw, unstructured form, the team embarks on the manual yet critical process of EDA. This initial exploration is foundational, as it requires each team member to apply basic cleaning and structuring techniques to the data, enabling a preliminary analysis of what’s available. The primary objectives during this phase are:
- Building a Comprehensive Map: The team's goal is to navigate the vast data landscape, identifying available information and initial trends. This mapping is instrumental in understanding the breadth and depth of the data at our disposal.
- Identifying Early Issues: Early in the exploration, it becomes crucial to identify any potential problems with the data. This early detection is key to ensuring the integrity and reliability of subsequent analyses.
- Making Statistics-Based Decisions: Addressing the primary issues encountered during EDA requires informed decisions. These include strategies for dealing with missing values, outliers, and datatype optimization, which are vital for preparing the data for more advanced stages of analysis and model training.
Patrick Riley of Google has written an extremely valuable guide to EDA that I always like to reference when beginning a new data analysis journey. Here are the key takeaways that Riley highlights.
- Not all distributions are ‘normal’: Riley emphasizes the importance of looking beyond summary statistics to understand data distributions fully. Histograms, Cumulative Distribution Functions (CDFs), and Quantile-Quantile (Q-Q) plots are instrumental in revealing intricate data characteristics, such as significant outlier classes. These visual representations allow us to see the story our data is trying to tell, often uncovering insights that summary metrics alone cannot provide.
- Don’t ignore your outliers: Outliers in your data shouldn't be hastily disregarded. As Riley points out, they can signal deeper issues within your analysis or data collection methodologies. A thoughtful examination of outliers can uncover flaws in your data or reveal unexpected truths, necessitating a nuanced approach to their management, whether by exclusion or categorization into an "Unusual" group. Understanding why data points are outliers is as crucial as identifying them.
- Statistical significance does not equal significant impact: In the realm of large datasets, the distinction between statistical significance and practical significance is sometimes missed. Riley prompts us to consider the real-world impact of our findings, encouraging a focus on meaningful differences. This perspective ensures that our analyses drive actionable insights.
The Gold Dataset
The AI Strategy Framework promises to deliver value early and often. The Gold Dataset is that first piece of value that business stakeholders can see from your efforts. What is the Gold Dataset? “Gold’ is our term for describing a highly curated and clean dataset produced by our applied data transformations.
It is an essential precursor to model training, but also a very useful artifact for the rest of the organization. Having a set of data that is easy to understand, is easily queryable, and has been cleaned up is useful to business analysts and other decision makers in the organization. Even AI product managers can utilize this data to improve their decision making and prioritization.
What’s most important in the early days of your Gold Dataset is to make sure that there is an awareness and understanding of what it contains and provide suggestions for how the data could be used to help achieve business objectives. Often a high quality dataset becomes available and business silo’s result in low use. Lastly, it’s useful to get a low-code or no-code tool that can sit on top of the data so that non-technical users have the power to self-serve.
Storage vs. Compute: The Basics
At the core, cloud storage and cloud computing serve distinct purposes. Cloud storage focuses on saving data that can be accessed and retrieved from multiple devices over the internet. It's akin to a digital filing cabinet where you store various types of files for later use. Cloud computing, however, goes beyond storage, providing the processing power and services to run applications and perform complex computations over the internet. This distinction is crucial because it influences how businesses approach data handling and infrastructure planning.
Parallel computing
Over the years, advancements in technology have greatly influenced the way storage and compute resources are utilized. The advent of parallel computing allowed for tasks to be divided and processed simultaneously across multiple processors, significantly speeding up computations and making large-scale data processing feasible. This shift laid the groundwork for modern cloud computing platforms that offer scalable compute resources on-demand, enabling businesses to run complex data analyses and machine learning algorithms without the need for significant upfront investment in physical infrastructure.
Separation of storage and compute
The separation of storage and compute in cloud environments marked a pivotal change, allowing for greater flexibility and efficiency. In traditional setups, storage and compute resources were tightly coupled, making it challenging to scale one without affecting the other. Modern cloud services, however, allow these resources to be scaled independently. This means businesses can store vast amounts of data without necessarily incurring high compute costs until those resources are needed for processing. Virtual warehouses exemplify this by offering on-demand compute capacity that can process data stored separately, optimizing both cost and performance.
Implications for Business: ETL vs. ELT
Understanding the distinction between storage and compute, and their decoupling in cloud environments, is instrumental in deciding between ETL and ELT processes. ETL, which emphasizes pre-processing data before loading it into a data warehouse, can be resource-intensive, requiring significant compute power upfront. On the other hand, ELT leverages the cloud's flexible compute resources by loading data first and transforming it as needed. This approach can be more cost-effective and scalable, particularly for businesses dealing with large volumes of data that may not require immediate processing.
For businesses, the choice between ETL and ELT should consider not just the current needs but also the strategic direction.
In the early days of your project, and while trying to deliver value early and often, ETL might be the way to go. Especially if you’ve limited yourself to a smaller dataset and have well understood transformation rules to apply. This will get you to the ‘Gold Dataset’ quickly. Further down the road, as more data becomes available, your team may want to implement a ‘light’ cleaning of the data and place it in a ‘Bronze’ database - one step beyond your fully raw database.
Keeping it simple
Navigating data transformation options can feel overwhelming, particularly in an environment saturated with technologies all promising to revolutionize AI development. As your team moves from exploratory data analysis (EDA) towards creating a curated 'Gold Dataset', it's crucial to approach the selection of tools and processes with a few clear heuristics in mind:
- Start with Scalability in Mind: Opt for data transformation solutions that offer scalability. This ensures that as your data grows and your AI vision expands, your chosen infrastructure can grow with you, avoiding potential bottlenecks.
- Ensure Compatibility: Ensure that the tools you select seamlessly integrate with your planned analytics and ML frameworks. Compatibility checks can save you from future headaches by ensuring that your data transformation pipelines smoothly feed into your analytical models and ML training processes.
- Prioritize Data Quality Over Quantity: When building your 'Gold Dataset', focus on the quality of data rather than the quantity. A well-curated, high-quality dataset will provide more value and lead to more accurate insights and predictions than a larger, less reliable one.
Wrapping Up
Data transformation is difficult - even if you’re a rocket scientist. While at this layer of the AI Strategy Framework, and your data teams begin to get their hands dirty, it can be easy to get distracted. Distracted by new data relationships, new opportunities, new insights. Don’t forget where you started and your goal. Limit complexity and maintain focus. Keep your focus on high quality data and getting the first dataset clean and ready for the next layers. This will set you apart and increase your chances of success.
Next week we’ll talk about deriving value from our Gold Dataset through data visualization and analytics. We’ll see that while our focus remains on delivering AI, sometimes good old fashioned analytics can provide massive returns. See you there!
Need Help?
If you're seeking to unlock the full potential of AI within your organization but need help, we’re here for you. Our AI strategies are a no-nonsense way to derive value from AI technology. Reach out. Together we can turn your AI vision into reality.
Chapters
Trip Images
Want to stay in the loop?
Mitchell Johnstone
Director of Strategy
Mitch is a Strategic AI leader with 7+ years of transforming businesses through high-impact AI/ML projects. He combines deep technical acumen with business strategy, exemplified in roles spanning AI product management to entrepreneurial ventures. His portfolio includes proven success in driving product development, leading cross-functional teams, and navigating complex enterprise software landscapes.