Part V: Data Exploration & Transformation

Illustration of birds in flight around the Kepler SatelliteIllustration of birds in flight around the Kepler Satellite
Table of Contents


Launching the Kepler Space Telescope

On March 7th, 2009, NASA launched the Kepler Space Telescope, aiming to discover Earth-like planets orbiting other stars. This ambitious mission sought to answer a fundamental question: Are we alone in the universe? While seemingly unrelated to data and analytics, Kepler's mission heavily relied on advanced data strategies and AI principles.

NASA and hundreds of scientists collaborated to create the hardware and software necessary to collect data on hundreds of thousands of stars. Kepler was equipped with a photometer and a 42 CCD array, capable of observing a vast portion of the sky and capturing data from millions of stars.

To manage this massive amount of data, the Kepler team adopted a strategy similar to the "minimum viable robot" (MVR) approach. They narrowed their focus to 100,000 stars and reduced data size by limiting each star to 32 pixels, prioritizing photometric data over full images.

Data Transmission and On-board Processing

Given the telescope's distance from Earth (151 million miles), a continuous stream of data was impossible. To address this, the team implemented on-board processing, accumulating data over 15 minutes and applying complex mathematics to drastically reduce the data size. This allowed the telescope to store a month's worth of data before transmitting it back to Earth.

Once the data reached Earth, the teams embarked on exploratory data analysis, examining the first readings for trends and issues. They encountered unexpected noise from the telescope's electronics, which they resolved by adjusting the data transformation pipeline on Earth.

Refining the Dataset and Astrological Discoveries

The team further enhanced the dataset by applying algorithms to generate light curves, essential for planet detection. This refined "Gold" dataset enabled scientists and citizen scientists to make significant astrological discoveries.

Kepler's Impact on our Understanding of the Universe

  1. Planets outnumber stars in our galaxy: Kepler revealed the abundance of planets in the Milky Way.
  2. Potential for life: The mission found that many stars could host small, potentially habitable planets.
  3. Diversity of planets: Kepler discovered a wide variety of planet types, including a common size not found in our solar system.

Through meticulous planning, innovative data strategies, and a relentless pursuit of knowledge, the Kepler mission expanded our understanding of the universe. It highlighted the crucial role of data and AI principles in unraveling the mysteries of space.

Exploratory Data Analysis

Using lessons from the Kepler project, let’s now learn what the data transformation layer of our hierarchy entails. After significant effort and some challenging decisions, we've secured our raw data in an accessible place, setting the stage for the crucial phase of Exploratory Data Analysis (EDA). This phase marks a pivotal moment in our journey, as the data teams are ready to play.

With the data in its raw, unstructured form, the team embarks on the manual yet critical process of EDA. This initial exploration is foundational, as it requires each team member to apply basic cleaning and structuring techniques to the data, enabling a preliminary analysis of what’s available. The primary objectives during this phase are:

  1. Building a Comprehensive Map: The team's goal is to navigate the vast data landscape, identifying available information and initial trends. This mapping is instrumental in understanding the breadth and depth of the data at our disposal.
  2. Identifying Early Issues: Early in the exploration, it becomes crucial to identify any potential problems with the data. This early detection is key to ensuring the integrity and reliability of subsequent analyses.
  3. Making Statistics-Based Decisions: Addressing the primary issues encountered during EDA requires informed decisions. These include strategies for dealing with missing values, outliers, and datatype optimization, which are vital for preparing the data for more advanced stages of analysis and model training.

Patrick Riley of Google has written an extremely valuable guide to EDA that I always like to reference when beginning a new data analysis journey. Here are the key takeaways that Riley highlights. 

  1. Not all distributions are ‘normal’: Riley emphasizes the importance of looking beyond summary statistics to understand data distributions fully. Histograms, Cumulative Distribution Functions (CDFs), and Quantile-Quantile (Q-Q) plots are instrumental in revealing intricate data characteristics, such as significant outlier classes. These visual representations allow us to see the story our data is trying to tell, often uncovering insights that summary metrics alone cannot provide.
  2. Don’t ignore your outliers: Outliers in your data shouldn't be hastily disregarded. As Riley points out, they can signal deeper issues within your analysis or data collection methodologies. A thoughtful examination of outliers can uncover flaws in your data or reveal unexpected truths, necessitating a nuanced approach to their management, whether by exclusion or categorization into an "Unusual" group. Understanding why data points are outliers is as crucial as identifying them.
  3. Statistical significance does not equal significant impact: In the realm of large datasets, the distinction between statistical significance and practical significance is sometimes missed. Riley prompts us to consider the real-world impact of our findings, encouraging a focus on meaningful differences. This perspective ensures that our analyses drive actionable insights.

The Gold Dataset

The AI Strategy Framework promises to deliver value early and often. The Gold Dataset is that first piece of value that business stakeholders can see from your efforts. What is the Gold Dataset? “Gold’  is our term for describing a highly curated and clean dataset produced by our applied data transformations.

It is an essential precursor to model training, but also a very useful artifact for the rest of the organization. Having a set of data that is easy to understand, is easily queryable, and has been cleaned up is useful to business analysts and other decision makers in the organization. Even AI product managers can utilize this data to improve their decision making and prioritization. 

What’s most important in the early days of your Gold Dataset is to make sure that there is an awareness and understanding of what it contains and provide suggestions for how the data could be used to help achieve business objectives. Often a high quality dataset becomes available and business silo’s result in low use. Lastly, it’s useful to get a low-code or no-code tool that can sit on top of the data so that non-technical users have the power to self-serve. 

Storage vs. Compute: The Basics

At the core, cloud storage and cloud computing serve distinct purposes. Cloud storage focuses on saving data that can be accessed and retrieved from multiple devices over the internet. It's akin to a digital filing cabinet where you store various types of files for later use. Cloud computing, however, goes beyond storage, providing the processing power and services to run applications and perform complex computations over the internet. This distinction is crucial because it influences how businesses approach data handling and infrastructure planning​​​​.

Parallel computing

Over the years, advancements in technology have greatly influenced the way storage and compute resources are utilized. The advent of parallel computing allowed for tasks to be divided and processed simultaneously across multiple processors, significantly speeding up computations and making large-scale data processing feasible. This shift laid the groundwork for modern cloud computing platforms that offer scalable compute resources on-demand, enabling businesses to run complex data analyses and machine learning algorithms without the need for significant upfront investment in physical infrastructure​​.

Separation of storage and compute

The separation of storage and compute in cloud environments marked a pivotal change, allowing for greater flexibility and efficiency. In traditional setups, storage and compute resources were tightly coupled, making it challenging to scale one without affecting the other. Modern cloud services, however, allow these resources to be scaled independently. This means businesses can store vast amounts of data without necessarily incurring high compute costs until those resources are needed for processing. Virtual warehouses exemplify this by offering on-demand compute capacity that can process data stored separately, optimizing both cost and performance​​​​.

Implications for Business: ETL vs. ELT

Understanding the distinction between storage and compute, and their decoupling in cloud environments, is instrumental in deciding between ETL and ELT processes. ETL, which emphasizes pre-processing data before loading it into a data warehouse, can be resource-intensive, requiring significant compute power upfront. On the other hand, ELT leverages the cloud's flexible compute resources by loading data first and transforming it as needed. This approach can be more cost-effective and scalable, particularly for businesses dealing with large volumes of data that may not require immediate processing​​.

For businesses, the choice between ETL and ELT should consider not just the current needs but also the strategic direction.

In the early days of your project, and while trying to deliver value early and often, ETL might be the way to go. Especially if you’ve limited yourself to a smaller dataset and have well understood transformation rules to apply. This will get you to the ‘Gold Dataset’ quickly. Further down the road, as more data becomes available, your team may want to implement a ‘light’ cleaning of the data and place it in a ‘Bronze’ database - one step beyond your fully raw database. 

Keeping it simple

Navigating data transformation options can feel overwhelming, particularly in an environment saturated with technologies all promising to revolutionize AI development. As your team moves from exploratory data analysis (EDA) towards creating a curated 'Gold Dataset', it's crucial to approach the selection of tools and processes with a few clear heuristics in mind:

  1. Start with Scalability in Mind: Opt for data transformation solutions that offer scalability. This ensures that as your data grows and your AI vision expands, your chosen infrastructure can grow with you, avoiding potential bottlenecks.
  2. Ensure Compatibility: Ensure that the tools you select seamlessly integrate with your planned analytics and ML frameworks. Compatibility checks can save you from future headaches by ensuring that your data transformation pipelines smoothly feed into your analytical models and ML training processes.
  3. Prioritize Data Quality Over Quantity: When building your 'Gold Dataset', focus on the quality of data rather than the quantity. A well-curated, high-quality dataset will provide more value and lead to more accurate insights and predictions than a larger, less reliable one.

Wrapping Up

Data transformation is difficult - even if you’re a rocket scientist. While at this layer of the AI Strategy Framework, and your data teams begin to get their hands dirty, it can be easy to get distracted. Distracted by new data relationships, new opportunities, new insights. Don’t forget where you started and your goal. Limit complexity and maintain focus. Keep your focus on high quality data and getting the first dataset clean and ready for the next layers. This will set you apart and increase your chances of success.

Next week we’ll talk about deriving value from our Gold Dataset through data visualization and analytics. We’ll see that while our focus remains on delivering AI, sometimes good old fashioned analytics can provide massive returns. See you there!

Need Help?

If you're seeking to unlock the full potential of AI within your organization but need help, we’re here for you. Our AI strategies are a no-nonsense way to derive value from AI technology. Reach out. Together we can turn your AI vision into reality.

No items found.
No items found.

Want to stay in the loop?

Subscribe below to get updates as they happen!
You have subscribed! Keep an eye on your emails for future updates.
Oops! Something went wrong while submitting the form.

Mitchell Johnstone

Director of Strategy

Mitch is a Strategic AI leader with 7+ years of transforming businesses through high-impact AI/ML projects. He combines deep technical acumen with business strategy, exemplified in roles spanning AI product management to entrepreneurial ventures. His portfolio includes proven success in driving product development, leading cross-functional teams, and navigating complex enterprise software landscapes.

Next post
There is no next post
Back to all posts
Previous post
There is no previous post
Back to all posts
Illustration of a baseball player amongst a flock of birds
Part VII: Machine Learning

Explore the journey of machine learning in baseball, from Billy Beane's OBP algorithm to modern MLOps, highlighting supervised and unsupervised learning, deployment, and monitoring.

Read More
Depiction of a person in the oval office with birds flying around - illustration only.
Part VI: Analytics and Insights

Netflix’s data strategy, using structured data and SQL, led to "House of Cards." Effective dashboards and feature engineering drive actionable insights for AI success.

Read More
Illustration of birds in flight around the Kepler Satellite
Part V: Data Exploration & Transformation

NASA's Kepler mission used innovative data strategies and AI frameworks to collect, process, and analyze vast amounts of astronomical data, leading to significant discoveries about planets and the universe.

Read More
Illustration of birds flying around Earth
Part IV: Ensuring Reliable and Accessible Storage

Discover how Amazon Web Services (AWS) transformed from a strategic insight at Jeff Bezos' home into a pivotal cloud solution for businesses, enabling innovative digital infrastructure management and strategic growth.

Read More
Illustration of birds flying over a graph
Part III: Data Collection - The Essence of AI

Google's founders used camera tech and a van in 2007 to validate image stitching, evolving to Street View and enhancing Maps with AI-driven data insights, setting a foundation for data-centric AI strategies.

Read More
Illustration of birds working on a whiteboard to plan out data strategy
Part II: Crafting a Compelling AI Product Vision and Narrative

Part II discusses crafting a compelling AI product vision, leveraging historical insights and modern management techniques for effective AI projects.

Read More
An illustration of birds sitting on a tree, a server is in the background.
Streamlining Website Management with Headless WordPress

Tired of endless CMS changes disrupting your marketing flow? Headless WordPress offers consistency, power, and ease of use.

Read More
Illustrated depiction of birds trying to put together a machine
Part I: Introducing the AI Strategy Framework

Get a proven AI Strategy Framework to take your project from idea to value-driven AI implementation. Actionable steps included.

Read More
Illustration of birds sitting on a stack of automation gear.
Cutting Costs with Automation: A Small Business Guide

Discover effective strategies for leveraging automation to cut operational costs and boost profitability in small businesses. This guide provides insights into selecting and implementing the right automation tools to streamline processes, reduce manual labor, and enhance efficiency.

Read More
Illustration of birds on a servers
AI in Business: Revolutionizing the Corporate Landscape

How AI is reshaping various aspects of business operations, from decision-making processes to customer experiences.

Read More
Illlustration of a bird on a desk
Harnessing AI for Efficient Inspiration Curation

We streamlined our inspiration curation by using GPT-4.0 to transform a disorganized Slack thread into a well-structured, easily navigable database, saving hours and enhancing our creative workflow efficiency.

Read More
Image of hands on a keyboard in oil painting style
Ecommerce and how it has changed the retail market

The retail industry has changed dramatically over the past decade. From the rise of online shopping and increased competition, to evolving consumer priorities and automation in retail – ecommerce is reshaping how we shop. In this post we'll explore some of these changes, along with their impact on consumers and retailers alike.

Read More
Oil painting of cows in a sunset
Project Launch: Ventec Website

Leading the charge in agricultural tech, Ventec needed a new site to better represent their industry. Today, we're proud to announce the launch of Ventec's new online platform!

Read More
Chess pieces painted in oil
Start With Strategy: The Key To A Successful Project

Without a defining strategy, projects can fall apart at any point in the process. It's important to start on strong footing to ensure success as a result.

Read More
Image of results of SEO, stylized into an oil painting
How long does it take for SEO to start working?

The time factor of SEO is often longer than many companies expect. Here's what to expect when it comes to launching an SEO strategy.

Read More
Painting of an automated arm moving
How to use automation to save you time and money

The key to success for many businesses today is automating tasks to ensure that costs are low, consistency is high, and less time is wasted overall.

Read More
Painting of a man climbing
Top Growth Tools To Expand Your Business

With the world of software evolving at a breakneck pace, here are a few tools that we use to help our clients' businesses grow.

Read More
oil painting of brain above a table representing AI
How Is AI Going To Change Graphic Design?

Artificial intelligence is changing the way every business operates, even in the creative fields that may once have been deemed safe from machine intelligence.

Read More
Sketch of a crane flying
Cranes, Trains, & Automobiles: Evolving Work Culture In The Digital World

With the COVID-forced digital shift, we have created a new benefit initiative to help improve quality of life for team members .

Read More
Watercolor of Paris
Company Trip Report: France 2022

A summary of our first official Cranes, Trains & Automobiles work away, and what's coming down the pipeline!

Read More
Watercolor background with the word "Branding" in the bottom left corner
A guide to creating a brand that works for your audience

It's time to step back and think about your brand in terms of what makes sense for your audience.

Read More
Orange painting with the word "Hybrid" in it
Hybrid Work: The Future Of The Office

The days of the office-bound worker are numbered. Organizations that have been slow to adapt will struggle to compete with those that embrace hybrid work, as employees seek more flexibility in their careers.

Read More
Logo of TIlt Five in a wheat field painting
Tilt Five Announcement

Congratulations to Tilt Five on partnering up with Asmodee to launch Catan in AR!

Read More
Watercolor image with a link icon in the middle of it.
Backlinks and Search Engine Optimization

When it comes to SEO, backlinks hold a lot of power. In fact, they’ve been shown to have a huge effect on how well your site performs in the search engines.

Read More
painting with the word, "Story" in the background
How do you build a brand story?

If you're looking to build your company into something more than just another commodity offering among many others on the market, here are some steps to get started:

Read More
Computer icon on a watercolor backsplash
How Owning A Great Website Impacts Your Business

Metrics on good websites vs. poor ones can be difficult to assess. With that being said, there are some important reasons to ensure your website is helping your company grow.

Read More
Plane moving around world icon, on a purple and green water color background
How Travel Breeds Creativity And Happiness For Our Team

We've found that traveling with our team has made them happier and more creative in the process.

Read More
Watercolor image background with "On-Page SEO" as wording in the middle
On-Page SEO: Questions To Ask Your 'Expert'

SEO is a convoluted field that can be difficult to understand as a non-expert. We have some tips on things to ask your developer or SEO expert as they change your site.

Read More
Our Paper Crane logo against a black and white watercolor splash
Legitimizing Your Brand

Brand legitimacy is a powerful tool for businesses, but many small businesses don't think that way. In this article, we'll discuss the concept of brand legitimacy and how it can help your business grow.

Read More
WordPress logo on a pink watercolor splashed background
How to speed up your Wordpress site

With recent search engine algorithm updates, page speed is more important than ever. Learn about how you can speed up a WordPress website!

Read More
Water color blue background with a Webflow logo
Webflow: When To Use It

Webflow can be a powerful tool in the right hands and perfect situations. In others, it is used to lesser effect when better tools may fit the bill better.

Read More
Oil painting of mountains
Vault 44.01 Lands $150M in Capital Commitment

With a significant fiscal investment, Grey Rock Investments showed their trust in Vault 44.01

Read More
Virtual Gurus logo over a colorful painted backdrop
Virtual Gurus Closes 8.4 Million

The Virtual Gurus were successful in closing 8.4 million dollars in funding after showing incredible year-over-year revenue growth on a consistent basis.

Read More
Board room painting
Kudos Lands $10M In Funding

With employee engagement trending internationally, Kudos leads the way with their unique software.

Read More
Next.js and Headless CMS: Revolutionizing Enterprise Web Development

Read More