Itinerary
- Part I: Introducing the AI Strategy Framework
- Part II: Crafting a Compelling AI Product Vision and Narrative
- Part III: Data Collection - The Essence of AI
- Part IV: Ensuring Reliable and Accessible Storage (👈You’re here)
- Part V: Data Exploration and Transformation
- Part VI: Insights & Analysis - See the Unseen
- Part VII: Machine Learning - The Continuous Improvement Cycle
- Part VIII: AI & Deep Learning - Reaching the Pinnacle
The birth of a cloud
In the quiet confines of Jeff Bezos's home in the early 2000s, the executive team was thinking strategy. While reviewing their strengths, weaknesses, and opportunities, a company altering idea emerged. The executive team realized they had a competitive advantage with their expertise in managing digital infrastructure. As a result of building one of the largest operational websites in the world, they had been forced to become skilled in the trade of managing large digital systems.
That day, the team recognized a universal challenge: countless businesses, especially those looking to implement digital innovation, grappled with the daunting task of building and managing technology infrastructure. The skills that had helped Amazon build their e-commerce giant, were skills that were highly valued by every other company looking to grow their digital presence. To capitalize on this, Bezos and his team decided to offer Amazon’s robust digital infrastructure as service. Famous investor Bill Gurley has said this is a top 3 business move of all time. This new offering allowed businesses to outsource the building of digital roads, so they could design the cars that drive upon those roads.
Amazon Web Services (AWS) was envisioned as the answer to a question many businesses hadn't fully articulated yet: How could they pivot from investing in the mechanics of technology to harnessing its strategic potential? With AWS, companies could tap into a scalable, secure, and sophisticated infrastructure, transforming their approach from one of operational burden to strategic advantage. The AWS retreat at the Bezos house revealed a guiding principle for businesses embarking on their AI journeys: the real value lies not in the infrastructure itself but in the innovation it enables.
For us, this story serves as a reminder as we navigate the complexities of executing an AI strategy. Our focus should remain on the problem we are solving for our customers. AWS and the other cloud service providers can handle the infrastructure, so we can innovate, create value, and transform our vision into reality.
Planning for flexibility - see the unseeable.
Let’s now take a step back and think about the big picture. Not all data, in all industries, is best suited for cloud storage. There are many heavily regulated industries and instances where sensitive data needs extra protection. In these cases an internally managed solution or hybrid solution may be required. We’ll discuss an evaluation framework to help you understand if your data needs additional security considerations.
If at all possible, try to focus efforts on collecting and storing data that is a good fit for the cloud. This will lower complexity, and speed up time to value. If you do have sensitive data, engaging an information security expert will allow you to see what secure tools a cloud provider can provide or build out the more complex managed strategy.
In addition to considering options for where the data is stored (cloud, on premises, or hybrid), we need to also decide the best storage format. Given that we aren’t fortune tellers and can often overlook important data points that may be relevant in the future, I usually recommend the first data storage location that allows for raw, structured and unstructured formats of many different kinds. A place where you can store text and images as well as structured data that would fit nicely into columns and rows.
This way you have a few advantages:
- Cost-Effectiveness: Cloud databases like AWS S3 offer storage at around $25/TB/month, making it economically feasible to warehouse vast quantities of raw data.
- Flexibility: Premature commitment to a structure may pigeonhole your data strategy. Flexibility helps avoid the pitfalls of scattered, inaccessible databases, ensuring data scientists and analysts have access to the information.
What this means in practice is that you’ll probably want a Data Lake. This is the industry term to describe a place that allows data of all types to be stored in raw form. It also means that you may have to worry about additional layers of data storage and data transformation down the road, but with the benefit of added flexibility, this is well worth it.
Structured vs. Unstructured
What do we mean when we say structured or unstructured data? Most of us have used Microsoft Excel in some capacity. Data that fits nicely into the rows and columns is considered structured. Consider filling out a form to sign up for a website. They ask for your name, email, phone number. This data can be stored in a structured format with each row representing a user who has signed up.
Conversely, unstructured data—words, images, sensory readings—defies easy categorization, requiring specialized databases for efficient storage and retrieval. The diversity of data in AI applications often requires accommodating both structured precision and unstructured richness.
Speed and Efficiency
The next thing we need to consider is how our data scientists and analysts are going to use the data within the database. Are they going to need to extract huge chunks of data to perform model training? Do they need to quickly find single pieces of data? For AI, data scientists are definitely going to need the ability to extract large amounts of data in batches to perform their model training - which we’ll get into later in this series - but there also may be different uses for the data. Do business users want access to explore the data, or do data scientists want to sort out different relationships and extract examples from the dataset? There are two things to think about for database performance in this case.
Indexing
Imagine a library where the books are placed in random order; indexing saves us from this chaos. By creating pointers to data locations, indexes ensure rapid retrieval, directly impacting query speed and user satisfaction. Whether organizing by date/time in time-series databases or utilizing unique identifiers, choosing the right indexing strategy is paramount for operational efficiency.
Partitioning
Dividing a database into manageable segments, partitioning enhances performance and simplifies data management. This strategy not only accelerates queries but also facilitates parallel processing and distributed storage, optimizing resource utilization.
Considering different data types
With a brief introduction to indexing and partitioning it’s important to now think about the primary data types that you are dealing with. Are you dealing with data collected across time? Are you collecting data that can all be connected back to a single person or event or category? Thinking about the type of data that you have up front will help you to select the right data storage strategy. There are different types of databases that excel for different types of data, so understanding the underlying structure, how it's created, and what the relationships between data points may look like will help you in your decision making. Let’s take a look at two specific examples to see how different data can result in different storage options.
Handling Time-Series Data: The Weather Forecasting Company
Use Case Overview: Imagine a company specializing in high-resolution, real-time weather forecasting. This company collects vast amounts of time-series data from various sensors across the globe, including temperature, humidity, wind speed, and atmospheric pressure readings. Each data point is timestamped, creating a continuous stream of data that needs to be processed, analyzed, and stored efficiently for predictive modeling and real-time weather updates.
Example of Data:
- Timestamp: 2024-04-10 14:00:00
- Location: San Francisco, CA
- Temperature: 68°F
- Humidity: 75%
- Wind Speed: 5 mph
- Atmospheric Pressure: 1013 mb
Database Solution: For this use case, a time-series database (TSDB) like InfluxDB or TimescaleDB would be ideal.
Why This Solution Works:
- Efficiency: TSDBs are optimized for handling time-stamped data, making them highly efficient for writing, querying, and analyzing time-series data at scale.
- Scalability: These databases can handle the high volume of data generated by weather sensors without performance degradation, crucial for real-time analysis.
- Query Support: They support complex queries essential for time-series analysis, such as calculating moving averages or identifying trends over time.
Handling Object-Oriented Data: The Digital Asset Management Platform
Use Case Overview: Consider a digital asset management platform that helps creative teams manage their digital content. This platform deals with a wide variety of object-oriented data, including high-resolution images, videos, design files, and accompanying metadata (tags, descriptions, project associations). The data is not only varied in type but also needs to be accessible in a highly relational context, where connections between different assets (such as those belonging to the same project) are easily retrievable.
Example of Data:
- Asset ID: 00123
- Type: Image
- File: example_image.jpg
- Metadata: {some text
- Tags: ["campaign_2024", "spring", "outdoors"]
- Description: "Spring 2024 campaign outdoor shoot."
- Project: "Spring 2024 Campaign"
}
Database Solution: For managing object-oriented data with complex interrelationships, a document-oriented database like MongoDB or a graph database like Neo4j could be considered, depending on the complexity of the relationships between assets.
Why These Solutions Work:
- Flexibility: Document-oriented databases like MongoDB store data in a flexible, JSON-like format, allowing for varied and nested data types (like images with metadata). This flexibility is ideal for handling the multifaceted attributes of digital assets.
- Relationship Handling: If the use case involves complex relationships between assets (e.g., shared tags, project hierarchies), a graph database like Neo4j offers superior capabilities in querying deeply interconnected data, making it easier to navigate and understand the relationships between various digital assets.
- Scalability and Performance: Both MongoDB and Neo4j scale well with large datasets while maintaining performance, ensuring that as the platform grows, retrieving and managing assets remains efficient.
By selecting the appropriate database solutions tailored to the specific needs of time-series and object-oriented data use cases, organizations can optimize their data management practices for better performance, scalability, and analytical capabilities. These examples illustrate how understanding the nature of your data can guide the selection of the most suitable database technology.
Making life easy for the Chief Information Security Officer
Security is paramount when considering data storage options for AI projects. The risks of data breaches, unauthorized access, and data corruption must be addressed. Topics such as advanced encryption for data at rest and in transit, rigorous access controls, and regular security audits can be difficult to wrap your mind around. However, there are a few things that you can do to make life easy on your security team (or a contracted security resource) to quickly make informed decisions that will move an AI Strategy forward without undue risk.
- Share the Vision: Sharing your AI Narrative with the security team will help bring them along for the ride. They are likely to get excited and want to find ways to make this vision a reality. It will also help them contribute to filling in blind spots and identify future risks up front. This collaboration will pay off.
- Data Legislation: Provide your security friends with some data samples as well as some notes and summaries of the legislation you think may apply to your data. This will help jump start their evaluation to see if the legislation may apply and the implications for your data security strategy.
- Network Diagram: Provide the team with a network diagram that shows where data comes from and where it moves through the AI system. It’s okay if you’re not 100% sure of the implementation at this point, but a rough architecture diagram showing the comings and goings of your data can help the team sort out what data security techniques may need to be applied
Keep it simple
Above we’ve considered some of the essentials when making your decision for data storage and honestly it’s a bit overwhelming. Especially if your team lacks the guidance of a tech leader with years of experience (and battle scars) it can seem daunting to make a decision. It also doesn’t help that there are billions of dollars being poured into the market for AI development which seems to increase the universe of options by the day.
When trying to get quickly to value and validate your AI vision, it is helpful to use a few heuristics to pick a data storage option and move on:
- Cloud-First: Consider cloud storage solutions for their scalability and ease of integration with analytics and ML services.
- Open Source Preference: Lean towards open-source databases and frameworks, which come with extensive community support and flexibility.
- Compatibility Check: Prioritize tools and platforms that offer native support or established connectors for your analytics and ML frameworks of choice.
Wrapping Up
Not all of us have the opportunity to hang at the Bezos residence and come up with a brand new seemingly unrelated business idea that turns into a money printing machine. What we do get as a result of this meeting is the ability to build, launch, and scale our AI products quickly and cheaply. Thanks to AWS, Microsoft, and Google we can focus on our customers and their problems. We can get to know them better than any of our competitors and use these tools to win.
When trying to get to value early, data storage can be one of the most difficult hurdles. There are so many unknowns and making a decision can be daunting. By understanding regulatory risks, your long-term plan, and the short term target, you can avoid analysis paralysis. I hope that this blog has helped you feel equipped to make a decision.
Next up we have Data Transformation. Taking our raw, ugly data and making it pretty. Getting closer to our first taste of value. See you next week!
Need Help?
If you're seeking to unlock the full potential of AI within your organization but need help, we’re here for you. Our AI strategies are a no-nonsense way to derive value from AI technology. Reach out. Together we can turn your AI vision into reality.
Chapters
Trip Images
Want to stay in the loop?
Mitchell Johnstone
Director of Strategy
Mitch is a Strategic AI leader with 7+ years of transforming businesses through high-impact AI/ML projects. He combines deep technical acumen with business strategy, exemplified in roles spanning AI product management to entrepreneurial ventures. His portfolio includes proven success in driving product development, leading cross-functional teams, and navigating complex enterprise software landscapes.