In Part 1, we discussed a few common pitfalls founders should consider and avoid when pitching investors. In this part, I'll discuss a few areas within the AI space that we’re excited about at Unusual:
While there are still significant difficulties in productionizing ML models—including in more advanced deployment workflows like autoscaling, multi-armed bandits, and canary deployments—the majority of startups to date have been focused on the left-hand side of model development lifecycle. Companies like DataRobot, Dataiku, H20.ai, and others have tackled auto ML, data ingestion, feature extraction, and other components of deploying ML, but there are numerous areas after models have been productionized that still require significant tooling.
As models proliferate and become more complex and as companies scale to more customer and user segments, there is a greater chance for feature drift and model decay. Companies that are able to monitor across different types of ML models with varying outputs to detect for concept drift across specific subsets of users or customers and to classify the likelihood of a model prediction being valid are growing in importance.
Deploying deep learning models to new platforms (e.g., IoT, edge, phones, etc.) is often very difficult due to the dependence on specific libraries or GPUs and due to the size of ML models. Model quantization - the process of storing tensors such that operations are executed with integers rather than floating point values - is one such example of a model compression technique for hardware compatibility. Apache TVM, a deep learning compiler, is an example of software acting as an intermediary stack to bridge the divide between underlying hardware/infrastructure and deep learning models. There are also frameworks specifically optimized for enabling deep learning on mobile devices, and tools for providing distributed auto-scaling GPUs on-demand. These startups fit the overall theme of enabling Data Scientists to focus on what they do best —building models—and not worry about underlying infrastructure/platform improvements, particularly as the prevalence of deep learning, edge, and IoT continue to increase along with the variety of hardware that will need to be supported.
AutoML is the process of automating the ML pipeline. These tools take a dataset, run various combinations of ML models and hyper-parameters on it, and provide an endpoint for deployment purposes. They vary in their target user—some are optimized for data scientists and quick experimentation, some for non-DS engineers (Low-Code), and others for those with no prior coding knowledge (No-Code). No-Code solutions typically involve taking a CSV and defining a target variable column, and the solution creates a model using permutations of the columns as features. These solutions are commonly marketed toward sales and marketing teams as their beachheads, but the premise is to enable applications of AI throughout an organization or to companies that lack DS expertise.
Another subset of this category is enabling Data Scientists themselves to more easily provide ML capabilities to other parts of the organization. Typically, it can take significant resources from ML Engineers, DevOps, and other departments to take a model created by a Data Scientist and deploy it into production or create the necessary integrations between the model and other products. However, there are solutions that allow Data Scientists to easily embed their models in applications or for non-DS to take models created from their DS department and incorporate them into their work. Streamlit is one example of this, where Data Scientists can easily create interactive applications on top of their models with a simple API and python script to deploy seamlessly.
I’ve already written about developments in data tooling like Workflow Orchestration Solutions. A subset of workflow orchestrators focused on AI will continue to gain traction in allowing DS to create reproducible AI workflows, such as our investment in Arrikto enabling adoption of Kubeflow (ML on Kubernetes) or Metaflow developed at Netflix.
Furthermore, as I discussed in our Unusual Roundtable on Data Lineage, data discovery (e.g., Data Catalogs), data testing, and data lineage will converge across the organization into a unified data platform. While this platform will impact numerous roles across the company, including non-technical individuals, Data Scientists will use the platform to discover and understand their training data, ensure the data is trustworthy and of high quality (and remains high quality in production), and to trace their AI models to understand the lineage of aggregated features for additional features to include or for debugging/understanding their model’s predictions. Furthermore, advancements in data warehouses and the adoption of lakehouse architectures with technologies like Apache Hudi and Apache Iceberg will more easily open up greater volumes of data, including streaming data, to data scientists.
Of course, having accurate labeled data in the first place to allow for these data infrastructure tools to power Machine Learning is imperative, which is why we invested in Heartex and its immensely popular open source product Label Studio.
Deep learning will continue to be dominant. We are still in the early phase with generative machine learning, though generative methods such as GANs (Generative Adversarial Networks) have progressed significantly. While the technology is quite hyped and filled with gimmicky use cases at the moment, people will find interesting ways to leverage them in enterprise settings. Right now, they’re predominantly being used in consumer applications, but we’re beginning to see interesting use cases, such as retailers generating new product designs or artificial images of models wearing merchandise , or pharmaceutical companies using GANs to generate new likely molecules worth testing for cancer. A popular enterprise use case right now of GANs is in creating synthetic data, both for compliance purposes (enabling data analysis without touching the actual data) and for increasing the amount of training data available for machine learning.
We will also see deep learning used more outside of traditional internet/business AI settings and incorporated more into real-world settings leveraging complex input and output such as vision, audio, and hardware sensors.
Finally, ML embeddings have become a core part of a Data Scientist’s workflow when implementing deep learning with unstructured data like text, audio, images, or videos. However, creating, manipulating, and maintaining embeddings requires extensive infrastructure and can be costly and resource intensive. Companies will emerge that offer Embeddings-as-a-Service through an API to facilitate leveraging embeddings in production.
AI is an incredibly exciting space with ample room for innovation. We’re huge believers at Unusual and can’t wait to meet with more entrepreneurs building the AI companies of the future.
Are you a founder building in one of these areas? Message me on twitter at @jordan_segall or email me at email@example.com.