February 9, 2023
Portfolio
Unusual

DevTools for language models — predicting the future

DevTools for language models — predicting the futureDevTools for language models — predicting the future
All posts
Editor's note: 

Anywhere developers go, DevTools reliably follow. In the last few months, a horde of developers has been hacking away at Foundation Models (FMs). The initial wave of DevTools has cropped up to enable those developers to iterate more quickly and build amazing features. In this blog, we’ll explore the thriving ecosystem of tools cropping up around large language models (LLMs), and dive into what tools exist and why those tools are important.

In the last few months, it feels like LLMs have crossed into the “Holy $#@!” phase of exponential growth where anyone who hasn’t been looking closely for the last few years probably exclaims, “Holy $#@!” after playing with ChatGPT for the first time.

The side effect of entering the “Holy $#@!” phase is that there has been a tremendous upswell of hackers and entrepreneurs building amazing product experiences on top of (mostly) GPT-3. Discord servers of hackers have filled up, market maps have been made, and an unbelievable number of people have tinkered with ChatGPT

This influx of developers has led to the advent of LLM DevTools — everything from convenience wrappers around OpenAI APIs, to complex tools for orchestrating multi-step reasoning, to very simple databases of prompt templates. As folks who have been obsessed with exactly two topics for the entirety of our professional careers (machine learning and DevTools), this has obviously caught our attention:

In this post, we’ll cover the state of LLM DevTools today and where we think it's heading in the future.

The emergence of DevTools for LLMs

To understand any DevTools, it’s important to start by understanding what tasks developers are trying to accomplish and what the steps are to accomplish those tasks. Here is a very rough sketch of what building an LLM-powered feature looks like today:

A note about this exact moment in time: V1 is basically all that matters right now. V1 is the thing that lifts your product from being “Cool!” to being “Holy $#@!”. For many applications, a V1 LLM feature can make your product 10x better (or at least cooler).

Meanwhile, V2+ are probably less impactful (right now).

Over time, continually improving models and products may well be the most important differentiator of companies building on top of language models, but we’re still too early in the race to focus on optimization.

What does that mean for DevTools? The majority of development and adoption to date has been in experimentation/prompting tools and data integrations (in particular, vector databases).

Representative, there is a lot more development than would fit on your screen!

Experimentation and Prompting: LLMs have weird APIs — you provide natural language in the form of a “prompt” and get a probabilistic response back. It turns out that mastering this API requires a lot of tinkering and experimentation – to solve a new task, you’ll probably need to try a lot of different prompts (or chains of prompts) to get the answer you’re looking for. Simply getting comfortable with the probabilistic nature of LLM output takes time — you’ll need to do extensive testing to understand the boundary cases of your prompt.

Tools have emerged to help jumpstart and manage this experimentation process. In particular, LangChain and GPT Index have exploded in popularity and use:

One good way to gain a lot of traction is to be the first tool someone needs to adopt an important new technology! Nothing beats selling picks during a gold rush. Time will tell whether these frameworks evolve into fundamental and sustaining applications or if they are just the initial simplification needed to kickstart development. Either way, they are playing a critical role in the ecosystem right now – inspiring amazing new use cases and features.

Knowledge Retrieval and Vector Databases: Part of the process of building a great prompt involves feeding the language model relevant “context” (i.e., related documents, the results of a Google search). This context is particularly relevant to stop LLMs from “hallucinating” – by providing good context, the model will extract the correct information from documents instead of making it up. This essentially boils down to giving LLMs “memory” – something that current generations of models does not have by default.

One of the most effective ways to find relevant context is to look up documents that are semantically similar to the task you’re looking to solve. An important capability of LLMs are their embedding capabilities — dense representations of language that contain semantic information. Vector Databases have emerged as the most obvious and performant way to retrieve “similar” documents by enabling similarity searches on LLM embeddings. 

Essentially all of the popular vector databases consider this workflow first-class citizens (Qdrant, Pinecone, Weaviate, Milvus) — and although they are probably closer to “core infrastructure” than DevTools, they play a critical role in the development journey of LLM applications.

Building V2 of LLM Features

Teams that have already scaled V1 of their LLM features are already confronting this next set of challenges in maintaining and improving intelligent features.

Representative, in particular there are a lot of labeling and fine tuning solutions out there

Labeling: The fundamentals of data labeling haven’t changed much in response to LLMs, but the priorities of labeling certainly have. When fine-tuning language models, being precise about which examples to label and use is more important than simply labeling a large volume of data. A lot of the most mature labeling solutions are really well-positioned to help companies looking to fine-tune LLMs to improve their products, but there is room to improve on those products to help efficiently select the right examples to label and use. We expect this category to evolve rapidly in response to an increase in demand and for them to incorporate even more effective tools for self-supervised labeling.

Fine-Tuning: Fine-tuning language models can serve two critical functions – improving accuracy and reducing inference latency. There are two really common patterns that have emerged:

  1. Fine-tuning the largest language models to improve performance when accuracy is critical
  2. Fine-tuning smaller language models to achieve similar accuracy as larger models while achieving lower latency and cost

To date, most of the fine-tuning market has been captured by the model providers (mostly OpenAI). We’d expect a lot of competitors to enter this space, in particular, if improvements to underlying LLMs show any signs of slowing down. For many applications, inference latency is key – there will be a lot of use cases where minimizing latency is worth a significant investment.

Monitoring, Observability, and Testing: There are some unique challenges in understanding and managing the performance of language models – most acutely, it can be very challenging to measure the “performance” of an LLM feature. To understand how “good” generated content is, you’ll need to actually measure how users are interacting with that content. That often means A/B tests and comprehensive product analytics just to assess performance. Most teams with LLM features are still checking the results with an eye test – across some number of tests, do the results look “good’? As adoption of OSS LLM models becomes more ubiquitous testing and comparison frameworks like “HELM” will become more and more important. 

Answering questions about LLM performance is critical for multiple reasons. Obviously, performance directly impacts UX, but it also impacts how easily you can decide when it makes sense to switch to a smaller and cheaper model or when to perform fine-tuning. 

Where is this headed, and how quickly?

Let’s finish by using some comparisons to the last wave of machine learning DevTools, to make guesses about where we might be headed next and how long it might take to get there.

Diving in, LangChain reminds us of the early ML frameworks.

As the first tool needed to train a model, Caffe (and later, Tensorflow) both became very popular frameworks in their early years. Caffe was the premier framework for building real-world ML models for a few years before Tensorflow (and its much more popular Python bindings) took over. 

These days, PyTorch has taken over as the most popular ML framework for new developers, and Tensorflow still holds a large market share. PyTorch won because it offered a much better developer experience than Tensorflow better usage experience, better documentation, and a thriving community. One thing LangChain has done well is to provide great examples for its users, which has had a huge impact on its adoption so far.

What does this mean for the current generation of prompting tools? A few shots:

  1. We’re in the early days of experimentation with LLMs.
    We expect numerous frameworks to emerge that solve this need over the next few years. It takes time to iterate on developer experience, and until developers see all of the possibilities, it will be impossible to pick a winner.
  2. New model versions will likely drive this innovation.
    There has been plenty of speculation about whether “prompt engineering” will go away as models improve – there will always be ways to improve performance with good prompting. That said, as the models change, the “API” of prompting will also likely shift, creating an opportunity for new prompting frameworks.

And one last prediction – getting started with LLMs is much easier than getting started with the last wave of ML.

Speaking from experience, enterprise adoption of the last wave of machine learning has been slow. The market for MLOps infrastructure is still immature, often because of how much complexity there is before getting out V1 of an ML feature. A lot of that complexity can’t be automated away – if you don’t have data, you cannot participate.

Building with LLM APIs is fundamentally different. Nearly any developer can build LLM features, and they don’t need to capture a ton of data first. As a result, this market will mature very quickly. The same buzz surrounding prompting frameworks will quickly grow around downstream tools like labeling, fine-tuning, model management, monitoring, testing, and more. There is an incredible opportunity to build DevTools right now!

One thing to note: building large language models is incredibly difficult. The challenges building ML from scratch are amplified dramatically when building multi-billion parameter models. It used to be plenty challenging to do data-distributed training; model-distributed training is way harder. This blog doesn’t scratch the surface of the infrastructure needed to train and operate LLMs from the ground up – that's worth its own follow-up.

Conclusion

As machine-learning and DevTools nerds, this is the most excited we’ve ever been for the ML DevTools ecosystem. Language models have lowered the barrier to entry for teams trying to build intelligent features. An unbelievable number of products (whether new startups or incumbents) will look to adopt features built with LLMs in the coming years.

As operators who devoted effort (and some gray hairs) through the last wave of DevTools that enabled machine learning, we can’t wait to partner with engineers who will build the tools that unlock LLMs for everyone else.

If you’re building a data or ML infrastructure company, let's connect: David Hershey, Diego M. Oppenheimer.

Special thanks to Willem Pienaar and Tristan Zajonc for their technical expertise and review of this blog.

All posts

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

All posts
February 9, 2023
Portfolio
Unusual

DevTools for language models — predicting the future

Editor's note: 

Anywhere developers go, DevTools reliably follow. In the last few months, a horde of developers has been hacking away at Foundation Models (FMs). The initial wave of DevTools has cropped up to enable those developers to iterate more quickly and build amazing features. In this blog, we’ll explore the thriving ecosystem of tools cropping up around large language models (LLMs), and dive into what tools exist and why those tools are important.

In the last few months, it feels like LLMs have crossed into the “Holy $#@!” phase of exponential growth where anyone who hasn’t been looking closely for the last few years probably exclaims, “Holy $#@!” after playing with ChatGPT for the first time.

The side effect of entering the “Holy $#@!” phase is that there has been a tremendous upswell of hackers and entrepreneurs building amazing product experiences on top of (mostly) GPT-3. Discord servers of hackers have filled up, market maps have been made, and an unbelievable number of people have tinkered with ChatGPT

This influx of developers has led to the advent of LLM DevTools — everything from convenience wrappers around OpenAI APIs, to complex tools for orchestrating multi-step reasoning, to very simple databases of prompt templates. As folks who have been obsessed with exactly two topics for the entirety of our professional careers (machine learning and DevTools), this has obviously caught our attention:

In this post, we’ll cover the state of LLM DevTools today and where we think it's heading in the future.

The emergence of DevTools for LLMs

To understand any DevTools, it’s important to start by understanding what tasks developers are trying to accomplish and what the steps are to accomplish those tasks. Here is a very rough sketch of what building an LLM-powered feature looks like today:

A note about this exact moment in time: V1 is basically all that matters right now. V1 is the thing that lifts your product from being “Cool!” to being “Holy $#@!”. For many applications, a V1 LLM feature can make your product 10x better (or at least cooler).

Meanwhile, V2+ are probably less impactful (right now).

Over time, continually improving models and products may well be the most important differentiator of companies building on top of language models, but we’re still too early in the race to focus on optimization.

What does that mean for DevTools? The majority of development and adoption to date has been in experimentation/prompting tools and data integrations (in particular, vector databases).

Representative, there is a lot more development than would fit on your screen!

Experimentation and Prompting: LLMs have weird APIs — you provide natural language in the form of a “prompt” and get a probabilistic response back. It turns out that mastering this API requires a lot of tinkering and experimentation – to solve a new task, you’ll probably need to try a lot of different prompts (or chains of prompts) to get the answer you’re looking for. Simply getting comfortable with the probabilistic nature of LLM output takes time — you’ll need to do extensive testing to understand the boundary cases of your prompt.

Tools have emerged to help jumpstart and manage this experimentation process. In particular, LangChain and GPT Index have exploded in popularity and use:

One good way to gain a lot of traction is to be the first tool someone needs to adopt an important new technology! Nothing beats selling picks during a gold rush. Time will tell whether these frameworks evolve into fundamental and sustaining applications or if they are just the initial simplification needed to kickstart development. Either way, they are playing a critical role in the ecosystem right now – inspiring amazing new use cases and features.

Knowledge Retrieval and Vector Databases: Part of the process of building a great prompt involves feeding the language model relevant “context” (i.e., related documents, the results of a Google search). This context is particularly relevant to stop LLMs from “hallucinating” – by providing good context, the model will extract the correct information from documents instead of making it up. This essentially boils down to giving LLMs “memory” – something that current generations of models does not have by default.

One of the most effective ways to find relevant context is to look up documents that are semantically similar to the task you’re looking to solve. An important capability of LLMs are their embedding capabilities — dense representations of language that contain semantic information. Vector Databases have emerged as the most obvious and performant way to retrieve “similar” documents by enabling similarity searches on LLM embeddings. 

Essentially all of the popular vector databases consider this workflow first-class citizens (Qdrant, Pinecone, Weaviate, Milvus) — and although they are probably closer to “core infrastructure” than DevTools, they play a critical role in the development journey of LLM applications.

Building V2 of LLM Features

Teams that have already scaled V1 of their LLM features are already confronting this next set of challenges in maintaining and improving intelligent features.

Representative, in particular there are a lot of labeling and fine tuning solutions out there

Labeling: The fundamentals of data labeling haven’t changed much in response to LLMs, but the priorities of labeling certainly have. When fine-tuning language models, being precise about which examples to label and use is more important than simply labeling a large volume of data. A lot of the most mature labeling solutions are really well-positioned to help companies looking to fine-tune LLMs to improve their products, but there is room to improve on those products to help efficiently select the right examples to label and use. We expect this category to evolve rapidly in response to an increase in demand and for them to incorporate even more effective tools for self-supervised labeling.

Fine-Tuning: Fine-tuning language models can serve two critical functions – improving accuracy and reducing inference latency. There are two really common patterns that have emerged:

  1. Fine-tuning the largest language models to improve performance when accuracy is critical
  2. Fine-tuning smaller language models to achieve similar accuracy as larger models while achieving lower latency and cost

To date, most of the fine-tuning market has been captured by the model providers (mostly OpenAI). We’d expect a lot of competitors to enter this space, in particular, if improvements to underlying LLMs show any signs of slowing down. For many applications, inference latency is key – there will be a lot of use cases where minimizing latency is worth a significant investment.

Monitoring, Observability, and Testing: There are some unique challenges in understanding and managing the performance of language models – most acutely, it can be very challenging to measure the “performance” of an LLM feature. To understand how “good” generated content is, you’ll need to actually measure how users are interacting with that content. That often means A/B tests and comprehensive product analytics just to assess performance. Most teams with LLM features are still checking the results with an eye test – across some number of tests, do the results look “good’? As adoption of OSS LLM models becomes more ubiquitous testing and comparison frameworks like “HELM” will become more and more important. 

Answering questions about LLM performance is critical for multiple reasons. Obviously, performance directly impacts UX, but it also impacts how easily you can decide when it makes sense to switch to a smaller and cheaper model or when to perform fine-tuning. 

Where is this headed, and how quickly?

Let’s finish by using some comparisons to the last wave of machine learning DevTools, to make guesses about where we might be headed next and how long it might take to get there.

Diving in, LangChain reminds us of the early ML frameworks.

As the first tool needed to train a model, Caffe (and later, Tensorflow) both became very popular frameworks in their early years. Caffe was the premier framework for building real-world ML models for a few years before Tensorflow (and its much more popular Python bindings) took over. 

These days, PyTorch has taken over as the most popular ML framework for new developers, and Tensorflow still holds a large market share. PyTorch won because it offered a much better developer experience than Tensorflow better usage experience, better documentation, and a thriving community. One thing LangChain has done well is to provide great examples for its users, which has had a huge impact on its adoption so far.

What does this mean for the current generation of prompting tools? A few shots:

  1. We’re in the early days of experimentation with LLMs.
    We expect numerous frameworks to emerge that solve this need over the next few years. It takes time to iterate on developer experience, and until developers see all of the possibilities, it will be impossible to pick a winner.
  2. New model versions will likely drive this innovation.
    There has been plenty of speculation about whether “prompt engineering” will go away as models improve – there will always be ways to improve performance with good prompting. That said, as the models change, the “API” of prompting will also likely shift, creating an opportunity for new prompting frameworks.

And one last prediction – getting started with LLMs is much easier than getting started with the last wave of ML.

Speaking from experience, enterprise adoption of the last wave of machine learning has been slow. The market for MLOps infrastructure is still immature, often because of how much complexity there is before getting out V1 of an ML feature. A lot of that complexity can’t be automated away – if you don’t have data, you cannot participate.

Building with LLM APIs is fundamentally different. Nearly any developer can build LLM features, and they don’t need to capture a ton of data first. As a result, this market will mature very quickly. The same buzz surrounding prompting frameworks will quickly grow around downstream tools like labeling, fine-tuning, model management, monitoring, testing, and more. There is an incredible opportunity to build DevTools right now!

One thing to note: building large language models is incredibly difficult. The challenges building ML from scratch are amplified dramatically when building multi-billion parameter models. It used to be plenty challenging to do data-distributed training; model-distributed training is way harder. This blog doesn’t scratch the surface of the infrastructure needed to train and operate LLMs from the ground up – that's worth its own follow-up.

Conclusion

As machine-learning and DevTools nerds, this is the most excited we’ve ever been for the ML DevTools ecosystem. Language models have lowered the barrier to entry for teams trying to build intelligent features. An unbelievable number of products (whether new startups or incumbents) will look to adopt features built with LLMs in the coming years.

As operators who devoted effort (and some gray hairs) through the last wave of DevTools that enabled machine learning, we can’t wait to partner with engineers who will build the tools that unlock LLMs for everyone else.

If you’re building a data or ML infrastructure company, let's connect: David Hershey, Diego M. Oppenheimer.

Special thanks to Willem Pienaar and Tristan Zajonc for their technical expertise and review of this blog.

All posts

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.