<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Hannes Hapke</title>
  <subtitle>Hannes Hapke — ML Engineering Leader, co-author of four O&#39;Reilly and Manning books on machine learning, and Google Developer Expert. Writing on production ML, MLOps, and generative AI systems.</subtitle>
  <link href="https://www.hanneshapke.com/feed.xml" rel="self"/>
  <link href="https://www.hanneshapke.com/"/>
  <updated>2026-05-14T19:58:42.477Z</updated>
  <id>https://www.hanneshapke.com/</id>
  <author>
    <name>Hannes Hapke</name>
  </author>
  
  <entry>
    <title>Vibe coding is killing open source, but not how you think</title>
    <link href="https://www.hanneshapke.com/open-source-is-dead.html"/>
    <updated>2026-05-14T00:00:00.000Z</updated>
    <id>https://www.hanneshapke.com/open-source-is-dead.html</id>
    <summary>Vibe coding is great at producing the first commit, but pretty bad at producing the fifth-year contributor. Why open source needs a renaissance — and what companies should do about it.</summary>
    <content type="html">&lt;p&gt;&lt;em&gt;Why we need a renaissance of open source&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;A few years ago, I was a contributor to TensorFlow Extended, and once a week, the contributors got on a call. We&#39;d argue about API design, complain about whatever was broken that week, and talk about where the industry was heading. Some of us worked at Google. Some of us worked at companies that competed with Google. None of that mattered for the forty-five minutes we were on the call. Over time, we built something that&#39;s harder to describe than it is to feel: a network of people across companies who actually trusted each other. We met up at conferences. We shared what we were learning. We pushed each other to be better.&lt;/p&gt;
&lt;p&gt;That kind of weekly call still happens in some projects. But the world that produced it is quietly changing, and I don&#39;t think most engineers have noticed yet.&lt;/p&gt;
&lt;p&gt;Today&#39;s internet runs on open source. Linux on the servers, nginx and Apache in front of them, Postgres underneath. Apache Kafka, originally built at LinkedIn, moves data around for basically every SaaS company you&#39;ve used. A long list of open source projects became the foundation for huge commercial ones: Apache Beam shaped Google Cloud Dataflow, Apache Flink powers parts of AWS, and the entire AI stack — PyTorch, TensorFlow, Keras, vLLM — is open source. So the stakes here are not abstract.&lt;/p&gt;
&lt;p&gt;With the rise of vibe coding, it is now faster to spin up a project from scratch for your own use case than to submit a patch to an existing one and let everyone benefit. You can see this in GitHub activity &lt;a href=&quot;https://github.blog/news-insights/company-news/an-update-on-github-availability/&quot;&gt;[1]&lt;/a&gt; &lt;a href=&quot;https://bsky.app/profile/cameron.stream/post/3ml2zgj4ots26&quot;&gt;[2]&lt;/a&gt;: where contributions used to be dominated by changes to existing code, the ratio has flipped toward greenfield projects, which is exactly the kind of work where Claude and Codex shine &lt;a href=&quot;https://news.ycombinator.com/item?id=46142841&quot;&gt;[3]&lt;/a&gt;.&lt;/p&gt;
&lt;figure&gt;
  &lt;img src=&quot;/images/open-source-is-dead/bafkreieiu5w47dxxzzp3oekpcggjhq6adnj5ccu4npkylymkc5ig3t2o5u.jpeg&quot; alt=&quot;Kyle Daigle on GitHub platform activity surging&quot;&gt;
  &lt;figcaption&gt;GitHub&#39;s Kyle Daigle on the surge in platform activity. Source: &lt;a href=&quot;https://github.blog/news-insights/company-news/an-update-on-github-availability/&quot;&gt;github.blog&lt;/a&gt;&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;figure&gt;
  &lt;img src=&quot;/images/open-source-is-dead/record-accelleration-1920x1080-2.jpeg&quot; alt=&quot;Record acceleration: pull requests merged, commits, and new repos per month from 2023 to 2026&quot;&gt;
  &lt;figcaption&gt;Record acceleration in pull requests merged, commits, and new repos per month, 2023–2026. Source: &lt;a href=&quot;https://bsky.app/profile/cameron.stream/post/3ml2zgj4ots26&quot;&gt;bsky.app&lt;/a&gt;&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;At the same time, the projects that do still get patches are flooded with PRs from coding agents. Before vibe coding, over 80% of contributions came from one-time committers [4]. Someone hit a bug in their favorite library, fixed it, watched the merge, and moved on. That was fine, because the remaining 20% had a long-term interest in the project, and that was usually enough to keep things going (though it still left a heavy load on core maintainers).&lt;/p&gt;
&lt;p&gt;Now the ratio is worse. A small fraction of contributions come from people with any long-term interest in the project. Some maintainers have responded by banning agent contributions outright — the Zig project, for example &lt;a href=&quot;https://ziglang.org/code-of-conduct/&quot;&gt;[5]&lt;/a&gt;. The reason isn&#39;t that the code is always bad. In some cases it&#39;s actually well structured and even improves the docs. The reason is that something quieter goes missing: mentorship. Junior developers learned the craft by contributing to established projects. Maintainers, often without realizing it, were training their own successors. It was a slow, durable win-win, and it&#39;s the part that doesn&#39;t show up in a commit graph.&lt;/p&gt;
&lt;p&gt;So here&#39;s where we are: quick vibe coding is replacing long-term contribution, and a whole community (a whole mindset, really) is fading with it.&lt;/p&gt;
&lt;p&gt;It isn&#39;t too late to fix this. But to see why it&#39;s worth fixing, it helps to remember what open source actually gives us.&lt;/p&gt;
&lt;p&gt;The web and the AI stack both benefited from open source in ways that are easy to take for granted. When a new attention mechanism drops, patches land in PyTorch and vLLM within hours. That isn&#39;t just technical excellence. It&#39;s trust. A broad community of people, none of whom answer to the same boss, decided the change was worth shipping.&lt;/p&gt;
&lt;p&gt;And it isn&#39;t only the trust that updates will arrive quickly. It&#39;s the trust that you can read every line if you need to. (Less true for open-weight models, but the principle still holds for the surrounding infrastructure.) That trust is quietly carrying a lot of the economy. It&#39;s why you can hand your credit card to a checkout page and assume the nginx config in front of it isn&#39;t doing something hostile. It&#39;s why you can run an ML evaluation and assume the framework isn&#39;t quietly cooking the numbers. Imagine a Volkswagen-style emissions scandal, but for model benchmarks. The reason that hasn&#39;t happened isn&#39;t a clever piece of regulation. It&#39;s that thousands of people have read the code.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Open source is trust.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;And trust matters more, not less, in a world where tech is increasingly controlled by a handful of companies.&lt;/p&gt;
&lt;p&gt;None of this is anti-AI. I use Claude every day, including for my open source work. It&#39;s great for the boring parts. The problem isn&#39;t agents writing code. The problem is what happens when only agents write code for the projects we all depend on, and the humans who used to mentor each other through those codebases drift away.&lt;/p&gt;
&lt;p&gt;Open source has always been carried by a small core of maintainers and a long tail of people who showed up, learned something, and stuck around. Vibe coding is great at producing the first commit. It&#39;s pretty bad at producing the fifth-year contributor who eventually takes over the project.&lt;/p&gt;
&lt;p&gt;So: fund this work. With headcount. With paid time for your engineers to maintain the libraries your company runs on. A few companies are starting to do this seriously. (Disclosure: I work at one of them, Dataiku, which set up an open source lab last year. I&#39;d rather see ten more.) The specific projects matter less than the pattern: treat the commons like infrastructure, because that&#39;s what it is. For a more in-depth economic argument, this paper &lt;a href=&quot;https://arxiv.org/pdf/2601.15494&quot;&gt;[6]&lt;/a&gt; is worth your time.&lt;/p&gt;
&lt;p&gt;Open source is trust. Trust gets built by humans showing up for each other over years. That&#39;s the part no agent can vibe-code for us.&lt;/p&gt;
&lt;h2&gt;References&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;GitHub Blog. &lt;a href=&quot;https://github.blog/news-insights/company-news/an-update-on-github-availability/&quot;&gt;&lt;em&gt;An Update on GitHub Availability.&lt;/em&gt;&lt;/a&gt; github.blog&lt;/li&gt;
&lt;li&gt;cameron.stream. &lt;a href=&quot;https://bsky.app/profile/cameron.stream/post/3ml2zgj4ots26&quot;&gt;Post on Bluesky.&lt;/a&gt; bsky.app&lt;/li&gt;
&lt;li&gt;Hacker News. &lt;a href=&quot;https://news.ycombinator.com/item?id=46142841&quot;&gt;Discussion thread 46142841.&lt;/a&gt; &lt;a href=&quot;http://news.ycombinator.com&quot;&gt;news.ycombinator.com&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Eghbal, Nadia. &lt;em&gt;Working in Public.&lt;/em&gt; Stripe Press, 2020.&lt;/li&gt;
&lt;li&gt;Zig Project. &lt;a href=&quot;https://ziglang.org/code-of-conduct/&quot;&gt;&lt;em&gt;Code of Conduct.&lt;/em&gt;&lt;/a&gt; &lt;a href=&quot;http://ziglang.org&quot;&gt;ziglang.org&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://arxiv.org/pdf/2601.15494&quot;&gt;Paper on arXiv. ID 2601.15494.&lt;/a&gt; &lt;a href=&quot;http://arxiv.org&quot;&gt;arxiv.org&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
</content>
  </entry>
  
  <entry>
    <title>Kiji Privacy Proxy™ - Protecting Your Data in the Age of Generative AI</title>
    <link href="https://www.hanneshapke.com/kiji-privacy-proxy.html"/>
    <updated>2026-04-27T00:00:00.000Z</updated>
    <id>https://www.hanneshapke.com/kiji-privacy-proxy.html</id>
    <summary>Introducing Kiji Privacy Proxy, an open-source gateway that automatically detects and redacts PII before it reaches external LLM APIs, letting enterprises use generative AI without exposing sensitive data.</summary>
    <content type="html">&lt;p&gt;&lt;em&gt;Originally published on the &lt;a href=&quot;https://www.dataiku.com/stories/blog/kiji-privacy-proxy&quot;&gt;Dataiku Blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Every time you type a prompt into ChatGPT, Claude, or any other LLM-powered service, you&#39;re sending data to an external server. For casual questions, that&#39;s fine. But in enterprise settings, those prompts often contain customer names, email addresses, social security numbers, medical records, financial details, and internal business data that should never leave your environment.&lt;/p&gt;
&lt;p&gt;This isn&#39;t a hypothetical risk. Regulations like GDPR, HIPAA, and CCPA impose real penalties on organizations that fail to protect personal data, and sending PII to a third-party API without proper safeguards can constitute a violation. A 2026 Dataiku/Harris Poll study of 600 CIOs found that 85% have seen AI projects delayed or blocked entirely due to gaps in traceability or explainability, and privacy concerns are a major part of that picture.&lt;/p&gt;
&lt;p&gt;The challenge is clear: Enterprises want the productivity gains of generative AI, but they can&#39;t afford to expose sensitive data in the process. Until now, the main options have been to either avoid using external AI services altogether (losing the benefits), build expensive custom infrastructure, or simply accept the risk and hope for the best.&lt;/p&gt;
&lt;p&gt;None of those options is good enough.&lt;/p&gt;
&lt;h2&gt;Why Kiji Privacy Proxy™&lt;/h2&gt;
&lt;p&gt;Operating as a transparent gateway between your local applications and external AI APIs, Kiji Privacy Proxy™ ensures you don&#39;t have to compromise your workflow or abandon powerful AI tools. By sitting directly within your network, Kiji automatically identifies and redacts personally identifiable information (PII) before any data is transmitted, allowing you to leverage generative AI without having to trust third-party servers with your sensitive information.&lt;/p&gt;
&lt;p&gt;Here&#39;s how it works: Your app sends a request to the Kiji Privacy Proxy, and it forwards it to services like OpenAI or Anthropic. Alternatively, Kiji can intercept requests as well, run them through an ML-powered PII detection model, and replace any sensitive data, emails, phone numbers, credit card numbers, SSNs, IP addresses, and 16+ other PII types with realistic dummy values. The masked request goes out to the API. When the response comes back, Kiji restores the original values so your application works exactly as expected.&lt;/p&gt;
&lt;p&gt;The result: The AI model never sees your real data, but your application behaves as though nothing changed.&lt;/p&gt;
&lt;p&gt;What makes Kiji particularly practical is how little friction it introduces. On macOS, it runs as a native desktop app with automatic proxy configuration. We also provide a Chrome extension that routes web requests through Kiji without any environment variables or code changes. On Linux, it runs as a standalone server. In both cases, latency stays under 100 milliseconds for most requests, and all PII detection happens locally with no external API calls.&lt;/p&gt;
&lt;p&gt;Kiji is open source under the Apache 2.0 license, and both the trained model and its training dataset are published on HuggingFace (&lt;a href=&quot;https://huggingface.co/DataikuNLP/kiji-pii-model-onnx&quot;&gt;DataikuNLP/kiji-pii-model-onnx&lt;/a&gt; and &lt;a href=&quot;https://huggingface.co/DataikuNLP/kiji-pii-training-data&quot;&gt;DataikuNLP/kiji-pii-training-data&lt;/a&gt;), so you can inspect, reproduce, and extend everything.&lt;/p&gt;
&lt;p&gt;The Kiji Privacy Proxy is powered by a base model (developed by Dataiku&#39;s 575 Lab) that attained a 94% F1 score on the industry benchmark dataset. This result is highly competitive when compared to similar models in the field.&lt;/p&gt;
&lt;h2&gt;A Collaboration Between Forward-Thinking ML Companies&lt;/h2&gt;
&lt;p&gt;Kiji Privacy Proxy doesn&#39;t exist in a vacuum. It&#39;s part of a broader vision where specialized companies across the ML ecosystem each contribute what they do best, and the result is greater than the sum of its parts. Built by Dataiku&#39;s 575 Lab, Kiji draws on and connects with the work of several outstanding partners.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dataiku&lt;/strong&gt; — The company that created Kiji, brings over a decade of enterprise AI experience, and recently launched the 575 Lab as its Open Source Office, is dedicated to building deployable tools for AI transparency, privacy, and governance. Kiji is one of the Lab&#39;s first releases, alongside agent explainability tools. As a member of the Linux Foundation and the Agentic AI Foundation, Dataiku, the Platform for AI Success, is committed to building these capabilities in the open.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Outerbounds&lt;/strong&gt; — The company behind Metaflow, the open-source ML infrastructure stack originally built at Netflix, provides state-of-the-art infrastructure that makes complex ML workflows manageable. For teams that want to integrate Kiji&#39;s PII detection into larger ML pipelines, train custom models, orchestrate data flows, and manage deployment at scale, Outerbounds&#39; infrastructure-as-code approach is a natural complement.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;HumanSignal&lt;/strong&gt; — The creators of Label Studio, the world&#39;s most popular open-source data labeling tool used by over 350,000 researchers, plays a critical role in the data quality side of the equation. Kiji&#39;s ML model is only as good as its training data, and for organizations that need to customize PII detection for their specific domain (think medical record formats, industry-specific identifiers, or non-English PII patterns), Label Studio provides the labeling infrastructure to build and refine those custom datasets.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Doubleword&lt;/strong&gt; — The inference provider for high-volume workloads, founded by researchers from Oxford University who pioneered techniques in model optimization, completes the picture on the deployment side. Doubleword&#39;s inference platform offers open-source model inference at a fraction of the cost of other providers, making it well-suited for high-volume workloads such as data and document processing, as well as async agents. In this case, Doubleword models were used to generate large volumes of synthetic data at a cost of only $50 — just five percent of what comparable models from closed-source providers would have cost.&lt;/p&gt;
&lt;h2&gt;Make Your Domain-Specific Kiji Privacy Proxy&lt;/h2&gt;
&lt;p&gt;One of the most powerful aspects of Kiji is its design for customization. The default model handles common PII types well, but every industry has unique data patterns that a generic model won&#39;t catch, such as pharmaceutical compound identifiers, internal project codes, proprietary customer reference numbers, and jurisdiction-specific ID formats.&lt;/p&gt;
&lt;p&gt;Kiji&#39;s architecture makes it straightforward to build your own domain-specific privacy proxy. The training data and model are fully open on HuggingFace. With Doubleword&#39;s batch inference platform, you can create your own large synthetic data. You can use Label Studio (by HumanSignal) to annotate the domain-specific, synthetic PII examples. You can orchestrate the training pipeline with Metaflow (by Outerbounds) on whatever compute you need.&lt;/p&gt;
&lt;p&gt;This is what collaboration across the ML ecosystem looks like in practice: not a single monolithic product, but a set of interoperable, open tools built by companies that deeply understand their piece of the puzzle. Together, they give enterprises the building blocks to protect their data without sacrificing the transformative potential of generative AI.&lt;/p&gt;
&lt;h2&gt;Star the Repo on GitHub and Try It Out&lt;/h2&gt;
&lt;p&gt;Read the full post on the &lt;a href=&quot;https://www.dataiku.com/stories/blog/kiji-privacy-proxy&quot;&gt;Dataiku blog&lt;/a&gt;.&lt;/p&gt;
</content>
  </entry>
  
  <entry>
    <title>Speculative Decoding with vLLM using Gemma</title>
    <link href="https://www.hanneshapke.com/speculative-decoding-gemma.html"/>
    <updated>2025-02-28T00:00:00.000Z</updated>
    <id>https://www.hanneshapke.com/speculative-decoding-gemma.html</id>
    <summary>Improving LLM inferences with speculative decoding using Gemma</summary>
    <content type="html">&lt;p&gt;When deploying large language models in production environments, latency optimization is crucial. This is particularly important for real-time applications like chatbots and conversational interfaces. While complex tasks often require larger LLMs (70 billion+ parameters), users still expect response times similar to smaller models. This challenge has led the machine learning community to continuously explore new ways to improve LLM latency.&lt;/p&gt;
&lt;p&gt;One of the most promising techniques is speculative decoding, which is a technique that improves the performance of a language model by predicting multiple tokens at a time with a smaller model and use a larger model to validate the predictions.&lt;/p&gt;
&lt;h2&gt;Problem&lt;/h2&gt;
&lt;p&gt;When serving large language models in production, you need to lower the latency. In fact, latency optimization is the biggest challenge that developers will face in production LLM systems. This challenge becomes particularly acute in real-time scenarios, where users expect near-instantaneous interactions with chatbots, code completion tools, and other AI-powered interfaces.&lt;/p&gt;
&lt;p&gt;At the heart of this challenge lies a fundamental characteristic of auto-regressive models: their sequential nature of text generation. Unlike many computational processes that can benefit from parallel processing, these models face an architectural constraint that proves to be their primary performance bottleneck. To generate any given token K, the model must first process and consider all preceding tokens, from 1 to K-1, in sequential order. This dependency chain, which provides context for the text generation, creates a processing pipeline that cannot be easily parallelized.&lt;/p&gt;
&lt;figure&gt;
  &lt;img src=&quot;/images/speculative_decoding/llm-token-generation.png&quot; alt=&quot;Sequential token generation&quot;&gt;
  &lt;figcaption&gt;Figure 1: Sequential token generation&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;Consider the process illustrated in Figure 1, where each token&#39;s generation depends on the complete history of previous tokens. This sequential dependency isn&#39;t merely a technical limitation—it&#39;s a fundamental aspect of how these models understand and generate human-like text.&lt;/p&gt;
&lt;p&gt;The situation becomes even more challenging when we scale up to larger models, particularly those exceeding 3+ billion parameters. These massive models, while offering superior capabilities in terms of reasoning, understanding, and generation, exact a significant performance penalty. Each token prediction requires more computational resources, as the model must process its vast parameter space for every single token generation step. The result is a compounding effect: not only must we handle the sequential nature of token generation, but each step in that sequence now takes longer due to the model&#39;s size.&lt;/p&gt;
&lt;p&gt;Yet, despite these performance challenges, larger models remain indispensable for many applications. They excel at complex tasks that smaller models struggle with, such as multi-step reasoning, code generation, and nuanced understanding of context. They also produce higher-quality text with fewer artifacts and better coherence. This creates a tension between the need for sophisticated model capabilities and the practical requirements of production deployment.&lt;/p&gt;
&lt;p&gt;In this solution, we will use speculative decoding to improve the performance of LLM deployments. It will allow us to improve the model latency without changing the model architecture, training data, or the trained model itself.&lt;/p&gt;
&lt;h2&gt;Solution&lt;/h2&gt;
&lt;p&gt;Speculative decoding is an optimization technique that leverages two distinct language models to improve generation speed while maintaining output quality. The approach uses a teacher-student architecture where two complementary models work together: a large, sophisticated language model (LLM) that produces highly accurate outputs but is computationally expensive and relatively slow, serving as the teacher model and ground truth for token generation; and a smaller, more efficient language model that operates faster but may be less accurate, acting as the student model. The student model is specifically trained to emulate the behavior of the teacher model – for example, a 3 billion parameter model might be trained to imitate a 405 billion parameter model.&lt;/p&gt;
&lt;h3&gt;The Inference Process&lt;/h3&gt;
&lt;p&gt;During text generation, the process follows a specific workflow. The student model begins by rapidly proposing a sequence of tokens based on its training to imitate the teacher model&#39;s behavior. Following this initial prediction, the teacher model evaluates the student&#39;s proposed tokens in parallel, verifying whether it would have generated the same sequence. The outcome of this validation determines the next steps: if the teacher model agrees with the student&#39;s predictions, the sequence is accepted and immediately output; however, if the teacher model disagrees, it falls back to its standard token-by-token generation process to ensure accuracy.&lt;/p&gt;
&lt;h3&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;The fundamental principle behind speculative decoding is that not all tokens require the computational power of a large model for accurate generation. Token difficulty varies significantly – simple, predictable tokens like common words or obvious completions can be reliably generated by the smaller student model, while complex or context-dependent tokens benefit from the teacher model&#39;s advanced capabilities. This selective use of computational resources allows for significant speed improvements while maintaining the quality standards of the larger model. The approach is particularly effective because it balances the trade-off between speed and accuracy by dynamically choosing the appropriate model based on the complexity of the current generation task.&lt;/p&gt;
&lt;p&gt;Here is an example of how speculative decoding would play out for a sequence of tokens:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;Step 1:
Student: &amp;quot;The [talented] [chef]&amp;quot;
Teacher: ✓ Accepts (common phrase)

Step 2:
Student: &amp;quot;cooked [a] [delicious]&amp;quot;
Teacher: ✓ Accepts (common food context)

Step 3:
Student: &amp;quot;[soup]&amp;quot;
Teacher: ✗ Rejects
Teacher generates: &amp;quot;bouillabaisse&amp;quot; (rare, specific word)

Step 4:
Student: &amp;quot;[for] [dinner]&amp;quot;
Teacher: ✓ Accepts (common ending)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;How does speculative decoding improve the inference speed?&lt;/p&gt;
&lt;p&gt;We gain the inference speed increases by three aspects. First of all, we generate the proposal tokens through the smaller LLM. In addition, we can generate multiple tokens at once. We can request multiple tokens, since we have a second model to validate the predictions. We can afford it because the initial tokens predictions are fast and cheap. Secondly, the validation of the prediction is also fast, and we only need to correct the predictions for tokens where the smaller model made a mistake.&lt;/p&gt;
&lt;p&gt;How can you use speculative decoding with your LLM? Most LLM deployment frameworks provide support of speculative decoding, in one form or another. For our core example, we are demonstrating speculative decoding with vLLM. vLLM is a frequently used framework for serving LLM models like Llama 3.2 3B. In our example, we use a smaller model to predict the next tokens and a larger model to validate the prediction and, if needed, correct the prediction. We use Meta&#39;s &lt;code&gt;opt-125m&lt;/code&gt; model to predict the next tokens and the larger &lt;code&gt;opt-2.7b&lt;/code&gt; model to validate the prediction and, if needed, correct the prediction. The sampling of the tokens, checking them with the larger model, and correcting them is done under the hood by the LLM serving framework, in our case vLLM.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;
from vllm import LLM, SamplingParams

prompts = [
    &amp;quot;The future of AI is&amp;quot;,
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model=&amp;quot;google/gemma-2-9b-it&amp;quot;,
    tensor_parallel_size=1,
    speculative_model=&amp;quot;google/gemma-2-2b-it&amp;quot;,
    num_speculative_tokens=5,
)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f&amp;quot;Prompt: {prompt!r}, Generated text: {generated_text!r}&amp;quot;)

&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When we compare the latency of speculative decoding with the latency of the same model without speculative decoding, we can see that speculative decoding is faster by roughly 35% as we can see in Figure 2.&lt;/p&gt;
&lt;figure&gt;
  &lt;img src=&quot;/images/speculative_decoding/speculative_decoding_comparison.png&quot; alt=&quot;Latency comparison&quot;&gt;
  &lt;figcaption&gt;Figure 2: Comparison of latency between standard and speculative decoding approaches. Speculative decoding shows a 35% improvement in processing time.&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;h2&gt;Trade-offs and Alternatives&lt;/h2&gt;
&lt;p&gt;There are significant trade-offs when using speculative decoding. The faster inferences don&#39;t come without downsides. In this section, we will discuss the trade-offs when using speculative decoding.&lt;/p&gt;
&lt;h3&gt;Sequence Length&lt;/h3&gt;
&lt;p&gt;The relationship between sequence length and performance gains in speculative decoding presents an interesting trade-off. Generally, longer sequences tend to yield higher speedups, as demonstrated in our example where we set &lt;code&gt;num_speculative_tokens=5&lt;/code&gt;. This parameter allows the smaller model to predict multiple tokens ahead, potentially improving throughput. However, this advantage comes with diminishing returns: as sequence length increases, so does the likelihood of prediction errors. When these errors occur, the larger model steps in to correct the predictions, which can significantly slow down the overall inference process. Finding the optimal sequence length therefore requires careful balancing between maximizing the benefits of speculation while minimizing the computational overhead of error correction.&lt;/p&gt;
&lt;h3&gt;Larger Memory Footprint&lt;/h3&gt;
&lt;p&gt;The speculative decoding requires a larger memory footprint. The larger model needs to loaded into memory, together with the smaller model. This will require larger instances and GPUs, which translates to higher costs.
Also, fine-tuning models is difficult for when you want to use speculative decoding with your models. The smaller model needs to be fine-tuned on the same dataset as the larger model. This is not always possible, since fine-tuning the larger model is more expensive than fine-tuning the smaller model (however, you might the performance boost from the smaller fine-tuned model already).&lt;/p&gt;
&lt;h3&gt;Model Pairing&lt;/h3&gt;
&lt;p&gt;A larger model needs to be paired with a smaller model using the same tokenization. This is no problem for larger models to be paired with smaller models. However, this is not the case for smaller models.&lt;/p&gt;
&lt;p&gt;The following table shows the possible combinations of base models and smaller models for speculative decoding.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Base Model&lt;/th&gt;
&lt;th&gt;Smaller Model for suggesting tokens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;gemma-2-27b-it&lt;/td&gt;
&lt;td&gt;gemma-2-9b-it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gemma-2-27b-it&lt;/td&gt;
&lt;td&gt;gemma-2-2b-it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gemma-2-9b-it&lt;/td&gt;
&lt;td&gt;gemma-2-9b-it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gemma-2-2b-it&lt;/td&gt;
&lt;td&gt;?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Problem specific Performance&lt;/h3&gt;
&lt;p&gt;The performance of speculative decoding also depends on the distribution of tokens. For example, if we want to generate English text for a chatbot, speculative decoding will be more effective than if we want to generate random hashes or bank transaction descriptions. in those cases, speculative decoding will actually be slower because the larger model needs to correct the predictions too often.&lt;/p&gt;
&lt;h3&gt;Alternatives&lt;/h3&gt;
&lt;p&gt;Several alternative approaches can help reduce LLM deployment latency, each with its own strengths and trade-offs.&lt;/p&gt;
&lt;h4&gt;Use a smaller model&lt;/h4&gt;
&lt;p&gt;The simplest approach is to use a smaller model altogether. This solution offers both reduced memory footprint and faster inference times compared to speculative decoding. The deployment becomes significantly simpler, requiring only one smaller model and less powerful GPUs. However, this approach comes with an obvious drawback: the quality of generated text suffers noticeably. While faster, smaller models often lack the sophisticated reasoning and nuanced understanding that larger models provide. You would typically only consider this option if your use case doesn&#39;t require the advanced capabilities of larger models.&lt;/p&gt;
&lt;h4&gt;Use a quantized model&lt;/h4&gt;
&lt;p&gt;Model quantization represents a sophisticated optimization technique that reduces numerical precision while preserving model functionality. By converting the model&#39;s parameters from their original 32-bit floating-point representation to 8-bit or even 4-bit precision post-training, quantization achieves significant improvements in both memory efficiency and computational performance. This reduction in numerical complexity translates directly into decreased memory footprint, lower computational overhead, and consequently, faster inference times.&lt;/p&gt;
&lt;p&gt;While quantization does introduce a modest degradation in model quality compared to the original implementation, it offers compelling advantages as an alternative to speculative decoding. The deployment architecture remains streamlined with only a single model to maintain, and the reduced computational demands enable the use of more cost-effective GPU hardware. This balance of performance optimization and operational simplicity makes quantization an attractive option for many production environments.&lt;/p&gt;
&lt;h4&gt;Parallelization&lt;/h4&gt;
&lt;p&gt;Parallelization presents another powerful strategy for improving LLM performance of larger LLMs. Instead of processing multiple requests sequentially, you can process multiple requests simultaneously. This way, you can significantly decrease the effective latency across multiple requests. This approach particularly shines in high-traffic scenarios where individual requests use only a fraction of the model&#39;s context length. However, parallelization faces clear limitations: it remains constrained by both the model&#39;s maximum context length and the available GPU memory. Despite these constraints, parallelization often provides substantial performance benefits for many production deployments and it should be your first consideration when optimizing deployment latency.&lt;/p&gt;
&lt;h4&gt;Continuous batching&lt;/h4&gt;
&lt;p&gt;Continuous batching takes the parallelization concept even further. Instead of processing fixed batches, this technique dynamically pulls new requests from a queue whenever space becomes available in the current batch. This approach proves especially effective when handling a high volume of requests with varying context lengths. By maintaining consistent GPU utilization, continuous batching can achieve even lower latency than standard parallelization. However, it shares the same fundamental limitations regarding context length and GPU memory, and requires specialized deployment infrastructure to support the dynamic batching mechanism.&lt;/p&gt;
&lt;h4&gt;Caching&lt;/h4&gt;
&lt;p&gt;Caching offers a different approach to latency optimization, particularly valuable for applications with repetitive requests. By storing and reusing previous inference results for identical prompts, caching can deliver nearly instantaneous responses for repeated queries. While novel requests still face slow inference time of a large model, frequently accessed responses become lightning-fast. This makes caching particularly effective for applications like customer service chatbots or code completion tools, where certain queries appear frequently. The effectiveness of caching directly correlates with the repetitiveness of your workload – the more repeated queries you handle, the greater the performance benefit.&lt;/p&gt;
&lt;h2&gt;Demo&lt;/h2&gt;
&lt;p&gt;We have created a demo of speculative decoding with vLLM using Gemma. You can find the code &lt;a href=&quot;https://colab.research.google.com/drive/1IVRcztCw4ypTGlVK0PQQ4s-lb7N1gVeJ?usp=sharing&quot;&gt;in Colab&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Speculative decoding is a promising technique to improve the performance of LLM deployments. in our demo example, we were able to improve the latency by 35%. However, it comes with a larger memory footprint and more deployment complexity.&lt;/p&gt;
&lt;h2&gt;Acknowledgements&lt;/h2&gt;
&lt;p&gt;We would like to thank Google&#39;s Developer Community (GDE) for providing the funding to create this demo. #VertexAISprint&lt;/p&gt;
</content>
  </entry>
  
  <entry>
    <title>Deploying Google&#39;s Gemma on Vertex AI</title>
    <link href="https://www.hanneshapke.com/gemma-vllm-vertex.html"/>
    <updated>2025-02-17T00:00:00.000Z</updated>
    <id>https://www.hanneshapke.com/gemma-vllm-vertex.html</id>
    <summary>A comprehensive guide to deploying Google&#39;s Gemma language model on Vertex AI using vLLM, covering model registration, endpoint creation, and production deployment best practices.</summary>
    <content type="html">&lt;p&gt;In the rapidly evolving landscape of artificial intelligence, the ability to deploy and manage your own language models has become increasingly important. While hosted solutions like Google&#39;s Gemini offer convenience, there are compelling reasons to host your own models. Today, we&#39;ll explore how to deploy Google&#39;s Gemma model on Vertex AI, providing you with complete control over your AI infrastructure.&lt;/p&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Google&#39;s recent release of Gemma marks a significant milestone in the democratization of AI. As an open-source alternative to their hosted Gemini models, Gemma provides organizations with the flexibility to run these powerful language models on their own infrastructure. In this comprehensive guide, we&#39;ll walk through the process of deploying Gemma on Google Cloud&#39;s Vertex AI platform, exploring every aspect from initial setup to production deployment.&lt;/p&gt;
&lt;h2&gt;Why Host Your Own Model?&lt;/h2&gt;
&lt;p&gt;Before diving into the technical details, let&#39;s understand why you might choose to host your own model instead of using hosted solutions:&lt;/p&gt;
&lt;h3&gt;Data Privacy and Compliance&lt;/h3&gt;
&lt;p&gt;When dealing with sensitive information such as medical records, legal documents, or proprietary business data, maintaining complete control over your data pipeline becomes crucial. By hosting your own model, you ensure that sensitive data never leaves your controlled environment, making it easier to comply with regulations like HIPAA, GDPR, or industry-specific requirements.&lt;/p&gt;
&lt;h3&gt;Responsible AI Implementation&lt;/h3&gt;
&lt;p&gt;Organizations increasingly need to demonstrate transparency and control over their AI systems. Running your own model instance allows you to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Monitor and audit all interactions&lt;/li&gt;
&lt;li&gt;Implement custom fairness metrics&lt;/li&gt;
&lt;li&gt;Control model behavior and outputs&lt;/li&gt;
&lt;li&gt;Maintain clear data lineage&lt;/li&gt;
&lt;li&gt;Avoid sharing potentially sensitive data with third-party providers&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Performance Optimization&lt;/h3&gt;
&lt;p&gt;Self-hosting enables you to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fine-tune latency for specific use cases&lt;/li&gt;
&lt;li&gt;Optimize hardware allocation based on your workload&lt;/li&gt;
&lt;li&gt;Implement custom caching strategies&lt;/li&gt;
&lt;li&gt;Control model quantization and optimization parameters&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Technical Understanding&lt;/h3&gt;
&lt;p&gt;For organizations invested in AI technology, understanding the deployment process provides valuable insights into:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Model serving architecture&lt;/li&gt;
&lt;li&gt;Resource management&lt;/li&gt;
&lt;li&gt;Scaling considerations&lt;/li&gt;
&lt;li&gt;Performance optimization techniques&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;p&gt;Before beginning the deployment process, ensure you have:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;A Google Cloud Account with billing enabled&lt;/li&gt;
&lt;li&gt;Vertex AI API activated in your project&lt;/li&gt;
&lt;li&gt;A Hugging Face account with access to Gemma models&lt;/li&gt;
&lt;li&gt;Basic familiarity with Python and cloud computing concepts&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Understanding the Deployment Architecture&lt;/h2&gt;
&lt;p&gt;Our deployment strategy uses vLLM (Versatile Large Language Model) serving framework, which offers several advantages:&lt;/p&gt;
&lt;h3&gt;Why vLLM?&lt;/h3&gt;
&lt;p&gt;vLLM has emerged as a leading solution for serving large language models due to its:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Continuous Batching&lt;/strong&gt;: Efficiently processes multiple requests by dynamically batching them, maximizing GPU utilization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;PagedAttention&lt;/strong&gt;: Implements an innovative attention mechanism that significantly reduces memory usage and increases throughput.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Kernel Fusion&lt;/strong&gt;: Optimizes computation by combining multiple operations into single GPU kernels.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Quantization Support&lt;/strong&gt;: Offers various quantization options to reduce model size and increase inference speed.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;The Deployment Process&lt;/h2&gt;
&lt;p&gt;Let&#39;s break down the deployment into three main steps:&lt;/p&gt;
&lt;h3&gt;Step 1: Registering the Model&lt;/h3&gt;
&lt;p&gt;The first step involves registering your Gemma model with Vertex AI&#39;s Model Registry. This process creates a versioned record of your model that can be tracked and managed.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from google.cloud import aiplatform

def register_model(
    project: str,
    location: str,
    display_name: str,
    artifact_uri: str,
    model_id: str,
    version_description: str,
    serving_container_image_uri: str,
    serving_container_environment_variables: dict,
    serving_container_command: list
) -&amp;gt; aiplatform.Model:
    &amp;quot;&amp;quot;&amp;quot;
    Register a new model in Vertex AI Model Registry.

    Args:
        project: Google Cloud project ID
        location: Region for deployment (e.g., &#39;us-central1&#39;)
        display_name: Human-readable name for the model
        artifact_uri: GCS location of model artifacts
        model_id: Unique identifier for the model
        version_description: Description of this model version
        serving_container_image_uri: Docker image URI for model serving

    Returns:
        aiplatform.Model: Registered model object
    &amp;quot;&amp;quot;&amp;quot;
    aiplatform.init(project=project, location=location)

    model = aiplatform.Model.upload(
        display_name=display_name,
        artifact_uri=artifact_uri,
        model_id=model_id,
        description=&amp;quot;vLLM model for generating text&amp;quot;,
        version_description=version_description,
        serving_container_image_uri=serving_container_image_uri,
        serving_container_health_route=&amp;quot;/health&amp;quot;,
        serving_container_environment_variables=serving_container_environment_variables,
        serving_container_predict_route=&amp;quot;/generate&amp;quot;,
        serving_container_ports=[8000],
        serving_container_command=serving_container_command
    )

    return model


register_model(
    project=&amp;quot;your gcp project id&amp;quot;,
    location=&amp;quot;us-central1&amp;quot;,  # or your preferred region
    display_name=&amp;quot;gemma-vllm&amp;quot;,
    model_id=&amp;quot;gemma_vllm_001&amp;quot;,
    version_description=&amp;quot;Initial Gemma vLLM deployment&amp;quot;,
    serving_container_image_uri=&amp;quot;us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:latest&amp;quot;,
    serving_container_environment_variables={
        &amp;quot;HUGGING_FACE_HUB_TOKEN&amp;quot;: &amp;quot;hf_&amp;lt;your token&amp;gt;&amp;quot;
    },
    serving_container_command=[&amp;quot;python3&amp;quot;, &amp;quot;-m&amp;quot;, &amp;quot;vllm.entrypoints.api_server&amp;quot;,
                             &amp;quot;--model=google/gemma-2-2b-it&amp;quot;,
                             &amp;quot;--tensor-parallel-size=1&amp;quot;,
                             &amp;quot;--max_model_len=8126&amp;quot;]
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This code does several important things:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Model Initialization&lt;/strong&gt;: Uses &lt;code&gt;aiplatform.init()&lt;/code&gt; to set up the connection to your Google Cloud project.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Model Registration&lt;/strong&gt;: Creates a new model entry in the Vertex AI Model Registry with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A display name for human readability&lt;/li&gt;
&lt;li&gt;The location of model artifacts in Google Cloud Storage&lt;/li&gt;
&lt;li&gt;A unique model identifier&lt;/li&gt;
&lt;li&gt;Version information for tracking changes&lt;/li&gt;
&lt;li&gt;Container configuration for serving&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Container Configuration&lt;/strong&gt;: Specifies important endpoints:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Health check route for monitoring&lt;/li&gt;
&lt;li&gt;Prediction route for inference&lt;/li&gt;
&lt;li&gt;Port configuration for network access&lt;/li&gt;
&lt;li&gt;Image URI&lt;/li&gt;
&lt;li&gt;Container commands&lt;/li&gt;
&lt;li&gt;Huggingface token via the environmental variables&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;Note about the container image&lt;/h4&gt;
&lt;p&gt;Vertex expects a very specific request - response structure.  Google provide the instructions of how to build such a &lt;a href=&quot;https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/069d709eddac1fda7aa877d7404f10e74c82aeb4/community-content/vertex_model_garden/model_oss/vllm/dockerfile/serve.Dockerfile&quot;&gt;container&lt;/a&gt; and they provide a &lt;a href=&quot;https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/069d709eddac1fda7aa877d7404f10e74c82aeb4/community-content/vertex_model_garden/model_oss/vllm/vllm.patch&quot;&gt;path&lt;/a&gt; to update the open-source vLLM implementation. Instead of patching and building our own docker image, we short cut the work by reusing the Docker image &lt;a href=&quot;https://console.cloud.google.com/artifacts/docker/vertex-ai/us/vertex-vision-model-garden-dockers/pytorch-vllm-serve&quot;&gt;provided by Google&#39;s model garden&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Step 2: Creating an Endpoint&lt;/h3&gt;
&lt;p&gt;The next step involves creating a Vertex AI endpoint that will serve your model:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;def create_endpoint(
    project: str,
    location: str,
    display_name: str
) -&amp;gt; aiplatform.Endpoint:
    &amp;quot;&amp;quot;&amp;quot;
    Create a new Vertex AI endpoint for model serving.

    Args:
        project: Google Cloud project ID
        location: Region for deployment
        display_name: Human-readable name for the endpoint

    Returns:
        aiplatform.Endpoint: Created endpoint object
    &amp;quot;&amp;quot;&amp;quot;
    aiplatform.init(project=project, location=location)

    endpoint = aiplatform.Endpoint.create(
        display_name=display_name,
        project=project,
        location=location,
    )

    return endpoint
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This endpoint creation process:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Initializes the Environment&lt;/strong&gt;: Sets up the project and location context.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Creates the Endpoint&lt;/strong&gt;: Establishes a new serving endpoint with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A human-readable display name&lt;/li&gt;
&lt;li&gt;Project and location specifications&lt;/li&gt;
&lt;li&gt;Default configuration settings&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Prepares for Deployment&lt;/strong&gt;: Sets up the necessary infrastructure for model serving.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 3: Deploying the Model&lt;/h3&gt;
&lt;p&gt;The final step involves deploying your registered model to the created endpoint:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;def deploy_model(
    model: str,
    endpoint: str,
    machine_type: str,
    accelerator_type: str,
    accelerator_count: int,
    min_replica_count: int = 1,
    max_replica_count: int = 1,
) -&amp;gt; aiplatform.Model:
    &amp;quot;&amp;quot;&amp;quot;
    Deploy a registered model to a Vertex AI endpoint.

    Args:
        model: Resource name of the model to deploy
        endpoint: Resource name of the target endpoint
        machine_type: Type of machine for deployment
        accelerator_type: Type of accelerator (GPU)
        accelerator_count: Number of accelerators
        min_replica_count: Minimum number of serving instances
        max_replica_count: Maximum number of serving instances

    Returns:
        aiplatform.Model: Deployed model object
    &amp;quot;&amp;quot;&amp;quot;
    deployed_model = aiplatform.Model.deploy(
        model=model,
        endpoint=endpoint,
        deployed_model_display_name=f&amp;quot;deployed_{model}&amp;quot;,
        machine_type=machine_type,
        accelerator_type=accelerator_type,
        accelerator_count=accelerator_count,
        min_replica_count=min_replica_count,
        max_replica_count=max_replica_count,
        traffic_split={&amp;quot;0&amp;quot;: 100},
        sync=True
    )

    return deployed_model
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This deployment configuration includes several important parameters:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Hardware Specification&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;machine_type&lt;/code&gt;: The type of VM instance (e.g., &#39;g2-standard-8&#39;)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;accelerator_type&lt;/code&gt;: GPU specification (e.g., &#39;NVIDIA_TESLA_L4&#39;)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;accelerator_count&lt;/code&gt;: Number of GPUs per instance&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Scaling Configuration&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;min_replica_count&lt;/code&gt;: Minimum number of serving instances&lt;/li&gt;
&lt;li&gt;&lt;code&gt;max_replica_count&lt;/code&gt;: Maximum number of serving instances&lt;/li&gt;
&lt;li&gt;Enables automatic scaling based on load&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Traffic Management&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;traffic_split&lt;/code&gt;: Controls request routing&lt;/li&gt;
&lt;li&gt;Enables gradual rollouts and A/B testing&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Alternative Serving Frameworks&lt;/h2&gt;
&lt;p&gt;While vLLM is our recommended choice, several alternatives exist:&lt;/p&gt;
&lt;h3&gt;1. FastAPI + Hugging Face Transformers&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from transformers import AutoModelForCausalLM, AutoTokenizer
from fastapi import FastAPI

app = FastAPI()
model = AutoModelForCausalLM.from_pretrained(&amp;quot;google/gemma-2-2b-it&amp;quot;)
tokenizer = AutoTokenizer.from_pretrained(&amp;quot;google/gemma-2-2b-it&amp;quot;)

@app.post(&amp;quot;/predict&amp;quot;)
async def predict(text: str):
    inputs = tokenizer(text, return_tensors=&amp;quot;pt&amp;quot;)
    outputs = model.generate(**inputs)
    return {&amp;quot;response&amp;quot;: tokenizer.decode(outputs[0])}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Simple implementation&lt;/li&gt;
&lt;li&gt;Direct integration with Hugging Face&lt;/li&gt;
&lt;li&gt;Flexible customization&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Disadvantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Limited optimization features&lt;/li&gt;
&lt;li&gt;No built-in batching&lt;/li&gt;
&lt;li&gt;Higher memory usage&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. Text Generation Inference (TGI)&lt;/h3&gt;
&lt;p&gt;TGI offers a more optimized alternative:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from text_generation import Client

client = Client(&amp;quot;http://localhost:8080&amp;quot;)
response = client.generate(
    &amp;quot;What is machine learning?&amp;quot;,
    max_new_tokens=512,
    temperature=0.7
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Optimized for production&lt;/li&gt;
&lt;li&gt;Streaming support&lt;/li&gt;
&lt;li&gt;Better memory management&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Disadvantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Less flexible than vLLM&lt;/li&gt;
&lt;li&gt;Limited quantization options&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. SGL Project&lt;/h3&gt;
&lt;p&gt;The SGL Project provides another approach:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import sglang as sgl

@sgl.function
def generate(prompt):
    return sgl.gen(prompt, max_tokens=512)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Simple API&lt;/li&gt;
&lt;li&gt;Good performance&lt;/li&gt;
&lt;li&gt;Easy integration&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Disadvantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Newer project&lt;/li&gt;
&lt;li&gt;Smaller community&lt;/li&gt;
&lt;li&gt;Limited features&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Limitations and Considerations&lt;/h2&gt;
&lt;p&gt;When deploying Gemma on Vertex AI, be aware of these limitations:&lt;/p&gt;
&lt;h3&gt;1. Streaming Limitations&lt;/h3&gt;
&lt;p&gt;Vertex AI currently doesn&#39;t support native streaming responses, which means:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;All responses must be returned as complete messages&lt;/li&gt;
&lt;li&gt;Real-time token generation isn&#39;t possible&lt;/li&gt;
&lt;li&gt;Higher latency for long responses&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. Hardware Availability&lt;/h3&gt;
&lt;p&gt;Some considerations regarding hardware:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;GPU availability varies by region&lt;/li&gt;
&lt;li&gt;Certain GPU types may have limited availability&lt;/li&gt;
&lt;li&gt;Cost implications of different hardware choices&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Resource Management&lt;/h3&gt;
&lt;p&gt;Important resource considerations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Memory management for large models&lt;/li&gt;
&lt;li&gt;GPU utilization optimization&lt;/li&gt;
&lt;li&gt;Scaling limitations&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Best Practices&lt;/h2&gt;
&lt;p&gt;To ensure optimal deployment and operation:&lt;/p&gt;
&lt;h3&gt;1. Model Optimization&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Use appropriate quantization methods&lt;/li&gt;
&lt;li&gt;Implement caching strategies&lt;/li&gt;
&lt;li&gt;Configure batch sizes based on workload&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. Monitoring&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Set up comprehensive logging&lt;/li&gt;
&lt;li&gt;Monitor GPU utilization&lt;/li&gt;
&lt;li&gt;Track response times and error rates&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Cost Management&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Use appropriate machine types&lt;/li&gt;
&lt;li&gt;Implement auto-scaling&lt;/li&gt;
&lt;li&gt;Monitor resource usage&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Deploying Gemma on Vertex AI provides organizations with powerful capabilities for running their own language models. While there are some limitations to consider, the benefits of control, customization, and privacy make it an attractive option for many use cases.&lt;/p&gt;
&lt;p&gt;The combination of Vertex AI&#39;s infrastructure and vLLM&#39;s serving capabilities creates a robust platform for AI deployment. By following the steps and best practices outlined in this guide, you can successfully deploy and manage your own Gemma instance.&lt;/p&gt;
&lt;p&gt;Remember to regularly monitor your deployment, optimize based on usage patterns, and stay updated with the latest developments in both Vertex AI and the serving frameworks to ensure the best possible performance and cost-effectiveness of your deployment.&lt;/p&gt;
</content>
  </entry>
  
  <entry>
    <title>Speculative Decoding with vLLM</title>
    <link href="https://www.hanneshapke.com/speculative-decoding.html"/>
    <updated>2025-01-11T00:00:00.000Z</updated>
    <id>https://www.hanneshapke.com/speculative-decoding.html</id>
    <summary>Improving LLV inferences with speculative decoding</summary>
    <content type="html">&lt;p&gt;When deploying large language models in production environments, latency optimization is crucial. This is particularly important for real-time applications like chatbots and conversational interfaces. While complex tasks often require larger LLMs (70 billion+ parameters), users still expect response times similar to smaller models. This challenge has led the machine learning community to continuously explore new ways to improve LLM latency.&lt;/p&gt;
&lt;p&gt;One of the most promising techniques is speculative decoding, which is a technique that improves the performance of a language model by predicting multiple tokens at a time with a smaller model and use a larger model to validate the predictions.&lt;/p&gt;
&lt;h2&gt;Problem&lt;/h2&gt;
&lt;p&gt;When serving large language models in production, you need to lower the latency. In fact, latency optimization is the biggest challenge that developers will face in production LLM systems. This challenge becomes particularly acute in real-time scenarios, where users expect near-instantaneous interactions with chatbots, code completion tools, and other AI-powered interfaces.&lt;/p&gt;
&lt;p&gt;At the heart of this challenge lies a fundamental characteristic of auto-regressive models: their sequential nature of text generation. Unlike many computational processes that can benefit from parallel processing, these models face an architectural constraint that proves to be their primary performance bottleneck. To generate any given token K, the model must first process and consider all preceding tokens, from 1 to K-1, in sequential order. This dependency chain, which provides context for the text generation, creates a processing pipeline that cannot be easily parallelized.&lt;/p&gt;
&lt;figure&gt;
  &lt;img src=&quot;/images/speculative_decoding/llm-token-generation.png&quot; alt=&quot;Sequential token generation&quot;&gt;
  &lt;figcaption&gt;Figure 1: Sequential token generation&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;Consider the process illustrated in Figure 1, where each token&#39;s generation depends on the complete history of previous tokens. This sequential dependency isn&#39;t merely a technical limitation—it&#39;s a fundamental aspect of how these models understand and generate human-like text.&lt;/p&gt;
&lt;p&gt;The situation becomes even more challenging when we scale up to larger models, particularly those exceeding 3+ billion parameters. These massive models, while offering superior capabilities in terms of reasoning, understanding, and generation, exact a significant performance penalty. Each token prediction requires more computational resources, as the model must process its vast parameter space for every single token generation step. The result is a compounding effect: not only must we handle the sequential nature of token generation, but each step in that sequence now takes longer due to the model&#39;s size.&lt;/p&gt;
&lt;p&gt;Yet, despite these performance challenges, larger models remain indispensable for many applications. They excel at complex tasks that smaller models struggle with, such as multi-step reasoning, code generation, and nuanced understanding of context. They also produce higher-quality text with fewer artifacts and better coherence. This creates a tension between the need for sophisticated model capabilities and the practical requirements of production deployment.&lt;/p&gt;
&lt;p&gt;In this solution, we will use speculative decoding to improve the performance of LLM deployments. It will allow us to improve the model latency without changing the model architecture, training data, or the trained model itself.&lt;/p&gt;
&lt;h2&gt;Solution&lt;/h2&gt;
&lt;p&gt;Speculative decoding is an optimization technique that leverages two distinct language models to improve generation speed while maintaining output quality. The approach uses a teacher-student architecture where two complementary models work together: a large, sophisticated language model (LLM) that produces highly accurate outputs but is computationally expensive and relatively slow, serving as the teacher model and ground truth for token generation; and a smaller, more efficient language model that operates faster but may be less accurate, acting as the student model. The student model is specifically trained to emulate the behavior of the teacher model – for example, a 3 billion parameter model might be trained to imitate a 405 billion parameter model.&lt;/p&gt;
&lt;h3&gt;The Inference Process&lt;/h3&gt;
&lt;p&gt;During text generation, the process follows a specific workflow. The student model begins by rapidly proposing a sequence of tokens based on its training to imitate the teacher model&#39;s behavior. Following this initial prediction, the teacher model evaluates the student&#39;s proposed tokens in parallel, verifying whether it would have generated the same sequence. The outcome of this validation determines the next steps: if the teacher model agrees with the student&#39;s predictions, the sequence is accepted and immediately output; however, if the teacher model disagrees, it falls back to its standard token-by-token generation process to ensure accuracy.&lt;/p&gt;
&lt;h3&gt;Key Insight&lt;/h3&gt;
&lt;p&gt;The fundamental principle behind speculative decoding is that not all tokens require the computational power of a large model for accurate generation. Token difficulty varies significantly – simple, predictable tokens like common words or obvious completions can be reliably generated by the smaller student model, while complex or context-dependent tokens benefit from the teacher model&#39;s advanced capabilities. This selective use of computational resources allows for significant speed improvements while maintaining the quality standards of the larger model. The approach is particularly effective because it balances the trade-off between speed and accuracy by dynamically choosing the appropriate model based on the complexity of the current generation task.&lt;/p&gt;
&lt;p&gt;Here is an example of how speculative decoding would play out for a sequence of tokens:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;Step 1:
Student: &amp;quot;The [talented] [chef]&amp;quot;
Teacher: ✓ Accepts (common phrase)

Step 2:
Student: &amp;quot;cooked [a] [delicious]&amp;quot;
Teacher: ✓ Accepts (common food context)

Step 3:
Student: &amp;quot;[soup]&amp;quot;
Teacher: ✗ Rejects
Teacher generates: &amp;quot;bouillabaisse&amp;quot; (rare, specific word)

Step 4:
Student: &amp;quot;[for] [dinner]&amp;quot;
Teacher: ✓ Accepts (common ending)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;How does speculative decoding improve the inference speed?&lt;/p&gt;
&lt;p&gt;We gain the inference speed increases by three aspects. First of all, we generate the proposal tokens through the smaller LLM. In addition, we can generate multiple tokens at once. We can request multiple tokens, since we have a second model to validate the predictions. We can afford it because the initial tokens predictions are fast and cheap. Secondly, the validation of the prediction is also fast, and we only need to correct the predictions for tokens where the smaller model made a mistake.&lt;/p&gt;
&lt;p&gt;How can you use speculative decoding with your LLM? Most LLM deployment frameworks provide support of speculative decoding, in one form or another. For our core example, we are demonstrating speculative decoding with vLLM. vLLM is a frequently used framework for serving LLM models like Llama 3.2 3B. In our example, we use a smaller model to predict the next tokens and a larger model to validate the prediction and, if needed, correct the prediction. We use Meta&#39;s &lt;code&gt;opt-125m&lt;/code&gt; model to predict the next tokens and the larger &lt;code&gt;opt-2.7b&lt;/code&gt; model to validate the prediction and, if needed, correct the prediction. The sampling of the tokens, checking them with the larger model, and correcting them is done under the hood by the LLM serving framework, in our case vLLM.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;
from vllm import LLM, SamplingParams

prompts = [
    &amp;quot;The future of AI is&amp;quot;,
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model=&amp;quot;facebook/opt-2.7b&amp;quot;,
    tensor_parallel_size=1,
    speculative_model=&amp;quot;facebook/opt-125m&amp;quot;,
    num_speculative_tokens=5,
)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f&amp;quot;Prompt: {prompt!r}, Generated text: {generated_text!r}&amp;quot;)

&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When we compare the latency of speculative decoding with the latency of the same model without speculative decoding, we can see that speculative decoding is faster by roughly 35% as we can see in Figure 2.&lt;/p&gt;
&lt;figure&gt;
  &lt;img src=&quot;/images/speculative_decoding/speculative_decoding_comparison.png&quot; alt=&quot;Latency comparison&quot;&gt;
  &lt;figcaption&gt;Figure 2: Comparison of latency between standard and speculative decoding approaches. Speculative decoding shows a 35% improvement in processing time.&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;h2&gt;Trade-offs and Alternatives&lt;/h2&gt;
&lt;p&gt;There are significant trade-offs when using speculative decoding. The faster inferences don&#39;t come without downsides. In this section, we will discuss the trade-offs when using speculative decoding.&lt;/p&gt;
&lt;h3&gt;Sequence Length&lt;/h3&gt;
&lt;p&gt;The relationship between sequence length and performance gains in speculative decoding presents an interesting trade-off. Generally, longer sequences tend to yield higher speedups, as demonstrated in our example where we set &lt;code&gt;num_speculative_tokens=5&lt;/code&gt;. This parameter allows the smaller model to predict multiple tokens ahead, potentially improving throughput. However, this advantage comes with diminishing returns: as sequence length increases, so does the likelihood of prediction errors. When these errors occur, the larger model steps in to correct the predictions, which can significantly slow down the overall inference process. Finding the optimal sequence length therefore requires careful balancing between maximizing the benefits of speculation while minimizing the computational overhead of error correction.&lt;/p&gt;
&lt;h3&gt;Larger Memory Footprint&lt;/h3&gt;
&lt;p&gt;The speculative decoding requires a larger memory footprint. The larger model needs to loaded into memory, together with the smaller model. This will require larger instances and GPUs, which translates to higher costs.
Also, fine-tuning models is difficult for when you want to use speculative decoding with your models. The smaller model needs to be fine-tuned on the same dataset as the larger model. This is not always possible, since fine-tuning the larger model is more expensive than fine-tuning the smaller model (however, you might the performance boost from the smaller fine-tuned model already).&lt;/p&gt;
&lt;h3&gt;Model Pairing&lt;/h3&gt;
&lt;p&gt;A larger model needs to be paired with a smaller model using the same tokenization. This is no problem for larger models to be paired with smaller models. However, this is not the case for smaller models.&lt;/p&gt;
&lt;p&gt;The following table shows the possible combinations of base models and smaller models for speculative decoding.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Base Model&lt;/th&gt;
&lt;th&gt;Smaller Model for suggesting tokens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 405b&lt;/td&gt;
&lt;td&gt;Llama 3.1 70B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 405b&lt;/td&gt;
&lt;td&gt;Llama 3.2 3B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 405b&lt;/td&gt;
&lt;td&gt;Llama 3.2 1B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.3 70b&lt;/td&gt;
&lt;td&gt;Llama 3.2 3B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.3 70b&lt;/td&gt;
&lt;td&gt;Llama 3.2 1B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.2 3B&lt;/td&gt;
&lt;td&gt;Llama 3.1 70B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.2 1B&lt;/td&gt;
&lt;td&gt;?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Problem specific Performance&lt;/h3&gt;
&lt;p&gt;The performance of speculative decoding also depends on the distribution of tokens. For example, if we want to generate English text for a chatbot, speculative decoding will be more effective than if we want to generate random hashes or bank transaction descriptions. in those cases, speculative decoding will actually be slower because the larger model needs to correct the predictions too often.&lt;/p&gt;
&lt;h3&gt;Alternatives&lt;/h3&gt;
&lt;p&gt;Several alternative approaches can help reduce LLM deployment latency, each with its own strengths and trade-offs.&lt;/p&gt;
&lt;h4&gt;Use a smaller model&lt;/h4&gt;
&lt;p&gt;The simplest approach is to use a smaller model altogether. This solution offers both reduced memory footprint and faster inference times compared to speculative decoding. The deployment becomes significantly simpler, requiring only one smaller model and less powerful GPUs. However, this approach comes with an obvious drawback: the quality of generated text suffers noticeably. While faster, smaller models often lack the sophisticated reasoning and nuanced understanding that larger models provide. You would typically only consider this option if your use case doesn&#39;t require the advanced capabilities of larger models.&lt;/p&gt;
&lt;h4&gt;Use a quantized model&lt;/h4&gt;
&lt;p&gt;Model quantization represents a sophisticated optimization technique that reduces numerical precision while preserving model functionality. By converting the model&#39;s parameters from their original 32-bit floating-point representation to 8-bit or even 4-bit precision post-training, quantization achieves significant improvements in both memory efficiency and computational performance. This reduction in numerical complexity translates directly into decreased memory footprint, lower computational overhead, and consequently, faster inference times.&lt;/p&gt;
&lt;p&gt;While quantization does introduce a modest degradation in model quality compared to the original implementation, it offers compelling advantages as an alternative to speculative decoding. The deployment architecture remains streamlined with only a single model to maintain, and the reduced computational demands enable the use of more cost-effective GPU hardware. This balance of performance optimization and operational simplicity makes quantization an attractive option for many production environments.&lt;/p&gt;
&lt;h4&gt;Parallelization&lt;/h4&gt;
&lt;p&gt;Parallelization presents another powerful strategy for improving LLM performance of larger LLMs. Instead of processing multiple requests sequentially, you can process multiple requests simultaneously. This way, you can significantly decrease the effective latency across multiple requests. This approach particularly shines in high-traffic scenarios where individual requests use only a fraction of the model&#39;s context length. However, parallelization faces clear limitations: it remains constrained by both the model&#39;s maximum context length and the available GPU memory. Despite these constraints, parallelization often provides substantial performance benefits for many production deployments and it should be your first consideration when optimizing deployment latency.&lt;/p&gt;
&lt;h4&gt;Continuous batching&lt;/h4&gt;
&lt;p&gt;Continuous batching takes the parallelization concept even further. Instead of processing fixed batches, this technique dynamically pulls new requests from a queue whenever space becomes available in the current batch. This approach proves especially effective when handling a high volume of requests with varying context lengths. By maintaining consistent GPU utilization, continuous batching can achieve even lower latency than standard parallelization. However, it shares the same fundamental limitations regarding context length and GPU memory, and requires specialized deployment infrastructure to support the dynamic batching mechanism.&lt;/p&gt;
&lt;h4&gt;Caching&lt;/h4&gt;
&lt;p&gt;Caching offers a different approach to latency optimization, particularly valuable for applications with repetitive requests. By storing and reusing previous inference results for identical prompts, caching can deliver nearly instantaneous responses for repeated queries. While novel requests still face slow inference time of a large model, frequently accessed responses become lightning-fast. This makes caching particularly effective for applications like customer service chatbots or code completion tools, where certain queries appear frequently. The effectiveness of caching directly correlates with the repetitiveness of your workload – the more repeated queries you handle, the greater the performance benefit.&lt;/p&gt;
&lt;h2&gt;Demo&lt;/h2&gt;
&lt;p&gt;We have created a demo of speculative decoding with vLLM. You can find the code &lt;a href=&quot;https://colab.research.google.com/drive/1CkI2Dl5WP2sEnspi8b4dV9J2lHl13sAl?usp=sharing&quot;&gt;in Colab&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Speculative decoding is a promising technique to improve the performance of LLM deployments. in our demo example, we were able to improve the latency by 35%. However, it comes with a larger memory footprint and more deployment complexity.&lt;/p&gt;
</content>
  </entry>
  
  <entry>
    <title>How to Profile TensorFlow Serving Inference Requests with TFProfiler</title>
    <link href="https://www.hanneshapke.com/inference-profiling.html"/>
    <updated>2023-02-12T00:00:00.000Z</updated>
    <id>https://www.hanneshapke.com/inference-profiling.html</id>
    <summary>Determining bottlenecks in your deep learning model can be crucial in reducing your model latency</summary>
    <content type="html">&lt;h2&gt;Why Profiling Deep Learning Models?&lt;/h2&gt;
&lt;p&gt;With the growing complexity of today&#39;s deep learning models, the aspect of model inference latency is more relevant than ever. Therefore, profiling your machine learning model for bottlenecks can save you milliseconds during your prediction requests, and it saved you ultimately real money when it comes to deploying your model in a production scenario (and CO2 emissions too).&lt;/p&gt;
&lt;p&gt;Keras already provides a stellar callback function to hook the training up to TensorBoard. This connection allows you to profile your model’s performance during the training phase. However, this profiler setup only tells you half the story.&lt;/p&gt;
&lt;p&gt;If you use the TensorBoard callback to profile your machine learning model, all TensorFlow ops used during the backward pass will be part of the profiling. For example, you&#39;ll find optimizer ops muddled in those profiling stats and some of the ops might show a very different profile because they are executed on a GPU instead of a CPU. The information is extremely helpful if you want to optimize for more efficient training patterns, but less helpful to reduce your serving latency.&lt;/p&gt;
&lt;p&gt;One of the many amazing features of TensorFlow Serving is the integrated TensorFlow Profiler. TensorFlow Profiler can connect to your TensorFlow Serving instance and profile your inference requests. Through this setup, you can investigate all inference-related ops and it mimics the deployment scenario better than profiling your model during the training phase.&lt;/p&gt;
&lt;p&gt;I am often using VSCode to connect to my GPUs, but unfortunately, the TensorBoard integration in VSCode couldn&#39;t connect to TensorFlow Serving, therefore I looked for a different setup. Here is how you can set it up.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;p&gt;For the purpose of this post, I created a demo model based on the code below. Don’t replicate the model, but rather make sure you save your TensorFlow or JAX model in the &lt;code&gt;savedModel&lt;/code&gt; format which TensorFlow Serving can load.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import tensorflow as tf
import tensorflow_text as _
import tensorflow_hub as hub

text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
preprocessor = hub.KerasLayer(
    &amp;quot;https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3&amp;quot;)
encoder_inputs = preprocessor(text_input)
encoder = hub.KerasLayer(
    &amp;quot;https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4&amp;quot;,
    trainable=True)
outputs = encoder(encoder_inputs)
sequence_output = outputs[&amp;quot;sequence_output&amp;quot;]
embedding_model = tf.keras.Model(text_input, sequence_output)

embedding_model.save(&amp;quot;/models/test_model/1/&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;TensorBoard Setup&lt;/h2&gt;
&lt;p&gt;Once you have your model saved in a location where TensorFlow Serving can load it from, let&#39;s set up your serving and TensorBoard.
First, let’s create a Docker image to host TensorBoard.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-Dockerfile&quot;&gt;FROM tensorflow/tensorflow:${TENSORFLOW_SERVING_VERSION}

RUN pip install -U tensorboard-plugin-profile

ENTRYPOINT [&#92;&amp;quot;/usr/bin/python3&#92;&amp;quot;, &#92;&amp;quot;-m&#92;&amp;quot;, &#92;&amp;quot;tensorboard.main&#92;&amp;quot;, &#92;&amp;quot;--logdir&#92;&amp;quot;, &#92;&amp;quot;/tmp/tensorboard&#92;&amp;quot;, &#92;&amp;quot;--bind_all&#92;&amp;quot;]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;TensorBoard doesn’t ship with the profiler anymore, therefore we need to install it separately.
Once you created the Docker image, we can use &lt;code&gt;docker compose&lt;/code&gt; to spin up TensorFlow Serving together with the newly created TensorBoard image.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;version: &#39;3.3&#39;
services:
  ${TENSORFLOW_SERVING_HOSTNAME}:
    image: tensorflow/serving:${TENSORFLOW_SERVING_VERSION}
    ports:
      - &#39;8500:8500&#39;
      - &#39;8501:8501&#39;
    environment:
      - MODEL_NAME=${TENSORFLOW_SERVING_MODEL_NAME}
    hostname: &#39;${TENSORFLOW_SERVING_HOSTNAME}&#39;
    volumes:
      - &#39;/models/${TENSORFLOW_SERVING_MODEL_NAME}:/models/${TENSORFLOW_SERVING_MODEL_NAME}&#39;
      - &#39;${TENSORBOARD_LOGDIR}:/tmp/tensorboard&#39;
    command:
      - &#39;--xla_cpu_compilation_enabled&#39;
      - &#39;--tensorflow_intra_op_parallelism=${INTRA_OP_PARALLELISM}&#39;
      - &#39;--tensorflow_inter_op_parallelism=${INTER_OP_PARALLELISM}&#39;
  profiler:
    image: ${DOCKER_PROFILER_TAG}
    ports:
      - &#39;6006:6006&#39;
    volumes:
      - &#39;${TENSORBOARD_LOGDIR}:/tmp/tensorboard&#39;

&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I like to add additional TensorFlow Serving command to mimic the full production setup as closely as possible. In this particular case, I enabled the XLA support, and limit the intra and inter ops parallelism in TensorFlow Serving. You can find more information about &lt;a href=&quot;https://www.tensorflow.org/xla&quot;&gt;XLA here&lt;/a&gt; and details about all &lt;a href=&quot;https://github.com/tensorflow/serving/blob/master/tensorflow_serving/model_servers/main.cc&quot;&gt;TensorFlow Serving options here&lt;/a&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;command:
    - &#39;--xla_cpu_compilation_enabled&#39;
    - &#39;--tensorflow_intra_op_parallelism=${INTRA_OP_PARALLELISM}&#39;
    - &#39;--tensorflow_inter_op_parallelism=${INTER_OP_PARALLELISM}&#39;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can find the full setup script in &lt;a href=&quot;https://gist.github.com/hanneshapke/9a87b932a02c7838b6ba68ded951811a&quot;&gt;this Github Gist&lt;/a&gt;. Thanks to &lt;a href=&quot;https://github.com/tensorflow/serving/issues/1755#issuecomment-1301911977&quot;&gt;Kyle Jarvis&lt;/a&gt; for suggesting to run the two containers via &lt;code&gt;docker-compose&lt;/code&gt; and for dynamically creating the &lt;code&gt;docker-compose&lt;/code&gt; configuration.&lt;/p&gt;
&lt;h2&gt;Profile Your Model&lt;/h2&gt;
&lt;p&gt;If you copy this &lt;a href=&quot;https://gist.github.com/hanneshapke/9a87b932a02c7838b6ba68ded951811a&quot;&gt;script from Github Gist&lt;/a&gt; to your local machine and execute it, it will start up a TensorFlow Serving instance that loads your model (adjust the model path in the script) and a TensorBoard instance as well.&lt;/p&gt;
&lt;p&gt;In case, you are running this script remotely (like many M1 users), you need to create an ssh tunnel to access TensorBoard. If you are running on a Google Cloud instance, you can do this by running&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;$ gcloud compute ssh &#92;
    --project=digits-data-science &#92;
    --zone=us-central1-a &#92;
    YOUR_INSTANCE_NAME
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;More information about connecting securely to Google Cloud instances can be found &lt;a href=&quot;https://cloud.google.com/solutions/connecting-securely&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If you run the &lt;code&gt;docker compose&lt;/code&gt; setup on your machine locally, you can skip the previous step. If you running on an AWS EC2 instance, check &lt;a href=&quot;https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ssh-tunnel-local.html&quot;&gt;here&lt;/a&gt; how to connect with your machine.&lt;/p&gt;
&lt;p&gt;Once &lt;code&gt;docker-compose&lt;/code&gt; is running, you should see a terminal output similar to this below.
If the serving or profiler container fails with an error, you’ll need to stop here and investigate. Both containers need to run for the next steps.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;$ sh ./tensorboard.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;mkdir -p /tmp/tensorboard
[+] Building 0.0s (6/6) FINISHED
 =&amp;gt; [internal] load build definition from Dockerfile_tfprofile =&amp;gt;
 =&amp;gt; transferring dockerfile:
 =&amp;gt; [internal] load dockerignore
 =&amp;gt; [internal] load metadata for docker.io/tensorflow/tensorflow:2.11.
 =&amp;gt; [1/2] FROM docker.io/tensorflow/tensorflow:2.11.
 =&amp;gt; CACHED [2/2] RUN pip install -U tensorboard-plugin-profile
 =&amp;gt; exporting to image
 =&amp;gt; =&amp;gt; exporting layers
...
 =&amp;gt; =&amp;gt; naming to docker.io/library/tensorboard_profiler:latest
Starting 20230128_tfserving_profiling_serving_1    ... done
Recreating 20230128_tfserving_profiling_profiler_1 ... done
Attaching to 20230128_tfserving_profiling_serving_1, 20230128_tfserving_profiling_profiler_1
serving_1   | 2023-02-12 18:30:46.059050: I tensorflow_serving/model_servers/server.cc:74] Building single TensorFlow model file config:  model_name: test_model model_base_path: /models/test_model
...
serving_1   | 2023-02-12 18:30:48.495900: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:213] Running initialization op on SavedModel bundle at path: /models/test_model/1
serving_1   | 2023-02-12 18:30:49.073199: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:305] SavedModel load for tags { serve }; Status: success: OK. Took 2803691 microseconds.
...
serving_1   | 2023-02-12 18:30:49.296815: I tensorflow_serving/model_servers/server.cc:383] Profiler service is enabled
serving_1   | 2023-02-12 18:30:49.298806: I tensorflow_serving/model_servers/server.cc:409] Running gRPC ModelServer at 0.0.0.0:8500 ...
serving_1   | [warn] getaddrinfo: address family for nodename not supported
serving_1   | 2023-02-12 18:30:49.300120: I tensorflow_serving/model_servers/server.cc:430] Exporting HTTP/REST API at:localhost:8501 ...
serving_1   | [evhttp_server.cc : 245] NET_LOG: Entering the event loop ...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If both containers are running, head over to your browser and access &lt;a href=&quot;http://localhost:6006&quot;&gt;http://localhost:6006&lt;/a&gt;.
You can start the TensorBoard Profiler by selecting &lt;code&gt;PROFILE&lt;/code&gt; from the top right menu.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/images/post_tensorboard/tensorboard_menu.png&quot; alt=&quot;TensorBoard Menu&quot;&gt;&lt;/p&gt;
&lt;p&gt;When you selected &lt;code&gt;PROFILE&lt;/code&gt;, it will open a menu to configure your Profiler session. If you use the provided script, the hostname is &lt;code&gt;serving&lt;/code&gt;. By default, TensorBoard profiles for 1s. This is fairly short this it takes some time to kick off an inference. I usually use 4000ms as a profiling duration.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/images/post_tensorboard/tensorboard_configuration.png&quot; alt=&quot;TensorBoard Configuration&quot;&gt;&lt;/p&gt;
&lt;p&gt;After you hit &lt;code&gt;CAPTURE&lt;/code&gt;, submit an prediction request to your TensorFlow Serving setup. You can do this with the following &lt;code&gt;curl&lt;/code&gt; command.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;$ curl -X POST --data &#39;{&amp;quot;instances&amp;quot;: [&amp;quot;This is a request for profiling purposes&amp;quot;]}&#39; http://localhost:8501/v1/models/test_model:predict
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If your payload is more than a few characters, save it in a &lt;code&gt;JSON&lt;/code&gt; formatted file (here &lt;code&gt;data.json&lt;/code&gt;). &lt;code&gt;curl&lt;/code&gt; can load the file and submit it as the request payload.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;$ curl -X POST --data @data.json http://localhost:8501/v1/models/test_model:predict
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A few seconds after you submitted your &lt;code&gt;curl&lt;/code&gt; request, you&#39;ll be provided with a variety of profiling details in TensorBoard. The TensorFlow Stats and the Tracer are the most insightful.
The TensorFlow Stats tell you, what ops are used most often. This provides you with details on how you could optimize your machine-learning model.
The Tracer shows every TensorFlow ops in its sequence. Here you can see the trace of a BERT model with its 12 layers.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/images/post_tensorboard/tracer_1.png&quot; alt=&quot;TensorBoard Model Tracer&quot;&gt;&lt;/p&gt;
&lt;p&gt;You can then zoom into any section of interest. For example, I am always checking how much time is taken up by the preprocessing step in the model.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/images/post_tensorboard/tracer_2.png&quot; alt=&quot;TensorBoard Model Tracer - Zoom&quot;&gt;&lt;/p&gt;
&lt;p&gt;You can then click on every ops and drill into the specific details. You might be surprised by what surprises you can sometimes discover.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/images/post_tensorboard/tracer_3.png&quot; alt=&quot;TensorBoard Model Tracer - Ops details&quot;&gt;&lt;/p&gt;
&lt;p&gt;Happy profiling :)&lt;/p&gt;
&lt;h2&gt;Further Reading&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.tensorflow.org/tfx/serving/tensorboard&quot;&gt;TensorFlow Profiler&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://cloud.google.com/blog/topics/developers-practitioners/how-optimize-training-performance-tensorflow-profiler-vertex-ai/&quot;&gt;Profiling on Google Cloud&#39;s Vertex AI Platform&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Header image by &lt;a href=&quot;https://unsplash.com/@nhoizey?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText&quot;&gt;Nicolas Hoizey&lt;/a&gt; on &lt;a href=&quot;https://unsplash.com/photos/poa-Ycw1W8U?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText&quot;&gt;Unsplash&lt;/a&gt;&lt;/p&gt;
</content>
  </entry>
  
  <entry>
    <title>Notes on deploying models with TFServing</title>
    <link href="https://www.hanneshapke.com/notes-deployment-with-tfserving.html"/>
    <updated>2023-01-14T00:00:00.000Z</updated>
    <id>https://www.hanneshapke.com/notes-deployment-with-tfserving.html</id>
    <summary>A collection of useful links with information about the inner working of TFServing</summary>
    <content type="html">&lt;p&gt;I think TFServing is a gold standard of deploying deep learning models. It is lean, memory efficient, and supports a number of non-TensorFlow frameworks like JAX, scikit learn or XGBoost.&lt;/p&gt;
&lt;p&gt;Here are some notes I constantly refer for details beyong the Google documentation:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://docs.google.com/presentation/d/1yx4oH94R6BNBwiNZHLHHlEUzBLk33ZwX7L6TM4MQ3HM/&quot;&gt;Deploying production ML models with TensorFlow Serving&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://arxiv.org/pdf/2109.09541.pdf&quot;&gt;Scaling TensorFlow to 300 million predictions per second&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content>
  </entry>
  
  <entry>
    <title>Notes on Reinforcement Learning for Human Feedback</title>
    <link href="https://www.hanneshapke.com/notes-on-rlhf.html"/>
    <updated>2023-01-10T00:00:00.000Z</updated>
    <id>https://www.hanneshapke.com/notes-on-rlhf.html</id>
    <summary>Reinforcement Learning for Human Feedback (RLHF) is the concept with powers recent models like ChatGPT</summary>
    <content type="html">&lt;p&gt;Reinforcement Learning for Human Feedback (RLHF) is the concept with powers recent models like ChatGPT. In my notes, I am covering resources I found helpful to get started with RLHF.&lt;/p&gt;
&lt;h2&gt;Paper&lt;/h2&gt;
&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;A classic paper on Reinforcement Learning for Human Feedback (RLHF) is &lt;a href=&quot;https://twitter.com/OpenAI?ref_src=twsrc%5Etfw&quot;&gt;@OpenAI&lt;/a&gt;&amp;#39;s &amp;quot;Learning to summarize from human feedback&amp;quot;.&lt;br&gt;&lt;br&gt;Our talented engineer &lt;a href=&quot;https://twitter.com/PhungVanDuy1?ref_src=twsrc%5Etfw&quot;&gt;@PhungVanDuy1&lt;/a&gt; replicated this paper using our trlX library!&lt;br&gt;&lt;br&gt;Read our report (w/ a code walk-through) here: &lt;a href=&quot;https://t.co/b06Nk8iKDv&quot;&gt;https://t.co/b06Nk8iKDv&lt;/a&gt;&lt;/p&gt;&amp;mdash; Carper (@carperai) &lt;a href=&quot;https://twitter.com/carperai/status/1613645352514768897?ref_src=twsrc%5Etfw&quot;&gt;January 12, 2023&lt;/a&gt;&lt;/blockquote&gt; &lt;script async src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube.com/embed/2MBJOuVq380&quot; title=&quot;YouTube video player&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen&gt;&lt;/iframe&gt;
&lt;h2&gt;Code&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.linkedin.com/posts/omarsar_machinelearning-deeplearning-ai-activity-7013910442238484480-31Bh&quot;&gt;PaLM + RLHF + PyTorch&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/lucidrains/PaLM-rlhf-pytorch&quot;&gt;Github repo&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content>
  </entry>
  
  <entry>
    <title>Notes on Model Performance Profiling</title>
    <link href="https://www.hanneshapke.com/notes-on-model-profiling.html"/>
    <updated>2023-01-05T00:00:00.000Z</updated>
    <id>https://www.hanneshapke.com/notes-on-model-profiling.html</id>
    <summary>A collection of useful links with information about model performance profiling</summary>
    <content type="html">&lt;p&gt;Model latency is critical to a successful roll-out of a production machine learning model. No one wants to wait, especially not customers using a machine learning model. My notes cover tools to investigate the model performance to detect bottlenecks within the model graph.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.tensorflow.org/guide/profiler#best_practices_for_optimal_model_performance&quot;&gt;Optimize TensorFlow performance using the Profiler&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras&quot;&gt;TensorFlow Profiler: Profile model performance&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://colab.research.google.com/github/tensorflow/tensorboard/blob/master/docs/tensorboard_profiling_keras.ipynb#scrollTo=ZlRwCDoVinHV&quot;&gt;Colab notebook about Profiling with TensorBoard&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content>
  </entry>
  
  <entry>
    <title>Notes on GPT4</title>
    <link href="https://www.hanneshapke.com/notes-on-gpt4.html"/>
    <updated>2023-01-03T00:00:00.000Z</updated>
    <id>https://www.hanneshapke.com/notes-on-gpt4.html</id>
    <summary>A collection of useful links with information about the inner working of TFServing</summary>
    <content type="html">&lt;p&gt;Everyone in the ML space is talking about the potential impact of GPT-4 and its siblings. In my notes, I am sharing a few tweets and resources which stood out over the last few days.&lt;/p&gt;
&lt;p&gt;I highly recommend the good read from Daniel Jeffries about &lt;a href=&quot;https://danieljeffries.substack.com/p/the-age-of-industrialized-ai&quot;&gt;The Age of Industrialized AI&lt;/a&gt;. Thanks to Willem Pienaar for the great reading recommendation.&lt;/p&gt;
&lt;h2&gt;OpenAI&lt;/h2&gt;
&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;1) ChatGPT is super cool and fun but it&amp;#39;s important to recall OpenAI made basically zero fundamental innovations. Actually the basic innovation behind the GPT software was made at Google Brain in Mountain View&lt;/p&gt;&amp;mdash; Ben Goertzel (@bengoertzel) &lt;a href=&quot;https://twitter.com/bengoertzel/status/1608548419135758337?ref_src=twsrc%5Etfw&quot;&gt;December 29, 2022&lt;/a&gt;&lt;/blockquote&gt; &lt;script async src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;
&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;&lt;a href=&quot;https://twitter.com/ShaanVP?ref_src=twsrc%5Etfw&quot;&gt;@ShaanVP&lt;/a&gt; &lt;a href=&quot;https://twitter.com/OpenAI?ref_src=twsrc%5Etfw&quot;&gt;@OpenAI&lt;/a&gt; is literally playing kingmaker by granting early access to GPT4.&lt;br&gt;&lt;br&gt;You nailed it saying the pace of innovation is too high to confidently roll up small AI tools today. &lt;a href=&quot;https://t.co/cL7FZRtmnd&quot;&gt;pic.twitter.com/cL7FZRtmnd&lt;/a&gt;&lt;/p&gt;&amp;mdash; DeepTakes (@DeepTakesAI) &lt;a href=&quot;https://twitter.com/DeepTakesAI/status/1605685319780642816?ref_src=twsrc%5Etfw&quot;&gt;December 21, 2022&lt;/a&gt;&lt;/blockquote&gt; &lt;script async src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;
&lt;h2&gt;Impact on the ML Community&lt;/h2&gt;
&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;1/ &lt;a href=&quot;https://twitter.com/hashtag/ChatGPT?src=hash&amp;amp;ref_src=twsrc%5Etfw&quot;&gt;#ChatGPT&lt;/a&gt; is closing out 2022 with a bang, but what’s next? 💥 &lt;a href=&quot;https://twitter.com/OpenAI?ref_src=twsrc%5Etfw&quot;&gt;@OpenAI&lt;/a&gt;’s &lt;a href=&quot;https://twitter.com/hashtag/GPT4?src=hash&amp;amp;ref_src=twsrc%5Etfw&quot;&gt;#GPT4&lt;/a&gt; is set to be the first big &lt;a href=&quot;https://twitter.com/hashtag/AI?src=hash&amp;amp;ref_src=twsrc%5Etfw&quot;&gt;#AI&lt;/a&gt; thing in 2023.&lt;br&gt;&lt;br&gt;So here are some bold, optimistic, yet sensible predictions from me, &lt;a href=&quot;https://twitter.com/vivek7ue?ref_src=twsrc%5Etfw&quot;&gt;@vivek7ue&lt;/a&gt; and &lt;a href=&quot;https://twitter.com/rajhans_samdani?ref_src=twsrc%5Etfw&quot;&gt;@rajhans_samdani&lt;/a&gt; ... 👀&lt;/p&gt;&amp;mdash; sridhar (@RamaswmySridhar) &lt;a href=&quot;https://twitter.com/RamaswmySridhar/status/1605603043046674435?ref_src=twsrc%5Etfw&quot;&gt;December 21, 2022&lt;/a&gt;&lt;/blockquote&gt; &lt;script async src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;
&lt;h2&gt;Impact on Google&lt;/h2&gt;
&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;The LLMs will kill Google takes are utter nonsense at worst and naive at best.&lt;br&gt;&lt;br&gt;Don’t you think they are busy at work on LLMs?&lt;/p&gt;&amp;mdash; Jo Kristian Bergum (@jobergum) &lt;a href=&quot;https://twitter.com/jobergum/status/1598582676742553600?ref_src=twsrc%5Etfw&quot;&gt;December 2, 2022&lt;/a&gt;&lt;/blockquote&gt; &lt;script async src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;
&lt;h2&gt;Overall Economic Impact&lt;/h2&gt;
&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;GPT4 will be out soon and will probably cause a similar economic shock to one from Covid. Instant distribution with nearly instant adoption and nearly instant productivity increase for hundreds of millions of knowledge workers. Brace yourselves, 2023 is coming&lt;/p&gt;&amp;mdash; Nick Davidov (@Nick_Davidov) &lt;a href=&quot;https://twitter.com/Nick_Davidov/status/1606688723265277952?ref_src=twsrc%5Etfw&quot;&gt;December 24, 2022&lt;/a&gt;&lt;/blockquote&gt; &lt;script async src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;
&lt;h2&gt;Detecting ChatGPT produced Content&lt;/h2&gt;
&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;I spent New Years building GPTZero — an app that can quickly and efficiently detect whether an essay is ChatGPT or human written&lt;/p&gt;&amp;mdash; Edward Tian (@edward_the6) &lt;a href=&quot;https://twitter.com/edward_the6/status/1610067688449007618?ref_src=twsrc%5Etfw&quot;&gt;January 3, 2023&lt;/a&gt;&lt;/blockquote&gt; &lt;script async src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;
</content>
  </entry>
  
  <entry>
    <title>Receiving Google Open Source Peer Bonus Award 2022</title>
    <link href="https://www.hanneshapke.com/google-open-source-peer-bonus.html"/>
    <updated>2022-12-31T00:00:00.000Z</updated>
    <id>https://www.hanneshapke.com/google-open-source-peer-bonus.html</id>
    <summary>Receiving Google Open Source Peer Bonus Award 2022</summary>
    <content type="html">&lt;p&gt;I feel very honored that I received the &lt;a href=&quot;https://opensource.google/documentation/reference/growing/peer-bonus&quot;&gt;Google Open Source Peer Bonus Award&lt;/a&gt; for my contributions to the TFX Addons project in 2022!
I am very grateful to the TFX/Google OSS team and the TFX Addons community for their support and collaboration.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/images/open_source_award/cert.png&quot; alt=&quot;Google Open Source Peer Bonus Award&quot;&gt;&lt;/p&gt;
</content>
  </entry>
  
  <entry>
    <title>Highly interesting Twitter threads to revisit from time to time</title>
    <link href="https://www.hanneshapke.com/highly-interesting-twitter-thread-to-reread.html"/>
    <updated>2022-10-22T00:00:00.000Z</updated>
    <id>https://www.hanneshapke.com/highly-interesting-twitter-thread-to-reread.html</id>
    <summary>I find myself revisiting highly interestign Twitter threads. Here is a list of the most interesting threads sorted by topics ...</summary>
    <content type="html">&lt;p&gt;I find myself revisiting highly interestign Twitter threads. Here is a list of the most interesting threads sorted by topics ...&lt;/p&gt;
&lt;!-- ![Machine Learning](/images/alina-grubnyak-ZiQkhI7417A-unsplash.jpg#wide)
*Photo by [Alina Grubnyak](https://unsplash.com/photos/ZiQkhI7417A) on [Unsplash](https://unsplash.com/)* --&gt;
&lt;h1&gt;Machine Learning&lt;/h1&gt;
&lt;p&gt;Critical ML topics for the coming 3-4 years by &lt;a href=&quot;https://twitter.com/soniajoseph_/&quot;&gt;@soniajoseph_&lt;/a&gt;. The replies to the thread are insightful.&lt;/p&gt;
&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;machine learning twitter-- &lt;br&gt;&lt;br&gt;ML is moving fast. Which research ideas / PhD topics will remain critical 3-4 years from now?&lt;/p&gt;&amp;mdash; Sonia Joseph (@soniajoseph_) &lt;a href=&quot;https://twitter.com/soniajoseph_/status/1583184692282478592?ref_src=twsrc%5Etfw&quot;&gt;October 20, 2022&lt;/a&gt;&lt;/blockquote&gt; &lt;script async src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;
&lt;p&gt;&lt;a href=&quot;https://twitter.com/ericmander/&quot;&gt;@ericmander&lt;/a&gt; on &amp;quot;Foundation models are the new public cloud and AI is the new SaaS&amp;quot;&lt;/p&gt;
&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;machine learning twitter-- &lt;br&gt;&lt;br&gt;ML is moving fast. Which research ideas / PhD topics will remain critical 3-4 years from now?&lt;/p&gt;&amp;mdash; Sonia Joseph (@soniajoseph_) &lt;a href=&quot;https://twitter.com/ericmander/status/1575390598512746496?ref_src=twsrc%5Etfw&quot;&gt;October 20, 2022&lt;/a&gt;&lt;/blockquote&gt; &lt;script async src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;
&lt;p&gt;&lt;a href=&quot;https://twitter.com/jobergum/&quot;&gt;@jobergum&lt;/a&gt; on bad ML experiences caused by ML batch systems&lt;/p&gt;
&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;Many bad online experiences are caused by predictions done by batch-oriented ML systems that do not consider the real-time context.&lt;/p&gt;&amp;mdash; Jo Kristian Bergum (@jobergum) &lt;a href=&quot;https://twitter.com/jobergum/status/1576287869005889537?ref_src=twsrc%5Etfw&quot;&gt;October 1, 2022&lt;/a&gt;&lt;/blockquote&gt; &lt;script async src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;
&lt;p&gt;Great thread on light weight Python setups to deploy ML models by &lt;a href=&quot;https://twitter.com/simonw&quot;&gt;@simonw&lt;/a&gt;
This replies to this tweet offer a number of interesting options.&lt;/p&gt;
&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;What&amp;#39;s the lightest Python dependency (in terms of complexity/amount of code/ideally no compiled dependencies) that would let me add a tiny ML model to a Python application? For inference against a bundled model, not for training the model itself&lt;/p&gt;&amp;mdash; Simon Willison (@simonw) &lt;a href=&quot;https://twitter.com/simonw/status/1576680930680262658?ref_src=twsrc%5Etfw&quot;&gt;October 2, 2022&lt;/a&gt;&lt;/blockquote&gt; &lt;script async src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;
&lt;p&gt;A thread on way embedding are a pain in the *** by &lt;a href=&quot;https://twitter.com/mlopscommunity/&quot;&gt;@mlopscommunity&lt;/a&gt;
Great thread about embedding and the hidden complexity.&lt;/p&gt;
&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;What&amp;#39;s the lightest Python dependency (in terms of complexity/amount of code/ideally no compiled dependencies) that would let me add a tiny ML model to a Python application? For inference against a bundled model, not for training the model itself&lt;/p&gt;&amp;mdash; Simon Willison (@simonw) &lt;a href=&quot;https://twitter.com/mlopscommunity/status/1562078702535573505?ref_src=twsrc%5Etfw&quot;&gt;October 2, 2022&lt;/a&gt;&lt;/blockquote&gt; &lt;script async src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;
&lt;p&gt;A thread about real time ML by &lt;a href=&quot;https://twitter.com/mlopscommunity/&quot;&gt;@mlopscommunity&lt;/a&gt;&lt;/p&gt;
&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;Why are building real-time data pipelines in machine learning so challenging? &lt;br&gt;&lt;br&gt;Let&amp;#39;s talk about it in this 🧵&lt;/p&gt;&amp;mdash; MLOps Community (@mlopscommunity) &lt;a href=&quot;https://twitter.com/mlopscommunity/status/1563162922247139332?ref_src=twsrc%5Etfw&quot;&gt;August 26, 2022&lt;/a&gt;&lt;/blockquote&gt; &lt;script async src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;
&lt;!-- ![Startups](/images/15-1.jpg#wide)
*Photo by [Israel Andrade](https://unsplash.com/photos/YI_9SivVt_s) on [Unsplash](https://unsplash.com/)* --&gt;
&lt;h1&gt;MLOps&lt;/h1&gt;
&lt;p&gt;Great thread by &lt;a href=&quot;https://twitter.com/GoAbiAryan/&quot;&gt;@GoAbiAryan&lt;/a&gt; on papers around ML system designs.&lt;/p&gt;
&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;This has been such an excellent year for software system design in ML. So, I compiled a list of some of my favorite papers 📜in MLOps. &lt;br&gt;&lt;br&gt;Here are some of my favorite ones till date⤵️&lt;/p&gt;&amp;mdash; Abi Aryan (@GoAbiAryan) &lt;a href=&quot;https://twitter.com/GoAbiAryan/status/1580852750526468097?ref_src=twsrc%5Etfw&quot;&gt;October 14, 2022&lt;/a&gt;&lt;/blockquote&gt; &lt;script async src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;
&lt;h1&gt;Startups&lt;/h1&gt;
&lt;p&gt;Advice on  good pitch decks by &lt;a href=&quot;https://twitter.com/wallstreetpaper/&quot;&gt;@wallstreetpaper&lt;/a&gt;&lt;/p&gt;
&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;At &lt;a href=&quot;https://twitter.com/HarlemCapital?ref_src=twsrc%5Etfw&quot;&gt;@HarlemCapital&lt;/a&gt; we see 4k+ pitch decks a year&lt;br&gt;&lt;br&gt;Here are the the 10 slides you should have in your pitch deck&lt;br&gt;&lt;br&gt;A THREAD 🧵&lt;/p&gt;&amp;mdash; Brandon (wallstreetpaper.eth) (@wallstreetpaper) &lt;a href=&quot;https://twitter.com/wallstreetpaper/status/1582884312604504064?ref_src=twsrc%5Etfw&quot;&gt;October 20, 2022&lt;/a&gt;&lt;/blockquote&gt; &lt;script async src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;
</content>
  </entry>
  
</feed>
