Building Production ML Systems That Scale

ML Engineering Leader, author of four machine learning books, and Google Developer Expert. I help organizations architect robust ML systems from research to production, specializing in MLOps, generative AI deployment, and scalable data platforms.

's Picture
11 min read Speculative Decoding with vLLM using Gemma

Speculative Decoding with vLLM using Gemma

Improving LLM inferences with speculative decoding using Gemma

10 min read Deploying Google's Gemma on Vertex AI

Deploying Google's Gemma on Vertex AI

A comprehensive guide to deploying Google's Gemma language model on Vertex AI using vLLM, covering model registration, endpoint creation, and production deployment best practices.

11 min read Speculative Decoding with vLLM

Speculative Decoding with vLLM

Improving LLV inferences with speculative decoding

7 min read How to Profile TensorFlow Serving Inference Requests with TFProfiler

How to Profile TensorFlow Serving Inference Requests with TFProfiler

Determining bottlenecks in your deep learning model can be crucial in reducing your model latency

1 min read Receiving Google Open Source Peer Bonus Award 2022

Receiving Google Open Source Peer Bonus Award 2022

Receiving Google Open Source Peer Bonus Award 2022

1 min read Notes on deploying models with TFServing

Notes on deploying models with TFServing

A collection of useful links with information about the inner working of TFServing

1 min read Notes on Reinforcement Learning for Human Feedback

Notes on Reinforcement Learning for Human Feedback

Reinforcement Learning for Human Feedback (RLHF) is the concept with powers recent models like ChatGPT

1 min read Notes on Model Performance Profiling

Notes on Model Performance Profiling

A collection of useful links with information about model performance profiling

1 min read Notes on GPT4

Notes on GPT4

A collection of useful links with information about the inner working of TFServing