Serverless GPUs: L4, L40S, V100, and more in private preview

https://www.koyeb.com/blog/serverless-gpus-in-private-preview-l4-l40s-v100-and-more

Today, we’re excited to share that Serverless GPUs are available for all your AI inference needs directly through the Koyeb platform!

note

We're starting with GPU Instances designed to support AI inference workloads including both heavy generative AI models and lighter computer vision models. These GPUs provide up to 48GB of vRAM, 733 TFLOPS and 900GB/s of memory bandwidth to support large models including LLMs and text-to-image models.

We're starting with 4 Instances, starting at $0.50/hr and billed by the second:

RTX 4000 SFF ADAV100L4L40S
GPU vRAM20GB16GB24GB48GB
GPU Memory Bandwidth280GB/s900GB/s300GB/s864GB/s
FP3219.2 TFLOPS15.7 TFLOPS30.3 TFLOPS91.6 TFLOPS
FP8--240 TFLOPS733 TFLOPS
RAM44GB44GB44GB96GB
vCPU681530
On-demand price$0.50/hr$0.85/hr$1/hr$2/hr

All these Instances have dedicated vCPUs that are equivalent to a hyperthread.

These GPUs come with the same serverless deployment experience that you know on the platform with one-click deployment of Docker containers, built-in load-balancing, and seamless horizontal autoscaling, zero downtime deployments, auto-healing, vector databases, observability, and real-time monitoring.

By the way, you can pause your GPU Instances when they're not in use. This is a great way to stretch your computing budget when you don't need to keep your GPU instances running 24/7.

Serverless GPUs on Koyeb

To access these GPU Instances, join the preview on koyeb.com/ai. We're gradually onboarding users to ensure the best experience for everyone.

If you need GPUs in volume, let us know in your request or book a call.

To get started and deploy your first service backed by a GPU, you can use the Koyeb CLI or the Koyeb Dashboard.

As usual, you can deploy using pre-built containers or directly connect your GitHub repository and let Koyeb handle the build of your applications.

Here is how you can deploy a Ollama service in one CLI command:

koyeb app init ollama \
  --docker ollama/ollama \
  --instance-type l4 \
  --regions fra \
  --port 11434:http \
  --route /:11434 \
  --docker-command serve

That's it! In less than 60 seconds, you will have Ollama running on Koyeb using a L4 GPU.

You can then pull your favorite models and start interacting with them.

We're super excited by this release and we believe this dramatically simplifies deploying production-grade inference workloads, APIs, and endpoints thanks to the serverless capabilities of the Koyeb platform.

Our goal is to let you easily access, build, experiment, and deploy on the best accelerators from AMD, Intel, Furiosa, Qualcomm, and Nvidia using one unified platform.

If you're looking for specific configurations, GPUs, or accelerators, we'd love to hear from you. We're currently adding more GPUs and accelerators to the platform and are working closely with early users to design our offering.

Let's get in touch! Whether you require high-performance GPUs, specialized accelerators, or unique hardware configurations, we want to hear from you.

Sign up for the platform today and join the GPU serverless private preview!

Keep up with all the latest updates by joining our vibrant and friendly serverless community or follow us on X at @gokoyeb.

{
"by": "riadsila",
"descendants": 0,
"id": 40246671,
"score": 2,
"time": 1714737550,
"title": "Serverless GPUs: L4, L40S, V100, and more in private preview",
"type": "story",
"url": "https://www.koyeb.com/blog/serverless-gpus-in-private-preview-l4-l40s-v100-and-more"
}
{
"author": "Yann Léger",
"date": "2024-04-30T15:00:00.000Z",
"description": "Today, we’re excited to share that GPU Instances designed to support AI inference workloads are available in private preview. These GPUs provide up to 48GB of vRAM, 733 TFLOPS and 900GB/s of memory bandwidth to support large models including LLMs and text-to-image models.",
"image": "https://www.koyeb.com/static/images/illustrations/og/serverless-gpus-in-private-preview-l4-l40s-v100-and-more.png",
"logo": "https://logo.clearbit.com/koyeb.com",
"publisher": "Koyeb",
"title": "Serverless GPUs in Private Preview: L4, L40S, V100, and more",
"url": "https://www.koyeb.com/blog/serverless-gpus-in-private-preview-l4-l40s-v100-and-more"
}
{
"url": "https://www.koyeb.com/blog/serverless-gpus-in-private-preview-l4-l40s-v100-and-more",
"title": "Serverless GPUs in Private Preview: L4, L40S, V100, and more",
"description": "Today, we're excited to share that GPU Instances designed to support AI inference workloads are available in private preview. These GPUs provide up to 48GB of vRAM, 733 TFLOPS and 900GB/s of memory bandwidth to support large models including LLMs and text-to-image models.",
"links": [
"https://www.koyeb.com/blog/serverless-gpus-in-private-preview-l4-l40s-v100-and-more"
],
"image": "https://www.koyeb.com/static/images/illustrations/og/serverless-gpus-in-private-preview-l4-l40s-v100-and-more.png",
"content": "<div><p>Today, we’re excited to share that Serverless GPUs are available for all your AI inference needs directly through the Koyeb platform!</p>\n<figure><p><svg width=\"18\" height=\"18\"></svg></p><h4>note</h4><p></p></figure>\n<p>We're starting with GPU Instances designed to support AI inference workloads including both heavy generative AI models and lighter computer vision models. These GPUs provide up to 48GB of vRAM, 733 TFLOPS and 900GB/s of memory bandwidth to support large models including LLMs and text-to-image models.</p>\n<p>We're starting with 4 Instances, starting at $0.50/hr and billed by the second:</p>\n<figure><table><thead><tr><th></th><th>RTX 4000 SFF ADA</th><th>V100</th><th>L4</th><th>L40S</th></tr></thead><tbody><tr><td><strong>GPU vRAM</strong></td><td>20GB</td><td>16GB</td><td>24GB</td><td>48GB</td></tr><tr><td><strong>GPU Memory Bandwidth</strong></td><td>280GB/s</td><td>900GB/s</td><td>300GB/s</td><td>864GB/s</td></tr><tr><td><strong>FP32</strong></td><td>19.2 TFLOPS</td><td>15.7 TFLOPS</td><td>30.3 TFLOPS</td><td>91.6 TFLOPS</td></tr><tr><td><strong>FP8</strong></td><td>-</td><td>-</td><td>240 TFLOPS</td><td>733 TFLOPS</td></tr><tr><td><strong>RAM</strong></td><td>44GB</td><td>44GB</td><td>44GB</td><td>96GB</td></tr><tr><td><strong>vCPU</strong></td><td>6</td><td>8</td><td>15</td><td>30</td></tr><tr><td><strong>On-demand price</strong></td><td>$0.50/hr</td><td>$0.85/hr</td><td>$1/hr</td><td>$2/hr</td></tr></tbody></table></figure>\n<p>All these Instances have dedicated vCPUs that are equivalent to a hyperthread.</p>\n<p>These GPUs come with the same serverless deployment experience that you know on the platform with one-click <a target=\"_blank\" href=\"https://www.koyeb.com/docs/build-and-deploy/prebuilt-docker-images\">deployment of Docker containers</a>, <a target=\"_blank\" href=\"https://www.koyeb.com/docs/reference/edge-network\">built-in load-balancing</a>, and <a target=\"_blank\" href=\"https://www.koyeb.com/docs/run-and-scale/autoscaling\">seamless horizontal autoscaling</a>,\n<a target=\"_blank\" href=\"https://www.koyeb.com/docs/reference/deployments\">zero downtime deployments</a>, <a target=\"_blank\" href=\"https://www.koyeb.com/docs/run-and-scale/health-checks\">auto-healing</a>, <a target=\"_blank\" href=\"https://www.koyeb.com/docs/databases\">vector databases</a>, <a target=\"_blank\" href=\"https://www.koyeb.com/docs/run-and-scale/log-exporter\">observability</a>, and <a target=\"_blank\" href=\"https://www.koyeb.com/run-and-scale/metrics\">real-time monitoring</a>.</p>\n<p>By the way, you can pause your GPU Instances when they're not in use. This is a great way to stretch your computing budget when you don't need to keep your GPU instances running 24/7.</p>\n<p><img alt=\"Serverless GPUs on Koyeb\" srcset=\"https://www.koyeb.com/_next/image?url=%2Fstatic%2Fimages%2Fblog%2Fserverless-gpus-private-preview-koyeb.png&amp;w=750&amp;q=75 1x, https://www.koyeb.com/_next/image?url=%2Fstatic%2Fimages%2Fblog%2Fserverless-gpus-private-preview-koyeb.png&amp;w=1920&amp;q=75 2x\" src=\"https://www.koyeb.com/_next/image?url=%2Fstatic%2Fimages%2Fblog%2Fserverless-gpus-private-preview-koyeb.png&amp;w=1920&amp;q=75\" /></p>\n<p>To access these GPU Instances, join the preview on <a target=\"_blank\" href=\"https://www.koyeb.com/ai\">koyeb.com/ai</a>. We're gradually onboarding users to ensure the best experience for everyone.</p>\n<p>If you need GPUs in volume, let us know in <a href=\"https://tally.so/r/mergdk\" target=\"_blank\"><span>your request</span></a> or <a href=\"https://app.reclaim.ai/m/koyeb-intro/short-call\" target=\"_blank\"><span>book a call</span></a>.</p>\n<p>To get started and deploy your first service backed by a GPU, you can use the <a target=\"_blank\" href=\"https://www.koyeb.com/docs/build-and-deploy/cli/installation\">Koyeb CLI</a> or the <a href=\"https://app.koyeb.com/\" target=\"_blank\"><span>Koyeb Dashboard</span></a>.</p>\n<p>As usual, you can deploy using <a target=\"_blank\" href=\"https://www.koyeb.com/docs/build-and-deploy/prebuilt-docker-images\">pre-built containers</a> or <a target=\"_blank\" href=\"https://www.koyeb.com/docs/build-and-deploy/deploy-with-git\">directly connect your GitHub repository</a> and let Koyeb handle the <a target=\"_blank\" href=\"https://www.koyeb.com/docs/build-and-deploy/build-from-git\">build</a> of your applications.</p>\n<p>Here is how you can deploy a <a target=\"_blank\" href=\"https://www.koyeb.com/deploy/ollama\">Ollama</a> service in one CLI command:</p>\n<figure><div><pre><code><span><span>koyeb</span><span> app</span><span> init</span><span> ollama</span><span> \\</span></span>\n<span><span> --docker</span><span> ollama/ollama</span><span> \\</span></span>\n<span><span> --instance-type</span><span> l4</span><span> \\</span></span>\n<span><span> --regions</span><span> fra</span><span> \\</span></span>\n<span><span> --port</span><span> 11434:http</span><span> \\</span></span>\n<span><span> --route</span><span> /:11434</span><span> \\</span></span>\n<span><span> --docker-command</span><span> serve</span></span></code></pre></div></figure>\n<p>That's it! In less than 60 seconds, you will have Ollama running on Koyeb using a L4 GPU.</p>\n<p>You can then pull your favorite models and start interacting with them.</p>\n<p>We're super excited by this release and we believe this dramatically simplifies deploying production-grade inference workloads, APIs, and endpoints thanks to the serverless capabilities of the Koyeb platform.</p>\n<p>Our goal is to let you easily access, build, experiment, and deploy on the best accelerators from AMD, Intel, Furiosa, Qualcomm, and Nvidia using one unified platform.</p>\n<p><strong>If you're looking for specific configurations, GPUs, or accelerators, we'd love to hear from you.\nWe're currently adding more GPUs and accelerators to the platform and are working closely with early users to design our offering.</strong></p>\n<p>Let's get in touch! Whether you require high-performance GPUs, specialized accelerators, or unique hardware configurations, we want to hear from you.</p>\n<p><a href=\"https://app.koyeb.com/auth/signup\" target=\"_blank\"><span>Sign up</span></a> for the platform today and <a target=\"_blank\" href=\"https://www.koyeb.com/ai\">join the GPU serverless private preview</a>!</p>\n<p>Keep up with all the latest updates by joining our vibrant and <a href=\"https://community.koyeb.com/\" target=\"_blank\"><span>friendly serverless community</span></a> or follow us on X at <a href=\"https://twitter.com/gokoyeb\" target=\"_blank\"><span>@gokoyeb</span></a>.</p></div>",
"author": "",
"favicon": "https://www.koyeb.com/favicon/favicon-16x16.png",
"source": "koyeb.com",
"published": "",
"ttr": 95,
"type": "website"
}