Updated: Jul 31
Google Cloud offers an excellent Jupyter Notebook managed service called Vertex AI Workbench which is relatively pricy. For some cases, I suggest using an old trick that can get you up to a 91% discount: hosting Jupyter Notebooks on Compute Engine Spot Instances using standard templates. This easy solution yields almost the same experience as Google’s managed notebook service, allowing you to be equally productive and considerably reduce costs. In fact, for short ML tasks like EDA or model experimentation, it can be an optimal solution on GCP. However, please mind the cost-SLA tradeoff here.
Vertex Workbench – Great Experience for high Price!
If you are new to data science or have spent the past five years on Mars, Jupyter Notebooks are the most popular coding environment for research, EDAs, and basic training models. Some companies like DataBricks and Netflix even suggest using it beyond interactive tools and integrating it in some ML/Data pipelines. 💡
Thanks to their usefulness, Google Cloud built a managed version of Jupyter Notebooks called Vertex AI Workbench. This fantastic service reduces Jupyter Notebooks’ launching time to under 60 seconds and various additional features to make your workflow easier. However, this service introduces a management fee and only supports on-demand instances (and not spot instances) – making it too expensive for non-critical tasks IMO.
Deep Learning VMs – Cost Efficient Alternative!
Luckily Google Cloud has an alternative product to Vertex AI Workbench called Deep Learning VMs. This product is essentially Compute Engine VMs with pre-installed ML/DL images that include popular data science python libraries, R, CUDA, and TensorFlow. Most importantly, when launching the VM, it automatically runs a Jupyter Lab server that can be accessed through the browser.
Since DL-VMs are not managed services, they are more customizable, thus opening more opportunities to perform cost optimization. When it comes to cost optimization on the cloud, it’s important to mention Spot instances – and indeed, DL-VMs support this feature! But, of course, there is a tradeoff between the user experience and cost-optimization. To learn more about this tradeoff, you can read in this other blog post about the advantages and disadvantages of using different Notebook services on Google.
Because DL-VMs are essentially Compute Engine with an image, they have the option to be launched as Spot Instances – therefore opening them to potential Spot discount and not introducing an additional service-management fee!
Spot Instances are Compute Engine VMs offered with an up to 91% discount versus the on-demand pricing. Spots have the EXACT same spec as the on-demand equivalent with only one downside: they may abruptly shut down. Spot instances rely on excess hardware available in specific regions; therefore, when other customers willing to pay full price request access, Google gives them priority and ousts you from your spot machine when demand is too high. When working with Spot instances, please recall:
⚠️ When the machine shuts down it happens abruptly. You can only get a short notice before it happens.
♻️ After the machine shuts down it’s possible to restart it; no need to create a new one.
🗄️ Data saved on the machine disk will NOT be affected by the shut down – it will be kept there. You will also be paying for this storage while the machine is off.
😱 On shut down the RAM will be cleared – so any unsaved work will be lost :/
📈 If there is high demand for the instance type that you are using in that region it increases the chances of frequent shut downs.
So why not use Vertex AI Workbench only with Spot machines and get a 91% discount? Well, Google just doesn’t allow it. It makes sense. Spot instances don’t benefit from SLA, and it stands against Google’s profound approach to reliable user experience. You wouldn’t expect a managed service to shut down due to outages – right? 💁♂️
So how can you use Jupyter Notebooks on Spot machines? Let’s have a quick walkthrough.
Tutorial: Launching a Spot Deep Learning VM
Step 1: Use a shell (terminal) with gcloud tool installed
You can either use your computer’s terminal and install the gcloud command tool or use Google’s managed Cloud Shell.
Step 2: SELECT an image that you want to run
You can read about selecting an image but here are the basics:
Running data science? use the clean common image
Running deep learning? use the TensorFlow/PyTorch image
Step3: Run the create-instance command
export IMAGE_FAMILY="common-cpu" export ZONE="us-central1-b" export INSTANCE_NAME="notebook-vm-spot" export INSTANCE_TYPE="e2-standard-2" gcloud compute instances create $INSTANCE_NAME \ --zone=$ZONE \ --image-family=$IMAGE_FAMILY \ --image-project=deeplearning-platform-release \ --maintenance-policy=TERMINATE \ --machine-type=$INSTANCE_TYPE \ --boot-disk-size=200GB \ --provisioning-model=SPOT \ --instance-termination-action=STOP
step 4: secure the connection between your computer and the server
To ensure that all access to the notebook is authorized, Google blocks unsecured connections to the machine. Therefore, you will need to create an ssh tunnel to the notebook server from your device. Sounds difficult? Nah, it’s only one line of code.
On your local computer, run the following command:
export PROJECT_ID="my-gcp-project-id" gcloud compute ssh \ --project $PROJECT_ID \ --zone $ZONE \ $INSTANCE_NAME \ -- -L 8080:localhost:8080
Just keep your terminal open otherwise the connection will be lost 😅.
Step 5: Open Jupyter Lab in your local browser
Now just go to your browser and open http://localhost:8080 and that’s it. You can work on your machine. Do remember to save your work on occasions!
Tips and advanced tricks
Low chances you will get a Spot GPU. Due to the ongoing shortage of silicone hardware GPUs are scarce resources. If you are using TensorFlow consider getting TPUs (especially Spot ones) instead!
Small machines live longer – your machine leverages excess hardware. The more cores and memory used by your instance, the greater the chance it will be snatched to serve full-price paying customers.
Frequently save your work! It’s not recommended to use Spot machines for long-running tasks like training neural networks. However if you do, find ways to save your work so you can restore it if the machine restarts. For example: create a callback to save the model every few epochs.
Define a shut down script – GCP allows you to run a script just before the instance shuts down. It can allow you to gracefully finish and back up your work.
Use E2 machine types for EDA and non compute heavy use-cases. If you are using large datasets use the e2-highmem group to get more RAM per CPU.
Advanced tip: Use GCP’s Deployment Manager to create a template and launch the instances from there.
Did you find this blog post helpful? Feel free to retweet and share! Any comments? Please share them with us here or on our Twitter.