Building multi-tenant JupyterHub Platforms on Amazon EKS

11 months ago 50

News Banner

Introduction

In recent years, there’s been a remarkable surge in the adoption of Kubernetes for data analytics and machine learning (ML) workloads in the tech industry. This increase is underpinned by a growing recognition that Kubernetes offers a reliable and scalable infrastructure to handle these demanding computational workloads. Furthermore, a recent wave of Generative AI (GenAI) has introduced an exciting era of possibilities in artificial intelligence (AI), which allows machines to autonomously generate content, refine Large language models (LLMs), and revolutionize the way we approach problem-solving in various domains.

As organizations increasingly seek to harness the potential of machine learning and GenAI, they find themselves in need of a versatile and interactive platform to experiment with various GenAI models on CPUs, GPUs, and AWS Neuron devices. These organization are on a quest to process the data, train the models, fine-tune the models, and Inference testing until they achieve a production-ready model, that can be deployed at scale. However, efficiency and cost-optimization are crucial to these organizations who strive to empower their data and machine learning teams with a platform that maximizes value.

This is where the power of Jupyter Notebooks on Kubernetes comes into play. Jupyter Notebooks have become a go-to tool for data scientists and ML engineers, which offers an interactive environment for coding, ML experimentation with training and inference, and data visualization. JupyterHub is an extension of Jupyter Notebooks that takes collaboration and scalability to the next level by providing a multi-user, multi-environment platform. JupyterHub can seamlessly run on Kubernetes, making it an ideal choice for organizations looking to create efficient, collaborative, and scalable ML environments. With Amazon Elastic Kubernetes Service Amazon EKS, you can unleash the full potential of JupyterHub by taking the advantage of highly scalable and reliable Amazon EKS cluster while customizing compute and storage options for individual Notebook users. Furthermore, pod and node scaling are seamlessly managed through the Karpenter autoscaler and an effortless integration with various AWS services.

In this post, we’ll delve into the process of building, deploying, and harnessing the power of multi-tenant JupyterHub environments on Amazon EKS at scale. With a keen focus on scalability, efficiency, security, and user empowerment. We’ll guide you through the steps to create a tailored JupyterHub environment that meets the diverse requirements of teams such as data engineering, data science, and ML engineering. Imagine a scenario where you can effortlessly log in to JupyterHub, each with your dedicated workspace and the flexibility to choose from various profiles with frameworks like PySpark, Scala Spark, PyTorch, and Tensorflow. You’ll also have the choice of CPUs, GPUs, AWS Trainium, and AWS Inferentia instances that seamlessly scale to your needs. Working with GPUs and AWS Neuron devices requires several components, including Compute Unified Device Architecture (CUDA) drivers, which enable software to use the GPU for computational tasks. This solution also deploys NVIDIA Device plugins and Neuron device plugins by default. Additionally, observability features are included to monitor your workloads using Prometheus and Grafana.

Solution overview

In the following diagram, you can see the overall solution architecture, which showcases the structure of a multi-tenant JupyterHub platform operating on Amazon EKS. Numerous users interact with JupyterHub via an Elastic Load Balancing (ELB) on the Amazon EKS Cluster. Karpenter handles the creation of the corresponding nodes, while the Kubernetes scheduler manages the scheduling of JupyterHub Notebook pods. These notebooks utilize Persistent Volume Claims (PVC) to establish their home directories and shared directories, all of which are connected to Amazon Elastic File System (Amazon EFS) filesystems.

Multi-tenant JupyterHub on Amazon EKS with Core/Karpenter Nodes, EFS, Prometheus, Grafana, NVIDIA.

Walkthrough

Setup and configuration

Data on Amazon EKS (DoEKS) is an open-source project that offers a distinct set of Infrastructure as Code blueprints designed for building AWS-managed and self-managed scalable data and ML platforms on Amazon EKS. We’ll utilize the JupyterHub blueprint from data on Amazon EKS to deploy the end-to-end solution. We’ll walk through the steps of how multiple users can login and use sample scripts to demonstrate the usage of NVIDIA GPU instances and AWS Trainium and Inferentia instances.

JupyterHub authentication options

This blueprint provides support for two authentication mechanisms: dummy and cognito. In this post, we’ll use the dummy mechanism for easy demonstration and it’s not a recommended authentication mechanism for production. We strongly advise utilizing the cognito method or other supported authentication mechanisms found on the Authenticators page for production-ready setups.

Prerequisites

Before you create the test environment using Terraform, you must have the following prerequisites:

An AWS account with valid AWS credentials with assumed AWS Identity and Access Management (AWS IAM) role
The AWS Command Line Interface (AWS CLI) installed
Terraform 1.0.1
kubectl installed
Helm installed

Step 1: Clone the DoEKS repository and grant permissions

The first step is to clone the DoEKS repository and grant permissions to the install.sh script. This script automates the deployment process for JupyterHub and sets up the necessary configurations.

git clone https://github.com/awslabs/data-on-eks.git cd data-on-eks/ai-ml/jupyterhub && chmod +x install.sh

Step 2: Execute the installation script

./install.sh

Sometimes, the install.sh script might not complete successfully due to Terraform dependencies or timeouts. In such cases, you can simply rerun the install.sh script to ensure the deployment is completed. If the deployment was successful, then you should see the following message.

"SUCCESS: Terraform apply of all modules completed successfully"

Step 3: Verify the deployment

First, you’ll need to configure your kubeconfig to connect to the newly created Amazon EKS cluster. Use the following command, replacing us-west-2 with your specific AWS Region if necessary:

aws eks --region us-west-2 update-kubeconfig --name jupyterhub-on-eks

Now, you can check the status of the pods across various namespaces by running. Keep an eye out for the key deployments:

In the jupyterhub namespace, ensure that there are four JupyterHub pods running.
In the gpu-operator namespace, verify that NVIDIA GPU Operator pods are active.
Verify the running pods of Karpenter, FluentBit, Cluster Autoscaler, CoreDNS, AWS Node, KubeProxy, EBS CSI Controller, Metrics Server, KubeCost, and the Kube Prometheus stack.

This verification step is essential to guarantee that all the necessary components are functioning correctly. If everything is in order, then you can proceed with confidence, knowing that your JupyterHub environment on Amazon EKS is ready to empower your data and machine learning teams.

$ kubectl get pods -A NAMESPACE NAME STATUS gpu-operator gpu-operator-6b8db67bfb-gff8r Running gpu-operator nvidia-gpu-operator-node-feature-discovery-master-6fb7d Running gpu-operator nvidia-gpu-operator-node-feature-discovery-worker-2pwpp Running gpu-operator nvidia-gpu-operator-node-feature-discovery-worker-hphbk Running gpu-operator nvidia-gpu-operator-node-feature-discovery-worker-pfqbf Running gpu-operator nvidia-gpu-operator-node-feature-discovery-worker-zhv95 Running jupyterhub hub-c9d47cc5-s66cf Running jupyterhub proxy-79fd94c9bb-2mwlg Running jupyterhub user-scheduler-555fcf5b7f-2q5vb Running jupyterhub user-scheduler-555fcf5b7f-z8285 Running karpenter karpenter-5796d4d446-brhz2 Running karpenter karpenter-5796d4d446-tgpqp Running kube-prometheus-stack kube-prometheus-stack-grafana-6bcdfc9959-2sm92 Running kube-prometheus-stack kube-prometheus-stack-kube-state-metrics-796d4ff45d-bt6 Running kube-prometheus-stack kube-prometheus-stack-operator-6fc7779d7b-pt6r5 Running kube-prometheus-stack kube-prometheus-stack-prometheus-node-exporter-66rzl Running kube-prometheus-stack kube-prometheus-stack-prometheus-node-exporter-6b4kf Running kube-prometheus-stack kube-prometheus-stack-prometheus-node-exporter-8fj5f Running kube-prometheus-stack kube-prometheus-stack-prometheus-node-exporter-zhshn Running kube-prometheus-stack prometheus-kube-prometheus-stack-prometheus-0 Running kube-system aws-for-fluent-bit-bw22h Running kube-system aws-for-fluent-bit-dvpjt Running kube-system aws-for-fluent-bit-j4xd2 Running kube-system aws-for-fluent-bit-ztlgx Running kube-system aws-load-balancer-controller-5d74f87558-79k4h Running kube-system aws-load-balancer-controller-5d74f87558-p7vlp Running kube-system aws-node-jqzdd Running kube-system aws-node-mr7m9 Running kube-system aws-node-tr89h Running kube-system aws-node-xbwlk Running kube-system cluster-autoscaler-aws-cluster-autoscaler-6fbfb79f4f-jf Running kube-system cluster-proportional-autoscaler-kube-dns-autoscaler-5d8 Running kube-system coredns-5b4ff8bf94-8829g Running kube-system coredns-5b4ff8bf94-v2d7d Running kube-system ebs-csi-controller-748f9bb4cf-hhr7h Running kube-system ebs-csi-controller-748f9bb4cf-k4lv4 Running kube-system ebs-csi-node-c6htq Running kube-system ebs-csi-node-c8pxx Running kube-system ebs-csi-node-gg8jm Running kube-system ebs-csi-node-jbhtl Running kube-system kube-proxy-8kmkn Running kube-system kube-proxy-fks4h Running kube-system kube-proxy-gfrhr Running kube-system kube-proxy-j5wj9 Running kube-system metrics-server-864c8db8fd-7426x Running kube-system metrics-server-864c8db8fd-kk256 Running kubecost kubecost-cost-analyzer-78f879f46f-mqg4m Running kubecost kubecost-grafana-5fcd9f86c6-psxxp Running

Verify the Karpenter Provisioners deployed by this blueprint. We’ll discuss how these provisioners are used by the JupyterHub profiles to spin-up specific node.

$ kubectl get provisioners NAME TEMPLATE default default gpu-mig gpu-mig gpu-ts gpu-ts inferentia inferentia tranium tranium

Verify the Persistent Volume Claims (PVCs) created by this blueprint, each serving a unique purpose. The Amazon EFS volume named efs-persist is mounted as the individual home directory for each JupyterHub single-user pod, which ensures a dedicated space for each user. In contrast, efs-persist-shared is a special PVC that is mounted across all JupyterHub single-user pods, facilitating collaborative notebook sharing among users. Alongside these, additional Amazon EBS Volumes have been provisioned to robustly support JupyterHub, Kube Prometheus Stack, and KubeCost deployments.

kubectl get pvc -A NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE jupyterhub efs-persist Bound efs-persist 123Gi RWX 17h jupyterhub efs-persist-shared Bound efs-persist-shared 123Gi RWX 17h jupyterhub hub-db-dir Bound pvc-208f33ab-2a23-4330-adf7-e49cd5309326 50Gi RWO gp3 13h kube-prometheus-stack data-prometheus-kube-prometheus-stack-prometheus-0 Bound pvc-9cf017e0-de99-4dea-822d-6b55472efacf 100Gi RWO gp3 13h kubecost kubecost-cost-analyzer Bound pvc-4136368b-3261-4a80-8a08-855fc5e1bc06 32Gi RWO gp3 17h

JupyterHub as a platform, working with distinct single-user profiles

Within this blueprint, we’ve created a range of profiles, each tailored to meet the diverse needs of users. These profiles harness Karpenter’s capabilities to automatically scale nodes to accommodate JupyterHub pods when a specific profile is selected. It’s important to highlight that these profiles are fully customizable, offering the flexibility to tailor settings, resources, and permissions to meet the unique needs of your organization.

You can seamlessly update and introduce additional profiles to the JupyterHub Helm Chart values file, tailoring them to your organization’s unique requirements. Each profile can be fine-tuned with dedicated nodeSelectors and tolerations, precisely triggering the corresponding Karpenter provisioner. Moreover, you have the capability to include multiple Docker images for each profile using profile options. This addition enables users to select the most suitable Docker image version for their specific experiments.

UI depicting JupyterHub profiles

Data Engineering (CPU): This profile is purpose-built for Data Engineers to develop scripts and process data with PySpark or Scala Spark profile options. This profile uses Karpenter’s default provisioner to dynamically scale the nodes based on demand.

Trainium (Trn1): Designed for ML engineers focused on training and fine-tuning models on AWS Trainium instances, this profile uses a dedicated Karpenter provisioner named trainium that efficiently manages autoscaling.

Inferentia (Inf2): Tailored for tasks requiring Inferentia2 instances, this profile benefits from a dedicated Karpenter provisioner named inferentia that efficiently manages autoscaling.

Data Science (GPU + Time-Slicing – g5): Data Scientists in need of GPU access for shorter tasks, this profile introduces time-sliced GPU resources. Time-slicing enables multiple users to effectively share a single GPU for smaller tasks, with the gpu-ts provisioner handling autoscaling tailored to time-slicing requirements.

Data Science (GPU + MIG on p4d.24xlarge): Ideal for advanced Data Science tasks demanding specialized GPU configurations, this profile leverages MIG (Multi-Instance GPUs). MIG empowers a single GPU to be divided into multiple smaller instances, each allocated to distinct tasks. It’s important to note that for MIG support, nodes should be pre-created, and we currently utilize Managed node groups until Karpenter dynamically supports MIG requests.

Data Science (GPU – P4d.24xlarge): Tailored to the requirements of Data Scientists in need of substantial GPU resources, this profile provides full access to a p4d.24xlarge GPU instance. The gpu provisioner, triggered by Karpenter, efficiently manages autoscaling to meet the high-resource demands of this profile.

By customizing profiles and Karpenter provisioners, you’ve already taken the first step toward optimizing your resource allocation and performance in JupyterHub’s Kubernetes environment. But what if you could take this a step further? What if you could allocate the exact amount of GPU power to each data science task, maximizing efficiency and lowering costs.

In the following sections, we’ll dive into the world of GPU MIG and time-slicing, exploring how it can maximize resource utilization and lower costs in your JupyterHub environment.

Ways to optimize GPUs using NVIDIA GPU Operator

Just like how you’d optimize CPU and memory allocation through Karpenter provisioners, slicing GPUs offers another level of efficiency. One of the limitations of a traditional GPU setup is that the powerful resources they offer might not be fully utilized by a single user or application. This is where the concept of GPU Slicing comes into play. Imagine a scenario where multiple Data Scientists are running their individual Jupyter Notebook instances on a Kubernetes cluster. Each of them is experimenting with models that need some GPU power but not the whole GPU. Allocating an entire GPU to each would be inefficient and costly.

NVIDIA MIG

Multi-Instance GPU (MIG) allows GPUs based on the NVIDIA Ampere architecture (such as NVIDIA A100), to be securely partitioned in up to seven separate GPU Instances for CUDA applications. Since MIG requires A100 architecture, it means that this solution can only be used with p4 and p5 instance types. NVIDIA provides two strategies for exposing MIG devices on a Kubernetes node, mixed and single. In this example we have configured mixed, this gives us the ability to with a single configuration, slice GPUs in different ways.

nvidia-a100g: |- version: v1 flags: migStrategy: mixed sharing: timeSlicing: resources: - name: nvidia.com/gpu replicas: 8 - name: nvidia.com/mig-1g.5gb replicas: 2 - name: nvidia.com/mig-2g.10gb replicas: 2 - name: nvidia.com/mig-3g.20gb replicas: 3 - name: nvidia.com/mig-7g.40gb replicas: 7

Time-slicing GPUs in Kubernetes

For some workloads, the high resource allocations of p4 and p5 instances—with their minimum of 96 vCPUs and 8 GPUs—may be excessive. In such scenarios, NVIDIA’s GPU Operator offers time-slicing capabilities for older GPU architectures. This method lacks the memory and fault isolation features found in Multi-Instance GPU (MIG) configurations but compensates by allowing more flexible sharing of resources. Essentially, time-slicing enables multiple workloads to share a single GPU by dividing its compute time among them.

nvidia-a10g: |- version: v1 flags: migStrategy: none sharing: timeSlicing: resources: - name: nvidia.com/gpu replicas: 4

In this deployed configuration, we specify that any GPU device in the cluster should support up to four replicas. For instance, if you have a single G5 instance with just one GPU core, then this time-slicing configuration allows that instance to support up to four individual Jupyter Notebook instances.

GPU slicing is all about efficient resource allocation, enabling multiple users or applications to share a powerful GPU without wastage. But what happens after you’ve optimized your GPUs? How do you manage multiple data scientists who want to run experiments in the same environment? This leads us to our next discussion: Scaling JupyterHub on Amazon EKS with KubeSpawner.

Scaling JupyterHub on Amazon EKS with KubeSpawner

KubeSpawner allows JupyterHub to dynamically generate isolated user environments as Kubernetes Pods. Multiple users can work in shared environments, thereby maximizing resource efficiency and cost-effectiveness. Integrate JupyterHub with your organization’s user or identity management system. This ensures that users can log in using their existing credentials and that access can be controlled at the user or group level. JupyterHub also gives you the ability to create notebook instances in separate Kubernetes namespaces to achieve further isolation between multiple users if needed. Kubernetes ResourceQuotas can be used to limit the amount of CPU and memory resources that each tenant can consume.

Now that we’ve covered the why, let’s dive into the how. Follow along to provision your first user environment in this scalable JupyterHub setup.

Setting up first user (User-1) environment

Exposing JupyterHub with port-forward: Execute the command below to make the JupyterHub service accessible for viewing the Web User Interface locally. It’s important to note that our current dummy deployment only establishes a Web UI service with a ClusterIP. Should you wish to customize this to an internal or internet-facing load balancer, you can make the necessary adjustments in the JupyterHub Helm chart values file.

kubectl port-forward svc/proxy-public 8080:80 -n jupyterhub

Sign-in: Navigate to http://localhost:8080/ in your web browser. Input user-1 as the username and choose any password.

JupyterHub's login screen

Select server options: Upon sign-in, you’ll be presented with a variety of Notebook instance profiles to choose from. For this time-slicing feature demonstration, we’ll be using the Data Science (GPU + Time-Slicing – G5) profile. Go ahead and select this option and choose the Start button.

JupyterHub Profiles

Wait for server initialization: The option above triggers the Karpenter provisioner to launch a new g5.2xlarge instance, schedule a user-1 JupyterHub pod on it, and fetch the Docker image. Please note that this setup may take some time, potentially up to 9 minutes (<1 minute for node creation, up-to 5 minutes for the NVIDIA Operator to prepare the node, and an additional 3 minutes for Docker image retrieval). It’s important to note that the NVIDIA GPU operator manages the deployment of required plugins and ensures node is ready for GPU workloads.

Locate the pod: To find out which node your pod is running on, execute:

kubectl get pods -n jupyterhub -owide | grep -i jupyter-user

Examine node labels: Use this command to list the node labels, which include GPU information added by the NVIDIA GPU Operator:

kubectl get nodes <node-name> -ojson | jq '.metadata.labels'

The output of the command shows several GPU-related labels, such as nvidia.com/gpu.count: 1 and nvidia.com/gpu.replicas: 4. Even though the node(g5.2xlarge) possesses only one GPU, it has the capacity to accommodate up to four pods that request access to this GPU with nvidia.com/gpu:1 .

Setting up second user (User-2) environment

To demonstrate GPU time-slicing in action, we’ll provision another Jupyter Notebook instance. This time, we’ll validate that the second user’s pod is scheduled on the same node as the first user’s, taking advantage of the GPU time-slicing configuration we set up earlier. Follow the steps below to achieve this:

Open JupyterHub in an Incognito browser window: Navigate to http://localhost:8080/ in the new incognito window in web browser. Input user-2 as the username and choose any password.

Choose server options: After logging in, you’ll see the server options page. Ensure that you select the Data Science (GPU + Time-Slicing – G5) radio button and select Start.

JupyterHub profiles for user-2

Verify pod placement: Notice that this pod placement takes only few seconds unlike the user-1. It’s because the Kubernetes scheduler is able to place the pod on the existing g5.2xlarge node created by the user-1 pod. user-2 is also using the same docker image so there is no delay in pulling the docker image and it leveraged local cache.

Open a terminal and execute the following command to check where the new Jupyter Notebook pod has been scheduled:

kubectl get pods -n jupyterhub -owide | grep -i user

You should see output similar to this:

jupyter-user-2d1 1/1 Running 0 27m 100.64.187.215 ip-100-64-204-206.us-east-2.compute.internal <none> <none> jupyter-user-2d2 1/1 Running 0 94s 100.64.167.20 ip-100-64-204-206.us-east-2.compute.internal <none> <none>

Observe that both the user-1 and user-2 pods are running on the same node (ip-100-64-204-206.us-east-2.compute.internal). This confirms that our GPU time-slicing configuration is functioning as expected.

You can execute the code below in Jupyter Notebook cells to confirm the username, AWS Role Amazon Resource Names(ARN), Docker image, CPU, and Memory requests. Additionally, you can validate the NVIDIA CUDA version by running !nvidia-smi.

from IPython.display import HTML import os username = os.getenv('JUPYTERHUB_USER') aws_role_arn = os.getenv('AWS_ROLE_ARN') jupyter_docker_image = os.getenv('JUPYTER_IMAGE') cpu_request = os.getenv('CPU_GUARANTEE') memory_request_gb = int(os.getenv('MEM_GUARANTEE')) / (1024 ** 3) output = f""" <div style="background-color: #f0f0f0; padding: 10px;"> Current Username: {username} AWS Role ARN: {aws_role_arn} Docker Image: {jupyter_docker_image} CPU Request: {cpu_request} Memory Request: {memory_request_gb} GB </div> """ display(HTML(output))

Sample Jupyter Notebook showing GPU validation

Validate GPU access

In this part of our demonstration, we’ll see if the Jupyter Notebook instances have access to GPUs and identify a potential pitfall regarding memory usage. Follow these steps:

Download and upload notebook: Download the provided notebook or open the code from GitHub from this link to run JupyterHub sessions of both user-1 and user-2 users.

Run cells in notebook: Open the notebook in each user’s session and execute the first two cells from the notebook.

First cell: Executes a simple matrix multiplication and compares the time it takes to perform this operation using CPU versus GPU. The graph illustrates the average time (in seconds) taken for matrix multiplication on both CPU and GPU across various matrix sizes, showing the performance difference between the two.

Second cell: Trains a basic neural network model using TensorFlow to assess its accuracy. Before executing this part, the notebook checks if TensorFlow has access to a GPU using the following code snippet:

import tensorflow as tf print("TensorFlow is using GPU:", tf.test.is_gpu_available())

Address memory limitations: If you attempt to run the notebook in both sessions simultaneously, then you might encounter errors. This is because time-slicing doesn’t provide memory isolation, unlike NVIDIA’s MIG technology. By default, TensorFlow attempts to allocate nearly all available GPU memory, leading to potential conflicts.

To resolve this issue, you’ll need to set a memory limit manually:

Restart both kernels: Navigate to Kernel in the Jupyter Notebook menu and select Restart Kernel.
Add memory limit code: Create a new code cell in the notebook and insert the following lines:

import tensorflow as tf gpus = tf.config.experimental.list_physical_devices('GPU') if gpus: try: tf.config.experimental.set_virtual_device_configuration( gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=5120)]) except RuntimeError as e: print(e)

This sets a memory limit of 5120 MB for TensorFlow, allowing both users’ notebooks to run simultaneously without conflict.

Re-run Notebook: Execute all notebook cells again in both user-1 and user-2 You should now observe that the cells execute without any errors.

By setting a specific memory limit, we ensure that multiple Jupyter Notebook instances can coexist on the same GPU thanks to time-slicing, without running into memory allocation issues.

Security considerations for JupyterHub

Ensuring the security of your JupyterHub environment is paramount, especially when dealing with sensitive data and complex workloads. Here are some key security considerations to keep in mind:

Namespace isolation and AWS IAM roles: JupyterHub profiles can be organized within specific namespaces. Each namespace can have its dedicated service account with Role Based Access Control (RBAC) and IAM roles attached. This granular control ensures that teams can only operate within their designated namespace with associated IAM roles, enhancing security and isolating resources.

User authentication: To authenticate JupyterHub users securely, you can employ authentication methods such as AWS Cognito, use an OAuth provider such as GitHub, or integrate with your organization’s Lightweight Directory Access Protocol (LDAP)system. These robust authentication mechanisms help ensure that only authorized individuals gain access to your JupyterHub environment.

Domain isolation: For additional security, consider exposing your JupyterHub domain internally using a local domain name. You can achieve this through Route53 and ELB configurations. By isolating your domain, you reduce the risk of unauthorized external access.

Regular updates and patching: Keep all components of your JupyterHub environment up to date with the latest security patches. Regularly update JupyterHub, its dependencies, and the underlying infrastructure to mitigate known vulnerabilities.

Checkout this security section from JupyterHub documentation to validate the security control for your environment. By implementing these security measures and remaining vigilant, you can create a robust and secure JupyterHub environment that protects your data and fosters a safe collaborative workspace for your teams.

Monitoring and cost management of notebooks

When vending Jupyter Notebooks, effective monitoring of plays a pivotal role in ensuring the reliability, reproducibility, and security of the data-science projects. Data scientists and organizations can benefit from real-time tracking of notebook execution to identify bottlenecks, optimize code, and improve overall efficiency. This leads to faster model development and deployments. Monitoring helps identify issues such as data inconsistencies, resource exhaustion, or system crashes and reduces the chances of costly mistakes. As data science projects scale, monitoring helps identify when and where to allocate additional resources, ensuring projects remain responsive and efficient.

To help with this we have included the kube-prometheus-stack and kubecost as add-ons in this blueprint. Together these add-ons help monitor resource and cost efficiency of your notebooks.

Both the NVIDIA GPU Operator and JupyterHub projects provide Grafana dashboards that can be easily imported into the Grafana instance running on the cluster.

Obtain the administrator password stored in AWS Secrets Manager by the blueprint, and use port-forwarding to forward the Grafana UI to localhost port 3000.

aws secretsmanager list-secrets --region us-west-2 { "SecretList": [ { "ARN": "arn:aws:secretsmanager:us-west-2:XXXXXXXXX:secret:jupyterhub-on-eks-grafana-20230927031315869300000001-H0PuIH", "Name": "jupyterhub-on-eks-grafana-20230927031315869300000001", "LastChangedDate": "2023-09-26T20:13:16.674000-07:00", "LastAccessedDate": "2023-10-01T17:00:00-07:00", "SecretVersionsToStages": { "498CE1A0-71F1-4DAB-9489-0AD47198222A": [ "AWSCURRENT" ] }, "CreatedDate": "2023-09-26T20:13:16.249000-07:00" } ] } aws secretsmanager get-secret-value \ --secret-id jupyterhub-on-eks-grafana-20230927031315869300000001 \ --region us-west-2 | jq -r .SecretString kubectl port-forward svc/kube-prometheus-stack-grafana \ -n kube-prometheus-stack 3000:80

Copy the password returned above and open http://localhost:3000 to bring up the Grafana login screen.

Grafana login screen

Once logged in, simply import the NVIDIA Dashboard with id 12239 from the Dashboard tab, as shown in the following diagram.

Grafana dashboard import UI

Detailed instructions for the NVIDIA GPU Operator monitoring setup can be found here. Once loaded the dashboard can be found under NVIDIA DCGM Exporter Dashboard.

Grafana GPU Operator dashboard

The Grafana dashboards for JupyterHub are provided by the project under this GitHub repository. You’ll need to create a Service Account token with Admin privileges, as described on Grafana documentation. Then, simply clone the repo and run the provided ./deploy.py script to provision the dashboard, as shown in the following code.

# Clone repo git clone https://github.com/jupyterhub/grafana-dashboards.git # Generate a Grafana service account token from the UI, then export GRAFANA_TOKEN=XXXXXXXXXXXXXXX # Run the provided deploy.py ./deploy.py "http://localhost:3000" --no-tls-verify Deployed dashboards/user.jsonnet Deployed dashboards/cluster.jsonnet Deployed dashboards/usage-report.jsonnet Deployed dashboards/jupyterhub.jsonnet Deployed dashboards/support.jsonnet

You can then view the dashboard under the JupyterHub Default Dashboards folder under Dashboards.

Grafana JupyterHub dashboard

Kubecost provides a granular breakdown of expenses at both the individual pod level and the namespace level. This means you can track the cost of resources consumed by each specific JupyterHub pod, gaining insights into how much each user or team is utilizing. You can access Kubecost UI by forwarding the port for the kubecost-cost-analyzer service as shown and opening http://localhost:9090 in the browser.

kubectl port-forward svc/kubecost-cost-analyzer \ -n kubecost 9090:9090

Kubecost UI

Cleaning up

First, you’ll have to stop any running notebook pods first using the JupyterHub UI. Then, to destroy and clean up all the infrastructure created in this post, simply run the provided ./cleanup.sh script.

./cleanup.sh

Conclusion

In this post, we showed you how to establish a scalable JupyterHub environment on Amazon EKS, designed to meet the needs of data science and machine learning teams. This platform allowed for multiple users to work in distinct yet collaborative environments. We also addressed the versatile GPU options available, from time-slicing to profiles that support MIG. Utilizing Karpenter simplifies resource management, which enables you to focus more effectively on your projects. Additionally, we have also seen how the Data on EKS project is used to deploy the blueprint and reduce the time to build and deploy the end-to-end solution for your Proof of Concept (PoC) or production workloads, accelerating your data-driven pursuits.

In summary, JupyterHub on Amazon EKS offers a variety of profiles and the streamlined resource management of Karpenter, making it an efficient and cost-effective solution for your data science and machine learning initiatives.

References

https://aws.amazon.com/blogs/mt/integrating-kubecost-with-amazon-managed-service-for-prometheus/

Read Entire Article