At UC Berkeley, Filestore supercharges one of largest JupyterHub deployments in U.S. higher ed

5 months ago 25
News Banner

Looking for an Interim or Fractional CTO to support your business?

Read more

Among researchers, students, and developers to work together on complex projects, JupyterHub has become an essential tool for collaborative data science, bringing the power of Jupyter Notebooks to groups of users. Its ability to manage multiple user environments and provide access to shared resources has revolutionized the way that data science instruction is done at scale. 

But as anyone who has worked on a large JupyterHub deployment will tell you, these benefits come with some challenges. As deployments grow, managing file storage for diverse users and computationally intensive tasks quickly becomes a critical concern. The need for reliable, scalable, and performant storage solutions is key to ensuring smooth operation and efficient workflows. 

In what may be the largest such deployment in United States higher education, UC Berkeley uses JupyterHub as a collaborative data science environment for students, instructors, faculty, staff and researchers. Their deployment, Datahub, is a heavily customized Zero to JupyterHub deployment with over 15 hubs that see 15,000 users, and span 100+ courses and 35+ departments. Uptime and availability are paramount, of course, as coursework and projects have deadlines and quizzes and exams are bound by the academic calendar. 

When Google and UC Berkeley first started talking, UC Berkeley was incredibly upfront about the challenges of supporting such a large and active user base, especially with limited resources. Like many universities, they faced budgetary constraints that made it difficult to staff a large IT team. In fact, they were managing this massive JupyterHub deployment with a lean team of just two full-time staffers, supplemented by dedicated volunteers and part-time contributors. 

It quickly became clear that their existing infrastructure, which relied on self-managed user home directories mounted on a self-managed NFS service hosted on Google Compute Engine, was struggling to keep pace. Their growing user base needed a more integrated and reliable experience, which meant finding a solution that could handle increased demand without compromising on performance or ease of use. As a leading research institution, they also needed to balance cross-departmental instruction goals with the realities of their limited IT budgets. 

That’s where Google Cloud’s Filestore, a managed NFS storage service, came in. By sharing UC Berkeley's journey to Filestore, we hope to provide valuable insights and practical guidance for anyone navigating similar challenges in their own endeavors.

Why Filestore?

When Shane joined the team in October 2022, it was in near-constant crisis mode. In the middle of the semester, an influx of new Datahub users pushed the existing GKE architecture to its limits. To make matters worse, the self-managed NFS service was constantly crashing under the load. 

The team had taken steps to address the GKE performance issues by re-architecting the setup to isolate specific course hubs and JupyterHub support infrastructure into their own node pools. This helped to improve performance for those users, but the underlying storage issues persisted. The self-managed NFS service had become a critical point of failure. To keep things running, the team had put a band-aid in place: a systemd timer that automatically restarted the NFS service every 15 minutes. 

While this prevented complete outages, the self-managed infrastructure was still struggling to keep up. At the same time, the user base kept growing rapidly, the workloads were becoming more demanding, and the budget simply couldn't stretch to accommodate the constant need for more servers and storage. They needed a more efficient and cost-effective solution.

That's when they reached out to Google Cloud, connecting with the Filestore team. Within an hour, the team at UC Berkeley was convinced that Filestore was the right solution. They were particularly interested in the Filestore Basic HDD tier, which offered the flexibility of instance sizing and a price point that aligned with their budget. 

Before diving into UC Berkeley’s transition, it’s worth mentioning there are three Filestore tiers — Basic, Zonal, and Regional — and choosing between them isn't always a simple decision. Basic instances deliver solid performance, but have capacity management restrictions (you can’t scale capacity down). Zonal instances deliver blazingly fast performance, which is great for latency-sensitive data science instructional workloads. But, as the name suggests, they're tied to a single zone within a region. If that zone experiences an outage, the workloads could be impacted. Filestore Regional, on the other hand, synchronously replicates data across three zones within a region, so that it is protected if one zone goes down. The trade-off between the three? Performance, cost, storage management flexibility, and storage SLA. Deciding between the three means weighing performance against your tolerance for downtime. And of course, budget and capacity requirements also will play a big role in the decision.

Transitioning from DIY NFS to Filestore

Once they had a good understanding of Filestore, Shane and his team were eager to put it to the test. They spun up a demo deployment, connecting a Filestore instance to one of their smaller JupyterHub environments. Shane, being the hands-on Technical Lead that he is, dove right in, even running some bonnie++ benchmarks from within a single user server notebook pod to really push the system.

code_block <ListValue: [StructValue([('code', "nfsPVC:\r\n nfs:\r\n shareName: shares/datahub/prod\r\n\r\njupyterhub:\r\n ingress:\r\n enabled: true\r\n hosts:\r\n - datahub.berkeley.edu\r\n tls:\r\n - secretName: tls-cert\r\n hosts:\r\n - datahub.berkeley.edu\r\n proxy:\r\n https:\r\n enabled: false\r\n hub:\r\n db:\r\n pvc:\r\n # This also holds logs\r\n storage: 80Gi\r\n resources:\r\n requests:\r\n # DataHub often takes up a full CPU now, so let's guarantee it at least that\r\n cpu: 1\r\n memory: 1Gi\r\n limits:\r\n memory: 2Gi\r\n scheduling:\r\n userPlaceholder:\r\n enabled: false"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e8b24edf3d0>)])]>

UC Berkeley’s YAML file for nfsPVC deployment with a Filestore volume. Github link is here for full YAML file

code_block <ListValue: [StructValue([('code', 'apiVersion: v1\r\nkind: PersistentVolume\r\nmetadata:\r\n name: PV_NAME\r\nspec:\r\n storageClassName: ""\r\n capacity:\r\n storage: 1Ti\r\n accessModes:\r\n - ReadWriteMany\r\n persistentVolumeReclaimPolicy: Retain\r\n volumeMode: Filesystem\r\n csi:\r\n driver: filestore.csi.storage.gke.io\r\n volumeHandle: "modeInstance/FILESTORE_INSTANCE_LOCATION/FILESTORE_INSTANCE_NAME/FILESTORE_SHARE_NAME"\r\n volumeAttributes:\r\n ip: FILESTORE_INSTANCE_IP\r\n volume: FILESTORE_SHARE_NAME\r\n---\r\nkind: PersistentVolumeClaim\r\napiVersion: v1\r\nmetadata:\r\n name: podpvc\r\nspec:\r\n accessModes:\r\n - ReadWriteMany\r\n storageClassName: ""\r\n volumeName: PV_NAME\r\n resources:\r\n requests:\r\n storage: 1Ti'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e8b24edf1f0>)])]>

Creating a Filestore Persistent volume claim

The results were impressive. Even with Shane intentionally pushing the limits of the test instance, other users on the hub experienced no noticeable impact. They were able to open and run notebooks, download datasets, and generally work without interruption. It was a real testament to Filestore's ability to isolate workloads and ensure consistent performance, even under demanding conditions.

This gave UC Berkeley the confidence to move forward with a larger-scale deployment. They were particularly impressed by Filestore’s read/write/latency performance, which met or exceeded their expectations. Although the Basic tier doesn't support scaling storage capacity down and Filestore Regional and Zonal do support scaling storage capacity down, they weren’t the right fit for their needs due to cost and storage capacity considerations.

After deciding on Filestore Basic, Shane and his team had to act fast. They had a hard deadline: the start of the Spring 2023 semester, was just a few short weeks away. This meant a complete redeployment of their JupyterHub environment on GKE, with Filestore as the foundation for their storage needs. Careful planning and efficient execution were critical. 

Shane and his team had some important decisions to make. First up: how to structure their Filestore deployment. Should they create a shared instance for multiple hubs, or give each hub its own dedicated instance?

Given the scale of Datahub and the critical importance of uptime, they decided to err on the side of caution – a decision undoubtedly influenced by their past experiences with storage-related outages. They opted for a one-to-one ratio of Filestore instances to JupyterHub deployments, effectively over-provisioning to maximize performance and reliability. They knew this would come at a higher cost, but they planned to closely monitor storage usage and consolidate low-usage hubs onto shared instances after the Spring 2023 semester.

The next challenge was to determine the appropriate size for each Filestore instance. Without historical data to guide them, they had to make some educated guesses. Since Datahub is designed for flexibility, they couldn’t easily enforce user storage quotas – a common challenge with JupyterHub deployments.

They turned to what data they did have, reviewing usage patterns from previous semesters where user data was archived to Cloud Storage. After some back-of-the-napkin calculations, they settled on a range of instance sizes from 1TB to 12TB, again leaning towards over-provisioning to accommodate potential growth.

Once the fall semester ended and they’d archived user data, the real work began. They created the Filestore instances, applied the necessary configurations (including NFS export and ROOT_SQUASH options), and even added GKE labels to track costs effectively — gotta love a bit of cost optimization!

With the data in place, it was time for the final switchover. They updated their JupyterHub configurations to point to the new Filestore instances, deleted the remnants of their old NFS setup, and with a mix of anticipation and relief, relaunched Datahub.

Managing Filestore

Since migrating to Filestore, Shane and his team at UC Berkeley have enjoyed a level of stability and performance they hadn’t thought possible. In their words, Filestore has become a "deploy-and-forget" service for them. They haven’t experienced a single minute of downtime, and their users — those thousands of students depending on Datahub — haven’t reported any performance issues.

At the same time, their management overhead has been dramatically reduced. They’ve set up a few simple Google Cloud alerts that integrate with their existing PagerDuty system, notifying them if any Filestore instance reaches 90% capacity. However, these alerts are rare, and scaling up storage when needed is straightforward.

To further optimize their usage and control costs, they’ve implemented a simple but effective strategy. At the end of each semester, they archive user data to Cloud Storage and then right-size their Filestore instances based on usage patterns. They either create smaller instances or consolidate hubs onto shared instances, ensuring they only pay for the storage they need. Rsync remains their trusty sidekick for migrating data between instances — a process that, while time-consuming, has become a routine part of their workflow.

The good, the challenging, and the (occasionally) unpredictable - Shane’s version

When reflecting on UC Berkeley’s Filestore journey, Shane didn’t sugarcoat things. They learned a lot, and not everything has been easy. In the spirit of transparency, here’s a breakdown in Shane’s own words of the experience into the good, the challenging, and the (occasionally) unpredictable. 

The good

Nothing beats peace of mind – especially in the middle of a semester. Moving to Filestore has been a game changer, allowing the team to trade midnight debugging sessions for restful nights of sleep. No more frantic calls about crashed servers or rescheduled exams — Filestore’s uptime has been rock-solid, and its performance at the Basic tier has been more than enough to keep pace with our users.

And as we dug deeper into Filestore, we discovered even more ways to optimize our setup by improving UC Berkeley operations: 

Sharing is caring (and cost-effective!): We found opportunities to consolidate hubs with smaller storage requirements onto shared instances, for greater cost-savings. 

Right-sizing is key: We’ve become pros at aggressively resizing and adding storage only when needed. 

    • Exploring Filestore Multishare CSI driver: We’re actively looking at the Filestore Multishare capability to streamline our ability to scale storage capacity up and down, and any potential cost deltas. This may save us further time and effort compared to our current Filestore deployment, but we are currently unable to do so as we are using the Basic HDD Tier.

Empowering our faculty: We’re working closely with faculty and instructors to help them educate students about data management best practices, and giving them  friendly reminders that only downloading multiple megabytes of data (as opposed to terabytes) can be really impactful. 

Smarter archiving: We’re continually analyzing our storage metrics and usage behavior to optimize our archiving processes. The goal is to archive only what’s necessary, when it’s necessary. 

The challenging

That’s not to say there are no drawbacks. Filestore isn't exactly a budget-friendly option. Our cloud storage costs did go up and while we’ve managed to mitigate some of that increase through our optimization efforts, there’s no denying the price tag. However, an increase in cloud costs is well worth the collective sanity of our team!

One thing we're still grappling with is the lack of easy down-scaling in Filestore Basic. It's not that it's technically difficult, but manually resizing instances does take some time and can disrupt our users, which we obviously want to avoid. At the same time, we're getting better at forecasting our storage needs, and the tips we've outlined — especially around right-sizing — have made a huge difference. But having a more streamlined way to scale down on demand would be a huge win for us. It could save us thousands of dollars each month — money we could redirect towards other critical resources for our students and faculty.

The unpredictable

Data science is, by its very nature, data-intensive. One of our biggest ongoing challenges is predicting just how much storage our users will need at any given time. We have thousands of students working on a huge variety of projects, and sometimes those projects involve datasets that are, well, massive. It's not uncommon for us to see a Filestore instance grow by terabytes in a matter of hours.

This unpredictable demand creates a constant balancing act. We want to make sure our SRE team isn't getting bombarded with alerts, but we also don't want to overspend on storage we might not need. It's a delicate balance, and we often err on the side of caution — making sure our users have the resources they need, even if it means higher costs in the short term.

As of now, Filestore now makes up about a third of our total cloud spend. So while we're committed to making it work, we're constantly looking for ways to optimize our usage and find that sweet spot between performance, reliability, and cost.

In conclusion

UC Berkeley's journey highlights a critical lesson for anyone deploying large-scale pedagogical platforms as force-multipliers for instruction: as the number of JupyterHub deployments grow in complexity and scale, so too do the demands on supporting infrastructure. Achieving success requires finding solutions that are not just technically sound but also financially sustainable. Despite challenges like a higher price tag, a slight learning curve with Filestore Basic, and some missing automation tools, Filestore proved to be that solution for Datahub, providing a powerful combination of performance, reliability, and operational efficiencies, and empowering the next generation of data scientists, statisticians, computational biologists, astronomers, and innovators. 

Are you looking to improve your JupyterHub deployment? Learn more about Filestore and GKE here.

Read Entire Article