Data science is a growing profession. While it involves more opportunities than ever, it also has a lot more complications. Standards and expectations are rapidly changing, especially in regards to the types of technology used to create data science projects.
Most data scientists are using some form of DevOps interface these days. One of the most popular is Kubernetes. Kyle Gallatin recently recorded a Kubernetes tutorial that was presented at the New York City Data Science Academy, which illustrates the importance of this platform for his profession.
There are a lot of important nuances for data scientists using Kubernetes. One of the most important is the adaption of serverless Kubernetes.
In this post, we will look at how serverless is changing the traditional Kubernetes architecture. However, we will first address the benefits of Kubernetes in data science.
Benefits of Kubernetes for Data Science
Kubernetes is based on a control node combined with multiple worker nodes to facilitate its cluster architecture. Workloads then get distributed to these worker nodes while being managed by the control node. With the emergence of serverless technologies, there is growing interest in utilizing serverless within Kubernetes both to manage workloads and provide the cluster itself.
It should be relatively obvious why data scientists can benefit from this interface. Bob Laurent, Senior Director of Domino Data Labs has talked about some of the biggest reasons. He points out that Kubernetes allows scalable access to GPUs and CPUs and helps with infrastructure abstraction. These features make data science projects scalable, cost-effective and easier to manage.
Kubernetes is clearly a useful feature for data scientists. After this is understood, it is important to come to terms with the wonders of using it in a serverless enviornment.
First of all, it is important to dispel a misconception. Serverless does not mean the absence of servers. It just means that the server is abstracted to a certain level that users do not need to consider how their applications are executed. You only have to simply provide your packaged application or a container, and the serverless platform will manage all the underlying infrastructure considerations. This means it can still be used to handle data projects at different levels of your infrastructure.
Even with all the advantages Kubernetes brings, users still need to manage the underlying servers. While managed K8s reduce this burden somewhat, it still does not eliminate servers completely from the equation. They will manage the control plane, yet you still have to provision and manage worker nodes on the various data science projects you are working on.
Serverless implementation like AWS Fargate completely eliminates the need for data scientists to manage the worker nodes and moves the workloads into serverless architecture. This approach completely shifts the responsibility of server (node) management from the user to the service providers. Serverless can also bring cost reductions, as users only pay for the resources used. Furthermore, it ensures no overprovisioning has occurred while having the flexibility to scale as needed.
Each worker node has an agent called kubelet that connects it to the Kubernetes API. When a user interacts with the Kubernetes API via kubectl commands, kubelet allows each node to receive instruction from the API on how to manage the pods in the specific nodes. Kubectl also uses PodSpecs to manage the underlying pods whenever a kubelet is running on a server and connected to K8s API.
This opens a lot of doors for data scientists trying to boost scalability and customize their projects. The biggest benefit in data science projects boils down to virtualization.
In a serverless setting, this functionality is typically emulated by a virtual kubelet. This allows the Kubernetes API to recognize the virtual kubelet implementation as a node within a cluster. However, this virtual kubelet will schedule containers elsewhere, typically in supported backends like AWS Fargate, AWS Batch, HasiCorp Normad, etc… Although users can interact with the K8s cluster usual way the underlying containers will be scheduled in serverless containers services. Thus, with this implementation, users can gain the advantages of serverless without sacrificing the functionality of Kubernetes. The best part of a virtual kubelet is it even allows for mixed configurations, where actual worker nodes and virtual kubectl can coexist within a cluster.
In a non-serverless setting, the users would create the container and then configure K8s manifests and resources to deploy and run the application within the cluster. Additionally, we have to configure the scaling and preconfigure the resource utilization. For a serverless implementation, there can be two approaches to do it called container as a Service (CaaS) and a Function as a Service (FaaS)
With CaaS, we provide the container with the necessary configurations, and CaaS will create and manage all the underlying secondary resources, including Istio routing, scaling, ingress, etc… CaaS will then configure the container and manage it depending on the configurations provided. The only requirement is that the container is able to interpret the commands sent by the CaaS service and act upon them, which will require some additional configurations or libraries in the container itself. A good example of CaaS would be Knative to deploy serverless workloads in Kubernetes.
FaaS takes CaaS implementation a step further. In CaaS, the user needs to provide the container in a FaaS service. The user will create and upload a function with a source code and additional configurations’ information like runtime, triggers, etc… However, FaaS will build our code and containerized the application with all the necessary management tools and libraries and deploy them, simplifying the application deployment. OpenWhisk, Kubeless, OpenFaaS are some FaaS services available to facilitate this functionality.
In both these instances, the functionality of these services will be built on top of the Kubernetes API, exposing only the CaaS or FaaS interface to the users. All the deployments and management will be carried out using the Kubernetes API. But users would only see the much simpler function or container service interface. Combining this with a completely serverless cluster powered by a virtual kubelet, you can have a complete serverless Kubernetes environment.
Kubernetes is a Wonderful Resource for Data Scientists
There are many powerful new platforms that data scientists should be willing to take advantage of. By integrating Kubernetes with serverless platforms and services, data scientists can gain the benefits of both of them without compromising their functionality. At a cluster level, serverless helps reduce costs while providing near-unlimited scalability and availability without management responsibilities. At the application level, serverless greatly simplifies the development and deployment effort required to deploy and use containers in a Kubernetes environment, either via CaaS or FaaS implementations.