Building Serverless Data Pipelines

7 min read
12 September 2023

Author: Dmitrii Vasianin, Software Engineer

 

The high scalability of serverless functions makes them popular among data scientists. With serverless technologies, the cloud service provider manages server provisioning, machine scaling, load balancers, and versioning.

 

So, what would make you build a data product with the above mentioned technologies? Let's take a look at the possibilities that this kind of platform provides and why it's different from what it used to be.

 

  • Information engineering and machine learning coding will be significantly reduced, which will increase the development time for innovative products.

  • Your teams will have easier access to cloud services than ever before. If the team urgently needs software, the implementation time will be minutes or hours, not days or weeks. Need to explore data in a laptop environment? Spin up a cloud service or competitive SaaS product over pre-built datasets.

  • Serverless event-based processing reduces infrastructure deployment, testing, and maintenance to a few lines of automation code. This makes it much easier to get a working data platform.

  • Flexibility and no investment. Using the services of leading providers you pay per megabyte / event / second. Now you can control your environment yourself. If you need to make any changes to the business, then they will be completely under your control. Create new products and get rid of old ones as fast as your teams can handle it.

  • Ancillary services such as surveillance, orchestration, authorization/authentication, identification/directory, PCI-DSS, network security, and active or inactive encryption are already built into the platform. These supplementary services are often ignored prior to implementation, but they can be used to make or break implementations. It is worth noting that a cloud service provider usually provides these services to many customers, so everything that is already built into the system has been automatically designed and tested by a large number of organisations.

It is important to note that we did not mention cost savings because the development of data products is our main concern. The reasons for building an effective data platform are manifold in most organisations, but in order to justify moving to a new data platform, we will assume that your use cases are:

 

  • provides significant competitive advantage through analytics;

  • provides income flow directly or indirectly from your data products.

Cost savings may actually be a side effect of moving to a cloud data platform, but they are rarely enough motivation to move. Therefore, we consider functional benefits rather than relying on direct cost comparisons.    

What is a Data Pipeline

This component is responsible for collecting data from multiple sources, pushing it to the event queue and/or the cloud storage layer. It represents the link to the rest of the enterprise or other external data sources. Although there are other options for cloud services, this solution still outperforms them in the overall takeover structure. It's worth noting that the solution hasn't been adopted by the cloud service yet, so it requires deployment in Docker containers in Container Services, available to all 3 listed cloud providers.

What a serverless data pipeline looks like

It covers many technologies. For example, let's see what kind of data flow would be needed in a hypothetical online store. In this case, we will use the Google Cloud Platform components to create a platform that supports both a data warehouse and a near-realtime dashboard. This simplified example uses three data sources:

 

  • Viewing a product by an online store user

  • Orders from the online retail order management system

  • Product catalog update information from file downloads

Google Cloud Functions: Google Serverless Functions

For application data scientists, one of the main serverless environments used is the Google Cloud Platform, where we can build cloud functions. GCP is a good starting platform as it is similar to standard Python development.

 

  • However, GCP has several disadvantages, such as:

  • No Elastic Cloud Storage

  • The storage is not intuitive because cloud functions are read-only and users can only write to the \tmp directory.

  • Using spaces instead of indents for formatting can cause large-scale problems.

  • Less responsive support teams when issues arise

  • Overly verbose documentation and API

  • Many developers agree that it's better to develop GCP code later using it in Flask Python as it allows you to take advantage of GCP as well as access elastic cloud storage.

How to create an echo service in GCP

GCP provides a web interface for creating cloud functions. This user interface offers options for setting up function triggers, defining requirements for a Python function, and designing a Flask function implementation. Let's see how it's done.

 

Setting up our environment

 

First, we will set up our environment by doing the following:

 

  1. Search for "cloud function".

  2. Click Create Function.

  3. Select "HTTP" as the trigger.

  4. Select Allow Unauthenticated Calls.

  5. Select the embedded editor for Source Code.

  6. Select Python 3.7 as the runtime.

  7. Write the name of your function in the "Function to execute" field.

После выполнения этих шагов, пользовательский интерфейс обеспечит

After completing these steps, the user interface will provide tabs for the main.py and requirements.txt files. In the requirements file, we will specify libraries such as flask >= 1.1.1, and in the main file, we will implement the behaviour of our function.

Deploying Our Function

Now that our environment is set up, we can start working on our programme. We're going to create a simple echo function that parses the msg parameter from the passed request and returns that parameter as a JSON response. To use the jsonify feature, we first need to include the flask library in the requirements.txt file.

The requirements.txt file and the main.py files for the simple echo function are both shown in the code snippet below.

Testing our function

Now click the feature name in the console, and then click the Testing tab to see if the feature deployment worked as expected. You can specify a JSON object to pass to a function and call the function by clicking "Verify Function", as shown in the image below.

езультатом этого теста является объект JSON, возвращенный в диалоговом окне

The result of this test is a JSON object returned in the Output dialogue box, which indicates that the call to the echo function worked correctly.

Testing HTTP Calls

Now that the function has been deployed and we have enabled unauthenticated access to the function, we can call the function over the web using Python. To get the URL of the function, click the Trigger tab. We can then use the requests library to pass the JSON object to the serverless function, as shown in the snippet below.

Building Serverless Data Pipelines

The output of this script is the JSON payload returned by the serverless function. The result of the call is the JSON shown below:

Building Serverless Data Pipelines

In case you have found a mistake in the text, please send a message to the author by selecting the mistake and pressing Ctrl-Enter.
Oleksandr 646
Joined: 1 year ago
Comments (0)

    No comments yet

You must be logged in to comment.

Sign In / Sign Up