Magniv + AWS S3 Batch Prediction / Inference — Housing price prediction
Must Read:
Let’s start with a true life story, while I was working with my ex-company, we had a WebSocket server for crypto price prediction and we had partners who didn't want the utilize the WebSocket inference but rather get a daily feed of data stored in a CSV file stored in S3. To cut the long story short, we couldn't find a cost-effective tool for batch inference so we had to use crontab and mailjet temporarily while we set up airflow.
You might come across a task in the organization that requires you to setup airflow and most org don’t actually need airflow for minor tasks when they can utilize a simple and more effective platform like Magniv.
In this tutorial, we will build an Airbnb price prediction models, and set up batch prediction/inference with Magniv and s3 to run on a scheduled time.
Here is the TOC in case you want to skip to the orchestration part
Table of content
- Airbnb rent prediction
- data cleaning
- data preprocessing
- cross-validation and Feature scaling
- model training
- model evaluation and export to a pickle file
- Orchestrate batch prediction monthly with Magniv
- Conclusion
Airbnb rent prediction
For this tutorial, we used a dataset on Kaggle (Airbnb NYC listings).
The objective of this project is to predict the price of Airbnb listings every month based on the reviews and other features and our focus will be on the model training and orchestration with Magniv
First, we will clean the dataset by dropping null values and eliminating geolocation columns and IDs.
For filling in missing values, we will use simple imputer and because we don't have a lot of data points, we will utilize the most frequent strategy to fill in missing values by replacing them with the most frequent value along each column
Simple imputer is an estimator used to fill the missing values in datasets. For numerical values, it uses mean, median, and constant. For categorical values, it uses the most frequently used and constant value.
After the cleaning process, let's plot a map chart
- Plot map using Folium
map chart
- Data preprocessing
after the chart, we will convert categorical values to vectors and normalize columns with mixed data types
- Cross-validation and Features Scaling
Because most algorithm works well with normally distributed data, In this section, we will split the data into train and test set with an 80:20 ratio and use a standard scaler to perform normalization(feature scaling) on the dataset
- Model training, evaluation and exporting to pickle
In this section, we will train the model using Linear Regression, xgboost regressor and ridge regression.
After the training process, we will value the model and pick the model with the least good score and export it with pickle.
from our mean absolute error score for each model, we can deduce that linear regression works better than other models.
Let’s move to the big bang theory on Batch prediction/inference orchestration
Orchestrate batch prediction monthly with Magniv
Before we begin let's take a little walk on Magniv and the awesome infrastructure you can orchestrate with it.
Magniv is an open-source Python library and infrastructure that simplifies job orchestration at scale. imagine that you don't need to set up airflow to run an orchestration that will be expensive running on AWS EC2 or any other VMs and you can setup an orchestration with just a decorator and cost-effective as well.
Click here to learn more about why you should utilize Magniv instead of setting up your own Airflow environment
Let’s use Magniv to set up an orchestration that will extract Airbnb listings from NYC monthly and run our prediction model. The result columns will be the lat, long and predicted price so we can plot a map chart later via streamlit, plotly or folium.
Batch Inference / Prediction:
This is a process that based its predictions on a batch of inferences or observations. Additionally, it's a simple inference process where models run in timed intervals or are scheduled.
Before we move into the script, we need to ensure we set up our directory in the standard approach.
After creating this directory, let's move to write the script.
For the upload_download_s3.py, we are going to write a function to upload data and download data to / from s3
After this let's setup our orchestration on app.py
In the script above we utilized the following features in Magniv
- Resource allocation
For Resource allocation, we will increase the resource for running the inference from the default 1gb to 2gb CPU and 2gb memory to speed up the operation. you can request more based on your preference.
- On Success:
These features allow us to run a task when the other has been completed or configure a task to run after the parent or head task has been completed. in the script above, we configured the inference to run after the get data function task has successfully completed.
- Keys:
Keys are representations of a function. more like identities. In the script above we set a key or identity for both functions so we can leverage the keys for “on success” However, keys are function names by default.
We can set task triggers by using Magniv webhook feastures. click here to view Magniv docs
let's wrap up with our requirements.txt file
Now we are done writing the orchestration script, let's push the script to GitHub and use Magniv cloud to run the orchestration.
First, you have to sign up to Magniv with your GitHub and create a workspace
after creating a workspace, it will take about 10–15 mins to start the workspace cluster and after this, you can go to the config and setup any environment variable like AWS access and secret ID, meanwhile, your build should be running.
For this session, we will run the job manually cause we set our schedule to monthly, so we will run it manually to be sure the pipeline is not error-prone.
From the image above, we can deduce that our orchestration ran successfully and we can go further to visualize our prediction with a map chart showing the price as a popup.
Conclusion
At the end of this session, we have been able to build a baseline price prediction model for Airbnb rental, export the model, set up a batch prediction pipeline with AWS S3 and Magniv.