Play with Ray
First steps with Ray: An Open-Source Framework for Scaling AI and Python Workloads
As the industry evolves, it's always good to bake some time into your schedule to explore new tools and capabilities. Yes, I know. Keeping up with everything is just impossible. But occasionally, there is this tool that everyone talks about that you should explore too. Yes, the duckling one, DuckDB, oh no. Please, not again.
Ok, Ok. I understand. Let's explore Ray.
Ray is an open-source framework to scale AI and python workloads. It provides a layer for parallel processing with pythonic primitives. If that sounds a lot like Spark. from a first glance, it does. Yet when digging under the hood, it doesn't. Ray innovates at the architecture and enablement level. It is written mostly in C++ and Python and leverages some format of the actor model.
In computer science, The actor model treats actors as the fundamental building blocks of concurrent computation. In response to a message it receives, an actor can:
Make local decisions.
Create more actors.
Send more messages.
Determine how to respond to the next message received.
They can modify their private state, but any impact on other actors is indirect through messaging. Therefore, actors are stateful, allowing for complex computation in a concurrent setting.
Ray introduced Actors as stateful workers to extend the API from functions (tasks) to classes. When an actor is instantiated, a new worker is created, and the methods of the actor are scheduled for that specific worker. This enables them to access and modify the state of that worker, providing a powerful mechanism for concurrent computation.
Ok. back to basics. Ray is written in C++ and Python. So, a refresh to all folks about files extension:
*.cc -> these are c++ files containing definitions and local declarations.
*.h -> these are header files containing shared declarations.
*.py → these are python files.
Its build system is the all-mighty Bazel by Google. If you work with IntelliJ, there is a plugin for Bazel that you will need to install to build the project.
From IntelliJ, go to:
IntelliJ IDE -> Preferences -> Plugics -> Marketplace ( here search for Bazel and install Bazel for IntelliJ).
Now with some capabilities to peek under the hood and look at the code source, let's try the API and play with it. I chose to work with the docker image I built for my book, which contains Jupyter notebooks, Spark, Hive, PyTorch, TensorFlow, Petastorm, and MLflow. Since it was easiest to run and play with to compare the outcomes and so on. There is also a repository for you to take advantage of, with an env-setting
notebook, which installs everything you need to run My first Ray project
notebook.
The notebook leverages ray getting started documentation that covers everything you need, except for some environment configuration that I added for you.
Here is the repository – play with ray.
The first code blocks from the quick start, pulls data from s3 open bucket using ray.data.read_csv
into a dataset. Datasets are distributed abstraction on top of your distributed data that enables basic data transformations like map_batch
, group, random_shuffle , repartition
and so on..
The python snippet split the code intro training and validating, and also leverage the validating dataset for testing by dropping the label named target.
(this is a hack for the purpose of simplifying the code).
import ray
# Load data.
dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")
# Split data into train and validation.
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)
# Create a test dataset by dropping the target column.
test_dataset = valid_dataset.drop_columns(cols=["target"])
As a result, there is this lovely warning about repartitions and CPU slots available
This enables us to tune the system even more. Take note of that. I won’t discuss it during this blog, but in the future, when discussing cluster tuning and performance, this would become essential to automate.
After splitting the data, we need to preprocess it. Ray offers a preprocessor library with lots of goodness in it. For example, the StandardScaler. This is used to normalize the data by
removing the mean and scaling each feature/variable to unit variance.
# Create a preprocessor to scale some columns.
from ray.data.preprocessors import StandardScaler
preprocessor = StandardScaler(columns=["mean radius", "mean texture"])
Now, let’s do some training with XGBoostTrainer.
This trainer is based on the gradient-boosted trees algorithm. I discussed the booster algorithms in my book, but if you didn’t yet get a chance to read it, here is a tl;dr.
tl;dr Gradient boosting is a supervised learning algorithm, where a set of simpler and weaker models are combined to attempt to predict the target variable accurately.
XGBoostTrainer
takes scaling_config as an input. This enables us to config the number of workers, gpu/cpu, and so on. Similar to the previous, here there is another chance to configure the system to reach maximum performance. Oftentimes, these processes would be completely automated and configured outside the training code itself.
It also takes parameters, label_column, num_boot_rounds, dataset, and the preprocessor.
This defines the trainer. Trainer is a
configured algorithm to create the model. Calling trainer.fit
produce the results.
from ray.air.config import ScalingConfig
from ray.train.xgboost import XGBoostTrainer
trainer = XGBoostTrainer(
scaling_config=ScalingConfig(
# Number of workers to use for data parallelism.
num_workers=2,
# Whether to use GPU acceleration.
use_gpu=False,
# Make sure to leave some CPUs free for Ray Data operations.
_max_cpu_fraction_per_node=0.9,
),
label_column="target",
num_boost_round=20,
params={
# XGBoost specific params
"objective": "binary:logistic",
# "tree_method": "gpu_hist", # uncomment this to use GPUs.
"eval_metric": ["logloss", "error"],
},
datasets={"train": train_dataset, "valid": valid_dataset},
preprocessor=preprocessor,
)
result = trainer.fit()
print(result.metrics)
Printing the result metrics provides some insights into the outcome, memory, system, and performance.
The next steps are to tune the hyperparameters to find the best model with minimal train-logloss and train-error. For that, I will let you explore ray.tune.tuner with
Play with ray notebook and gain hands-on experience! Would love to hear about your experience in comments.
This is all for this week with Ray, as for next week, there is a debate between continuing to explore Ray, and moving on to Table formats, either Delta Lake or Iceberg. A tight competition!
Share in comments which one you would like to learn about first!
Speak soon,
Adi