Basics
Don't confuse Local and Standalone modes.
Local mode is a dev mode that runs everything in the same JVM, it's the default mode and configured with master local[*]
or local[n]
Standalone mode is Spark's own cluster manager just like (not that much) YARN and Mesos
Standalone
- Start the master node
./start-master.sh
- Start the worker node (could be in the same machine)
./start-worker.sh spark://mac.local:7077
- In your application set the master to
spark://mac.local:7077
- We can adjust how much executors per worker and how much cores an executor allocates with
spark.executor.cores
andspark.cores.max
spark.executor.instances
has no effect in the standalone mode- Writing to the filesystem asks all the workers to write their partitions to the specified path, so we have to make sure that path exists for all workers
S3
- Set
spark.jars.packages
tocom.amazonaws:aws-java-sdk:1.12.310,org.apache.hadoop:hadoop-aws:3.3.4
- If the S3 bucket is public set
fs.s3a.aws.credentials.provider
toorg.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider
- Read the S3 object (not sure about the bucket) using its URL prefixed by
s3a://
. Examples3a://cellpainting-gallery/datasets/dataset.csv
Kubernetes
brew install openjdk@11
brew install apache-spark
brew install ubuntu/
microk8s enable registry
Since microk8s
is running in a VM, we need to use the VM IP address to talk to the registry from the host machine (Mac OS), but we need to use the registry service cluster IP (or DNS name ?) when submitting Spark applications
# Get the VM IP
kubectl get nodes -o wide
# add "insecure-registries": ["http://192.168.64.2:32000"] to ~/.docker/daemon.json
# or via the Docker app, then restart
cd /opt/homebrew/Cellar/apache-spark/3.2.1/libexec/
docker-image-tool.sh -r 192.168.64.2:32000 build
docker-image-tool.sh -r 192.168.64.2:32000 -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build
docker-image-tool.sh -r 192.168.64.2:32000 push