Connect to Spark on an external cluster
=======================================
Using the `Apache Livy `_ service, you can
connect to an external Spark cluster from Faculty notebooks, apps and APIs.
.. contents:: Contents
:local:
.. note::
To enable interaction with the Spark cluster, Apache Livy must be
installed. Contact your cluster administrator to arrange installation -
documentation on installation is available `here
`_.
We also strongly recommend to use Spark 2, which provides a much easier to
use interface for data science than Spark 1. Indeed, MLlib, the Spark
machine learning library has already deprecated their RDD (Spark 1)
interface. You can check your Spark version by running
``print(sc.version)`` inside a Spark session as described below. Contact
your cluster administrator to install Spark 2 and configure Apache Livy to
use it.
Any libraries or other dependencies needed by your code must be installed
on the Spark cluster, `not` on your Faculty server. Using
sparkmagic/pylivy and Apache Livy, the code you run inside a ``%spark``
cell is run inside the external cluster, not in your notebook.
.. _sparkmagic:
Interactive Spark in notebooks
------------------------------
The `sparkmagic `__ package
provides Jupyter magics for managing Spark sessions on a external cluster and
executing Spark code in them.
To use sparkmagic in your notebooks, install the package with
``pip install sparkmagic`` in a terminal or with a :ref:`Faculty Environment
`, and load the magics in a notebook with:
.. code-block:: python
%load_ext sparkmagic.magics
This makes available the ``%spark`` magic, which is the main entry point for
managing sessions and executing code.
Session management
^^^^^^^^^^^^^^^^^^
Create a new PySpark session with:
.. code-block:: python
%spark add -s my_session -l python -u https://mycluster.com:8998
The ``-u``/``--url`` option sets the URL where Livy is running - you can get
this from your system administrator. You can create R and Scala Spark sessions
by setting the ``-l``/``--language`` option to ``r`` or ``scala`` respectively.
For documentation of all options run ``%spark?``.
You can also list running sessions with:
.. code-block:: python
%spark list
and delete sessions with:
.. code-block:: python
%spark delete sessionname
sparkmagic also provides the ``%manage_spark`` command, which returns a widget
for managing Spark sessions on the Livy server, which you may prefer to the
above interface.
Executing code
^^^^^^^^^^^^^^
Once you've created a Spark session as above, execute a cell on the cluster by
decorating a cell with the ``%%spark`` cell magic (note the two ``%``):
.. code-block:: python
%%spark
print('I am being executed on the external cluster')
.. note::
It's important to bear in mind the distinction between code executed in the
external Spark cluster from code that is executed in the notebook in your
Faculty server. Cells that have the ``%%spark`` magic are executed on
the external cluster, and will only see variables that exist there, and
cells without that magic are executed on your Faculty server, and will
only see variables that exist there.
If you get errors like `NameError: name 'df' is not defined`, it may be
because the variable you meant exists in the other context.
Transfer of data between the external cluster and Faculty notebook must
be done explicity, as :ref:`described below
`.
The ``spark`` SparkSession (Spark 2 only) and ``sc`` SparkContext objects will
be inserted into the session automatically. For example, to create a Spark
DataFrame from a CSV file in the cluster's HDFS filesystem:
.. code-block:: python
%%spark
df = spark.read.csv('hdfs:////data/sample_data.csv')
Variables created in one cell will persist in the session and will be available
in other later cells. For example, we can run a second cell that counts the
number of rows in the Spark DataFrame created above:
.. code-block:: python
%%spark
print(df.count())
Any output generated by your code in the cluster will be retrieved and
displayed as the output of the notebook cell in Faculty.
.. _sparkmagic_retrieve_dataframe:
Retrieving data
^^^^^^^^^^^^^^^
Often, you'll want to retrieve the contents of a Spark DataFrame from the
cluster so you can do additional processing and modelling in your normal
Jupyter notebook. You can do this with the ``-o`` option:
.. code-block:: python
%spark -o df
This will evaluate and collect the Spark DataFrame ``df`` on the external
Spark cluster, and save its data into a Pandas DataFrame in your Faculty
notebook, also called ``df``.
.. note::
Using ``%spark -o`` will attempt to load all of the values from a Spark
DataFrame into the memory on your Faculty server. If this is very large,
as is often the case with Spark DataFrames, it may crash your server due to
running out of memory!
You can also use ``-o`` with a ``%%spark`` cell magic. The below code creates a
Spark DataFrame in the external cluster called ``top_ten``, then collects it
into the Faculty notebook as the Pandas DataFrame ``top_ten``.
.. code-block:: python
%%spark -o top_ten
top_ten = df.limit(10)
Run Spark jobs from scripts, apps and APIs
------------------------------------------
The `pylivy `_ package provides tools for
managing Spark sessions on an external cluster and executing Spark code in
them.
Unlike :ref:`sparkmagic `, pylivy does not depend on being executed
from inside a Jupyter notebook, making it suitable for use in scripts, apps and
APIs.
To use pylivy, install it with ``pip install livy`` in a terminal or with a
:ref:`Faculty Environment `.
Usage
^^^^^
pylivy provides the ``LivySession`` class, which creates a Spark session and
shuts it down automatically when finished. To execute code in the session, pass
it as a string to the ``run()`` method on the session:
.. code-block:: python
from livy import LivySession
with LivySession('https://mycluster.com:8998') as session:
session.run('print("foo")')
You may also find the ``textwrap.dedent()`` function from the Python standard
library useful for writing multiple-line code snippets inline:
.. code-block:: python
from textwrap import dedent
with LivySession('https://mycluster.com:8998') as session:
session.run(dedent("""
df = spark.read.csv('hdfs:////data/sample_data.csv')
top_ten = df.limit(10)
"""))
The ``read()`` method on the session allows you to evaluate and retrieve the
contents of a Spark DataFrame. Pass it the name of the Spark DataFrame you want
to read, and it will return it as a Pandas DataFrame:
.. code-block:: python
with LivySession('https://mycluster.com:8998') as session:
session.run(dedent("""
df = spark.read.csv('hdfs:////data/sample_data.csv')
top_ten = df.limit(10)
"""))
top_ten_pandas = session.read('top_ten')
# Do something useful with the Pandas dataframe
make_plot(top_ten_pandas)