Faculty Datasets
================

.. contents:: Contents
    :local:

.. role:: r(code)
   :language: r

`Datasets` is Faculty's environment for storing large files. It is designed
to prevent accidental loss or modification of important data. Before
proceeding, you may want to skim through the tutorial on :doc:`Accessing Data
<../getting_started/accessing_data>`.

To access the `Datasets` environment, click the relevant icon in the tab on the
left-hand side of the workspace. Once inside the `Datasets` environment, the
buttons on the top-right of the page offer three options, `Upload file`,
`Create folder`, and `Delete file`. It is important to note that other actions,
such as moving files from `Datasets` to the workspace, can be performed using
the |faculty.datasets|_ Python module.

.. |faculty.datasets| replace:: ``faculty.datasets``
.. _faculty.datasets: ../libraries/datasets/index.html

Files uploaded to `Datasets` in CSV or TSV format are automatically analysed
with `Lens`, Faculty's data-exploration service. As we will see, reports
generated by `Lens` can be readily accessed from the `Datasets` page. If, on
the other hand, you would like to use this feature as a Python module,
:ref:`find its documentation here <lens-library>`.

`Lens` reports evaluate the quality of datasets, and offer immediate insight
through visualisations and tabular summaries.

Moving files to and from datasets
---------------------------------

In order to move files from Datasets to the workspace, where you can use files
in your programs, we provide a Python and R library that lets you manipulate
files on Datasets.

.. tabs::

  .. tab:: Python

    The Python library is called ``faculty.datasets``. To save you some
    typing, you can import it as datasets:

    .. code-block:: python

      import faculty.datasets as datasets

    You can then use the commands as follows:

    .. code-block:: python

      datasets.put('/project/test-file.csv', '/input/test-file.csv')

    The various functions for manipulating files are:

    .. currentmodule:: faculty.datasets

    .. autosummary::
      :toctree:

      cp
      etag
      get
      glob
      ls
      mv
      open
      put
      rm
      rmdir

    To find a full list, take a look at the |faculty.datasets|_ page.

  .. tab:: R

    The R library is called `rfaculty`. As always, we load it with the
    library command:

    .. code-block:: r

      library(rfaculty)

    You can then use the commands as follows:

    .. code-block:: r

      datasets_put('/project/test-file.csv', '/input/test-file.csv')

    The various functions for manipulating files are:

    .. include:: ../libraries/rfaculty/datasets_calls.rst

.. note::
    Copying and moving large files (> 1 GB) is currently not well supported.
    Instead of using the `cp` and `mv` commands, consider downloading the file first, 
    and re-upload it to a different location within datasets. Then, remove the original
    file if needed.

Lens reports
------------

Within the `Datasets` environment, select the CSV or TSV file you that would
like to explore. The icon |dots| on the right-hand side of the page will direct
you to the `Lens` report.

.. |dots| image:: images/dots.png
   :scale: 50 %

.. thumbnail:: images/datasets_page.png

Interpreting Lens reports
^^^^^^^^^^^^^^^^^^^^^^^^^

Your `Lens` report will typically look like this:

.. thumbnail:: images/lens_overview.png

`Lens` reports are organised in three parts, namely `Columns`, `Correlation
Matrix` and `Pairwise Density Plot`. Each corresponds to a tab, so that you can
navigate through a report as you would in your web browser.

Columns
^^^^^^^

This is the "landing page" of the report. It lists the quantities (`Columns`)
found in the dataset, alongside their main characteristics,

- **TYPE**: the way the column is encoded. As in the data-analysis tool Pandas,
  the type can be :code:`int64` (integer number), :code:`float64` (floating
  point number), or :code:`object` (non-numeric).

- **VALID**: the number of non-null entries in the column.

- **NULL**: the number of null entries in the column. The sum of `VALID` and
  `NULL` is equal to the size of your dataset.

- **DISTINCT**: the number of distinct entries in the column. In other words,
  repeated identical values are counted only once.

- **CATEGORICAL**: If :code:`No`, the column is numeric (:code:`int64` or
  :code:`float64`). Else the column is non-numeric (:code:`object`).

.. note:: Clicking the name of a numeric quantity will direct you to the
          corresponding histogram. The plot will also include an estimate of
          the Probability Density Function (PDF) for the quantity, represented
          as a solid yellow line. More precisely, `Lens` calculates the PDF by
          means of a `kernel density estimation (KDE)
          <https://en.wikipedia.org/wiki/Kernel_density_estimation>`_ method.

.. thumbnail:: images/histogram.png

Correlation matrix
^^^^^^^^^^^^^^^^^^

Data scientists are often interested in correlations, as these indicate whether
it is possible to make predictions. For example, let us assume the score of
pupils in a test is highly correlated with the number of hours they studied.
Then, given a pupil who has never taken the test, the number of hours he/she
spent studying can be used to predict his/her score.

`Lens` calculates the correlation coefficient of each quantity in the dataset
with all the others, and reports back a correlation matrix that summarises this
information. For instance, each diagonal entry of this matrix specifies the
correlation coefficient of a quantity with itself, which is equal to 1 by
definition.

.. thumbnail:: images/correlation_matrix.png

More technically, `Lens` returns the Spearman rank-order correlation
coefficient matrix for your dataset.

Pairwise density plot
^^^^^^^^^^^^^^^^^^^^^

To better characterise the correlation between two quantities, it is useful to
create a scatter plot. The tab `Pairwise Density Plot` in your `Lens` report
displays, in a sense, all scatter plots that can be generated by examining the
quantities in the dataset pairwise. Whenever a quantity is compared with
itself, the scatter plot conveys no information, and thus a histogram is
displayed instead.

.. thumbnail:: images/pairwise_density.png

To be accurate, visualisations reported in this page are not scatter plots, but
rather `2D Kernel Density Estimates (KDEs)
<https://en.wikipedia.org/wiki/Kernel_density_estimation>`_. These are
approximations of joint Probability Density Functions (PDFs) for pairs of
quantities in the dataset.