Faculty Datasets ================ .. contents:: Contents :local: .. role:: r(code) :language: r `Datasets` is Faculty's environment for storing large files. It is designed to prevent accidental loss or modification of important data. Before proceeding, you may want to skim through the tutorial on :doc:`Accessing Data <../getting_started/accessing_data>`. To access the `Datasets` environment, click the relevant icon in the tab on the left-hand side of the workspace. Once inside the `Datasets` environment, the buttons on the top-right of the page offer three options, `Upload file`, `Create folder`, and `Delete file`. It is important to note that other actions, such as moving files from `Datasets` to the workspace, can be performed using the |faculty.datasets|_ Python module. .. |faculty.datasets| replace:: ``faculty.datasets`` .. _faculty.datasets: ../libraries/datasets/index.html Files uploaded to `Datasets` in CSV or TSV format are automatically analysed with `Lens`, Faculty's data-exploration service. As we will see, reports generated by `Lens` can be readily accessed from the `Datasets` page. If, on the other hand, you would like to use this feature as a Python module, :ref:`find its documentation here `. `Lens` reports evaluate the quality of datasets, and offer immediate insight through visualisations and tabular summaries. Moving files to and from datasets --------------------------------- In order to move files from Datasets to the workspace, where you can use files in your programs, we provide a Python and R library that lets you manipulate files on Datasets. .. tabs:: .. tab:: Python The Python library is called ``faculty.datasets``. To save you some typing, you can import it as datasets: .. code-block:: python import faculty.datasets as datasets You can then use the commands as follows: .. code-block:: python datasets.put('/project/test-file.csv', '/input/test-file.csv') The various functions for manipulating files are: .. currentmodule:: faculty.datasets .. autosummary:: :toctree: cp etag get glob ls mv open put rm rmdir To find a full list, take a look at the |faculty.datasets|_ page. .. tab:: R The R library is called `rfaculty`. As always, we load it with the library command: .. code-block:: r library(rfaculty) You can then use the commands as follows: .. code-block:: r datasets_put('/project/test-file.csv', '/input/test-file.csv') The various functions for manipulating files are: .. include:: ../libraries/rfaculty/datasets_calls.rst .. note:: Copying and moving large files (> 1 GB) is currently not well supported. Instead of using the `cp` and `mv` commands, consider downloading the file first, and re-upload it to a different location within datasets. Then, remove the original file if needed. Lens reports ------------ Within the `Datasets` environment, select the CSV or TSV file you that would like to explore. The icon |dots| on the right-hand side of the page will direct you to the `Lens` report. .. |dots| image:: images/dots.png :scale: 50 % .. thumbnail:: images/datasets_page.png Interpreting Lens reports ^^^^^^^^^^^^^^^^^^^^^^^^^ Your `Lens` report will typically look like this: .. thumbnail:: images/lens_overview.png `Lens` reports are organised in three parts, namely `Columns`, `Correlation Matrix` and `Pairwise Density Plot`. Each corresponds to a tab, so that you can navigate through a report as you would in your web browser. Columns ^^^^^^^ This is the "landing page" of the report. It lists the quantities (`Columns`) found in the dataset, alongside their main characteristics, - **TYPE**: the way the column is encoded. As in the data-analysis tool Pandas, the type can be :code:`int64` (integer number), :code:`float64` (floating point number), or :code:`object` (non-numeric). - **VALID**: the number of non-null entries in the column. - **NULL**: the number of null entries in the column. The sum of `VALID` and `NULL` is equal to the size of your dataset. - **DISTINCT**: the number of distinct entries in the column. In other words, repeated identical values are counted only once. - **CATEGORICAL**: If :code:`No`, the column is numeric (:code:`int64` or :code:`float64`). Else the column is non-numeric (:code:`object`). .. note:: Clicking the name of a numeric quantity will direct you to the corresponding histogram. The plot will also include an estimate of the Probability Density Function (PDF) for the quantity, represented as a solid yellow line. More precisely, `Lens` calculates the PDF by means of a `kernel density estimation (KDE) `_ method. .. thumbnail:: images/histogram.png Correlation matrix ^^^^^^^^^^^^^^^^^^ Data scientists are often interested in correlations, as these indicate whether it is possible to make predictions. For example, let us assume the score of pupils in a test is highly correlated with the number of hours they studied. Then, given a pupil who has never taken the test, the number of hours he/she spent studying can be used to predict his/her score. `Lens` calculates the correlation coefficient of each quantity in the dataset with all the others, and reports back a correlation matrix that summarises this information. For instance, each diagonal entry of this matrix specifies the correlation coefficient of a quantity with itself, which is equal to 1 by definition. .. thumbnail:: images/correlation_matrix.png More technically, `Lens` returns the Spearman rank-order correlation coefficient matrix for your dataset. Pairwise density plot ^^^^^^^^^^^^^^^^^^^^^ To better characterise the correlation between two quantities, it is useful to create a scatter plot. The tab `Pairwise Density Plot` in your `Lens` report displays, in a sense, all scatter plots that can be generated by examining the quantities in the dataset pairwise. Whenever a quantity is compared with itself, the scatter plot conveys no information, and thus a histogram is displayed instead. .. thumbnail:: images/pairwise_density.png To be accurate, visualisations reported in this page are not scatter plots, but rather `2D Kernel Density Estimates (KDEs) `_. These are approximations of joint Probability Density Functions (PDFs) for pairs of quantities in the dataset.