How to import Kaggle Dataset in Google Colaboratory

Google Colaboratory or Colab is a free Jupiter notebook environment that requires zero user configuration. It has many pre-installed ML libraries and a built-in environment to install related packages. It is one of the best available resources for using GPU and TPU without any cost.

Kaggle is also a Google subsidiary and an online community for data scientists and enthusiasts. It is often necessary to work with the Kaggle dataset in a colab notebook. Here I will discuss the easiest method to import and use the Kaggle dataset in a colab environment.

Kaggle API Setup

Kaggle API provides command-line access to the Kaggle dataset. To install this API, execute the following command:

! pip install kaggle

To use the Kaggle API, we need API credentials. For that, sign up for a Kaggle account at [https://www.kaggle.com]. Then go to the ‘Account’ tab of your user profile (https://www.kaggle.com/<username>/account) and select ‘Create API Token’. This will trigger a download option for kaggle.json, a file containing your API credentials.

Now upload this file using colab upload:

from google.colab import files
files.upload()

Executing this will invoke a browse button, then browse and select the JSON file like the following image,

Upload kaggle.json file

The following command will create a ~/.kaggle folder if not exists, then move kaggle.jsonfile to the Kaggle config folder. To restrict other user's access we can execute chmod command.

! mkdir -p ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

Alternatively, if you don’t want to import a credential file for this purpose, you can also choose to export your Kaggle username and token to the environment:

import os
os.environ['KAGGLE_USERNAME'] = xxxxxxxxx
os.environ['KAGGLE_KEY'] = xxxxxxxxxxxxxx

Please check your version first,

! kaggle --version

If it is below 1.5.0, then update with the following command,

! pip install --upgrade --force-reinstall --no-deps kaggle

That’s it! Now you are ready to download any dataset from the Kaggle competition.

Search Dataset

To check the available dataset from API, execute the following command:

! kaggle datasets list

The output of this command will look like this,

Available dataset list

The command-line tool supports the following commands:

kaggle competitions {list, files, download, submit, submissions, leaderboard}
kaggle datasets {list, files, download, create, version, init}
kaggle kernels {list, init, push, pull, output, status}
kaggle config {view, set, unset}

Let’s play with some commands.

If we want to find some health-related datasets, we can use the search term health like this command,

! kaggle competitions list -s health

Similarly, it is easy to find the getting started category by executing this command,

! kaggle competitions list — category gettingStarted

Competition Dataset

To download any specific competition dataset, you have to join or accept competition rules from http://www.kaggle.com/c/<competition-name>/rules. Then download with this command,

! kaggle competitions download favorita-grocery-sales-forecasting

Or use a specific filename to download the file directly,

! kaggle competitions download favorita-grocery-sales-forecasting -f test.csv.7z

Google colab doesn’t provide any way to save a persistent dataset for a new session or restore a previous session. But this way, we can define a cell block to directly download any dataset from Kaggle to start the session and execute the notebook sequentially.

Happy coding!

Data Science Enthusiast