How to import Kaggle Dataset in Google Colaboratory
Google Colaboratory or Colab is a free Jupiter notebook environment that requires zero user configuration. It has many pre-installed ML libraries and a built-in environment to install related packages. It is one of the best available resources for using GPU and TPU without any cost.
Kaggle is also a Google subsidiary and an online community for data scientists and enthusiasts. It is often necessary to work with the Kaggle dataset in a colab notebook. Here I will discuss the easiest method to import and use the Kaggle dataset in a colab environment.
Kaggle API Setup
Kaggle API provides command-line access to the Kaggle dataset. To install this API, execute the following command:
! pip install kaggle
To use the Kaggle API, we need API credentials. For that, sign up for a Kaggle account at [https://www.kaggle.com]. Then go to the ‘Account’ tab of your user profile (https://www.kaggle.com/<username>/account) and select ‘Create API Token’. This will trigger a download option for kaggle.json
, a file containing your API credentials.
Now upload this file using colab upload:
from google.colab import files
files.upload()
Executing this will invoke a browse button, then browse and select the JSON file like the following image,
The following command will create a ~/.kaggle
folder if not exists, then move kaggle.json
file to the Kaggle config folder. To restrict other user's access we can execute chmod
command.
! mkdir -p ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
Alternatively, if you don’t want to import a credential file for this purpose, you can also choose to export your Kaggle username and token to the environment:
import os
os.environ['KAGGLE_USERNAME'] = xxxxxxxxx
os.environ['KAGGLE_KEY'] = xxxxxxxxxxxxxx
Please check your version first,
! kaggle --version
If it is below 1.5.0, then update with the following command,
! pip install --upgrade --force-reinstall --no-deps kaggle
That’s it! Now you are ready to download any dataset from the Kaggle competition.
Search Dataset
To check the available dataset from API, execute the following command:
! kaggle datasets list
The output of this command will look like this,
The command-line tool supports the following commands:
kaggle competitions {list, files, download, submit, submissions, leaderboard}
kaggle datasets {list, files, download, create, version, init}
kaggle kernels {list, init, push, pull, output, status}
kaggle config {view, set, unset}
Let’s play with some commands.
If we want to find some health-related datasets, we can use the search term health like this command,
! kaggle competitions list -s health
Similarly, it is easy to find the getting started category by executing this command,
! kaggle competitions list — category gettingStarted
Competition Dataset
To download any specific competition dataset, you have to join or accept competition rules from http://www.kaggle.com/c/<competition-name>/rules. Then download with this command,
! kaggle competitions download favorita-grocery-sales-forecasting
Or use a specific filename to download the file directly,
! kaggle competitions download favorita-grocery-sales-forecasting -f test.csv.7z
Google colab doesn’t provide any way to save a persistent dataset for a new session or restore a previous session. But this way, we can define a cell block to directly download any dataset from Kaggle to start the session and execute the notebook sequentially.
Happy coding!