In this tutorial, we explain how you can use data tool to extract information about remote datasets, preview tabular data and download it. We assume you already have data installed. If not, please, visit this page - https://datahub.io/docs/getting-started/installing-data.

For this tutorial, we’ll use Global CO2 Emissions dataset from the DataHub:

https://datahub.io/core/co2-fossil-global

Extract summary about a dataset

Using info command, you can easily extract summary information about the dataset from the given URL:

data info https://datahub.io/core/co2-fossil-global

which will print out README of the dataset + summary table about available resources:

Name Format Size Title
validation_report json 511
global_csv csv 6714
global_json json 37857
co2-fossil-global_zip zip 11080
global csv 6453

You can see that it has global CSV file, derived CSV and JSON versions of it, a validation report and ZIP version of the dataset.

Read more about derived CSV and JSON of a tabular data and ZIP version of the datasets: https://datahub.io/docs/features/auto-generated-csv-json-and-zip

Preview tabular data

Let’s preview global CSV file so we know how data looks like before downloading it. We can do it by using cat command:

If you wonder how we constructed the above URL, read this docs about “r” links - https://datahub.io/docs/getting-started/getting-data#perma-urls-for-data.

data cat https://datahub.io/core/co2-fossil-global/r/global.csv

and it prints out a table so you can see the data.

Download it

Finally, download the dataset using get command:

data get https://datahub.io/core/co2-fossil-global

which will save all available files in ./core/co2-fossil-global directory. If you run tree core/co2-fossil-global/, you’d see the following output:

core/co2-fossil-global/

├── README.md

├── archive

│   └── global.csv

├── data

│   ├── global_csv.csv

│   ├── global_json.json

│   └── validation_report.json

└── datapackage.json

2 directories, 6 files

You can find original data in archive directory, while data directory contains all derived files. If you don’t know what is datapackage.json, please, read through this document - https://datahub.io/docs/data-packages#datapackagejson.

Summary

We hope this tutorial helps you to get the most of the data tool. If you experience any bugs or have suggestions on improvements, feel free to open an issue at https://github.com/datahq/datahub-qa/issues.