Introduction to NCI’s Data Catalogue and Indexing Schemes

Name: Introduction to NCI’s Data Catalogue and Indexing Schemes
Start: 2025-06-11T00:00:00-0400
End: 2025-06-11T03:00:00-0400

Wed, 11 Jun, 12am - 3am EDT

Online Event

National Computational Infrastructure

244 followers · Contact host

Event description

We’re hosting a tutorial to introduce the NCI data catalogue and its two indexing schemes: Intake-ESM and Intake-Spark.

A data catalogue helps users discover and access datasets through structured metadata, while indexing improves performance by enabling fast, targeted searches. Built on the Python Intake package, these tools support scalable, memory-efficient access to large datasets. At NCI, Intake-Spark uses Parquet-based indexes for high-performance querying with Spark, while Intake-ESM uses lightweight CSV-based indexes ideal for climate data workflows.

This session will include hands-on Jupyter Notebook examples showing how to use the catalogue in data analysis and machine learning workflows. You’ll learn how to search, load, and filter datasets efficiently from the /g/data collections.

The tutorial is ideal for researchers working with large-scale data or looking to streamline their pipelines.

If you have any questions regarding this training, please contact training.nci@anu.edu.au.

Prerequisites

Experience with Python.
Experience with bash or similar Unix shells.
Having a valid NCI account
Experience using NCI ARE service is recommended. You can find relevant documentations here: ARE User Guide.

Learning Outcomes

After this training session, you will be able to

Learn about NCI data services
Understand NCI data catalogue and schemes
Perform search, load, and filter datasets efficiently from the /g/data collection
Can use data catalog in data analysis and machine learning workflows.