Introduction to NCI’s Data Catalogue and Indexing Schemes
Event description
We’re hosting a tutorial to introduce the NCI data catalogue and its two indexing schemes: Intake-ESM and Intake-Spark.
A data catalogue helps users discover and access datasets through structured metadata, while indexing improves performance by enabling fast, targeted searches. Built on the Python Intake package, these tools support scalable, memory-efficient access to large datasets. At NCI, Intake-Spark uses Parquet-based indexes for high-performance querying with Spark, while Intake-ESM uses lightweight CSV-based indexes ideal for climate data workflows.
This session will include hands-on Jupyter Notebook examples showing how to use the catalogue in data analysis and machine learning workflows. You’ll learn how to search, load, and filter datasets efficiently from the /g/data collections.
The tutorial is ideal for researchers working with large-scale data or looking to streamline their pipelines.
If you have any questions regarding this training, please contact training.nci@anu.edu.au.
Prerequisites
- Experience with Python.
-
Experience with bash or similar Unix shells.
- Having a valid NCI account
- Experience using NCI ARE service is recommended. You can find relevant documentations here: ARE User Guide.
Learning Outcomes
After this training session, you will be able to
- Learn about NCI data services
- Understand NCI data catalogue and schemes
- Perform search, load, and filter datasets efficiently from the /g/data collection
- Can use data catalog in data analysis and machine learning workflows.
Topics Covered
- Welcome and Introduction to NCI’s Intake-Spark and Intake-ESM Indexing Schemes
- Overview of NCI’s Data Catalogue Services
- Working with the Intake-ESM Indexing Scheme
- Applying the Intake-ESM Scheme in AI/ML Workflows
- Using the Intake-Spark Indexing Scheme
Tickets for good, not greed Humanitix dedicates 100% of profits from booking fees to charity