carpyncho module

Python client for Carpyncho VVV dataset collection.

This code access as a Pandas DataFrame all the data of the web version of Carpyncho https://carpyncho.github.io/.

class carpyncho.Carpyncho(cache_path: str = PosixPath('/home/docs/carpyncho_py_data/_cache_'), cache_expire: float = None, parquet_engine: str = 'auto', index_url: str = 'https://raw.githubusercontent.com/carpyncho/carpyncho-py/master/data/index.json')[source]

Bases: object

Client to access the Carpyncho VVV dataset collection.

This code access as a Pandas Dataframe all the data of the web version of Carpyncho. https://carpyncho.github.io/.

Parameters:
  • cache (diskcache.Cache, diskcache.Fanout,) – or None (default: None) Any instance of diskcache.Cache, diskcache.Fanout or None (Default). If it’s None a diskcache.Cache istance is created with the parameter directory = carpyncho.DEFAULT_CACHE_DIR. More information: http://www.grantjenks.com/docs/diskcache
  • cache_expire (float or None (default=``None``)) – Seconds until item expires (default None, no expiry) More information: http://www.grantjenks.com/docs/diskcache
  • parquet_engine (str (default=”auto”)) – Default Parquet library to use. Remotely carpyncho stores all the data as compresses parquet files; When the download happend a this must be parsed. If ‘auto’, then the option io.parquet.engine is used. The default io.parquet.engine behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable.
cache

Return the internal cache of the client the internal cache.

cache_expire = None

Default timeout of the catalog-cache. Try to always set to None (default), the catalogs are big and mostly never change.

cache_path = None

Location of the catalog cache

catalog_info(tile, catalog)[source]

Retrieve the information about a given catalog.

Parameters:
  • tile (str) – The name of the tile.
  • catalog – The name of the catalog.
Returns:

The entire information of the given catalog file. This include url, md5 checksum, size in bytes, number of total records, etc.

Return type:

dict

Raises:

ValueError: – If the tile or the catalog is not found.

get_catalog(tile, catalog, force=False)[source]

Retrieve a catalog from the carpyncho dataset.

Parameters:
  • tile (str) – The name of the tile.
  • catalog – The name of the catalog.
  • force (bool (default=False)) – If its True, the cached version of the catalog is ignored and redownloaded. Try to always set force to False.
Returns:

The columns of the DataFrame changes between the different catalog.

Return type:

pandas.DataFrame

Raises:
  • ValueError: – If the tile or the catalog is not found.
  • IOError: – If the checksum not match.
has_catalog(tile, catalog)[source]

Check if a given catalog and tile exists.

Parameters:
  • tile (str) – The name of the tile.
  • catalog – The name of the catalog.
Returns:

True if the convination tile+catalog exists.

Return type:

bool

index_

Structure of the Carpyncho dataset information as a Python-dict.

index_url = None

Location of the carpyncho index (usefull for development)

list_catalogs(tile)[source]

Retrieve the available catalogs for a given tile.

Parameters:tile (str) – The name of the tile to retrieve the catalogs.
Returns:The names of available catalogs in the given tile.
Return type:tuple of str
Raises:ValueError: – If the tile is not found.
list_tiles()[source]

Retrieve available tiles with catalogs as a tuple of str.

parquet_engine = None

Default Parquet library to use.

retrieve_index(reset)[source]

Access the remote index of the Carpyncho-Dataset.

The index is stored internally for 1 hr.

Parameters:reset (bool) – If its True the entire cache is ignored and a new index is donwloaded and cached.
Returns:
Return type:dict with the index structure.
carpyncho.CARPYNCHOPY_DATA_PATH = PosixPath('/home/docs/carpyncho_py_data')

Where carpyncho gonna store the entire data.