carpyncho module

Python client for Carpyncho VVV dataset collection.

This code access as a Pandas DataFrame all the data of the web version of Carpyncho https://carpyncho.github.io/.

class carpyncho.Carpyncho(cache: Union[diskcache.core.Cache, diskcache.fanout.FanoutCache] = NOTHING, cache_expire: float = None, parquet_engine: str = 'auto', index_url: str = 'https://raw.githubusercontent.com/carpyncho/carpyncho-py/master/data/index.json')[source]

Bases: object

Client to access the Carpyncho VVV dataset collection.

This code access as a Pandas Dataframe all the data of the web version of Carpyncho. https://carpyncho.github.io/.

Parameters:
  • cache (diskcache.Cache, diskcache.Fanout,) – or None (default: None) Any instance of diskcache.Cache, diskcache.Fanout or None (Default). If it’s None a diskcache.Cache istance is created with the parameter directory = carpyncho.DEFAULT_CACHE_DIR. More information: http://www.grantjenks.com/docs/diskcache
  • cache_expire (float or None (default=``None``)) – Seconds until item expires (default None, no expiry) More information: http://www.grantjenks.com/docs/diskcache
  • parquet_engine (str (default=”auto”)) – Default Parquet library to use. Remotely carpyncho stores all the data as compresses parquet files; When the download happend a this must be parsed. If ‘auto’, then the option io.parquet.engine is used. The default io.parquet.engine behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable.
cache = None

Local cache of the carpyncho database.

cache_expire = None

Default timout of the catalog-cache. Try to always set to None (default), the catalogs are big and mostly never change.

catalog_info(tile, catalog)[source]

Retrieve the information about a given catalog.

Parameters:
  • tile (str) – The name of the tile.
  • catalog – The name of the catalog.
Returns:

The entire information of the given catalog file. This include drive-id, md5 checksum, size in bytes, number of total records, etc.

Return type:

dict

Raises:

ValueError: – If the tile or the catalog is not found.

get_catalog(tile, catalog, force=False)[source]

Retrieve a catalog from the carpyncho dataset.

Parameters:
  • tile (str) – The name of the tile.
  • catalog – The name of the catalog.
  • force (bool (default=False)) – If its True, the cached version of the catalog is ignored and redownloaded. Try to always set force to False.
Returns:

The columns of the DataFrame changes between the different catalog.

Return type:

pandas.DataFrame

Raises:
  • ValueError: – If the tile or the catalog is not found.
  • IOError: – If the checksum not match.
has_catalog(tile, catalog)[source]

Check if a given catalog and tile exists.

Parameters:
  • tile (str) – The name of the tile.
  • catalog – The name of the catalog.
Returns:

True if the convination tile+catalog exists.

Return type:

bool

index_

Structure of the Carpyncho dataset information as a Python-dict.

index_url = None

Location of the carpyncho index (usefull for development)

list_catalogs(tile)[source]

Retrieve the available catalogs for a given tile.

Parameters:tile (str) – The name of the tile to retrieve the catalogs.
Returns:The names of available catalogs in the given tile.
Return type:tuple of str
Raises:ValueError: – If the tile is not found.
list_tiles()[source]

Retrieve available tiles with catalogs as a tuple of str.

parquet_engine = None

Default Parquet library to use.

retrieve_index(reset)[source]

Access the remote index of the Carpyncho-Dataset.

The index is stored internally for 1 hr.

Parameters:reset (bool) – If its True the entire cache is ignored and a new index is donwloaded and cached.
Returns:
Return type:dict with the index structure.
carpyncho.CARPYNCHOPY_DATA_PATH = PosixPath('/home/docs/carpyncho_py_data')

Where carpyncho gonna store the entire data.