Carpyncho client for Python

Python client for Carpyncho VVV dataset collection.
QuatroPe CI DOI Python 3 BSD-3 Documentation Status PyPI ascl:2005.007 Logo

This library access as a Pandas DataFrame all the data of the web version of Carpyncho https://carpyncho.github.io/.

Code

The entire source code of is hosted in GitHub https://github.com/carpyncho/carpyncho-py/

License

Carpyncho is under The BSD-3 License

The BSD 3-clause license allows you almost unlimited freedom with the software so long as you include the BSD copyright and license notice in it (found in Fulltext).

Citation

If you use Carpyncho in a scientific publication, we would appreciate citations to the following paper:

Cabral, J. B., Ramos, F., Gurovich, S., & Granitto, P. (2020). Automatic Catalog of RRLyrae from ∼ 14 million VVV Light Curves: How far can we go with traditional machine-learning? https://arxiv.org/abs/2005.00220

Bibtex entry

@ARTICLE{2020A&A...642A..58C,
       author = {{Cabral}, J.~B. and {Ramos}, F. and {Gurovich}, S. and {Granitto}, P.~M.},
        title = "{Automatic catalog of RR Lyrae from {\ensuremath{\sim}}14 million VVV light curves: How far can we go with traditional machine-learning?}",
      journal = {\aap},
     keywords = {methods: data analysis, methods: statistical, surveys, catalogs, stars: variables: RR Lyrae, Galaxy: bulge, Astrophysics - Instrumentation and Methods for Astrophysics, Astrophysics - Solar and Stellar Astrophysics, Computer Science - Machine Learning, Statistics - Machine Learning},
         year = 2020,
        month = oct,
       volume = {642},
          eid = {A58},
        pages = {A58},
          doi = {10.1051/0004-6361/202038314},
archivePrefix = {arXiv},
       eprint = {2005.00220},
 primaryClass = {astro-ph.IM},
       adsurl = {https://ui.adsabs.harvard.edu/abs/2020A&A...642A..58C},
      adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}

Installation

This is the recommended way to install carpyncho.

Installing with pip

Make sure that the Python interpreter can load carpyncho code. The most convenient way to do this is to use virtualenv, virtualenvwrapper, and pip.

After setting up and activating the virtualenv, run the following command:

$ pip install carpyncho
...

That should be it all.

Installing the development version

If you’d like to be able to update your carpyncho code occasionally with the latest bug fixes and improvements, follow these instructions:

Make sure that you have Git installed and that you can run its commands from a shell. (Enter git help at a shell prompt to test this.)

Check out carpyncho main development branch like so:

$ git clone https://github.com/carpyncho/carpyncho-py.git carpyncho
...

This will create a directory carpyncho in your current directory.

Then you can proceed to install with the commands

$ cd carpyncho
$ pip install -e .
...

Documentation

The full documentation of the project are available in https://carpyncho-py.readthedocs.io/

Contact

For bugs or question please contact

Juan B. Cabral: jbcabral@unc.edu.ar

Contents:

Tutorials

This section contains a step-by-step tutorial with examples for using the carpyncho tools, and how to understand several catalogs.

Introduction

This tutorial will show how to understand and manipilate the carpyncho Python Client.

First we need to import the module, and instantiate the client

[1]:
# import the module
import carpyncho

# instance the client
client = carpyncho.Carpyncho()

Firsts lets check which tiles have available catalogs to download.

[2]:
client.list_tiles()
[2]:
('others',
 'b206',
 'b214',
 'b216',
 'b220',
 'b228',
 'b234',
 'b247',
 'b248',
 'b261',
 'b262',
 'b263',
 'b264',
 'b277',
 'b278',
 'b356',
 'b360',
 'b396')

Well lets asume we are interested in the tile b216, so we can check which catalogs are available in this tiles

[3]:
client.list_catalogs("b216")
[3]:
('features', 'lc')

Well we see that catalogs with the light curves (lc), and the features of those curves (features) are available.

So for example we now can retrieve more info of any of this catalogs, for simplicity let’s check the b216 lc

[4]:
client.catalog_info("b216", "lc")
[4]:
{'hname': 'Time-Series',
 'format': 'BZIP2-Parquet',
 'extension': '.parquet.bz2',
 'date': '2020-04-14',
 'md5sum': '236e126f82e80684f29247220470b831  lc_obs_b216.parquet.bz2',
 'filename': 'lc_obs_b216.parquet.bz2',
 'url': 'https://catalogs.iate.conicet.unc.edu.ar/carpyncho/lcurves/b216/lc_obs_b216.parquet.bz2',
 'size': 369866999,
 'records': 37839384}

The attribute hname is a human readable version of the name of the catalog, the next two keys have information of format of the catalog (how is stored in the cloud), next are information about the date of publication of the file, check-sums and the cloud-ID (all of this is mostly for internal use).

Finally we have the two more important information: size is the size in bytes of the file (352.7 MiB) and the number of records stored in the file (more than 37 millons).

Ok… to big, lets check the b278 features catalog

[5]:
client.catalog_info("b216", "features")
[5]:
{'hname': 'Features',
 'format': 'BZIP2-Parquet',
 'extension': '.parquet.bz2',
 'date': '2020-04-14',
 'md5sum': '433aae05541a2f5b191aa95d717fa83c  features_b216.parquet.bz2',
 'filename': 'features_b216.parquet.bz2',
 'url': 'https://catalogs.iate.conicet.unc.edu.ar/carpyncho/lcurves/b216/features_b216.parquet.bz2',
 'size': 149073679,
 'records': 334773}

In this case this file is only 142.2 MiB of size, let’s retrive it into a dataframe.

[6]:
# the first time this can be slow
df = client.get_catalog("b278", "features")
df
b278-features: 349MB [02:23, 2.42MB/s]
[6]:
id cnt ra_k dec_k vs_type vs_catalog Amplitude Autocor_length Beyond1Std Con ... c89_jk_color c89_m2 c89_m4 n09_c3 n09_hk_color n09_jh_color n09_jk_color n09_m2 n09_m4 ppmb
0 32780000001647 33 270.675437 -30.833556 0.4205 1.0 0.303030 0.0 ... 0.254668 14.690558 14.666523 0.183141 0.026877 0.229526 0.256404 14.820497 14.825871 0.000044
1 32780000001722 32 270.601058 -30.797561 0.2815 1.0 0.250000 0.0 ... 0.714901 14.020039 13.975850 0.344830 0.136610 0.580695 0.717305 14.243584 14.253669 5.275137
2 32780000001725 31 270.586525 -30.790697 0.7770 1.0 0.225806 0.0 ... 0.737901 12.123443 12.090208 0.228780 0.187609 0.552696 0.740305 12.351745 12.358436 3.785030
3 32780000001764 35 270.533529 -30.764936 0.6025 1.0 0.200000 0.0 ... 0.708924 12.578053 12.528972 0.398169 0.114875 0.596295 0.711170 12.798067 12.809809 6.580914
4 32780000001766 49 270.575246 -30.785039 0.4850 1.0 0.224490 0.0 ... 0.587902 13.324888 13.287657 0.285994 0.111611 0.478696 0.590306 13.522170 13.530534 5.746016
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
866878 32780000865946 56 270.961067 -29.417511 0.0230 1.0 0.357143 0.0 ... 0.687494 12.213607 12.171067 0.347818 0.125109 0.563573 0.688682 12.398301 12.408567 2.734307
866879 32780000879797 56 270.962108 -29.392694 0.0260 1.0 0.375000 0.0 ... 0.655494 12.358619 12.329497 0.212287 0.163109 0.493573 0.656682 12.536672 12.542938 6.877899
866880 32780000894698 57 270.987062 -29.378722 0.0310 1.0 0.350877 0.0 ... 0.743494 10.892837 10.854592 0.297559 0.164110 0.580572 0.744682 11.089152 11.097934 5.883032
866881 32780000894881 55 272.064012 -29.893197 0.0285 1.0 0.381818 0.0 ... 0.464515 13.531632 13.505489 0.203314 0.096373 0.369172 0.465545 13.667806 13.673903 5.310308
866882 32780000900244 56 272.123942 -29.912258 0.0280 1.0 0.267857 0.0 ... 0.706600 12.044150 12.008327 0.273388 0.158705 0.549095 0.707800 12.229126 12.236734 0.922049

866883 rows × 73 columns

Well we have a lot of imformation to play here. Let’s check if we have some multiple types of sources

[7]:
df.groupby("vs_type").id.count()
[7]:
vs_type
               857450
BLAP                1
CV-DN               5
Cep-1               1
Cep-F               1
ECL-C             729
ECL-ELL           486
ECL-NC           3246
LPV-Mira          104
LPV-OSARG        3820
LPV-SRV           592
RRLyr-RRab        289
RRLyr-RRc         145
RRLyr-RRd           3
SP_ECL-C            1
SP_ECL-NC           1
T2Cep-BLHer         6
T2Cep-RVTau         2
T2Cep-WVir          1
Name: id, dtype: int64

Well 41 RRab stars (and more than 334K of unknow sources)

Well we have a lot to use here, lets make some plots.

Form now on, yo simple have a big pandas dataframe to manipulate.

All the methods of carpyncho.Carpyncho client are well documented and you can acces it whit the ‘?’ command in Jupyter

[8]:
client.get_catalog?
Signature: client.get_catalog(tile, catalog, force=False)
Docstring:
Retrieve a catalog from the carpyncho dataset.

Parameters
----------
tile: str
    The name of the tile.
catalog:
    The name of the catalog.
force: bool (default=False)
    If its True, the cached version of the catalog is ignored and
    redownloaded. Try to always set force to False.

Returns
-------
pandas.DataFrame:
    The columns of the DataFrame changes between the different catalog.

Raises
------
ValueError:
    If the tile or the catalog is not found.
IOError:
    If the checksum not match.
File:      ~/proyectos/carpyncho-py/src/carpyncho.py
Type:      method
[9]:
import datetime as dt
dt.datetime.now()
[9]:
datetime.datetime(2022, 6, 13, 0, 1, 52, 183122)

Command line interface (CLI)

After install Carpyncho you gonna have available command line app to download any dataset.

[1]:
carpyncho --help
Usage: carpyncho command [args...]

Carpyncho console client.

Explore and download the entire https://carpyncho.github.io/ catalogs from your
command line.

Commands:
  catalog-info       Retrieve the information about a given catalog.
  download-catalog   Retrives a catalog from th Carpyncho dataset collection.
  has-catalog        Check if a given catalog and tile exists.
  list-catalogs      Show the available catalogs for a given tile.
  list-tiles         Show available tiles.
  version            Print Carpyncho version.

This software is under the BSD 3-Clause License. Copyright (c) 2020, Juan
Cabral. For bug reporting or other instructions please check:
https://github.com/carpyncho/carpyncho-py

To list all availables tiles we can run

[2]:
carpyncho list-tiles
- b206
- b214
- b216
- b220
- b228
- b234
- b247
- b248
- b261
- b262
- b263
- b264
- b277
- b278
- b356
- b360
- b396

Then we can check all the available catalogs for a given tile (b216 for example)

[3]:
carpyncho list-catalogs b216
Tile b216
    - features
    - lc

Lets asume we want to download the catalog features from the tile b216. First lets check how big is the catalog before download:

[4]:
carpyncho catalog-info b216 features
Catalog b216-features
    - hname: Features
    - format: BZIP2-Parquet
    - extension: .parquet.bz2
    - date: 2020-04-14
    - md5sum: 433aae05541a2f5b191aa95d717fa83c  features_b216.parquet.bz2
    - filename: features_b216.parquet.bz2
    - driveid: 1-t165sLjn0k507SFeW-A4p9wYVL9rP4B
    - size: 142.2 MiB
    - records: 334,773

Well 142 MiB for 334773 rows in the table, lets download it and sotore it in csv format

[5]:
carpyncho download-catalog b216 features --out b216_features.csv
b216-features: 149MB [03:03, 811kB/s]
Writing b216_features.csv...

Now lets check the size and the checksum to see if it’s correct (warning this is linux and mac only)

[7]:
cat b216_features.csv | wc -l
334774

The rows are ok, so it’s done.

If you run the same command multiple times, the file will be cached.

All the commands support more options yo can check it with carpyncho <command> --help. For example

[11]:
carpyncho download-catalog --help
Usage: carpyncho download-catalog [OPTIONS] tile catalog

Retrives a catalog from th Carpyncho dataset collection.

Arguments:
  tile         The name of the tile.
  catalog      The name of the catalog.

Options:
  --out=STR    Path to store the catalog. The extension of the file detemines
               the format. Options are ".xlsx" (Excel), ".csv", ".pkl" (Python
               pickle) and ".parquet".
  --force      Force to ignore the cached value and redownload the catalog. Try
               to always set force to False.

Other actions:
  -h, --help   Show the help
[12]:
date
jue abr 23 22:38:42 -03 2020

Catalogs Tutorials

This section contains information on how to interpret and manipulate the catalogs offered by carpyncho.

Light-Curves catalogs tutorial (lc)

This notebook give some insights about the data stored in all the lc types catalogs.

[1]:
# import the module and instance the client
import carpyncho
client = carpyncho.Carpyncho()
[2]:
df = client.get_catalog("b214", "lc")
df.sample(3)
b214-lc: 222MB [01:33, 2.37MB/s]
[2]:
bm_src_id pwp_id pwp_stack_src_id pwp_stack_src_hjd pwp_stack_src_mag3 pwp_stack_src_mag_err3
10436285 32140000106633 4707 3000470700094773 56110.056177 15.774 0.053
17718411 32140000156103 4719 3000471900157271 55435.200210 17.899 0.277
20185341 32140000399685 4726 3000472600347001 56214.078852 16.240 0.073

The columns of this catalog are

[3]:
print(list(df.columns))
['bm_src_id', 'pwp_id', 'pwp_stack_src_id', 'pwp_stack_src_hjd', 'pwp_stack_src_mag3', 'pwp_stack_src_mag_err3']

Where

  • bm_src_id (Band-Merge Source ID): This is the unique identifier of every light curve. The records with the same bm_src_id are part of the same lc (This id is part of Carpyncho internal and is unique for every source).
  • pwp_id (Pawprint Stack ID): The id of the pawprint where this point of the light curve is located (This id is part of Carpyncho internal database).
  • pwp_stack_src_id (Pawprint Stack Source ID): The id of this particular observation inside the pawprint where this point (This id are part of Carpyncho internal database)
  • pwp_stack_src_hjd (Pawprint Stack Source HJD): The Heliocentric-Julian-Date of this particular observation.
  • pwp_stack_src_mag3 (Pawprint Stack Source Magnitude of the 3rd Aperture): The magnitude (of the 3rd aperture) of this particular observation.
  • pwp_stack_src_mag_err3 (Pawprint Stack Source Magnitude Error of the 3rd Aperture): The magnitude error (of the 3rd aperture) of this particular observation.
Retrieve a single light-curve

Lets, for example, retrieve the LC with the ID 32140000349109 and sort by time

[4]:
lc = df[df.bm_src_id == 32140000349109]
lc = lc.sort_values("pwp_stack_src_hjd")
lc
[4]:
bm_src_id pwp_id pwp_stack_src_id pwp_stack_src_hjd pwp_stack_src_mag3 pwp_stack_src_mag_err3
9824315 32140000349109 4705 3000470500316153 55301.355623 15.736 0.045
15823573 32140000349109 4713 3000471300371137 55404.204420 15.705 0.040
17889825 32140000349109 4719 3000471900344310 55435.200224 15.734 0.042
15383198 32140000349109 4711 3000471100264253 55497.035605 15.868 0.061
7230797 32140000349109 4694 3000469400252478 55806.201062 15.795 0.060
... ... ... ... ... ... ...
17498325 32140000349109 4718 3000471800253351 57248.183650 15.750 0.050
20689532 32140000349109 4728 3000472800102886 57251.230468 15.828 0.099
7873271 32140000349109 4697 3000469700250665 57252.177815 15.727 0.050
12365892 32140000349109 4677 3000467700294240 57265.086562 15.773 0.055
15027768 32140000349109 4686 3000468600357522 57282.060899 15.748 0.045

67 rows × 6 columns

Great, 67 epochs. Let’s check the average and dispersion of the magnitudes and the error

[5]:
lc[['pwp_stack_src_mag3', 'pwp_stack_src_mag_err3']].mean()
[5]:
pwp_stack_src_mag3        15.759224
pwp_stack_src_mag_err3     0.051299
dtype: float64
[6]:
lc[['pwp_stack_src_mag3', 'pwp_stack_src_mag_err3']].std()
[6]:
pwp_stack_src_mag3        0.044732
pwp_stack_src_mag_err3    0.009130
dtype: float64

The source is stable, now check the observation range

[7]:
print((lc.pwp_stack_src_hjd.max() - lc.pwp_stack_src_hjd.min()) / 365, "Years")
5.426589796388058 Years

Finally we can plot the entire LC

[8]:
import matplotlib.pyplot as plt
[9]:
fig, ax = plt.subplots(figsize=(12, 4))

ax.errorbar(
    lc.pwp_stack_src_hjd,
    lc.pwp_stack_src_mag3,
    lc.pwp_stack_src_mag_err3,
    ls="", marker="o", ecolor="red")

ax.set_title(f"Light Curve of source 32140000349109")
ax.set_ylabel("Magnitude")
ax.set_xlabel("HJD")

ax.invert_yaxis()
fig.tight_layout()
_images/tutorials_catalogs_00_lc_14_0.png
[10]:
import datetime as dt
dt.datetime.now()
[10]:
datetime.datetime(2022, 6, 13, 0, 7, 52, 200684)
[ ]:

Features catalogs tutorial (features)

This notebook give some insights about the data stored in all the features types catalogs.

[1]:
# import the module and instance the client
import carpyncho
client = carpyncho.Carpyncho()

Now we download one features catalog

[2]:
df = client.get_catalog("b214", "features")
df.sample(3)
b214-features: 159MB [01:06, 2.40MB/s]
[2]:
id cnt ra_k dec_k vs_type vs_catalog Amplitude Autocor_length Beyond1Std Con ... c89_jk_color c89_m2 c89_m4 n09_c3 n09_hk_color n09_jh_color n09_jk_color n09_m2 n09_m4 ppmb
246230 32140000304888 66 281.886621 -24.814519 0.18675 1.0 0.287879 0.0 ... 0.790576 15.292760 15.256852 0.274868 0.188890 0.602440 0.791330 15.479110 15.486869 2.341868
375156 32140000459091 44 281.626075 -24.136333 0.27050 1.0 0.431818 0.0 ... 0.475020 16.945470 16.877186 0.682651 -0.076031 0.551810 0.475779 17.051519 17.072001 4.627671
264628 32140000328959 39 281.817479 -24.701575 0.28200 1.0 0.333333 0.0 ... 0.498984 16.798042 16.762198 0.305437 0.071062 0.428715 0.499777 16.923741 16.932339 0.928964

3 rows × 73 columns

The columns of this catalog are

[3]:
print(list(df.columns))
['id', 'cnt', 'ra_k', 'dec_k', 'vs_type', 'vs_catalog', 'Amplitude', 'Autocor_length', 'Beyond1Std', 'Con', 'Eta_e', 'FluxPercentileRatioMid20', 'FluxPercentileRatioMid35', 'FluxPercentileRatioMid50', 'FluxPercentileRatioMid65', 'FluxPercentileRatioMid80', 'Freq1_harmonics_amplitude_0', 'Freq1_harmonics_amplitude_1', 'Freq1_harmonics_amplitude_2', 'Freq1_harmonics_amplitude_3', 'Freq1_harmonics_rel_phase_0', 'Freq1_harmonics_rel_phase_1', 'Freq1_harmonics_rel_phase_2', 'Freq1_harmonics_rel_phase_3', 'Freq2_harmonics_amplitude_0', 'Freq2_harmonics_amplitude_1', 'Freq2_harmonics_amplitude_2', 'Freq2_harmonics_amplitude_3', 'Freq2_harmonics_rel_phase_0', 'Freq2_harmonics_rel_phase_1', 'Freq2_harmonics_rel_phase_2', 'Freq2_harmonics_rel_phase_3', 'Freq3_harmonics_amplitude_0', 'Freq3_harmonics_amplitude_1', 'Freq3_harmonics_amplitude_2', 'Freq3_harmonics_amplitude_3', 'Freq3_harmonics_rel_phase_0', 'Freq3_harmonics_rel_phase_1', 'Freq3_harmonics_rel_phase_2', 'Freq3_harmonics_rel_phase_3', 'Gskew', 'LinearTrend', 'MaxSlope', 'Mean', 'Meanvariance', 'MedianAbsDev', 'MedianBRP', 'PairSlopeTrend', 'PercentAmplitude', 'PercentDifferenceFluxPercentile', 'PeriodLS', 'Period_fit', 'Psi_CS', 'Psi_eta', 'Q31', 'Rcs', 'Skew', 'SmallKurtosis', 'Std', 'StetsonK', 'c89_c3', 'c89_hk_color', 'c89_jh_color', 'c89_jk_color', 'c89_m2', 'c89_m4', 'n09_c3', 'n09_hk_color', 'n09_jh_color', 'n09_jk_color', 'n09_m2', 'n09_m4', 'ppmb']

Where

  • id (ID): This is the unique identifier of every light curve. If you want to access all the points of the lightcurve of a source wiht any id, you can search for the same value of a bm_src_idin the lc of the same tile of the features catalog.
  • cnt (Count): How many epochs has the lightcurve.
  • ra_k: Right Ascension in band \(K_s\) of the source in the first epoch.
  • dec_k: Declination in band \(K_s\) of the source in the first epoch.
  • vs_type (Variable Star Type): The type of the source if is a variable star tagged with the OGLE-III, OGLE-IV and VIZIER catalogs; or empty if the source has no type.
  • vs_catalog (Variable Star Catalog): From which catalog the vs_type was extracted.

All the other columns (Except the last 13) are the features itself And can be consulted here https://feets.readthedocs.io/en/latest/tutorial.html#The-Features

Finally the reddening free features are:

  • c89_c3: \(C3\) Pseudo-color using the cardelli-89 extinction law.

  • c89_ab_color: Magnitude difference in the first epoch between the band \(a\) and the band \(b\) using the Cardelli-89 extinction law. Where \(a\) and \(b\) can be the bands \(H\), \(J\) and \(K_s\).

  • c89_m2 and c89_m4: \(m2\) and \(m4\) pseudo-magnitudes using the Cardelli-89 extinction law.

  • n09_c3: \(C3\) Pseudo-color using the Nishiyama-09 extinction law.

  • n09_ab_color: Magnitude difference in the first epoch between the band \(a\) and the band \(b\) using the nishiyama-09 extinction law. Where \(a\) and \(b\) can be the bands \(H\), \(J\) and \(K_s\)

  • n09_m2 and n09_m4: \(m2\) and \(m4\) pseudo-magnitudes using the nishiyama-09 extinction law.

  • ppmb (Pseudo-Phase Multi-Band): This index sets the first time in phase with respect to the average time in all bands, using the period calculated by feets.

    \[PPMB = frac(\frac{|mean(HJD_H, HJD_J, HJD_{K_s}) - T_0|}{P})\]

    Where \(HJD_H\), \(HJD_J\) and \(HJD_{K_s}\) are the time of observations in the band \(H\), \(J\) and \(K_s\); \(T_0\) is the time of observation of maximum magnitude in \(K_s\) band; \(mean\) calculate the mean of the three times, \(frac\) returns only the decimal part of the number, and \(P\) is the extracted period.

For more information about the extintion laws and pseudo colors/magnitudes:


Cardelli-89 Extinction law:

Cardelli, J. A., Clayton, G. C., & Mathis, J. S. (1989). The relationship between infrared, optical, and ultraviolet extinction. The Astrophysical Journal, 345, 245-256.

Nishiyama-09 Extinction law:

Nishiyama, S., Tamura, M., Hatano, H., Kato, D., Tanabé, T., Sugitani, K., & Nagata, T. (2009). Interstellar extinction law toward the galactic center III: J, H, KS bands in the 2MASS and the MKO systems, and 3.6, 4.5, 5.8, 8.0 μm in the Spitzer/IRAC system. The Astrophysical Journal, 696(2), 1407.

Pseudo colors/magnitudes:

Catelan, M., Minniti, D., Lucas, P. W., Alonso-Garcia, J., Angeloni, R., Beamin, J. C., … & Dekany, I. (2011). The Vista Variables in the Via Lactea (VVV) ESO Public Survey: Current Status and First Results. arXiv preprint arXiv:1105.1119.

Well lets play with the data of 3 Variable star

[4]:
rrs = df[df.vs_type == "RRLyr-RRab"][:3]
rrs
[4]:
id cnt ra_k dec_k vs_type vs_catalog Amplitude Autocor_length Beyond1Std Con ... c89_jk_color c89_m2 c89_m4 n09_c3 n09_hk_color n09_jh_color n09_jk_color n09_m2 n09_m4 ppmb
4456 32140000002913 68 280.582129 -25.299033 RRLyr-RRab vizier 0.12600 1.0 0.279412 0.0 ... 0.162897 13.827993 13.819663 0.053522 0.040231 0.123061 0.163291 13.893530 13.895081 1.629816
21827 32140000030432 68 280.482488 -25.156328 RRLyr-RRab vizier 0.17275 1.0 0.235294 0.0 ... 0.221322 13.112390 13.104045 0.049464 0.062893 0.158674 0.221566 13.185733 13.187114 1.672651
21929 32140000030564 68 281.279492 -25.499744 RRLyr-RRab vizier 0.18400 1.0 0.308824 0.0 ... 0.314535 14.001622 13.984766 0.128221 0.068086 0.246717 0.314803 14.087323 14.090839 2.006219

3 rows × 73 columns

We can check their mean of magnitudes to check if the source is not saturated or diffuse.

[5]:
rrs.Mean
[5]:
4456     14.086691
21827    13.488265
21929    14.412221
Name: Mean, dtype: float64

The three are between 12 an 16.5, so they are ok, and their pulsation?

[6]:
rrs.Std
[6]:
4456     0.071391
21827    0.090222
21929    0.120290
Name: Std, dtype: float64

Plotting time: we need the lc catalog to show the phased-folded light curve.

[7]:
lcs = client.get_catalog("b214", "lc")

Now to reduce the memory footprint we can retrieve the lc for only our 3 selected stars

[8]:
lcs = lcs[lcs.bm_src_id.isin(rrs.id)]

For make our code simple we can use to folde the light curve the PyAstronomy and numpy library

[9]:
from PyAstronomy.pyasl import foldAt
import numpy as np
[10]:
%matplotlib inline
import matplotlib.pyplot as plt

now we can plot the folded and unfolded lightcurves

[11]:
# get one ot the 3 sources
rr = rrs.iloc[0]

# retrieve the lightcurve for this rr
lc = lcs[lcs.bm_src_id == rr.id]

# sort by time
lc = lc.sort_values("pwp_stack_src_hjd")

# split in time, magnitude and error
time, mag, err = (
        lc.pwp_stack_src_hjd.values,
        lc.pwp_stack_src_mag3.values,
        lc.pwp_stack_src_mag_err3.values)

# t0 is the first time
t0 = time[0]

# fold
phases = foldAt(time, rr.PeriodLS, T0=t0)
sort = np.argsort(phases)
phases, pmag, perr = phases[sort], mag[sort], err[sort]

# duplicate the values in two phases
phases = np.hstack((phases, phases + 1))
pmag = np.hstack((pmag, pmag))
perr = np.hstack((perr, perr))

# now create two plot for the folded and the unfolde LC
fig, axes = plt.subplots(2, 1, figsize=(12, 6))

# first lets plot the unfolded lc
ax = axes[0]
ax.errorbar(time, mag, err, ls="", marker="o", ecolor="red")
ax.set_title(f"Light Curve of source {rr.id}")
ax.set_ylabel("Magnitude")
ax.set_xlabel("HJD")
ax.invert_yaxis()

# now the folded lc
ax = axes[1]
ax.errorbar(phases, pmag, perr, ls="", marker="o", ecolor="blue", color="red")
ax.set_title(f"Folded Light Curve of source {rr.id}")
ax.set_ylabel("Magnitude")
ax.set_xlabel("Phase")
ax.invert_yaxis()

fig.tight_layout()
_images/tutorials_catalogs_01_features_20_0.png

The next light curve

[12]:
rr = rrs.iloc[1]
lc = lcs[lcs.bm_src_id == rr.id]
lc = lc.sort_values("pwp_stack_src_hjd")

time, mag, err = (
        lc.pwp_stack_src_hjd.values,
        lc.pwp_stack_src_mag3.values,
        lc.pwp_stack_src_mag_err3.values)

t0 = time[0]

phases = foldAt(time, rr.PeriodLS, T0=t0)
sort = np.argsort(phases)
phases, pmag, perr = phases[sort], mag[sort], err[sort]

phases = np.hstack((phases, phases + 1))
pmag = np.hstack((pmag, pmag))
perr = np.hstack((perr, perr))


fig, axes = plt.subplots(2, 1, figsize=(12, 6))

ax = axes[0]
ax.errorbar(time, mag, err, ls="", marker="o", ecolor="red")
ax.set_title(f"Light Curve of source {rr.id}")
ax.set_ylabel("Magnitude")
ax.set_xlabel("HJD")
ax.invert_yaxis()

ax = axes[1]
ax.errorbar(phases, pmag, perr, ls="", marker="o", ecolor="blue", color="red")
ax.set_title(f"Folded Light Curve of source {rr.id}")
ax.set_ylabel("Magnitude")
ax.set_xlabel("Phase")
ax.invert_yaxis()

fig.tight_layout()
_images/tutorials_catalogs_01_features_22_0.png

And the final one

[13]:
rr = rrs.iloc[2]
lc = lcs[lcs.bm_src_id == rr.id]
lc = lc.sort_values("pwp_stack_src_hjd")

time, mag, err = (
        lc.pwp_stack_src_hjd.values,
        lc.pwp_stack_src_mag3.values,
        lc.pwp_stack_src_mag_err3.values)

t0 = time[0]

phases = foldAt(time, rr.PeriodLS, T0=t0)
sort = np.argsort(phases)
phases, pmag, perr = phases[sort], mag[sort], err[sort]

phases = np.hstack((phases, phases + 1))
pmag = np.hstack((pmag, pmag))
perr = np.hstack((perr, perr))


fig, axes = plt.subplots(2, 1, figsize=(12, 6))

ax = axes[0]
ax.errorbar(time, mag, err, ls="", marker="o", ecolor="red")
ax.set_title(f"Light Curve of source {rr.id}")
ax.set_ylabel("Magnitude")
ax.set_xlabel("HJD")
ax.invert_yaxis()

ax = axes[1]
ax.errorbar(phases, pmag, perr, ls="", marker="o", ecolor="blue", color="red")
ax.set_title(f"Folded Light Curve of source {rr.id}")
ax.set_ylabel("Magnitude")
ax.set_xlabel("Phase")
ax.invert_yaxis()

fig.tight_layout()
_images/tutorials_catalogs_01_features_24_0.png
[14]:
import datetime as dt
dt.datetime.now()
[14]:
datetime.datetime(2022, 6, 13, 0, 13, 25, 563015)
[ ]:

Carpyncho RR-Lyrae V.1.0 Catalogs(cp_rr_v1)

This notebook give some insights about the data stored in all the features types catalogs.

[1]:
# import the module and instance the client
import carpyncho
client = carpyncho.Carpyncho()

Now we download te Carpyncho RR-Lyrae V1 Catalog

[2]:
df = client.get_catalog("others", "cpy_rr_v1")
df
others-cpy_rr_v1: 32.8kB [00:00, 1.80MB/s]
[2]:
id tile cnt ra_k dec_k prob tsample
6799063 33960000211620 b396 131 267.549917 -18.699892 0.852000 b206, b214, b216, b220, b228, b234, b247, b248...
7153972 33960000942530 b396 130 268.118017 -17.763556 0.849467 b206, b214, b216, b220, b228, b234, b247, b248...
6971905 33960000566886 b396 131 267.536346 -18.093889 0.837733 b206, b214, b216, b220, b228, b234, b247, b248...
6801853 33960000220135 b396 131 267.637892 -18.734767 0.837333 b206, b214, b216, b220, b228, b234, b247, b248...
6747151 33960000105195 b396 131 267.487762 -18.848939 0.828400 b206, b214, b216, b220, b228, b234, b247, b248...
... ... ... ... ... ... ... ...
6457949 33600000787886 b360 139 263.635417 -29.303825 0.462667 b206, b214, b216, b220, b228, b234, b247, b248...
6926060 33960000470380 b396 131 267.525504 -18.250553 0.462000 b206, b214, b216, b220, b228, b234, b247, b248...
6874859 33960000359818 b396 131 267.490592 -18.416206 0.460933 b206, b214, b216, b220, b228, b234, b247, b248...
682776 32200000656825 b220 123 274.725521 -34.229878 0.460933 b206, b214, b216, b228, b234, b247, b248, b261...
6617136 33600000965623 b360 204 263.473292 -28.911094 0.460533 b206, b214, b216, b220, b228, b234, b247, b248...

242 rows × 7 columns

The columns of this catalog are

[3]:
print(list(df.columns))
['id', 'tile', 'cnt', 'ra_k', 'dec_k', 'prob', 'tsample']

Where

  • id (ID): This is the unique identifier of every light curve. If you want to access all the points of the lightcurve of a source wiht any id, you can search for the same value of a bm_src_idin the lc, or id in the features catalog of the same tile indicated in the column tile.
  • tile: The name of the tile where the candidate is located.
  • cnt (Count): How many epochs has the lightcurve.
  • ra_k: Right Ascension in band \(K_s\) of the source in the first epoch.
  • dec_k: Declination in band \(K_s\) of the source in the first epoch.
  • prob (Probability): The probability of this source to be a RR-Lyrae star [1].
  • tsample (Tiles-Sample): Which tiles was used to create the ensemble select this source as a candidate [1].

[1] To more insights about this feature please chek our work

Not ready

Well lets play with a candidate

[4]:
rr = df.iloc[0]
rr
[4]:
id                                            33960000211620
tile                                                    b396
cnt                                                      131
ra_k                                              267.549917
dec_k                                             -18.699892
prob                                                   0.852
tsample    b206, b214, b216, b220, b228, b234, b247, b248...
Name: 6799063, dtype: object

We can check their mean of magnitudes to check if the source is not saturated or diffuse.

Now we know id from the tile b396 we can retrieve the entire collection of features and the light-curve

[5]:
feats = client.get_catalog("b396", "features")
lc = client.get_catalog("b396", "lc")

# retrieve the features of the selected source
feats = feats[feats.id == rr.id]
lc = lc[lc.bm_src_id == rr.id]
b396-features: 303MB [02:10, 2.32MB/s]
b396-lc: 829MB [05:58, 2.32MB/s]

No we have the features

[6]:
feats
[6]:
id cnt ra_k dec_k vs_type vs_catalog Amplitude Autocor_length Beyond1Std Con ... c89_jk_color c89_m2 c89_m4 n09_c3 n09_hk_color n09_jh_color n09_jk_color n09_m2 n09_m4 ppmb
144267 33960000211620 131 267.549917 -18.699892 0.178 2.0 0.274809 0.0 ... 0.139307 13.758278 13.748731 0.039034 0.037559 0.103765 0.141323 13.8797 13.880858 1.49232

1 rows × 73 columns

and the entire lc

[7]:
lc
[7]:
bm_src_id pwp_id pwp_stack_src_id pwp_stack_src_hjd pwp_stack_src_mag3 pwp_stack_src_mag_err3
752483 33960000211620 2753 3000275300031172 56152.044089 14.230 0.028
926175 33960000211620 2754 3000275400026402 56152.044623 14.198 0.030
1745826 33960000211620 2759 3000275900028136 56156.014732 14.282 0.030
1922566 33960000211620 2760 3000276000029792 56156.015266 14.342 0.032
2959474 33960000211620 2765 3000276500036874 56167.994914 14.304 0.029
... ... ... ... ... ... ...
82226182 33960000211620 2711 3000271100035211 56075.305061 14.265 0.031
82456417 33960000211620 2712 3000271200042427 56075.305558 14.242 0.027
83636862 33960000211620 2717 3000271700036927 56078.297314 14.206 0.029
83884497 33960000211620 2718 3000271800042962 56078.297828 14.205 0.026
85016389 33960000211620 2723 3000272300029811 56099.295536 14.295 0.031

131 rows × 6 columns

We can phase the light curve now

For make our code simple we can use to folde the light curve the PyAstronomy and numpy library

[8]:
from PyAstronomy.pyasl import foldAt
import numpy as np
[9]:
%matplotlib inline
import matplotlib.pyplot as plt

now we can plot the folded and unfolded lightcurves

[10]:
lc = lc.sort_values("pwp_stack_src_hjd")

time, mag, err = (
        lc.pwp_stack_src_hjd.values,
        lc.pwp_stack_src_mag3.values,
        lc.pwp_stack_src_mag_err3.values)

t0 = time[0]

phases = foldAt(time, feats.PeriodLS.values, T0=t0)
sort = np.argsort(phases)
phases, pmag, perr = phases[sort], mag[sort], err[sort]

phases = np.hstack((phases, phases + 1))
pmag = np.hstack((pmag, pmag))
perr = np.hstack((perr, perr))


fig, axes = plt.subplots(2, 1, figsize=(12, 6))

ax = axes[0]
ax.errorbar(time, mag, err, ls="", marker="o", ecolor="red")
ax.set_title(f"Light Curve of source {rr.id} (Prob: ~{rr.prob:.2f})")
ax.set_ylabel("Magnitude")
ax.set_xlabel("HJD")
ax.invert_yaxis()

ax = axes[1]
ax.errorbar(phases, pmag, perr, ls="", marker="o", ecolor="blue", color="red")
ax.set_title(f"Folded Light Curve of source {rr.id} (Prob: ~{rr.prob:.2f})")
ax.set_ylabel("Magnitude")
ax.set_xlabel("Phase")
ax.invert_yaxis()

fig.tight_layout()
_images/tutorials_catalogs_02_cpy_rr_v1_20_0.png
[11]:
import datetime as dt
dt.datetime.now()
[11]:
datetime.datetime(2022, 6, 13, 0, 25, 59, 228809)

API

carpyncho module

Python client for Carpyncho VVV dataset collection.

This code access as a Pandas DataFrame all the data of the web version of Carpyncho https://carpyncho.github.io/.

class carpyncho.Carpyncho(cache_path: str = PosixPath('/home/docs/carpyncho_py_data/_cache_'), cache_expire: float = None, parquet_engine: str = 'auto', index_url: str = 'https://raw.githubusercontent.com/carpyncho/carpyncho-py/master/data/index.json')[source]

Bases: object

Client to access the Carpyncho VVV dataset collection.

This code access as a Pandas Dataframe all the data of the web version of Carpyncho. https://carpyncho.github.io/.

Parameters:
  • cache (diskcache.Cache, diskcache.Fanout,) – or None (default: None) Any instance of diskcache.Cache, diskcache.Fanout or None (Default). If it’s None a diskcache.Cache istance is created with the parameter directory = carpyncho.DEFAULT_CACHE_DIR. More information: http://www.grantjenks.com/docs/diskcache
  • cache_expire (float or None (default=``None``)) – Seconds until item expires (default None, no expiry) More information: http://www.grantjenks.com/docs/diskcache
  • parquet_engine (str (default=”auto”)) – Default Parquet library to use. Remotely carpyncho stores all the data as compresses parquet files; When the download happend a this must be parsed. If ‘auto’, then the option io.parquet.engine is used. The default io.parquet.engine behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable.
cache

Return the internal cache of the client the internal cache.

cache_expire = None

Default timeout of the catalog-cache. Try to always set to None (default), the catalogs are big and mostly never change.

cache_path = None

Location of the catalog cache

catalog_info(tile, catalog)[source]

Retrieve the information about a given catalog.

Parameters:
  • tile (str) – The name of the tile.
  • catalog – The name of the catalog.
Returns:

The entire information of the given catalog file. This include url, md5 checksum, size in bytes, number of total records, etc.

Return type:

dict

Raises:

ValueError: – If the tile or the catalog is not found.

get_catalog(tile, catalog, force=False)[source]

Retrieve a catalog from the carpyncho dataset.

Parameters:
  • tile (str) – The name of the tile.
  • catalog – The name of the catalog.
  • force (bool (default=False)) – If its True, the cached version of the catalog is ignored and redownloaded. Try to always set force to False.
Returns:

The columns of the DataFrame changes between the different catalog.

Return type:

pandas.DataFrame

Raises:
  • ValueError: – If the tile or the catalog is not found.
  • IOError: – If the checksum not match.
has_catalog(tile, catalog)[source]

Check if a given catalog and tile exists.

Parameters:
  • tile (str) – The name of the tile.
  • catalog – The name of the catalog.
Returns:

True if the convination tile+catalog exists.

Return type:

bool

index_

Structure of the Carpyncho dataset information as a Python-dict.

index_url = None

Location of the carpyncho index (usefull for development)

list_catalogs(tile)[source]

Retrieve the available catalogs for a given tile.

Parameters:tile (str) – The name of the tile to retrieve the catalogs.
Returns:The names of available catalogs in the given tile.
Return type:tuple of str
Raises:ValueError: – If the tile is not found.
list_tiles()[source]

Retrieve available tiles with catalogs as a tuple of str.

parquet_engine = None

Default Parquet library to use.

retrieve_index(reset)[source]

Access the remote index of the Carpyncho-Dataset.

The index is stored internally for 1 hr.

Parameters:reset (bool) – If its True the entire cache is ignored and a new index is donwloaded and cached.
Returns:
Return type:dict with the index structure.
carpyncho.CARPYNCHOPY_DATA_PATH = PosixPath('/home/docs/carpyncho_py_data')

Where carpyncho gonna store the entire data.

Indices and tables