Skip to content

Commit

Permalink
Merge pull request #559 from msdemlei/add-datalink-original_row
Browse files Browse the repository at this point in the history
Add datalink original row
  • Loading branch information
bsipocz committed Oct 14, 2024
1 parent 05b6fe3 commit 550f9b1
Show file tree
Hide file tree
Showing 4 changed files with 126 additions and 41 deletions.
4 changes: 4 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,13 @@ Bug Fixes

- Include port number if it is present in endpoint access URL. [#582]

- Where datalink records are made from table rows, the table row is
now accessible as datalinks.original_row. [#559]

- Tables returned by RegistryResource.get_tables() now have a utype
attribute [#576]


Deprecations and Removals
-------------------------

Expand Down
107 changes: 77 additions & 30 deletions docs/dal/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,8 @@ astrometric parameter, such as waveband ranges.

Some services also accept `~astropy.time.Time` as ``time`` parameter.

.. doctest::

>>> from astropy.time import Time
>>> time = Time(('2015-01-01T00:00:00', '2018-01-01T00:00:00'),
... format='isot', scale='utc')
Expand Down Expand Up @@ -171,6 +173,16 @@ mean G-band magnitude between 19 - 20:
2171810342771336704 323.25913736080776 51.94305655940998 19.0
2180349528028140800 310.5233961869657 50.3486391034819 19.0

While DALResultsTable has some convenience functions, is is often
convenient to directly obtain a proper astropy Table using the
``to_table`` method:

.. doctest-remote-data::

>>> result.to_table().columns[:3]
<TableColumns names=('source_id','ra','dec')>


To explore more query examples, you can try either the ``description``
attribute for some services. For other services like this one, try
the ``examples`` attribute.
Expand All @@ -193,22 +205,16 @@ the ``examples`` attribute.
POINT(l.raj2000, l.dej2000)
)<0.0002 -- fine selection with PMs

Furthermore, one can find the names of the tables using:
TAPServices let you do extensive metadata inspection. For instance,
to see the tables available on the Simbad TAP service, say:

.. doctest-remote-data::

>>> print([tab_name for tab_name in tap_service.tables.keys()]) # doctest: +IGNORE_WARNINGS
['ivoa.obs_radio', 'ivoa.obscore', 'tap_schema.columns', 'tap_schema.tables',..., 'taptest.main', 'veronqsos.data', 'vlastripe82.stripe82']
>>> simbad = vo.dal.TAPService("http://simbad.cds.unistra.fr/simbad/sim-tap")
>>> print([tab_name for tab_name in simbad.tables.keys()]) # doctest: +IGNORE_WARNINGS
['TAP_SCHEMA.schemas', 'TAP_SCHEMA.tables', 'TAP_SCHEMA.columns', 'TAP_SCHEMA.keys', ... 'mesVelocities', 'mesXmm', 'otypedef', 'otypes', 'ref']


And also the names of the columns from a known table, for instance
the first three columns:

.. doctest-remote-data::

>>> result.table.columns[:3] # doctest: +IGNORE_WARNINGS
<TableColumns names=('source_id','ra','dec')>

If you know a TAP service's access URL, you can directly pass it to
:py:class:`~pyvo.dal.TAPService` to obtain a service object.
Sometimes, such URLs are published in papers or passed around through
Expand Down Expand Up @@ -347,7 +353,7 @@ you reach the ``maxrec`` limit:

.. doctest-remote-data::

>>> tap_results = tap_service.search("SELECT * FROM ivoa.obscore", maxrec=100000) # doctest: +SHOW_WARNINGS
>>> tap_results = tap_service.search("SELECT * FROM arihip.main", maxrec=5) # doctest: +SHOW_WARNINGS
DALOverflowWarning: Partial result set. Potential causes MAXREC, async storage space, etc.

Services will not let you raise maxrec beyond the hard match limit:
Expand Down Expand Up @@ -522,8 +528,10 @@ region is always circular with ``pos`` as center:

.. doctest-remote-data::

>>> ssa_service = vo.dal.SSAService("https://irsa.ipac.caltech.edu/SSA")
>>> ssa_service = vo.dal.SSAService("http://archive.stsci.edu/ssap/search2.php?id=BEFS&")
>>> ssa_results = ssa_service.search(pos=pos, diameter=size)
>>> ssa_results[0].getdataurl()
'http://archive.stsci.edu/pub/vospectra/...'

SSA queries can be further constrained by the ``band`` and ``time`` parameters.

Expand Down Expand Up @@ -788,39 +796,78 @@ as quantities):
>>> astropy_table = resultset.to_table()
>>> astropy_qtable = resultset.to_qtable()

Multiple datasets
-----------------
PyVO supports multiple datasets exposed on record level through the datalink.
To get an iterator yielding specific datasets, call
:py:meth:`pyvo.dal.adhoc.DatalinkResults.bysemantics` with the identifier
identifying the dataset you want it to return.
Datalink
--------

.. remove skip once https://github.com/astropy/pyvo/issues/361 is fixed
.. doctest-skip::
Datalink lets operators associate multiple artefacts with a dataset.
Examples include linking raw data, applicable or applied calibration
data, derived datasets such as extracted sources, extra documentation,
and much more.

>>> preview = next(row.getdatalink().bysemantics('#preview')).getdataset()
Datalink can both be used on result rows of queries and from
datalink-valued URLs. The typical use is to call ``iter_datalinks()``
on some DAL result; this will iterate over all datalinks pyVO finds in a
document and yields :py:class:`pyvo.dal.adhoc.DatalinkResults` instances
for them. In those, you can, for instance, pick out items by semantics,
where the standard vocabulary datalink documents use is documented at
http://www.ivoa.net/rdf/datalink/core. Here is how to find URLs for
previews:

.. note::
Since the creation of datalink objects requires a network roundtrip, it is
recommended to call ``getdatalink`` only once.
.. doctest-remote-data::
>>> rows = vo.dal.TAPService("http://dc.g-vo.org/tap"
... ).run_sync("select top 5 * from califadr3.cubes order by califaid")
>>> for dl in rows.iter_datalinks(): # doctest: +IGNORE_WARNINGS
... print(next(dl.bysemantics("#preview"))["access_url"])
http://dc.zah.uni-heidelberg.de/getproduct/califa/datadr3/V1200/IC5376.V1200.rscube.fits?preview=True
http://dc.zah.uni-heidelberg.de/getproduct/califa/datadr3/COMB/IC5376.COMB.rscube.fits?preview=True
http://dc.zah.uni-heidelberg.de/getproduct/califa/datadr3/V500/IC5376.V500.rscube.fits?preview=True
http://dc.zah.uni-heidelberg.de/getproduct/califa/datadr3/COMB/UGC00005.COMB.rscube.fits?preview=True
http://dc.zah.uni-heidelberg.de/getproduct/califa/datadr3/V1200/UGC00005.V1200.rscube.fits?preview=True

The call to ``next`` in this example picks the first link marked
*preview*. For previews, this may be enough, but in general there can
be multiple links for a given semantics value for one dataset.

It is sometimes useful to go back to the original row the datalink was
generated from; use the ``original_row`` attribute for that (which may
be None if pyvo does not know what row the datalink came from):

Of course one can also build a datalink object from its url.
.. doctest-remote-data::
>>> dl.original_row["obs_title"]
'CALIFA V1200 UGC00005'

Consider ``original_row`` read only. We do not define what happens when
you modify it.

Rows from tables supporting datalink also have a ``getdatalink()``
method that returns a ``DatalinkResults`` instance. In general, this is
less flexible than using ``iter_datalinks``, and it may also cause more
network traffic because each such call will cause a network request.

When one has a link to a Datalink document – for instance, from an
obscore or SIAP service, where the media type is
application/x-votable;content=datalink –, one can build a
DatalinkResults using
:py:meth:`~pyvo.dal.adhoc.DatalinkResults.from_result_url`:

.. doctest-remote-data::

>>> from pyvo.dal.adhoc import DatalinkResults
>>> # In this example you know the URL from somewhere
>>> url = 'https://ws.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/caom2ops/datalink?ID=ivo%3A%2F%2Fcadc.nrc.ca%2FHSTHLA%3Fhst_12477_28_acs_wfc_f606w_01%2Fhst_12477_28_acs_wfc_f606w_01_drz'
>>> datalink = DatalinkResults.from_result_url(url)
>>> next(datalink.bysemantics("#this")).content_type
'application/fits'


Server-side processing
----------------------
Some services support the server-side processing of record datasets.
This includes spatial cutouts for 2d-images, reducing of spectra to a certain
waveband range, and many more depending on the service.

Datalink
^^^^^^^^
Generic Datalink Processing Service
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Generic access to processing services is provided through the datalink
interface.

Expand All @@ -830,8 +877,8 @@ interface.
>>> datalink_proc = next(row.getdatalink().bysemantics('#proc'))

.. note::
most times there is only one processing service per result, and thats all you
need.
Most datalink documents only have one processing service per dataset,
which is why there is the ``get_first_proc`` shortcut mentioned below.


The returned object lets you access the available input parameters which you
Expand Down
48 changes: 38 additions & 10 deletions pyvo/dal/adhoc.py
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,10 @@ def iter_datalinks(self):
if batch_size is None:
# first call.
self.query = DatalinkQuery.from_resource(
[_ for _ in self], self._datalink, session=self._session)
[_ for _ in self],
self._datalink,
session=self._session,
original_row=row)
remaining_ids = self.query['ID']
if not remaining_ids:
# we are done
Expand All @@ -217,9 +220,13 @@ def iter_datalinks(self):
id1 = current_ids.pop(0)
processed_ids.append(id1)
remaining_ids.remove(id1)
yield current_batch.clone_byid(id1)
yield current_batch.clone_byid(
id1,
original_row=row)
elif getattr(row, 'access_format', None) == DATALINK_MIME_TYPE:
yield DatalinkResults.from_result_url(row.getdataurl())
yield DatalinkResults.from_result_url(
row.getdataurl(),
original_row=row)
else:
# Fall back to row-specific handling
try:
Expand Down Expand Up @@ -370,6 +377,8 @@ def from_resource(cls, rows, resource, session=None, **kwargs):
ref="srcGroup"/>
</GROUP>
"""
original_row = kwargs.pop("original_row", None)

input_params = _get_input_params_from_resource(resource)
# get params outside of any group
dl_params = _get_params_from_resource(resource)
Expand Down Expand Up @@ -406,7 +415,11 @@ def from_resource(cls, rows, resource, session=None, **kwargs):
except KeyError:
query_params[name] = query_param

return cls(accessurl, session=session, **query_params)
return cls(
accessurl,
session=session,
original_row=original_row,
**query_params)

def __init__(
self, baseurl, id=None, responseformat=None, session=None, **keywords):
Expand All @@ -424,6 +437,8 @@ def __init__(
session : object
optional session to use for network requests
"""
self.original_row = keywords.pop("original_row", None)

super().__init__(baseurl, session=session, **keywords)

if id is not None:
Expand All @@ -445,8 +460,11 @@ def execute(self, post=False):
DALFormatError
for errors parsing the VOTable response
"""
return DatalinkResults(self.execute_votable(post=post),
url=self.queryurl, session=self._session)
return DatalinkResults(
self.execute_votable(post=post),
url=self.queryurl,
original_row=self.original_row,
session=self._session)


class DatalinkResults(DatalinkResultsMixin, DALResults):
Expand Down Expand Up @@ -492,6 +510,10 @@ class DatalinkResults(DatalinkResultsMixin, DALResults):
a Numpy array.
"""

def __init__(self, *args, **kwargs):
self.original_row = kwargs.pop("original_row", None)
super().__init__(*args, **kwargs)

def getrecord(self, index):
"""
return a representation of a datalink result record that follows
Expand All @@ -507,7 +529,7 @@ def getrecord(self, index):
Returns
-------
REc
Rec
a dictionary-like wrapper containing the result record metadata.
Raises
Expand Down Expand Up @@ -573,10 +595,10 @@ def bysemantics(self, semantics, include_narrower=True):
if record.semantics in semantics:
yield record

def clone_byid(self, id):
def clone_byid(self, id, *, original_row=None):
"""
return a clone of the object with results and corresponding
resources matching a given id
resources matching a given id
Returns
-------
Expand All @@ -601,7 +623,7 @@ def clone_byid(self, id):
for x in copy_tb.resources:
if x.ID and x.ID not in referenced_serviced:
copy_tb.resources.remove(x)
return DatalinkResults(copy_tb)
return DatalinkResults(copy_tb, original_row=original_row)

def getdataset(self, timeout=None):
"""
Expand Down Expand Up @@ -633,6 +655,12 @@ def get_first_proc(self):
return proc
raise IndexError("No processing service found in datalink result")

@classmethod
def from_result_url(cls, result_url, *, session=None, original_row=None):
res = super().from_result_url(result_url, session=session)
res.original_row = original_row
return res


class SodaRecordMixin:
"""
Expand Down
8 changes: 7 additions & 1 deletion pyvo/dal/tests/test_datalink.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,8 @@ def test_datalink():

datalinks = next(results.iter_datalinks())

assert datalinks.original_row["accsize"] == 100800

row = datalinks[0]
assert row.semantics == "#progenitor"

Expand All @@ -132,7 +134,9 @@ def test_datalink_batch():
results = vo.dal.imagesearch(
'http://example.com/obscore', (30, 30))

assert len([_ for _ in results.iter_datalinks()]) == 3
dls = list(results.iter_datalinks())
assert len(dls) == 3
assert dls[0].original_row["obs_collection"] == "MACHO"


@pytest.mark.usefixtures('proc', 'datalink_vocabulary')
Expand All @@ -143,6 +147,8 @@ def test_datalink_batch():
class TestSemanticsRetrieval:
def test_access_with_string(self):
datalinks = DatalinkResults.from_result_url('http://example.com/proc')

assert datalinks.original_row is None
res = [r["access_url"] for r in datalinks.bysemantics("#this")]
assert len(res) == 1
assert res[0].endswith("eq010000ms/20100927.comb_avg.0001.fits.fz")
Expand Down

0 comments on commit 550f9b1

Please sign in to comment.