SPARC workflows
Table of Contents
SPARC
WARNINGS
- DO NOT USE
cp -a
copy files with xattrs!
INSTEAD usersync -X -u -v
.
cp
does not remove absent fields from xattrs of the file previously occupying that name! OH NO (is this acp
bug!?)
Export v4
source ~/files/venvs/sparcur-dev/bin/activate python -m sparcur.simple.combine && python -m sparcur.simple.disco ~/.local/share/sparcur/export/summary/618*/LATEST/curation-export.json && echo Export complete. Check results at: ; echo https://cassava.ucsd.edu/sparc/preview/archive/summary/$(readlink ~/.local/share/sparcur/export/summary/618*/LATEST)
Report
function fetch-and-run-reports () { local FN="/tmp/curation-export-$(date -Is).json" curl https://cassava.ucsd.edu/sparc/preview/exports/curation-export.json -o "${FN}" spc sheets update Organs --export-file "${FN}" spc report all --sort-count-desc --to-sheets --export-file "${FN}" } fetch-and-run-reports
Reporting
turtle diff
spc report changes \ --ttl-file https://cassava.ucsd.edu/sparc/preview/archive/exports/2021-05-25T125039,817048-0700/curation-export.ttl \ --ttl-compare https://cassava.ucsd.edu/sparc/preview/archive/exports/2021-05-24T141309,920776-0700/curation-export.ttl
spc report completeness
spc server --latest --count
keywords = sorted(set([k for d in asdf['datasets'] if 'meta' in d and 'keywords' in d['meta'] for k in d['meta']['keywords']]))
Archiving files with xattrs
tar
is the only one of the 'usual' suspects for file archiving that
supports xattrs, zip
cannot.
tar --force-local --xattrs -cvzf 2019-07-17T10\:44\:16\,457344.tar.gz '2019-07-17T10:44:16,457344/'
tar --force-local --xattrs -xvzf 2019-07-17T10\:44\:16\,457344.tar.gz
find 2019-07-17T10\:44\:16\,457344 -exec getfattr -d {} \;
Archiving releases
in place
Manually remove the echo after checking that you are removing what you expect.
pushd /var/www/sparc/sparc/ pushd archive/exports find -maxdepth 1 -not -path '.' -type d -exec tar -cvJf '{}.tar.xz' '{}' \; chown nginx:nginx *.tar.xz # remove all but the one currently symlinked to exports find -maxdepth 1 -not -path '.' -not -path "*$(basename $(readlink ../../exports))*" -type d -exec echo rm -r '{}' \; popd pushd preview/archive/exports find -maxdepth 1 -not -path '.' -type d -newer $(ls -At *.tar.xz | head -n 1) -exec tar -cvJf '{}.tar.xz' '{}' \; chown nginx:nginx *.tar.xz # remove previous years find -maxdepth 1 -not -path '.' -not -path "*$(date +%Y)-*" -type d -exec echo rm -r '{}' \+ # remove all the but most recent 8 folders find -maxdepth 1 -not -path '.' -type d | sort -u | head -n -8 | xargs echo rm -r popd
elsewhere
pushd /path/to/backup rsync -z -v -r -e ssh cassava:/var/www/sparc sparc-$(date -I)
pushd /path/to/backup pushd sparc-*/sparc/archive/exports find -maxdepth 1 -not -path '.' -type d -exec tar -cvJf '{}.tar.xz' '{}' \; find -maxdepth 1 -not -path '.' -type d -exec rm -r '{}' \; popd pushd sparc-*/sparc/preview/archive/exports find -maxdepth 1 -not -path '.' -type d -exec tar -cvJf '{}.tar.xz' '{}' \; find -maxdepth 1 -not -path '.' -type d -exec rm -r '{}' \; popd
Other random commands
Duplicate top level and ./.operations/objects
function sparc-copy-pull () { : ${SPARC_PARENT:=${HOME}/files/blackfynn_local/} local TODAY=$(date +%Y%m%d) pushd ${SPARC_PARENT} && mv SPARC\ Consortium "SPARC Consortium_${TODAY}" && rsync -ptgo -A -X -d --no-recursive --exclude=* "SPARC Consortium_${TODAY}/" SPARC\ Consortium && mkdir SPARC\ Consortium/.operations && mkdir SPARC\ Consortium/.operations/trash && rsync -X -u -v -r "SPARC Consortium_${TODAY}/.operations/objects" SPARC\ Consortium/.operations/ && pushd SPARC\ Consortium && spc pull || echo "spc pull failed" popd popd }
Simplified error report
jq -r '[ .datasets[] | {id: .id, name: .meta.folder_name, se: [ .status.submission_errors[].message ] | unique, ce: [ .status.curation_errors[].message ] | unique } ]' curation-export.json
File extensions
- List all file extensions
Get a list of all file extensions.
find -type l -o -type f | grep -o '\(\.[a-zA-Z0-9]\+\)\+$' | sort -u
- Get ids with files matching a specific extension
Arbitrary information about a dataset with files matching a pattern. The example here gives ids for all datasets that contain xml files. Nesting
find -exec
does not work so the first pattern here uses shell globing to get the datasets.function datasets-matching () { for d in */; do find "$d" \( -type l -o -type f \) -name "*.$1" \ -exec getfattr -n user.bf.id --only-values "$d" \; -printf '\n' -quit ; done }
- Fetch files matching a specific pattern
Fetch files that have zero size (indication that fetch is broken).
find -type f -name '*.xml' -empty -exec spc fetch {} \+
Sort of manifest generation
This is slow, but prototypes functionality useful for the curators.
find -type d -not -name 'ephys' -name 'ses-*' -exec bash -c \ 'pushd $1 1>/dev/null; pwd >> ~/manifest-stuff.txt; spc report size --tab-table ./* >> ~/manifest-stuff.txt; popd 1>/dev/null' _ {} \;
Path ids
This one is fairly slow, but is almost certainly i/o limited due to having to read the xattrs. Maintaining the backup database of the mappings would make this much faster.
# folders and files find . -not -type l -not -path '*operations*' -exec getfattr -n user.bf.id --only-values {} \; -print # broken symlink format, needs work, hard to parse find . -type l -not -path '*operations*' -exec readlink -n {} \; -print
Path counts per dataset
for d in */; do printf "$(find "${d}" -print | wc -l) "; printf "$(getfattr --only-values -n user.bf.id "${d}") ${d}\n" ; done | sort -n
Debug units serialization
Until we fix compound units parsing for the round trip we might
accidentally encounter and error along the lines of
ValueError: Unit expression cannot have a scaling factor.
jq -C '.. | .units? // empty' /tmp/curation-export-*.json | sort -u
protocols cache
pushd ~/.cache/idlib mv protocol_json protocol_json-old # run export find protocol_json -size -2 -exec cat {} \+ # check to make sure that there weren't any manually provided caches find protocol_json -size -2 -execdir cat ../protocol_json-old/{} \;
clean up org folders
THIS COMMAND IS DANGEROUS ONLY RUN IT IN SPARC Consortium
folders that you want to nuke.
find -maxdepth 1 -type d -not -name '.operations' -not -name '.' -exec rm -r {} \;
clean up broken symlinks in temp-upstream
Unfortunately keeping these around causes inode exhaustion issues. Very slow, but only needs to be run once per system since the code has been updated to do this during the transitive unsymlink.
from sparcur.paths import Path here = Path.cwd() here = Path('/var/lib/sparc/files/sparc-datasets-test') bs = [ rc for c in here.children for rd in (c / 'SPARC Consortium' / '.operations' / 'temp-upstream').rchildren_dirs for rc in rd.children if rc.is_broken_symlink()] _ = [b.unlink() for b in bs]
datasets causing issues with fetching files
find */SPARC\ Consortium/.operations/temp-upstream/ -type d -name '*-ERROR' | cut -d'/' -f 1 | sort -u
python -m sparcur.simple.retrieve --jobs 1 --sparse-limit -1 --parent-parent-path . --dataset-id $1 pushd $1 spc export
viewing single dataset logs
pushd ~/.cache/log/sparcur/datasets find -name stdout.log -printf "%T@ %Tc %p\n" | sort -n less -R $_some_path
fixing feff issues
from sparcur.datasets import Tabular from sparcur.paths import Path p = Path('dataset_description.xlsx') t = Tabular(p) hrm1 = list(t.xlsx1()) hrm2 = list(t.xlsx2())
look for \ufeff
at the start of strings and then use e.g. vim to
open and edit the file removing it from the offending strings
View logs for failed single dataset exports
Run the function, paste in the ids under failed and hit enter.
function review-failed () { local paths _id paths=() while read _id; do paths+=(~/.cache/log/sparcur/datasets/${_id}/LATEST/stdout.log) if [ -z $_id ]; then break; fi done less -R ${paths[@]} }
SCKAN
See the developer guide section on SCKAN.
SODA
Have to clone SODA and fetch the files for testing.
from pprint import pprint import pysoda from sparcur.paths import Path p = Path(parent_folder, path).expanduser().resolve() children = list(p.iterdir()) blob = pysoda.create_folder_level_manifest( {p.resolve().name: children}, {k.name + '_description': ['some description'] * len(children) for k in [p] + list(p.iterdir())}) manifest_path = Path(blob[p.name][-1]) manifest_path.xopen() pprint(manifest_path)
Developer
See also the sparcur developer guild
Releases
DatasetTemplate
Commit any changes and push to master.
make-template-zip () { local CLEANROOM=/tmp/cleanroom/ mkdir ${CLEANROOM} || return 1 pushd ${CLEANROOM} git clone https://github.com/SciCrunch/sparc-curation.git && pushd ${CLEANROOM}/sparc-curation/resources zip -x '*.gitkeep' -r DatasetTemplate.zip DatasetTemplate mv DatasetTemplate.zip ${CLEANROOM} popd rm -rf ${CLEANROOM}/sparc-curation popd } make-template-zip
Once that is done open /tmp/cleanroom/DatasetTemplate.zip in file-roller
or similar
and make sure everything is as expected.
Create the GitHub release. The tag name should have the format dataset-template-1.1
where
the version number should match the metadata version embedded in
datasetdescription.xlsx.
Minor versions such as dataset-template-1.2.1
are allowed.
Attach ${CLEANROOM}/DatasetTemplate.zip
as a release asset.
Inform curation so they can notify the community.
Getting to know the codebase
Use inspect.getclasstree
along with pyontutils.utils.subclasses
to display hierarchies of classes.
from inspect import getclasstree from pyontutils.utils import subclasses from IPython.lib.pretty import pprint # classes to inspect import pathlib from sparcur import paths def class_tree(root): return getclasstree(list(subclasses(root))) pprint(class_tree(pathlib.PurePosixPath))
Viewing logs
View the latest log file with colors using less
.
less -R $(ls -d ~sparc/files/blackfynn_local/export/log/* | tail -n 1)
For a permanent fix for less
add
alias less='less -R'
Debugging fatal pipeline errors
You have an error!
maybe_size = c.cache.meta.size # << AttributeError here
Modify to wrap code
try: maybe_size = c.cache.meta.size except AttributeError as e: breakpoint() # << investigate error
Temporary squash by logging as an exception with optional explanation
try: maybe_size = c.cache.meta.size except AttributeError as e: log.exception(e) log.error(f'explanation for error and local variables {c}')