Index > SPARC workflows Edit on GitHub

SPARC workflows

Table of Contents

SPARC

WARNINGS

  1. DO NOT USE cp -a copy files with xattrs!
    INSTEAD use rsync -X -u -v.
    cp does not remove absent fields from xattrs of the file previously occupying that name! OH NO (is this a cp bug!?)

Export v4

source ~/files/venvs/sparcur-dev/bin/activate
python -m sparcur.simple.combine &&
python -m sparcur.simple.disco ~/.local/share/sparcur/export/summary/618*/LATEST/curation-export.json &&
echo Export complete. Check results at: ;
echo https://cassava.ucsd.edu/sparc/preview/archive/summary/$(readlink ~/.local/share/sparcur/export/summary/618*/LATEST)

Report

function fetch-and-run-reports () {
    local FN="/tmp/curation-export-$(date -Is).json"
    curl https://cassava.ucsd.edu/sparc/preview/exports/curation-export.json -o "${FN}"
    spc sheets update Organs --export-file "${FN}"
    spc report all --sort-count-desc --to-sheets --export-file "${FN}"
}
fetch-and-run-reports

Reporting

turtle diff

spc report changes \
--ttl-file https://cassava.ucsd.edu/sparc/preview/archive/exports/2021-05-25T125039,817048-0700/curation-export.ttl \
--ttl-compare https://cassava.ucsd.edu/sparc/preview/archive/exports/2021-05-24T141309,920776-0700/curation-export.ttl
spc report completeness
spc server --latest --count
keywords = sorted(set([k for d in asdf['datasets'] if 'meta' in d and 'keywords' in d['meta']
                       for k in d['meta']['keywords']]))

Archiving files with xattrs

tar is the only one of the 'usual' suspects for file archiving that supports xattrs, zip cannot.

tar --force-local --xattrs -cvzf 2019-07-17T10\:44\:16\,457344.tar.gz '2019-07-17T10:44:16,457344/'
tar --force-local --xattrs -xvzf 2019-07-17T10\:44\:16\,457344.tar.gz
find 2019-07-17T10\:44\:16\,457344 -exec getfattr -d {} \;

Archiving releases

in place

Manually remove the echo after checking that you are removing what you expect.

pushd /var/www/sparc/sparc/
    pushd archive/exports
        find -maxdepth 1 -not -path '.' -type d -exec tar -cvJf '{}.tar.xz' '{}' \;
        chown nginx:nginx *.tar.xz
        # remove all but the one currently symlinked to exports
        find -maxdepth 1 -not -path '.' -not -path "*$(basename $(readlink ../../exports))*" -type d -exec echo rm -r '{}' \;
    popd

    pushd preview/archive/exports
        find -maxdepth 1 -not -path '.' -type d -newer $(ls -At *.tar.xz | head -n 1) -exec tar -cvJf '{}.tar.xz' '{}' \;
        chown nginx:nginx *.tar.xz
        # remove previous years
        find -maxdepth 1 -not -path '.' -not -path "*$(date +%Y)-*" -type d -exec echo rm -r '{}' \+
        # remove all the but most recent 8 folders
        find -maxdepth 1 -not -path '.' -type d | sort -u | head -n -8 | xargs echo rm -r
    popd

elsewhere

pushd /path/to/backup
rsync -z -v -r -e ssh cassava:/var/www/sparc sparc-$(date -I)
pushd /path/to/backup
pushd sparc-*/sparc/archive/exports
find -maxdepth 1 -not -path '.' -type d -exec tar -cvJf '{}.tar.xz' '{}' \;
find -maxdepth 1 -not -path '.' -type d -exec rm -r '{}' \;
popd
pushd sparc-*/sparc/preview/archive/exports
find -maxdepth 1 -not -path '.' -type d -exec tar -cvJf '{}.tar.xz' '{}' \;
find -maxdepth 1 -not -path '.' -type d -exec rm -r '{}' \;
popd

Other random commands

Duplicate top level and ./.operations/objects

function sparc-copy-pull () {
    : ${SPARC_PARENT:=${HOME}/files/blackfynn_local/}
    local TODAY=$(date +%Y%m%d)
    pushd ${SPARC_PARENT} &&
        mv SPARC\ Consortium "SPARC Consortium_${TODAY}" &&
        rsync -ptgo -A -X -d --no-recursive --exclude=* "SPARC Consortium_${TODAY}/"  SPARC\ Consortium &&
        mkdir SPARC\ Consortium/.operations &&
        mkdir SPARC\ Consortium/.operations/trash &&
        rsync -X -u -v -r "SPARC Consortium_${TODAY}/.operations/objects" SPARC\ Consortium/.operations/ &&
        pushd SPARC\ Consortium &&
        spc pull || echo "spc pull failed"
    popd
    popd
}

Simplified error report

jq -r '[ .datasets[] |
         {id: .id,
          name: .meta.folder_name,
          se: [ .status.submission_errors[].message ] | unique,
          ce: [ .status.curation_errors[].message   ] | unique } ]' curation-export.json

File extensions

  • List all file extensions

    Get a list of all file extensions.

    find -type l -o -type f | grep -o '\(\.[a-zA-Z0-9]\+\)\+$' | sort -u
    
  • Get ids with files matching a specific extension

    Arbitrary information about a dataset with files matching a pattern. The example here gives ids for all datasets that contain xml files. Nesting find -exec does not work so the first pattern here uses shell globing to get the datasets.

    function datasets-matching () {
        for d in */; do
            find "$d" \( -type l -o -type f \) -name "*.$1" \
            -exec getfattr -n user.bf.id --only-values "$d" \; -printf '\n' -quit ;
        done
    }
    
  • Fetch files matching a specific pattern

    Fetch files that have zero size (indication that fetch is broken).

    find -type f -name '*.xml' -empty -exec spc fetch {} \+
    

Sort of manifest generation

This is slow, but prototypes functionality useful for the curators.

find -type d -not -name 'ephys' -name 'ses-*' -exec bash -c \
'pushd $1 1>/dev/null; pwd >> ~/manifest-stuff.txt; spc report size --tab-table ./* >> ~/manifest-stuff.txt; popd 1>/dev/null' _ {} \;

Path ids

This one is fairly slow, but is almost certainly i/o limited due to having to read the xattrs. Maintaining the backup database of the mappings would make this much faster.

# folders and files
find . -not -type l -not -path '*operations*' -exec getfattr -n user.bf.id --only-values {} \; -print
# broken symlink format, needs work, hard to parse
find . -type l -not -path '*operations*' -exec readlink -n {} \; -print

Path counts per dataset

for d in */; do printf "$(find "${d}" -print | wc -l) "; printf "$(getfattr --only-values -n user.bf.id "${d}") ${d}\n" ; done | sort -n

Debug units serialization

Until we fix compound units parsing for the round trip we might accidentally encounter and error along the lines of ValueError: Unit expression cannot have a scaling factor.

jq -C '.. | .units? // empty' /tmp/curation-export-*.json | sort -u

protocols cache

pushd ~/.cache/idlib
mv protocol_json protocol_json-old
# run export
find protocol_json -size -2 -exec cat {} \+
# check to make sure that there weren't any manually provided caches
find protocol_json -size -2 -execdir cat ../protocol_json-old/{} \;

clean up org folders

THIS COMMAND IS DANGEROUS ONLY RUN IT IN SPARC Consortium folders that you want to nuke.

find -maxdepth 1 -type d -not -name '.operations' -not -name '.' -exec rm -r {} \;

clean up broken symlinks in temp-upstream

Unfortunately keeping these around causes inode exhaustion issues. Very slow, but only needs to be run once per system since the code has been updated to do this during the transitive unsymlink.

from sparcur.paths import Path
here = Path.cwd()
here = Path('/var/lib/sparc/files/sparc-datasets-test')
bs = [
    rc
    for c in here.children
    for rd in (c / 'SPARC Consortium' / '.operations' / 'temp-upstream').rchildren_dirs
    for rc in rd.children
    if rc.is_broken_symlink()]
_ = [b.unlink() for b in bs]

datasets causing issues with fetching files

find */SPARC\ Consortium/.operations/temp-upstream/ -type d -name '*-ERROR' | cut -d'/' -f 1 | sort -u
python -m sparcur.simple.retrieve --jobs 1 --sparse-limit -1 --parent-parent-path . --dataset-id $1
pushd $1
spc export 

viewing single dataset logs

pushd ~/.cache/log/sparcur/datasets
find -name stdout.log -printf "%T@ %Tc %p\n" | sort -n
less -R $_some_path

fixing feff issues

from sparcur.datasets import Tabular
from sparcur.paths import Path
p = Path('dataset_description.xlsx')
t = Tabular(p)
hrm1 = list(t.xlsx1())
hrm2 = list(t.xlsx2())

look for \ufeff at the start of strings and then use e.g. vim to open and edit the file removing it from the offending strings

View logs for failed single dataset exports

Run the function, paste in the ids under failed and hit enter.

function review-failed () {
    local paths _id
    paths=()
    while read _id; do
        paths+=(~/.cache/log/sparcur/datasets/${_id}/LATEST/stdout.log)
        if [ -z $_id ]; then break; fi
    done
    less -R ${paths[@]}
}

SCKAN

See the developer guide section on SCKAN.

SODA

Have to clone SODA and fetch the files for testing.

from pprint import pprint
import pysoda
from sparcur.paths import Path
p = Path(parent_folder, path).expanduser().resolve()
children = list(p.iterdir())
blob = pysoda.create_folder_level_manifest(
    {p.resolve().name: children},
    {k.name + '_description': ['some description'] * len(children)
     for k in [p] + list(p.iterdir())})
manifest_path = Path(blob[p.name][-1])
manifest_path.xopen()
pprint(manifest_path)

Developer

Releases

DatasetTemplate

Commit any changes and push to master.

make-template-zip () {
    local CLEANROOM=/tmp/cleanroom/
    mkdir ${CLEANROOM} || return 1
    pushd ${CLEANROOM}
    git clone https://github.com/SciCrunch/sparc-curation.git &&
    pushd ${CLEANROOM}/sparc-curation/resources
    zip -x '*.gitkeep' -r DatasetTemplate.zip DatasetTemplate
    mv DatasetTemplate.zip ${CLEANROOM}
    popd
    rm -rf ${CLEANROOM}/sparc-curation
    popd
}
make-template-zip

Once that is done open /tmp/cleanroom/DatasetTemplate.zip in file-roller or similar and make sure everything is as expected.

Create the GitHub release. The tag name should have the format dataset-template-1.1 where the version number should match the metadata version embedded in datasetdescription.xlsx. Minor versions such as dataset-template-1.2.1 are allowed.

Attach ${CLEANROOM}/DatasetTemplate.zip as a release asset.

Inform curation so they can notify the community.

Getting to know the codebase

Use inspect.getclasstree along with pyontutils.utils.subclasses to display hierarchies of classes.

from inspect import getclasstree
from pyontutils.utils import subclasses
from IPython.lib.pretty import pprint

# classes to inspect
import pathlib
from sparcur import paths

def class_tree(root):
    return getclasstree(list(subclasses(root)))

pprint(class_tree(pathlib.PurePosixPath))

Viewing logs

View the latest log file with colors using less.

less -R $(ls -d ~sparc/files/blackfynn_local/export/log/* | tail -n 1)

For a permanent fix for less add

alias less='less -R'

Debugging fatal pipeline errors

You have an error!

maybe_size = c.cache.meta.size  # << AttributeError here

Modify to wrap code

try:
    maybe_size = c.cache.meta.size
except AttributeError as e:
    breakpoint()  # << investigate error

Temporary squash by logging as an exception with optional explanation

try:
    maybe_size = c.cache.meta.size
except AttributeError as e:
    log.exception(e)
    log.error(f'explanation for error and local variables {c}')

Date: 2022-12-22T00:35:45-05:00

Author: Tom Gillespie

Created: 2022-12-22 Thu 01:38

Validate