Tutorial 12: Data Archival#

The final step of any analysis workflow should be to archive your simulation files used in reporting and documenation, both input and output files. The archival task is generally performed once at the end of a project and limited to the final, peer-reviewed simulation results. However, if the task of archiving these files is added to the automated workflow, it is easier to guarantee that the archived files are in sync with the simulation results. Of course, it’s not enough to produce the archive, it must also be stored somewhere for retrieval by colleagues and the analysis report audience.

The archive can include compute environment information and repository version information for improved reproducibility. For the reproducible version number, it is beneficial to use a versioning scheme that includes information from the project’s version control system, e.g. git. The WAVES project uses git and setuptools_scm [55] to build version numbers with a clean version number that is uniquely tied to a single commit, e.g. 1.2.3, or a version number appended with the short git hash to uniquely identify the project commit. Setting up git, git tags, and a setuptools_scm version number is outside the scope of this tutorial, but highly recommended.

References#

Environment#

SCons and WAVES can be installed in a Conda environment with the Conda package manager. See the Conda installation and Conda environment management documentation for more details about using Conda.

Note

The SALib and numpy versions may not need to be this strict for most tutorials. However, Tutorial: Sensitivity Study uncovered some undocumented SALib version sensitivity to numpy surrounding the numpy v2 rollout.

  1. Create the tutorials environment if it doesn’t exist

    $ conda create --name waves-tutorial-env --channel conda-forge waves 'scons>=4.6' matplotlib pandas pyyaml xarray seaborn 'numpy>=2' 'salib>=1.5.1' pytest
    
  2. Activate the environment

    $ conda activate waves-tutorial-env
    

Some tutorials require additional third-party software that is not available for the Conda package manager. This software must be installed separately and either made available to SConstruct by modifying your system’s PATH or by modifying the SConstruct search paths provided to the waves.scons_extensions.add_program() method.

Warning

STOP! Before continuing, check that the documentation version matches your installed package version.

  1. You can find the documentation version in the upper-left corner of the webpage.

  2. You can find the installed WAVES version with waves --version.

If they don’t match, you can launch identically matched documentation with the WAVES Command-Line Utility docs subcommand as waves docs.

Directory Structure#

  1. Create and change to a new project root directory to house the tutorial files if you have not already done so. For example

$ mkdir -p ~/waves-tutorials
$ cd ~/waves-tutorials
$ pwd
/home/roppenheimer/waves-tutorials

Note

If you skipped any of the previous tutorials, run the following commands to create a copy of the necessary tutorial files.

$ pwd
/home/roppenheimer/waves-tutorials
$ waves fetch --overwrite --tutorial 11 && mv tutorial_11_regression_testing_SConstruct SConstruct
WAVES fetch
Destination directory: '/home/roppenheimer/waves-tutorials'
  1. Download and copy the tutorial_11_regression_testing file to a new file named tutorial_12_archival with the WAVES Command-Line Utility fetch subcommand.

$ pwd
/home/roppenheimer/waves-tutorials
$ waves fetch --overwrite tutorials/tutorial_11_regression_testing && cp tutorial_11_regression_testing tutorial_12_archival
WAVES fetch
Destination directory: '/home/roppenheimer/waves-tutorials'

SConscript#

A diff against the tutorial_11_regression_testing file from Tutorial 11: Regression Testing is included below to help identify the changes made in this tutorial.

waves-tutorials/tutorial_12_archival

--- /home/runner/work/waves/waves/build/docs/tutorials_tutorial_11_regression_testing
+++ /home/runner/work/waves/waves/build/docs/tutorials_tutorial_12_archival
@@ -7,6 +7,8 @@
 
   * ``datacheck_alias`` - String for the alias collecting the datacheck workflow targets
   * ``regression_alias`` - String for the alias collecting the regression test suite targets
+  * ``archive_prefix`` - String prefix for archive target(s) containing identifying project and version information
+  * ``project_configuration`` - String absolute path to the project SCons configuration file
   * ``unconditional_build`` - Boolean flag to force building of conditionally ignored targets
   * ``abaqus`` - String path for the Abaqus executable
 """
@@ -25,6 +27,7 @@
 # Simulation variables
 build_directory = pathlib.Path(Dir(".").abspath)
 workflow_name = build_directory.name
+workflow_configuration = [env["project_configuration"], workflow_name]
 parameter_study_file = build_directory / "parameter_study.h5"
 
 # Collect the target nodes to build a concise alias for all targets
@@ -191,11 +194,19 @@
     )
 )
 
+# Data archival
+archive_name = f"{env['archive_prefix']}-{workflow_name}.tar.bz2"
+archive_target = env.Tar(
+    target=archive_name,
+    source=workflow + workflow_configuration,
+)
+
 # Collector alias based on parent directory name
 env.Alias(workflow_name, workflow)
 env.Alias(f"{workflow_name}_datacheck", datacheck)
 env.Alias(env["datacheck_alias"], datacheck)
 env.Alias(env["regression_alias"], datacheck)
+env.Alias(f"{workflow_name}_archive", archive_target)
 
 if not env["unconditional_build"] and not env["ABAQUS_PROGRAM"]:
     print(f"Program 'abaqus' was not found in construction environment. Ignoring '{workflow_name}' target(s)")

First, we add the new environment keys required by the SConscript file that will be used by the archive task. Second, we build a list of all required SCons configuration files for the current workflow, where the project_configuration will point to the SConstruct file and by the project’s naming convention the build directory name will match the current SConscript file name. These SCons workflow configuration files will be archived with the output of the workflow for reproducibility of the workflow task definitions.

For advanced workflows, e.g. Tutorial: Task Definition Reuse, that reuse SConscript files, it may be necessary to recover the current SConscript file name with a Python lambda expression as seen in the SConstruct modifications below. If the current workflow uses more than one SConscript file, the workflow_configuration list should be updated to include all configuration files for the archive task.

Next, we define the actual archive task using the SCons Tar builder [39]. The archive target is constructed from a prefix including the current project name and version in the SConstruct file. Including the version number will allow us to keep multiple archives simultaneously, provided the version number is incremented between workflow executions and as the project changes. We append the current workflow name in the archive target for projects that may contain many unique, independent workflows which can be archived separately. The archive task sources are compiled from all previous workflow targets and the workflow configuration file(s). In principle, it may be desirable to archive the workflow’s source files, as well. However, if a version control system is used to build the version number as in Tutorial: setuptools_scm, the source files may also be recoverable from the version control state which is embedded in the version number.

Finally, we create a dedicated archive alias to match the workflow alias. Here we separate the aliases because workflows with large output files may require significant time to archive. This may be undesirable during workflow construction and troubleshooting. It is also typical for the archival task to be performed once at reporting time when the post-processing plots have been finalized.

SConstruct#

A diff against the SConstruct file from Tutorial 11: Regression Testing is included below to help identify the changes made in this tutorial.

waves-tutorials/SConstruct

--- /home/runner/work/waves/waves/build/docs/tutorials_tutorial_11_regression_testing_SConstruct
+++ /home/runner/work/waves/waves/build/docs/tutorials_tutorial_12_archival_SConstruct
@@ -3,6 +3,7 @@
 import os
 import sys
 import pathlib
+import inspect
 
 import waves
 
@@ -59,6 +60,7 @@
     unconditional_build=GetOption("unconditional_build"),
     print_build_failures=GetOption("print_build_failures"),
     abaqus_commands=GetOption("abaqus_command"),
+    TARFLAGS="-c -j",
 )
 
 # Conditionally print failed task *.stdout files
@@ -76,13 +78,17 @@
 # Set project internal variables and variable substitution dictionaries
 project_name = "WAVES-TUTORIAL"
 version = "0.1.0"
-project_dir = pathlib.Path(Dir(".").abspath)
+archive_prefix = f"{project_name}-{version}"
+project_configuration = pathlib.Path(inspect.getfile(lambda: None))
+project_dir = project_configuration.parent
 project_variables = {
+    "project_configuration": project_configuration,
     "project_name": project_name,
     "project_dir": project_dir,
     "version": version,
     "regression_alias": "regression",
     "datacheck_alias": "datacheck",
+    "archive_prefix": archive_prefix,
 }
 for key, value in project_variables.items():
     env[key] = value
@@ -114,6 +120,7 @@
     "tutorial_08_data_extraction",
     "tutorial_09_post_processing",
     "tutorial_11_regression_testing",
+    "tutorial_12_archival",
 ]
 for workflow in workflow_configurations:
     build_dir = env["variant_dir_base"] / workflow

Note that we retrieve the project configuration SConstruct file name and location with a Python lambda expression [40]. We do this to recover the absolute path to the current configuration file and because some projects may choose to use a non-default filename for the project configuration file. In Python 3, you would normally use the __file__ attribute; however, this attribute is not defined for SCons configuation files. Instead, we can recover the configuration file name and absolute path with the same method used in Tutorial 01: Geometry and Tutorial 02: Partition and Mesh for the Abaqus Python 2 journal files. For consistency with the configuration file path, we assume that the parent directory of the configuration file is the same as the project root directory.

The environment is also modified to provide non-default configuration options to the SCons Tar builder. Here, we request the bzip2 compression algorithm of the archive file and a commonly used file extension to match. You can read more about tar archives in the GNU tar documentation [56] and the SCons Tar builder in the SCons manpage [39].

Build Targets#

  1. Build the archive target. Note that the usual workflow target does not include the archive task because it is not required until the project developer is ready to begin final reporting.

$ pwd
/home/roppenheimer/waves-tutorials
$ scons tutorial_12_archival_archive --jobs=4

Output Files#

The output should look identical to Tutorial 11: Regression Testing with the addition of a single *.tar.bz2 file. You can inspect the contents of the archive as below.

$ pwd
/home/roppenheimer/waves-tutorials
$ find build -name "*.tar.bz2"
build/tutorial_12_archival/WAVES-TUTORIAL-0.1.0-tutorial_12_archival.tar.bz2
$ tar -tjf $(find build -name "*.tar.bz2") | grep -E "parameter_set0|SConstruct|^tutorial_12_archival"
build/tutorial_12_archival/parameter_set0/rectangle_geometry.cae
build/tutorial_12_archival/parameter_set0/rectangle_geometry.jnl
build/tutorial_12_archival/parameter_set0/rectangle_geometry.stdout
build/tutorial_12_archival/parameter_set0/rectangle_partition.cae
build/tutorial_12_archival/parameter_set0/rectangle_partition.jnl
build/tutorial_12_archival/parameter_set0/rectangle_partition.stdout
build/tutorial_12_archival/parameter_set0/rectangle_mesh.inp
build/tutorial_12_archival/parameter_set0/rectangle_mesh.cae
build/tutorial_12_archival/parameter_set0/rectangle_mesh.jnl
build/tutorial_12_archival/parameter_set0/rectangle_mesh.stdout
build/tutorial_12_archival/parameter_set0/rectangle_compression.inp.in
build/tutorial_12_archival/parameter_set0/rectangle_compression.inp
build/tutorial_12_archival/parameter_set0/assembly.inp
build/tutorial_12_archival/parameter_set0/boundary.inp
build/tutorial_12_archival/parameter_set0/field_output.inp
build/tutorial_12_archival/parameter_set0/materials.inp
build/tutorial_12_archival/parameter_set0/parts.inp
build/tutorial_12_archival/parameter_set0/history_output.inp
build/tutorial_12_archival/parameter_set0/rectangle_compression.sta
build/tutorial_12_archival/parameter_set0/rectangle_compression.stdout
build/tutorial_12_archival/parameter_set0/rectangle_compression.odb
build/tutorial_12_archival/parameter_set0/rectangle_compression.dat
build/tutorial_12_archival/parameter_set0/rectangle_compression.msg
build/tutorial_12_archival/parameter_set0/rectangle_compression.com
build/tutorial_12_archival/parameter_set0/rectangle_compression.prt
build/tutorial_12_archival/parameter_set0/rectangle_compression.h5
build/tutorial_12_archival/parameter_set0/rectangle_compression_datasets.h5
build/tutorial_12_archival/parameter_set0/rectangle_compression.csv
build/tutorial_12_archival/parameter_set0/rectangle_compression.h5.stdout
SConstruct
tutorial_12_archival

Workflow Visualization#

View the workflow directed graph by running the following command and opening the image in your preferred image viewer. First, plot the workflow with all parameter sets.

$ pwd
/home/roppenheimer/waves-tutorials
$ waves visualize tutorial_12_archival_archive --output-file tutorial_12_archival.png --width=60 --height=12 --exclude-list /usr/bin .stdout .jnl .prt .com .msg .dat .sta

The output should look similar to the figure below.

_images/tutorial_12_archival.png

In this image of the archive target’s full directed graph we see that full workflow feeds down into a single archive file on the left hand side. Since the archive target does not include the full workflow, there is only a single connection between the archive alias and the archive file itself. We could specify the archive target by relative path directly, but the alias saves some typing and serves as a consistent command when the project version number changes. This is especially helpful when using a dynamic version number built from a version control system as introduced in the supplemental Tutorial: setuptools_scm.

Now plot the workflow with only the first set, set0.

$ pwd
/home/roppenheimer/waves-tutorials
$ waves visualize tutorial_12_archival_archive --output-file tutorial_12_archival_set0.png --width=60 --height=8 --exclude-list /usr/bin .stdout .jnl .prt .com .msg .dat .sta --exclude-regex "set[1-9]"

The output should look similar to the figure below.

_images/tutorial_12_archival_set0.png

As in previous tutorials, the full image is useful for describing simulation size and scope, but the image for a single parameter set is more readable and makes it easier to see individual file connections.