Kevin Leonardic

DICOM validation in the browser

If you just want to use dciodvfy or dcentvfy in your browser, click here.

About dicom3tools

Some C++ applications stood the test of time and have been in development for a long while. This is the case for dicom3tools, whose development history dates back to 1994. 1 As such, it predates the first release of the C++98 standard itself.

dicom3tools is a collection of command line utilities that can be used to create, modify, dump and validate DICOM files. It has been written by David A. Clunie and is still maintained by him to this very day. In my opinion the validation utilities, dciodvfy and dcentvfy, produce some of the most helpful diagnostic messages when dealing with messy DICOM files.

When faced with customers who had trouble providing DICOM files that can be interpreted easily, I would point them to these tools so they can address problems in their files themselves before spending lots of iterations going back and forth with me.

However, being C++ command line utilities, it was sometimes difficult to get people to use these tools. While Windows and macOS binaries are graciously being provided, on Linux you would have to compile them yourself.

So I thought to myself: Why not compile dciodvfy and dcentvfy for the web and make them accessible by using a browser? This would eliminate the need to use the command line, and the utilities would run in the sandbox provided by the browser. Thus my quest to port the dicom3tools to WebAssembly began.

The original build system & CMake port

With Emscripten it is fairly straight-forward to compile C and C++ code to WebAssembly, since there are wrappers for a lot of the C standard library surface. Compiling dicom3tools this way is not as simple as exporting CXX as em++ and CC as emc though, since the dicom3tools use a combination of custom shell scripts, imake, awk and Makefiles for the build system. There is also a dependency on X11 that is not required for lots of the tools.

Additionally, I wanted to be able to compile on Linux and Windows.

Grabbing a version of gawk compiled from msys2, I tried to port the shell script, imake, Makefile functionality to CMake first.

Thankfully there have been only two imake macros I needed to adapt:

/*
 * ProjectToolLibraryHeaderTemplate - general translate template to
 * header using awk script in support area and specifying multiple args
 */
/*
 * ProjectConvertTemplate - general translate template to
 * header using awk script in support area
 */

This is the original ProjectToolLibraryHeaderTemplate imake macro:

#ifndef ProjectToolLibraryHeaderTemplate
#define ProjectToolLibraryHeaderTemplate(header,template,script,args)   @@\
header.h: $(PROJECTLIBSTANDARDDIR)/template.tpl \           @@\
            $(PROJECTLIBSUPPORTDIR)/script.awk      @@\
    RemoveTargetProgram(header.h)                   @@\
    $(AWK) -f $(PROJECTLIBSUPPORTDIR)/script.awk \          @@\
        args <$(PROJECTLIBSTANDARDDIR)/template.tpl >header.h   @@\
                                    @@\
depend::    header.h                        @@\
                                    @@\
clean::                                 @@\
    $(RM) header.h
#endif /* ProjectToolLibraryHeaderTemplate */

And this is the corresponding - not quite as terse and probably not quite idiomatic - CMake equivalent I came up with:

macro(GenerateProjectToolLibraryHeaderTemplate header template script args)

separate_arguments(SEPARATED_ARGS NATIVE_COMMAND ${args})

# ... omitted some code to copy files from the source directory to the build directory

add_custom_command(
    OUTPUT "${CMAKE_CURRENT_BINARY_DIR}/${header}.h"
    COMMAND ${AWK_PROGRAM} -f ${ProjectLibSupportDir}/${script}.awk ${SEPARATED_ARGS} < ${ProjectLibStandardDir}/${template}.tpl > ${header}.h
    DEPENDS "${ProjectLibStandardDir}/${template}.tpl" "${ProjectLibSupportDir}/${script}.awk"
    COMMENT "Generating ${header}.h from ${template}.tpl using ${script}.awk ..."
    VERBATIM
)
add_custom_target(${header} DEPENDS "${CMAKE_CURRENT_BINARY_DIR}/${header}.h" SOURCES "${CMAKE_CURRENT_BINARY_DIR}/${header}.h")

endmacro(GenerateProjectToolLibraryHeaderTemplate)

The files are first copied to the build directory, since some of them are also being generated by concatenating template files. On one hand, I want to keep the actual CMake files used to configure the project and dependencies as close to the original layout as possible, on the other, it's useful to enable a fully separate build directory for an out-of-source build.

Here is a comparison between the two macro calls:

# Original imake:
ProjectToolLibraryHeaderTemplate(transynu,transyn,transyn,role=extern outname=transynu)

# CMake:
GenerateProjectToolLibraryHeaderTemplate("transynu" "transyn" "transyn" "role=extern outname=transynu")

It was possible to use a simple regular expression to convert the Imakefile's rules to CMake targets this way. This might be important later on, when we want to rebase those changes to updates in the upstream source. I very much doubt that it's a good idea to replace a build system that has been working well for over 20 years. But in this case, it was easier to just replace it then to get it working in tandem with Emscripten / on Windows without POSIX wrappers.

Compiling with Emscripten

For another project I have already created a CMake kit for use with VSCode that I can use with Emscripten and VCPKG. It has been created using the emsdk_env.bat environment setup as a guide.

{
    "name": "emscripten 3.1.67 with vcpkg",
    "compilers": {
      "C": "H:/toolchain/emsdk/upstream/emscripten/emcc.bat",
      "CXX": "H:/toolchain/emsdk/upstream/emscripten/em++.bat"
    },
    "cmakeSettings": {
      "CMAKE_MAKE_PROGRAM": "H:/toolchain/ninja/bin/ninja.exe",
      "VCPKG_TARGET_TRIPLET": "wasm32-emscripten",
      "VCPKG_CHAINLOAD_TOOLCHAIN_FILE": "H:/toolchain/emsdk/upstream/emscripten/cmake/Modules/Platform/Emscripten.cmake"
    },
    "environmentVariables": {
      "EMSDK_OS": "windows",
      "EM_CONFIG": "H:/toolchain/emsdk/.emscripten",
      "EM_LLVM_ROOT": "H:/toolchain/emsdk/upstream/bin",
      "LLVM_ROOT": "H:/toolchain/emsdk/upstream/bin",
      "EMSDK": "H:/toolchain/emsdk",
      "_EMSDK_NODE": "H:/toolchain/emsdk/node/16.20.0_64bit/bin/node.exe",
      "EMSDK_NODE": "H:/toolchain/emsdk/node/18.20.3_64bit/bin/node.exe",
      "EMSDK_PYTHON": "H:/toolchain/emsdk/python/3.9.2-nuget_64bit/python.exe",
      "JAVA_HOME": "H:/toolchain/emsdk/java/8.152_64bit",
      "PATH": "H:/toolchain/ninja/bin;H:/toolchain/emsdk;H:/toolchain/emsdk/upstream/emscripten;H:/toolchain/emsdk/node/18.20.3_64bit/bin;H:/toolchain/emsdk/node/16.20.0_64bit/bin;H:/toolchain/emsdk/python/3.9.2-nuget_64bit"
    },
    "toolchainFile": "H:/v/scripts/buildsystems/vcpkg.cmake",
    "isTrusted": true
  }

So all that was left to do, is to configure the CMake project with the new kit, and compile dciodvfy and dcentvfy for WebAssembly!

We can run it with node:

node.exe dciodvfy.js -help
-help: unrecognized option
Usage: dciodvfy.js [-input-nolengthtoend] [-input-nometa] [-ignoreoutofordertags] [-usvrlutdata] [-input-transfersyntax|-input-ts uniqueidentifier] [-input-default] [-input-byteorder|-input-endian big|little] [-input-vr implicit|explicit] [-if|-input-file inputfile] [-profile profilename] [-describe] [-dump] [-filename] [-v|-verbose] [inputfile] <inputfile

But passing a file to it will fail:

Abort - File open for read failed -

To solve this, we could use Emscripten support for the file system capabilities of node, but ultimately this tool is supposed to run in a browser. As such, a wrapper is required to run dciodvfy and dcentvfy, since the std::fstream based functions can only operate on a virtual filesystem provided by Emscripten.

Here are the minimal wrapper function signatures I came up with:

int dciodvfy(int argc, const char* const argv[], std::istream& is, std::ostream& os);

int dcentvfy(int argc, const char * const argv[], std::vector<std::istream*> filestreams, std::ostream& os);

All files will be accessible using a std::istream interface and the output is captured with std::ostream. The options will be passed by adapting the argc and argv parameters. Since the C++ code has not been written with coroutines in mind, there is no direct way to just wrap an asynchronous FileReader from the JavaScript side. With the ASYNCIFY option of Emscripten, using asynchronous functions in synchronous C++ code becomes a possibility.

The first attempt involved resolving the file wrapper callbacks with a Promise and awaiting the results like so:

EM_ASYNC_JS(EM_VAL, read_from_file_impl,
            (emscripten::EM_VAL file_raw,
             std::size_t pos, std::size_t size),
{
  "use strict";
  var file = Emval.toValue(file_raw);
  function data_reader(file, start, end, success_callback,
                       error_callback) {
    const fr = new FileReader();
    fr.onload = () => {
      const array_buffer = fr.result;
      success_callback(new Uint8Array(array_buffer));
    };
    fr.onerror = (err) => {
      console.error("Error while loading: " + err);
      error_callback(err);
    };
    fr.readAsArrayBuffer(file.slice(start, end));
  };
  function load_file_promise(file, start, end) {
    return new Promise((resolve, reject) => {
      data_reader(
          file, start, end, (result) => { resolve(result); },
          (error) => { reject(error); });
    });
  };
  try {
    const read_data =
        await load_file_promise(file, pos, pos + size);
    return Emval.toHandle(read_data);
  } catch (err) {
    console.error(err);
  }
  return Emval.toHandle(false);
})

It turned out that this works decently well. Compilation time and binary size went through the roof though. Spending over half an hour compiling the C++ code in combination with using an unfamiliar language ( JavaScript ), slowed down development to a crawl.

That is when I discovered that the Web Workers API makes it possible to access synchronous APIs that do not require the ASYNCIFY option to play nice with the synchronous C++ code.

EM_JS(EM_VAL, read_from_file_impl,
            (emscripten::EM_VAL file_raw,
             std::size_t pos, std::size_t size),
{
  "use strict";
  try {
    const file = Emval.toValue(file_raw);
    const fr = new FileReaderSync();
    const bytes = new Uint8Array(fr.readAsArrayBuffer(file.slice(pos, pos + size)));
    return Emval.toHandle(bytes);
  } catch (err) {
    console.error(err);
  }
  return Emval.toHandle(false);
});

This is much more manageable. Using Web Workers also has the added benefit of being able to disregard memory leaks and crashes for the time being, since every validation of a file can spawn a fresh Web Worker.

Here is class definition I settled on for implementing a std::streambuf using the read_from_file function:

struct LazyFileReader : public std::streambuf, public std::istream
{
    emscripten::val file;         // js "File" object
    char* cached_begin = nullptr; // Begin of useful buffer for reverse seeking
    const std::size_t size = 0;
    std::size_t file_pos = 0;
    // "file_pos" indicates global file position of bytes that have been read,
    // so that "file_pos" - (egptr() - gptr()) is at the current seek position
    constexpr static std::size_t default_buffer_size = (1ULL << 20) * 16;
    std::size_t max_buffer_size = default_buffer_size;
    std::vector<char> input_buffer;

    using std::streambuf::int_type;
    using std::streambuf::off_type;
    using std::streambuf::pos_type;
    using std::streambuf::traits_type;

    LazyFileReader(emscripten::val file, std::size_t max_buffer_size);
    LazyFileReader(emscripten::val file);

    std::streamsize getBytesAvailable() const;

    void initBuffer(std::size_t request_buffer_size);
    void resetBuffer();

    virtual pos_type seekoff(
        off_type off, std::ios_base::seekdir dir,
        std::ios_base::openmode which = ios_base::in | ios_base::out) override;

    virtual pos_type
    seekpos(pos_type pos, std::ios_base::openmode which = ios_base::in | ios_base::out) override;

    virtual std::streamsize showmanyc() override;

    // Read some bytes from input sequence and store in buffer
    virtual int_type underflow() override;
    virtual std::streamsize xsgetn(char* ptr, std::streamsize count) override;

    // Clears buffer
    virtual int sync() override;
};

After some experimentation it turned out, that calling the JavaScript read_from_file function too often, would result in a significant performance hit. So I specialized std::streambuf that supports seeking forwards and backwards as well as a huge 16MiB buffer2 for dciodvfy and a small buffer for dcentvfy3. There is now a fourth pointer in addition to eback(), gptr() and egptr() to allow seeking backwards without calling to JavaScript if the data is still inside the buffer. This could have been avoided by only loading buffer-sized chunks, however I wanted to allow loading huge chunks completely bypassing the buffer, after its contents have been used up.4

After adding some HTML, CSS and more JavaScript into the mix, there now is a way to use dciodvfy and dcentvfy from the convenience of your browser, enjoy!

Performance is something I have been slightly worried about, since there might be a lot of IO involved when validating thousands of files. In my tests, it performs reasonably well though! A ~544MiB file is analyzed in ~253ms with the native dciodvfy. It's ~865ms in Firefox and ~457ms in Chrome5. The output matches perfectly what the native tool outputs on the instances where I have checked.

What's next?

I have been finding myself reaching for the Browser based version of dciodvfy quite a lot and there are a few DICOM tools that might benefit from a similar treatment. Also, there are way more tools in the dicom3tools package than just the two I have ported here!6

I have also managed to compile DCMTK with Emscripten, so next time, maybe a DICOM tag viewer / editor?

  1. The first tarball I could find dates back to the 24. August of 1994, so possibly it was started even earlier than that.
  2. Storing a 8KiB buffer on the stack lead to huge problems with Firefox.
  3. dcentvfy uses all files selected for validation, so it was important that the buffer can also be fully deallocated after use.
  4. Initially I was hoping to avoid copying bytes as much as possible, but a true zero-copy implementation is only possible when ALLOW_MEMORY_GROWTH is disabled.
  5. 3 runs each, best result taken here. Variance was fairly low though.
  6. The list is actually pretty long: antodc, dcarith, dcbriggs, dcburn, dccidump, dccomb, dccp, dccreate, dcdecmpr, dcdict, dcdirdmp, dcdirmk, dcdtchg, dcdump, dcencap, dcfile, dchist, dckey, dclutburn, dclutdmp, dclutmix, dcmerge, dcmkpres, dcmulti, dcortho, dcostosr, dcposn, dcpost, dcproj, dcrmmeta, dcsmpte, dcsort, dcsqextr, dcsrdump, dcsrmrg, dcstats, dcsub, dctable, dctopdf, dctopgm8, dctopgx, dctopnm, dctoraw, dcuidchg, dcuncat, pbmtoovl, pdftodc, pgxtodc, pnmtodc, rawftodc, rawtodc