ACM SIGIR Artifact Badging

Procedure for ACM SIGIR Artifact Review and Badging Scope

The ACM Special Interest Group on Information Retrieval (SIGIR) adopts the ACM policy on “Artifact Review and Badging” available at https://www.acm.org/publications/policies/artifact-review-and-badging-current (Version 1.1 – 20 August 2020) and applies it to:

  • ACM Transactions on Information Systems (TOIS)
  • Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)
  • ACM on Conference on Human Information Interaction and Retrieval (CHIIR)
  • ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR)

The goal is to provide additional recognition for authors of accepted papers who are willing to go a step further by providing a complete implementation of their algorithms and/or sharing their datasets and resources, allowing replication, or even reproduction, of the results presented in their papers. However, replicability and reproducibility are not only a matter of authors but also of reviewers and other researchers who check artifacts and reproduce them. Therefore, the hard work of reviewers and “reproducers” will be acknowledged too, by mentioning who has dealt with an artifact.

Overall, replicability and reproducibility are community efforts, they are not competitive processes aimed at selecting some papers with respect to others, but they are rather cooperative work. For these reasons, we adopt an open review process, where neither authors nor reviewers are blind, which also allows for explicitly acknowledging the work of all the parties involved.

The different types of ACM stamps are not meant to be a measure of the scientific quality of the paper themselves or of the usefulness of presented algorithms, which are assessed by means of the traditional peer-review processes and by adoption/impact in the research and industry community. Rather, they are a recognition of the service provided by authors to the community by releasing the code and/or data and they are an endorsement of the replicability and/or reproducibility of the results presented in the paper. The stamps also alert users of the ACM Digital Library about the presence and location of these artifacts in the ACM DL:

In this way, each artifact will be assigned its own DOI and will be directly citable.

Overall, the initiative promotes reproducibility of research results and allows scientists and practitioners to immediately benefit from state-of-the-art research results, without spending months re-implementing the proposed algorithms and trying to find the right parameter values, or creating datasets or running intensive user-oriented evaluation. We also hope that it will indirectly foster scientific progress, since it allows researchers to reliably compare with and build upon existing techniques, knowing that they are using exactly the same implementation.

Artifact Submission

Irrespective of the nature of the artifacts, authors should create a single Web page (whether on their site or a third-party repository service) that contains the artifact, the paper, and all necessary instructions.

For artifacts where this would be appropriate, it would be helpful to also provide a self-contained bundle (including instructions) as a single file (.tgz or .zip) for convenient offline use.

The artifact submission thus consists of just the URL and any credentials required to access the files submitted into the submission system: 

https://openreview.net/group?id=ACM.org/SIGIR/Badging

A mini tutorial on the use of OpenReview for authors is available here:

aec-openreview-authors.pdf

Artifact Packaging

Below guidelines are provided for packaging the three main kinds of artifacts we expect to receive, (1) datasets, (2) code, and (3) user-oriented materials.

Datasets

When appropriate, datasets should be packaged in a compressed format (.tgz or .zip) and accompanied by detailed documentation about their contents and structure.

In case of formats supporting document types or schemas, e.g. XML, RDF, JSON, a commented schema should be provided.

In the case of textual data, care has to be put for character encoding and the use of Unicode and UTF-8 is suggested.

It is also suggested that datasets are accompanied by an example parser to show how to process them in the intended way. It could be also helpful to provide a tool to visualize the datasets, if the dataset is difficult to be viewed, e.g., like plain text.

The paper:

  • Gebru, T., Morgenstern, J., Vecchione, B., Wortman Vaughan, J., Wallach, H., Daumeé III, H., and Crawford, K. (2018). Datasheets for Datasets. arXiv.org, Databases (cs.DB), arXiv:1803.09010 (https://arxiv.org/abs/1803.09010).

contains useful suggestions on how to package and describe datasets.

Code

The full source code must be available and it needs to be properly documented as well as provided with clear instructions to build it from scratch. It is strongly suggested the use of build tools such as Maven, Gradle, CMake and the like to easily manage dependencies and the build process. If the code is in Python, all the dependencies must be pinned.

Source code should be made available through a version control system, such as GitHub or Bitbucket, and the version of the artifact submitted for badging should be explicitly tagged, in order to allow reviewers to seamlessly identify the correct version and not deal with subsequent updates.

The code must compile and run on a vanilla installation of an operating system, without any software installed. If you need specific software to be installed, you should provide a script that automatically downloads and installs all the required dependencies. As explained above, you should provide a README of the code repository which contains a step-by-step deployment guide. The starting point of the guide is a freshly installed OS (say, a current Ubuntu LTS version). The guide then walks the reviewer through the deployment process, specifying every command to be executed until the system is fully deployed. In particular, the guide must not presume any prior knowledge, and the steps must be effortless on the part of the reviewer (i.e., not some instruction like “Install and configure Redis”, but the actual commands to do it the way the authors need it).

The code should be preferably distributed with a license that allows free non-commercial usage. Its dependencies must have a license that allows free non-commercial or academic usage.

If the code is expected to run on particular datasets, the datasets have to be provided as well.

Besides the source code which must always be available, authors should also strongly consider one of the following methods to package the software components of their artifacts to foster their re-use:

  • A binary installable package (short-term reproducibility)
  • A “live notebook”, e.g. Jupyter, that can provide a journal view of all the work done to create the results (short-term reproducibility)
  • Pinning dependencies (short term reproducibility)
  • Providing copies of the dependencies plus deployment script (mid-term reproducibility)
  • A container image, e.g. Docker image not just Docker file (long-term reproducibility)
  • A virtual machine image (Virtualbox/VMWare, i.e. a true hypervisor) containing software applications already setup in the intended run-time environment (e.g., Mobile phone emulator, Real-time OS). This is the preferred option: It avoids installation and compatibility problems and provides the best guarantees for reproducibility of the results by the committee (longer-term reproducibility)

User-oriented Experiments

The full study protocol and where applicable the general approach, definitions, study design, recruitment and sampling, and data collection setup has to be available. This can include the task designs, procedure, questionnaires (pre-test, pre-task, post-task, and exit), semi-structured interview questions, and apparatus.

The following paper describes useful suggestions for what can be included in user-oriented studies:

  • Petras, V., Bogers, T., & Gäde, M. (2019). Elements of IIR Studies: A Review of the 2006-2018 IIiX and CHIIR Conferences. In BIIRRR@ CHIIR (pp. 37-41). (http://ceur-ws.org/Vol-2337/paper6.pdf)

Review Procedure

Artifact evaluation is only open to accepted papers in one of the venues listed above and it is triggered upon explicit request for badging by the authors of the artifact. The submission of the artifact in OpenReview is this explicit request.

The artifact badging process is fully asynchronous with respect to the submission and review process of the targeted conferences and journals and it can be triggered at any moment. However, we expect artifact badging requests to occur within 3 years of the date of publication of a paper.  Requests to badge older papers will be considered on a case by case basis and should be submitted to the AEC Chairs along with a detailed justification outlining the importance of badging the paper and why it took longer than 3 years to submit the request. Notification of acceptance emails will contain information on how to request badges.

We adopt an open review process where reviewers are known to authors and can interact with them to ask for clarifications and explanations.

Each artifact is jointly reviewed by:

  • 1 junior AEC member
  • 1 senior AEC member

who work together to produce a single assessment of the artifact. The distinction between junior and senior members is mainly defined with respect to their experience in the committee and not just on the basis of the academic seniority. It is expected that the junior member carries out most of the practical work and discusses issues, problems, and workarounds together with the senior member. The TOIS/SIGIR/CHIIR/ICTIR representative corresponding to the source venue of the paper is kept in the loop to ensure compliance with the criteria adopted at the source venue. The chair/vice-chair will resolve any conflicts and break ties if needed.

SIGIR/CHIIR/ICTIR calls for papers will explicitly mention the possibility for authors to ask for artifact badging upon acceptance of their papers, as well as TOIS guidelines for authors or calls for papers of special issues.

It is important to stress that the artifact review is not intended to redo the peer-review of the paper, which is already accepted and whose scientific quality is taken for granted. The only purpose of the artifact review process is to determine if a given badge can be assigned to a paper and the evaluation of the artifact is relative to the paper, i.e. to the claims made in the paper.

Authors can apply for a given badge at most three times. Authors re-submitting an artifact are expected to provide a link to the previous submission within OpenReview and the submission form contains a specific field for this purpose.

A mini tutorial on the use of OpenReview for reviewers is available here:

aec-openreview-reviewers.pdf

Below, more information is provided about the specific criteria to meet for being awarded each badge type.

Criteria for Awarding Artifacts Evaluated – Functional

Definition of Artifacts Evaluated – Functional [v1.1 as of August 24, 2020]: “The artifacts associated with the research are found to be documented, consistent, complete, exercisable, and include appropriate evidence of verification and validation.”

During the review process, if not already in the “Source Materials” section of the ACM DL, the artifact should be made available in an online repository accessible to reviewers.

Check also that the content of the artifact corresponds to what is mentioned in the corresponding paper – no need to try to run or compile materials, just look that all the material is available.

The expected review effort for this type of badge is in the order of very few hours and more time is considered an indicator of potential issues with the artifact.

Criteria for Awarding Artifacts Evaluated – Reusable and Available

Definition of Artifacts Evaluated – Reusable [v1.1 as of August 24, 2020]: “The artifacts associated with the paper are of a quality that significantly exceeds minimal functionality. That is, they have all the qualities of the Artifacts Evaluated – Functional level, but, in addition, they are very carefully documented and well-structured to the extent that reuse and repurposing are facilitated. In particular, the norms and standards of the research community for artifacts of this type are strictly adhered to.”

Definition of Available [v1.1 as of August 24, 2020]: “This badge is applied to papers in which associated artifacts have been made permanently available for retrieval.”

For the purposes of ACM SIGIR, we can merge the Artifacts Evaluated – Reusable and Artifacts Available into a single badge at level Artifacts Available

Check that the artifact is properly documented which means being provided with a general description and an inventory of its contents:

  • for datasets, a description of their “schema” is also expected and an explanation of the intended way to process them;
  • for code, the code itself need to be properly commented and documented, e.g. Javadoc or similar for other languages;
  • for log-books and user-oriented experiments, all the materials used to conduct the study should be available, e.g. instructions, questionnaires, interview questions, scales, qualitative analysis code-books, or study protocols.

Check that the content of the artifact corresponds to its description in the paper, that it may produce the results reported in the paper, and that no part of it is missing, relative to the results reported in the paper.

In case of datasets, check that they can be successfully unpacked and parsed, if a default parser is provided with the artifact. For datasets to be reusable, the reviewer should really ensure they can import them (or a subset by some criteria) into some data analysis tool either via pre-processing (or directly) and then carry out some basic analysis – producing graphs, histograms etc.

In particular, the paper:

  • Gebru, T., Morgenstern, J., Vecchione, B., Wortman Vaughan, J., Wallach, H., Daumeé III, H., and Crawford, K. (2018). Datasheets for Datasets. arXiv.org, Databases (cs.DB), arXiv:1803.09010 (https://arxiv.org/abs/1803.09010).

contains best practices that should be adopted by authors in packaging and describing their datasets.

In the case of code, check that it can be successfully compiled and executed. Source code must always be available but, on top of this, it is strongly suggested that the artifact is also packaged as a virtual machine or live notebook to enable the easiest possible fruition of the artifact (see section on packaging below).

The README of the code repository contains a step-by-step deployment guide. The starting point of the guide is a freshly installed OS (say, a numbered version of Ubuntu LTS version). The guide then walks the reviewer through the deployment process, specifying every command to be executed until the system is fully deployed. In particular, the guide must not presume any prior knowledge, and the steps must be effortless on the part of the reviewer (i.e., not some instruction like “Install and configure Redis”, but the actual commands to do it the way the authors need it).

In the case of log-books and user-oriented experiments, all the materials to conduct the experiment should be made available. This may include, pre-task, post-task, and exit questionnaires, (semi-structured) interview questions, scales, qualitative analysis code-books, tasks, or study protocols. Preferably, the data collection setup is explained in such a way that it can be easily understood. This can include the apparatus used and how the recruitment was completed. Where possible, setups of crowdsourcing experiments are shared or, as a minimum, detailed screenshots of the interface and logic, as well as detailed information about the procedure for collecting data.

The expected review effort for this type of badge is in the order of very few hours and more time is considered an indicator of potential issues with the artifact.

Criteria for Awarding Results Reproduced

Definition of Results Reproduced [v1.1 as of August 24, 2020]: “The main results of the paper have been obtained in a subsequent study by a person or team other than the authors, using, in part, artifacts provided by the author.”

Execute the provided code on the datasets described in the paper and provided with the artifact. Issues to be considered for specific cases:

  • Timings/Efficiency: If the paper includes timings, an additional script should be included in the submission. The script should run all the experiments for which a timing is provided in the paper and save the results in a single text file. The reviewer will launch the scripts on their machine — the produced txt file will be attached as additional material to the submission, together with a description of the CPU, GPU and maximal memory of the workstation used to replicate the experiments. Note that it is not expected that the timings in the paper will match those obtained by the reviewer.
  • Interactive Applications: If your application is interactive and it is not possible to script it to replicate a certain figure in the paper, you must provide a short movie clip that captures the screen while you use your software to reproduce the figure. The reviewers will then follow step by step the video to reproduce the same result. If you want to use this option, you will have to provide a short description to justify why it is not possible to script the generation of the results. You can also consider using specific tools for interface testing, such as Selenium, which ease replicating results.
  • Data Restrictions: There might be cases where the data used in a paper cannot be publicly released with the same license as the code. In these cases, it is sufficient to provide a URL that points to a website that allows users to apply for (or buy) the license needed to obtain the data. A confidential copy of the data (which will be deleted after the review) should be sent to the reviewer.  Another option is to test the code against a subset of data; or synthetic data which is of the same shape; to confirm that the code itself is functional.
  • Long-Running Times: The authors should provide pre-computed data for each result/algorithmic step requiring more than 24 hours. The code and training data required to reproduce the result/algorithmic step should still be in the repository, but the reviewer will not rerun the computation to save time and resources. This applies in particular to training of deep networks or expensive parameter searches: for these cases, providing the weights of the network or the parameters will allow the reviewer to validate the results, without having to recompute them.
  • Hardware Requirements: We encourage the authors to provide code that runs without requiring specific hardware, to favor portability and simplify the replicability of the results. However, if this is not possible, access to specialised hardware needs to be provided in order to give a chance to replicate the results.

Conduct the user-experiment with all the described protocols and provided with the artifacts. Issues to be considered for specific cases:

  • Lab facilities and apparatus: If the paper includes a description of the facilities and apparatus used, the reviewer must replicate the setup as closely as possible.
  • Scales and test: We encourage the authors to provide the scales and tests. There might be cases where the tests used in a paper cannot be publicly released. In this case, it is sufficient to provide a URL that allows users to apply for (or buy) the needed tests. A confidential copy of the tests should be sent to the reviewer.

Criteria for Awarding Results Replicated

Definition of Results Replicated [v1.1 as of August 24, 2020]: “The main results of the paper have been independently obtained in a subsequent study by a person or team other than the authors, without the use of author-supplied artifacts.”

Not implemented for the moment.

Artifact Evaluation Committee (AEC)

The Artifact Evaluation Committee (AEC) is a standing committee within the SIGIR group, responsible for reviewing and badging artifacts.

The AEC is constituted by:

  • AEC chair
  • AEC vice-chair
  • Senior members, including
    • ACM TOIS representative
    • ACM SIGIR representative
    • ACM CHIIR representative
    • ACM ICTIR representative
  • Junior members (PhD and early post-docs)

The bootstrap AEC members and their roles are appointed by the SIGIR Executive Committee.

AEC chair and vice-chair are elected every 3 years by AEC members.

TOIS/SIGIR/CHIIR/ICTIR representatives are invited by AEC chair and vice-chair in accordance with TOIS/SIGIR/CHIIR/ICTIR organizers.

Senior and junior members are invited by the AEC chair and vice-chair.

Selection of the chair/vice-chair should consider these guidelines:

  • each chair/vice-chair should be enthusiastic about artifact evaluation, i.e., they should not be chosen purely for their scientific standing
  • each chair/vice-chair should be experienced with and engaged in the process of artifact creation
  • preferably, the two chair/vice-chair should be representative of different areas of the community, e.g. system-oriented and user-oriented angle or academia and industry

Junior members are the key to successfully push the artifact review process forward:

  • they may be more familiar with the mechanics of modern-day software tools
  • they may have more time to conduct detailed experiments with the artifacts
  • participating in the review process might give them some perspective on the importance of artifacts, and influence the people likely to become the next generation of leaders

Current Composition

Chair and Vice-chair

  • Nicola Ferro, University of Padua, Italy [chair]
  • Johanne Trippas, University of Melbourne, Australia [vice-chair]

Senior Members

  • Alessandro Benedetti, Sease, UK
  • Rob Capra, University of North Carolina at Chapel Hill, USA
  • Diego Ceccarelli, Bloomberg, UK
  • Anita Crescenzi, University of North Carolina at Chapel Hill, USA
  • Charles L . A. Clarke, University of Waterloo, Canada
  • Yi Fang, Santa Clara University, USA
  • Norbert Fuhr, University of Duisburg-Essen, Germany
  • Claudia Hauff, Delft University of Technology, The Netherlands
  • Jiqun Liu, University of Oklahoma, USA
  • Maria Maistro, University of Copenhagen, Denmark
  • Miguel Martinez, Signal AI, UK
  • Parth Mehta, Parmonic, USA
  • Martin Potthast, Leipzig University, Germany
  • Tetsuya Sakai, Waseda University, Japan
  • Ian Soboroff, National Institute of Standards and Technology (NIST), USA
  • Paul Thomas, Microsoft, Australia
  • Andrew Trotman, University of Otago, New Zealand
  • Min Zhang, Tsinghua University, China

Junior Members

  • Valeriia Baranova, RMIT University, Australia
  • Arthur Barbosa Câmara, Delft University of Technology, The Netherlands
  • Hamed Bonab, University of Massachusetts Amherst, USA
  • Kathy Brennan, Google, USA
  • Timo Breuer, TH Köln, Germany
  • Guglielmo Faggioli, University of Padua, Italy
  • Alexander Frummet, University of Regensburg, Germany
  • Darío Garigliotti, Aalborg University, Denmark
  • Chris Kamphuis, Radboud University, The Netherlands
  • Johannes Kiesel, Bauhaus-Universität Weimar, Germany
  • Yuan Li, University of North Carolina at Chapel Hill, USA
  • Joel Mackenzie, University of Melbourne, Australia
  • Antonio Mallia, New York University, USA
  • David Maxwell, Delft University of Technology, The Netherlands
  • Felipe Moraes, Delft University of Technology, The Netherlands
  • Ahmed Mourad, University of Queensland, Australia
  • Zuzana Pinkosova, ‎University of Strathclyde, UK
  • Chen Qu, University of Massachusetts Amherst, USA
  • Anna Ruggero, Sease, UK
  • Svitlana Vakulenko, University of Amsterdam, The Netherlands
  • Sasha Vtyurina, KIRA systems, Canada
  • Oleg Zendel, RMIT University, Australia
  • Steven Zimmerman, University of Essex, UK

Contacts

For any questions or clarifications, please contact us at:

  • aec_sigir@acm.org