Shared Open Vocabulary for Audio Research and Retrieval

The goal of the Shared Open Vocabulary for Audio Research and Retrieval (SOVARR) project is to investigate if and how audio research communities would benefit from using interoperable file formats, data structures, vocabularies or ontologies, what are the primary needs of MIR researchers, and what are the main barriers to the uptake of shared vocabularies. The project aims to investigate user needs, rework a previously created Semantic Web ontology reflecting these needs, and increase its visibility focussing on the music informatics field.

  • Audio researchers increasingly use common sets of feature extraction techniques to characterise audio material, while large data sets of features are released for scientific use.
  • Most data sets and research tools do not use shared open vocabularies and common data structures, but rely on ad-hoc file formats and information management solutions, chosen without considering interoperability, sustainability and research reproducibility.
  • The Shared Open Vocabulary for Audio Research and Retrieval (SOVARR) project aims to investigate:
    1. how the community would benefit from using interoperable file formats, vocabularies and ontologies,
    2. what are the primary user needs in the music informatics area,
    3. what are the main barriers of uptake of shared ontologies.
  • We will update existing ontologies and research tools after reflecting on our findings, and present tutorials

Background
Researchers in audio (including speech, music, bioacoustics and environmental audio signal processing and retrieval) increasingly use a common set of feature extraction techniques to characterise audio material, while large data sets of features are commonly released for public and scientific use. The development of data sets and research tools however are not governed by shared open vocabularies and common data structures. Instead, they typically rely on ad-hoc file formats and information management solutions, chosen without considering sustainability and research reproducibility. This raises several issues, including

  • the need for adapting research code and research environments for a variety of different formats,
  • the difficulty of combining similar data sets from different sources to form larger data sets,
  • the difficulty of using complementary datasets together,
  • the lack of interoperability between research tools.

The problem affects several communities such as audio signal processing, audio information extraction, and information retrieval researchers, as well as audio archives, libraries, broadcasters and creative industries that may utilise the outcome of the above research activities. While the project will encourage feedback from different audio research fields and user communities at large, we recognise that the size of the project does not allow engagement with all communities equally, therefore the emphasis will be placed on the field of music informatics, and vocabularies that support feature extractor tools used by this community.

State of the art
Methods for describing content-based audio features, including the Sound Description Interchange Format, the MPEG7 Audio Framework, ACE/XML as well as generic structured data formats are limited in their extensibility, modularity and interoperability. This limits their ability to support reproducible research, sustainable research tools, and the creation of shared data sets. The lack of machine-processable definition of the meaning of vocabulary terms, diverse syntactic variations in the expressed data, and schema languages which do not facilitate extending vocabularies to support new research needs are among the prominent problems (Troncy et al., 2007, Raimond, 2009, Fazekas et al, 2010). The use of Semantic Web technologies for creating shared open vocabularies allows for overcoming these issues, therefore they have the potential to gain wide appeal across research communities and developers of sustainable research tools.

Leveraging data published as Linked Data (Berners-Lee, 2006) by fusing cultural, contextual and content-based information is becoming increasingly important in music research. Furthermore, automated reasoning and data aggregation has the potential to simplify experiments and increase productivity in research activities traditionally relying on Web scraping, proprietary application programming interfaces (API) and manual data collection, see e.g. (Collins, 2010). In these use cases the association of features with audio data or temporal annotations thereof is a common requirement. The harmonisation of these mechanisms is therefore the first step towards creating shared vocabularies, linkable data sets, and interoperable research tools. Open and extensible Semantic Web ontologies can facilitate the harmonisation of different vocabularies by directly accommodating or indirectly referring to terms defined elsewhere.

Existing ontologies and tools
To solve a more general problem regarding interoperability between music related data sources, the Centre for Digital Music (C4DM) developed a Semantic Web ontology called the Music Ontology during the EPSRC funded Online Music Recognition and Searching 2 (OMRAS2) project. The ontology was adopted by researchers and user communities, Semantic Web programmers, as well as the industry, including the BBC and its music website. It is integrated in several research projects, for instance, the Networked Environment for Music Analysis (NEMA) funded by the Andrew W. Mellon Foundation, and the JISC funded Linked Music Metadata projects. Furthermore, several music related data sets were released using this ontology: DBTune, Automatic Annotations.

The Music Ontology is the core of a harmonised library of modular ontologies. This library relies on widely adopted Semantic Web ontologies such as the Friend of a Friend (FOAF) vocabulary, as well as domain specific ontologies for describing intellectual works (FRBR) and complex associations of domain objects with time-based events (Event and Timeline ontologies). The library also provides a set of extensions describing music specific concepts including music similarity (Jacobson et al., 2009) and the production of musical works in the recording studio (Fazekas et al., 2011).

The Audio Features Ontology was created within this framework. It provides a model for describing acoustical and musicological data and allows for publishing content-derived information about audio recordings. Its aim is to provide a framework for communication. Therefore it is free from deep taxonomical organisation, and focuses on representational issues, including the data density of audio features (i.e. dense or sparse), and their temporal relation to audio signals (i.e. onset like events perceived as instantaneous, and segments with known duration, for instance, keys, chords and elements of musical structure).

Open source research tools that produce and interpret data adhering to this ontology were also created and deployed. These include Sonic Visualiser, Sonic Annotator and a Web-based demonstrator called SAWA (Sonic Annotator Web Application).

However, due to open issues, limited exposure to the audio research communities, and despite the obvious benefits of using a shared open format, the adoption of this ontology remains relatively limited. The Audio Features Ontology was created without wider community involvement, its domain boundaries are fuzzy, and its vocabulary is incomplete with regards to user needs within the audio research communities. It also lacks most features used by music researchers.

Further work is required to provide a better coverage of features commonly used by researchers within its vocabulary, and to enable better harmonisation with existing research tools and to support a wider set of research data sets and use cases. This work shall be carried out with strong community involvement, and with a view of extending the vocabulary to cover most state of the art feature extraction techniques (see e.g. Mitrovic, 2010 for a recent review and taxonomy).

Users and needs
The primary users of the Audio Features Ontology within the HE sector are researchers working in the Music Information Retrieval and broader Music Informatics fields. However, uptake can be anticipated in the wider audio research communities, including speech processing, bioacoustics and environmental sound analysis.

Albeit the scope of engagement within this project will be limited to music informatics researchers, we expect the core vocabulary, as well as the methodologies and recommendations produced by this project to be applicable to other areas within audio research, and also to the wider JISC and HE communities, for instance, musicologists using computational tools for analysing musical audio.

Releasing and maintaining shared data sets is requisite for scientific advancement and a crucial component of the research workflow in most data-centric fields. In music informatics this can be observed in the release of ever larger data sets, such as the Million Song Dataset (MSD) (Bertin-Mahieux et al, 2011) or the Structural Analysis of Large Amounts of Music Information (SALAMI) data set (Smith et al., 2011), released as part of Digging into Data project with JISC, NSF and SSHRC funding. There are also Web Application Programming Interfaces (API) such as Canoris or EchoNest Analyze that researchers increasingly use as reference points. These data sets and services however use task-specific formats selected purely for representational convenience or other narrowly defined criteria. These formats do not assist in linking information to support cross-domain analyses and do not facilitate sustainable and reproducible research.

The following anonymised quote from the Music-IR mailing list demonstrates the problem in a typical use case scenario: “Hi! I am trying to reproduce the MSD feature extraction. I'm using pyechonest to access the EchoNest API. As far as it seems, the analyze method of the API only returns a single feature vector for each feature for a certain segment. How can I rebuild the time segmentation that has been applied to the MSD tracks to retrieve a similar list of vectors for each feature? Currently it seems I have to split each audio file into multiple 1-second files and analyze each of them separately.” It is obvious from this snippet that the researcher is unsure about how the features were computed, and struggles to cope with conceptual differences between audio feature representations, albeit they originate from the same source. Encoding differences such as the binary HDF5 file format of the MSD and the JavaScript Object Notation of the EchoNest API give rise to further confusion. Finally, the reliance on services using distinct conceptual models and encodings leads to unsustainable code (thus loss of effort) when such services are discontinued, as is the case with Canoris.

Nevertheless, there are certain real-world constraints on how data sets are published. It may be impractical for instance to release very large amounts of numerical data using a text- based encoding such as RDF or various XML dialects. This is the reason why the MSD uses a compact binary format. It is a common misconception however that this prohibits the use of shared ontologies. Binary formats can be adapted to refer to terms in shared vocabularies, just as ontologies can be developed to describe binary data structures. There are real user needs to describe large data sets using Semantic Web ontologies. The project will investigate issues related to large data publishing in its requirements analysis.

Content-based audio retrieval also calls for the use of shared ontologies. The AudioDB project for instance allows access to audio content given a set of features, however it does not provide means for describing the features used as basis for retrieval, nor a way to disclose the semantics of features a database may hold. If content- based features are to be used in audio retrieval, (especially in a distributed manner), a common agreement on how features are described is vital.

Identifying barriers
There are also direct research needs that may present barriers to the uptake of ontologies. Researchers typically aim at publishing quickly and frequently therefore seek “quick and dirty” solutions in their daily workflows. This often leads to ad-hoc methods for representing research data, while the use of standardised methods is seen as an obstruction or a “waste of time”. This view however is counter-intuitive, since time gained during development is often lost in data preparation, data management or when code is reused at a later stage. To overcome this obstacle, it is necessary to involve communities in the development of shared vocabularies and educate researchers about their benefits and use. The process of adopting ontologies could also be accelerated by republishing existing data sets using a shared vocabulary, once an agreement on this vocabulary has been reached.

Complexities in large standards like MPEG7, and the time it takes to read and understand the documentation is a barrier that led to limited uptake in the past. The use of precisely scoped ontologies can alleviate this problem, especially if presented in an open, modular and harmonised manner, which allows easy adoption to domain or task specific requirements.

The wide variety of programming languages used in audio research is an additional obstacle to the uptake of shared ontologies. Researchers typically choose to convert their data into a format that is most transparently handled by the language of their choice. This problem can be mitigated by creating software libraries that hide complexities from the end user (a task that is outside of the scope of this proposal), as well as audio annotators, data visualisers, (Semantic) Web services and data repositories that rely on shared open vocabularies.

Engagement with community
This project is community-focussed and will maintain a high level of engagement with the research community through its lifetime and beyond. We regard insufficient communication and limited dissemination as chief barriers in the uptake of shared vocabularies. Therefore we will ensure, that communities will be involved in core decisions, that requirements will be collected and problems will be openly discussed using public forums and the Wiki, and the outcome of debates will be posted on our blog.

The project’s output will be disseminated online and via a tutorials. A tutorial beyond the life of this project is planned for ISMIR 2013. We will also draw upon our close relationships with UK universities active in the field, the ISMIR society, and our involvement in the BBC Audio Research Partnership. We will engage with the JISC community via the synthesis project and by attending JISC programme and strand meetings.

The primary stakeholders in the project are people who use information about musical recordings. In the HE sector, this includes Music Informatics researchers and practitioners, who develop new algorithms for analysing, navigating, manipulating and understanding musical works and collections; and music teachers, students and researchers, for whom recordings are works of art exemplifying the performers’ technical mastery and interpretative skill. Other stakeholders include the Linked Data community and the wider JISC community.

Sustainability
The software output and ontology of this project will be hosted in open-source repositories with prolonged funding, while the ontology has the potential to contribute to sustainable data management in itself. Open access journals and funding bodies (including JISC) encourage releasing code and data associated with scientific publications, which requires easy to use repositories and clear data management policies. The Centre for Digital Music is currently managing a JISC-funded project for this purpose known as “Sustainable Management of Digital Music Research Data”, as well as the software-focussed “Sustainable Software for Digital Music and Audio Research” (EPSRC grant EP/H043101/1). These projects will provide us with infrastructure to ensure the maintenance of the current project’s outputs at least until 2014. In general, the utility of repositories and data management policies may remain limited however, if research data is published in a wide variety of different formats. This project addresses this issue.

References
Berners-Lee, T. (2006). Linked data. Available online: http://www.w3.org/DesignIssues/LinkedData.html. Bertin-Mahieux, T. Ellis, D. P. W., Whitman, B., Lamere, P., (2011). The Million Song Dataset. 12th International Society for Music Information Retrieval Conference (ISMIR 2011), Miami, Florida, USA.

Collins, N., (2010). Computational analysis of musical influence: A musicological case study using MIR tools. 11th International Society of Music Information Retrieval Conference (ISMIR 2010), Utrecht, the Netherlands.

Downie, J. S. (2008). The Music Information Retrieval Evaluation Exchange (2005-2007): A window into music information retrieval research. Acoustical Science and Technology, 29(4): 247–255.

Fazekas, G., Raimond, Y., Jakobson, K., Sandler, M., (2010). An overview of Semantic Web activities in the OMRAS2 Project. Journal of New Music Research, special issue on Music Informatics and the OMRAS2 Project, Vol. 39. Issue 4, pp. 295–311, 2010.

Fazekas, G., Sandler, M., (2011) The Studio Ontology Framework. 12th International Society for Music Information Retrieval (ISMIR 2011), Miami, Florida, USA.

Jacobson, K., Raimond, Y., Sandler, M. (2009). An ecosystem for transparent music similarity in an open world. 10th International Society for Music Information Retrieval Conference (ISMIR 2009), Kobe, Japan.

Mitrovic, D., Zeppelzauer, M., and Breitender, C. (2010). Features for content-based audio retrieval. Advances in Computers, vol. 78, pp. 71–150.

Raimond, Y., Abdallah, S., Sandler, M., Giasson, F. (2007). The Music Ontology. 8th International Conference on Music Information Retrieval, pages 417–422.

Raimond, Y. (2009). A Distributed Music Information System. PhD thesis, Queen Mary University of London, Centre for Digital Music.

Smith, J. B. L., Burgoyne J. A., Fujinaga I., De Roure, D., Downie J. S. (2011). Design and Creation of a Large-scale Database of Structural Annotations. 12th International Society for Music Information Retrieval Conference (ISMIR 2011), Miami, Florida, USA.

Troncy, R., Celma, O., Little, S., Garcia, R., Tsinaraki, C. (2007). Mpeg-7 based multimedia ontologies: Interoperability support or interoperability issue? 1st Workshop on Multimedia Annotation and Retrieval enabled by Shared Ontologies, Genova, Italy.