An important part of this research project is the distribution of the corpus, including metadata from the questionnaires and audiometry. Metadata describe the properties of each recording, e.g. the speakers’ (anonymous) identifiers, the speakers’ language backgrounds and hearing details, the speakers’ self-assessed properties, technical details of the recording, etc.
The longitudinal nature of the corpus requires links between a single speaker and up to five recording sessions (with each session yielding multiple audio files), with most metadata also varying across sessions. We even had to design “new” properties of our speech files and of our speakers, to describe concepts for which pre-defined properties or data categories were not available.
One example of such a new concept is
which is TRUE if speaker SID regards SID (himself or herself) as a native speaker of language LID, and FALSE otherwise (including the combination of speaker SID and a language LID which speaker SID is competent in, but which SID has acquired after childhood).
This particular concept describes a property of the combination of speaker and language. It turned out to be necessary, since multilingual speakers turned out to self-assess themselves differently in different recording sessions.
A list of all 30 new concepts and properties is available from the ISOCat repository of data categories, at http://www.isocat.org/rest/dcs/649 .