Commons:Batch uploading/OpenCultureDataNL

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Open Culture Data to Commons[edit]

These notes are of a meeting between the Open Culture Data network in the Netherlands (OCDNL), represented by Maarten Brinkerink (Netherlands Institute for Sound and Vision) and Lex Slaghuis (Open State Foundation) and Sebastiaan ter Burg (WMNL) on the 15th of May 2014.

OCDNL has built an infrastructure where data from (their) cultural partners is accessible through their API. In other words: their API is an index of datasets of Dutch GLAMs. See the heading [http://www.opencultuurdata.nl/datasets/ Datasets for a list of current data suppliers. Source data is stored and is searchable. They're researching if universal properties are offering enough information (the properties themselves and the contents of the properties) to work with.

OCDNL would like to upload the available content of their partners to Wikimedia Commons on a short notice. This could involve several million files. There are multiple reasons why this would or would not be a good plan, which have to be weighed before making a decision. These pro's and cons are written down on this page. Suggested is to find out if there are technical or content wise limitations before starting a discussion about this plan.

Desirability[edit]

The first question is if it is desired to upload an amount this large to Commons.

Disadvantages[edit]

Related to the data
  1. The amount of metadata the OCDNL API can offer is limited (see below for details). There is currently no possibility to update metadata of files without throwing away information that has been added to the file after the upload, including categorisation.
Related to Wikimedia Commons community
  1. Checking an upload of this size potentially causes a lot of work for the Commons community.
  2. The experience with large content donations is that a lot of the content is not being used in articles on the Wiki platforms.
Related to the GLAM community opportunities
  1. Some GLAMs in the Netherlands prefer to split their content donation in smaller batches. This way they can increase the amount and quality of metadata that is shared.
  2. Every upload could be a reason to organise an event to promote the use of / ask volunteers to use the content on the platforms. These events play a role in attracting new volunteers. The willingness to organise / participate in an event could diminish when the content is already available.

Advantages[edit]

  1. All available content will surface sometime. Some people prefer to have the content on the platforms first and worry about metadata later.
  2. There is word of an integration of Wikidata and Wikimedia Commons. This would make it possible to update the metadata at a later moment. It could take several years before this functionality is available. All current available metadata (more than the OCDNL API offers) should be backed up an stored.
  3. Having the content on Commons could also lower the threshold for some GLAMs to organise an event.

Datasets[edit]

A quick review (in Dutch) of the available data with the OCD API learns that a limited amount of data provided by the OCD api is suited for an upload. There are a couple of GLAMs where WMNL is currently cooperating with to upload their data. Some datasets might not be suited to upload.

Metadata[edit]

The amount of metadata that is offered with the "content push" by OCDNL is limited on purpose. The idea behind this choice is that there is a bigger certainty that the this metadata is available, correct and complete. The question is if the offered metadata is enough to work with on the Wiki platforms.


OCD API[edit]

The OCD API offers the following metadata:

meta
   rights
      ...
  original_object_urls
      [html]=...
      [json]= ...
  source
      ...
  original_object_id
      ...
title
  ...
description
  ...
date
  iso date yyyy-mm-dd hhmmss
date_granularity 
    (0-12) 1= century, 4 exact year, 6 = exact month, 8 exact day, 10= exact hour, 12 exact minutes, 14 = exact seconds
media_urls
   url: ...
content_type: ...

WIKIMEDIA COMMONS[edit]

En example of a good, but limited, XML file of a GLAM:

<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:europeana="http://www.europeana.eu" xmlns:dcterms="http://purl.org/dc/terms/">
<record>
   <dc:creator>Ford-[3D]</dc:creator>
   <dc:date>[1943]</dc:date>
   <dc:identifier>054396</dc:identifier>
   <dc:subject>voertuigen</dc:subject>
   <dc:subject>rupsvoertuigen</dc:subject>
   <dc:subject>gevechtsvoertuig</dc:subject>
   <dc:subject>vlammenwerper</dc:subject>
   <dc:title>Vlammenwerper Wasp Mk II</dc:title>
   <dc:type>creatie event</dc:type>
   <europeana:country>netherlands</europeana:country>
   <europeana:dataProvider>Legermuseum</europeana:dataProvider>
   <europeana:isShownAt>http://lm.rnaviewer.net/nl/item?uri=http://www.rnaproject.org/data/df54489c-3632-4169-8333-c1c7481ecf6d</europeana:isShownAt>
   <europeana:isShownBy>http://afbeeldingen.collectie.legermuseum.nl/wwwopac.exe?thumbnail=%5C%5CBuffel%5Cimages$%5Cvoertuigen%5C054396_003.jpg</europeana:isShownBy>
   <europeana:language>nl</europeana:language>
   <europeana:object>http://afbeeldingen.collectie.legermuseum.nl/wwwopac.exe?thumbnail=%5C%5CBuffel%5Cimages$%5Cvoertuigen%5C054396_003.jpg</europeana:object>
   <europeana:provider>Digitale Collectie</europeana:provider>
   <europeana:rights>http://creativecommons.org/licenses/by/3.0/</europeana:rights>
   <europeana:type>IMAGE</europeana:type>
   <europeana:uri>http://www.europeana.eu/resolve/record/2021605/BA270D6E07B95B6B9046595E8DDB95277796B863</europeana:uri>
   <europeana:year>1943</europeana:year>
</record>

Categorisations[edit]

Commons relies heavily on categorisation of the content. This is one of the reasons why it is advised to split content donations into smaller batches. Some examples of variables for categorisation are:

  1. GLAM
  2. Author / artist
  3. Content type (there are templates available for a range of content types)
  4. License (i.e. CC BY-SA, CC BY, CC 0, PD)

This donation would not allow categorisation on forehand. This should be done afterwards by volunteers. It should be checked if the metadata offers enough information for categorisation afterwards.

Technical limitations[edit]

The XML offered by the OCDNL API has to be flat.

Possible strategies[edit]

  1. Upload as much as possible
  2. No upload with the OCDNL API
  3. Inform the GLAMs about the advantages and disadvantages and let them choose if they want their content to be uploaded with the known limitations
  4. Divide in batches by institution, see if extra metadata are available by other ways (the example given above links to https://www.rijksmuseum.nl/nl/collectie/SK-A-1405, and a JSON API version, which provides sufficient metadata), and takes an upload decision for each of theses batches (maybe yes when metadata is enough, after a contact with the institution if not).

Comments[edit]

  •  Weak oppose For collections where it would be reasonable for a batch uploader to do a better job by uploading from the original source archive. In the example of the Rijksmuseum used, there is more complete data available from the museum's API and there is a mapping and API-to-xml generator to be able to apply the GWToolset. Refer to Commons:Batch_uploading/Art_of_Japan_in_the_Rijksmuseum, likely to start uploading content in a week or two depending on some minor enhancements planned. The same discussion applies for uploads from Europeana where the collections there are effectively sub-sets of original archives that may not be complete or the most recent data. With regard to the above point, a significant proportion of the set-up and passive testing effort for uploads is to ensure that there is at least a basic level of categorization (if not "good") at the time of upload, or a specific phase of the project to ensure this happens as part of post-upload housekeeping.
Before someone points this out, the same work could generate replacement text for a previous batch upload, however this massively increases the cross-checking needed. Experienced uploaders would have no incentive to invest their volunteer time, effectively sorting out someone else's housekeeping, and would move on to another project from our large backlog. A consequence may well be that significant collections uploaded to Commons, like the Rijksmuseum, would never be completed with more than the basic metadata that happened to be mirrored by an aggregator website. -- (talk) 17:30, 20 May 2014 (UTC)[reply]
@:  Question Is it within your capabilities (danger, danger!) to create a web tool like the FIST tool or FLickr2Commons for this project? I am thinking something like restricted access to users that know how Commons works, something where you can see thumbs and add categories before upload? I know, it takes a little longer to finish but would keep housekeeping to a minimum. Just an unregulated brainwave. Or brainfart, depending on your answer. :) --Hedwig in Washington (mail?) 05:47, 21 May 2014 (UTC)[reply]
Possible, but it looks like a lot of work (note how long Magnus has been de-bugging F2C) compared to getting on with the upload with basic categories and encouraging volunteers with project pages and village pump discussion, to join in using Hotcat or similar. Barnstars, prizes or just "Burn-down" progress thermometers, as the DoD example below, might help encourage participants that they are helping towards a shared goal (we need to find more volunteers prepared to spend time on this community communications side—it is the sort of things that interns or WIRs hosted by chapters could get on with, without needing software engineering skills). Had a web-tool like this been created for, say, the 22,000 images from LACMA, I doubt that even a year later more than 10% would have been uploaded by anyone other than myself. This may seem cynical, but is based on what can bee seen of how much of the larger collections volunteers get around to handling beyond cherry-picking (not that there is anything wrong with that, in fact 10% reuse levels would be very good for any >10,000 image batch upload).

96% Department of Defense image backlog burndown—help!

   

Keep in mind experience shows that batch uploading volunteers have a limited span of energetic contributions before moving on to other things or they just vanish (paid batch uploaders have an even shorter span, often limited to their paid contract). It is probably a priority for us batch uploaders to ensure that any specific collection is fully uploaded with the best possible metadata to meet our project goal of preservation. The issue of good categorization is something that can be worked on long term by others after the uploader might have moved on, especially if searching and sorting is enabled by good quality metadata on the image page, even if not ideal.
As an example, I can see that Sanborn maps of Staten Island has been created this morning and populated with 190 map images from the NYPL maps project that was promoted on the Village pump a week ago. This is great, using more time and local knowledge than the batch uploader can be expected to have, however it is not part of a coordinated programme, the categorizer does not get any public recognition as encouragement to do more, and this represents 1.5% of the images to be organized. -- (talk) 07:11, 21 May 2014 (UTC)[reply]