User talk:Multichill/Same image without Wikidata

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Hi Multichill, first i thought that this task could be implemented as a game on "The Distributed Game" but now i understood that this only allows actions on Wikidata. However, i see a big amount of work on Commons that could be gamefied. I like the way you do it on your site. Maybe we could have more of such little "games" and create a hub where people can find pages like yours. What do you think about it? --Arnd (talk) 11:28, 31 August 2016 (UTC)[reply]

Sure, that sounds like a good plan, go ahead! I write these kind of things every once in a while. Some work and stay around, others don't and I just abandon them. Multichill (talk) 19:21, 31 August 2016 (UTC)[reply]

Thanks[edit]

… a lot for this work. The need of indexing "duplicate" artwork images was burning in my mind for a long time yet. A more clean approach might be to create categories missing and use {{Category definition: Object}} and {{Object photo}}, and the best to transfer all information to Wikidata and retrieve it back of course, but I think this is a very valuable start for that, so thanks! --Marsupium (talk) 10:17, 1 September 2016 (UTC)[reply]

I'm happy you like it, but I would never call {{Category definition: Object}} and {{Object photo}} clean. That's one complex beast! I hope soon {{Artwork}}, {{Creator}} and {{Institution}} will start grabbing the relevant data from Wikidata so we don't have to keep duplicate redundant data here. Actually, we're already doing that for authority control. The more images get linked to Wikidata this way, the more interesting and rewarding it will become to update {{Artwork}}. Multichill (talk) 20:05, 1 September 2016 (UTC)[reply]
Actually, I have made a Lua version of {{Artwork}} that should make the use of Wikidata easier. However it breaks Wikitables and they are widely used, especially for licence templates, so that I don't know what to do (see Module talk:Artwork). --Zolo (talk) 07:44, 2 September 2016 (UTC)[reply]

Broken links to images[edit]

Hi Multichill, there are several broken links to images in the current table. It seems to me that it is due to an encoding problem because the correct filenames contain for example German umlauts (like Dürer). Thanks, --Arnd (talk) 05:19, 9 September 2016 (UTC)[reply]

Arnd: Yup, I ran into some encoding fun. Yesterday I noticed that the bot didn't update this and other pages for a couple of days. I do several mysql queries which I put in a textfile encoded as utf-8. These files I read in to work on. No problems until a couple of days ago. I found the nice error "UnicodeDecodeError: 'utf8' codec can't decode byte 0xd0 in position 950998: invalid continuation byte" in the logs. To at least make the bot run again I'm putting the offending output at this file and use "iconv -f LATIN1 -t UTF-8 paintings_without_wikidata_cia_org.txt paintings_without_wikidata_cia.txt" to produce this file. This is just a work around to keep the bot from crashing.
The source of this probably the fun way MediaWiki tends to encode things. I probably have to do something like the workaround at mw:Toolserver:Code_snippets#Fix_UTF-8_encoded_as_latin-1 to take care of the fact that I use cl_sortkey_prefix which is of type varbin. Code snippet (in python) to reproduce:
import urllib2
url = u'http://tools.wmflabs.org/multichill/queries2/commons/paintings_without_wikidata_cia_org.txt'
queryPage = urllib2.urlopen(url)
queryData = unicode(queryPage.read(), u'utf-8')
Multichill (talk) 11:52, 9 September 2016 (UTC)[reply]
Thanks for the details, although i am not sure if i got everything. Interestingly not long ago i came across a similar encoding problem on enwikisource, where interwikilinks are broken in archive.[1] Do you think it can be connected? The archive is also built with the help of a bot. --Arnd (talk) 12:00, 9 September 2016 (UTC)[reply]
That fun encoding trick seems to have fixed it.
I doubt this is connected. Because of the mix of Latin-1, binary and utf-8, you'll run into encoding problems every once in a while. Multichill (talk) 12:30, 9 September 2016 (UTC)[reply]
Multichill - thanks for reboot. for me it is assigning a wikidata item of "Q" and requiring pasting in the proper ID number. Slowking4 § Richard Arthur Norton's revenge 23:35, 30 November 2016 (UTC)[reply]
Yeah, noticed that too. Fixed. Multichill (talk) 11:33, 1 December 2016 (UTC)[reply]
seems to be fixed, and speed increase. good job. Slowking4 § Richard Arthur Norton's revenge 19:10, 1 December 2016 (UTC)[reply]
@Slowking4: last patch merged and another fix pushed to Toollabs too. Now the bot should run correct from Toollabs and run every night again.
@Aschroet: sorry, seems you ran into one of my bugs. Should be fixed now. I undid two of your edits.
Hope you can help to reduce this list again. It has grown quite a bit since it broke down in September. I'm clicking away on Wikidata to progress suggestions. Multichill (talk) 20:55, 1 December 2016 (UTC)[reply]

Duplicates[edit]

Hi Multichill, seems that the list contains several dups such as Judith_Leyster_-_Boy_playing_the_Flute_-_Google_Art_Project.jpg vs. Judith_Leyster_-_Young_Flute_Player_-_WGA12961.jpg. Maybe you could optimize your code.. thanks, --Arnd (talk) 16:59, 2 December 2016 (UTC)[reply]

Thanks for pointing that out. It's probably caused by the refactoring I did. I'll look into it. Multichill (talk) 19:28, 2 December 2016 (UTC)[reply]
Still have this problem. --Arnd (talk) 07:18, 21 December 2016 (UTC)[reply]

Object photo[edit]

Hi, in some files like File:Georges de La Tour 029.jpg there is no Artwork template but Object photo. How to deal with them? In any case it is not possible to add WD item to it with the link in the list. --Arnd (talk) 08:07, 6 December 2016 (UTC)[reply]

Arnd {{Object photo}} is a horrible construct where the "object" indicates what category (!) to transclude. In this case it's Category:The Cheat with the Ace of Clubs (Kimbell Art Museum). I fixed it. Multichill (talk) 21:48, 20 December 2016 (UTC)[reply]
Thank you for the explaination. --Arnd (talk) 07:18, 21 December 2016 (UTC)[reply]

Same painting[edit]

Hi, if one pictures is an extraction of the other are they the "same painting"? I ask because i often see extracted details. Till now i did not count them. --Arnd (talk) 07:45, 25 December 2016 (UTC)[reply]

Rewrote it[edit]

@Slowking4: @Aschroet: I rewrote most of it to produce separate lists based on the 3 different criteria (institution, creator and inventory number). The lists are linked from the page. I hope this offers more options and makes matching easier. Multichill (talk) 22:34, 14 January 2017 (UTC)[reply]

good, i see user:Jane023 and Aschroet have it covered. maybe we will need to recruit more editors. I have shifted over to "Painting images no artwork template" - it was getting a little thin, based on recent wikidata activity - more to do now. go talk to sadads about the Sloan grant. cheers. Slowking4 § Richard Arthur Norton's revenge 22:41, 14 January 2017 (UTC)[reply]
@Slowking4: I've been involved with the structured data on Commons project from the start as a volunteer. I was at the bootcamp back in 2014 in Berlin and was involved in reviewing the Sloan grant before it went out. So yes, I sure talked with Alex about it. I plan to continue to be involved with the project as a volunteer.
I was talking with Jane about this, so I guess she is trying out the new toys. Multichill (talk) 22:54, 14 January 2017 (UTC)[reply]
good, if you want some input on action plan, or rustling some volunteers, let me know. i could put a word in with smithsonian. Slowking4 § Richard Arthur Norton's revenge 22:59, 14 January 2017 (UTC)[reply]

Multichill, thank you. However, we still have the duplicate issue. --Arnd (talk) 17:48, 15 January 2017 (UTC)[reply]

@Aschroet: but at least the duplicates should be clustered together now, right? At least makes it easier to skip over them...... I haven't really figured out a good strategy to remove the duplicates without removing too much. Multichill (talk) 18:21, 15 January 2017 (UTC)[reply]

What would be good to have some kind of statistics about the progress. --Arnd (talk) 09:32, 16 January 2017 (UTC)[reply]

I added two improvements.
  • Filter out the files from the suggestions for which the bot was able to add a Wikidata link (example).
  • Filter out duplicate lines as requested by Arnd. If two consecutive are the same, the second line won't be added. I think this should take care of most of the duplicates
What kind of statistics are you thinking about exactly? Number of files in Category:Artworks with Wikidata item and Category:Artworks without Wikidata item? You have to write down the number on a regular basis. Multichill (talk) 17:07, 16 January 2017 (UTC)[reply]

Suggestions with both having WD item already[edit]

In User:Multichill/Same image without Wikidata/Wikidata creator, institution and inventory number match i see only matches where both images from the beginning already had WD items. Why so? --Arnd (talk) 05:41, 12 April 2017 (UTC)[reply]