User talk:AzaToth/wikimgrab.pl

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Re-directs[edit]

The script in its current form does not cope with re-directs, e.g try

perl wikimgrab.pl Donkey.jpg

where [[File:Donkey.jpg]] produces an image, and http://commons.wikimedia.org/wiki/File:Donkey.jpg currently exists, referencing "File:Donkey (Equus asinus) at Disney's Animal Kingdom (16-01-2005).jpg"

HYanWong (talk) 11:37, 21 February 2013 (UTC)[reply]

New code[edit]

I've been altering the code on the main page to cope with a few edge cases, where the actual image resource has been moved elsewhere (I hope it's not bad for to alter someone else's user pages). My changes have, however, made the code look more cumbersome. Below is a slight re-write that is a little shorter, clearer, and easier to understand, I think.

I'll leave it up to the talk page owner to decide if this code is preferable.

#!/usr/bin/perl
 
use strict;
use warnings;
use URI::Escape;
use Digest::MD5 qw(md5_hex);
use LWP::UserAgent;
 
my $ua = LWP::UserAgent->new;
$ua->timeout(15);
$ua->env_proxy;
$ua->show_progress(1);

sub get_image {
  my $user_agent = shift;
  my $image = uri_unescape(shift);

  $image =~ s/ /_/g;
  $image =~ s/^(File|Image)://ig;
  $image =~ s/^(\w)/uc($1)/e;
 
  my $digest = lc(md5_hex( $image ));
  my $a = substr $digest, 0, 1;
  my $b = substr $digest, 0, 2;
  my $path = "http://upload.wikimedia.org/wikipedia/commons/$a/$b/$image";
  return($user_agent->mirror( $path, $image ))
}

foreach my $imageName ( @ARGV ) {
  if (get_image($ua, $imageName)->is_error) { #if failed, look for redirects
    warn("Could not get image directly - looking for alternative name on main image page.");
    my $basepage = "http://commons.wikimedia.org/wiki/File:$imageName";
    my $response = $ua->get($basepage);
    if ($response->content =~ m!<link rel="canonical" href="/wiki/(.+?)"!) {
      get_image($ua, $1); #found an alternative "canonical" link
    } else {
      get_image($ua, $response->filename); #this could be a redirect 
    }
  }
}

HYanWong (talk) 00:57, 22 February 2013 (UTC)[reply]