Skip to content

Using ElasticSearch as Unit Index

Igor Afanasyev edited this page Mar 10, 2016 · 1 revision

This is the test script that populates ElasticSearch index with Pootle units.

Run it like that:

./populate_es_unit_index.pl --user=pootle --password=yOur_PaSSwOrd

(by default it connects to pootle database on local MySQL instance).

populate_es_unit_index.pl

#!/usr/bin/perl
=head1 NAME

populate_es_unit_index.pl - Populate ElasticSearch index with Pootle units

=head1 DESCRIPTION

B<populate_es_unit_index.pl> is an utility for bulk-updating ElasticSearch index with Pootle units for faster searching

=head1 SYNOPSIS

populate_es_unit_index.pl --host=localhost --user=username --password=password --port=port --database=database

=cut

use strict;

use DBI;
use DateTime;
use Getopt::Long;
use JSON -support_by_pp;
use LWP::UserAgent;

my $ELASTICSEARCH_HOST = 'http://localhost:9200';
my $ELASTICSEARCH_INDEX = 'pootle_test';
my $ELASTICSEARCH_TYPE = 'unit';

binmode(STDOUT, ":utf8");
$| = 1;

my ($host, $database, $username, $password, $port) = ('localhost', 'pootle', 'pootle', undef, 3306);

GetOptions(
	"host|h:s" => \$host,
	"user|u:s" => \$username,
	"port|P:s" => \$port,
	"database|D:s" => \$database,
	"password=s" => \$password
);

my $dbh = DBI->connect("DBI:mysql:database=$database;host=$host;port=$port",
	$username,
	$password,
	{
        'mysql_enable_utf8' => 1,
        'mysql_bind_type_guessing' => 1,
    }
) or die "Can't connect to the mysql database.";

my $ua = LWP::UserAgent->new;

print "Deleting '$ELASTICSEARCH_INDEX' index in ElasticSearch...";
my $req = HTTP::Request->new('DELETE', "$ELASTICSEARCH_HOST/$ELASTICSEARCH_INDEX");
my $res = $ua->request($req);
check_error($res) if $res->code != 404;
print "\n";

my $date_type = {
    type => 'date',
    format => 'yyyy-MM-dd HH:mm:ss',
};

my $index_mapping = {
    mappings => {
        $ELASTICSEARCH_TYPE => {
            properties => {
                pootle_path => {
                    type => 'string',
                    index => 'not_analyzed',
                },
                index => {
                    type => 'long',
                },
                state => {
                    type => 'short',
                },
                mtime => $date_type,
                submitted_on => $date_type,
                creation_time => $date_type,
                reviewed_on => $date_type,
            }
        }
    }
};

print "Creating '$ELASTICSEARCH_INDEX' index with field mappings in ElasticSearch...";
my $req = HTTP::Request->new('PUT', "$ELASTICSEARCH_HOST/$ELASTICSEARCH_INDEX");
$req->header('Content-Type' => 'application/json');
$req->content(to_json($index_mapping, {utf8 => 1}));
my $res = $ua->request($req);
check_error($res);
print "\n";

my $bulk_url = "$ELASTICSEARCH_HOST/$ELASTICSEARCH_INDEX/$ELASTICSEARCH_TYPE/_bulk";

my $sql = "
    SELECT u.id, s.pootle_path, u.index, u.state, u.mtime,
    u.submitted_on, u.creation_time, u.reviewed_on
    FROM pootle_store_unit u
    JOIN pootle_store_store s ON u.store_id = s.id
";
my $sth = $dbh->prepare($sql) or die $dbh->errstr;

print "Executing SQL query...";
$sth->execute;
print " Done\n";

my ($start, $now);
$start = $now = DateTime->now;

my $i;
my @bulk_rows = ();
while (my $h = $sth->fetchrow_hashref) {
    my $command = {
        index => {
            _id => $h->{id}
        }
    };

    die "mtime is null" unless $h->{mtime};

    my $data = {
        pootle_path => $h->{pootle_path},
        index => $h->{index},
        state => $h->{state},
        mtime => $h->{mtime},
    };
    $data->{submitted_on} = $h->{submitted_on} if $h->{submitted_on};
    $data->{creation_time} = $h->{creation_time} if $h->{creation_time};
    $data->{reviewed_on} = $h->{reviewed_on} if $h->{reviewed_on};

    push @bulk_rows, to_json($command, {utf8 => 1});
    push @bulk_rows, to_json($data, {utf8 => 1});

    $i++;
    if ($i % 10000 == 0) {
        push_bulk_request();
    }
}
push_bulk_request();

$sth->finish;
$dbh->disconnect;

$now = DateTime->now;
my $d = $now - $start;

print "[$now]\tAll done in ".join(':', map {$_ > 10 ? "$_" : "0$_"} ($d->hours(), $d->minutes(), $d->seconds()))."\n";

sub push_bulk_request {
    $now = DateTime->now;
    print "[$now]\t";
    if (length(@bulk_rows) > 0) {
        my $req = HTTP::Request->new('POST', $bulk_url);
        $req->header('Content-Type' => 'application/json');
        $req->content(join("\n", @bulk_rows));
        my $res = $ua->request($req);
        check_error($res);
    }
    @bulk_rows = ();
    print "$i units were updated\n";
}

sub check_error {
    my $res = shift;
    if ($res->code >= 400) {
        warn "ERROR: ", $res->code, "\n";
        warn "Full response is: ", $res->decoded_content, "\n";
        die;
    }
}

I ran this on my local Pootle DB instance containing 2,409,581 units. Full indexing took 5 minutes 50 seconds. After that I ran some sample queries against ElasticSearch (I recommend using 'Sense' Chrome extension to run ES queries):

Getting first 20 units ordered by pootle_path and index

Average query execution time: ~25-30ms

POST /pootle_test/unit/_search

{
   "from": 0,
   "size": 20,
   "query": {
      "match_all": {}
   },
   "sort": ["pootle_path", "index"],
   "_source": false
}

Getting 20 units starting from offset 5000, ordered by pootle_path and index

Average query execution time: ~60-70ms

POST /pootle_test/unit/_search

{
   "from": 5000,
   "size": 20,
   "query": {
      "match_all": {}
   },
   "sort": ["pootle_path", "index"],
   "_source": false
}

Getting first 20 units ordered by pootle_path and index for pootle_path starting with /ru/

Average query execution time: ~4-5ms

POST /pootle_test/unit/_search

{
   "from": 0,
   "size": 20,
   "query": {
      "prefix": {
         "pootle_path": "/ru/"
      }
   },
   "sort": ["pootle_path", "index"],
   "_source": false
}

Getting 20 units starting from offset 5000, ordered by pootle_path and index for pootle_path starting with /ru/

Average query execution time: ~30-40ms

POST /pootle_test/unit/_search

{
   "from": 5000,
   "size": 20,
   "query": {
      "prefix": {
         "pootle_path": "/ru/"
      }
   },
   "sort": ["pootle_path", "index"],
   "_source": false
}

SQL query to retrieve actual data

Once one gets a sorted vector of 20 unit IDs, they would need to run an SQL query against Pootle DB to retrieve the actual unit data, something like this:

SELECT *
FROM pootle_store_unit u
JOIN pootle_store_store s ON u.store_id = s.id
WHERE u.id IN (
    1351041, 3372829, 1351040, 1351039, 1351037,
    1357104, 1357105, 1381339, 1389822, 1424225,
    1447226, 1464442, 1463805, 1464440, 1463797,
    1463801, 1463807, 1464441, 1463806, 1463795
)

In my tests, the average execution time was always reported in my SQL DB admin as "0.000s" (in other words, less than 0.5ms + rounding, which is negligibly small and should probably not be taken into account at all).

Clone this wiki locally