pdxhash and working with spatial data using sql

No Gravatar

Overview

After analyzing the data, it would appear that we can use the geohash algorithm to geographically locate things in the Portland area by a short combination of letters and numbers. Because we only care about locations in and around Portland, we can throw away the majority of the geohash.  What is left of this string can be used to geotag locations and should fulfill that requirement of the #pdxtags civicapps challenge idea. It would appear that we can identify areas in Portland  as small as individual houses with just 5 characters, like 06ytd or 03qrf.

The interesting thing about this approach is that it eliminates the need for PostGIS or other spatial database when building applications with the civicapps data. Using a pdxhash to store locations would allow a programmer to do fast geographic lookups and basic spatial queries in Portland area, without a geodatabase or GIS software. It would also allow the use of non traditional application servers like the Google App Engine or NOSQL databases like Mongo or CouchDB.

Massaging the Data

Now that we have the Portland metro area address points loaded into a postgis table, we can begin working with the data.  We will be doing gis operations at the SQL level.  People commonly use ArcGIS Desktop to do analysis and data management tasks like these, but the open source tools work just as well.  There are more GUI based open source methods to do what we are doing, but if you are interested in automating tasks and scripting, the command line rules.

The table schema in postgis matches the csv file which we loaded with ogr2ogr. The import added a column named wkb_geometry that contains the point in postgis’ binary representation of a coordinate. The table contains 316,133 records that look a little like this:

SELECT wkb_geometry, street_name, zip_code FROM address_data LIMIT 10;
wkb_geometry                    | street_name | zip_code
0101000020E6100000D1C25823EAC262C045CEAEC8C8C04340 | LAMBERT     |
0101000020E6100000D1C25823EAC262C045CEAEC8C8C04340 | ASHBY       | 97229
0101000020E6100000D1C25823EAC262C045CEAEC8C8C04340 | 2ND         | 97080
0101000020E6100000D1C25823EAC262C045CEAEC8C8C04340 | CERVANTES   | 97035
0101000020E6100000D1C25823EAC262C045CEAEC8C8C04340 | 40TH        | 97123
0101000020E6100000D1C25823EAC262C045CEAEC8C8C04340 | PARK        | 97201
0101000020E6100000D1C25823EAC262C045CEAEC8C8C04340 | DOLPH       | 97219
0101000020E6100000D1C25823EAC262C045CEAEC8C8C04340 | 15TH        | 97030
0101000020E6100000D1C25823EAC262C045CEAEC8C8C04340 | 48TH        | 97213
0101000020E6100000D1C25823EAC262C045CEAEC8C8C04340 | 182ND       | 97233
(10 rows)

Putting the Data on a Map

If we fire up qgis and add this postgis layer, we can view all of the points on a map. We can symbolize the points based on the attributes in the table. For example we could make points have different colors based on their county. I’d like to visually differentiate the points based on their #pdxhash, which is a masked version of a geohash. Unfortunately the querybuilder in qgis doesn’t allow you to get too fancy with sql functions, so we have to create a view.

A view is a little like a layer definition in ArcGIS. It allows us to create a virtual table that hides our more complicated queries behind what looks like a simple table. In our psql or pgadmin window, we can define the following view.

CREATE VIEW pdx_hash_address_view AS
SELECT substring(trim(ST_Geohash(wkb_geometry)) FROM 0 FOR 5) AS pdx_hash_4,                                         substring(trim(ST_Geohash(wkb_geometry)) FROM 0 FOR 6) AS pdx_hash_5,                                       substring(trim(ST_Geohash(wkb_geometry)) FROM 0 FOR 7) AS pdx_hash_6,                                       substring(trim(ST_Geohash(wkb_geometry)) FROM 0 FOR 8) AS pdx_hash_7,                                       wkb_geometry, ogc_fid
FROM address_data;

qgis postgis layersWe need the wkb_geometry and ogc_fid data to provide our actually features and a primary key to keep things sorted out. Now we can go to postgis and pull up this layer just like we did with our original table.  However it becomes obvious that the view is much slower than our original table.  When we select our view we are doing multiple complicated operations on each row in our table.  Fortunately postgresql offers us a way to easily “materialize” this view into it’s own table.

SELECT * INTO pdx_hash_address_data_table
FROM pdx_hash_address_view;
CREATE INDEX pdx_hash__address_data_table_geom_idx ON pdx_hash_address_data_table USING gist (wkb_geometry);

Creating New Layers

When we use SELECT INTO, our new table is created automatically from the schema of our old table or view. Now instead of complicated function calls, our pdx_hash columns are simple text fields and our query is much faster. We also need to add a spatial index to this new table, so our operations on the points are speedy as well.

Now when we bring this layer into qgis, we can see the pdxhash attributes we have added and it is fast enough. We can symbolize the layer based on the pdx_hash_4 field. Immediately we can see that the main part of Portland is really represented by four of the pdx_hash_4 classes. pdx_hash_4
If we symbolize the data based on 108 pdx_hash_5 classes, we get a more interesting picture. Unfortunately when dealing with all of these points, it gets a little jumbled. However the map seems to validate using a 5 character pdxhash to break down the Portland area.
pdx_hash_5
It would be nice if we could convert our points into polygons that represent each pdx_hash zone. It would make our map a little more visually effective. Right now it is hard to see the forest for the trees.

Common Table Expressions and Spatial Aggregates

One problem with working with a large dataset is that some operation can take a long time, especially when you make mistakes. In order to convert our point fields into polygons, we have to use postgis aggregate functions which behave like standard sql aggregates like max, sum, average only for geographic features.

It would be nice to work on just a subset of our data while we get our statement correct. We could create a new table with a limited subset of the address data records, but more recent sql implementations support something called a common table expression which will do that for us.

Creating Polygons and exporting Shapefiles

Using the WITH statement, we create a virtual table with only 10,000 records on which to perform our query. This is much faster for some operations than working on our entire dataset. When the query is solid, you just remove the “LIMIT 10000″ from the common table expression and you are ready to go. As it turns out postgis can generate polygons from our point fields very quickly when the query is correct.

WITH pdx_hash AS (SELECT * FROM pdx_hash_address_data_table LIMIT 10000)
SELECT pdx_hash_5, ST_Envelope(ST_Collect(wkb_geometry)) AS geom, count(*)                                                                               INTO pdx_hash_polygons FROM pdx_hash GROUP BY pdx_hash_5;

This query takes all the points which have the same pdx_hash_5 field and returns them on a single row using ST_Collect. The ST_Envelope function creates a polygon from the edge of the field of points. As you can see our, polygons enclose the regions defined by the pdxhash points in our previous queries. It also would appear that all these blocks begin with a pdx_hash of c2, which means that we can uniquely identify each zone with just three characters.
I’ve made some shapefiles for you to play with. The highest precision shapefile reveals the limitations of basing the polygons off of the address data. However, I don’t beleive it invalidates the concept. I created the shapefiles using ogr2ogr.   I did this example on a mac with postgis/postgresql, but you could execute basically the same queries with Microsoft SQL Server 2008.

ogr2ogr -skipfailures -nlt POLYGON  -f "ESRI Shapefile" pdx_hash_polygon PG:"host=localhost dbname=civicapps" pdx_hash_polygon

pdx_hash_polygon.zip


.

Posted in technology | Tagged , , , , , , | 5 Comments

pdxhash: an implementation of #pdxtags using the geohash algorithm

No Gravatar

I never learned about geohashes in school, but I ran across the term when I was figuring out how to do fast GIS operations in the Google App Engine cloud computing platform. It is a two way function that encodes a latitude and longitude into a 20 character string. According to the wikipedia page, a geohash could be a possible geotagging implementation. I’ve found that using a geohash you can split the Portland area into 15 unique zones each identified by a four characters.

I thought that maybe we could use a geohash to geotag locations in Portland as specified in the #pdxtags civicapp idea submission. Geohashes are accurate approximately 79% of the time, all the time. They actually do have trouble around the international date line, but we don’t care about anywhere beyond Portland.

A geohash is made up of a string of 20 letters or numbers. As you go from left to right, the area of the earth that is specified gets smaller and the precision of the location increases. Because of that characteristic we can throw away most of the characters.

When grouped by geohash, it turns out that you can slice up the Portland area into 15 polygons, each of which can be identified by a 4 character string. Here is the address point dataset organized into these groups.

SELECT substring(trim(ST_Geohash(wkb_geometry)) FROM 0 FOR 5) AS pdx_hash, count(*) AS address_point_count FROM address_data GROUP BY pdx_hash ORDER BY address_point_count;

pdx_hash | address_point_count
———-+———————
c211 | 2
c207 | 2
c217 | 3
c209 | 12
c20c | 14
c20b | 14
c206 | 54
c20s | 157
c215 | 507
c214 | 1263
8xj2 | 14008
c20e | 23581
c20d | 24980
c20g | 94659
c20f | 156877
(15 rows)

Beyond that, you can further subdivide Portland into 108 zones which can be uniquely identified by a 5 letter string. It is easy to control how accurately a tagged location is identified by using more or less characters. A 5 letter pdx_hash is more accurately described than one with 4 letters. With a 6 character pdx_hash you get 1487 different zones at higher resolution.

When you look at the count of address point in each polygon, it becomes apparent that the bulk of Portland addresses fall into the same five or six zones. One advantage to this method is that geohash is not subject to any claim of copyright or corporate control. The algorithm is public domain and open to use by anyone.

It would be reasonable for the city to eventually add a pdx_hash to all street signs allowing people to figure out the geotag of their current location without smartphones.

Later, I’ll post some maps as well as do some error analysis. In the course of working with the address point data, I found a simple way to load csv files into postgis, which I will also post about. The examples in this post were done in Postgis, but you should also be able to use the same query in Microsoft SQL Server 2008 and Oracle. Although the string manipulation might be a little different.

The web comic xkcd is where I think most of us first heard about geohashing. ;)
xkcd

Posted in technology | Tagged , , , , , , | Leave a comment

bulk loading shapefiles into postgis

No Gravatar

civicappsCivicapps datasets are made available in a format called a shapefile. ESRI shapefiles are the de facto standard for distributing spatial data, but they can be tricky for those who are new to GIS.  Fortunately shapefiles are an open standard with a variety of useful tools available.

Shapefiles and Spatial Databases

One approach to using a shapefile in an application is to load it into a relational database that supports spatial features like PostgreSQL/PostGIS, SQL Server 2008 (with or without ArcSDE), SQLite/Spatialite, Oracle and even MySQL. These products are sometimes referred to as spatial or geodatabases. Some implementations are more mature and useful than others.

I’m using PostGIS for this example and the ogr tools which ship with the ubuntu gdal-bin package. These tools are being developed as part of the Open Geospatial Consortium and are capable of doing incredible things, especially when compared to the commercial offerings.

Getting at the data

bridgesAfter you unzip the archive you are left with several different files including a .shp file, a .prj file and a .dbf file. The geographic features are not stored as longitude and latitude, but a coordinate system specific to Northern Oregon called State Plane with units in Feet. The projection information is in the .prj file.

Spatial Transformations

First we have to project the data from State Plane which is commonly used by state and local governments to the standard web mapping coordinate system, WGS1984. This projection works very well with the google maps api and uses latitude and longitude. We will employ a tool called ogr2ogr to do the reprojection. It will create a new shapefile called Bridges_pdx_4326.shp in the WGS84 coordinate system which is designated by a spatial reference id of 4326.

ogr2ogr -t_srs EPSG:4326 -a_srs EPSG:4326 -f "ESRI Shapefile" Bridges_pdx_4326.shp Bridges_pdx.shp

Creating the Database

Now, we need to create a postgis database and load it from the shapefile. We’ll be using the command shp2pgsql which comes with postgis and psql the command line interface to postgres to do the dirty work. It’s really not that bad and lends itself to being automated with a shell script.

edwin@iknuth.com:~/civicapps$ createdb civicapps
edwin@iknuth.com:~/civicapps$ createlang plpgsql civicapps
edwin@iknuth.com:~/civicapps$ psql civicapps -f /usr/share/postgresql/8.4/contrib/postgis.sql
edwin@iknuth.com:~/civicapps$ psql civicapps -f /usr/share/postgresql/8.4/contrib/spatial_ref_sys.sql

Loading the Data

Now we load our projected data into postgres. We are using a few command line options to specify the name of the table as “bridges_pdx” and the name of the column storing the features as “the_geom”. You can use whatever naming conventions you like.

shp2pgsql -s 4326 -d -g the_geom Bridges_pdx_4326.shp bridges_pdx |psql civicapps

If you run the command without piping it to psql, you see that it outputs the sql necessary to create a table and insert the data to populate it. This whole process is highly scriptable and can be used to bulk load many shapefiles at once.

Working with PostGIS

To look at the data, we can connect to the database and run some queries. PostGIS includes many functions that implement gis operations in sql. We can use this command to generate kml directly from our database.

SELECT ST_AsKML(the_geom) FROM bridges_pdx;

Most spatial databases implement at least some of the OpenGIS Simple Features Implementation Specification for SQL. Using these functions, we can make queries and relate tables at the database level based on the spatial location of the features using SQL or object-relational mapping. This is a very powerful tool for developing GIS applications.

I hope that helps people get started with shapefiles and postgis. I’ll be writing more about civicapps and working with spatial data in the future I appreciate your interest.

Posted in technology | Tagged , , , , , | 1 Comment

rubyrep and postgresql replication roundup

No Gravatar

While replication might not be tightly integrated into postgresql at the moment, we are lucky to have a variety of tools.  Each one has different qualities and meets different needs.

SlonyII is mature and easy enough to implement, but it does not support multi master replication.  Bucardo and rubyrep handle that requirement and definitely have different personalities.  Bucardo is written in perl and rubyrep uses ruby, so you might pick the language with which you are more comfortable.

I had been impressed by how easy it was to get rubyrep up and running, but for some reason I missed it’s ability to handle continuous replication using triggers.  Luckily rubyrep’s author Arndt Lehmann set me straight with a comment to my misguided post.  We are currently implementing rubyrep to replicate a very large django database and I have been very pleased with the results.

I have noticed one interesting quick with rubyrep that we did not see with bucardo.  I’m still analyzing the situation and will submit a bug report to rubyrep or postgres.  I am using the same database backup to load the two databases to different servers.  I noticed that rubyrep was finding a huge amount of record mismatches regarding time/data data.  The only that was different between the two servers was the timezone.  One was set to UTC and the other Pacific.  Fixing the tz mismatch and restarting postgres has taken care of the problem.

While this would a problem going into production, rubyrep actually was able to sync these records very easily.  It painless synced the hundreds of thousands of records without breaking a sweat.  I was very impressed.

Posted in technology | Tagged , , , , | 5 Comments

deploying bucardo and solving a schema mismatch problem

No Gravatar

In the process of deploying bucardo 4.4.0 to replicate a fairly large django database using postgresql 8.4, I ran into a tricky issue.  Although I knew for certain that the database schemas were identical, bucardo still complained about a mismatch in the default value for the first table’s primary key column.

For some reason bucardo was identifying the reference to the sequence with the “public” schema explicitly named on one database and without “public” on the other.  In this case, schema refers to a namespace and not the database layout sense of the word. The goat validation was throwing the following error.


MCP Warning: Source database for sync "databaseA" has column "id" of table "public.auth_group" with a DEFAULT of "nextval(‘auth_group_id_seq’::regclass)", but target database "databaseB" has a DEFAULT of "nextval(‘public.auth_group_id_seq’::regclass)"

I was sure that the databases were identical because they were both restored from the same backup file.  I did find that I was running 8.4.1 on one postgresql server and 8.4.3 on the other. I synced up the postgresql versions, but I did not track down any of the relevant driver versions, because I was able to find a simple work around.

I searched for the error string in the bucardo perl source code and found the problematic section.  Since I knew the tables were identical, I felt that skipping this part of the validation would be harmless, so I made the following change to Bucardo.pm

2032     #if ($scol->{atthasdef} and $fcol->{atthasdef} and $scol->{def} ne $fcol->{def}) {
2033     if (1 == 2) {

After I installed my updated bucardo, I kicked off replication again with a successful result.

Posted in technology | Tagged , , , , | 3 Comments

displaying a google maps api v3 map in a django application with geodjango and postgis

No Gravatar

In order to create a Google Maps API v3 map with django templates and views you can use the python modules django.contrib.gis.maps and specifically django.contrib.gis.maps.google.

We will create a view called show_map that takes a request object and a query string that we will use to retrieve polygons from postgis, our spatial database.

If we have a database of time zones and we would like to display a map with a specific time zone, we first need to select it from the database.  For this query, we are using the standard django ORM to pull out the desired polygon. We have defined the Zone object in our app’s models.py file.

querying the geodatabase using django ORM

In[24]: results = Time_Zone.objects.filter(timezone_id=query_string)

timezone_id is the criteria we are using to select all objects where the time zone id matches the query string exactly.

The geometry or polygon shapes of the records that have been returned are available in the polygon field you have specified in the model definition. Because we are using the filter method of Zone’s object manager, we can pull multiple records that match our search criteria. We can get at the polygon data for each record using some properties that are available on the geometry columns.

geojson

In[24]print results[0].geom.geojson
Out[24]‘{ "type": "MultiPolygon", "coordinates": [ [ [ [ -6.091862, 4.991835 ], [ -6.089009, 4.990883 ], [ -6.079963, 4.988334 ], [ -6.074483, 4.985070 ], [ -6.071546, 4.979528 ], [ -6.072025, 4.977002 ], [ -6.076686, 4.968072 ], [ -6.083112, 4.967894 ], [ -6.100063, 4.971250 ], [ -6.103525, 4.968749 ], [ -6.107572, 4.978899 ], [ -6.113694, 4.987916 ], [ -6.113905, 4.990215 ], [ -6.112429, 4.991211 ], [ -6.105550, 4.991084 ], [ -6.099033, 4.994057 ], [ -6.091862, 4.991835 ] ] ] ] }’

kml

In [25]: results[0].geom.kml
Out[25]:

-6.09186172485,4.99183511734,0 -6.08900928497,4.99088335037,0 -6.07996273041,4.98833370209,0 -6.07448291779,4.98507022858,0 -6.07154560089,4.97952795029,0 -6.07202482224,4.97700214386,0 -6.07668590546,4.9680724144,0 -6.08311223984,4.9678940773,0 -6.10006284714,4.97125005722,0 -6.10352468491,4.96874904633,0 -6.1075720787,4.97889947891,0 -6.11369419098,4.98791646957,0 -6.113904953,4.99021482468,0 -6.11242866516,4.9912109375,0 -6.10554981232,4.99108362198,0 -6.09903335571,4.9940571785,0 -6.09186172485,4.99183511734,0′

google maps api

In order to display the polygons on our google map, we will need to pass them to our code written in javascript. We will then use methods from the google maps api v3 to construct and display our polygons on the map. We will also need a little more information about the polygons we will be displaying in order to adjust the center and zoom level of our map.

First we will need to instantiate a google zoom object. There are probably easier ways to do this, perhaps subclassing the geodjango admin views, but this is quick and effective. And then render our template with the appropriate context.

gz = GoogleZoom()
return render_to_response(‘map.html’, {
‘all_tz’: all_tz,
‘zoom’: gz.get_zoom(all_tz.unionagg())
})

template javascript

To display our maps using google’s javascript mapping API, we need to initialize our map and the add our polygons to it. We will iterate through our list of polygons use the gpoly method of our Time_Zone object to definte the points that make up each polygon. We’ll talk about the gpoly method in a bit.

function initialize() {
var latlng = new google.maps.LatLng({{ results.unionagg.centroid.y}},
{{ results.unionagg.centroid.x}});
var myOptions = {
zoom: {{ zoom }},
center: latlng,
mapTypeId: google.maps.MapTypeId.TERRAIN
};
var map = new google.maps.Map(document.getElementById("map_canvas"), myOptions)
{% for result in results %}
polygon = new google.maps.Polygon({
paths: {{ result.gpoly }},
strokeColor: "#FF0000",
strokeOpacity: 0.8,
strokeWeight: 2,
fillColor: "#FF0000",
fillOpacity: 0.35
});
polygon.setMap(map);
{% endfor %}

google.maps.Polygon

A google.maps.Polygon object is slightly different from a geojson representation. They are both a string containing a list of points that represent the vertices of the polygon. The geojson polygon that is displayed above needs to be transformed into the following string to be interpretable by the google Polygon constructor.

‘[new google.maps.LatLng(4.99183511734,-6.09186172485),new google.maps.LatLng(4.99088335037,-6.08900928497),new google.maps.LatLng(4.98833370209,-6.07996273041),new google.maps.LatLng(4.98507022858,-6.07448291779),new google.maps.LatLng(4.97952795029,-6.07154560089),new google.maps.LatLng(4.97700214386,-6.07202482224),new google.maps.LatLng(4.9680724144,-6.07668590546),new google.maps.LatLng(4.9678940773,-6.08311223984),new google.maps.LatLng(4.97125005722,-6.10006284714),new google.maps.LatLng(4.96874904633,-6.10352468491),new google.maps.LatLng(4.97889947891,-6.1075720787),new google.maps.LatLng(4.98791646957,-6.11369419098),new google.maps.LatLng(4.99021482468,-6.113904953),new google.maps.LatLng(4.9912109375,-6.11242866516),new google.maps.LatLng(4.99108362198,-6.10554981232),new google.maps.LatLng(4.9940571785,-6.09903335571),new google.maps.LatLng(4.99183511734,-6.09186172485)]‘

To do that with django, I import the GPolygon object from django.contrib.gis.maps.google.overlays. It takes a geodjango multipolygon object and returns a string which is acceptable to the google maps api polygon constructor paths argument. I’ve added a gpoly method to my geodjango object that wraps the GPolygon object and returns the correct string.

supporting google maps api v3

Unfortunately, geodjango seems to be written with v2 of the google maps api in mind, at least in the current version. In order to transfer the output of the GPolygon method, we simply do a replace in the string. The gpoly method looks like the following.

def gpoly(self):
gpoly = GPolygon(self.geom[0])
return gpoly.points.replace(‘GLatLng’, ‘google.maps.LatLng’)
Posted in technology | Tagged , , , , , , , , | 1 Comment

installing bucardo the hard way

No Gravatar

It would seem that the bucardo_ctl installer script for version 4.4.0 has some problems. I spent a little time trying to fix the script, but it was too much of a pain.

I found a great write up by Richard Martin that describes getting bucardo running without bucardo_ctl.  I simplified his procedure to replicate the three tables from the pgbench data that have primary keys.   From there you should be able to adopt it for any database.  You can use bucardo to replicate tables that do not have primary keys, but primary keys are generally a good thing to have.

I did modify the bucardo.schema file to remove the account creation stuff.  The script tries to switch to the bucardo db as the bucardo user after it creates it and then can cause a problem with most default postgres installs that use ident auth.

Feel free to download my modified bucardo.schema.  For this example, I have used the database names ‘db_left’ and ‘db_right’.  They are both on localhost, so we have to worry less about getting across a network.

Be sure to use better passwords if you are setting this up for reals. Please do things like use ssl/ssh tunnels.

The first step is to create the databases and populate them with pgbench data. For ubuntu 9.10 with postgresql 8.4, pgbench is in /usr/lib/postgresql/8.4/bin/

createdb db_left
createdb db_right
pgbench -i db_left
pgbench -i db_right
/usr/lib/postgresql/8.4/bin/pgbench -i db_left
/usr/lib/postgresql/8.4/bin/pgbench -i db_right

Next, run the following script in psql or your postgresql client of choice.

CREATE user bucardo SUPERUSER;
CREATE DATABASE bucardo owner bucardo;
ALTER user bucardo WITH password ‘bucardo’;

\connect db_left;
CREATE LANGUAGE plpgsql;

\connect db_right;
CREATE LANGUAGE plpgsql;

\connect bucardo;
CREATE LANGUAGE plpgsql;
CREATE LANGUAGE plperlu;
\i mod_bucardo.schema
INSERT INTO db(name, dbname, dbhost, dbuser, dbpass)
VALUES (‘left’,‘db_left’,’127.0.0.1′,‘bucardo’, ‘bucardo’);
INSERT INTO db(name, dbname, dbhost, dbuser, dbpass)
VALUES (‘right’,‘db_right’,’127.0.0.1′,‘bucardo’, ‘bucardo’);
INSERT INTO dbgroup(name) VALUES (‘remote_dbs’);
INSERT INTO dbmap(db,dbgroup) VALUES (‘left’,‘remote_dbs’);
INSERT INTO dbmap(db,dbgroup) VALUES (‘right’,‘remote_dbs’);

INSERT INTO goat(db, schemaname, tablename, standard_conflict) VALUES(‘left’, ‘public’, ‘pgbench_accounts’, ‘source’);
INSERT INTO goat(db, schemaname, tablename, standard_conflict) VALUES(‘left’, ‘public’, ‘pgbench_branches’, ‘source’);
INSERT INTO goat(db, schemaname, tablename, standard_conflict) VALUES(‘left’, ‘public’, ‘pgbench_tellers’‘source’);

INSERT INTO herd(name) VALUES (‘HERD_NAME’);
INSERT INTO herdmap(herd,goat) SELECT ‘HERD_NAME’, id FROM goat;

INSERT INTO sync(name,source,targetdb,synctype,checktime) VALUES (‘left’,‘HERD_NAME’,‘right’,‘swap’,’10 minutes’);

UPDATE sync SET STATUS=‘active’ WHERE name =‘left’;

This will set up basic replication between db_left and db_right. To kick off the bucardo process, use the following command:

sudo bucardo_ctl –dbuser=‘bucardo’dbhost=‘localhost’dbpass=‘bucardo’ start "full_copy"

And to check the status to make sure it is running:

sudo bucardo_ctl –dbuser=‘bucardo’dbhost=‘localhost’dbpass=‘bucardo’ status
Days back: 3  User: bucardo  Database: bucardo  Host: localhost  PID of Bucardo MCP: 7771
Name Type  State PID  Last_good Time  I/U/D Last_bad Time
====+=====+=====+====+=========+=====+=====+========+====
left| S   |idle |7775|4m27s    |0s   |0/0/0|unknown |

Sorry about the backticks in those last commands. I think that is the code snippet wordpress plugin :(

To test the replication insert a record into one of the database and look for it to be replicated.

\c db_left
INSERT INTO pgbench_branches (bid, bbalance, filler) VALUES (2, 10, ‘test’);
\c db_right
SELECT * FROM pgbench_branches;
INSERT INTO pgbench_branches (bid, bbalance, filler) VALUES (3, 10, ‘the other way’);
\c db_left
SELECT * FROM bgpench_branches;
bid | bbalance |                                          filler
—–+———-+——————————————————————————————
1 |        0 |
2 |       10 | test
3 |       10 | the other way
(3 rows)

It may take a few seconds for the new record to show up. There should also be a logfile in your current working directory that will have a lot of diagnostic info.

And that is multimaster replication.

Posted in technology | Tagged , , , , | 4 Comments

geocoding and spatial database queries using python with django and postgis

No Gravatar

Python is an excellent language for developing GIS applications.  It’s very easy to create a web service that postgisqueries a spatial database and returns a result.  Sometimes that sort of thing is called a location based service.  They are common in mobile applications, including iphone apps.  For our purposes, latitude and longitude are just inputs to a function.

Geocoding is the act of turning street addresses into coordinates on a map.  A geocoding service takes an address or a more generic reference to streets and returns a latitude and longitude that can be plotted on a map.  Reverse geocoding takes a point and returns an address, but we won’t be going into that here.

Google provides an excellent geocoding service that has a flexible address matching algorithm and provides  accurate results.  It can be used in python by way of the geopy module.  The module can also query other geocoding services like yahoo and microsoft, but google does a good job and works well with the Google Maps API and base layer.  The syntax to look up a location by the street address is simple.

This example also includes using that geocoded address to look up the Portland neighborhood in which the point is located.  To do this we must have a geodjango model loaded in a postgis or other spatial database or geodatabase as they are sometimes called.

from django.contrib.gis.geos import Point
from models import Neighborhood
from geopy import geocoders
api_key=‘AAJJDJDJSJ….’

g = geocoders.Google(api_key)
place, (lat, lon) = g.geocode(address)
pnt = Point(lon,lat)
n = Neighborhood.objects.get(poly__intersects=pnt)

Posted in technology | Tagged , , , , , , , , | Leave a comment

installing bucardo for multi-master postgres database replication

No Gravatar

In my quest for postgres replication/clustering nirvana I’ve played with pgpool, slony, rubyrep and now bucardo. Our requirements are painless bidirectional synchronization between two postgresql databases that serve a django application.

bucardo

Bucardo seems awesome, but there are some problems with the install scripts.  I hope to work them out in a future post. In the meantime here are some notes about installing bucardo on ubuntu 9.10.

Bucardo is written in perl and seems to be a little more complicated than rubyrep to get up and running. I’ll be testing it out on a virtualbox vm running ubuntu 9.10 karmic koala. This is postgres 8.4, for the record.

The first prerequisite for bucard is the DBIx::Safe perl module. Unfortunately there isn’t a package available in the default 9.10 repo’s, but it does look like it’s out for lucid.

I’m just going to install DBIx::Safe from source. It and bucardo require the DBD::Pg perl module, which is available from the ubuntu repo. I’ll install that with:

sudo aptitude install libdbd-pg-perl

The bucardo directions have instructions for installing DBIx::Safe from source, so I’ll leave that up to them. The directions also state that DBIx::Sage is a rereq, but I couldn’t find that. The bucardo build did not complain about lacking that module, so I’m hoping we’re good to go.

After getting and building bucardo, the next step is to add the new functions to your database. Bucardo uses stored procedures written in perl to handle the replication. The instructions use yum to install this from a package on redhat/centos style servers. For ubuntu, we’ll do:

sudo aptitude install postgresql-plperl-8.4

The next step is run the bucardo_ctl install script. It prompts you for database settings, which is kind of a hassle. rubyrep was much more straightforward. It also creates a bucardo superuser on the database. The script prompts you for the the password for this user, but there doesn’t seem to be a step to set the password. I opened up a psql session and ran the following:

ALTER user bucardo WITH password ‘CHGME’;

Definitely change that password to something a little more secure. While you are at it, make sure you are being secure with your db connections. You don’t want port 5432 open to the world, and please use SSL.

The bucardo_ctl script gave me an error about not having the plperl language available. I tried various things and got tired of running the script. I installed the bucardo functions to the database with the following command:

psql DBNAME -h localhost -f /usr/local/share/bucardo/bucardo.schema

The next step is to add the database to be replicated to bucardo. bucardo_ctl had a little trouble authenticating. I needed to specify the hostname. You can do it like this:

bucardo_ctl –dbhost=localhost –dbpass=BUCARDO_PASSWORD add database DBNAME

For some reason bucardo had a lot of trouble installing on Ubuntu. I kept getting an error about missing plperlu, even though the package was installed on the server and I could manually add the language to a database.

There seem to be a few errors in the bucardo_ctl perl script. I had to make two changes to get it to work. You need to comment the exit command after the bucardo_schema file is installed, and a secret hardcoded db password at the end of the install:

5465 if ($res !~ m{Pl/PerlU was successfully installed}) {
5466 warn "\nINSTALLATION FAILED! ($res)\n\n";

5467 warn "Installation cannot proceed unless the Pl/PerlU language is available\n";

5468 warn "This is usually available as a separate package\n";

5469 warn "For example, you might try: yum install postgresql-plperl\n";

5470 warn "If compiling from source, add the –with-perl option to your ./configure comma 5470 nd\n\n";

5471 # exit 1;

5472 }

…….

5481 $port and $BDSN .= ";port=$port";

5482 my $default_bucardo_password = ‘replicate’;

5483 $dbh = DBI->connect($BDSN, ‘bucardo’, $default_bucardo_password, {AutoCommit=>0,RaiseError=> 5483 1,PrintError=>0});

Now, let us create a few databases, populate and then sync them with bucardo. We’ll be using pgbench to generate some test data.

createdb db_left
createdb db_right
pgbench -i db_left
pgbench -i db_right

We’ll be starting with two identical databases, populated with some data. One of the tables does not have a primary key defined, but bucardo can handle that. I’m not really interested in that functionality, so I’m not going to worry about it here.

To add the databases to replication, execute the following commands:

bucardo_ctl add database db_left –dbhost=localhost –dbpass=PASS
bucardo_ctl add database db_right –dbhost=localhost –dbpass=PASS
bucardo_ctl add all tables
bucardo_ctl add all sequences
Posted in technology | Tagged , , , , , , | Leave a comment

using rubyrep to replicate a postgresql database on ubuntu 9.10 (karmic koala)

No Gravatar

rubyrep Update: I was mistaken about rubyrep’s ability to support continuous replication.  See my new post for an update.

For my current project, we need to replicate a postgresql 8.4 database between several sites.  We are looking at bucardo and rubyrep.  Both projects support a multiple master architecture.  I’m going to give rubyrep a shot, and then try bucardo.  I’ll be using ubuntu 9.10 karmic koala, but we’ll be using debian in production.  We’ll be replicating a fairly large django database between two sites.

Slony-I

Slony isn’t the only technology out there for doing replication of a postgres database.  Looks like two relatively new projects are being developed  that might be better for certain instances.  We couldn’t use slony because it is does not support bidirectional synchronization.  Slony only allows you to have a single master and multiple slaves.  For our purposes we have to have multiple masters.

Setup

rubyrep is obviously written in ruby, and you can choose from standard and jruby version.  The java ruby variant is supposed to be a bit faster, but I just want to see how it works.  To test, I’ll be installing it on for two databases on a single server.
Assuming you have ruby installed, you can install rubyrep with the following command:

sudo gem install rubyrep

This installed version 1.1.1 of rubyrep.  The next step is to generate a stub config file using the rubyrep command.  The tutorial does not specify the path, and gem didn’t add it to the default path.  For my setup the rubyrep script can be called like:

/var/lib/gems/1.8/bin/rubyrep generate bh_rubyrep.conf

This will generate a sample config file that you can edit to specify your database and server settings.  My databases are both on localhost for this test.  I uncommented the following to specify syncing all tables:

config.include_tables /./ # regexp matching all tables in the database
The /./ is just a regex to match table names.

Execution

The first step is to do a scan of the databases:

/var/lib/gems/1.8/bin/rubyrep scan -c bh_rubyrep.conf
I haven’t created the right hand database, so I got a fatal error stating it didn’t exist.  I created the second database and ran the command again, but it gave no output.  It should list all the tables with a count of mismatched records.  Unfortunately my right hand database is empty.  I dumped the left hand database and restored to the the right.
I ran the scan again and got a nice list of tables with zero mismatches.  We are going to be replicating the database of a relatively large postgres application.  With both databases running under a virtualbox vm on a macbook pro, it took 1 min 24 seconds to scan an 18mb database.  All tables have primary keys defined, which helps our cause.
eknuth@eknuth-vubuntu:~/rubyrep$ time /var/lib/gems/1.8/bin/rubyrep scan -c bh_rubyrep.conf
purchasing_purchaseancillary 100% …………………….   0
orders_orderline 100% …………………….   0
orders_orderstatus 100% …………………….   0
auth_user 100% …………………….   0
navigation_pinnedtab 100% …………………….   0
navigation_tab 100% …………………….   0
real 1m23.323s
user 1m4.708s
sys 0m8.533s
To test the syncronization, I deleted all the records from one of the tables and ran the sync command:
time /var/lib/gems/1.8/bin/rubyrep sync -c bh_rubyrep.conf
rubyrep picked up the missing 8000 records and recreated them in the rh database.  Very slick and it did not take measurably longer than the scan.

Conclusions

I’m very impressed with the ease of setup.  It really was incredibly painless, in keeping with rubyrep’s slogan.  There are more complicated sync and conflict resolution rules you can set up, but the basic config was done very quickly.

Unlike slony, which uses triggers to sync the data, rubyrep would be the sort of thing you run periodically out of a cron job.  I’m not sure if that will work for us.
rubyrep seems like an excellent way to do postgres replication.  It would also allow you to replicate data between mysql and postgres, which is kinda cool.
I also think that rubyrep is probably one of the best postgres data comparison tools available.  It seems well suited for that type of use.
Posted in technology | Tagged , , , , | 2 Comments