31 May, 2010

pdxhash: an implementation of #pdxtags using the geohash algorithm

I never learned about geohashes in school, but I ran across the term when I was figuring out how to do fast GIS operations in the Google App Engine cloud computing platform. It is a two way function that encodes a latitude and longitude into a 20 character string. According to the wikipedia page, a geohash could be a possible geotagging implementation. I've found that using a geohash you can split the Portland area into 15 unique zones each identified by a four characters. I thought that maybe we could use a geohash to geotag locations in Portland as specified in the #pdxtags civicapp idea submission. Geohashes are accurate approximately 79% of the time, all the time. They actually do have trouble around the international date line, but we don't care about anywhere beyond Portland. A geohash is made up of a string of 20 letters or numbers. As you go from left to right, the area of the earth that is specified gets smaller and the precision of the location increases. Because of that characteristic we can throw away most of the characters. When grouped by geohash, it turns out that you can slice up the Portland area into 15 polygons, each of which can be identified by a 4 character string. Here is the address point dataset organized into these groups.


select substring(trim(ST_Geohash(wkb_geometry)) from 0 for 5) as pdx_hash, count(*) as address_point_count from address_data group by pdx_hash order by address_point_count;

pdx_hash | address_point_count ----------+--------------------- c211 | 2 c207 | 2 c217 | 3 c209 | 12 c20c | 14 c20b | 14 c206 | 54 c20s | 157 c215 | 507 c214 | 1263 8xj2 | 14008 c20e | 23581 c20d | 24980 c20g | 94659 c20f | 156877 (15 rows)

Beyond that, you can further subdivide Portland into 108 zones which can be uniquely identified by a 5 letter string. It is easy to control how accurately a tagged location is identified by using more or less characters. A 5 letter pdx_hash is more accurately described than one with 4 letters. With a 6 character pdx_hash you get 1487 different zones at higher resolution. When you look at the count of address point in each polygon, it becomes apparent that the bulk of Portland addresses fall into the same five or six zones. One advantage to this method is that geohash is not subject to any claim of copyright or corporate control. The algorithm is public domain and open to use by anyone. It would be reasonable for the city to eventually add a pdx_hash to all street signs allowing people to figure out the geotag of their current location without smartphones. Later, I'll post some maps as well as do some error analysis. In the course of working with the address point data, I found a simple way to load csv files into postgis, which I will also post about. The examples in this post were done in Postgis, but you should also be able to use the same query in Microsoft SQL Server 2008 and Oracle. Although the string manipulation might be a little different.

The web comic xkcd is where I think most of us first heard about geohashing. ;) xkcd

28 May, 2010

bulk loading shapefiles into postgis

Civicapps datasets are made available in a format called a shapefile. ESRI shapefiles are the de facto standard for distributing spatial data, but they can be tricky for those who are new to GIS. Fortunately shapefiles are an open standard with a variety of useful tools available.

Shapefiles and Spatial Databases

One approach to using a shapefile in an application is to load it into a relational database that supports spatial features like PostgreSQL/PostGIS, SQL Server 2008 (with or without ArcSDE), SQLite/Spatialite, Oracle and even MySQL. These products are sometimes referred to as spatial or geodatabases. Some implementations are more mature and useful than others.

I'm using PostGIS for this example and the ogr tools which ship with the ubuntu gdal-bin package. These tools are being developed as part of the Open Geospatial Consortium and are capable of doing incredible things, especially when compared to the commercial offerings.

Getting at the data

After you unzip the archive you are left with several different files including a .shp file, a .prj file and a .dbf file. The geographic features are not stored as longitude and latitude, but a coordinate system specific to Northern Oregon called State Plane with units in Feet. The projection information is in the .prj file.

Spatial Transformations

First we have to project the data from State Plane which is commonly used by state and local governments to the standard web mapping coordinate system, WGS1984. This projection works very well with the google maps api and uses latitude and longitude. We will employ a tool called ogr2ogr to do the reprojection. It will create a new shapefile called Bridges_pdx_4326.shp in the WGS84 coordinate system which is designated by a spatial reference id of 4326.


ogr2ogr -t_srs EPSG:4326 -a_srs EPSG:4326 -f "ESRI Shapefile" Bridges_pdx_4326.shp Bridges_pdx.shp

Creating the Database

Now, we need to create a postgis database and load it from the shapefile. We'll be using the command shp2pgsql which comes with postgis and psql the command line interface to postgres to do the dirty work. It's really not that bad and lends itself to being automated with a shell script.


edwin@iknuth.com:~/civicapps$ createdb civicapps
edwin@iknuth.com:~/civicapps$ createlang plpgsql civicapps
edwin@iknuth.com:~/civicapps$ psql civicapps -f /usr/share/postgresql/8.4/contrib/postgis.sql
edwin@iknuth.com:~/civicapps$ psql civicapps -f /usr/share/postgresql/8.4/contrib/spatial_ref_sys.sql

Loading the Data

Now we load our projected data into postgres. We are using a few command line options to specify the name of the table as "bridges_pdx" and the name of the column storing the features as "the_geom". You can use whatever naming conventions you like.


shp2pgsql -s 4326 -d -g the_geom Bridges_pdx_4326.shp bridges_pdx |psql civicapps

If you run the command without piping it to psql, you see that it outputs the sql necessary to create a table and insert the data to populate it. This whole process is highly scriptable and can be used to bulk load many shapefiles at once.

Working with PostGIS

To look at the data, we can connect to the database and run some queries. PostGIS includes many functions that implement gis operations in sql. We can use this command to generate kml directly from our database.


select ST_AsKML(the_geom) from bridges_pdx;

Most spatial databases implement at least some of the OpenGIS Simple Features Implementation Specification for SQL. Using these functions, we can make queries and relate tables at the database level based on the spatial location of the features using SQL or object-relational mapping. This is a very powerful tool for developing GIS applications. I hope that helps people get started with shapefiles and postgis. I'll be writing more about civicapps and working with spatial data in the future I appreciate your interest. [ad]

19 May, 2010

rubyrep and postgresql replication roundup

While replication might not be tightly integrated into postgresql at the moment, we are lucky to have a variety of tools. Each one has different qualities and meets different needs. SlonyII is mature and easy enough to implement, but it does not support multi master replication. Bucardo and rubyrep handle that requirement and definitely have different personalities. Bucardo is written in perl and rubyrep uses ruby, so you might pick the language with which you are more comfortable. I had been impressed by how easy it was to get rubyrep up and running, but for some reason I missed it's ability to handle continuous replication using triggers. Luckily rubyrep's author Arndt Lehmann set me straight with a comment to my misguided post. We are currently implementing rubyrep to replicate a very large django database and I have been very pleased with the results. I have noticed one interesting quick with rubyrep that we did not see with bucardo. I'm still analyzing the situation and will submit a bug report to rubyrep or postgres. I am using the same database backup to load the two databases to different servers. I noticed that rubyrep was finding a huge amount of record mismatches regarding time/data data. The only that was different between the two servers was the timezone. One was set to UTC and the other Pacific. Fixing the tz mismatch and restarting postgres has taken care of the problem. While this would a problem going into production, rubyrep actually was able to sync these records very easily. It painless synced the hundreds of thousands of records without breaking a sweat. I was very impressed.