Project Description
Geography Coder is used to parse geography(address) information from given text, and also provides geogrpahy ontology for applications.


This project provides geography ontology and interface for application to parse geography information from given string. The core algorithms in the project are CRF and ListNetRanker. Both of them are based on machine learning.

Firstly, Geography Coder tries to label given string that which parts are location name. And then it uses geo-ontology and other knowledges to build geography chains candidates and feature set for ranking. Finally, ListNetRanker is used to rank above candidates and return the best one as result.

There are some types query as follows to introduce what Geography coder can do:
1. Query: 广州市荔湾区东风西路15号
Geography coder will detect the entire query is address string and return its most detailed geography information as result, such as 东风西路15号's lat/long.

2. Query: 广东省广州越秀区中山二路广东省人民医院
Geography coder will detect "广东省广州越秀区中山二路" as where part and "广东省人民医院" as what part in above query. And "广东省广州越秀区中山二路"'s geography information will be returned as result.

3. Query: 遵化到廊坊的班车
This query contains two places(遵化 and 廊坊), and user may want to search the shuttle schedule from the first place to the second place. Geography coder can detect 遵化 and 廊坊 as where part and "到 的班车" as what part. Furthermore, since there are more than one place named 遵化 or 廊坊 in country wide. Geography coder will rank these places with same name, and return (河北省唐山市)遵化 and (河北省)廊坊 as best result, since the two places in the best result are in the same province and Geography coder guess it is what user want to have.

Source code

The solution contains following projects:
src\Core : Geocoder core algorithm
src\Core\GeoCoderParser : the core algorithm for geocoder
src\Core\POIParser : the core algorithm for POI parser
src\Demo : Geocoder console tools
src\Demo\GeoCoderConsole : Geocoder console tool
src\Demo\POIParserConsole : POI parser console tool
src\FeatureGenerator : Feature generator for ranker and parser
src\FeatureGenerator\BusinessNameParserFeatureGenerator : generate feature set for POI parser
src\FeatureGenerator\GeocoderRankerFeatureGenerator : generate feature set for Geocoder ranker
src\GeoUtils : Geocoder util libs
src\Pipeline : Geocoder offline data pipeline
src\dll : external dll which Geocoder depends on

Data Pipeline

The data pipeline is used to build data set used by Geocoder. It contains following directories:
data_pipeline : Geocoder's offline data pipeline
data_pipeline\RawData : the raw data to build geocoder data
data_pipeline\GeneratedData : the generated data for geocoder online logic
data_pipeline\BuildGeoEntityFromMIF : build MIF format geo-entity into indexed file
data_pipeline\BuildAddressData : build detailed address geo-entity into indexed file
data_pipeline\BuildTRStationData : build transit station geo-entity into indexed file
data_pipeline\BuildZoneData : build zone geo-entity into indexed file
data_pipeline\BuildLocationDictMatchData : build location name into indexed file.
data_pipeline\RefinePOIParserTrainingCorpus : refine POI parser training corpus
data_pipeline\TrainORGInnerModel : train POI inner-structure parser model
data_pipeline\RefineNerTrainingCorpus: refine query named entity parser training corpus
data_pipeline\TrainWordnerModel : train named entity parser model
data_pipeline\TrainGeoRanker : Train Geocoder ranker model
data_pipeline\tmp : temporary files

Run run.bat file to start the data pipeline, it may take a while to finish

Console Tools

deployed : Geocoder's online logic console tools
deployed\GeoCoderConsole : Geocoder online logic console tools
deployed\POIParserConsole : POI parser console tools leveraging Geocoder logic

Last edited Mar 14, 2013 at 8:34 AM by monkeyfu, version 9