[TIKA-1614] Geo Topic Parser - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.9
Component/s: parser
Labels:
- memex

External issue URL:
https://github.com/AranyaLi/GeoParsingNSF

Description

##Description

This program aims to provide the support to identify geonames for any unstructured text data in the project NSF polar research. https://github.com/NSF-Polar-Cyberinfrastructure/datavis-hackathon/issues/1

This project is a content-based geotagging solution, made of a variaty of NLP tools and could be used for any geotagging purposes.

##Workingflow

1. Plain text input is passed to geoparser

2. Location names are extracted from the text using OpenNLP NER

3. Provide two roles:

The most frequent location name choosed as the best match for the input text
Other extracted locations are treated as alternatives (equal)

4. location extracted above, search the best GeoName object and return the resloved objects with fields (name in gazetteer, longitude, latitude)

##How to Use
Cautions: This program requires at least 1.2 GB disk space for building Lucene Index

```Java
function A(stream){
Metadata metadata = new Metadata();
ParseContext context=new ParseContext();
GeoParserConfig config= new GeoParserConfig();
config.setGazetterPath(gazetteerPath);
config.setNERModelPath(nerPath);
context.set(GeoParserConfig.class, config);

geoparser.parse(
stream,
new BodyContentHandler(),
metadata,
context);

for(String name: metadata.names())

{ String value=metadata.get(name); System.out.println(name +" " + value); }

}
```
This parser generates useful geographical information to Tika's Metadata Object.

Fields for best matched location:
```
Geographic_NAME
Geographic_LONGTITUDE
Geographic_LATITUDE
```
Fields for alternatives:
```
Geographic_NAME1
Geographic_LONGTITUDE1
Geographic_LATITUDE1

Geographic_NAME2
Geographic_LONGTITUDE2
Geographic_LATITUDE2

...

```
If you have any questions, contact me: anyayunli@gmail.com

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TIKA-1614.Mattmann.Li.052405.patch.txt
24/May/15 16:57
26 kB
Chris A. Mattmann

Activity

People

Assignee:: Chris A. Mattmann

Reporter:: Anya Yun Li

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 23/Apr/15 00:44

Updated:: 25/May/15 01:49

Resolved:: 25/May/15 00:57