Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1614

Geo Topic Parser

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.9
    • parser

    Description

      ##Description

      This program aims to provide the support to identify geonames for any unstructured text data in the project NSF polar research. https://github.com/NSF-Polar-Cyberinfrastructure/datavis-hackathon/issues/1

      This project is a content-based geotagging solution, made of a variaty of NLP tools and could be used for any geotagging purposes.

      ##Workingflow

      1. Plain text input is passed to geoparser

      2. Location names are extracted from the text using OpenNLP NER

      3. Provide two roles:

      • The most frequent location name choosed as the best match for the input text
      • Other extracted locations are treated as alternatives (equal)

      4. location extracted above, search the best GeoName object and return the resloved objects with fields (name in gazetteer, longitude, latitude)

      ##How to Use
      Cautions: This program requires at least 1.2 GB disk space for building Lucene Index

      ```Java
      function A(stream){
      Metadata metadata = new Metadata();
      ParseContext context=new ParseContext();
      GeoParserConfig config= new GeoParserConfig();
      config.setGazetterPath(gazetteerPath);
      config.setNERModelPath(nerPath);
      context.set(GeoParserConfig.class, config);

      geoparser.parse(
      stream,
      new BodyContentHandler(),
      metadata,
      context);

      for(String name: metadata.names())

      { String value=metadata.get(name); System.out.println(name +" " + value); }

      }
      ```
      This parser generates useful geographical information to Tika's Metadata Object.

      Fields for best matched location:
      ```
      Geographic_NAME
      Geographic_LONGTITUDE
      Geographic_LATITUDE
      ```
      Fields for alternatives:
      ```
      Geographic_NAME1
      Geographic_LONGTITUDE1
      Geographic_LATITUDE1

      Geographic_NAME2
      Geographic_LONGTITUDE2
      Geographic_LATITUDE2

      ...

      ```
      If you have any questions, contact me: anyayunli@gmail.com

      Attachments

        1. TIKA-1614.Mattmann.Li.052405.patch.txt
          26 kB
          Chris A. Mattmann

        Activity

          People

            chrismattmann Chris A. Mattmann
            diefunction Anya Yun Li
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: