Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Invalid
-
1.4
-
Ubuntu12.04, Python 2.7, Apache Tika 1.4
Description
When Extracting text using Apache Tika 1.4, the Text is getting duplicated.
APACHE_TIKA_PATH = os.path.abspath(os.path.join(PROJECT_ROOT, apache_tika/tika-app-1.4.jar'))
sout = subprocess.check_output("java -jar %s -t %s"%(APACHE_TIKA_PATH, document),shell=True)
sout contains duplicate text.
Issue both for Doc and PDF files.