Description
I am trying to convert PDF to TXT and some PDF, after converted, the String present wrong character. Could be UNICODE problem ? Can somebody help me ?
I oberved that the problem when try to convert PDF, created by PDFCreator, in Text. The character are wrong. Any suggesting ?
the code
public class PDFTextParser {
PDFParser parser;
String parsedText;
PDFTextStripper pdfStripper;
PDDocument pdDoc;
COSDocument cosDoc;
PDDocumentInformation pdDocInfo;
// PDFTextParser Constructor
public PDFTextParser() {
}
// Extract text from PDF Document
public String pdftoText(String fileName) {
System.out.println("Parsing text from PDF file " + fileName + "....");
File f = new File(fileName);
if (!f.isFile())
{ System.out.println("File " + fileName + " does not exist."); return null; }try
{ parser = new PDFParser(new FileInputStream(f)); }catch (Exception e)
{ System.out.println("Unable to open PDF Parser."); return null; }try
{ parser.parse(); cosDoc = parser.getDocument(); pdfStripper = new PDFTextStripper(); pdDoc = new PDDocument(cosDoc); parsedText = pdfStripper.getText(pdDoc); } catch (Exception e) {
System.out.println("An exception occured in parsing the PDF Document.");
e.printStackTrace();
try
catch (Exception e1)
{ e.printStackTrace(); } return null;
}
System.out.println("Done.");
return parsedText;
}