Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Not A Problem
-
1.10
-
None
Description
When using the commoncrawldump component, we get the error:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at com.fasterxml.jackson.dataformat.cbor.CBORGenerator._flushBuffer(CBORGenerator.java:1365)
at com.fasterxml.jackson.dataformat.cbor.CBORGenerator.close(CBORGenerator.java:896)
at org.apache.nutch.tools.CommonCrawlDataDumper.serializeCBORData(CommonCrawlDataDumper.java:461)
at org.apache.nutch.tools.CommonCrawlDataDumper.dump(CommonCrawlDataDumper.java:375)
at org.apache.nutch.tools.CommonCrawlDataDumper.main(CommonCrawlDataDumper.java:256)
and
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.StringCoding.safeTrim(StringCoding.java:89)
at java.lang.StringCoding.access$100(StringCoding.java:50)
at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:154)
at java.lang.StringCoding.decode(StringCoding.java:193)
at java.lang.StringCoding.decode(StringCoding.java:254)
at java.lang.String.<init>(String.java:536)
at java.io.ByteArrayOutputStream.toString(ByteArrayOutputStream.java:208)
at org.apache.nutch.tools.CommonCrawlFormatJackson.generateJson(CommonCrawlFormatJackson.java:80)
at org.apache.nutch.tools.AbstractCommonCrawlFormat.getJsonData(AbstractCommonCrawlFormat.java:121)
at org.apache.nutch.tools.CommonCrawlDataDumper.dump(CommonCrawlDataDumper.java:361)
at org.apache.nutch.tools.CommonCrawlDataDumper.main(CommonCrawlDataDumper.java:256)
The segment files' size is 1.41GB. However we successfully dump the files with the segments' size of 100M.