Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
Hi there,
As there is an issue that is still not handled occurs in use, I would like to suggest the following fix for the source code of Confluence Repository Connector.
For details about this issue, please refer to the information below:
1. Connector Name
confluence-v6 \ Confluence Repository Connector
2. Overview
- In the Confluence Repository Connector, there is an error in the logic that determines wether the document has attachments or not.
- Wrong logic leads to attachments not being crawled.
※ This error only occurs when crawling documents from Confluence server, while crawling documents from Confluence Cloud (SaaS) still works normally.
- Formats of the document's ID when there is a file attached are as below:
- Crawled from Confluence server: <id of attchment>-<id of blog/page>
- Crawled from Confluence cloud (SaaS): att<id of attchment>-<id of blog/page>
3. Reproduction
- On Confluence server:
- Create a blog.
- Add attachments to the newly created blog.
- On ManifoldCF:
- Create a Confluence Repository Connector with the aforementioned Confluence server information.
- Create a job using the connector created above with the following details:
- On the [Page] tab:
- Process Attachments: (Check).
- Type Specification: Blog.
- On the [Page] tab:
- Start job.
- Check [Simple History Report].
4. Cause
- At the logic for judging whether the document has / does not have a file attachment, if the ID of the document begins with att, it is judging that there is a file attachment.
- However, the ID field of the document crawled from the Confluence server, in fact, when the file is attached, does not prefix it with att (format mentioned in item 2).
5. Solution
My observation is as below:
- If a document has a file attachment, the ID of that document is a string of characters connected by - character.
- If a document does not have a file attachment, the ID of that document does not contain - character.
Therefore, it is possible to judge whether a file is is attached or not by checking if the ID contains - character.
6. Suggested source code (based on release 2.22.1)
**Class: org.apache.manifoldcf.crawler.connectors.confluence.v6.util.ConfluenceUtil**
- private static final String ATTACHMENT_ID_PREFIX = "att"; + private static final String ATTACHMENT_ID_CHARACTER = "-";
public static Boolean isAttachment(String id) { - return id.startsWith(ATTACHMENT_ID_PREFIX); + return id.contains(ATTACHMENT_ID_CHARACTER); }