Create a content connector

A content connector is a software program used to traverse the data in an enterprise's repository and populate a data source. Google provides the following options for developing content connectors:

The Content Connector SDK. This is a good option if you are programming in Java. The Content Connector SDK is a wrapper around the REST API allowing you to quickly create connectors. To create a content connector using the SDK, refer to Create a content connector using the Content Connector SDK.
A low-level REST API or API libraries. Use these options if you're not programming in Java, or if your codebase better accommodates a REST API or a library. To create a content connector using the REST API, refer to Create a content connector using the REST API.

A typical content connector performs the following tasks:

Reads and processes configuration parameters.
Pulls discrete chunks of indexable data, called "items," from the third-party content repository.
Combines ACLs, metadata, and content data into indexable items.
Indexes items to the Cloud Search data source.
(optional) Listens to change notifications from the third-party content repository. Change notifications are converted into indexing requests to keep the Cloud Search data source in sync with the third-party repository. The connector only performs this task if the repository supports change detection.

Create a content connector using the Content Connector SDK

The following sections explain how to create a content connector using the Content Connector SDK.

Set up dependencies

You must include certain dependencies in your build file to use the SDK. Click on a tab below to view the dependencies for your build environment:

Maven

<dependency>
<groupId>com.google.enterprise.cloudsearch</groupId>
<artifactId>google-cloudsearch-indexing-connector-sdk</artifactId>
<version>v1-0.0.3</version>
</dependency>

Gradle

compile group: 'com.google.enterprise.cloudsearch',
        name: 'google-cloudsearch-indexing-connector-sdk',
        version: 'v1-0.0.3'

Create your connector configuration

Every connector has a configuration file containing parameters used by the connector, such as the ID for your repository. Parameters are defined as key-value pairs, such as api.sourceId=1234567890abcdef.

The Google Cloud Search SDK contains several Google-supplied configuration parameters used by all connectors. You must declare the following Google-supplied parameters in your configuration file:

For a content connector, you must declare api.sourceId and api.serviceAccountPrivateKeyFile as these parameters identify the location of your repository and private key needed to access the repository.

For an identity connector, you must declare api.identitySourceId as this parameter identifies the location of your external identity source. If you are syncing users, you must also declare api.customerId as the unique ID for your enterprise's Google Workspace account.

Unless you want to override the default values of other Google-supplied parameters, you do not need to declare them in your configuration file. For additional information on the Google-supplied configuration parameters, such as how to generate certain IDs and keys, refer to Google-supplied configuration parameters.

You can also define your own repository-specific parameters for use in your configuration file.

Pass the configuration file to the connector

Set the system property config to pass the configuration file to your connector. You can set the property using the -D argument when starting the connector. For example, the following command starts the connector with the MyConfig.properties configuration file:

java -classpath myconnector.jar;... -Dconfig=MyConfig.properties MyConnector

If this argument is missing, the SDK attempts to access a default configuration file named connector-config.properties.

Determine your traversal strategy

The primary function of a content connector is to traverse a repository and index its data. You must implement a traversal strategy based on the size and layout of data in your repository. You can design your own strategy or choose from the following strategies implemented in the SDK:

Full traversal strategy

A full traversal strategy scans the entire repository and blindly indexes every item. This strategy is commonly used when you have a small repository and can afford the overhead of doing a full traversal every time you index.

This traversal strategy is suitable for small repositories with mostly static, non-hierarchical, data. You might also use this traversal strategy when change detection is difficult or not supported by the repository.

List traversal strategy

A list traversal strategy scans the entire repository, including all child nodes, determining the status of each item. Then, the connector takes a second pass and only indexes items that are new or have been updated since the last indexing. This strategy is commonly used to perform incremental updates to an existing index (instead of having to do a full traversal every time you update the index).

This traversal strategy is suitable when change detection is difficult or not supported by the repository, you have non-hierarchical data, and you are working with very large data sets.

Graph traversal

A graph traversal strategy scans the entire parent node determining the status of each item. Then, the connector takes a second pass and only indexes items in the root node are new or have been updated since the last indexing. Finally, the connector passes any child IDs then indexes items in the child nodes that are new or have been updated. The connector continues recursively through all child nodes until all items have been addressed. Such traversal is typically used for hierarchical repositories where listing of all IDs isn't practical.

This strategy is suitable if you have hierarchical data that needs to be crawled, such as a series of directories or web pages.

Each of these traversal strategies is implemented by a template connector class in the SDK. While you can implement your own traversal strategy, these templates greatly speed up the development of your connector. To create a connector using a template, proceeed to the section corresponding to your traversal strategy:

Create a full traversal connector using a template class
Create a list traversal connector using a template class
Create a graph traversal connector using a template class

Create a full traversal connector using a template class

This section of the docs refers to code snippets from the FullTraversalSample example.

Implement the connector’s entry point

The entry point to a connector is the main() method. This method’s primary task is to create an instance of the Application class and invoke its start() method to run the connector.

Before calling application.start(), use the IndexingApplication.Builder class to instantiate the FullTraversalConnector template. The FullTraversalConnector accepts a Repository object whose methods you implement. The following code snippet shows how to implement the main() method:

FullTraversalSample.java

Create a content connector

Create a content connector using the Content Connector SDK

Set up dependencies

Maven

Gradle

Create your connector configuration

Pass the configuration file to the connector

Determine your traversal strategy

Create a full traversal connector using a template class

Implement the connector’s entry point

Implement the Repository interface

Get custom configuration parameters

Perform a full traversal

Set the permissions for an item

Set the metadata for an item

Create the indexable item

Package each indexable item in an iterator

Next Steps

Create a list traversal connector using a template class

Implement the connector’s entry point

Implement the Repository interface

Get custom configuration parameters

Perform the list traversal

Push item IDs and hash values

Retrieve and handle each item

Handle deleted items

Handle unchanged items

Set the permissions for an item

Set the metadata for an item

Create an indexable item

Next Steps

Create a graph traversal connector using a template class

Implement the connector’s entry point

Implement the Repository interface

Get custom configuration parameters

Perform the graph traversal

Push item IDs and hash values

Retrieve and handle each item

Handle deleted items

Set the permissions for an item

Set the metadata for an item

Create the indexable item

Place the child IDs in the Cloud Search Indexing Queue

Next Steps

Create a content connector using the REST API

Determine your traversal strategy

Implement your traversal strategy and index items

Handle repository changes

Implement the `Repository` interface

Implement the `Repository` interface

Implement the `Repository` interface