A content connector is a software program used to traverse the data in an enterprise's repository and populate a data source. Google provides the following options for developing content connectors:
The Content Connector SDK. This is a good option if you are programming in Java. The Content Connector SDK is a wrapper around the REST API allowing you to quickly create connectors. To create a content connector using the SDK, refer to Create a content connector using the Content Connector SDK.
A low-level REST API or API libraries. Use these options if you're not programming in Java, or if your codebase better accommodates a REST API or a library. To create a content connector using the REST API, refer to Create a content connector using the REST API.
A typical content connector performs the following tasks:
- Reads and processes configuration parameters.
- Pulls discrete chunks of indexable data, called "items," from the third-party content repository.
- Combines ACLs, metadata, and content data into indexable items.
- Indexes items to the Cloud Search data source.
- (optional) Listens to change notifications from the third-party content repository. Change notifications are converted into indexing requests to keep the Cloud Search data source in sync with the third-party repository. The connector only performs this task if the repository supports change detection.
Create a content connector using the Content Connector SDK
The following sections explain how to create a content connector using the Content Connector SDK.
Set up dependencies
You must include certain dependencies in your build file to use the SDK. Click on a tab below to view the dependencies for your build environment:
Maven
<dependency>
<groupId>com.google.enterprise.cloudsearch</groupId>
<artifactId>google-cloudsearch-indexing-connector-sdk</artifactId>
<version>v1-0.0.3</version>
</dependency>
Gradle
compile group: 'com.google.enterprise.cloudsearch',
name: 'google-cloudsearch-indexing-connector-sdk',
version: 'v1-0.0.3'
Create your connector configuration
Every connector has a configuration file containing parameters used by the
connector, such as the ID for your repository. Parameters are defined as
key-value pairs, such as
api.sourceId=1234567890abcdef
.
The Google Cloud Search SDK contains several Google-supplied configuration parameters used by all connectors. You must declare the following Google-supplied parameters in your configuration file:
- For a content connector, you must declare
api.sourceId
andapi.serviceAccountPrivateKeyFile
as these parameters identify the location of your repository and private key needed to access the repository.
- For an identity connector, you must declare
api.identitySourceId
as this parameter identifies the location of your external identity source. If you are syncing users, you must also declareapi.customerId
as the unique ID for your enterprise's Google Workspace account.
Unless you want to override the default values of other Google-supplied parameters, you do not need to declare them in your configuration file. For additional information on the Google-supplied configuration parameters, such as how to generate certain IDs and keys, refer to Google-supplied configuration parameters.
You can also define your own repository-specific parameters for use in your configuration file.
Pass the configuration file to the connector
Set the system property config
to pass the configuration file to your
connector. You can set the property using the -D
argument when starting
the connector. For example, the following command starts the connector
with the MyConfig.properties
configuration file:
java -classpath myconnector.jar;... -Dconfig=MyConfig.properties MyConnector
If this argument is missing, the SDK attempts to access a default configuration
file named connector-config.properties
.
Determine your traversal strategy
The primary function of a content connector is to traverse a repository and index its data. You must implement a traversal strategy based on the size and layout of data in your repository. You can design your own strategy or choose from the following strategies implemented in the SDK:
- Full traversal strategy
A full traversal strategy scans the entire repository and blindly indexes every item. This strategy is commonly used when you have a small repository and can afford the overhead of doing a full traversal every time you index.
This traversal strategy is suitable for small repositories with mostly static, non-hierarchical, data. You might also use this traversal strategy when change detection is difficult or not supported by the repository.
- List traversal strategy
A list traversal strategy scans the entire repository, including all child nodes, determining the status of each item. Then, the connector takes a second pass and only indexes items that are new or have been updated since the last indexing. This strategy is commonly used to perform incremental updates to an existing index (instead of having to do a full traversal every time you update the index).
This traversal strategy is suitable when change detection is difficult or not supported by the repository, you have non-hierarchical data, and you are working with very large data sets.
- Graph traversal
A graph traversal strategy scans the entire parent node determining the status of each item. Then, the connector takes a second pass and only indexes items in the root node are new or have been updated since the last indexing. Finally, the connector passes any child IDs then indexes items in the child nodes that are new or have been updated. The connector continues recursively through all child nodes until all items have been addressed. Such traversal is typically used for hierarchical repositories where listing of all IDs isn't practical.
This strategy is suitable if you have hierarchical data that needs to be crawled, such as a series of directories or web pages.
Each of these traversal strategies is implemented by a template connector class in the SDK. While you can implement your own traversal strategy, these templates greatly speed up the development of your connector. To create a connector using a template, proceeed to the section corresponding to your traversal strategy:
- Create a full traversal connector using a template class
- Create a list traversal connector using a template class
- Create a graph traversal connector using a template class
Create a full traversal connector using a template class
This section of the docs refers to code snippets from the FullTraversalSample example.
Implement the connector’s entry point
The entry point to a connector is the
main()
method. This method’s primary task is to create an instance of the
Application
class and invoke its
start()
method to run the connector.
Before calling
application.start()
,
use the
IndexingApplication.Builder
class to instantiate the
FullTraversalConnector
template. The
FullTraversalConnector
accepts a
Repository
object whose methods you implement. The following code snippet shows how
to implement the main()
method:
Behind the scenes, the SDK calls the
initConfig()
method after your connector’s main()
method calls
Application.build
.
The
initConfig()
method
performs the following tasks:
- Calls the
Configuation.isInitialized()
method to ensure that theConfiguration
hasn’t been initialized. - Initializes a
Configuration
object with the Google-supplied key-value pairs. Each key-value pair is stored in aConfigValue
object within theConfiguration
object.
Implement the Repository
interface
The sole purpose of the Repository
object is to perform the traversal and
indexing of repository items. When using
a template, you need only override certain methods within the Repository
interface to create a content connector. The methods you override depend on the
template and traversal strategy you use. For the
FullTraversalConnector
, override the following methods:
The
init()
method. To perform any data repository set-up and initialization, override theinit()
method.The
getAllDocs()
method. To traverse and index all items in the data repository, override thegetAllDocs()
method. This method is called once for each scheduled traversal (as defined by your configuration).(optional) The
getChanges()
method. If your repository supports change detection, override thegetChanges()
method. This method is called once for each scheduled incremental traversal (as defined by your configuration) to retrieve modified items and index them.(optional) The
close()
method. If you need to perform repository cleanup, override theclose()
method. This method is called once during shutdown of the connector.
Each of the methods of the
Repository
object returns some type of
ApiOperation
object. An ApiOperation
object performs an action in the form of a single, or
perhaps multiple, IndexingService.indexItem()
calls to perform the actual indexing of your repository.
Get custom configuration parameters
As part of handling your connector’s configuration, you will need to get any
custom parameters from the
Configuration
object. This task is usually performed in a
Repository
class's
init()
method.
The Configuration
class has several methods for getting different data types
from a configuration. Each method returns a ConfigValue
object. You will then
use the ConfigValue
object’s
get()
method to retrieve the actual value.
The following snippet, from
FullTraversalSample
,
shows how to retrieve a
single custom integer value from a Configuration
object:
To get and parse a parameter containing several values, use one of the
Configuration
class's type parsers to parse the data into discrete chunks.
The following snippet, from the tutorial connector uses the
getMultiValue
method to get a list GitHub repository names:
Perform a full traversal
Override
getAllDocs()
to perform a full traversal and index your repository. The getAllDocs()
method accepts a checkpoint. The checkpoint is used to resume indexing at a
specific item should the process be interrupted. For each item in your
repository, perform these steps in the getAllDocs()
method:
- Set permissions.
- Set the metadata for the item that you are indexing.
- Combine the metadata and item into one indexable
RepositoryDoc
. - Package each indexable item into an iterator returned by the
getAllDocs()
method. Note thatgetAllDocs()
actually returns aCheckpointCloseableIterable
which is an iteration ofApiOperation
objects, each object representing an API request performed on aRepositoryDoc
, such as indexing it.
If the set of items is too large to process in a single call, include a
checkpoint and set
hasMore(true)
to indicate more items are available for indexing.
Set the permissions for an item
Your repository uses an Access Control List (ACL) to identify the users or groups that have access to an item. An ACL is a list of IDs for groups or users who can access the item.
You must duplicate the ACL used by your repository to ensure only those users with access to an item can see that item within a search result. The ACL for an item must be included when indexing an item so that Google Cloud Search has the information it needs to provide the correct level of access to the item.
The Content Connector SDK provides a rich set of ACL classes and methods to model the ACLs of most repositories. You must analyze the ACL for each item in your repository and create a corresponding ACL for Google Cloud Search when you index an item. If your repository’s ACL employs concepts such as ACL inheritance, modeling that ACL can be tricky. For further information on Google Cloud Search ACLs, refer to Google Cloud Search ACLs.
Note: The Cloud Search Indexing API supports single-domain ACLs. It does not
support cross-domain ACLs. Use the
Acl.Builder
class to set access to each item using an ACL. The following code snippet, taken
from the full traversal sample, allows
all users or “principals”
(getCustomerPrincipal()
)
to be “readers” of all items
(.setReaders()
)
when performing a search.
You need to understand ACLs to properly model ACLs for the repository. For example, you might be indexing files within a file system that uses some sort of inheritance model whereby child folders inherit permissions from parent folders. Modeling ACL inheritance requires additional information covered in Google Cloud Search ACLs
Set the metadata for an item
Metadata is stored in an Item
object. To create an Item
, you need a
minimum of a unique string ID, item type, ACL, URL, and version for the item.
The following code snippet shows how to build an Item
using the
IndexingItemBuilder
helper class.
Create the indexable item
Once you have set the metadata for the item, you can create the actual indexable
item using the
RepositoryDoc.Builder
class. The following example shows how to create a single indexable item.
A RepositoryDoc
is a type of ApiOperation
that performs the actual
IndexingService.indexItem()
request.
You can also use the
setRequestMode()
method of the
RepositoryDoc.Builder
class to identify the indexing request as ASYNCHRONOUS
or SYNCHRONOUS
:
ASYNCHRONOUS
- Asynchronous mode results in longer indexing-to-serving latency and accommodates large throughput quota for indexing requests. Asynchronous mode is recommended for initial indexing (backfill) of the entire repository.
SYNCHRONOUS
- Synchronous mode results in shorter indexing-to-serving latency and
accommodates limited throughput quota. Synchronous mode is
recommended for indexing of updates and changes to the repository. If
unspecified, the request mode defaults to
SYNCHRONOUS
.
Package each indexable item in an iterator
The getAllDocs()
method returns an Iterator
, specifically a
CheckpointCloseableIterable
,
of
RepositoryDoc
objects. You can use the
CheckpointClosableIterableImpl.Builder
class to construct and return an iterator. The following code snippet shows how
to construct and return an iterator.
The SDK executes each indexing call enclosed within the iterator.
Next Steps
Here are a few next steps you might take:
- (optional) If your indexing throughput seems slow, refer to Increase indexing rate for
FullTraversalConnector
. - (optional) Implement the
close()
method to release any resources before shutdown. - (optional) Create an identity connector using the Content Connector SDK.
Create a list traversal connector using a template class
The Cloud Search Indexing Queue is used to hold IDs and optional hash values for each item in the repository. A list traversal connector pushes item IDs to the Google Cloud Search Indexing Queue and retrieves them one at a time for indexing. Google Cloud Search maintains queues and compare queue contents to determine item status, such as whether an item has been deleted from the repository. For further information on the Cloud Search Indexing Queue, refer to The Cloud Search Indexing Queue.
This section of the docs refers to code snippets from the ListTraversalSample example.
Implement the connector’s entry point
The entry point to a connector is the
main()
method. This method’s primary task is to create an instance of the
Application
class and invoke its
start()
method to run the connector.
Before calling
application.start()
,
use the
IndexingApplication.Builder
class to instantiate the
ListingConnector
template. The ListingConnector
accepts a
Repository
object whose methods you implement. The following snippet shows how to
instantiate the ListingConnector
and its associated Repository
:
Behind the scenes, the SDK calls the
initConfig()
method after your connector’s main()
method calls
Application.build
.
The initConfig()
method:
- Calls the
Configuation.isInitialized()
method to ensure that theConfiguration
hasn’t been initialized. - Initializes a
Configuration
object with the Google-supplied key-value pairs. Each key-value pair is stored in aConfigValue
object within theConfiguration
object.
Implement the Repository
interface
The sole purpose of the Repository
object is to perform the traversal and
indexing of repository items. When using a template, you need only override
certain methods within theRepository
interface to create a content connector.
The methods you override depend on the template and traversal strategy you use. For the
ListingConnector
,
override the following methods:
The
init()
method. To perform any data repository set-up and initialization, override theinit()
method.The
getIds()
method. To retrieve IDs and hash values for all records in the repository, override thegetIds()
method.The
getDoc()
method. To add new, update, modify, or delete items from the index, override thegetDoc()
method.(optional) The
getChanges()
method. If your repository supports change detection, override thegetChanges()
method. This method is called once for each scheduled incremental traversal (as defined by your configuration) to retrieve modified items and index them.(optional) The
close()
method. If you need to perform repository cleanup, override theclose()
method. This method is called once during shutdown of the connector.
Each of the methods of the Repository
object returns some type of
ApiOperation
object. An ApiOperation
object performs an action in the form of a single, or
perhaps multiple, IndexingService.indexItem()
calls to perform the actual indexing of your repository.
Get custom configuration parameters
As part of handling your connector’s configuration, you will need to get any
custom parameters from the
Configuration
object. This task is usually performed in a
Repository
class's
init()
method.
The Configuration
class has several methods for getting different data types
from a configuration. Each method returns a ConfigValue
object. You will then
use the ConfigValue
object’s
get()
method to retrieve the actual value.
The following snippet, from
FullTraversalSample
,
shows how to retrieve a
single custom integer value from a Configuration
object:
To get and parse a parameter containing several values, use one of the
Configuration
class's type parsers to parse the data into discrete chunks.
The following snippet, from the tutorial connector uses the
getMultiValue
method to get a list GitHub repository names:
Perform the list traversal
Override
getIds()
method to retrieve IDs and hash values for all records in the repository.
The getIds()
method accepts a checkpoint. The checkpoint is used to resume
indexing at a specific item should the process be interrupted.
Next, override the
getDoc()
method to handle each item in the Cloud Search Indexing Queue.
Push item IDs and hash values
Override
getIds()
to fetch the item IDs and their associated content hash values from the
repository. ID and hash value pairs are then packaged into push operation
request to the Cloud Search Indexing Queue. Root or parent IDs are typically
pushed first followed by child IDs until the entire hierarchy of items has been
processed.
The getIds()
method accepts a checkpoint representing the last item to be
indexed. The checkpoint can be used to resume indexing at a specific item should
the process be interrupted. For each item in your repository, perform these
steps in the getIds()
method:
- Get each item ID and associated hash value from the repository.
- Package each ID and hash value pair into a
PushItems
. - Combine each
PushItems
into an iterator returned by thegetIds()
method. Note thatgetIds()
actually returns aCheckpointCloseableIterable
which is an iteration ofApiOperation
objects, each object representing an API request performed on aRepositoryDoc
, such as push the items to the queue.
The following code snippet shows how to get each item ID and hash value and
insert them into a
PushItems
.
A PushItems
is an ApiOperation
request to push an item to the Cloud Search
Indexing Queue.
The following code snippet shows how to use the
PushItems.Builder
class to package the IDs and hash values into a single push
ApiOperation
.
Items are pushed to the Cloud Search Indexing Queue for further processing.
Retrieve and handle each item
Override
getDoc()
to handle each item in the Cloud Search Indexing Queue.
An item can be new, modified, unchanged, or can no longer exist in the source
repository. Retrieve and index each item that is new or modified. Remove items
from the index that no longer exist in the source repository.
The getDoc()
method accepts an Item from the Google Cloud Search
Indexing Queue. For each item in the queue, perform these steps in the
getDoc()
method:
Check if the item’s ID, within the Cloud Search Indexing Queue, exists in the repository. If not, delete the item from the index.
Poll the index for item status and, if an item unchanged (
ACCEPTED
), don’t do anything.Index changed or new items:
- Set the permissions.
- Set the metadata for the item that you are indexing.
- Combine the metadata and item into one indexable
RepositoryDoc
. - Return the
RepositoryDoc
.
Note: The ListingConnector
template does't support returning null
on
the getDoc()
method. Returning null
results in a NullPointerException.
Handle deleted items
The following code snippet shows how to determine if an item exists in the repository and, if not, delete it.
Note that documents
is a data structure representing the repository. If
documentID
is not found in documents
, return
APIOperations.deleteItem(resourceName)
to delete the item from the index.
Handle unchanged items
The following code snippet shows how to poll item status in the Cloud Search Indexing Queue and handle an unchanged item.
To determine if the item is unmodified, check the status of the item as well as other metadata that may indicate a change. In the example, the metadata hash is used to determine if the item has been changed.
Set the permissions for an item
Your repository uses an Access Control List (ACL) to identify the users or groups that have access to an item. An ACL is a list of IDs for groups or users who can access the item.
You must duplicate the ACL used by your repository to ensure only those users with access to an item can see that item within a search result. The ACL for an item must be included when indexing an item so that Google Cloud Search has the information it needs to provide the correct level of access to the item.
The Content Connector SDK provides a rich set of ACL classes and methods to model the ACLs of most repositories. You must analyze the ACL for each item in your repository and create a corresponding ACL for Google Cloud Search when you index an item. If your repository’s ACL employs concepts such as ACL inheritance, modeling that ACL can be tricky. For further information on Google Cloud Search ACLs, refer to Google Cloud Search ACLs.
Note: The Cloud Search Indexing API supports single-domain ACLs. It does not
support cross-domain ACLs. Use the
Acl.Builder
class to set access to each item using an ACL. The following code snippet, taken
from the full traversal sample, allows
all users or “principals”
(getCustomerPrincipal()
)
to be “readers” of all items
(.setReaders()
)
when performing a search.
You need to understand ACLs to properly model ACLs for the repository. For example, you might be indexing files within a file system that uses some sort of inheritance model whereby child folders inherit permissions from parent folders. Modeling ACL inheritance requires additional information covered in Google Cloud Search ACLs
Set the metadata for an item
Metadata is stored in an Item
object. To create an Item
, you need a
minimum of a unique string ID, item type, ACL, URL, and version for the item.
The following code snippet shows how to build an Item
using the
IndexingItemBuilder
helper class.
Create an indexable item
Once you have set the metadata for the item, you can create the actual indexable
item using the
RepositoryDoc.Builder
.
The following example shows how to create a single indexable item.
A RepositoryDoc
is a type of
ApiOperation
that performs the actual
IndexingService.indexItem()
request.
You can also use the
setRequestMode()
method of the
RepositoryDoc.Builder
class to identify the indexing request as ASYNCHRONOUS
or SYNCHRONOUS
:
ASYNCHRONOUS
- Asynchronous mode results in longer indexing-to-serving latency and accommodates large throughput quota for indexing requests. Asynchronous mode is recommended for initial indexing (backfill) of the entire repository.
SYNCHRONOUS
- Synchronous mode results in shorter indexing-to-serving latency and
accommodates limited throughput quota. Synchronous mode is
recommended for indexing of updates and changes to the repository. If
unspecified, the request mode defaults to
SYNCHRONOUS
.
Next Steps
Here are a few next steps you might take:
- (optional) Implement the
close()
method to release any resources before shutdown. - (optional) Create an identity connector using the Content Connector SDK.
Create a graph traversal connector using a template class
The Cloud Search Indexing Queue is used to hold IDs and optional hash values for each item in the repository. A graph traversal connector pushes item IDs to the Google Cloud Search Indexing Queue and retrieves them one at a time for indexing. Google Cloud Search maintains queues and compare queue contents to determine item status, such as whether an item has been deleted from the repository. For further information on the Cloud Search Indexing Queue, refer to The Google Cloud Search Indexing Queue.
During the index, the item content is fetched from the data repository and any children item IDs are pushed to the queue. The connector proceeds recursively processing parent and children IDs until all items are handled.
This section of the docs refers to code snippets from the GraphTraversalSample example.
Implement the connector’s entry point
The entry point to a connector is the
main()
method. This method’s primary task is to create an instance of the
Application
class and invoke its
start()
method to run the connector.
Before calling
application.start()
,
use the
IndexingApplication.Builder
class to instantiate the ListingConnector
template. The
ListingConnector
accepts a
Repository
object whose methods you implement.
The following snippet shows how to
instantiate the ListingConnector
and its associated Repository
:
Behind the scenes, the SDK calls the
initConfig()
method after your connector’s main()
method calls
Application.build
.
The initConfig()
method:
- Calls the
Configuation.isInitialized()
method to ensure that theConfiguration
hasn’t been initialized. - Initializes a
Configuration
object with the Google-supplied key-value pairs. Each key-value pair is stored in aConfigValue
object within theConfiguration
object.
Implement the Repository
interface
The sole purpose of the
Repository
object is to perform the traversal and indexing of repository
items. When using a template, you need only override certain methods within the
Repository
interface to create a content connector. The methods you override
depend on the template and traversal strategy you use. For the
ListingConnector
,
you override the following methods:
The
init()
method. To perform any data repository set-up and initialization, override theinit()
method.The
getIds()
method. To retrieve IDs and hash values for all records in the repository, override thegetIds()
method.The
getDoc()
method. To add new, update, modify, or delete items from the index, override thegetDoc()
method.(optional) The
getChanges()
method. If your repository supports change detection, override thegetChanges()
method. This method is called once for each scheduled incremental traversal (as defined by your configuration) to retrieve modified items and index them.(optional) The
close()
method. If you need to perform repository cleanup, override theclose()
method. This method is called once during shutdown of the connector.
Each of the methods of the
Repository
object returns some type of ApiOperation
object. An ApiOperation
object performs an action in the form of a single, or perhaps multiple,
IndexingService.indexItem()
calls to perform the actual indexing of your repository.
Get custom configuration parameters
As part of handling your connector’s configuration, you will need to get any
custom parameters from the
Configuration
object. This task is usually performed in a
Repository
class's
init()
method.
The Configuration
class has several methods for getting different data types
from a configuration. Each method returns a ConfigValue
object. You will then
use the ConfigValue
object’s
get()
method to retrieve the actual value.
The following snippet, from
FullTraversalSample
,
shows how to retrieve a
single custom integer value from a Configuration
object:
To get and parse a parameter containing several values, use one of the
Configuration
class's type parsers to parse the data into discrete chunks.
The following snippet, from the tutorial connector uses the
getMultiValue
method to get a list GitHub repository names:
Perform the graph traversal
Override
getIds()
method to retrieve IDs and hash values for all records in the repository.
The getIds()
method accepts a checkpoint. The checkpoint is used to resume
indexing at a specific item should the process be interrupted.
Next, override the
getDoc()
method to handle each item in the Cloud Search Indexing Queue.
Push item IDs and hash values
Override
getIds()
to fetch the item IDs and their associated content hash values from the
repository. ID and hash value pairs are then packaged into push operation
request to the Cloud Search Indexing Queue. Root or parent IDs are typically
pushed first followed by child IDs until the entire hierarchy of items has been
processed.
The getIds()
method accepts a checkpoint representing the last item to be
indexed. The checkpoint can be used to resume indexing at a specific item should
the process be interrupted. For each item in your repository, perform these
steps in the getIds()
method:
- Get each item ID and associated hash value from the repository.
- Package each ID and hash value pair into a
PushItems
. - Combine each
PushItems
into an iterator returned by thegetIds()
method. Note thatgetIds()
actually returns aCheckpointCloseableIterable
which is an iteration ofApiOperation
objects, each object representing an API request performed on aRepositoryDoc
, such as push the items to the queue.
The following code snippet shows how to get each item ID and hash value and
insert them into a
PushItems
. A PushItems
is an
ApiOperation
request to push an item to the Cloud Search Indexing Queue.
The following code snippet shows how to use the
PushItems.Builder
class to package the IDs and hash values into a single push
ApiOperation
.
Items are pushed to the Cloud Search Indexing Queue for further processing.
Retrieve and handle each item
Override
getDoc()
to handle each item in the Cloud Search Indexing Queue.
An item can be new, modified, unchanged, or can no longer exist in the source
repository. Retrieve and index each item that is new or modified. Remove items
from the index that no longer exist in the source repository.
The getDoc()
method accepts an Item from the Cloud Search Indexing
Queue. For each item in the queue, perform these steps in the
getDoc()
method:
Check if the item’s ID, within the Cloud Search Indexing Queue, exists in the repository. If not, delete the item from the index. If the item does exist, continue with the next step.
Index changed or new items:
- Set the permissions.
- Set the metadata for the item that you are indexing.
- Combine the metadata and item into one indexable
RepositoryDoc
. - Place the child IDs in the Cloud Search Indexing Queue for further processing.
- Return the
RepositoryDoc
.
Handle deleted items
The following code snippet shows how to determine if an item exists in the index and, it not, delete it.
Set the permissions for an item
Your repository uses an Access Control List (ACL) to identify the users or groups that have access to an item. An ACL is a list of IDs for groups or users who can access the item.
You must duplicate the ACL used by your repository to ensure only those users with access to an item can see that item within a search result. The ACL for an item must be included when indexing an item so that Google Cloud Search has the information it needs to provide the correct level of access to the item.
The Content Connector SDK provides a rich set of ACL classes and methods to model the ACLs of most repositories. You must analyze the ACL for each item in your repository and create a corresponding ACL for Google Cloud Search when you index an item. If your repository’s ACL employs concepts such as ACL inheritance, modeling that ACL can be tricky. For further information on Google Cloud Search ACLs, refer to Google Cloud Search ACLs.
Note: The Cloud Search Indexing API supports single-domain ACLs. It does not
support cross-domain ACLs. Use the
Acl.Builder
class to set access to each item using an ACL. The following code snippet, taken
from the full traversal sample, allows
all users or “principals”
(getCustomerPrincipal()
)
to be “readers” of all items
(.setReaders()
)
when performing a search.
You need to understand ACLs to properly model ACLs for the repository. For example, you might be indexing files within a file system that uses some sort of inheritance model whereby child folders inherit permissions from parent folders. Modeling ACL inheritance requires additional information covered in Google Cloud Search ACLs
Set the metadata for an item
Metadata is stored in an Item
object. To create an Item
, you need a
minimum of a unique string ID, item type, ACL, URL, and version for the item.
The following code snippet shows how to build an Item
using the
IndexingItemBuilder
helper class.
Create the indexable item
Once you have set the metadata for the item, you can create the actual indexable
item using the
RepositoryDoc.Builder
.
The following example shows how to create a single indexable item.
A RepositoryDoc
is a type of ApiOperation
that performs the actual
IndexingService.indexItem()
request.
You can also use the
setRequestMode()
method of the
RepositoryDoc.Builder
class to identify the indexing request as ASYNCHRONOUS
or SYNCHRONOUS
:
ASYNCHRONOUS
- Asynchronous mode results in longer indexing-to-serving latency and accommodates large throughput quota for indexing requests. Asynchronous mode is recommended for initial indexing (backfill) of the entire repository.
SYNCHRONOUS
- Synchronous mode results in shorter indexing-to-serving latency and
accommodates limited throughput quota. Synchronous mode is
recommended for indexing of updates and changes to the repository. If
unspecified, the request mode defaults to
SYNCHRONOUS
.
Place the child IDs in the Cloud Search Indexing Queue
The following code snippet shows how to include the child IDs, for the currently processing parent item, into the queue for processing. These IDs are processed after the parent item is indexed.
Next Steps
Here are a few next steps you might take:
- (optional) Implement the
close()
method to release any resources before shutdown. - (optional) Create an identity connector using the Identity Connector SDK.
Create a content connector using the REST API
The following sections explain how to create a content connector using the REST API.
Determine your traversal strategy
The primary function of a content connector is to traverse a repository and index its data. You must implement a traversal strategy based on the size and layout of data in your repository. Following are three common traversal strategies:
- Full traversal strategy
A full traversal strategy scans the entire repository and blindly indexes every item. This strategy is commonly used when you have a small repository and can afford the overhead of doing a full traversal every time you index.
This traversal strategy is suitable for small repositories with mostly static, non-hierarchical, data. You might also use this traversal strategy when change detection is difficult or not supported by the repository.
- List traversal strategy
A list traversal strategy scans the entire repository, including all child nodes, determining the status of each item. Then, the connector takes a second pass and only indexes items that are new or have been updated since the last indexing. This strategy is commonly used to perform incremental updates to an existing index (instead of having to do a full traversal every time you update the index).
This traversal strategy is suitable when change detection is difficult or not supported by the repository, you have non-hierarchical data, and you are working with very large data sets.
- Graph traversal
A graph traversal strategy scans the entire parent node determining the status of each item. Then, the connector takes a second pass and only indexes items in the root node are new or have been updated since the last indexing. Finally, the connector passes any child IDs then indexes items in the child nodes that are new or have been updated. The connector continues recursively through all child nodes until all items have been addressed. Such traversal is typically used for hierarchical repositories where listing of all IDs isn't practical.
This strategy is suitable if you have hierarchical data that needs to be crawled, such as a series directories or web pages.
Implement your traversal strategy and index items
Every indexable element for Cloud Search is referred to as an item in the Cloud Search API. An item might be a file, folder, a line in a CSV file, or a database record.
Once your schema is registered, you can populate the index by:
(optional) Using
items.upload
to upload files larger than 100KiB for indexing. For smaller files, embed the content as inlineContent usingitems.index
.(optional) Using
media.upload
to upload media files for indexing.Using
items.index
to index the item. For example, if your schema uses the object definition in the movie schema, an indexing request for a single item would look like this:{ "name": "datasource/<data_source_id>/items/titanic", "acl": { "readers": [ { "gsuitePrincipal": { "gsuiteDomain": true } } ] }, "metadata": { "title": "Titanic", "viewUrl": "http://www.imdb.com/title/tt2234155/?ref_=nv_sr_1", "objectType": "movie" }, "structuredData": { "object": { "properties": [ { "name": "movieTitle", "textValues": { "values": [ "Titanic" ] } }, { "name": "releaseDate", "dateValues": { "values": [ { "year": 1997, "month": 12, "day": 19 } ] } }, { "name": "actorName", "textValues": { "values": [ "Leonardo DiCaprio", "Kate Winslet", "Billy Zane" ] } }, { "name": "genre", "enumValues": { "values": [ "Drama", "Action" ] } }, { "name": "userRating", "integerValues": { "values": [ 8 ] } }, { "name": "mpaaRating", "textValues": { "values": [ "PG-13" ] } }, { "name": "duration", "textValues": { "values": [ "3 h 14 min" ] } } ] } }, "content": { "inlineContent": "A seventeen-year-old aristocrat falls in love with a kind but poor artist aboard the luxurious, ill-fated R.M.S. Titanic.", "contentFormat": "TEXT" }, "version": "01", "itemType": "CONTENT_ITEM" }
(Optional) Using items.get calls to verify an item has been indexed.
To perform a full traversal, you would periodically reindex the entire repository. To perform a list or graph traversal, you need to implement code to handle repository changes.
Handle repository changes
You can periodically gather and index each item from a repository to perform a full indexing. While effective at ensuring your index is up-to-date, a full indexing can be costly when dealing with larger or hierarchical repositories.
Instead of using index calls to index an entire repository every so often, you can also use the Google Cloud Indexing Queue as a mechanism for tracking changes and only indexing those items that have changed. You can use the items.push requests to push items into the queue for later polling and updating. For more information on the Google Cloud Indexing Queue, refer to Google Cloud Indexing Queue.
For further information on the Google Cloud Search API, refer to Cloud Search API.