Deploy a database connector

Warning: The Cloud Search reference connectors are provided "as is" as sample code for use in creating your own working connectors. This sample code requires substantial customization and testing before being used in proof-of-concept or production environments. For production use, we highly recommend obtaining help from one of our Cloud Search partners. For further help finding a suitable Cloud Search partner, contact your Google Account Manager.

You can set up Google Cloud Search to discover and index data from your organization's databases by using the Google Cloud Search database connector.

Important considerations

You can install and run the Cloud Search database connector in almost any environment where Java apps can run, so long as the connector has access to both the internet and the database.

System requirements

System requirements
Operating system Windows or Linux
SQL database Any SQL database with a JDBC 4.0 or later compliant driver, including the following:
  • MS SQL Server (2008, 2012, 2014, 2016)
  • Oracle (11g, 12c)
  • Google Cloud SQL
  • MySQL
Software JDBC driver for the connector to use to access the database (downloaded and installed separately)

Deploy the connector

The following steps describe how to install the connector and configure it to index the specified databases and return the results to Cloud Search users.

Prerequisites

Before you deploy the Cloud Search database connector, gather the following information:

Step 1. Download and build the database connector software

  1. Clone the connector repository from GitHub.
    $ git clone https://github.com/google-cloudsearch/database-connector.git
    $ cd database-connector
  2. Check out the desired version of the connector:
    $ git checkout tags/v1-0.0.3
  3. Build the connector.
    $ mvn package
    To skip the tests when building the connector, use mvn package -DskipTests.
  4. Copy the connector zip file to your local installation directory and unzip it:
    $ cp target/google-cloudsearch-database-connector-v1-0.0.3.zip installation-dir
    $ cd installation-dir
    $ unzip google-cloudsearch-database-connector-v1-0.0.3.zip
    $ cd google-cloudsearch-database-connector-v1-0.0.3

Step 2. Configure the database connector

  1. Create a text file and name it connector-config.properties (the default) or similar. Google recommends that you name configuration files with the .properties or .config extension and keep the file in the same directory as the connector. If you use a different name or path, you must specify the path when you run the connector.
  2. Add parameters as key/value pairs to the file contents. The configuration file must specify the parameters for data source access, database access, a database full traversal SQL statement, a content field title, and column definitions. You can also configure other connector behavior with optional parameters. For example:
    # Required parameters for data source access
    api.sourceId=1234567890abcdef
    api.identitySourceId=0987654321lmnopq
    api.serviceAccountPrivateKeyFile=./PrivateKey.json
    #
    # Required parameters for database access
    db.url=jdbc:mysql://localhost:3306/mysql_test
    db.user=root
    db.password=passw0rd
    #
    # Required full traversal SQL statement parameter
    db.allRecordsSql=select customer_id, first_name, last_name, phone, change_timestamp from address_book
    #
    # Required parameters for column definitions and URL format
    db.allColumns=customer_id, first_name, last_name, phone, change_timestamp
    db.uniqueKeyColumns=customer_id
    url.columns=customer_id
    #
    # Required content field parameter
    contentTemplate.db.title=customer_id
    #
    # Optional parameters to set ACLs to "entire domain" access
    defaultAcl.mode=fallback
    defaultAcl.public=true
    #
    # Optional parameters for schedule traversals
    schedule.traversalIntervalSecs=36000
    schedule.performTraversalOnStart=true
    schedule.incrementalTraversalIntervalSecs=3600

    For detailed descriptions of the database-specific parameters, go to the Configuration parameters reference at the end of this article.

    To learn about the parameters that are common to all Cloud Search connectors, such as metadata configuration, datetime formats, and ACL options, go to Google-supplied connector parameters.

    If applicable, specify properties of the schema object in the traversal SQL query parameters. Usually you can add aliases to the SQL statement. For example, if you have a movie database and the data source schema contains a property definition named "ActorName", a SQL statement could have the form: SELECT …, last_name AS ActorName, … FROM … .

Step 3. Run the database connector

The following example assumes the required components are located in the local directory on a Linux system.

To run the connector from the command line, enter the following command:

java \
   -cp "google-cloudsearch-database-connector-v1-0.0.3.jar:mysql-connector-java-5.1.41-bin.jar" \
   com.google.enterprise.cloudsearch.database.DatabaseFullTraversalConnector \
   [-Dconfig=mysql.config]

Where:

  • google-cloud-search-database-connector-v1-0.0.3.jar is the database connector .jar file
  • mysql-connector-java-5.1.41-bin.jar is the JDBC driver being used to access the database
  • mysql.config is a custom-named configuration file. To ensure the connector recognizes your configuration file, specify its path on the command line. Otherwise, the connector uses connector-config.properties in your local directory as the default filename.

The connector reports configuration errors as it detects them. Some errors are reported when the connector initializes, such as when a database column is defined as part of the record content (in db.allColumns), but the column isn't used in the traversal SQL query of the database (in db.allRecordsSql). Other errors are only detected and reported when the connector attempts to access the database for the first traversal, such as invalid SQL statement syntax.

Configuration parameters reference

Data source access parameters

Setting Parameter
Data source ID api.sourceId = source-ID

Required. The Cloud Search source ID that the Google Workspace administrator set up.

Identity source ID api.identitySourceId = identity-source-ID

Required to use external users and groups for ACLs. The Cloud Search identity source ID that the Google Workspace administrator set up.

Service account api.serviceAccountPrivateKeyFile = path-to-private-key

Required. The path to the Cloud Search service account key file that the Google Workspace administrator createed.

Database access parameters

Setting Parameter
Database URL db.url = database-URL

Required. The full path of the database to be accessed, such as jdbc:mysql://127.0.0.1/dbname.

Database username and password db.user = username
db.password = password

Required. A valid username and password that the connector uses to access the database. This database user must have read access to the relevant records of the database being read.

JDBC driver db.driverClass = oracle.jdbc.OracleDriver

Required only if the JDBC 4.0 driver is not already specified in the class path.

Traversal SQL query parameters

The connector traverses database records with SQL SELECT queries in the configuration file. You must configure a full traversal query; queries for incremental traversals are optional.

A full traversal reads every database record configured for indexing. A full traversal is required to index new records for Cloud Search and also to re-index all existing records.

An incremental traversal reads and re-indexes only newly modified database records and recent entries to the database. Incremental traversals can be more efficient than full traversals. To use incremental traversals, your database must contain timestamp fields to indicate modified records.

The connector executes these traversals according to the schedules you define in traversal schedule parameters.

Setting Parameter
Full traversal query db.allRecordsSql = SELECT column-1[, column-2,...] FROM database-name

Required. The query run for every full traversal.

Every column name that the connector will use in any capacity (content, unique ID, ACLs) must be present in this query. The connector performs some preliminary verifications at startup to detect errors and omissions. For this reason, do not use a general "SELECT * FROM …" query.

Full traversal pagination db.allRecordsSql.pagination = {none | offset}

Value can be:

  • none: do not use pagination
  • offset: use pagination by row offset

    To use pagination by offset, the SQL query must have a placeholder question mark (?) for a row offset, starting with zero. In each full traversal, the query is executed repeatedly until no results are returned.

Incremental traversal query db.incrementalUpdateSql = SELECT column-1[, column-2,...] FROM database-name WHERE last_update_time > ?

Required if you schedule incremental traversals.

The "?" in the query is a mandatory placeholder for a timestamp value. The connector uses the timestamp to track modifications between incremental traversal SQL queries.

To track the database timestamp column for the last update time, add the timestamp_column alias to the SQL statement; otherwise, use the current timestamp of the connector traversal.

For the first incremental traversal, the connector uses the start time of the connector. After the first incremental traversal, Cloud Search stores the timestamp so that connector restarts are able to access the previous incremental traversal timestamp.

Database time zone db.timestamp.timezone = America/Los_Angeles

Specifies the time zone to use for database timestamps. The database timestamp used to identify new record additions or newly modified database records. The default is the local time zone where the connector is running.

Traversal SQL query examples

  • Basic full traversal query that reads every record of interest in an employee database for indexing:
    db.allRecordsSql = SELECT customer_id, first_name, last_name, employee_id, interesting_field \
        FROM employee
  • Specify pagination by offset, and break up a full traversal into multiple queries.

    For SQL Server 2012 or Oracle 12c (standard SQL 2008 syntax):

    db.allRecordsSql = SELECT customer_id, first_name, last_name, employee_id, interesting_field \
        FROM employee \
        ORDER BY customer_id OFFSET ? ROWS FETCH FIRST 1000 ROWS ONLY
    db.allRecordsSql.pagination = offset

    or, for MySQL or Google Cloud SQL:

    db.allRecordsSql = SELECT customer_id, first_name, last_name, employee_id, interesting_field \
        FROM employee \
        ORDER BY customer_id LIMIT 1000 OFFSET ?
    db.allRecordsSql.pagination = offset
  • Full traversal query that applies individual ACLs with aliases:
    db.allRecordsSql = SELECT customer_id, first_name, last_name,  employee_id, interesting_field, last_update_time, \
         permitted_readers AS readers_users, \
         denied_readers AS denied_users, \
         permitted_groups AS readers_groups, \
         denied_groups AS denied_groups \
         FROM employee
  • Basic incremental traversal query:
    db.incrementalUpdateSql = SELECT customer_id, first_name, last_name, employee_id, interesting_field, last_update_time \
         FROM employee \
         WHERE last_update_time > ?
  • Incremental traversal query that applies individual ACLs with aliases:
    db.incrementalUpdateSql = SELECT customer_id, first_name, last_name, employee_id, interesting_field, last_update_time, \
         permitted_readers AS readers_users, \
         denied_readers AS denied_users, \
         permitted_groups AS readers_groups, \
         denied_groups AS denied_groups \
         FROM employee \
         WHERE last_update_time > ?
  • Incremental traversal query that uses the database timestamp rather than the current time:
    db.incrementalUpdateSql = SELECT customer_id, first_name, last_name, employee_id, interesting_field, \
         last_update_time AS timestamp_column \
         FROM employee \
         WHERE last_update_time > ?

Column definition parameters

The following parameters specify the columns that you use in the traversal statements and to uniquely identify each record.

Setting Parameter
All columns db.allColumns = column-1, column-2, ...column-N

Required. Identifies all the columns that are required in a SQL query when accessing the database. The columns defined with this parameter must be explicitly referenced in the queries. Every other column definition parameter is compared against this set of columns.

Example:

db.allColumns = customer_id, first_name, last_name, phone, change_timestamp
Unique key columns db.uniqueKeyColumns = column-1[, column-2]

Required. Lists either a single database column that contains unique values or by a combination of columns whose values together define a unique ID.

Cloud Search requires every searchable document to have a unique identifier within a data source. You must be able to define a unique ID for each database record from column values. If you run multiple connectors on separate databases but index into a common dataset, make sure that you specify a unique ID across all documents.

Examples:

db.uniqueKeyColumns = customer_id
# or
db.uniqueKeyColumns = last_name, first_name
URL link column url.columns = column-1[, column-2]

Required. Specifies one or more valid, defined names of the columns used for the URL used for a clickable search result. For databases that have no relevant URL associated with each database record, a static link can be used for every record.

However, if the column values do define a valid link for each record, the view URL columns and format configuration values should be specified.

URL format url.format = https://www.example.com/{0}

Defines the format of the view URL. Numbered parameters refer to the columns specified in db.columns, in order, starting with zero.

If not specified, the default is "{0}."

Examples follow this table.

Percent-encoded columns for URL url.columnsToEscape = column-1[, column-2]

Specifies columns from db.columns whose values will be percent-encoded before including them in the formatted URL string.

URL column examples

To specify the columns used in traversal queries and the format of the view URL:

  • To use a static URL not using any database record values:
    url.format = https://www.example.com
  • To use a single column value that is the view URL:
    url.format = {0}
    url.columns = customer_id
  • To use a single column value that is substituted into the view URL at position {0}:
    url.format = https://www.example.com/customer/id={0}
    url.columns = customer_id
    url.columnsToEscape = customer_id
  • To use multiple column values to build the view URL (columns are order-dependent):
    url.format = {1}/customer={0}
    url.columns = customer_id, linked_url
    url.columnsToEscape = customer_id

Content fields

Use the content options to define which record values should be made part of the searchable content.

Setting Parameter
Highest-quality search column contentTemplate.db.title = column-name

Required. The highest-quality column for search indexing and result prioritization.

Column prioritization for search contentTemplate.db.quality.high = column-1[, column-2...]
contentTemplate.db.quality.medium = column-1[, column-2...]
contentTemplate.db.quality.low = column-1[, column-2...]

Designate content columns (except the column set for contentTemplate.db.title) as high, medium, or low search quality fields. Unspecified columns default to low.

Content data columns db.contentColumns = column-1[, column-2...]

Specify content columns in the database. These are formatted and uploaded to Cloud Search as searchable document content.

If you don't specify a value, the default is "*" indicating that all columns should be used for content.

Blob column db.blobColumn = column-name

Specify the name of a single blob column to use for document content instead of a combination of content columns.

If a blob column is specified, then it is considered an error if content columns are also defined. However, metadata and structured data column definitions are still allowed along with blob columns.