Skip to main content

Adding Source/Tenant to Existing Collector

When to Add vs Create New

Add to existing collector when:

  • Same data source (API/Scrapy/JSON)
  • Same data structure
  • Different tenant/keywords/configuration

Create new collector when:

  • Different data source API
  • Different data structure
  • Requires custom processing logic

Step-by-Step Process

1. Identify Collector Type

API Integrations:

  • RedditApi - Reddit API
  • StackExchangeApi - Stack Exchange API

Scrapy Spiders:

  • AtlassianForums - Scrapy-based scraping

2. Add Instance in Index

File: packages/data-lake/data-lake-infra/src/index.ts

Add new instance after existing ones (see examples below).

3. Configure Parameters

Required:

  • tenantId: could be guid from configuration (see wordpress example) or textual like 'Atlasian'
  • keywords: Search terms/keywords array
  • apiKey: API key secret name
  • disableIngestion: true for testing, false for production

API-specific:

  • subreddit: Array of subreddit names (RedditApi only)
  • keywords: Comma-separated string (StackExchangeApi)

4. Set Pulumi Secrets if needed

cd packages/data-lake/data-lake-infra
pulumi config --stack dev/prod set --secret {secretKeyName} {value}

5. Deploy

NX_TUI=false pnpm nx pulumi data-lake-infra -- preview --stack dev
NX_TUI=false pnpm nx pulumi data-lake-infra -- up --stack dev

Examples by Type

RedditApi Example

Location: packages/data-lake/data-lake-infra/src/index.ts:315-339

new RedditApi('reddit-{tenant-name}', {
dataArchiveEmrAppInstance: dataArchiveEmrAppInstance,
sparkJobLocationPrefix: dataArchiveSparkJobLocationPrefix,
alarmSnsTopicArn: alarmSnsTopicArn,
athenaCatalogName: athenaCatalog.name,
env: env,
vpcId: vpcId,
ecsClusterArn: ecsClusterArn,
scraperImage: dataLakeCollectorImage.imageName,
apiKey: config.requireSecret('reddit_api_key'),
redditClientId: config.requireSecret('reddit_client_id'),
ecsSubnets: internalSubnetIds,
bronzeDataLakeDB: bronzeDataLakeDB,
silverDataLakeDB: silverDataLakeDB,
keywords: ['keyword1', 'keyword2'], // Array of keywords
subreddit: ['subreddit1', 'subreddit2'], // Array of subreddits
tenantId: '{tenant-name}',
disableIngestion: false, // Set to true for testing
});

Zendesk Example: packages/data-lake/data-lake-infra/src/index.ts:376-393

StackExchangeApi Example

Location: packages/data-lake/data-lake-infra/src/index.ts:289-311

new StackExchangeApi('stack-exchange-{tenant-name}', {
dataArchiveEmrAppInstance: dataArchiveEmrAppInstance,
sparkJobLocationPrefix: dataArchiveSparkJobLocationPrefix,
alarmSnsTopicArn: alarmSnsTopicArn,
athenaCatalogName: athenaCatalog.name,
env: env,
vpcId: vpcId,
ecsClusterArn: ecsClusterArn,
scraperImage: dataLakeCollectorImage.imageName,
apiKey: config.requireSecret('stack_exchange_api_key'),
ecsSubnets: internalSubnetIds,
bronzeDataLakeDB: bronzeDataLakeDB,
silverDataLakeDB: silverDataLakeDB,
keywords: 'keyword1,keyword2,keyword3', // Comma-separated string
tenantId: '{tenant-name}',
disableIngestion: false,
});

AtlassianForums Example (Scrapy)

Location: packages/data-lake/data-lake-infra/src/index.ts:163-177

new AtlassianForums('atlassian-forums', {
dataArchiveEmrAppInstance: dataArchiveEmrAppInstance,
sparkJobLocationPrefix: dataArchiveSparkJobLocationPrefix,
alarmSnsTopicArn: alarmSnsTopicArn,
athenaCatalogName: athenaCatalog.name,
env: env,
vpcId: vpcId,
ecsClusterArn: ecsClusterArn,
scraperImage: dataLakeCollectorImage.imageName,
zyteApiKey: config.requireSecret('zyteApiKey'),
ecsSubnets: internalSubnetIds,
bronzeDataLakeDB: bronzeDataLakeDB,
silverDataLakeDB: silverDataLakeDB,
disableIngestion: true,
});

Note: Currently, Scrapy collectors don't use keywords/tenantId - configuration is in spider code.

Configuration Checklist

  • Instance name: {collector-type}-{tenant-name}
  • tenantId: Set to tenant identifier
  • keywords: Configure search terms
  • subreddit: Set if using RedditApi
  • API secrets: Set in Pulumi config
  • disableIngestion: Set to false for production
  • Deploy and verify Step Functions execution

Configuration Differences

TypeKeywords FormatAdditional ConfigTenant ID
RedditApiArray of stringssubreddit arrayYes
StackExchangeApiComma-separated stringNoneYes
AtlassianForumsN/A (in spider code)N/AN/A

References

  • Infrastructure index: packages/data-lake/data-lake-infra/src/index.ts
  • RedditApi class: packages/data-lake/data-lake-infra/src/redditApi.ts
  • StackExchangeApi class: packages/data-lake/data-lake-infra/src/stackExchangeApi.ts
  • AtlassianForums class: packages/data-lake/data-lake-infra/src/atlassianForums.ts
  • Tenant ID format: [PLACEHOLDER: tenant ID format documentation]
  • Secret key conventions: [PLACEHOLDER: secret key naming conventions]