Adding Source/Tenant to Existing Collector
When to Add vs Create New
Add to existing collector when:
- Same data source (API/Scrapy/JSON)
- Same data structure
- Different tenant/keywords/configuration
Create new collector when:
- Different data source API
- Different data structure
- Requires custom processing logic
Step-by-Step Process
1. Identify Collector Type
API Integrations:
RedditApi- Reddit APIStackExchangeApi- Stack Exchange API
Scrapy Spiders:
AtlassianForums- Scrapy-based scraping
2. Add Instance in Index
File: packages/data-lake/data-lake-infra/src/index.ts
Add new instance after existing ones (see examples below).
3. Configure Parameters
Required:
tenantId: could be guid from configuration (see wordpress example) or textual like 'Atlasian'keywords: Search terms/keywords arrayapiKey: API key secret namedisableIngestion:truefor testing,falsefor production
API-specific:
subreddit: Array of subreddit names (RedditApi only)keywords: Comma-separated string (StackExchangeApi)
4. Set Pulumi Secrets if needed
cd packages/data-lake/data-lake-infra
pulumi config --stack dev/prod set --secret {secretKeyName} {value}
5. Deploy
NX_TUI=false pnpm nx pulumi data-lake-infra -- preview --stack dev
NX_TUI=false pnpm nx pulumi data-lake-infra -- up --stack dev
Examples by Type
RedditApi Example
Location: packages/data-lake/data-lake-infra/src/index.ts:315-339
new RedditApi('reddit-{tenant-name}', {
dataArchiveEmrAppInstance: dataArchiveEmrAppInstance,
sparkJobLocationPrefix: dataArchiveSparkJobLocationPrefix,
alarmSnsTopicArn: alarmSnsTopicArn,
athenaCatalogName: athenaCatalog.name,
env: env,
vpcId: vpcId,
ecsClusterArn: ecsClusterArn,
scraperImage: dataLakeCollectorImage.imageName,
apiKey: config.requireSecret('reddit_api_key'),
redditClientId: config.requireSecret('reddit_client_id'),
ecsSubnets: internalSubnetIds,
bronzeDataLakeDB: bronzeDataLakeDB,
silverDataLakeDB: silverDataLakeDB,
keywords: ['keyword1', 'keyword2'], // Array of keywords
subreddit: ['subreddit1', 'subreddit2'], // Array of subreddits
tenantId: '{tenant-name}',
disableIngestion: false, // Set to true for testing
});
Zendesk Example: packages/data-lake/data-lake-infra/src/index.ts:376-393
StackExchangeApi Example
Location: packages/data-lake/data-lake-infra/src/index.ts:289-311
new StackExchangeApi('stack-exchange-{tenant-name}', {
dataArchiveEmrAppInstance: dataArchiveEmrAppInstance,
sparkJobLocationPrefix: dataArchiveSparkJobLocationPrefix,
alarmSnsTopicArn: alarmSnsTopicArn,
athenaCatalogName: athenaCatalog.name,
env: env,
vpcId: vpcId,
ecsClusterArn: ecsClusterArn,
scraperImage: dataLakeCollectorImage.imageName,
apiKey: config.requireSecret('stack_exchange_api_key'),
ecsSubnets: internalSubnetIds,
bronzeDataLakeDB: bronzeDataLakeDB,
silverDataLakeDB: silverDataLakeDB,
keywords: 'keyword1,keyword2,keyword3', // Comma-separated string
tenantId: '{tenant-name}',
disableIngestion: false,
});
AtlassianForums Example (Scrapy)
Location: packages/data-lake/data-lake-infra/src/index.ts:163-177
new AtlassianForums('atlassian-forums', {
dataArchiveEmrAppInstance: dataArchiveEmrAppInstance,
sparkJobLocationPrefix: dataArchiveSparkJobLocationPrefix,
alarmSnsTopicArn: alarmSnsTopicArn,
athenaCatalogName: athenaCatalog.name,
env: env,
vpcId: vpcId,
ecsClusterArn: ecsClusterArn,
scraperImage: dataLakeCollectorImage.imageName,
zyteApiKey: config.requireSecret('zyteApiKey'),
ecsSubnets: internalSubnetIds,
bronzeDataLakeDB: bronzeDataLakeDB,
silverDataLakeDB: silverDataLakeDB,
disableIngestion: true,
});
Note: Currently, Scrapy collectors don't use keywords/tenantId - configuration is in spider code.
Configuration Checklist
- Instance name:
{collector-type}-{tenant-name} -
tenantId: Set to tenant identifier -
keywords: Configure search terms -
subreddit: Set if using RedditApi - API secrets: Set in Pulumi config
-
disableIngestion: Set tofalsefor production - Deploy and verify Step Functions execution
Configuration Differences
| Type | Keywords Format | Additional Config | Tenant ID |
|---|---|---|---|
| RedditApi | Array of strings | subreddit array | Yes |
| StackExchangeApi | Comma-separated string | None | Yes |
| AtlassianForums | N/A (in spider code) | N/A | N/A |
References
- Infrastructure index:
packages/data-lake/data-lake-infra/src/index.ts - RedditApi class:
packages/data-lake/data-lake-infra/src/redditApi.ts - StackExchangeApi class:
packages/data-lake/data-lake-infra/src/stackExchangeApi.ts - AtlassianForums class:
packages/data-lake/data-lake-infra/src/atlassianForums.ts - Tenant ID format: [PLACEHOLDER: tenant ID format documentation]
- Secret key conventions: [PLACEHOLDER: secret key naming conventions]