File Discovery Service
This page outlines the Storage Discovery Service within OPSWAT MetaDefender Storage Security, detailing its function, architecture, and operational flow.
About the Service
The Storage Discovery Service is responsible for identifying and retrieving information about objects (files) within a designated storage location. This information enables MDSS to process each object and submit it to MetaDefender Core/Cloud for scanning:
- The process of retrieving object metadata from a storage location. This metadata allows MDSS to manage and process the object for scanning.
- The Storage Discovery Service utilizes a RabbitMQ message bus for communication. Events are published and consumed asynchronously to coordinate discovery tasks and report progress. Adherence to message publishing guidelines, routing keys, and JSON schemas is crucial for proper functionality.
Elements
- ScanId → a unique identifier associated with a scan operation. This ID is used in
StartDiscoveryEvent
andRequestDiscoveryEvent
messages and is stored in the MDCS database's "scans" collection. - StorageProtocolType → an enumeration in
mdss.models
that defines the protocol or SDK used to interact with the storage (AWS S3, Azure Blob Storage). - Storage → represents the storage location being scanned. Linked to the
ScanId
, its details are stored in the "storages" collection within the MDCS database.
Initiating Discovery
The StartDiscoveryEvent
is the trigger for initiating a new discovery process. This event can be published in response to:
- User requests via the API Gateway.
- Scheduled tasks initiated by the JobDispatcher.
Event Communication
StartDiscoveryEvent
signals the start of a new discovery operation.ObjectDiscoveryFinishedEvent
indicates that a new object has been discovered and its metadata retrieved. Published by the storage-specific discovery service.DiscoveryFinishedEvent
notifies the central service that a specific level or branch of the discovery process has been completed.
Storage Discovery Algorithm
MDSS employs a parallel approach for discovering the hierarchical structure of storage systems.
Process
- Parallel Exploration → the algorithm explores the storage's folder structure in parallel, enhancing efficiency.
- Caching → a cache is used to track discovery progress and status. Keys are used to indicate whether a folder is being explored or if an error occurred.
- Failure Handling → if the storage discovery fails (
DiscoveryRequestFinishedState
isStorageDiscoveryFailed
), the failure is recorded in the cache. If this fails, an error is logged. - Subfolder Exploration → for each subfolder within the current folder, a key is set in the cache to mark it as "In progress." Errors during this process are logged.
- Completion Marking → once a folder is fully explored, its corresponding key is removed from the cache. Errors during removal are logged.
- Cache Expiration → In situations where a key cannot be removed from the cache (e.g., due to unforeseen errors), it will automatically expire after a predefined duration (
_timeSpanExpiration
).
Architecture and Discovery Flow
MDSS utilizes two primary services for efficient storage discovery:
- the Discovery Service Central service acts as the central orchestrator, receiving discovery requests and delegating tasks to the appropriate storage-specific service. It uses the provided information, such as storage type and target locations, to guide the discovery process.
- the Discovery Storage Specific Service is a service specialized for different storage platforms ( AWS S3, Azure Blob Storage). They receive instructions from the central service and execute the actual discovery process, tailored to the specific storage environment.
Discovery Flows
- Regular Discovery
- A
StartDiscoveryEvent
containingStorage Discovery information
initiates the process. - The central service sends a
RequestDiscoveryEvent
to the corresponding storage-specific service. - The specific service analyzes the storage content level by level, identifying files and folders.
- Discovered files trigger an
ObjectDiscoveryFinishedEvent
, adding them to the MDSS database for tracking and scanning. - Upon completing a level, a
DiscoveryFinishedEvent
is sent. - Discovered folders trigger new
StartDiscoveryEvent
messages, enabling recursive exploration of the storage structure.
2. Webhook Discovery
- A
WebHookDiscoveryObjectEvent
containingObject Discovery information
signals the need to discover a specific object. - The central service sends a
RequestDiscoveryEvent
to the specific service. - The specific service locates the object and sends an
ObjectDiscoveryFinishedEvent
to add it to the database.
This two-service architecture ensures flexibility and efficiency in handling various storage types and discovery scenarios.