Some customers might have experienced problems with server startup resilience in the face of temporary database connection issues.
If the database connection was dropped while the system was attempting to load essential database metadata (such as persistence information for non-abstract types), the server startup proceeded, and the Spring Context was loaded. However, the system did not attempt to reload the missing persistence metadata, nor did it fail the server startup after multiple attempts to recover. Consequently, while the server showed an alive status and was up and running, critical functionalities were impaired, as the essential persistence metadata was not loaded, leading to operational issues. Ultimately, even brief network issues prevented the system from recovering after failing to load persistence information, causing deployments to become indefinitely stalled and leading to potential extended downtime.
The system now includes intelligent retry logic and graceful error handling to make server startup resilient against temporary database connectivity issues. More specifically, we have introduced the following key improvements:
- Automatic retry mechanism: When database connection issues occur specifically during the loading of persistence information within the startup process, the system automatically retries connecting and loading the required persistence information. By default, the retry duration is 120 seconds, but it's possible to configure the maximum waiting time through the persistence.loading.max.wait.duration property.
- Graceful exception handling: If the database remains unavailable after the maximum wait time, the system returns a PersistenceInfoNotLoadedException and can either shut down cleanly or resort to legacy behavior, controlled by the persistence.loading.skip.exception.legacy.flag (with a default setting of false).
- Controlled shutdown process: When persistence loading fails, the system prevents Spring Context creation and shuts down gracefully, avoiding stuck states.
This fix eliminates the problem of costly production outages caused by deployment failures and reduces deployment time and complexity by automatically handling temporary issues. At the same time, it offers configurable resilience settings for different operational requirements.