Skip to content

GDPR and personal data in web server logs

Masking data in logs got really important due to meet the requirements of GDPR a European data protection regulation. In the GDPR role "data controller" for your logs, you should minimize the risk of exposing sensitive data to 3rd parties. In some cases, even IP-Addresses are considered as personal data, but your logs might contain more sensitive data like username, phone numbers etc.

You can’t collect and store any personal data without having obtained, and being able to document that you obtained, consent from the persons you’re collecting data from. You can, however, still collect and store personal data in your server logs for the limited and legitimate purpose of detecting and preventing fraud and unauthorized system access, and ensuring the security of your systems.

Best log management practices to manage GDPR

  1. Centralize log storage
    Centralize your log storage. This lets you apply policies in one place. Centralizing logs reduces the complexity and risk of maintaining policies in multiple places. Most log management services support retention policies per data source. You should define a reasonable retention time for every log source.
  2. Delete local logs from your servers (periodically)
    Duplicated data could create problems when enforcing policies. Therefore, you should make sure that logs stored in a central place are removed from local servers as soon as possible. Logrotate is a common tool used to delete logs periodically (weekly by default). A log shipper streams the logs to the centralized log storage in near-realtime.
  3. Structure your logs
    You can structure logs with parser rules in a log shipper configuration. Structured logs make it easier to mask or anonymize sensitive data as we point out in the next step. Wherever possible, applications should log directly in a structured format like JSON. Using a structured log format saves human time needed to create parser rules, as well as CPU cycles for processing.
  4. Anonymize sensitive data fields in logs
    Identify and anonymize sensitive data fields before data is shipped to remote storage. Multiple techniques could be applied like hash, encrypt or removal of sensitive data fields.
  5. Encrypted logs in transit
    Use only encrypted channels to transmit log data to a central storage. Logs are often shipped unencrypted with Syslog/UDP for performance reasons. That is bad practice. Do not do that.

Note

The GDPR suggests to anonymize, mask or remove personal data before you hand over data to any 3rd party.

There are several methods to anonymize or mask data fields:

  • Truncate field values in logs.
    In case of IP addresses, it could be sufficient to replace the last digits of an IP address. This breaks the potential correlation to a specific user, so it protects the personal data. A part of the network information is still available for analytics (e.g. Country of a user based on the partial IP address).
  • Hash field values in logs.
    Strong hash algorithms don't allow to calculate the original value out of the hash code. You can use the resulting hash code for analytics, e.g. count unique entries.
  • Encrypt field values in logs.
    Encryption might be even more interesting than hashing because you don't lose the information, while you protect the data. For 3rd parties, the encrypted content is not readable. But you could decrypt the data field in case you really need to access the content. E.g. for forensics after a security breach. Depending on the encryption method you might be able to use the encrypted value for analytics, like hash codes.
  • Remove fields from logs.
    The removal of data fields provides has the lowest risk of data leaking, but you might lose the information for troubleshooting or statistics. You can combine all mentioned methods depending on nature of data and your needs to analyze or restore the information in specific cases.

Further reading