Introduction to AWS Data Lake and AWS Lake Formation for Security Management
AWS Data Lake is a cloud-based data storage solution that can store and manage large amounts of structured and unstructured data. It is designed to provide high scalability and cost-effectiveness, and it can be used by businesses of all sizes. AWS Lake Formation is a service that allows you to build and manage data lakes on AWS. It provides a set of tools and features that can help you manage the security of your data lake and ensure that your data is protected from unauthorized access.
One of the key aspects of data lake security is confidentiality. Confidentiality refers to the protection of sensitive or High Risk data from unauthorized access or disclosure. In this blog post, we will discuss how to manage confidentiality aspects in a data lake with AWS Lake Formation.
Why Halodoc Protect Confidentiality of High Risk Data
Halodoc is a health-related services company that provides online consultation, medicine delivery, and other healthcare services. As a healthcare service digital platform company, Halodoc handles a lot of personal data, consisting of general personal data and specific personal data such as the health condition of a person. The confidentiality of this data is considered high-risk because it is sensitive for the users.
Data Lake Permission Management
Data Lake has many principals/users and resources where access of these principals to the resources need to be managed to ensure that confidentiality of High Risk Data is maintained. AWS Lake Formation supports these access controls by providing centralized permission management where centralized policy can be enforced to ensure access to High Risk Data can only be conferred to certain principals. The permission management in AWS Lake Formation provides fine-grained control features where permissions can be managed not only on database or table level but can also be managed on column, row, and cell level. Column-level data access controls access to specific columns in a table, row-level data access controls access to specific rows in a table, while cell-level data access controls access to specific cells in row and column.
The choice of access control mechanism depends on the organization's data security and tech requirements. Column-level access control is useful when organizations need to restrict access to specific columns or fields in a table. This capability is helpful in scenarios where all data in the column can be considered as containing sensitive data. Row-level access control is useful when organizations need to restrict access to specific rows or records in a table. This capability is helpful in scenarios where all information in a record or row is considered as sensitive data and other records that do not contain sensitive data. Cell-level access control is useful when organizations need to restrict access to specific data in a certain row and column. This capability is helpful in scenarios where non-sensitive and accessible information is stored in the same column with sensitive information.
Considering the nature of data access controls capability, architecture, and accumulative insight in row level can still be served, Column-level data access is considered efficient to be implemented to protect High Risk data from unauthorized principals.
Implementing Column Level Access Control in Data Lake Using AWS Lake Formation Tag-Based Access Control
AWS Lake Formation has two ways to assign and manage permissions to resources, i.e. named-based access and tag-based access. Tag-based access or in Lake Formation is called tag-based access control (LF-TBAC), is the recommended method to grant permissions when there is a large number of data catalog resources and principals. LF-TBAC is more scalable than the named resource method and requires less permission management overhead.
The following steps can be used to implement this feature:
Step 1: Create a tag policy for the data lake. A tag policy defines the set of tags that can be used in the data lake. One of the scenarios is as follows:
- There are 2 Databases and 2 Principals where there is 1 column in Database 1 that contains High Risk
- High Risk data can only be accessed by Principal 1
- Non High Risk data can be accessed by Principal 1 and Principal 2
- Illustration of the model is as depicted in Figure 1
Step 2: Create LF tags in Lake Formation.
- LF tags can be created by Data Lake Admin by Add LF-Tag process.
- Subsequently, Data Lake Admin will create LF-Tag based on the defined security policy (in Step 1). In this case, we will create 2 tags for Permissions, i.e. permission = sensitive and permission = non_sensitive and 1 tag for Status, i.e. status = true. The illustration of process to assign the tags for permissions is as depicted below.
Step 3: Assign tags to the principals. Based on the security policy, Data Lake Admin to assign tags to the principals.
- Data Lake Admin will assign tags by Grant process.
- Data Lake Admin will choose the principals and assign LF-Tags to the principals through the Resource matched by LF-Tags option.
- Principal 1 will be assigned permission = sensitive and status = true while Principal 2 will be assigned permission = non_sensitive
Step 4: Assign LF-tags to Resources. Based on the access policy and tags, Data Lake Administrator assigns the tags to the resources.
- Data Lake Admin assigns the tags to tables through the Edit LF-Tags process to assign permission = sensitive to Table 1 and permission = non_sensitive to Table 2. Additionally, Table 2 will have 1 additional tag, status = true.
- For the permission to Column level, Data Lake Admin assigns the tags permission = non_sensitive to Column_2 and Column_3 in Table 1.
Step 5: Test your access control policies. You can test your policies by simulating different access scenarios.
- The result of access before the implementation of access control policies is illustrated below.
- The result of access after the implementation of access control policies is illustrated below.
Managing access to high risk data in a data lake is critical and this needs to be addressed properly to ensure data security and compliance. AWS Lake Formation provides a set of tools and features to manage data access, data security, and data governance. Column-level access control using tag-based access control is a useful mechanism to restrict access to specific columns or fields in a table. This mechanism can be implemented using AWS Lake Formation by creating a tag policy, assigning tags to principals, assigning tags to resources, and testing the policies. By implementing these security measures, organizations can ensure the confidentiality of their data in a data lake.
Scalability, reliability and maintainability are the three pillars that govern what we build at Halodoc Tech. We are actively looking for engineers at all levels and if solving hard problems with challenging requirements is your forte, please reach out to us with your resumé at email@example.com.
Halodoc is the number 1 all around Healthcare application in Indonesia. Our mission is to simplify and bring quality healthcare across Indonesia, from Sabang to Merauke. We connect 20,000+ doctors with patients in need through our Tele-consultation service. We partner with 3500+ pharmacies in 100+ cities to bring medicine to your doorstep. We've also partnered with Indonesia's largest lab provider to provide lab home services, and to top it off we have recently launched a premium appointment service that partners with 500+ hospitals that allow patients to book a doctor appointment inside our application. We are extremely fortunate to be trusted by our investors, such as the Bill & Melinda Gates Foundation, Singtel, UOB Ventures, Allianz, GoJek, Astra, Temasek and many more. We recently closed our Series C round and In total have raised around USD$180 million for our mission. Our team works tirelessly to make sure that we create the best healthcare solution personalised for all of our patient's needs, and are continuously on a path to simplify healthcare for Indonesia.