Distributed File System with gRPC

1. Introduction

The purpose of this project is to develop a gRPC-based file management system that supports storing, fetching, deleting, and listing files on a server. This system allows clients to manage files remotely with efficient communication facilitated by gRPC framework. And there will be a weakly consistent synchronization system to manage cache consistency between multiple clients and a single server. The system should be able to handle both binary and text-based files.

Gitbub Link

2. Environment

Linux Ubuntu
C++
gRPC + Proto Buffer
Inotify
CRC Checksum

3. Step A: gRPC

3.1 Tasks

Fetch a file from a remote server and transfer its contents via gRPC
Store a file to a remote server and transfer its data via gRPC
List all files on the remote server:
- For this assignment, the server is only required to contain files in a single directory; it is not necessary to manage nested directories.
- The file listing should include the file name and the modified time (mtime) of the file data in seconds from the epoch.
Get the following attributes for a file on the remote server:
- Size
- Modified Time
- Creation Time
Finally, the client should recognize when the server has timed out.

3.2 Design Choices

gRPC Framework: Selected for its ability to handle remote procedure calls efficiently and support for various programming languages, making it ideal for distributed systems.
Protocol Buffers (ProtoBuf): Used to define service APIs and message structures, allowing for compact and efficient serialization of data.
Streaming vs. Unary Calls: Streaming is utilized for operations like storing and fetching large files, while unary calls are adequate for operations like deleting files and fetching file attributes, which involve smaller data transactions.

				
					rpc StoreFile(stream StoreFileRequest) returns (StoreFileResponse);
rpc FetchFile(FetchFileRequest) returns (stream FetchFileResponse);

Deadline check: The function called by the client will initially check for a timeout to prevent unnecessary resource waste.

				
					if (context->IsCancelled()) {
    return Status(StatusCode::DEADLINE_EXCEEDED, "Deadline exceeded");
}

				
					ClientContext context;
auto deadline = std::chrono::system_clock::now() + std::chrono::milliseconds(this->deadline_timeout);
context.set_deadline(deadline);

3.3 Flow of Control

Store Operation: The client reads file data and streams it to the server using ClientWriter. The server writes the data to the file system and returns Status::OK on success.
Fetch Operation: The client initiates a request, and the server streams the file back using ServerWriter. The client writes the incoming data to a local file.
Delete Operation: A straightforward unary RPC where the client requests to delete a file, and the server responds with success or failure status.
List Operation: The client requests a list of files, and the server returns a complete response containing file metadata.
Stat Operation: The client requests file attributes, and the server responds with metadata such as size and modification time.

4. Step B: Completing the Distributed File System (DFS)

4.1 Tasks

Now that we have a working gRPC service, we will turn our focus towards completing a rudimentary DFS. For this assignment, we’ll apply a weakly consistent cache strategy to the RPC calls created in Part 1. This is similar to the approach used by the Andrew File System (AFS). To keep things simple, we’ll focus on whole-file caching for the client-side and a simple lock strategy on the server-side to ensure that only one client may write to the server at any given time.

In addition to the synchronous gRPC calls in Part 1, we will also make an asynchronous gRPC call for Part 2. This asynchronous call will be used to allow the server to communicate to connected clients whenever a change is registered in the service.

4.2 Design Choices

gRPC for Communication: We chose gRPC due to its efficient protocol buffer serialization, which allows for high-performance communication between the client and server. The use of gRPC also simplifies the implementation of remote calls with automatic code generation.
Mutex for Thread Safety: Within the server, a mutex (lock_mutex) protects shared resources like the file_locks map on the server to ensure thread safety when multiple clients attempt to lock the same files simultaneously.

				
					private:
    // Map to keep track of file locks: filename -> client ID holding the lock
    std::unordered_map<std::string, std::string> file_locks; 

    // Mutex to protect access to the file_locks map
    std::mutex lock_mutex; 
    
public:
    Status RequestWriteLock(ServerContext* context, const WriteLockRequest* request, WriteLockResponse* response) override {
        ...
        {
            // explicit use of mutex
            std::lock_guard<std::mutex> guard(lock_mutex);
            auto lock_it = file_locks.find(filename); 

            if (lock_it != file_locks.end()) {
                if (lock_it->second == client_id) {
                    ...
                }
            ...
            }
        }

Status Code Handling: We employed grpc::Status and grpc::StatusCode to handle and communicate the status of operations clearly between the server and client. This design choice enables detailed error reporting and handling.
Timeout Handling: The use of ClientContext::set_deadline ensures that each request respects a specified timeout, which is critical for maintaining system responsiveness in a distributed environment.
Checksum: Use CRC checksums to verify file consistency, which will be included in the request and response information.

				
					message FetchFileRequest {
    string filename = 1;
}

message FetchFileResponse {
    bytes data = 1;
    string filename = 2;
    int64 modified_time = 3;
    uint32 checksum = 4;
}

Inotify & Asyn: Use another mutex in the client to ensure that the inotify thread and the async thread do not compete for file operations.

				
					void DFSClientNodeP2::InotifyWatcherCallback(std::function<void()> callback) {
    std::lock_guard<std::mutex> lock(callback_mutex);
    callback();
}

Handle Callback List: Compare the files in the callback list returned by the server to determine which files need to be stored, retrieved, or deleted.
Check before Process Queued Requests: Before processing requests in the queue, verify whether the files have changed; only if they have been modified will a synchronization request be sent to the client.

				
					{
    std::lock_guard<std::mutex> lock(queue_mutex);

    DIR* dir;
    struct dirent* ent;
    struct stat file_stat;
    std::unordered_set<std::string> current_files; // Track files

    if ((dir = opendir(this->mount_path.c_str())) != nullptr) {
        ....// check file changes
    }
    closedir(dir);
}

if (!files_changed) {
    // Sleep for a while if no changes
    std::this_thread::sleep_for(std::chrono::milliseconds(1000));
    continue;
}

4.3 Flow of Control

Server Side
1. Receive Request: The server receives a WriteLockRequest from the client.
2. Lock Acquisition:
  - Check Existing Locks: Uses a mutex to check if the file is already locked.
  - Grant or Deny Lock: If the file is not locked, or locked by the same client, the lock is granted. If locked by another client, the lock request is denied with RESOURCE_EXHAUSTED.
3. Operation: Store, fetch, delete like part 1.
4. Return Status: The server returns a grpc::Status indicating the result (e.g., Status::OK, DEADLINE_EXCEEDED, CANCELLED).
5. Async: Check if the file has been changed. If so, process the synchronization requests in the queue and send the callback list to each client. If not, sleep for a while and check again.

Client Side
1. Inotify: Check if the file has been modified. If so, first attempt to acquire the operation lock, then invoke the callback() to initiate the store or delete operation.
2. Async: Acquire the operation lock and receive the file information returned by the server. Compare the local files with those on the server and initiate store, fetch, or delete operations.
3. Prepare Request: Sets up WriteLockRequest with the filename and client ID.
4. Set Deadline: Configures the context with a deadline for the request.
5. Send Request: Calls RequestWriteLock on the server stub.
6. Process Response:
  - Successful Lock: If status.ok(), the client can proceed with file operations.
  - Handle Errors: Otherwise, logs error and returns the appropriate status code.

5. Implementation and Testing

Implementation: Developed gRPC service definitions in the proto file and implemented virtual functions in the server. And then implemented related function on the client end. Used mutex for synchronizing access to shared resources on both client and server ends.
Testing: Conducted unit tests to verify lock behavior under various conditions, including multiple clients requesting the same lock, and timeout scenarios. Used logging to trace execution paths and ensure correctness. In addition to testing individual functionalities, used ./bin/dfs-client-p2 mount to bind the client to a path. Mount, delete, and modify files through the file manager to observe synchronization on the other side.