Server-side write-safe file caching with Go


Having to sync the filesystem state with a database, containing one-on-one relationships with the files, can be a tricky task. In this article, I will explain how I implemented my solution for this problem.

All the code in the article is GPL-3 licensed.

Introduction

I have been developing a full-stack application called Mangatsu for storing, managing, and reading various image collections, such as manga. It consists of a backend server application written in Go with secure API access and user control, and a client application written in TypeScript and Next.js. In this article, I will focus on the backend server application.

The name "Mangatsu" is a play on the Japanese words "mangetsu" (満月, full moon) and "manga" (漫画, comic).

Mangatsu scans the directory paths provided by the user for manga, comics, doujinshi, and other collections (referred to as galleries from this point forward) and parses all the information it can from the filenames, included metadata files, and the pages themselves.

Mangatsu accepts two different directory structures:

📂 freeform
├── 📂 doujinshi
│       ├──── 📂 deeper-level
│       │     ├──── 📦 [Group (Artist)] 同人誌◯.cbz
│       │     └──── 📄 [Group (Artist)] 同人誌◯.json
│       ├──── 📦 (C99) [Group (Artist)] 漫画〇〇.zip
│       ├──── 📄 (C99) [Group (Artist)] 漫画〇〇.json
│       └──── 📦 (C88) [group (author, another author)] 単行本 [DL].zip # (JSON or TXT metafile inside)
├── 📂 art
│       ├──── 📂 [Artist] art collection
│       │     ├──── 🖼️ 0001.jpg
│       │     ├────...
│       │     └──── 🖼️ 0300.jpg
│       ├──── 📦 art collection XYZ.rar
│       └──── 📄 art collection XYZ.json
└── 📦 (C93) [group (artist)] 同人誌△ (Magical Girls).cbz
📂 structured
├── 📕 漫画1
│       ├── 📦 Volume 1.cbz
│       ├── 📦 Volume 2.cbz
│       ├── 📦 Volume 3.cbz
│       └── 📦 Volume 4.zip
├── 📘 漫画2
│       └── 📂 Vol. 1
│           ├── 🖼️ 0001.jpg
│           ├── ...
│           └── 🖼️ 0140.jpg
├── 📘 漫画3
│       └── 📂 Vol. 1
│           ├── 📦 Chapter 1.zip
│           ├── 📦 Chapter 2.zip
│           └── 📦 Chapter 3.rar
├── 📗 漫画4
│       ├── 📦 Chapter 1.zip
│       ├── ...
│       └── 📦 Chapter 30.rar
└── 📦 One Shot漫画.rar

As shown in the examples, numerous galleries are compressed in formats such as zip/cbz, rar/cbr, and 7zip. To avoid extracting the same file with each request, and to reduce server load and latency, it is not desirable to extract the file every time a user opens a gallery in the client. While caching the result on the client side is an option (and is used to some extent alongside server caching!), it can be unreliable, doesn't prevent abuse from bad actors, and, especially with applications like Mangatsu where the cached data can be huge, it is impractical for the client (browser, phone app, etc.).

Implementing server-side file cache

... which is relatively simple.

First, this is how I initialize the cache directories.

// InitPhysicalCache initializes the physical cache directories.
func InitPhysicalCache() {
	cachePath := config.BuildCachePath()
	if !utils.PathExists(cachePath) {
		err := os.Mkdir(cachePath, os.ModePerm)
		if err != nil {
		}
	}
 
	thumbnailsPath := config.BuildCachePath("thumbnails")
	if !utils.PathExists(thumbnailsPath) {
		err := os.Mkdir(thumbnailsPath, os.ModePerm)
		if err != nil {
		}
	}
}

Next, implementing functions to read from the cache and write to it. When writing to cache, there are two validations before extracting the gallery:

  1. Does the cache directory already exist?
  2. If it does, are there any files (i.e. pages) residing there?
  3. If yes, then proceed to just return the residing files.
    • ※ If someone or something has corrupted the files in the cache, it would cause a problem here for the end user. I am considering writing additional validation here.
// fetchFromDiskCache fetches the files from the disk cache and returns the list of files and the number of files.
func fetchFromDiskCache(dst string, uuid string) ([]string, int) {
	var files []string
	count := 0
 
	cacheWalk := func(s string, d fs.DirEntry, err error) error {
		if err != nil {
			log.Z.Error("failed to walk cache dir",
				zap.String("name", d.Name()),
				zap.String("err", err.Error()))
			return err
		}
		if d.IsDir() {
			return nil
		}
 
		// ReplaceAll ensures that the path is correct: cache/uuid/<arbitrary/path/image.png>
		files = append(files, strings.ReplaceAll(filepath.ToSlash(s), config.BuildCachePath(uuid)+"/", ""))
		count += 1
		return nil
	}
 
	err := filepath.WalkDir(dst, cacheWalk)
	if err != nil {
		log.Z.Error("failed to walk cache dir",
			zap.String("dst", dst),
			zap.String("err", err.Error()))
		return nil, 0
	}
 
	return files, count
}
 
// fetchOrExtractGallery extracts the gallery from the archive and returns the list of files and the number of files.
func fetchOrExtractGallery(archivePath string, uuid string) ([]string, int) {
	dst := config.BuildCachePath(uuid)
	if _, err := os.Stat(dst); errors.Is(err, fs.ErrNotExist) {
		return utils.UniversalExtract(dst, archivePath)
	}
 
	files, count := fetchFromDiskCache(dst, uuid)
	if count == 0 {
		err := os.Remove(dst)
		if err != nil {
			log.Z.Debug("removing empty cache dir failed",
				zap.String("dst", dst),
				zap.String("err", err.Error()))
			return nil, 0
		}
 
		return utils.UniversalExtract(dst, archivePath)
	}
	natsort.Sort(files)
 
	return files, count
}

The next section will detail how to sync this file cache with the server and the database.

Implementing file locking

What happens if there are two or more requests trying to access a gallery at the same time? Especially in the case of a non-cached gallery, it would cause many unnecessary writes which in turn could cause corruption in the cached file themselves, and in the worst case, could get abused to crash the application or even the system.

To prevent this, I decided to use maps of mutexes. Essentially, I maintain a runtime map of all the cached galleries and their last access times, with mutexes to signal when a particular gallery cache is being written to.

type cacheValue struct {
	Accessed time.Time   // Used to determine when the cache entry was last accessed for expiring and pruning purposes.
	Mu       *sync.Mutex // Mutex to ensure file-write-safety when accessing the cache.
}
 
type GalleryCache struct {
	Path  string                // Path to the cache directory.
	Store map[string]cacheValue // Map of Mutexes and access times for each cache entry.
}
 
var galleryCache *GalleryCache

Initializing the map and reading the existing cache into the map by iterating and verifying the cache directory:

// InitGalleryCache initializes the abstraction layer for the gallery cache.
func InitGalleryCache() {
	galleryCache = &GalleryCache{
		Path:  config.BuildCachePath(),
		Store: make(map[string]cacheValue),
	}
 
	iterateCacheEntries(func(pathToEntry string, accessTime time.Time) {
		maybeUUID := path.Base(pathToEntry)
		if _, err := uuid.Parse(maybeUUID); err != nil {
			return
		}
 
		galleryCache.Store[path.Base(pathToEntry)] = cacheValue{
			Accessed: accessTime,
			Mu:       &sync.Mutex{},
		}
	})
}
 
// iterateCacheEntries iterates over all cache entries and calls the callback function for each entry.
func iterateCacheEntries(callback func(pathToEntry string, accessTime time.Time)) {
	cachePath := config.BuildCachePath()
	cacheEntries, err := os.ReadDir(cachePath)
	if err != nil {
		log.Z.Error("could not read cache dir",
			zap.String("path", cachePath),
			zap.String("err", err.Error()))
		return
	}
 
	for _, entry := range cacheEntries {
		info, err := entry.Info()
		if err != nil {
			log.Z.Error("could to read cache entry info",
				zap.String("path", cachePath),
				zap.String("err", err.Error()))
			return
		}
 
		pathToEntry := path.Join(cachePath, entry.Name())
		accessTime, err := atime.Stat(pathToEntry)
		if err != nil {
			log.Z.Debug("could to read the access time",
				zap.String("name", entry.Name()),
				zap.String("path", cachePath),
				zap.String("err", err.Error()))
			accessTime = info.ModTime()
		}
 
		callback(pathToEntry, accessTime)
	}
}

The galleries will be read with the following functions with the logic flowing as follows:

  1. Check if the gallery exists in the runtime cache map
    1. If not, a new entry will be created.
    2. If it does, only the access time will be updated.
  2. Lock the mutex to block any operations for the gallery.
  3. If the gallery resides in the cache already, only the filepaths will be read and returned. If not, it will be extracted and copied.
  4. Return filepaths and their count.
  5. Release the mutex (notice the defer statement).
// Read reads the gallery while updating the mutex and access time.
func Read(archivePath string, galleryUUID string) ([]string, int) {
	if _, ok := galleryCache.Store[galleryUUID]; !ok {
		galleryCache.Store[galleryUUID] = cacheValue{
			Accessed: time.Now(),
			Mu:       &sync.Mutex{},
		}
	} else {
		galleryCache.Store[galleryUUID] = cacheValue{
			Accessed: time.Now(),
			Mu:       galleryCache.Store[galleryUUID].Mu,
		}
	}
 
	galleryCache.Store[galleryUUID].Mu.Lock()
	defer galleryCache.Store[galleryUUID].Mu.Unlock()
 
	files, count := fetchOrExtractGallery(archivePath, galleryUUID)
	if count == 0 {
		return files, count
	}
 
	return files, count
}

Pruning expired cache entries according to their last access timestamps requires the use of mutexes as well. The PruneCache function runs every minute to clean up expired gallery caches.

utils.PeriodicTask(time.Minute, cache.PruneCache)
 
// PeriodicTask loops the given function in separate thread between the given interval.
func PeriodicTask(d time.Duration, f func()) {
	go func() {
		for {
			f()
			time.Sleep(d)
		}
	}()
}

The PruneCache function iterates through the entire runtime cache map, locks the mutex while checking the access timestamp, and if expired, removes the cache entry from the map and deletes the files.

To ensure safe file removal, I am cautious not to remove anything other than the cache directory that starts with a UUID. This is verified by checking that the path begins with a valid UUID.

// PruneCache removes entries not accessed (internal timestamp in mem) in the last x time in a thread-safe manner.
func PruneCache() {
	now := time.Now()
	for galleryUUID, value := range galleryCache.Store {
		value.Mu.Lock()
		if value.Accessed.Add(config.Options.Cache.TTL).Before(now) {
			if err := remove(galleryUUID); err != nil {
				log.Z.Error("failed to delete a cache entry",
					zap.Bool("thread-safe", true),
					zap.String("uuid", galleryUUID),
					zap.String("err", err.Error()))
			}
		}
		value.Mu.Unlock()
	}
}
 
// remove wipes the cached gallery from the disk.
func remove(galleryUUID string) error {
	// Paranoid check to make sure that the base is a real UUID, since we don't want to delete anything else.
	maybeUUID := path.Base(galleryUUID)
	if _, err := uuid.Parse(maybeUUID); err != nil {
		delete(galleryCache.Store, galleryUUID)
		return err
	}
 
	galleryPath := config.BuildCachePath(galleryUUID)
	if err := os.RemoveAll(galleryPath); err != nil {
		if errors.Is(err, fs.ErrNotExist) {
			delete(galleryCache.Store, galleryUUID)
		}
		return err
	}
 
	delete(galleryCache.Store, galleryUUID)
 
	return nil
}
 

Closing

In this article, we've explored how to implement a simple file caching mechanism with mutexes in Go for a server. By using runtime maps and mutexes, safe concurrent access to cached galleries can be ensured while efficiently managing the cache. This approach not only reduces server load but also prevents potential issues from simultaneous access and cache corruption.

As the project evolves, I will continue to update this article with new insights and improvements. Thank you for reading, and I hope you found this guide helpful for your own projects!


By Marko Leinikka

Word count: 1628
8 min read