My Unsolicited Opinions on Tidy Git Repositories#
Today, I will attempt to list some of the key components of a “good” public git repository that make me feel warm and fuzzy when I see them.
No Instance-Specific Data#
When you publish open source projects to the internet, you are contributing your ideas, expressed in code, to the world. However, that code should not be anything more than that. It is so easy to accidentally release information about yourself by simply adding everything by default in a directory to your repository or by including the wrong configuration files. I recommend using
git add . sparingly. You can easily leak to the world the type of operating system you use, the type of deployment solution you use and its configuration, and much more.
Good for Collaboration, Not Just Security#
Beyond keeping your personal data out of others’ hands, freeing your code from its connection to any specific instance greatly increases its capacity for collaboration. In a previous job, I worked as a full stack web developer. I worked on a Mac, another coworker worked on a Linux machine, and another on a Windows machine. We all regularly deployed our code in a few different test environments each, so there were around 10 instances of our code being run at any given time in addition to the production copy. A few configuration files within our project were added to Git by an inexperienced developer who started the project. During my time there, I spent far too many hours resolving merge conflicts related to these configuration files that were constantly being overwritten by each of the developers who would push them up to our Git server. This could have so easily been avoided had someone invested the time in abstracting the configuration to work on each of our systems without modification.
Overwriting Leaks Doesn’t Remove Them#
One of the nicest things about Git is that it stores versions of your code by scanning your code for differences, making a list of those differences, and storing that list. This method makes version controlling a codebase very efficient and doable. However, it also means that all of your mistakes that you commit and then later on come back to and fix are still part of the repository. You can view the history of commits in a repository. So, if you accidentally include a configuration file with a password inside it and then delete the file in a later commit, you are not safe. Anyone could look back through your repository’s history and uncover that file.
Now, how to solve this issue… that’s a tricky question. There are ways you can do this with the
git filter-branch command but I am not really sure how that works. Similar to my professor who said “I don’t write bugs”, I simply don’t make mistakes in my git repos.
Oh if that were only true… In reality, when I discover a mistake of this type, I usually create a new repository and only push the most recent code to it so that the history disappears. However, GitHub provides some more information on how to do this more cleanly here. I have also used the BFG Repo-Cleaner and had great success.
Single Responsibility Principle#
In software engineering, one of the most widely acknowledged principles is the single responsibility principle. This principle is described on Wikipedia as follows: “every module, class or function in a computer program should have responsibility over a single part of that program’s functionality, and it should encapsulate that part.” I firmly believe that this concept should be applied to git repositories as well. It is poor design to place multiple projects within a single repository. The following are some examples of how this should be applied.
A single git repository should be used for a single project. For example, a series of related projects (i.e. for a class) should be placed in separate repositories. A client-server project is a trickier example. I firmly believe that a repository should not contain subdirectories with separate projects for the client and the server, but should instead be split into two repositories: one for the frontend and one for the backend. Since there is a disconnect between the two parts of the application (usually being the network gap), their codebase should be maintained separately. Although it may be unlikely, especially for smaller projects, a stranger could come along and re-implement your client to interact with your deployed server, which could be a perfectly valid way to interact with the application. In this situation, one client may be more official because the server’s developer created it, but both would be equally suitable options, so they should be equally disconnected from server’s codebase.
Placing a .gitignore in each directory of a git repository is bad practice. You should use a single file that references files within subdirectories. For example, if you are a MacOS user, place the line
**.DS_Store in that top-level .gitignore.
A little bit of a side note here, but if you include a build system in your repository, you should make sure that running
make or building your project with whatever system you use does not add files to Git. You should be able to run
git status immediately before and immediately after building and see no changes.
Github provides a great variety of .gitignore samples. You can find them here.
Single Build System#
During my time working in HPC centers installing “crapware” for users, I have witnessed the horrendous trend that is poorly designed build systems. I have had to hunt down all the instances of
Makefile within the complicated directory structure of projects far too many times. The build process for any repository of code should be consolidated to a single, top-level location that can be run in a single pass.
Last but not least, any clean Git repository contains a clear and organized
README.md. This file is the landing page for the repository and should contain information about what the repository is, how it is structured, and how to use it. An unclear
README.md limits collaboration and keeps your project from becoming widely used.
I throw up in my mouth when I see these problems in the real world. The biggest culprits? My classmates. And me (back when I was young and foolish). Maybe I just feel holier than everyone else for having cleaned up the presence of my code on the internet (which has largely meant eliminating it), but these points remain perfectly valid and important.
This concept demonstrates that human error is the weakest link in any system. There are countless ways to mess up the simplest things that could lead to massive consequences. Placing an ip address, username, and password in a
.gitlab-ci.yml file could prove catastrophic, despite being an innocent mistake. Moral of the story? Follow these principles to keep your information safer and maintain a valuable presence as a developer.