Welcome to…the [finite state] Machine

TL;DRhttps://gitmachine.zubin.io is an interactive demo that models the local change management process in Git repositories.

I recently leveraged xstate, to build a finite state machine that models how local changes are managed in a git repository. This little experiment quickly turned into an interactive demo that allows users to issue actions (some simple git-related terminal commands) and behold the resulting state.

A finite state machine represents (or models) a dynamic system as two or more discrete states, each characterized by rules governing whether they are allowed (or disallowed) to transition to one state or another. A canonical example of this is a basic traffic light, which is always in one of the following three states: green, yellow or red. A green light cannot transition directly to a red light. It must first transition to a yellow light. Similarly, the red light can never transition to the yellow light. It can only transition to the green light. In this analogy, the color of the light represents a discrete state, to use the parlance of finite state machines.

Similar to the basic traffic light model, local changes in a git repository must fall into one of a set of discrete states: clean (no changes), un-staged, staged, or committed. If you click on an example command (left hand panel), the corresponding state will be highlighted. The command will also display in the mock terminal panel (right-hand panel), so you can view the order of actions, as if it was your command-line history.

Remember, certain commands cannot change the state, depending on what the current state is. Even if you are very comfortable using git on a daily basis, you might still be unfamiliar with the low-level details of how You may even be vaguely aware that you can’t affect a change using certain commands, while in certain states, even if you’re not sure the exact reason. For example, you can NOT checkout a change:

git checkout .

…if the change has already been staged using the following command:

git add new_file.txt

In the above scenario, the file is currently in the staged state, and in order to move the change back to the un-staged state, you would have to run:

git reset .

Unfortunately, in it’s current (dare I say?) state, this interactive demo does not do a good job of indicating what states are available to transition to. Consequently, this is also a source of confusion for developers using Git! In the future, I might add an icon or indicator to suggest that a command is either enabled or disabled for a given state of the machine.

As a last note, another command that I did not account for in this interactive demo is git stash, only because stashing changes actually creates a whole new state machine, separate from (or embedded within) the currently implemented state machine. In the future, I might add a second state machine (representing the stashing process) that would allow users to volley between the currently implementation and a new (secondary) state machine that models how stashes work. In any case, you can play around with this demo here: https://gitmachine.zubin.io

Learning DynamoDB: Single-Table Design

A few years ago, I started using DynamoDB for a project and thought it would be easy to pick it up since I already had experience with NoSQL databases. In the past, I used MongoDB for simple social media applications and Redis as a data store for queues, session management, and caching. The NoSQL solutions worked well for me, but these were also relatively simple use cases.

SQL Databases are relational, but each NoSQL database is non-relational in its own way.

Leo Tolstoy, Anna Karenina

I’m sure I will one day regret the above misquote ūüėČ

In any case, enter DynamoDB: While working with a team of independent contractors, I started working on a custom web-based e-learning platform which leveraged DynamoDB as the application’s primary (and only) database. As is not uncommon in situations with brand new technologies and competing deadlines, we neglected to learn the fundamentals of DynamoDB early and continued to build new schema as if it was just another SQL database. Things got out of hand pretty quickly. The most critical mistake we made was continuing to create a new table for each new entity in the system. It was not until much later that I dove into the fundamentals of DynamoDB to better grasp the value proposition of this oft misunderstood database.

While designing schema with DynamoDB can get complicated pretty quickly, let me assure you, the fundamentals are fairly easy to understand. This post will not be a comprehensive treatment of DynamoDB, but it will address one of its most important features: single-table design. At a high-level, one of the most important value propositions of NoSQL databases are fast-reads from large tables at “hyperscale”. Smarter people can disagree with me on this point, but you will have to keep this assumption in mind if you continue to read about my lessons learned. I’m especially interested in preaching to the uninitiated: developers who have just come from a SQL environment, or perhaps have just enough NoSQL experience, like myself.

Single-table design involves storing multiple different types of items per table. Using e-commerce as an example, you could store a customer’s user data (first name, last name, etc) along with the user’s order information (order date, quantity, etc) in a single table. This is not only reasonable in DynamoDB, it is the recommended approach. If you find yourself retrieving single records from multiple tables in DynamoDB, you might want to consider consolidating these different items into a single table. Single-table design requires the “primary key” to be composite by default. This means that primary keys in DynamoDB consist of two separate fields: a partition key, and a sort (or range) key. For example, the table below might declare userId as the partition key, and the literal string user as the range key. A customer’s order records can also be stored in the exact same table, using the userId as the partition key, but this time using the orderId as the range key, which could yield some data like the following:

partition keyrange keyadditional attributes
12345user(first, last, created_at, etc)
12345order#223344(product_id, ordered_on, quantity, etc)
12345order#551209(product_id, ordered_on, quantity, etc)

In the parlance of DynamoDB, the above is often called an ItemCollection and is defined as a logical collection of records grouped by partition key. You would never encounter this in a relational database, since the order records would normally be stored in a separate table and linked via a foreign key. Also, as you might have encountered in other NoSQL solutions, the schema is almost infinitely flexible, except for the requirement of a unique primary key for each Item. At any time you like, you can append (or exclude) an attribute to any item. The closest analog to this is the way JSON documents are stored in MongoDB (or even PostgreSQL and MySQL). Just beware, it’s up to you to make sure that the attribute exists before you attempt to operate on any item.

Consider the following remark by Alex Debrie, DynamoDB guru and author of The DynamoDB book:

A single, over-loaded DynamoDB table looks really weird compared to the clean, normalized tables of your relational database. It’s hard to unlearn all the lessons you’ve learned over years of relational data modeling.

– Alex Debrie (https://www.alexdebrie.com/posts/dynamodb-single-table/)

If you spent some time in SQL world, you might have developed an allergy to de-normalized data, but you should know that single-table schema design is idiomatic in DynamoDB. Hopefully, you’ll get a chance to leverage the power of this feature!

As much as I am advocating for the positive benefits of using DynamoDB, I also want to mention at least a few of the drawbacks. The query syntax is much more cumbersome than SQL, which is a “natural” query language, that even non-developers find easy to learn. The following is a query example from the AWS DynamoDB docs:

// Return a single song, by primary key 
{     
    TableName: "Music",     
    KeyConditionExpression: "Artist = :a and SongTitle = :t",
    ExpressionAttributeValues: {         
      ":a": "No One You Know",         
      ":t": "Call Me Today"     
    } 
}

^^^ Pretty awkward right? You can probably work around this awkwardness by using some sort of ORM (object-relational mapper), but then you a have second problem ūüôā

Another issue to watch out for is type safety. Unlike relational databases, certain types are not enforced in DynamoDB, which means you could store an attribute isManager as true (native boolean) in one item, and “true” (literal string) in another item. Same goes for number types. For some developers this is a “feature” that allows one to move quickly during development without the friction of having to produce boilerplate schema definitions for every new table.

And lastly, if you are given some requirements some analytics data (OLAP) using DynamoDB, you may need to construct a custom ETL (extract, transform and load) process to create downstream reporting tables, and this type of activity is not as well-supported as it is in established relational databases like MS SQL Server. Copying data from one table to another requires a scan, and scanning an entire table in DynamoDB is not exactly an anti-pattern, but it is something you only want to do as a last resort. The way to avoid scanning is to either use some sort of global or local secondary index (a solid DynamoDB feature which is outside the scope of this post), or create some sort of ETL process as mentioned above. To be clear, it’s perfectly normal to query an entire table in a typical relational database, but scanning a table in DynamoDB should be done very sparingly.

Lastly, single-table design itself has some down sides. Your application may require new data access patterns in the future that are not compatible with your initial schema design, which could cause you to have to do significant refactoring. However, I would argue that this is not a new problem specific to DynamoDB, as you could easily encounter this issue in highly normalized SQL databases. For those of you who have had to join multiple tables to retrieve some critical piece of information in a high-frequency application request, you’ve probably run into this problem. It’s pretty common for enterprise applications to have a business critical feature (or module) that runs god-awful slow due to a query that requires the joining of twelve tables. It’s almost inevitable, and it’s one of the reasons that NoSQL solutions rose in popularity in the first place.

If you want to learn more about how to use DynamoDB from a real expert, do yourself a favor and check out Alex Debrie’s excellent https://www.dynamodbbook.com/. This book has all the information you could ever want to know about DynamoDB and was a tremendous help for me in researching this blog post. You can also find a less-detailed (but still quite helpful) free online guide here: https://www.dynamodbguide.com/what-is-dynamo-db.

Building in Public

While level-ing up in my guitar practice, I’ve frequently returned to a book called The Advancing Guitarist, by the esteemed (if not well-known) jazz guitarist Mick Goodrick. In it, he (lightly) prescribes a few exercises for learning scales by playing them vertically (up-and-down) a single string, as opposed to horizontally (eg. CAGED, pentatonic forms), the way most guitarists initially learn to do. In his book (p. 12), he recommends the guitarist write down every combination of an individual string and modal scale on a piece of paper, cut the paper into strips and randomly select the combinations from a hat to spur your practice. But I’m so lazy, I couldn’t be bothered with writing things down on paper, let alone searching for scissors, so instead I built a website to do it for me. I’m excited to announce the release of a new application/website I’ve been working on over the past couple months: https://unostring.com.

The basic premise of this app/website is to randomly generate combinations of a single guitar string from the set E-A-D-G-B-E, and a single modal scale (eg. Ionian, Dorian, Mixolydian). As a bonus, I wanted it to also provide some backing tracks, either a drone corresponding to particular mode, or an actual musical vamp (jazz, rock, etc).

I built UNOString using JavaScript, GatsbyJS, and regular ol’ CSS. On top of this, I embedded some organ drone MP3 samples to accompany the practice of each modal scale. I initially tried using the Web Audio API to create synth sounds programmatically, but this proved to be too difficult for the initial release. Digital audio is a deep subject, and I might have underestimated how hard it would be to use out of the box as a beginner. I had no problem creating some basic synth sounds using a couple OscillatorNode‘s but ran into problems trying to reduce the initial volume. Ultimately, I fell back to just recording some MIDI organ samples in Logic Pro X, and then exported WAV files from the DAW, which worked fine for this initial release.

There’s a lot of things I’d like to improve about the site. The user experience is a bit clunky, and the style/layout could use some polishing, but overall it has landed at a good stopping point. Next up, I will probably provide more interesting backing audio samples for each mode, something more musical (eg. jazzy, groovy, rock-and-roll, etc). I’ve also thought about giving users/students the ability to track how many mode/string combinations they have practiced over how many minutes/hours, but that’s probably way down the road.

One more note about implementing the site: I will eventually switch from using plain CSS to using styled-components, but in this first phase, I was curious to see how far I could get using vanilla CSS (HINT: just far enough). And, I definitely want to return to learning more about Web Audio API, as I think there’s a lot of interesting possibilities for improving this application with it.

Play around with the site and let me know what you think! Feedback and request for features are most welcome.

`git [what] -p` ALL the things!

A while back, I was doing a lot of code changes that sprawled throughout the codebase, and when I finally completed a single chunk of work mixed in with a lot superfluous code changes, I was left with the final task of committing only a subset of un-staged changes in my local repository. I had several files that contained multiple changes which either needed to be committed or deleted, and I didn’t have a way to do it other than manually.

I’m so lazy (or just allergic to reversing changes manually) that I wondered how I could selectively check out chunks of changes from the files while leaving others in place so that I could stage and subsequently commit those changes. A minimal search yielded this great Stack Overflow post explaining how to selectively check out chunks of code. Sweet! TLDR; git checkout -p¬†allows you to interactively un-stage code changes in your local repository.

After finding that post, I quickly shaped up my next commit and moved on to other tasks. But that episode reminded me that another way to solve the problem would have been to execute the inverse I was trying to achieve in the above-mentioned scenario and use¬†git add -p¬†to selectively stage chunks of code changes for commit (as opposed to remove unwanted code from potential commits). This is useful when you are experimenting with a lot of different code changes (perhaps you like to liberally sprinkle console logging everywhere when fixing obscure bugs?) in your local repository, and it’s more efficient to try multiple approaches and not immediately clean up after yourself because you’re trying to get in the flow of understanding or working on some hard problem. Using the -p¬†flag will allow you to stage one code diff and discard another in the same file. The typical work flow is to use git add -p¬†to selectively add ONLY the code diffs that solved your problem for a given commit. After choosing the correct code diffs, commit what you’ve staged, and then run git checkout .¬†to get rid of the remaining cruft that didn’t work. See how that works?

You can read more about¬†adding commits in chunks here¬†on the Git SCM website. You might be wondering what’s so special about -p¬†? What does p¬†even stand for? Git repositories are sometimes described as being a tree of linked-lists where the nodes in the list are patches. p¬†is an abbreviation for “patch”!

On a closing note, the nice thing about git is they try (like many other UNIX commands) to maintain standard configuration flags across all operations (eg -p!)  Try using git log -p and see what happens?

Git: Undo a commit

At some point, a new user of Git will accidentally commit some undesirable changes. This post will address a few scenarios where you might want to “undo” a commit.

First case, you’ve committed some terrible code, and you simply want to blow away the previous commit and start over. Remember, each commit is uniquely identified by a hash identifier, and you’ll need to locate the hash of the commit just previous to the unwanted commit you just created. To undo the most recent commit, you’ll type:

git reset --hard [hash_of_commit_previous_to_unwanted_commit]

WARNING: Much like ‘–hard’ implies, this will completely blow away the unwanted commit. You will not be able to retrieve the changes for that commit in a way that is…economical. What if you want to undo the commit, but you still want to further manipulate the changes that were contained in that commit? A common scenario might be that you want to improve the changes or that you simply want to recover a specific portion of the commit. Then, you simply modify the above command:

git reset --soft [hash_of_commit_previous_to_unwanted_commit]

OR

git reset [hash_of_commit_previous_to_unwanted_commit]

There is a slight difference between resetting a to a previous commit using –soft versus¬†no option at all. What happened in the first command (reset using –soft), is that we’ve undone the previous commit and pulled the changes back into “staged” mode, which is essentially what the status of the files were right before you originally committed the unwanted changes. The second command (reset, without specifying an option) returns the changes to “unstaged”, which means that if you want to make some additional modifications and then commit them, you’ll need to add or “stage” them to do a subsequent commit.

All of the above examples require you to seek out the specific hash of the commit to which you want to reset the current branch, and this can be tedious. You can accomplish the same general goal (using similar options for each type of reset) as the above commands, by simply typing:

git reset HEAD~1

Perhaps you’ve seen the term HEAD in command-line output with Git and have always wondered what it means? Basically HEAD is a symbolic reference¬†to the most recent commit (or “tip”) of a branch. By specifying HEAD~1, you’re basically saying, I want to go back exactly one commit from HEAD. Learning to use the convention of HEAD~[some-number-of-commits-back] is a common shorthand for specifying the target of other Git commands as well (such as rebase and log), so be sure to get comfortable using. It will be very handy and convenient for you in the future when executing more advanced operations with Git.

There you go…now you can undo a commit! Git luck to you!

What is Git revert?

When using Git, one of the first problems people encounter is how to undo a commit. Taking a quick look at the help docs, it would seem that “revert” is the way to undo a commit. In Subversion, for example, “revert”¬†actually blows away¬†uncommitted changes¬†to local files. But with¬†Git, “revert” means something entirely¬†different. This is the¬†first¬†in a series of posts on using Git, and I’ve set up a Github repository called git-examples¬†to act as practical reference for some of these¬†issues.

What happens when you revert? Consider the following Git command:

 git revert <hash-of-commit-to-reverse>

This will create an additional commit that represents the inverse of commit you want to revert. The commit message will look like this:

git-revert-commit-detail-github

The default commit message for a reverted commit will indicate the hash (c74ba1e) of the specific commit that is being reverted.

So what is actually happening when you revert a commit? Simply put, Git creates a commit that is the exact inverse of the targeted commit. Git is NOT undoing the commit. It is literally creating a new commit that reverses the changes of the target reverted commit. The new commit will re-add any code that was deleted from the target commit, and delete any code that was added to the target commit. The original commit will continue to exist in your commit history for eternity! Newcomers to Git are very often confused when they see a brand new commit in their branch, with the original commit continuing to exist.

Take a look at the original commit:

git-revert-target-commit

And here is the new commit created by reverting:

git-revert-inverse-commit.png

Note that code that was added in the target (original) commit, was then deleted in the subsequent commit created by the revert command. For a more in-depth look at those commits, take a look at the branch that I’ve created to demonstrate the use of the revert command.

After grasping what reverting commits is all about, you may decide that using revert is unnecessary since you only¬†need to undo your most recent commit. If that’s the case, you’ll want to use¬†the reset command (which I will discuss in a future post)

So why would anyone use the revert command? IMO, there’s two special cases that may necessitate¬†using¬†revert:

  1. The commit you want to undo is far back in the commit history, and it’s too late to¬†reset or interactively rebase (I’ll talk more about interactive rebasing in later posts). The example above shows a deadly simple example, but in real life, the commit you want to revert may encompass complicated changes across multiple files, and revert guarantees to reverse exactly those changes.
  2. Using revert is a way to document a specific code-change. It indicates to future developers/readers of the commit history, that someone very deliberately corrected changes from a previous commit.

For more in-depth explanation of how reverting works in Git, take a look at this great post.

NOTE: The site gitready.com, is hands-down, the BEST resource that I’ve found on the web for all things Git-related!

Fundamentals of Git

I first started using git on professional projects about two¬†years ago. Throughout my career, I’ve used the whole gamut of version control systems: CVS, VSS, TFS, Subversion, Perforce and Git, in that order. Of all these tools, Git had the most difficult (notoriously so) learning curve. After a year or so, I gained enough working knowledge to use Git very effectively, even if I didn’t totally understand how things worked under the hood. As I spent some time guiding other newbies to use it for the first time, I delved deeper into the fundamentals and also learned some of its more esoteric functionality. At this point going back to traditional VCS’s feels dramatically limiting.

Having used only traditional version control systems, let’s acknowledge that Git introduces an explosion of new functionality and terms (push, pull, rebasing, remotes) and also requires that we mentally re-map previously canonical version control terminology like: `checkin`, `checkout`, `sync`, `commit`, and others.

Distributed Version Control Systems

Git is a distributed version control system, meaning a local copy of the repository is a fully autonomous repository with the entire commit history and capabilities of a traditional server-side repository. All day-to-day source control operations like committing changes, viewing history of a file, and creating a new branch happen entirely in your local repository. This is often befuddling to new users of git, who are accustomed to traditional source-control systems, where similar operations are so tightly synchronized between client and server, they appear to be singular.

While a Git repository is self-sufficient and autonomous, programmers will obviously still need to synchronize changes with a remote repository at some point. This is achieved with the commands `push` and `pull`.

Clones

Cloning is how you retrieve a full copy of the repository of interest. You can clone a repository locally as many times as you like.

`git clone git@github.com:joyent/node.git`

The default result of this operation is that a full copy of the repository will be created in a directory called `./node`.

Remotes

Non-distributed VCS’s are based on the traditional client-server model. The server is centralized and all clients sync with that server. With Git your repository ‘tracks’ a remote server by default, but your local repository can track multiple remote servers. This sounds crazy complicated to the uninitiated, but it’s intrinsic to how a lot of source software is developed. After cloning a repository locally, type:

git remote

This will list all of the remote repositories that your local repository is tracking. Git neophytes should simply note that the authoritative remote server is conventionally referred to as “origin”. When you see “origin” in Git command output, this is a reference to the remote server that is being tracked for the current branch. Don’t ascribe too much significance to this, as it’s literally just a naming convention. You can easily change “origin” to any name you desire.

Branches

One of the most important features of Git is how branching works. People often claim branching is “cheap” in Git. The reasons for this are:

  • branches are created entirely locally (no server-side operation)
  • branches are **NOT** copies of a file system, but rather they are a reference to a commit.

This second point is critical, as it results in a very fast operation. If you’ve ever branched in Subversion, you’ve probably noticed it takes a while for the operation to complete, as the entire subset of files is being copied. Branches are actually newly generated in the file system. Similarly, in Perforce, files are copied and then directed to `populate` (P4 command). The consumption of space on the file system as well as the cost of read/write operations causes most people using traditional version control systems to branch sparingly and with great caution. Git, however, supports frequent and near-instantaneous branching by simply creating a pointer to a specific commit in history. Nothing is copied! As a result, Git gives programmers the ability to experiment more easily, as well as to version changes to source code at a more granular level without incurring the latency of a client-server connection.

Rebasing

This is easily the most controversial aspect of Git, especially when it comes to the schism of opinion over how teams should use it. Rebasing is the act of rewriting a select portion of branch commit history. Imagine you are a developer in this most common scenario:

1. You create a branch to start coding a new feature.
2. You make several commits to the new feature branch
3. Changes from another teammate have been committed to the parent branch, so you get these new latest changes from the remote server and you rebase your commits onto the latest changes from the parent branch by typing:

git rebase origin/<name_of_parent_branch>

*……Deep breath*. The above scenario can be pretty mind-bending for neophytes. Rebasing works like this:

  • Git rewinds all your local commits and puts them aside in a safe place
  • Changes that were fetched from the server are now fast-forwarded onto your feature branch
  • Your previously saved local commits are now fast-forwarded on top of your updated feature branch.

Rebasing is an incredibly powerful feature that allows you to rewrite commit history, change the order of commits, and combine multiple commits into a single commit.

Still confused as to why this is useful? That’s okay, because rebasing is Git’s MOST misunderstood concept, which I will address in more depth in future posts.

In Closing

For many programmers version control systems are like a substitutable public good (eg. water, power, etc), but Git and other distributed version control systems like Mercurial, add a new dimension to the software development process. It’s perfectly normal for teams and companies to continue using Git the same way that they used Subversion and TFS in the past, but it would behoove anyone to look at what Git offers beyond the traditional source control system. In future posts, I’ll talk about some of the finer points of distributed version control and the nuances¬†of Git.