见不得别人好是什么心理| 十多块钱的烟什么好抽| 什么的白塔| 5月出生是什么星座| 睡觉做噩梦是什么原因| 吃枸杞有什么功效| 隐翅虫怕什么| dsa什么意思| 老是流鼻血是什么原因| 蚊子怕什么味道| 吃什么能胖起来| 默哀是什么意思| 门前栽什么树最好| 中医经方是什么意思| 胎盘能吃吗有什么作用与功效| 青海省会城市叫什么| 夏天煲什么汤| 一般炒什么菜放蚝油| 努尔哈赤是什么民族| 什么是色盲| 肠胃炎需要做什么检查| 才貌双全是什么生肖| 血管脆是什么原因| 考试前吃什么早餐| 无创dna是检查什么的| 扁桃体发炎不能吃什么东西| 凝血功能差有什么危害| 规培是什么意思| 5.2号是什么星座| 表情包什么意思| 花胶有什么功效与作用| 驳是什么意思| 春天开的花都有什么花| 减肥什么方法最快最有效| 甲功七项能查出什么病| 手机暂停服务是什么意思| 自控能力是什么意思| biu是什么意思| 碧字五行属什么| 加湿器有什么用| 看脑部挂什么科| 梦见买鞋子是什么意思| 独美是什么意思| 什么的足球| 背德感是什么意思| 巨蟹跟什么星座最配| 宫颈多发囊肿是什么意思| 男羊配什么属相最好| 播客是什么意思| 吃什么补脾| 荔枝什么时候成熟季节| 暗网是什么意思| 一冷就咳嗽是什么原因| 奶水不足吃什么下奶多| 胰腺炎为什么喝水就死| 低度鳞状上皮内病变是什么意思| 老戏骨是什么意思| 袁绍和袁术是什么关系| 很man是什么意思| 尿隐血2十是什么原因| 1月23日是什么星座| 脚底发凉是什么原因| 商朝后面是什么朝代| 头发稀少是什么原因导致的| 什么地方能做亲子鉴定| 卯木代表什么| 产品批号什么意思| 企鹅是什么意思| 足底筋膜炎吃什么药| 雪梨是什么梨| 贾宝玉的玉是什么来历| 虐恋是什么意思啊| temp是什么文件夹| 高铁动车有什么区别| 恍然大悟什么意思| 你为什么爱我| 择偶标准是什么意思| 吃完避孕药不能吃什么东西| 6岁属什么生肖| 槐花蜜是什么颜色| 温州有什么区| 鼠肚鸡肠是什么生肖| 一步登天是什么生肖| 维生素d滴剂什么时候吃最好| 牛肉饺子馅配什么蔬菜好吃| 朝对什么| 什么是菩提心| 防疫站属于什么单位| opple是什么牌子| 什么紫什么红| 什么叫全科医生| 趣味是什么意思| 害羞是什么意思| 海关锁是什么意思| 蚂蚁代表什么生肖| 我需要什么| 龙肉指的是什么肉| 我行我素的人什么性格| 女人吃鹿鞭有什么好处| 什么叫语言障碍| 忘情水是什么意思| 老是掉头发什么原因| ms是什么意思| 邪不压正什么意思| 低血钾有什么症状| 胰腺炎恢复期吃什么好| 247是什么意思| 婴儿外阴粘连挂什么科| 为什么喝咖啡会心慌| 烧烤烤什么好吃| 美国为什么打伊朗| 三点水的字大多与什么有关| 一如既往的意思是什么| 唐僧是什么菩萨| 切口憩室是什么意思| 绿对什么| 春宵一刻值千金什么意思| 黄色加蓝色等于什么颜色| 蜗牛有什么特点| 梦见自己请客吃饭是什么意思| 红色配什么颜色好看| 饱和脂肪酸是什么意思| 螳螂捕蝉黄雀在后是什么生肖| 做梦捡钱是什么预兆| 大力念什么| 球菌阳性是什么意思| 江西景德镇有什么好玩的地方| 后巩膜葡萄肿是什么意思| 减肥喝什么茶| onemore是什么牌子| 女真族现在是什么族| 塔罗是什么意思| 红肉是什么肉| 青头鸭和什么煲汤最好| 猪肉炖什么好吃| 唐氏筛查是什么检查| 女人吃什么补气血效果最好| 慢性非萎缩性胃炎伴糜烂吃什么药| 昆山有什么好玩的地方| ca125是什么意思| 骨髓炎是什么病| 大麦茶有什么功效与作用| 梦到知了猴是什么意思| 66年属什么| 肠道易激惹综合征的症状是什么| 地钱是什么植物| 肺气肿是什么原因导致的| 100元人民币什么时候发行的| 单核细胞百分比偏高是什么意思| 天神是什么意思| 肺大泡是什么病| 菩提是什么| 辅警政审主要审些什么| luxury什么牌子| 深井冰什么意思| 92年的猴是什么命| 嘴苦是什么病的征兆| 三刀六洞什么意思| 哺乳期吃什么奶水多| 因果关系是什么意思| 高氨血症是什么病| 运钞车是什么车| 溥仪为什么没有生育能力| 灵魂伴侣什么意思| 平板运动试验阳性是什么意思| 暑伏为什么要吃饺子| 活在当下什么意思| 尿酸高是什么原因导致的| 炸薯条用什么粉| 双侧卵巢多卵泡是什么意思| 肺结节吃什么药散结节最快| 女生下面叫什么| 线性骨折是什么意思| 睡觉时头晕是什么原因| 风水宝地是什么生肖| 脊椎痛什么原因| 手脚爱出汗是什么原因| 经常感觉饿是什么原因| 过去的日子叫什么日| 便士是什么意思| 株连九族是什么意思| 天空中有什么| 口水粘稠是什么原因| 割痔疮后吃什么恢复快| 喜悦之情溢于言表什么意思| 小孩咳嗽喝什么药| 寒酸是什么意思| 帕金森是什么引起的| 来大姨妈量少是什么原因| 草是什么颜色的| 啄木鸟为什么不会脑震荡| 丹青是什么| 甲状腺功能挂什么科| 过敏性皮肤用什么护肤品比较好| mm代表什么单位| 素土是什么| 什么是优质蛋白| 为什么白带是黄绿色的| 有什么好看的古装剧| 孟姜女属什么生肖| cd56阳性是什么意思| qa和qc有什么区别| 水印是什么| 梦见龙是什么预兆| 膝盖里面痛什么原因引起的| 大便拉不出来吃什么药| 核黄素是什么| c4是什么驾驶证| 属龙五行属什么| 尿道口流白色液体是什么病| 坐班什么意思| 求嗣是什么意思| 白细胞高是什么原因造成的| 疏通血管吃什么药最好| 肾和性功能有什么关系| 马甲是什么意思?| 心脏呈逆钟向转位什么意思| 什么的饰品| 观音坐莲是什么姿势| 补血吃什么好| 什么是作风建设| 吃什么药会死| 两个叉念什么| 什么牌助听器好| 蛐蛐吃什么食物| 查颈椎挂什么科| 什么是开放性伤口| 亚甲炎是什么原因引起的| 胃肠功能紊乱吃什么药| 随大流什么意思| 人鱼小姐大结局是什么| 为什么精液是黄色的| 甲沟炎挂什么科| 今天立冬吃什么| vjc是什么品牌| 天空像什么| 关税什么意思| csv文件用什么打开| 兔子能吃什么| 手指头麻木吃什么药| 免疫系统由什么组成| 什么是刑事拘留| 2017年是属什么年| 2月出生的是什么星座| 野鸡吃什么| 什么为力| 多愁善感的动物是什么生肖| 扁桃体疼吃什么药| 六是什么意思| 生菜不能和什么一起吃| 五脏是什么| 阴蒂长什么样| 脚转筋是什么原因引起的| 牙周病是什么| 肾积水是什么病严重吗| 脚后跟干裂用什么药膏| 吃了虾不能吃什么水果| taco是什么| 近视眼睛什么牌子好| 走马观花的走是什么意思| 梦见石头是什么意思| 白头发缺什么微量元素| 自来水是什么水| 手脱皮什么原因| 百度

Rewriting Your Git History and JS Source for Fun and Profit

This is a post in the Codebase Conversion series.


Or, how I spent a few weeks obsessing over a task that only I cared about

Intro ??︎

I just completed a large-scale rewrite and cleanup of our team's Git repository. I learned a lot working on this task and came up with some nifty and useful techniques in the process, so I'd like to share what I've learned.

In order to keep this post from getting any longer that it already is, I'll be referencing a bunch of external articles and assuming that you, the reader, have taken the time to read them and understand most of the concepts involved. That way I can go into more detail on the stuff I actually did.

To summarize the rest of the cleanup task:

  • I filtered out junk files from the repo's history, shrinking the base repo size by over 70%
  • I automatically rewrote our ES5 Javascript source files to ES6 syntax throughout the entire history, as if they had "always been written that way"

I wrote a bunch of scripts and code for this task. I've created a repo with slightly cleaned-up copies of these scripts for reference . I'll show some code snippets in this post, but point to that repo for the full details.

Note: This post is absurdly technical and deep, even for me :) Hopefully people find this info useful, but I also don't expect it to be widely read. This is mostly a chance for me to publicly document all the stuff I did, as a public service.

Table of Contents ??︎

Background ??︎

Why Rewrite History? ??︎

My current project got started about six years ago. We used Mercurial for the first year, then migrated to Git. The repo's .git folder currently takes up about 2.15GB, with about 15,000 commits. There's several reasons for that. We've historically "vendored" third-party libs by committing them directly to the repo, including a lot of Java libraries. We've also had some random junk files that were accidentally committed (like a 135MB JAR file full of test images).

Unfortunately, because of how Git works, any file that exists in the historical commits has to be kept around permanently, as long as at least one commit references that file. That means that if you accidentally commit a large file, merge that commit to master, and then merge a commit that deletes the file, Git still keeps the file's contents around.

So, we had junk files in the history that should never have been committed, and we had old libraries and other binaries that were not going to be needed for future development. We just wrapped up a major development cycle, and I wanted to clean up the repo in preparation for the next dev cycle. That way everyone's clones would be smaller, which would also help with CI jobs.

Dealing with History Changes ??︎

However, Git commits form an immutable history. Every commit references its parents by their hashes. Every commit references its file tree by a hash. Every file tree references its files by the hashes of their contents. That means that if you literally change a single bit in one file at the start of the repo's history, every commit after that would have a different hash, and thus effectively form an "alternate history" line that has no relation to the original history. (This is one of the reasons why you should never rebase branches that have already been pushed - it creates a new history chain, and someone else might be relying on the existing history.)

My plan was to create a fresh clone of our repo, and rewrite its history to filter out the files we didn't need for future work. We'd archive the old repo, and when the next dev cycle starts, everyone would clone the new repo and use that for development going forward.

The Codebase ??︎

Our repo has two separate JS client codebases which talk to the same set of services. The services are written in a mixture of Python and Java.

The older JS client codebase, which I'll call "App1", started in 2013, and the initial dev cycle resulted in a tangle of jQuery spaghetti and global script tags. I was able to convert those global scripts to AMD modules about halfway through that first dev cycle, and spent the second half of the year refactoring the codebase to use Backbone. We continued to use Backbone for new features until early 2017, when we began adding new features in React+Redux, and refactoring existing features from Backbone to React. We didn't have a true "compile" step in our build process, so we were limited in what JS syntax we could use based on our target browser environment. I finally upgraded the build system from Require.js to Webpack+Babel in late 2017, and that allowed us to finally start using ES6 modules and ES6+ syntax. Since then, all of our new files have been written as ES6 modules, and we've had to do back-and-forth imports between files in AMD format and files in ES6 module format.

"App2" was originally written using Google Web Toolkit (GWT), a Java-to-JS compiler. We completed a ground-up rewrite of that client in React+Redux during this dev cycle (and I took great joy in deleting the GWT codebase, particularly since I'd written almost all of that myself). This codebase was written using Wepack, Babel, and ES6+ from day 1.

Because of its longer and varied dev history, App1's codebase is a classic example of the "lava layer anti-pattern" (and I will freely admit that I'm responsible for most of those layers). It's currently about 80% Backbone and 20% React+Redux, and we hope to finish rewriting all the remaining Backbone code to React over the next year or two. In the meantime, the mixture of AMD and ES6 modules is a bit of a pain. Webpack will let you use both, but you have to do some annoying workarounds when importing and exporting between files in different module formats (like adding SomeDefaultImport.default when using an ES6 default export in an AMD file)

The Plan Grows ??︎

Our team hasn't exactly been consistent with our code styling. In theory we have a formatting guide we ought to be following, but in practice... eh, whatever :)

I've been planning to set up automatic formatting tools like Prettier for our JS code and Black for our Python code. However, the downside of adding a code formatter to an existing codebase is that you inevitably wind up with a "big bang" commit that touches almost every line and obscures the actual change history of a given file. If you do a git blame (or "annotate"), at some point every line was last changed by "Mark: REFORMATTED ALL THE THINGS", which isn't helpful. There's ways to skip past that, but it's annoying.

At some point I realized that if I was going to be rewriting the entire commit history anyway by filtering out junk files, then I could also apply auto-formatting of the code at every step in the history, to make it look as if all our code had "always been formatted correctly". That led me to another bigger realization: I could do more than just reformat the code - I could rewrite the code!.

I'd seen mentions of "codemods" before - automated tools that look for specific patterns in your code and transform them into other patterns. The React team is especially fond of these, and has provided codemods for things like renaming componentWillReceiveProps to UNSAFE_componentWillReceiveProps across an entire codebase.

it occurred to me that I could automatically rewrite all of our AMD modules to ES6 modules, and upgrade other syntax to ES6 as well. And, as with the formatting, I could do this for the entire Git history, as if the files had been written that way since the beginning.

Naturally, I ran into a bunch of complications along the way, but in the end I accomplished what I set out to do (yay!). Here's the details of how I did it.

Filtering Junk Files from History ??︎

Short answer: use The BFG Repo-Cleaner. Done.

Slightly longer answer: it does take work to figure out which files and folders you want to delete, and set up a command for the BFG to do its work.

Note: most of these commands are Bash-based and require a Unix-type environment. Fortunately, Git Bash will suffice on Windows.

Finding Files to Delete ??︎

There's a couple ways to approach this.

The first is to look for specific large files in the history. Per this Stack Overflow answer, here's a Bash script that will spit out a list of the largest files and their sizes:

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| sed -n 's/^blob //p' \
| awk '$2 >= 2^20' \
| sort --numeric-sort --key=2 \
| cut -c 1-12,41- \
| numfmt --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

The 2^20 searches for files larger than 1MB, and can be adjusted to look for other sizes.

Anoher approach is to look at the names of every file path that has ever existed in the repo, to get an idea what might not be relevant any more. Thanks again to Stack Overflow for this answer:

git log --pretty=format: --name-only --diff-filter=A | sort -u > allFilesInRepoHistory.txt

Skimming through the list of all files turned up a bunch of junk from early in the development history that wasn't in our current tree, and could safely be removed.

Preparing the Filtering Command ??︎

The BFG Repo-Cleaner supports deleting top-level folders by name, but not nested folders. For that, I had to write a script that looked for all files matching a given folder/path prefix, and write all matching blob hashes to a file that could be read by the BFG. (Pretty sure this started from another SO answer, but can't find which one atm.)

First, write a text file called nestedFoldersToRemove.txt with each path prefix to delete on a separate line:

ParentFolder1/nestedFolder1/
ParentFolder2/nestedFolder2/nestedFolder2a/
ParentFolder3/someFilePrefix

Then, run this script inside a repo to generate a file containing just the blob IDs that match those prefixes

readarray -t folders <  "../nestedFilesToRemove.txt"

for f in "${folders[@]}"
do
    echo "Finding blobs for ${f}..."
    git rev-list --all --objects | grep -P "^\w+ ${f}*" | cut -d" " -f1 >> ../foundFilesToDelete.txt
done

If you want to see which files are getting deleted, remove the | cut -d" " -f1 portion to generate lines like ABCD1234 Path/to/some/file.ext

Finally, put together a list of the top-level folders you want deleted as well.

In some cases, I wanted to nuke all the old files in a folder, and only keep what was in there currently. In those cases I went ahead and specified the whole folder as a search path, because the BFG by default will preserve all files in the current HEAD.

Running the BFG ??︎

Once you know all the files you want cleaned up, you need to make sure that your original repo does not have those in the tip of its history. If any of them still exist, add commits that delete those files.

Once that's done, it's time to nuke stuff!

# Clone the existing repo, without checking out files
git clone --bare originalRepoPath filteredRepo

# Run the BFG, deleting specific top-level folders and nested folders/files
java -jar bfg-1.13.0.jar --delete-folders "{TopLevelFolder1,TopLevelFolder2}" --strip-blobs-with-ids ./fileIdsToDelete.txt filteredRepo

After the BFG has rewritten the filtered repo, you need to have Git remove any blobs that are no longer referenced:

git reflog expire --expire=now --all && git gc --prune=now --aggressive

You should now have a much smaller repo with the same commit authors and times, but a different line of history.

Rewriting JS Source with Codemods ??︎

Codemods for Converting to ES6 ??︎

There's plenty of tools out there for rewriting JS source code automatically. My main concern was finding codemod transforms to do what I needed: convert our AMD modules to ES6 modules, and then fix up a few more bits of ES6 syntax on top of that.

I found most of what I needed in two repos:

  • 5to6/5o6-codemod
    • amd: converts AMD module syntax to ES6 module syntax
    • named-export-generation: Adds named exports corresponding to default export object keys. Only valid for ES6 modules exporting an object as the default export.
  • cpojer/jscodemod
    • no-vars: Conservatively converts var to const or let
    • object-shorthand: Transforms object literals to use ES6 shorthand for properties and methods.
    • trailing-commas: Adds trailing commas to array and object literals

There's many other transforms I could have used, but this was sufficient for what I wanted to do.

Writing a Custom Babel-Based Codemod ??︎

I mentioned that we had some funkiness in the JS code as a result of cross-importing between AMD and ES6 modules. One common pattern was that an AMD module would do:

define(["./someEs6Module"], 
function(someEs6Module) {
    const {named1, named2} = someEs6Module;
});

Transformed into ES6, this would be:

import someEs6Module from "./someEs6Module";

const {named1, named2} = someEs6Module;

However, the ES6 module in question might actually only have named exports, and no default export. This only worked because of Webpack magic.

When I tried running the transformed code, Webpack gave me a bunch of errors saying that these imports didn't exist. So, I opted to write a codemod that found all named exports, and if there was no default export, generated a fake default export object containing those, as a compatibility hack.

I couldn't find anything related to this that worked with jscodeshift. However, I did find jfeldstein/babel-plugin-export-default-module-exports, which almost did what I wanted. I figured I could hack together some custom changes to it. Since it was a Babel plugin, I needed a different tool to run that codemod. Fortunately, square/babel-codemod lets you run Babel plugins as codemods.

Thanks to the AST Explorer tool and some assistance from Twitter, I was able to hack together a plugin that did what I needed.

Initial Conversion Testing ??︎

I started off by trying to run each of these transforms as a separate step. I created a shell script that called jscodeshift multiple times, each with the path to a single transform file. I also called Prettier to do some formatting.

While doing that, I ran across some issues with Babel not recognizing the dynamic import() syntax, so I added a couple sed commands to rewrite those temporarily:

# Before anything else, replace uses of dynamic import
sed -i s/import/require.ensure/g App1/src/entryPoint.js

# Do other transforms
yarn jscodeshift -t path/to/some-transform.js App1/src

yarn codemod -p my-custom-babel-plugin.js App1/src

# Format the code
yarn prettier --config .prettierrc --write "App1/src/**/*.{js,jsx}"

# Undo replacement
sed -i s/require.ensure/import/g App1/src/entryPoint.js

Using this script, I was able to process a current checkout of our codebase automatically.

Formatting Python Code ??︎

While the majority of my focus was on our JS code, we've also got a bunch of Python code on our backend. I figured I'd take this chance to do some auto-formatting on that as well. I reviewed the available tools, and settled on Black, largely because its highly-opinionated style is almost identical to how we were writing our Python anyway.

Iterating through Git History ??︎

My ultimate goal was to iterate through every commit in the history of the already-filtered Git repo, find any relevant JS and Python files, transform the versions of the files as they existed in that commit, and create new commits with the same metadata but updated file trees containing the transformed files.

I'd done some prior research and read through a bunch of articles. The most relevant was A tale of three filter-branches, which compared three different ways to iterate using the git filter-branch command, and showed that the third way was the fastest.

Understanding the Index Filter Logic ??︎

In general, git filter-branch will iterate over the commit history, and run whatever additional commands you want at each commit. These could be "inline" shell commands, or separate scripts / tools.

The third approach shown in that post involved using git filter-branch --index-filter, and checking each commit to see which files had actually been added/changed/deleted. Fortunately, the author of that post chose to write the per-commit logic as a Ruby script. Reading that was extremely helpful in understanding what was going on.

I'll summarize the steps:

  • Retrieve the current commit ID from the filter-branch environment
  • Look up the original parent commit ID
  • Look up the ID for the rewritten form of the parent commit
  • Reset the Git index to the tree contents of the rewritten parent commit
  • Diff the original parent tree and original commit tree to see which files changed
  • For each added/changed/removed file:
    • If it's a file we're interested in, transform it, and add the transformed version to the Git index
    • Any other added/changed files we don't care about should just be added to the index as-is
    • If it was removed, delete it from the index
  • Create a new commit with the original metadata and the transformed tree

Note that the Ruby script was specifically interacting with Git via low-level "plumbing" commands like git cat-file blob and git update-index. In addition, note that it was shelling out to call Git's commands as external binaries.

Speeding Up the Filtering Process: Iterating Commits in Python ??︎

I started by trying to port some of the Ruby script's logic to Python. My first attempt was just to run the same Git commands, capture the list of changed files, filter them based on the source paths of the JS and Python files I was interested in, and print those. I used the great plumbum toolkit to let me easily call external binaries from Python.

When I tried running git filter-branch --index-filter myScript.py, it worked. But, Git estimated that it would take upwards of 16 hours just to iterate through 15K commits and print the list of changed files. My guess was that a large part of that had to do with kicking off so many external processes (especially since I was doing this in Git Bash on Windows 10).

I knew that the libgit2 library existed, and that there were Python bindings for libgit2. I figured the process would run faster if I could somehow do all of the Git commands in a single process, using pygit2 to iterate over the history.

I experimented with pygit2 and figured out how to iterate over commits using commands like:

for commit in repo.walk(repo.head.target, GIT_SORT_TOPOLOGICAL | GIT_SORT_REVERSE):
    # do something with each commit

Since pygit2 also has APIs to manipulate the index, I was able to put together a script that replicated the logic from the Ruby script, but all done in-process. I think the initial script I wrote was able to loop over the history and print every JS/Python file that matched my criteria, in about 30 minutes or so. Clearly a huge improvement.

Speeding Up the Filtering Process: Using pylter-branch ??︎

I was curious if anyone else had done something like this already. I dug through Github, and came across sergelevin/pylter-branch. Happily, it already did what I wanted in terms of iterating over commits, doing some kind of rewrite step, and saving the results, and provided a base "repo transformation" class that could be subclassed to define the actual transformation step.

I switched over to using that, and it actually seemed to iterate a bit faster.

Optimizing the Transformation Process ??︎

My original plan for the rewrite was to run these steps for each commit:

  • Filter the list of added/changed files for the JS and Python files I was interested in
  • Write each original file blob to disk in a temp folder, with names like ABCD1234_actualFilename.js
  • Run all of the JS codemods and Python formatting on all files in that folder
  • Write the changed files to the Git index and commit

However, I knew that all of the file access and external commands would slow things down, so I began trying to find ways to optimize this process.

Speeding Up JS Transforms: Combining Transforms ??︎

I had six different JS codemods I wanted to run. Five of them required jscodeshift, the other required babel-codemod. Originally, this would have required six separate external processes being kicked off for every commit.

A comment in the jscodeshift repo pointed out that you could write a custom transform that just imported the others and called them each in sequence. Using that idea, I was able to run all five jscodeshift transforms in one step.

That left the one Babel plugin-based transform as the outlier. There was a jscodeshift issue discussing how to use Babel plugins as transforms, and I was able to adapt Henry's example to run my plugin inside of jscodeshift. That meant all six transforms could run in a single step.

I ultimately copied the transform files locally so I didn't have to try installing them as separate dependencies.

Speeding Up Python Formatting ??︎

That cut down on a lot of the external processes, but I was still writing files to disk for every commit. Since my script was written in Python, and the Black formatter is also Python, I realized I could probably just call it directly.

I set up some logic to put all matching Python files into an array that looked like [ {"name" : "ABCD1234_someFile.py", "source" : "actual source here}], and just directly call Black's format_str() function on each source string. That eliminated the need to write any Python file to disk - I could read the source blobs, format them, and write the formatted blobs back to the repo, all in memory without ever writing any temp files to disk.

Speeding Up JS Transforms: Creating a Persistent JS Transform Server ??︎

Since the JS transforms were all done using tools written in JS, calling that code directly wasn't an option. To solve that, I threw together a tiny Express server that accepted the same kind of file sources array as a POST, directly called the jscodeshift and prettier APIs to transform each file in memory, and returned the transformed array. This meant I didn't have to have any temp files written to disk. It also meant there were no other external processes starting up for every commit. All I had to do was start the JS transfom server, and kick off the commit iteration script.

I later realized I could speed up things even further by parallelizing the JS transforms step to handle multiple files at once. The workerpool library made that trivial to add. I set up the pool of workers, and for every request, mapped the array to transform calls that returned promises, and did await Promise.all(fileTransforms). This was another great improvement.

Handling Transform Problems ??︎

I ran into a bunch of problems along the way. Here's some of the problems I found and the solutions I settled on.

Python String Conversions ??︎

Both Black and pylter-branch required Python 3.6, so I was using that. pygit2 reads blobs as bytes, but some of the work I needed to do required strings. I also found a few files that had some kind of a "non-breaking space" character instead of actual ASCII spaces. Fixing this required doing some unicode normalization:

def normalizeEntry(fileEntry):
    newText = normalize("NFKD", str(fileEntry["source"], 'utf-8', 'ignore'))
    return update(fileEntry, {"source": newText})

Handling JS Syntax Errors ??︎

Turns out our team had made numerous commits over the years with invalid JS syntax. This included things like missing closing curly braces, extra parentheses, wrong variable names, Git conflict markers, and more. Because the JS transforms are all parser-based, both jscodeshift and prettier could throw errors if they ran across bad syntax, causing the transforms for that file to fail.

I first had to know which files were broken, at which commits. I added error handling to the JS transform server to catch any errors, return the original (untransformed) source, and continue onwards. I then added handling to write both the current string contents and the error message to disk, like:

- /js-transform-errors
    - /some-commit-id
        - ABCD1234_someFile.js    // bad source file contents
        - ABCD1234_someFile.json  // serialized error message

I'd run the conversion script for a while, let it write a bunch of errors, then kill it and review them.

I wound up writing dozens of hand-tested regular expressions to try to fix those files whenever they came up. Using an interactive Python interpeter (in my case, Dreampie), I'd read the original blob into a string, fiddle with regexes until I got something that matched the problematic source, create a substitution that fixed the issue, and then paste the search regex and the replacement into a search table in my conversion script. The conversion logic would then check to see if any file in a given commit matched the bad file paths, and run each provided regex in sequence on the source in memory. This would fix up the source to be syntactically valid before it was sent to the JS transform server, allowing it to be transformed correctly.

I eventually gave up on fixing every last issue. So, a few files in the history wound up with commit sequences like:

FIXED_SOURCE_C    // ES6 syntax
BROKEN_SOURCE_B   // original AMD syntax
FIXED_SOURCE_A    // ES6 syntax

Fortunately, most of those were far enough back in the history to not really cause issues with the blames.

This was probably the most annoying part of the whole task.

Mistaken Optimization of ES6 Files ??︎

All of the source files for App2 were already ES6 modules, and about 20% of App1's files were also ES6. Five of the six codemods were about converting AMD/ES5 to ES6, so I figured I could speed things up by not running those on files that were already ES6.

I modified the JS transform server to accept a {formatOnly : true} flag in the file entries, in which case it would skip the transforms and just run Prettier on the source. That was fine.

I then tried to have the Python script detect which files were already ES6, and did so by checking to see if the strings "import " or "export" were in the JS source. That turned out to be a mistake, as we had a bunch of AMD files with those words in comments or source already. I did one conversion run that I thought would be "final", but realized afterwards that many files hadn't been transformed at all.

I eventually settled for just checking to see if the file was part of App2's source, and ran the complete transform process on everything in App1.

Running the Conversion ??︎

I did lots of partial and complete trial runs to iron out the issues. The actual final conversion run, with all of the transforms and formatting, took right about 5 hours to complete. That's certainly a huge improvement over the base git filter-branch command, which probably would have taken upwards of 20-some hours. (This proved to be particularly helpful when I realized that I'd screwed up a "final" run with the bad "skip ES6" optimization, and had to re-run the whole process again.)

I had added a bunch of logging to the conversion script, including running values of elapsed time, time per processed commit, and estimated time remaining. This was really useful, and it was extremely satisfying to watch the commit details fly by in the terminal. (Fun side note: a couple hours after kicked off the actual final conversion run in the background, a popup window informed me that IT had pushed through a forced Windows reboot due to updates. That meant it was a race for the conversion to complete before the reboot, and at one point it was 3.5 hours remaining for the conversion, 3 hours left until the reboot. Fortunately, the conversion sped up considerably about halfway through thanks to smaller files to process.)

However, we were still doing some fixes and work in our existing repo, and had made several commits since I made the clone to start the file filtering process. Those needed to get ported to the new repo, and that turned out to be trickier than I expected.

Attempt #1: Bundles ??︎

I've used Git "bundles" before, which let you export a series of commits as a single file. That file can then be transferred to another machine, and used as a source for cloning or pulling commits into another repo.

I had assumed I could export the latest commits into a bundle and pull those directly into the newly-rewritten repo. I was wrong :( Turns out that git bundle always verifies that the target repo already has the parent commit before the first commit in the bundle, and of course since the rewritten repo had a completely different history line, that specific parent commit ID didn't exist in the new repo.

I poked at various forms of pulling, cloning, and banging my head against a wall before giving up on this approach.

Attempt #2: Cherry-Picking ??︎

I then figured I could add the original repo as a remote, git fetch the original commits into the new repo, and cherry-pick them over into the new history. Didn't go as I planned.

First, the new repo had to copy over every old blob and commit into itself. Then, when I tried cherry-picking the first "new" commit, it brought along not just the changed files from that commit, but the old versions of every other file in the old commit's file tree.

I probably could have done some kind of surgery to make that work, but I gave up.

Attempt #3: Patch Files ??︎

Git was originally intended for use with an email-based workflow, since that's what the Linux kernel team does. It has built-in commands for generating patch files, writing emails with those patches, and applying patches from emails.

I was able to generate a series of patch files with git format-patch COMMIT_BEFORE_FIRST_I_WANT..LAST_COMMIT_I_WANT. I then copied those to the new repo, and tried to apply them.

Git has two related commands. git apply takes a single standalone patch file and tries to update your working copy. git am reads an email-formatted patch file, applies the diff, and then creates a new commit using the metadata that was encoded in the email header.

When I tried to use git am, it failed. The generated patch files have lines like index OLD_BLOB_HASH NEW_BLOB_HASH for each changed file, and those lines caused Git to try to find the old file blob IDs in the new repo. Again, those didn't exist.

I finally resorted to manually deleting those index lines from each patch file, then running git am --reject --ignore-whitespace --committer-date-is-author-date *.patch. Git would try to apply a patch file, and if any hunks failed to apply correctly, write them to disk as SomeFile.js.rej and pause. I could then do manual hand-fixes to match the changes in the patch, and run git am --reject --ignore-whitespace --committer-date-is-author-date --continue, and it would pick up where it left off.

So, patch files aren't ideal, but they at least allow copying over newer changes semi-automatically while reusing the commit metadata. I probably could have found other ways to do this programmatically, but oh well.

Conversion Results ??︎

Nuking unneeded files from the repo history knocked the new repo down from 2.15GB as a baseline to "only" 600MB, a savings of 1.5GB. We still unfortunately have a bunch of vendored libs with the current codebase, but that's still a significant improvement. (Yeah, yeah, we'll look at maybe using something like Git-LFS down the road.)

The JS transformation process worked out exactly as I'd hoped. All of the versions of the JS files in our history were transformed and formatted, as if they'd originally been written using ES6 module syntax and consistent formatting from day 1. This meant that each commit retained the original metadata and relative diffs to its parent.

As an example, App1's entry point was initially a global script, but a couple months into development I converted it into an AMD module. The original diff looked like:

+define(["a", "b"],
+function(a, b) {

// Actual app setup code here

+   return {
+       export1 : variable1,
+       export2 : function2
+   }
+});

Afterwards, the equivalent commit was still by me, still on the same day, but the diff looked like:

+import a from "a";
+import b from "b";

// Actual app setup code here

+export {export1 : variable1, export2 : function2};

I did have to do a few hand-fix commits after the conversion process was done, mostly around our mixed use of AMD/ES6 modules (like removing any of use of SomeDefaultImport.default). Once I did that, the code ran just fine.

Final Thoughts ??︎

As I said, that was incredibly technical, but hopefully it was informative.

I had a fairly good grasp on how Git worked and how it stored data before this task began, but this really solidified my understanding.

I suspect there may have been some other approaches I could have used, particularly rewriting each file blob in parallel. I'm pretty happy with how this worked out, though.

The sanitized conversion scripts are available on Github. If you've got questions, leave a comment or ping me @acemarke on Twitter or Reactiflux.

Further Information ??︎

Git internal data structures: blobs, trees, commits, and hashes ??︎

Git history rewriting ??︎

JS Codemods ??︎

Git Commit Transfers ??︎


This is a post in the Codebase Conversion series. Other posts in this series:

谵语是什么意思 同化是什么意思 纷至沓来什么意思 asc是什么意思 什么肉不能吃
五月一日是什么节日 女性睾酮高说明什么 儿女双全是什么意思 韭菜苔炒什么好吃 女性下面水少是什么原因
张仲景的著作是什么 小插曲是什么意思 叶什么什么龙 红花泡水喝有什么功效和作用 女生腰疼是什么原因
慢性浅表性胃炎吃什么药好得快 什么人容易得圆锥角膜 儿保科主要是检查什么 1987年出生属什么生肖 不是你撞的为什么要扶
怀孕时间从什么时候开始算hcv8jop5ns5r.cn 生肉是什么意思hcv7jop9ns3r.cn 麦芽糖是什么糖inbungee.com 绿色搭配什么颜色好看hcv7jop6ns9r.cn 氟哌噻吨美利曲辛片治什么病hcv7jop9ns6r.cn
喆是什么意思hcv8jop2ns7r.cn 沙悟净的武器叫什么hcv8jop8ns2r.cn 白麝香是什么味道helloaicloud.com 女今读什么hcv9jop6ns3r.cn 十一月份出生的是什么星座hcv9jop8ns2r.cn
m是什么hcv8jop7ns0r.cn 净高是什么意思youbangsi.com 腰疼是什么原因引起的女性hcv9jop3ns5r.cn 圣经是什么意思hcv8jop9ns0r.cn 曹植字什么hcv8jop5ns6r.cn
梦到谈恋爱预示着什么hcv8jop7ns2r.cn cet是什么意思hcv9jop0ns3r.cn 微信限额是什么意思gangsutong.com 双肾囊肿有什么危害hcv8jop7ns6r.cn 满江红属于什么植物hcv8jop3ns5r.cn
百度