I actually used topological sort at work

Ming
4 min readFeb 14, 2024

What’s something that high schoolers keep on saying when they start majoring in computer science in college? It can’t be anything other than:

“When am I going to use this in real life?”

That’s exactly what I said to myself (not to my professor; I was too shy) when I saw topological sort on the projector screen.

“Topological sort”, as imagined by MidJourney.

Anecdote of mine, with CS courses

Topological sorting is usually taught in the 3rd year of college (CSE 326 in UW, CS 302 in UTK), or the 1st if you are in UPenn (CIS 110). Lucky for me, I wasn’t exposed to it until I started grad school there. (Back in undergrad, I majored in something totally unrelated.) What a sheltered childhood.

I still can vividly recall attending my first lecture in CIS 502, Analysis of Algorithms. (You know the deal is real when the name says “analysis of” anything, even for Harry Potter.) The professor — who I assume would prefer to remain anonymous — held up the textbook in his hands, the infamous “CLRS” Introduction to Algorithms. For those who didn’t know, it’s a 1312-page-thick paper brick.

You know a book is popular when you can look it up by the abbreviation of its authors’ last names.

“This shall be your textbook for this course,” he announced.

I took a skim of the table of contents, thinking, “There are 35 chapters in this book and 16 weeks in this term, so if I read two chapters per week, I should be good.”

The professor grabbed the first ~700 pages and continued, “I assume you all have familized with the material in the first 20 chapters during undergrad.”

That was the moment I knew I was screwed.

Le me in CIS 502 (2017), digitally colorized

It wasn’t a waste of time, after all

Therefore, I’m thrilled to announce that, after 4 years working as a software engineer, I finally got a chance to put it to practice.

The software engineering team I’m with maintains a 17-year-old Java repository. If it were a person, it could soon take CIS 110 and do this itself. The repo contains a million lines of code, organized into 40 Maven Modules. My job was to decouple 15 of them from their aggregator POM (what is this?).

The tricky part is that these modules have dependencies on each other. To avoid compilation errors, I need to perform my surgery in a correct order, working my way from the least-dependent module to the most-dependent one. I need to consult the dependency graph.

The de facto standard (change my mind) for storing graph data is the DOT format from Graphviz. While I would go with depgraph-maven-plugin for visualizing the graph, the vanilla Maven Dependency Plugin will do this time, since we will only be doing textual manipulations:

mvn dependency:tree -DoutputType=dot -DoutputFile="target/dependencies.dot"

Next, we convert the DOT-formatted graph to space-delimited tuples, so that the tsort program can understand it:

sed -n 's/.*"\(.*\):jar:\(.*\):compile" -> "\(.*\):jar:\(.*\):compile".*/\1 \3/p' \
./*/target/dependencies.dot > "/tmp/dependencies.dot"

Now let’s break down the script:

  • s/: This indicates that we're performing a substitution operation.
  • .*"\(.*\):jar:\(.*\):compile" -> "\(.*\):jar:\(.*\):compile".*: This is the regular expression pattern we're searching for. It captures strings that look like "something:jar:something:compile" -> "something:jar:something:compile" and captures the something parts within groups (specified by \(...\)).
  • /: This separates the pattern from the replacement.
  • \1 \3: This is the replacement string. It includes \1 which refers to the first captured group (\(.*\)), and \3 which refers to the third captured group (\(.*\)). So essentially, it replaces the entire matched line with the first and third captured groups, separated by a space.
  • /p: This flag at the end tells sed to print the result if a substitution is made.

(Disclaimer: ChatGPT generated this breakdown. I was too lazy to explain myself.)

Finally comes the juicy part: Let’s topologically sort this dependency graph. While we are at it, let’s also keep the pipe flowing:

  1. filter for artifacts from this repo only,
  2. filter again for artifacts that I need to decouple,
  3. keep only the artifact IDs, and
  4. reverse the order, so that the least-dependent entry comes first.

The complete command reads quite cryptic:

tsort "/tmp/dependencies.dot" |\
grep "com.example.MyTeam.MyRepo" |\
grep ":.*Module" |\
sed 's/.*://' |\
tac

We can store the output into an environment variable, such as $dir_name , and perform a line-wise for-each loop. With shell being shell, our for each loop won’t even have the words “for” and “each” in it:

echo "$dir_names" | while read -r dir_name; do
pom_file="${dir_name}/pom.xml"
xmlstarlet ... # whatever XML magic you wanna do here
done

That’s it.

Conclusion

This short blog post, unlike others where I show off my shell-fu (e.g., how to convert a Google doc to a multi-page website with ~20 lines of commands), is mainly about proving a point: you never know when you need to use what you learned from school.

Moments like this keep me learning about random topics without being too cynical about their utility. (At one extreme, I read pathology textbook as a pastime.) Should you have been losing heart in your academic endeavor, I hope this short story of mine can also motivate you the same.

(Disclaimer: All statistics about that particular repo were skewed for confidentiality.)

--

--