Documentation in Data/Analytics Project: An underestimated ally!
Documentation is something that gets misunderstood a lot IMO.
Sharing my thoughts on the topic (applies to Data and beyond)
Its Objective: Context & Clarity
In more details, you want to make sure that the future readers/users (yourself included!) of the model/transformations/query/code/etc will understand what has been done without the need to come back to you and ask you to understand why and what you did before.
Its Flavours: More diverse that what we might think
A couple of ones I consciously consider:
- Code -> Your code should be as self-explanatory as possible. Write clear simple code, not a clever one!
- Naming -> helpful intuitive naming (for files, datasets, data models, columns…)
- Comments -> Additional comments in the code (if a comment is confusing or too complex/long, it backfires)
- Testing -> Testing helps the user understand the structure and data of the model (like uniqueness testing help understanding the granularity)
- Documentation files in codebase (as needed) -> if the naming makes things self-explanatory, this becomes the place for more context, like business context
- Documentation outside codebase (like in confluence) -> Mostly for more complex contextual subjects like documenting the business use case of an ML model, reporting pipeline, architecture, complex data source, etc…
Ultimately, one needs to I ask oneself:
Will the next reader face friction with what I am creating ?
If the answer is yes, simplify further and document better!