Serialization Support for py-multicodec #26

sumanjeet0012 · 2025-12-03T18:20:27Z

Issue

closes #25

py-multicodec lacks serialization integration (identified in gap analysis). Users must manually serialize data before adding multicodec prefixes.

Solution

Added a serialization module with:

Abstract Codec base class for custom codecs
Built-in JSONCodec (0x0200) and RawCodec (0x55)
Generic encode()/decode() functions with codec auto-detection
Codec registry for dynamic management
Proper exception hierarchy

Example

from multicodec import JSONCodec, encode, decode

# Encode data with JSON codec
data = {"hello": "world"}
encoded = encode(data, JSONCodec())  # bytes with multicodec prefix

# Decode with auto-detection
decoded = decode(encoded)  # {"hello": "world"}

Changes

File	Description
`multicodec/exceptions.py`	Exception classes
`multicodec/serialization.py`	Codec interface & implementations
`tests/test_serialization.py`	43 test cases
`newsfragments/25.feature.rst`	Changelog

877 lines added, all tests passing

Checklist

Tests added
make pr passes (lint, type check, tests)
Newsfragment added

acul71 · 2025-12-04T00:19:36Z

@sumanjeet0012 Merged with master

Here's a review: see if can be of help or too broad:

1. Summary of Changes

This PR adds comprehensive serialization support to py-multicodec, addressing issue #25 which identified the lack of serialization integration as a gap in the library. The PR implements:

New exception hierarchy (multicodec/exceptions.py): Base exceptions for multicodec operations
Serialization module (multicodec/serialization.py): Complete codec interface with built-in JSON and Raw codecs
Comprehensive test suite (tests/test_serialization.py): 43 test cases covering all functionality
Newsfragment (newsfragments/25.feature.rst): User-facing changelog entry
Module exports (multicodec/__init__.py): Updated to export all new serialization functionality

Files Changed:

multicodec/__init__.py - Added serialization imports and exports
multicodec/exceptions.py - New file (45 lines)
multicodec/serialization.py - New file (364 lines)
tests/test_serialization.py - New file (418 lines)
newsfragments/25.feature.rst - New file (1 line)
tools/gen_code_table.py - Minor type hint improvement

Total Changes: +877 lines added, 4 lines modified

Breaking Changes: None - this is a purely additive feature.

2. Branch Sync Status and Merge Conflicts

Branch Sync Status

✅ Status: Branch is in sync with origin/master after merge resolution

Details: The branch was initially 1 commit behind and 2 commits ahead of origin/master. After merging origin/master into the PR branch, the branch is now 0 commits behind and 3 commits ahead (including the merge commit).

Merge Conflict Analysis

✅ Conflicts Resolved: All merge conflicts have been successfully resolved

Conflicts Detected and Resolved:

File: multicodec/__init__.py
- Conflict Type: Both branches modified the same file
- Location: Import statements and __all__ list
- Nature:
  - PR branch added serialization imports and exports
  - Master branch added Code type imports and exports
- Resolution: Merged both sets of imports and exports, maintaining proper organization with comments
- Severity: Low
- Status: ✅ Resolved
File: newsfragments/23.feature.rst
- Conflict Type: Add/add conflict (both branches added the same file)
- Location: Entire file content
- Nature: Both branches had identical content, only formatting difference (backticks vs no backticks)
- Resolution: Used the version without backticks (more standard formatting)
- Severity: Low
- Status: ✅ Resolved
File: tools/gen_code_table.py
- Conflict Type: Add/add conflict (both branches added the same file)
- Location: Function signature and implementation details
- Nature:
  - PR branch used ast.literal_eval (safer)
  - Master branch used eval and had type hints
- Resolution: Combined best of both: used ast.literal_eval (safer) with type hints from master
- Severity: Low
- Status: ✅ Resolved

Impact Assessment:

Severity: Low - All conflicts were straightforward to resolve
Resolution Complexity: Simple - Conflicts were in separate sections or had clear best choices
Functionality Impact: None - All functionality preserved after merge

Resolution Summary:
All conflicts have been resolved and the merge commit has been successfully created. The branch is now ready for review and can be merged cleanly into master.

3. Strengths

Well-Designed Architecture:
- Clean abstract base class (Codec) with proper type hints using Generic[T]
- Clear separation between encoding/decoding logic and prefix handling
- Follows similar patterns to js-multiformats and rust-multicodec as mentioned in the PR
Comprehensive Exception Hierarchy:
- Proper exception inheritance (MulticodecError → CodecError → specific errors)
- Specific error types (EncodeError, DecodeError, UnknownCodecError) for better error handling
- Good error messages with context
Excellent Test Coverage:
- 43 test cases covering all major functionality
- Tests for both success and error cases
- Tests for edge cases (Unicode, empty data, wrong codecs, etc.)
- Tests for registry functionality (register/unregister)
- Tests for auto-detection in decode()
Type Safety:
- Proper use of type hints throughout
- Generic type support for codecs
- Type checking passes without errors
Code Quality:
- Clean, readable code with good docstrings
- Proper use of abstract methods
- Singleton pattern for built-in codecs
- Registry pattern for dynamic codec management
Documentation:
- Good module-level docstrings with examples
- Clear method docstrings with parameter descriptions
- Newsfragment included for changelog
Security:
- Uses ast.literal_eval instead of eval in gen_code_table.py (safer)
- Proper input validation (e.g., RawCodec checks for bytes type)
- Exception handling prevents information leakage

4. Issues Found

Critical

None identified.

Major

None identified.

Minor

File: multicodec/serialization.py
- Line(s): 100-103
- Issue: Redundant exception re-raising in encode() method
- Suggestion: The except EncodeError: raise block is unnecessary since EncodeError will propagate naturally. Consider removing it:
```
try:
    encoded = self._encode(data)
    prefix = varint.encode(self.code)
    return prefix + encoded
except Exception as e:
    raise EncodeError(f"Failed to encode with {self.name}: {e}") from e
```
The same pattern appears in decode() (lines 126-127) and decode_raw() (lines 141-142).
File: multicodec/serialization.py
- Line(s): 315-316
- Issue: Catching TypeError from varint.decode_bytes() may be too broad
- Suggestion: Consider catching more specific exceptions or checking the varint library's exception types. However, this is acceptable as a defensive measure.
File: multicodec/serialization.py
- Line(s): 318-320
- Issue: When codec prefix is unknown, raises DecodeError but could provide more context
- Suggestion: The error message is clear, but consider including available codecs in the message for better debugging. However, this might be too verbose.
File: multicodec/__init__.py
- Line(s): 44-80
- Issue: The __all__ list could be better organized or alphabetized for consistency
- Suggestion: Consider alphabetizing the __all__ list or grouping more consistently. The current organization with comments is good, but some items seem out of order (e.g., "decode" appears before "encode" in the serialization functions section).

5. Security Review

✅ Security Assessment: No security vulnerabilities identified

Security Considerations:

Input Validation:
- ✅ RawCodec properly validates input type (bytes only)
- ✅ JSONCodec handles invalid JSON gracefully
- ✅ Codec registry validates codec names
Code Execution:
- ✅ Uses ast.literal_eval instead of eval in gen_code_table.py (resolved during merge)
- ✅ No unsafe deserialization - JSON uses standard library json.loads()
Exception Handling:
- ✅ Proper exception chaining with from e
- ✅ Error messages don't expose sensitive information
- ✅ Exceptions are properly typed and specific
Binary Data Handling:
- ✅ Proper handling of varint encoding/decoding
- ✅ No buffer overflows or unsafe memory operations
- ✅ Proper prefix validation before decoding

Risk Level: None / Low

6. Documentation and Examples

✅ Documentation Quality: Good overall, with minor improvements possible

Strengths:

Module-level docstrings with usage examples
Method docstrings with parameter descriptions and return types
Clear exception documentation in docstrings
Newsfragment included

Areas for Improvement:

File: multicodec/serialization.py

Line(s): 13-23
Issue: Module docstring example uses multicodec.serialization imports, but users will likely import from multicodec
Suggestion: Update example to show imports from multicodec:

Example usage:
    >>> from multicodec import json_codec, raw_codec, encode, decode
    >>> # Using JSON codec
    >>> data = {"hello": "world"}
    >>> encoded = json_codec.encode(data)
    >>> decoded = json_codec.decode(encoded)
    >>> assert decoded == data
    >>>
    >>> # Using the generic encode/decode with codec name
    >>> encoded = encode("json", {"key": "value"})
    >>> decoded = decode(encoded)

File: README.rst
- Issue: README may need updates to document the new serialization features
- Suggestion: Consider adding a section about serialization support with examples. However, this may be out of scope for this PR.
File: multicodec/serialization.py
- Line(s): 234-243
- Issue: register_codec() docstring could mention that it modifies global state
- Suggestion: Add a note about thread-safety or global state implications if relevant.

7. Newsfragment Requirement

✅ Newsfragment Status: Present and valid

Verification:

✅ File exists: newsfragments/25.feature.rst
✅ Format: Correct filename format (25.feature.rst)
✅ Content: User-facing description present
✅ Newline: File ends with newline (verified)
✅ Issue reference: PR references issue Missing Serialization Support #25 in description ("closes Missing Serialization Support #25")

Content Review:

Added serialization with JSON/raw codecs, custom codec interface, encode/decode functions, and a dynamic codec registry.

✅ Content is appropriate: Describes user-facing features clearly and concisely.

Issue Reference:

✅ PR description includes "closes Missing Serialization Support #25"
✅ Issue Missing Serialization Support #25 exists and describes the serialization support feature
✅ Newsfragment uses issue number 25

Conclusion: Newsfragment requirement is fully satisfied.

8. Tests and Validation

✅ Test Coverage: Excellent - 43 new test cases, all passing

Test Execution Summary

Total Tests: 546 tests (503 existing + 43 new)

✅ Passed: 546
❌ Failed: 0
⚠️ Skipped: 0
❌ Errored: 0

Test Execution Time: 0.20s

Test Files:

tests/test_code.py - 31 tests
tests/test_multicodec.py - 472 tests
tests/test_serialization.py - 43 tests (NEW)

Test Coverage Analysis

New Test Cases Cover:

JSONCodec Tests:
- ✅ Codec properties (name, code)
- ✅ Encoding simple and complex data structures
- ✅ Round-trip encoding/decoding
- ✅ Unicode support
- ✅ Error cases (non-serializable data, invalid JSON, wrong codec)
- ✅ decode_raw() functionality
- ✅ String representation
RawCodec Tests:
- ✅ Codec properties
- ✅ Encoding/decoding bytes
- ✅ Round-trip with various byte patterns
- ✅ Error cases (non-bytes input, wrong codec)
- ✅ String representation
Singleton Codecs:
- ✅ json_codec and raw_codec instances
Codec Registry:
- ✅ Listing registered codecs
- ✅ Checking if codec is registered
- ✅ Getting registered codec
- ✅ Registering custom codecs
- ✅ Unregistering codecs
- ✅ Error cases (duplicate registration, unregistering non-existent)
Generic Functions:
- ✅ encode() function
- ✅ decode() with auto-detection
- ✅ decode() with explicit codec name
- ✅ Error cases (unknown codec, invalid data)

Test Quality:

✅ Tests cover both success and error paths
✅ Tests use descriptive names
✅ Tests are well-organized in test classes
✅ Tests include edge cases (empty data, Unicode, binary data)
✅ Tests properly clean up (unregister custom codecs)

Linting Results

✅ All checks passed:

check yaml: Passed
check toml: Passed
fix end of files: Passed
trim trailing whitespace: Passed
pyupgrade: Passed
ruff (legacy alias): Passed
ruff format: Passed
run mypy with all dev dependencies present: Passed

No linting errors or warnings.

Type Checking Results

✅ Type checking passed:

mypy-local: Passed

No type errors or warnings.

Documentation Build Results

✅ Documentation build succeeded:

Sphinx build completed successfully
All modules documented (code, exceptions, multicodec, serialization)
No documentation errors or warnings
HTML pages generated successfully

9. Recommendations for Improvement

Code Simplification:
- Remove redundant except EncodeError: raise blocks in serialization.py (lines 100-101, 126-127, 141-142)
Documentation Updates:
- Update module docstring example to use multicodec imports instead of multicodec.serialization
- Consider adding serialization examples to README (if in scope)
Code Organization:
- Consider alphabetizing or better organizing the __all__ list in __init__.py
Future Enhancements (not blocking):
- Consider adding more built-in codecs (CBOR, MessagePack) if commonly needed
- Consider thread-safety documentation for the codec registry if relevant
- Consider adding codec validation (e.g., ensuring codec code matches multicodec table)

10. Questions for the Author

Design Decision: Why was the codec registry implemented as a global mutable dictionary rather than a registry class? This is fine, but I'm curious about the design rationale.
Future Codecs: Are there plans to add more built-in codecs (like CBOR or MessagePack) in the future, or is the registry pattern intended to encourage users to implement their own?
Thread Safety: Is the codec registry intended to be thread-safe? If so, should we add locking, or is it assumed that codecs are registered at application startup?
Codec Validation: Should there be validation to ensure that a custom codec's code property matches an entry in the multicodec table? Currently, users could register a codec with an invalid code.
Backward Compatibility: The PR adds new exports to __init__.py. Are there any concerns about namespace pollution or conflicts with existing code?

11. Overall Assessment

Quality Rating: Excellent

This is a well-implemented feature that adds significant value to py-multicodec. The code is clean, well-tested, properly documented, and follows good Python practices. The architecture is sound and extensible.

Security Impact: None / Low

No security vulnerabilities identified. The implementation uses safe practices (ast.literal_eval, proper input validation, exception handling).

Merge Readiness: Ready (after minor improvements)

The PR is in excellent shape. The minor issues identified are non-blocking and can be addressed in a follow-up or as part of this PR. All tests pass, linting passes, type checking passes, and documentation builds successfully.

Confidence: High

The implementation is solid, well-tested, and follows established patterns. The merge conflicts have been resolved, and the branch is ready for integration.

Recommendation: ✅ Approve with minor suggestions

This PR successfully addresses issue #25 and adds valuable serialization support to py-multicodec. The code quality is high, test coverage is excellent, and the implementation is well-designed. The minor suggestions in section 9 are optional improvements and do not block approval.

Appendix: Merge Conflict Resolution Details

Resolved Conflicts Summary

multicodec/init.py:
- Merged Code type imports from master with serialization imports from PR
- Combined __all__ lists with proper organization
- Result: All exports properly included
newsfragments/23.feature.rst:
- Chose version without backticks (standard formatting)
- Result: Clean, consistent formatting
tools/gen_code_table.py:
- Combined type hints from master with safer ast.literal_eval from PR
- Result: Type-safe and secure implementation

All conflicts resolved successfully. Merge commit created: 8e3c55b

sumanjeet0012 · 2025-12-04T04:46:56Z

@acul71 Thank you for the review.

I see that there are some optional changes we can make in this PR, but no major issues were found. I will rebase this on the latest master branch, and then we can proceed with merging.

sumanjeet0012 · 2025-12-04T15:24:54Z

@acul71 @seetadev Ready for review.

acul71 · 2025-12-04T15:37:04Z

@acul71 @seetadev Ready for review.

Review Summary

Status: Ready to merge

Key Findings:

All 546 tests pass (43 new tests added)
No linting errors or warnings
Type checking passes
Documentation builds successfully
No merge conflicts detected
Branch is in sync with master (0 behind, 1 ahead)

Strengths:

Well-designed architecture with abstract base class and type hints
Comprehensive test coverage (43 new tests covering all functionality)
Proper exception hierarchy
Good documentation and examples
No security vulnerabilities identified

Minor Issues (Non-blocking):

Redundant exception re-raising in a few methods (cosmetic)
Module docstring example could use multicodec imports instead of multicodec.serialization
__all__ list could be better organized

Recommendation: Approve — The PR successfully addresses issue #25 and adds serialization support. The minor issues are optional and do not block approval.

sumanjeet0012 · 2025-12-08T07:10:50Z

@acul71 @seetadev Do we need some more changes?

seetadev · 2025-12-08T09:44:10Z

@sumanjeet0012 : Great work, Sumanjeet. Appreciate your support and efforts.

Wish to have feedback and pointers from @pacrob before I merge the PR.

pacrob

lgtm!

sumanjeet0012 force-pushed the feat/serialization branch from 8e3c55b to 49dec34 Compare December 4, 2025 14:16

Implemented serization methods in py-multicodec

3b02319

sumanjeet0012 force-pushed the feat/serialization branch from 49dec34 to 3b02319 Compare December 4, 2025 15:10

sumanjeet0012 marked this pull request as ready for review December 4, 2025 15:11

pacrob approved these changes Dec 8, 2025

View reviewed changes

acul71 merged commit 2dad7b2 into multiformats:master Dec 8, 2025
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Serialization Support for py-multicodec #26

Serialization Support for py-multicodec #26

Uh oh!

sumanjeet0012 commented Dec 3, 2025

Uh oh!

acul71 commented Dec 4, 2025

Uh oh!

sumanjeet0012 commented Dec 4, 2025

Uh oh!

sumanjeet0012 commented Dec 4, 2025

Uh oh!

acul71 commented Dec 4, 2025

Uh oh!

sumanjeet0012 commented Dec 8, 2025

Uh oh!

seetadev commented Dec 8, 2025

Uh oh!

pacrob left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Serialization Support for py-multicodec #26

Serialization Support for py-multicodec #26

Uh oh!

Conversation

sumanjeet0012 commented Dec 3, 2025

Issue

Solution

Example

Changes

Checklist

Uh oh!

acul71 commented Dec 4, 2025

1. Summary of Changes

2. Branch Sync Status and Merge Conflicts

Branch Sync Status

Merge Conflict Analysis

3. Strengths

4. Issues Found

Critical

Major

Minor

5. Security Review

6. Documentation and Examples

7. Newsfragment Requirement

8. Tests and Validation

Test Execution Summary

Test Coverage Analysis

Linting Results

Type Checking Results

Documentation Build Results

9. Recommendations for Improvement

10. Questions for the Author

11. Overall Assessment

Appendix: Merge Conflict Resolution Details

Resolved Conflicts Summary

Uh oh!

sumanjeet0012 commented Dec 4, 2025

Uh oh!

sumanjeet0012 commented Dec 4, 2025

Uh oh!

acul71 commented Dec 4, 2025

Review Summary

Uh oh!

sumanjeet0012 commented Dec 8, 2025

Uh oh!

seetadev commented Dec 8, 2025

Uh oh!

pacrob left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants