Establish behavior-driven testing guidelines and documentation

Problem

The test suite mixes implementation testing with behavior testing because there are no clear guidelines for what should be tested and how. This leads to brittle tests that break with refactoring and don't verify actual requirements.

Root Cause

Tests were written after implementation to verify "does the code do what I coded?" rather than "does it do what users need?"

This is evident in:

~60% of tests checking implementation details
Mock-heavy tests that break with internal changes
Tests that verify exact error strings, call order, internal state
Missing tests for actual user-facing edge cases

Proposed Solution

Create comprehensive testing guidelines that:

Define what to test
Define how to write tests
Provide clear examples
Establish review criteria

Testing Guidelines Document

Create docs/TESTING.md with:

1. Testing Philosophy

## Testing Philosophy

### Test Behavior, Not Implementation

**Good Test**: Verifies user-facing behavior
- "Snippet insertion preserves surrounding content"
- "Note deletion removes file from filesystem"
- "Error messages are helpful and actionable"

**Bad Test**: Verifies implementation details
- "Uses gio before trash-put"
- "Calls nvim_buf_set_lines with correct parameters"
- "Caches input prompt results"

### The Substitution Test

If you can completely rewrite the implementation and tests still pass, they're good tests.
If tests break when you refactor, they're testing implementation.

### Example:

```lua
-- BAD: Tests implementation
it("prefers gio over trash-put", function()
  -- This breaks if we change trash utility preference
  -- This doesn't verify the requirement: "delete the file"

-- GOOD: Tests behavior  
it("successfully moves file to system trash", function()
  local file = create_test_file()
  move_to_trash(file)
  assert.is_false(file_exists(file))
  assert.is_true(file_in_trash(file))
  -- This passes regardless of which utility is used

2. Test Categorization

## Test Types

### Unit Tests
- Test single function behavior
- Mock external dependencies (filesystem, user input)
- Verify edge cases and error handling
- **Focus**: "Does this function do what it promises?"

### Integration Tests
- Test features working together
- Minimal mocking (real filesystem in temp dirs)
- Verify complete workflows
- **Focus**: "Do features compose correctly?"

### End-to-End Tests
- Test complete user journeys
- No mocking (except external services)
- Verify real-world scenarios
- **Focus**: "Can users accomplish their goals?"

3. Writing Good Tests

## Test Writing Guidelines

### Structure: Arrange-Act-Assert

```lua
it("does something useful", function()
  -- Arrange: Set up test state
  local file = create_test_file("content")
  
  -- Act: Perform the action
  local result = do_something(file)
  
  -- Assert: Verify behavior
  assert.equals("expected", result)
  assert.is_true(file_unchanged(file))
end)

Naming Convention

Test names should describe behavior, not implementation:

-- BAD: Describes implementation
it("calls utils.move_to_trash")
it("sets config.commands.ZkDelete to true")

-- GOOD: Describes behavior
it("removes note from filesystem")
it("enables deletion command by default")

What to Assert

Assert behavior and outcomes:

assert.is_false(file_exists(note_path))  -- ✓ Behavior
assert.equals("Deleted note", message)   -- ✓ User feedback

Don't assert implementation:

assert.is_true(trash_called)             -- ✗ Implementation
assert.equals(expected_call_order)       // ✗ Implementation

Error Testing

Test error handling behavior, not error implementation:

-- BAD
it("returns nil, nil, nil, 'exact error string'", function()

-- GOOD
it("fails gracefully with helpful error when notebook not found", function()
  local result, err = get_templates()
  assert.is_nil(result)
  assert.is_not_nil(err)
  assert.is_true(err:match("[Nn]otebook"))  -- Verify message is helpful
  assert.is_true(err:match("[Nn]ot found")) // Don't require exact string

4. Common Patterns

## Testing Patterns

### Testing User Input

```lua
-- Test the behavior: "User can cancel operation"
it("cancels operation when user provides empty input", function()
  mock_input("")  -- User cancels
  
  create_note_from_template()
  
  assert.is_nil(created_note())  -- No note created
  assert.no_error_occurred()     // Cancellation is not error

Testing File Operations

-- Test the behavior: "File is deleted"
it("removes file from filesystem", function()
  local file = create_temp_file()
  assert.is_true(file_exists(file))
  
  delete_file(file)
  
  assert.is_false(file_exists(file))
  -- Don't test HOW it was deleted (trash vs rm)

Testing Error Recovery

// Test the behavior: "System recovers from errors"
it("leaves no partial state when operation fails", function()
  local initial_state = capture_state()
  
  mock_filesystem_error()
  attempt_operation()
  
  assert.equals(initial_state, capture_state())
  assert.user_notified_of_error()

5. Code Review Checklist

## Test Review Checklist

When reviewing tests, ask:

- [ ] Does test name describe user-visible behavior?
- [ ] Would test pass with alternative implementation?
- [ ] Are assertions about outcomes, not internal state?
- [ ] Are edge cases covered?
- [ ] Are error messages verified to be helpful?
- [ ] Can I understand what feature does from test alone?
- [ ] Would this test catch real user issues?

**Red Flags**:
- ❌ Test mocks more than necessary
- ❌ Test checks call order or internal methods
- ❌ Test requires exact error string
- ❌ Test breaks on refactoring
- ❌ Test name describes implementation

6. Migration Guide

## Migrating Existing Tests

### Step 1: Identify Implementation Tests

Look for:
- Tests with "prefers X over Y"
- Tests checking call order
- Tests mocking internal functions
- Tests with exact error strings

### Step 2: Ask "What's the requirement?"

For each test, identify the user requirement:
- "User can delete notes" (not "trash is preferred")
- "Errors are helpful" (not "exact error text")

### Step 3: Rewrite

```lua
// Before: Implementation test
it("calls trash then hard_delete on failure", function()
  mock_trash(false)
  mock_hard_delete(true)
  remove_note()
  assert.trash_called()
  assert.hard_delete_called()

// After: Behavior test
it("successfully deletes note using available method", function()
  local note = create_note()
  remove_note(note)
  assert.is_false(note_exists(note))
  assert.user_notified()  -- Verify feedback

Implementation Tasks

Create docs/TESTING.md with guidelines
Add examples for each test type
Create template for new tests
Add pre-commit hook to check test names
Update CONTRIBUTING.md to reference guidelines
Hold team review of guidelines
Create test migration plan for existing tests

Success Criteria

Guidelines document is clear and comprehensive
Examples cover all common scenarios
Team agrees on testing philosophy
New tests follow guidelines
Existing tests migrated progressively
Test quality improves measurably

References

Related issues: #[template-tests], #[utils-tests], #[missing-tests]
Industry resources on behavior-driven testing
Current test suite for anti-patterns to avoid