Analyzing data
Downloading data
You can use the command npm run getdata (which runs the script scripts/download_data.mjs) to download your data from Firebase. If run with no arguments, the script will ask your four questions before downloading the data:
- Whether you want to download testing data (generated by running your experiment in development mode) or real data (generated from the deployment of your experiment)
- Whether you want to download all data, or only data from participants marked as complete (participants who reached the end of the experiment)
- Which deployment branch you want to download data from (defaults to the current branch)
- Whether you want to download participants' recruitment info, such as platform-specific ID numbers (defaults to
False) - What file name (or full path) to save the data into
Alternatively, you can specify any number of these through command-line arguments:
npm run getdata -- --type <real|testing> --complete_only <all|complete_only> --branch_name <name> --filename <name-or-path>The script will ask you regarding any arguments not specified through the command-line.
Analyzing data
Once you have a data file, you can load and analyze it in the programming language of your choice. The following examples all use Python. The data file is saved as a JSON file, and so using the json package is the simplest way to load it:
import json
DATA_PATH = '...'
with open(DATA_PATH, 'r') as f:
raw_data = json.load(f)The loaded object is a list, where each entry in the list is a dictionary with two keys: id is the ID of the participant's data in the database, and data is a dictionary with the participant's data. Many of the fields it includes are automatically populated by Smile, and it will include your experiment's data. The structure of the data should match the structure of the data attribute in your smilestore in smilestore.js.
For a slightly nicer data loading experience (which loads data into Python objects with attributes, rather than just dictionaries), you can use the following extension of types.SimpleNamespace:
import json
from types import SimpleNamespace
DATA_PATH = '...'
with open(DATA_PATH, 'r') as f:
data = json.load(f, object_hook=lambda d: ExtendedSimpleNamespace(**d))Data provenance
Smile tracks the provenance of each data file, including the version of the code that was used to generate it and the git commit hash of the code at the time it was run. This allows you to know exactly which version of the code was used to generate any given data file.
In the Firestore document, this information is stored in smileConfig.github field and includes the repo name, owner, branch, last commit message, last commit hash, and the URL of the commmit.
By keeping this information stored with the data, you can also link directly to the code used to generate the data even as it evolves during development.
Python Analysis Library (smiledata)
Smile includes a Python library calledsmiledata for analyzing exportedexperiment data. The library is located in analysis/lib/smiledata and uses Polars for fast DataFrame operations.
Installation
The library uses uv for dependency management. To set up:
# Navigate to the analysis folder
cd analysis
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment and install dependencies
uv sync
# For Jupyter notebook support
uv sync --extra jupyter
# For Marimo notebook support
uv sync --extra marimo
# For development (includes testing tools)
uv sync --extra dev
# Combine multiple extras
uv sync --extra jupyter --extra dev
# Install everything (jupyter, marimo, and dev tools)
uv sync --extra allQuick Start
from smiledata import load_json
# Load exported data
data = load_json("data/my-experiment-2025-01-15.json")
# Get summary statistics
print(data.summary())
# {'total': 50, 'complete': 45, 'withdrawn': 2, 'incomplete': 3}
# Filter to complete participants only
complete = data.complete_only()
# Convert to Polars DataFrame for analysis
trials_df = complete.to_trials_df()Loading Data
load_json(path)
Load a single JSON export file:
from smiledata import load_json
data = load_json("data/experiment-2025-01-15.json")load_folder(folder, pattern="*.json")
Load all JSON files from a folder:
from smiledata import load_folder
# Load all JSON files
data = load_folder("data/")
# Load files matching a pattern
data = load_folder("data/", pattern="*-production-*.json")load_latest(folder)
Load the most recently modified JSON file:
from smiledata.loader import load_latest
data = load_latest("data/")Working with Datasets
The SmileDataset class provides methods for filtering and transforming participant data.
Filtering
# Filter to complete participants (consented, done, not withdrawn)
complete = data.complete_only()
# Filter by experimental condition
condition_a = data.by_condition(condition="A")
# Filter by multiple conditions
filtered = data.by_condition(condition="A", block_order="1")
# Filter by recruitment service
prolific_only = data.by_recruitment("prolific")
# Custom filter with lambda
fast_responders = data.filter(lambda p: p.trial_count > 100)
# Chain filters
result = data.complete_only().by_condition(condition="A")Statistics
# Counts
print(f"Total: {data.participant_count}")
print(f"Complete: {data.complete_count}")
print(f"Withdrawn: {data.withdrawn_count}")
# Full summary
summary = data.summary()
# {'total': 50, 'complete': 45, 'withdrawn': 2, 'incomplete': 3}Converting to DataFrames
import polars as pl
# Participant-level DataFrame (one row per participant)
participants_df = data.to_participants_df()
# Trial-level DataFrame (one row per trial, all participants)
trials_df = data.to_trials_df()
# Without participant ID column
trials_df = data.to_trials_df(include_participant_id=False)
# Demographics DataFrame
demographics_df = data.demographics_df()
# Specific page data (from pageData_* fields)
quiz_df = data.to_page_data_df("instructionsQuiz")Extracting Page Data into DataFrames
Smile uses route-based data recording, where each page/route in yourexperiment can record its own data. This data is stored in pageData_<routeName> fields. The smiledata library provides convenient methods to extract this data into DataFrames for analysis.
Discovering Available Pages
First, check what pages have recorded data:
# See all pages that have data across participants
print(data.available_pages())
# Output: ['consent', 'demograph', 'device', 'experiment', 'feedback', 'quiz']Extracting Trial Data from a Page
Use to_trials_df(page=...) or to_page_data_df(page_name) to extract data from a specific page:
# Get trial data from the main experiment page
trials_df = data.to_trials_df(page="experiment")
# Or equivalently
trials_df = data.to_page_data_df("experiment")
# Get quiz attempts
quiz_df = data.to_page_data_df("quiz")The resulting DataFrame includes automatic columns:
participant_id: The participant's Firebase document IDvisit: The visit number (0-indexed) if the participant visited the page multiple timesindex: The index of this entry within the visit (for multiple records per visit)timestamp: When the data was recorded (if available)
Plus all the fields you recorded in each data entry.
Understanding the Visit Structure
When a participant visits a page multiple times (e.g., going back and forth between routes), data is organized by visit:
# Access raw page data structure for a participant
page_data = participant.get_page_data("experiment")
# {
# "visit_0": {"data": [...], "timestamps": [...]},
# "visit_1": {"data": [...], "timestamps": [...]}
# }
# Get all entries flattened into a list
entries = participant.get_page_data_entries("experiment")
# Get just the last submission (most recent)
last_entry = participant.get_form("experiment")
# Get a specific visit
first_visit = participant.get_form("experiment", visit=0)
# Get all visits as a list
all_visits = participant.get_form("experiment", all_visits=True)Organizing Data for DataFrame Extraction
Important
For to_page_data_df() and to_trials_df(page=...) to work correctly, the data you record must be organized consistently.
Consistent Field Names
Each data entry recorded on a page should have the same field names across all entries and all participants. This ensures the resulting DataFrame has consistent columns:
// Good: Consistent field names across all trials
recordPageData({
trial_num: 1,
stimulus: 'imageA.png',
response: 'left',
rt: 523,
correct: true,
})
recordPageData({
trial_num: 2,
stimulus: 'imageB.png',
response: 'right',
rt: 612,
correct: false,
})// Bad: Inconsistent field names (will create sparse columns)
recordPageData({
trialNumber: 1, // Different name
stim: 'imageA.png', // Different name
response: 'left',
reactionTime: 523, // Different name
})
recordPageData({
trial_num: 2,
stimulus: 'imageB.png',
response: 'right',
rt: 612,
})Consistent Data Types
Keep data types consistent for each field:
// Good: rt is always a number
recordPageData({ rt: 523 })
recordPageData({ rt: 612 })
// Bad: rt is sometimes a number, sometimes a string
recordPageData({ rt: 523 })
recordPageData({ rt: 'timeout' }) // Will cause issuesIf you need to handle special cases like timeouts, use a consistent approach:
// Better: Use null for missing values, add a separate status field
recordPageData({ rt: null, status: 'timeout' })
recordPageData({ rt: 523, status: 'responded' })Flat Structure Preferred
Keep your data entries flat (not deeply nested) for easier DataFrame conversion:
// Good: Flat structure
recordPageData({
trial_num: 1,
stimulus_type: 'congruent',
stimulus_position: 'left',
response: 'left',
rt: 523,
})
// Avoid: Deeply nested structure
recordPageData({
trial: {
num: 1,
stimulus: {
type: 'congruent',
position: 'left',
},
},
response: {
key: 'left',
rt: 523,
},
})If you do use nested structures, you may need to flatten them during analysis:
from smiledata.transforms import flatten_nested
# Flatten nested dict
flat = flatten_nested({"a": {"b": 1, "c": 2}})
# Result: {"a.b": 1, "a.c": 2}Working with Individual Participants
The Participant class wraps a single participant's data:
# Access participants by index
first = data[0]
# Iterate over participants
for participant in data:
print(participant.id)
# Key properties
print(participant.id) # Firebase document ID
print(participant.seed_id) # UUID for random seeding
print(participant.consented) # True/False
print(participant.withdrawn) # True/False
print(participant.done) # True/False
print(participant.is_complete) # True if consented, done, and not withdrawn
# Access data fields
print(participant.demographics) # demographicForm data (most recent)
print(participant.device_info) # deviceForm data (most recent)
print(participant.feedback) # feedbackForm data (most recent)
print(participant.quiz) # quiz data (final attempt)
print(participant.conditions) # Experimental conditions
print(participant.config) # smileConfig
print(participant.study_data) # studyData array (legacy)
print(participant.trial_count) # Number of trials
# Data provenance (GitHub info)
print(participant.github_info) # Full github provenance dict
print(participant.git_commit) # Commit hash that generated this data
print(participant.git_branch) # Branch name
print(participant.git_repo) # Repository name
print(participant.git_owner) # Repository owner
print(participant.git_commit_url) # URL to the commit on GitHub
print(participant.git_commit_message) # Commit message
# Access arbitrary fields
value = participant.get("customField", default=None)Flexible Form Data Access
The get_form() method provides flexible access to form data with support for multiple visits (useful when participants can retake a quiz or revisit a form):
# Get the most recent submission (default)
demographics = participant.get_form("demograph")
# Get a specific visit (0-indexed)
first_quiz_attempt = participant.get_form("quiz", visit=0)
second_quiz_attempt = participant.get_form("quiz", visit=1)
# Get all attempts as a list (with visit/index metadata)
all_quiz_attempts = participant.get_form("quiz", all_visits=True)Page Data Access
Access raw page data or flattened entries:
# Raw page data with visit structure
page_data = participant.get_page_data("experiment")
# Returns: {"visit_0": {"data": [...], "timestamps": [...]}, ...}
# All entries flattened into a list
entries = participant.get_page_data_entries("experiment")
# Returns: [{"visit": 0, "index": 0, "timestamp": ..., ...}, ...]
# Get all pageData_* fields
all_page_data = participant.get_all_page_data()Route Order and Timing
See how participants navigated through the experiment:
# Text summary of routes and time spent
print(participant.route_order_summary())
# Output:
# Route Order:
# 1. consent 2.5s
# 2. instructions 15.3s
# 3. experiment 5m 23.1s
# 4. feedback 45.2s
# 5. thanks current
# Total: 6m 26.1s
# Visual subway-style timeline
ax = participant.plot_route_order()
# With explicit dark mode
ax = participant.plot_route_order(mode="dark")Converting Individual Data to DataFrame
# Convert individual's study_data to DataFrame (legacy format)
trials_df = participant.study_data_to_polars()Analyzing Trial Data
import polars as pl
from smiledata import load_json
# Load and filter data
data = load_json("data/experiment.json")
complete = data.complete_only()
# Get trials DataFrame
trials = complete.to_trials_df()
# Basic statistics with Polars
mean_rt = trials.group_by("participant_id").agg(
pl.col("rt").mean().alias("mean_rt"),
pl.col("correct").mean().alias("accuracy")
)
# Filter trials
correct_trials = trials.filter(pl.col("correct") == 1)
# Analyze by condition
by_condition = trials.group_by("condition").agg(
pl.col("rt").mean().alias("mean_rt"),
pl.col("rt").std().alias("std_rt"),
pl.col("correct").mean().alias("accuracy")
)Plotting
The library includes built-in plotting functions using seaborn and matplotlib. All plots support dark mode and automatically adapt their colors to the current theme.
Dark Mode Support
Plots automatically detect the current theme when used in marimo notebooks. You can also manually specify the theme:
from smiledata import detect_theme, get_theme_colors
# Auto-detect current theme (works in marimo)
theme = detect_theme() # Returns "light" or "dark"
# Get theme-appropriate colors
colors = get_theme_colors() # Auto-detect
colors = get_theme_colors("dark") # Force dark mode
# Available color keys:
# - text_color: Primary text color
# - muted_color: Secondary/muted text
# - node_fill: Fill color for chart elements
# - edge_color: Border/edge colors
# - grid_color: Grid line colorAll built-in plotting functions:
- Make figure and axes backgrounds transparent
- Automatically style text, ticks, and spines for the current theme
- Work seamlessly in both light and dark notebook environments
Available Plot Types
import matplotlib.pyplot as plt
from smiledata.plotting import (
plot_completion_rate,
plot_trial_rt_distribution,
plot_accuracy_by_condition,
plot_rt_by_condition,
plot_participant_timeline,
)plot_completion_rate(dataset) - Horizontal bar chart showing the number of complete, withdrawn, and incomplete participants.
ax = plot_completion_rate(data)plot_trial_rt_distribution(trials_df, rt_column="rt") - Histogram showing the distribution of reaction times across all trials.
trials = data.complete_only().to_trials_df(page="experiment")
ax = plot_trial_rt_distribution(trials, rt_column="rt", bins=50)plot_accuracy_by_condition(trials_df, condition_col) - Bar chart with standard error bars showing accuracy across experimental conditions.
ax = plot_accuracy_by_condition(trials, condition_col="condition", correct_col="correct")plot_rt_by_condition(trials_df, condition_col) - Box plot showing reaction time distributions across experimental conditions.
ax = plot_rt_by_condition(trials, condition_col="condition", rt_column="rt")plot_participant_timeline(dataset) - Line plot showing cumulative participant enrollment over time.
ax = plot_participant_timeline(data)Participant Route Visualization
Individual participants have a method to visualize their route through the experiment:
participant = data[0]
# Text summary of routes and time spent
print(participant.route_order_summary())
# Subway-style visual timeline
ax = participant.plot_route_order()
# With explicit theme control
ax = participant.plot_route_order(mode="dark")Composing Multi-Panel Figures
All plotting functions accept an optional ax parameter, allowing you to compose multiple plots into a single figure:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
plot_trial_rt_distribution(trials, ax=axes[0])
plot_accuracy_by_condition(trials, condition_col="condition", ax=axes[1])
plt.tight_layout()Using with Notebooks
Marimo
uv run marimo edit notebooks/analysis.pyimport marimo as mo
from smiledata import load_json
data = load_json("../data/experiment.json")
mo.md(f"Loaded {len(data)} participants")Jupyter
uv run jupyter labfrom smiledata import load_json
data = load_json("data/experiment.json")
data.summary()Running Tests
You can run Python tests from the project root using npm:
# Run Python tests
npm run test:python
# Run Python tests with coverage
npm run test:python:cov
# Run both JS and Python tests
npm run test:allOr run directly from the analysis folder:
cd analysis
# Run all tests
uv run pytest
# Run with coverage report
uv run pytest --cov=smiledata --cov-report=term-missing
# Run specific test file
uv run pytest tests/test_participant.pyWhy Polars instead of Pandas?
The smiledata library uses Polars rather than Pandas for several reasons:
- Performance: Polars is significantly faster, especially for large datasets
- Memory efficiency: Uses Apache Arrow for zero-copy data access
- Modern API: Cleaner, more consistent syntax
- Type safety: Better type inference and validation
If you need Pandas compatibility, you can easily convert:
pandas_df = trials_df.to_pandas()