sunday links #9

March 2, 2025


Sketchheads, B & T, 2025

Sketchheads, B & T, 2025

  1. Crossing the uncanny valley of conversational voice
    • "voice presence" – the quality that makes spoken interactions feel real, understood, and valued
    • conversational speech model (csm) - a multimodal text and speech model
    • key components of voice presence:
      • emotional intelligence: reading and responding to emotional contexts
      • conversational dynamics: natural timing, pauses, interruptions and emphasis
      • contextual awareness: adjusting tone and style to match the situation
      • consistent personality: maintaining a coherent presence
    • technical approach:
      • converts continuous waveforms into discrete audio token sequences
      • semantic tokens: compressed speaker-invariant representations of semantic/phonetic features
      • acoustic tokens: fine-grained acoustic details that enable high-fidelity audio with speaker identity
      • single-stage model improves efficiency and expressivity compared to two-stage approaches
    • natural pauses and rhythm:
      • interleaves text and speech tokens during training to learn hesitations and pacing
      • backbone transformer captures long-range dependencies to anticipate appropriate pauses
      • uses residual vector quantization (rvq) to encode prosody, tone, rhythm, and speaking style
    • evaluation:
      • trained on ~1 million hours of english audio data
      • evaluated using expresso dataset with comparative mean opinion score (cmos)
      • new benchmarks for homograph disambiguation (e.g., "lead" metal vs. "lead" guide)
      • pronunciation consistency tests with words like "route" (/raʊt/ or /ruːt/)
      • results: matches human-level naturalness but still needs improvement in contextual appropriateness
  2. Alexandria AI
    • llm over top 1000 off-copyright books to build the digital library of alexandria
    • "The Great Library of Alexandria in Alexandria, Egypt, was one of the largest and most significant libraries of the ancient world. The library was part of a larger research institution called the Mouseion, which was dedicated to the Muses, the nine goddesses of the arts. It is unknown precisely how many scrolls were housed at any given time, but estimates range from 40,000 to 400,000 at its height. Alexandria came to be regarded as the capital of knowledge and learning, in part because of the Great Library."
  3. Elevated Indoor Carbon Dioxide Impairs Decision-Making Performance
    • Berkeley Lab study found CO2 levels of 1,000 ppm significantly impaired decision-making performance
    • Most dramatic declines were in strategic thinking and initiative, with subjects rated "dysfunctional" at 2,500 ppm
    • Challenges conventional wisdom that CO2 at typical building levels (below 5,000 ppm) has no direct impact on people
    • Particularly concerning for schools where CO2 frequently exceeds 1,000 ppm and occasionally 3,000 ppm
    • Get Aranet4 HOME for monitoring indoor CO2 levels but for 166 USD
      • Uses NDIR (Non-Dispersive Infrared) gas sensors that detect CO2 by measuring decreased infrared light transmission
      • The sensor works because CO2 absorbs specific infrared wavelengths, so higher CO2 concentration means less transmitted light
      • Transmittance (ratio of transmitted radiation to incident energy) decreases proportionally to CO2 concentration
      • This technology enables accurate, real-time monitoring of indoor air quality
  4. How to see. - by Emily Parsons
    • "it was one of those icy, clear January nights—the sky an ink black, no clouds in sight—a lively cold plunge"
    • "suddenly reminded of the friction between etiquette and free will"
    • "She explained that her mind naturally organizes the world into neat buckets—one for making music, another for visual art"
    • "One of the first lessons in art school was to draw what you see, not what your brain has memorized. When faced with a still life—say, a bowl of oranges—you force yourself to capture not the known version of an orange (an orange circle) but what you actually observe: wobbly, textured stones with golden rims where the kitchen light touches their tops and blue or red shadows beneath"
    • "In writing, we’re also taught to capture what we see—to infuse our words with the raw details of our experience"
    • In White Fang, Jack London doesn’t simply describe a forest as “empty and cold” but rather creates a living image of its wild, haunting beauty: "Dark spruce forest frowned on either side of the frozen waterway. The trees had been stripped by a recent wind of their white frosting, and they seemed to lean toward each other—black and ominous—in the fading light..."
  5. Focus – James Lin
    • "Focus and sacrifice are two sides of the same coin. Historically, I say no to 40% of opportunities. This should be 99%."
    • Warren Buffett's "2 List" strategy: identify top 5 priorities and treat everything else as "avoid at all costs"
    • "most hard problems are soluble with deep effort but impossible with shallow attempts; focus is the process of refining merely good ideas into great ideas"
    • "Networking can create false sense of accomplishment through association"
  6. Who comes to RC - Recurse Center
    • 6 or 12 week programming retreats available in NYC, hybrid, or remote formats. diverse community of programmers at all skill levels: EMs returning to code, self-taught hobbyists, staff engineers, recent grads, former founders
    • guided by three self-directives:
      • work at the edge of your abilities
      • build your volitional muscles
      • learn generously
    • key qualities they look for in applicants:
      • enjoyment of programming and willingness to spend most time coding
      • readiness to work at the edge of your abilities regardless of experience level
      • self-direction and ability to structure your own time without external guidance
      • sharpness: being a deliberate thinker who can pick things up quickly
      • pleasantness: being kind, engaged, thoughtful, and polite with others
      • intellectual curiosity and intrinsic motivation to learn technical concepts
  7. What Is the Sense of Agency and Why Does it Matter?
    • sense of agency is the feeling of being in control of your actions and their outcomes
    • key takeaways for building stronger agency:
      • pay attention to the connection between your intentions and actions
      • create environments with clear, immediate feedback for your actions
      • practice predicting outcomes of your actions to strengthen internal models
      • reduce delays between actions and their effects when possible
      • be mindful of how external factors (like technology interfaces) affect your sense of control
    • agency declines with age (starting around 50) but can be maintained through deliberate practice
    • stronger sense of agency correlates with better health outcomes and quality of life
    • agency can be disrupted in various conditions like schizophrenia, where prediction systems malfunction
    • modern interfaces should be designed to support users' sense of control and ownership
  8. Magnificent Grants
    • $10,000+ fellowships for outliers under 25 years old who challenge universities, credentialism, and elitist hierarchies
    • annual selection of 10 exceptional pioneers to receive grants and join a 2-year mentorship program
  9. olmOCR – Open-Source OCR for Accurate Document Conversion
    • key features:
      • cost-effective: processes 1M PDF pages for ~$190 (1/32 cost of GPT-4o APIs)
      • outputs text in Markdown format with proper handling of equations, tables, and handwriting
      • maintains correct reading order even for complex multi-column layouts
    • technical highlights:
      • uses "document anchoring" technique to leverage existing text/metadata in PDFs
      • training process: extracts image locations and text blocks, concatenates them into model prompts
      • dataset composition: 60% academic papers, 12% brochures, 11% legal documents, 6% diagrams, 5% slideshows
      • labeled 250,000 pages using GPT-4o with publicly accessible PDFs and Internet Archive scanned books
      • fine-tuned Qwen2-VL-7B-Instruct checkpoint with optimized inference pipeline for batch processing
      • outperforms other OCR tools in human evaluations with ELO score above 1800
      • preferred in 61.3% comparisons against Marker, 58.6% against GOT-OCR, 71.4% against MinerU
  10. the intersection of art & technology
    • a place for the creative technologists in sf, showcasing projects at the intersection of art and technology:
      • tiat 10: music explorer, rat dating simulator, roombavision
      • tiat 9: body language, AcknowledgeNET, knit soundscape/memory, bop spotter
      • tiat 8: anti-aging software, music millennium, chinese cyberfeminism archive, binaural tuner