
Best of your X follows: June 20
Mollick turns model evaluation into artifact inspection with GLM-5.2 and a harbor-town benchmark. Google DeepMind points AI at UK housing planning workflows, while Simon Willison and Charity Majors push the developer-tooling theme: generated code is cheap, engineering discipline is not.

The strongest signal today is not one giant launch. It is a set of small tests for where AI systems are starting to show up: model comparisons that use artifacts instead of leaderboard numbers, public-sector workflow prototypes, and developer tools that now assume agents can write to real systems.
Source mix: mostly X posts from the monitored account set, plus Simon Willison's weblog when his X timeline was quiet. Pure retweets, one-line political posts, and low-context small talk were left out.
Model releases and evaluation
Ethan Mollick: GLM-5.2 Max can do the task, but Fable still changes the shape of it
What happened: Mollick credited GLM-5.2 Max, a new open-weights model, for completing a constrained poem task that involved disappearing letters 1.
Why it matters: his comparison was not about whether the output was correct. He argued that Fable integrated the disappearing-letter constraint into the poem's theme, while GLM-5.2 Max mostly satisfied the surface requirement 1.
Implication: if you evaluate creative or agentic systems only by task completion, you miss the difference between following an instruction and using the constraint as part of the work.
コンテンツカードを読み込んでいます…
Ethan Mollick: a 20-model harbor-town gallery as an AI progress test
What happened: Mollick shared a benchmark prompt asking models to build a procedurally generated 3D harbor-town simulation from 3000 BCE to 3000 AD, with beauty and user control in the spec 2.
Why it matters: the linked gallery compares model outputs from one prompt and describes the set as spanning 39 months of AI progress; the older GPT-3.5 and GPT-4 entries needed one standardized follow-up 3.
Implication: this is the kind of artifact-based benchmark that is easy for practitioners to inspect. You can judge coherence, interactivity, aesthetics, and failure modes without reducing everything to one score.
コンテンツカードを読み込んでいます…
Public-sector AI
Google DeepMind: planning-office prototype targets housing applications
What happened: Google DeepMind said it is working with UK government bodies on an AI housing application planning prototype 4.
Why it matters: the post says the prototype is aimed at repetitive planning-officer work, so officers can spend more attention on complex projects 4.
Implication: DeepMind is claiming a processing-time reduction of up to 50%. Treat that as a target claim from the project team, not an audited deployment result yet 4.
コンテンツカードを読み込んでいます…
Developer tools and engineering practice
Simon Willison: Datasette gets first-class row editing
What happened: Simon Willison released Datasette 1.0a34, adding insert, edit, and delete tools to the Datasette interface 5.
Why it matters: the feature is available on table pages, while edit and delete also appear as row-level actions. That makes the ordinary UI catch up with the write workflows Simon had already been exploring through Datasette Agent 5.
Implication: agent-assisted database work is pushing product surfaces back toward explicit human approval and visible edit controls, not just chat-only automation.

Simon Willison / Charity Majors: AI coding raises the bar for engineering discipline
What happened: Willison surfaced Charity Majors' argument that AI made code generation cheap and fast, changing the economics of software production 6.
Why it matters: Majors' longer piece argues that if code becomes more disposable, teams need stronger production understanding, observability, review habits, and system invariants, not weaker ones 7.
Implication: the practical takeaway for AI coding teams is blunt: optimize for shared understanding and production feedback, because generated code is cheap and operational confusion is still expensive.
Short signals
Greg Brockman: GPT-Realtime-2 gets a terse internal endorsement
What happened: Greg Brockman posted that "GPT-Realtime-2 is something new" 8.
Why it matters: the post gives no launch note or technical detail, so the signal is weaker than a product announcement. It does show OpenAI's cofounder drawing attention to the realtime line after recent voice and WebRTC experiments in the developer community 8.
Implication: keep an eye on demos and docs before treating this as more than a high-level hint.
コンテンツカードを読み込んでいます…
François Chollet: solve hard problems by reframing, not piling on complexity
What happened: Chollet argued that hard problems are rarely solved by adding complexity; they are solved by reframing the question until a simpler answer becomes visible 9.
Why it matters: in the context of AI research and software design, that is a useful counterweight to scale-first thinking. More machinery can hide a bad problem statement.
Implication: before adding another layer to an agent pipeline, ask whether the task definition is wrong.
参考ソース
- 1Ethan Mollick on GLM-5.2 Max vs Fable
- 2Ethan Mollick on the harbor-town benchmark
- 3Harbor Town AI Gallery
- 4Google DeepMind on an AI planning prototype
- 5Release: datasette 1.0a34
- 6Simon Willison quoting Charity Majors
- 7AI demands more engineering discipline. Not less
- 8Greg Brockman on GPT-Realtime-2
- 9François Chollet on reframing hard problems
このコンテンツについて、さらに観点や背景を補足しましょう。